Comparing Python, Go, and C++ on the N-Queens Problem

by   Pascal Fua, et al.

Python currently is the dominant language in the field of Machine Learning but is often criticized for being slow to perform certain tasks. In this report, we use the well-known N-queens puzzle as a benchmark to show that once compiled using the Numba compiler it becomes competitive with C++ and Go in terms of execution speed while still allowing for very fast prototyping. This is true of both sequential and parallel programs. In most cases that arise in an academic environment, it therefore makes sense to develop in ordinary Python, identify computational bottlenecks, and use Numba to remove them.



There are no comments yet.


page 1


FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries

The Python package fluidfft provides a common Python API for performing ...

PyTracer: Automatically profiling numerical instabilities in Python

Numerical stability is a crucial requirement of reliable scientific comp...

Towards Memory Safe Python Enclave for Security Sensitive Computation

Intel SGX Guard eXtensions (SGX), a hardware-supported trusted execution...

Using Python for Model Inference in Deep Learning

Python has become the de-facto language for training deep neural network...

Pytrec_eval: An Extremely Fast Python Interface to trec_eval

We introduce pytrec_eval, a Python interface to the tree_eval informatio...

Nonparametric Estimation of the Random Coefficients Model in Python

We present PyRMLE, a Python module that implements Regularized Maximum L...

Enhancing Python Compiler Error Messages via Stack Overflow

Background: Compilers tend to produce cryptic and uninformative error me...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Python currently is the dominant language in the field of Machine Learning and gives easy access to powerful Deep Learning packages such as TensorFlow and PyTorch. However, it is known to be slow to perform some operations such as loops, which are not always easy to vectorize away. In such situations, one might consider switching to another language, such as C++ or the more recent Go language whose similarity to Python makes them potentially attractive replacements. In this note, we will argue that this may be necessary only rarely because the Numba python compiler 

[8] delivers performance close to those of C++ while preserving the compactness and ease of development that make Python such a powerful prototyping tool. Furthermore, it is easy to use. Once a set of Python functions has been identified as computationally intensive, one simply adds Numba decorators before their definitions to instruct Python to compile them while leaving the rest of the code largely unchanged.

To demonstrate this, we use the well known -queens puzzle [1] as a benchmark. It involves placing chess queens on an chessboard so that no two queens threaten each other. Fig. 1 depicts three solutions on a standard board and one on a larger one. We will focus on finding the number of solutions as a function of . This is easy for small values of but quickly becomes computationally intractable for larger ones because the complexity of our algorithm is exponential with respect to . There is no known formula for the exact number of solutions and, to date, is the larger value of for which the answer has been computed [12].

All our benchmarking code available is available online.111 We welcome comments and suggestions for potential improvements.

2 Sequential Processing

We started from the recursive algorithm described in [15] to compute one single solution of the 8-queens problem and translated it to Python. Our implementation relies on the fact that for two queens at board locations and not to be in conflict with each other, they must not be on the same row, on the same same column, or on the same diagonal. The first two mean that and . The third holds if and . In other words, the two diagonals going through any location are completely characterized by and .

To exploit this, the function allQueensRec of Tab. 1 allocates boolean arrays col, dg1, and dg2 to keep track on which columns and diagonals are still available to place a new queen on an board. It then invokes the recursive function allQueensRecursive. At recursion level , for each , it adds a queen at location if it is available, marks the column and the diagonals and as unavailable for additional queens, and calls itself for row . The recursion ends in one of two ways. Either reaches , meaning that all rows have been successfully filled, or no more queen can be added. If the first case, a counter is incremented. In the second case, nothing happens. In both cases, the program backtracks, undoes it earlier marking, and continues until all solutions have been found. This process could be sped up by exploiting the symmetries of the -queens problems. However, this is not required for benchmarking purposes an we chose not to do it to keep the code simple. We also chose to use the C++ and Go naming convention for functions and variables, that is, we use allQueensRec instead of the more typical all_queens_rec, so that we can use the same names in all versions of the code we present.

In Fig. 2(a), the red curves depicts the computation time on a 2.9 GHz Quad-Core Intel Core i7 Mac running the Catalina operating system. In the top part of the figure, we plot the wall-clock time as as function of the board size using a standard scale. In the bottom part of the figure, we plot the same computation times using a log-scale instead, which results in an almost straight curve. This serves as a sanity check because the computational complexity grows exponentially with . As allQueensRecursive performs loops, our vanilla Python implementation is inefficient. To remedy this, we used the Numba python compiler [8] as shown in Tab. 2. The code is almost unchanged except for adding a couple of Numba decorators and Yet, as depicted by the green curve in Fig. 2(a), these minor modifications deliver a 33-fold increase in average computing speed of allQueensNmb over allQueensRec.

The Numba decorator njit() that appears in Tab. 2 is short for jit(nopython=True). It ensures that if the code compile without errors it will not invoke python while running and will therefore be fast. Additionally, we could have used jit(nopython=True,nogil=True) to instruct Numba to release the Python Global Interpreter Lock [3] while executing the function, thus allowing several versions to run simultaneously on threads of a single process, something that standard Python code cannot do because of the aforementioned lock. This does not have any significant impact on performance in a sequential execution scenario.

1def allQueensRec(n):
2    # Arrays used to flag available columns and diagonals
3    col = np.ones(n,dtype=bool)
4    dg1 = np.ones(2*n,dtype=bool)
5    dg2 = np.ones(2*n,dtype=bool)
7    return allQueensRecursive(n,0,col,dg1,dg2)
9def allQueensRecursive(n,i,col,dg1,dg2):
10    # All rows are filled, stop the recursion and report a new solution
11    if n == i :
12        return 1
13    # Try putting a queen in each cell of row i
14    nsol = 0
15    for j in range(n):
16        if (col[j] and dg1[i+j] and dg2[i-j+n]):
18            col[j]     = False  # Mark column j as occupied
19            dg1[i+j]   = False  # Mark diagonal i+j as occupied
20            dg2[i-j+n] = False  # Mark diagonal i-j as occupied
22            nsol+=allQueensRecursive(n,i+1,col,dg1,dg2)
24            col[j]     = True   # Unmark column j
25            dg1[i+j]   = True   # Unmark diagonal i+j
26            dg2[i-j+n] = True   # Unmark diagonal i-j
28    return nsol
Table 1: Vanilla python code.
(a) (b)
Figure 2: Run times as a function of the board size. Linear scale at the top and log scale at the bottom. (a) Sequential. (b) Parallel.
1@njit() #Numba decorator
2def allQueensNmb(n,i=0,col=None,dg1=None,dg2=None):
3    # np.bool_ not np.bool because of
4    col = np.ones(n,dtype=np.bool_)
5    dg1 = np.ones(2*n,dtype=np.bool_)
6    dg2 = np.ones(2*n,dtype=np.bool_)
8    return allQueensNumba(n,0,col,dg1,dg2)
10@njit() #Numba decorator
11def allQueensNumba(n,i,col,dg1,dg2):
12    # All rows are filled, stop the recursion and report a new solution
13    if n == i :
14        return 1
15    # Try putting a queen in each cell of row i
16    nsol=0
17    for j in range(n):
18        if (col[j] and dg1[i+j] and dg2[i-j+n]):
20            col[j]     = False  # Mark column j as occupied
21            dg1[i+j]   = False  # Mark diagonal i+j as occupied
22            dg2[i-j+n] = False  # Mark diagonal i-j as occupied
24            nsol+=allQueensNumba(n,i+1,col,dg1,dg2)
26            col[j]     = True   # Unmark column j
27            dg1[i+j]   = True   # Unmark diagonal i+j
28            dg2[i-j+n] = True   # Unmark diagonal i-j
30    return nsol
Table 2: The python code of Tab. 1 slightly modified to force numba compilation. .

To further assess how effective the Python/Numba combination is, we rewrote the code in Go and C++, as shown in Tabs. 3 and 4. The short variable declarations make the Go code very similar to the Python code while being statically typed. The C++ code is slightly more verbose and one must remember to deallocate what has been allocated because there is no garbage collector. As shown in Fig. 5, this can be remedied by using more sophisticated containers such as the standard vectors of C++ that are automatically deallocated at the end of the scope of their definition. Note that we used vector<uint8_t> as the type for our boolean arrays instead of the apparently more natural vector<bool>. We did this because the latter packs the bits densely and has to perform binary arithmetic to extract the requested bit for each access because memory can only be addressed down to whole bytes. In other words, it reduces memory usage at the expense of increased computation. As we are interested in speed, it is therefore more effective to explicitly use bytes (uint8_t) for our purpose. Nevertheless, we have verified that even when using byte vectors, the implementation of Fig. 5 incurs a small, but noticeable, performance decrease with respect to that of Tab. 4 and we therefore chose to stick with it, even though it is less elegant. In short, unlike Go, C++ gives the programmer great freedom to carry out tasks in many different ways but it takes a lot expertise to exploit it effectively and to avoid the many lurking pitfalls.

For example, unlike Python and Go, C++ does not automatically check that one does not write beyond the bounds of arrays. As a result, the buggy code of Tab. 6 runs but returns nonsensical values. We unintentionally made this mistake while translating the code from Python and, even though this is a short program, it took us a while to spot it. Of course, we could have used a tool such as valgrind, which would have detected the error, but this is far less convenient than being given a runtime warning. By default Numba does not perform bounds checks but they can be enabled using the decorator njit(boundscheck=True), which can be useful while debugging.

The cyan and purple curves of Fig. 2(a) depict the corresponding runtimes. The Numba, Go, and C++ curves are almost superposed. Closer examination of the raw numbers give in Tab. 15 in appendix show that C++ wins. Go in slower by about and Numba by . Numba is slower mostly for low values of

, which suggests that the algorithm itself runs just as fast but that calling the Numba function from Python involves an overhead. While the observed differences are statistically significant based on the variances of the different runs, in our daily research practice, they are rarely large enough to justify giving up the development speed that Python provides and to contend with potential bugs such as the one discussed above.

However, there are optimizations that require the low-level control that C++ or Go can provide. For example, in all versions of the code presented here, the memory for the col, dg1, and dg2 arrays is allocated dynamically on the heap. The array sizes are decided at runtime and this code could in theory handle the -queens problem for any value of . However the computational cost is exponential and any value of is wildly impractical. If we accept to limit ourselves to , we can use fixed-sized arrays allocated on the stack by declaring them as var col[32]bool in Go or  std::array<bool, 32> in C++. Unlike in the case discussed above, using bool instead uint8_t has no adverse effect. We have checked that the C++ code modified in this manner delivers a gain over Numba, instead of the earlier . Potential explanations are that putting the arrays on the stack works better for the CPU cache or that the optimizer has an easier time reasoning about fixed-size stack arrays. In any event, this shows that C++, and Go, being closer to the hardware may be useful to fine-tune code under some circumstances.

Go can therefore be considered a promising alternative to both Python and C++ because its run-time checks make bugs such as the one of Fig. 6 easy to detect and correct. Furthermore, it is almost as concise as a Python and a little faster than Numba. However, some of its design features make it unwieldy in the prototyping role. For example, insisting that all variables and packages declared in a file be used makes sense for production code but is unhelpful when groping for a solution to a research problem: Commenting out a particular line of code, can mean many modifications in the file, which are unnecessary until a final solution has been found. Similarly not providing a full-fledged class-system can be understood as a way to discourage the writing of hard-to-maintain spaghetti code, which is commendable in production mode but unnecessarily rigid in prototyping mode.

1func allQueensRec(n int) int {
2    // Allocate arrays
3    col := make([]bool, n, n)
4    dg1 := make([]bool, 2*n, 2*n)
5    dg2 := make([]bool, 2*n, 2*n)
6    // All columns and diagonals are initially available
7    for i := 0; i < n; i++ {
8        col[i] = true
9    }
10    for i := 0; i < 2*n; i++ {
11        dg1[i] = true
12        dg2[i] = true
13    }
14    // Perform the recursive computation and return the results
15    return allQueensRecursive(n, 0, col, dg1, dg2)
18func allQueensRecursive(n int, i int, col [32]bool, dg1 [64]bool, dg2 [64]bool) int {
19    if n == i {
20        return 1
21    }
22    nsol := 0
23    for j := 0; j < n; j++ {
24        if col[j] && dg1[i+j] && dg2[i-j+n] {
25            col[j]     = false
26            dg1[i+j]   = false
27            dg2[i-j+n] = false
29            nsol += allQueensRecursive(n, i+1, col, dg1, dg2)
31            col[j]     = true
32            dg1[i+j]   = true
33            dg2[i-j+n] = true
34        }
35    }
36    return nsol
Table 3: Go version of the python code of Tab. 1.
1int allQueensRecursive(int n,int i,bool *col,bool *dg1,bool *dg2)
3    if (n == i) {
4        return 1;
5    }
6    int nsol = 0;
7    for (int j = 0; j < n; j++) {
8        if (col[j] && dg1[i+j] && dg2[i-j+n]) {
9            col[j]     = false;
10            dg1[i+j]   = false;
11            dg2[i-j+n] = false;
13            nsol += allQueensRecursive(n, i+1, col, dg1, dg2);
15            col[j]     = true;
16            dg1[i+j]   = true;
17            dg2[i-j+n] = true;
18        }
19    }
20    return nsol;
23int allQueensRec(int n)
25    // Allocate dynamic memory on the heap
26    bool *col = new bool[n];
27    bool *dg1 = new bool[2*n];
28    bool *dg2 = new bool[2*n];
29    // All columns and diagonals are initially available
30    memset((void *)col,1,n*sizeof(bool));
31    memset((void *)dg1,1,2*n*sizeof(bool));
32    memset((void *)dg2,1,2*n*sizeof(bool));
33    // Perform the recursive computation
34    int nsol = allQueensRecursive(n,0, col, dg1, dg2);
35    // No garbage collector, must deallocate to prevent memory leaks
36    delete[] col;
37    delete[] dg1;
38    delete[] dg2;
40    return nsol;
Table 4: C++ version of the python code of Tab. 1. To initialize the arrays, we could have used loops as in the Go code of Tab. 3. Instead we used the lower level instruction  memset, which performs the same tasks without loops and can therefore be expected to be faster.
1typedef vector<uint8_t> BoolArray;  // Use uint8_t instead of bool to boost efficiency
3int allQueensRecursive(int n,int i,BoolArray& col,BoolArray& dg1,BoolArray& dg2){
4       ........
7int allQueensRec(int n)
9    BoolArray col(n,   true);
10    BoolArray dg1(2*n, true);
11    BoolArray dg2(2*n, true);
13    int nsol = allQueensRecursive(n,0,col,dg1,dg2);
15    return nsol;
Table 5: Using C++ vectors makes it unnecessary to explicitly free them. The call to allQueensRecursive has been slightly modified slightly so that they are passed by value instead of by reference and therefore not copied.
1int allQueensRec(int n)
3    // dg1 and dg2 are of size n instead of 2n
4    bool *col = new bool[n];
5    bool *dg1 = new bool[n];
6    bool *dg2 = new bool[n];
7    memset((void *)col,1,n*sizeof(bool));
8    memset((void *)dg1,1,n*sizeof(bool));
9    memset((void *)dg2,1,n*sizeof(bool));
11        ........
Table 6: Buggy version of the C++ code of Tab. 4. It runs but returns nonsensical results.

3 Parallel Processing

2def allQueensCol(n,j):
4    col = np.ones(n,dtype=np.bool_)
5    dg1 = np.ones(2*n,dtype=np.bool_)
6    dg2 = np.ones(2*n,dtype=np.bool_)
7    # Put a queen in cell j of the first row
8    col[j]   = False
9    dg1[j]   = False
10    dg2[n-j] = False
11    # Fills the rest of the board starting with the second row
12    return allQueensNumba(n,1,col,dg1,dg2)
14if __name__ == "__main__":
15    nsol = 0
16    for j in range(8):
17        nsol += allQueensCol(8,j)
Table 7: The python code of Tab. 2 rewritten to perform independent computations.
2def allQueensPara(n):
3    nsol = 0
4    for j in prange(n):         # prange is only applicable inside jit(parallel=True)
5        nsol+=allQueensCol(n,j)
6    return nsol
8def allQueensPool(n,np=None):
9     with Pool_proc() as pool:  # Create a pool of processes
10         nsols=,n),range(n))
11         return (sum(nsols))
13def poolWorker(n,j):
14    return allQueensCol(n,j)
Table 8: Two different ways to Invoke the function allQueensCol of Tab. 7 so that the computation is split into tasks potentially running on different cores. Note how compact this code is.
1func allQueensPara(nd int) int {
2    // Create the structure that will be used to synchronize
3    var wg sync.WaitGroup
4    wg.Add(nd)
5    // Explicitly allow go to run on 8 cores
6    runtime.GOMAXPROCS(8)
8    sols := make([]int, nd)
10    f := func(wg *sync.WaitGroup, n int, j int) {
11        sols[j] = allQueensCol(n, j) // Result for a queen in cell k of first row
12        wg.Done()                    // Flag the thread as complete
13    }
14    for j := 0; j < nd; j++ {
15        go f(&wg, nd, j)             // Launch a new thread for each computation
16    }
17    wg.Wait()                        // Wait for all threads to be completed
19    nsol := sols[0]                  // Sum the individual results
20    for j := 1; j < nd; j++ {
21        nsol += sols[j]
22    }
23    return nsol
26func allQueensCol(n int, j int) int {
28    col := make([]bool, n, n)
29    dg1 := make([]bool, 2*n, 2*n)
30    dg2 := make([]bool, 2*n, 2*n)
32    for i := 0; i < n; i++ {
33        col[i] = true
34    }
35    for i := 0; i < 2*n; i++ {
36        dg1[i] = true
37        dg2[i] = true
38    }
39    col[j]   = false
40    dg1[j]   = false
41    dg2[n-j] = false
43    return allQueensRecursive(n, 1, col, dg1, dg2, 0)
Table 9: Go version of the parallel python code of Tabs. 7 and 8.
1int allQueensCol(int n,int j) {
3    bool *col = new bool[n];
4    bool *dg1 = new bool[2*n];
5    bool *dg2 = new bool[2*n];
6    memset((void *)col,1,n*sizeof(bool));
7    memset((void *)dg1,1,2*n*sizeof(bool));
8    memset((void *)dg2,1,2*n*sizeof(bool));
10    col[j]   = false;
11    dg1[j]   = false;
12    dg2[n-j] = false;
14    int ncol = allQueensRecursive(n,1,col, dg1, dg2);
16    free(col);
17    free(dg1);
18    free(dg2);
19    return ncol;
22int allQueensPara(int nd){
24    vector<future<int>> running_tasks;
25    // Start one process per column
26    for(int col = 0; col < nd; col++){
27        running_tasks.push_back(
28            async(std::launch::async, [=]() {return allQueensCol(nd,col);})
29        );
30    }
31    // Wait for results
32    int nsol_sum = 0;
33    for(auto& f : running_tasks) {
34        nsol_sum += f.get();
35    }
36    return nsol_sum;
Table 10: C++ version of the parallel python code of Tabs. 7 and 8. The async template function makes the code very compact.

Nearly every modern computer, including the one we used, has a multicore CPU and we can speed things up by running independent parts of the computation simultaneously on separate cores. In Go and C++, this can be done using multiple threads. Standard Python cannot do this due to the Global Interpreter Lock (GIL) [3] that we have already encountered in the previous section. Fortunately, there are several workarounds and we explored two of them:

  1. Using Numba’s automatic parallelization [10]. Numba implements the ability to run loops in parallel as in Open Multi-Processing (OpenMP). The loop’s body is scheduled in separate threads and the system automatically takes care of data privatization and reduction.

  2. Using a pool of processes. The pool distributes the computation into separate processes and tasks are sent to the available processors using a FIFO scheduling. Each process has its own interpreter and GIL, so they do not interfere. The price to pay is that objects need to be serialized and sent to the processes.

To test these two approaches, we parallelized the allQueensRec in a simple way. As shown in Tab. 7, we defined a new function allQueensCol that puts a queen in column of the first row and then invokes the function allQueensNumba defined in Tab. 2 starting at the second row instead of the first, as in allQueensRec. Summing the results for all possible values of yields the same results by performing independent computations. In Tab. 8, we integrate this code into two functions that spread the tasks across separate cores: allQueensPara uses the first method described above while allQueensPool uses the second. We will refer to them as para and pool respectively. Numba can take parallelization even further and produce functions that exploit the GPU. However, we did not explore this aspect in this study because our problem is not conducive to GPU processing.

In Fig. 2(b), we compare runtimes of the sequential Numba-compiled code of the previous section with our two parallelized versions. As before, the sequential code is depicted by the green curve while the two parallel versions are depicted by the red and blue curves, labeled para and pool respectively. para clearly delivers a significant improvement. However, for smaller values of , we noted that para does not always fully use the 8 cores of our machines, which impacts its performance. For values of up to 13, the overhead involved in spawning new processes dominates the computational cost of pool and makes it uncompetitive. However, for , this overhead becomes negligible with respect to the rest of the computation and pool starts dominating, albeit only by a small margin for large values of , as can be seen in Tab. 15.

To again compare against Go and C++, we rewrote the code in these two languages using their built-in multi-threading capabilities, as shown in Tabs 9 and 10. Note that we used the template function std::async function to make the C++ version compact. The corresponding performance measurements are depicted by the cyan and purple curves of Fig. 2(b). As before for small values of , para and pool are uncompetitive because the initial overhead is too large. However, for larger values of they catch up and eventually do better than Go and almost as well as C++. In short, there are corner cases in which it might pay to switch from Python to C++ or go but it is not clear how pervasive they are in our research practice.

4 Numba Limitations

In the two previous sections, we have argued that Numba is a powerful tool to painlessly compile potentially slow Python code so that it runs almost as fast as Go and C++. However, it also has limitations: Only a subset of Python [11] and NumPy [9] features are available inside compiled functions. Numba has a compilation mode that generates code able to handle all values as Python objects and uses the Python C API to perform all operations on such objects. Unfortunately, relying on this mode results in almost no performance gain over non-compiled code. This is why we used the njit() decorator in all our examples. It yields much faster code but requires that the native types of all values in the function can be inferred, which is not necessarily true in standard Python. Otherwise, the compilation fails.

In practice, this imposes an additional workload on the programmer who has to figure out what parts of the code are computationally expensive, encapsulate them in separate functions, and make sure that these functions can be compiled using the no-python mode that njit()

enforces. This is probably why there are ongoing efforts to optimize whole Python programs such as PyPy 

[13]. Unfortunately, the results are not always compatible with libraries utilizing the C API, such as those routinely used in the field of scientific computing. As discussed in appendix, Julia is a potential alternative to Python/Numba that supports both high performance scientific computing and fast prototyping, is compiled, and could eventually address this problem.

5 Conclusion

As Computer Vision and Machine Learning researchers, we primarily need a language that allows us to test and refine ideas quickly while giving us access to as many mathematical, image processing, and machine learning libraries as possible. The latter spares us the need to reinvent the wheel every time we want to try something new. Maintainability and ability to work in large teams are secondary considerations as our code often stops evolving once the PhD student or post-doctoral researcher who wrote it leaves our lab. Before that happens, we typically make it publicly available to demonstrate that the ideas we published in conference and journals truly work and, in the end, that is often its main function.

Python fits that bill perfectly at the cost of being slow when performing operations such as loops. Fortunately, as we showed in this report, this shortcoming can be largely overcome by using the Numba compiler that delivers performance comparable to that of C++, which itself tends to be faster than Go. This suggests that a perfectly valid workflow is to first write and debug a program in ordinary Python; identify the computational bottlenecks; and use Numba to eliminate them. This will work most of the time. In the rare cases where it does not, we can rewrite the relevant section of the code in C++ and call it from Python, which can be achieved using Cython [2] or pybind11 [4]. Interestingly, this approach harkens to the standard way one used to work in the much older Common Lisp language, as discussed in the appendix.


Appendix A Other Languages

1program eightqueen1(output);
3var i : integer; q : boolean;
4    a : array[ 1 .. 8] of boolean;
5    b : array[ 2 .. 16] of boolean;
6    c : array[ -7 .. 7] of boolean;
7    x : array[ 1 .. 8] of integer;
9procedure try( i : integer; var q : boolean);
10    var j : integer;
11    begin
12    j := 0;
13    repeat
14        j := j + 1;
15        q := false;
16        if a[ j] and b[ i + j] and c[ i - j] then
17            begin
18            x[ i    ] := j;
19            a[ j    ] := false;
20            b[ i + j] := false;
21            c[ i - j] := false;
22            if i < 8 then
23                begin
24                try( i + 1, q);
25                if not q then
26                    begin
27                    a[ j]     := true;
28                    b[ i + j] := true;
29                    c[ i - j] := true;
30                    end
31                end
32            else
33                q := true
34            end
35    until q or (j = 8);
36    end;
39for i :=  1 to  8 do a[ i] := true;
40for i :=  2 to 16 do b[ i] := true;
41for i := -7 to  7 do c[ i] := true;
42try( 1, q);
43if q then
44    for i := 1 to 8 do write( x[ i]:4);
Table 11: Pascal program by Niklaus Wirth in 1976. It finds one solution to the eight queens problem.
1/* Use clpfd package to loop through all configurations until a feasible one is found */
2n_queens(N, Qs) :-
3        length(Qs, N),
4        Qs ins 1..N,
5        safe_queens(Qs).
7/* Predicate is true if the configuration is feasible */
9safe_queens([Q|Qs]) :- safe_queens(Qs, Q, 1), safe_queens(Qs).
10safe_queens([], _, _).
11safe_queens([Q|Qs], Q0, D0) :-
12        Q0 #\= Q,
13        abs(Q0 - Q) #\= D0,
14        D1 #= D0 + 1,
15        safe_queens(Qs, Q0, D1).
17 /* Example */
18 ?- n_queens(8, Qs), labeling([ff], Qs).
19   Qs = [1, 5, 8, 6, 3, 7, 2, 4] ;
20   Qs = [1, 6, 8, 3, 7, 4, 2, 5] .
Table 12: Prolog version of the Pascal code of Tab. 11 from
1(defun allQueensRec(n)
2  (declare (type fixnum n))
4  (let ((col (make-array n       :initial-element t :element-type ’boolean))
5        (dg1 (make-array (* 2 n) :initial-element t :element-type boolean))
6        (dg2 (make-array (* 2 n) :initial-element t :element-type ’boolean)))
7    (declare (type (array boolean 1) col dg1 dg2 ))
9    (allQueensRecursive n 0 col dg1 dg2 0)))
11(defun allQueensRecursive(n i col dg1 dg2)
12  ;; Optional declarations. Some compilers exploit them to speed up the code
13  (declare (type (array boolean 1) col dg1 dg2 ))
14  (declare (type fixnum n i))
16  (if (= i n)
18      1
20    (let ((nsol 0))
21       (declare (type fixnum nsol))
23      (loop for j from 0 below n
24            when (and (aref col j) (aref dg1 (+ i j)) (aref dg2 (- (+ i n) j)))
25            do
27              (setf
28                (aref col j) nil
29                (aref dg1 (+ i j)) nil
30                (aref dg2 (- (+ i n) j)) nil)
32              (incf nsol (allQueensRecursive n (+ i 1) col dg1 dg2))
34              (setf
35                (aref col j) t
36                (aref dg1 (+ i j)) t
37                (aref dg2 (- (+ i n) j)) t))
38      nsol)))’
Table 13: Common Lisp version of the python code of Tab. 1.
1function allQueensRecursive(n, i, col, dg1, dg2)
2    # All rows are filled, stop the recursion and report a new solution
3    if n == i
4        return 1
5    end
7    # Try putting a queen in each cell of row i
8    nsol = 0
10    for j = 0:n-1
11        if (col[j+1] && dg1[i+j+1] && dg2[i-j+n+1])
13            col[j+1]     = false  # Mark column j as occupied
14            dg1[i+j+1]   = false  # Mark diagonal i+j as occupied
15            dg2[i-j+n+1] = false  # Mark diagonal i-j as occupied
17            nsol += allQueensRecursive(n,i+1,col,dg1,dg2)
19            col[j+1]     = true   # Unmark column j
20            dg1[i+j+1]   = true   # Unmark diagonal i+j
21            dg2[i-j+n+1] = true   # Unmark diagonal i-j
22        end
23    end
24    return nsol
27function allQueensRec(n)
28    col = ones(Bool, n)
29    dg1 = ones(Bool, 2*n)
30    dg2 = ones(Bool, 2*n)
32    return allQueensRecursive(n,0,col,dg1,dg2)
Table 14: Julia version of the code of Tab. 1

In his book [15], N. Wirth proposed the -queens algorithm implemented in Pascal and shown in Tab. 11. It is specific to the case and is designed to stop when the first solution is found. It then returns the corresponding configuration. As shown in Tab. 12

, this can be done even more concisely in Prolog, a logic programming language of the same era as Pascal. What makes Prolog particularly interesting is that, of all the languages discussed in this report, it is the only one that forces a truly different approach to programming. It is well suited to performing the kind of systematic exploration and backtracking that solving the

-queens problem requires and is still used to solve specific tasks that involve rule-based logical queries such as searching databases.

As not all implementations of the Pascal standard support dynamic arrays, extending the program of Tab. 11 that is specific to the case to the general case would require manually allocating memory using pointers, much as in C. This would not be necessary in the even older Common Lisp language, as shown in Tab. 13. Although among the most ancient languages still in regular use, Lisp offers many of the same amenities as Python, that is, dynamically allocated arrays, sophisticated loop structures, and garbage collection among others. It can be used either in interpreted or compiled form, much like Python without and with Numba. When invoking the compiler, the optional type declarations help it generate faster code. A standard approach to developing in Lisp is therefore to prototype quickly without the declarations and then add them as needed to speed up the code. Interestingly Python is now moving in that direction with its support for type hints but does not yet enforce that variable and argument values match their declared types at runtime. They are only intended to be used by third party tools such as type checkers, IDEs, and linters [14].

As can be seen in Fig. 2(a), because it is compiled, Lisp does not perform so badly compared to the more recent languages we discussed in this report. The even newer Julia [6] language can be understood as being related to it in that it supports rapid prototyping by allowing interactive execution while being compiled. Type declarations are available but optional and the code is very Python-like, as can be seen in Tab. 14. Like Numba, Julia uses LLVM [7] to perform low level optimizations and produce efficient native binary binary code. Unlike in C++ or Go, there is no explicit compilation step and yet it delivers performance that are almost on par with those of the other compiled languages we discussed, as can be seen in Fig. 2(a) and Tab. 15.

This make Julia a potentially attractive alternative to Python/Numba. Unfortunately, there are some significant obstacles to its adoption. First, it still is a new language. Some important features remain experimental and the number of third-party libraries is limited, whereas Python gives access to a wealth of powerful libraries, such as the deep learning ones that have become absolutely central to our research activities. Furthermore, it features some design choices that differentiate it from currently popular languages [5], such as 1-indexed arrays and multiple-dispatch instead of classes. Whether or not these choices are wise, they make it harder to switch from an established language like Python to Julia.

Sequential 8 9 10 11 12 13 14 15 16 17 18
Python 0.00391 0.01625 0.07024 0.33707 1.77808 9.58502 57.5610 - - - -
Lisp 0.00036 0.00148 0.00577 0.02711 0.13346 0.72808 4.43882 29.1243 - - -
Julia 0.00010 0.00047 0.00213 0.01057 0.05555 0.30745 1.90335 13.1295 89.8649 634.371 -
Numba 0.00011 0.00053 0.00219 0.01054 0.05455 0.28898 1.74438 11.0766 75.5948 546.312 4085.54
Go 0.00009 0.00044 0.00198 0.01007 0.05404 0.30519 1.80617 10.9284 73.6867 527.514 4014.05
C++ 0.00009 0.00041 0.00189 0.00935 0.04991 0.29386 1.67399 9.99408 66.7773 491.910 3677.10
Parallel 8 9 10 11 12 13 14 15 16 17 18
Numba Para 0.00008 0.000248 0.00087 0.00473 0.02509 0.15463 0.98384 6.6811 17.2807 135.166 1184.33
Numba Pool 0.13120 0.130002 0.13113 0.13105 0.13349 0.13280 0.44365 2.52153 16.9606 135.078 994.799
Go 0.00012 0.000221 0.00055 0.00236 0.01222 0.08580 0.51022 3.44080 22.9841 140.860 1068.71
C++ 0.00012 0.000202 0.00063 0.00272 0.01293 0.06945 0.39773 2.75142 18.7342 122.827 877.398
Table 15: Benchmarking results in seconds per trial for from 8 to 18.

Appendix B Raw Data

The performance numbers we used to produce the plots of Fig. 2 are given in Tab. 15. For both the sequential versions of the code and for each value of , the time for the fastest implementation appears in red and the one for the second best in blue. These numbers were obtained on a 2.9 GHz Quad-Core Intel Core i7 Mac running the Catalina operating system. For all versions of the code we ran 20 trials for , 10 for , and 3 for and computed the mean and variance in each case. We have rerun all these benchmarks on an Intel Xeon X5690 CPU running Ubuntu 18.04 and the overall ranking of the implementations was unchanged.