## 1 Introduction

Nowadays, it is not so difficult to find problems that can be mapped to a network in order to improve solutions for power distribution networks, highways, etc. All these problems can be modeled as graphs, and many times the solution of the problem is based on finding the Minimum Spanning Tree (MST) of the modeled graph. It must be noted that normally these graphs are extense and complex, leading to complex computational systems with high time cost. Many efficient solutions for these problems are based on Evolutionary Algorithms (EA), which showed to have good perfomance and efficiency [De Jong 2006].

Finding a MST is solvable in polynomial time by using two well-known algorithms: Prim and Kruskal. However, these algorithms fail when a degree restriction is applied to all nodes of a graph, which is not an uncommon cenario on real-world problems. In this case, the problem becomes NP-Hard [Knowles and Corne 2000].

In this paper we propose an expansion to a platform developed by [Gois et al. 2014] in order to solve bigger problems, by exploiting the use of multi-FPGAs. We present a framework for modeling and then a real implementation on hardware.

## 2 Materials and Methods

In a previous work, [Gois et al. 2014] developed a platform called NDEWG, which finds a MST for a given graph by using an EA with complexity of [Delbem et al. 2012] with serial processing elements, or near if processing elements (henceforth called Workers) work in parallel. This complexity was achieved by using a special representation for trees, called Node-Depth Encoding, or NDE [Santos et al. 2010].

The NDE basically represents a tree on a linear list containing a pair of values (, ), where are the tree nodes and their depths. The pairs are disposed in the list based on the depth search algorithm traversing.

The NDEWG hardware transforms a graph into a forest of spanning trees by using a modified version of the Kruskal algorithm. A pair of trees is randomly chosen and a tree operator called Preserve Ancestor Operator (PAO) [Santos et al. 2010],[De Lima et al. 2008] is applied to both, generating two different but still spanning trees from the original graph, respecting any applied restriction degree. The results are analised for any improvements in overall weight, and discarded if no optimisation is achieved. Many Workers may apply in parallel the PAO operator with different random seeds and each analyses for best weight reduction. After many iterations, the result is an optimised spanning tree with degree restriction.

This project was implemented on an Altera Cyclone II FPGA, running up to 512 nodes and later on a Stratix IV GX, reaching up to 1024 nodes. Higher expansions weren’t possible due to bus-width restrictions. To reach a higher size of solvable graphs, the system bus was expanded to 64 bits.

Since most of the hardware resources of the platform are dedicated to Worker Synthesis, which increases linearly with the graph size, the Workers were moved to separate set of FPGAs, enabling scalability. The proposed platform was a star architecture consisted by one Central FPGA, responsible for managing the spanning trees to be worked by the Workers and many Satellite FPGAs, each having a number of Workers. Each FPGA communicates with the Central FPGA by using a dedicated link. Figure 1 shows how each FPGA logic was implemented.

For the validation on physical hardware, it was still necessary to develop a network system for connecting the FPGAs. We connected the systems by using high-speed 10 Gbps Ethernet Interfaces. A whole system was developed to abstract all the network layer aspects. This system was called Network Abstraction System, or NAS.

The NAS implements master and slave Avalon Memory-Mapped frontends, which are compatible with the network ports of the previously Central and Satellite modules. The system converts the transactions to a streaming-based Avalon bus, which is used to communicate with the Ethernet MAC. The Ethernet MAC and XAUI PHY operates on the physical network layer.

The Central, Satellite and NAS projects were merged to validate the whole system. It was used two Stratix V GX Development Kits (5SGXEA7) connected using two DUAL XAUI TO SFP+ HSMC BOARD, and two full-duplex SFP+ Avago AFBR-703SDZ optical interfaces, with speed up to 10 Gbps. Figure 2 shows the final validation system.

## 3 Results

The Central and Satellites modules were first tested by simulation, where no memories and interconnection delays were considered. Table 1 shows some results from simulation and synthesis tools, where one local memory was used for each Sat. module, and is the amount of Logic Elements (ALUTs) used on the Central and Satellites modules respectively, and the maximum frequency (MHz) for Central and Sat. modules, and the average time per iteration (in seconds) to solve graphs up to 1024 nodes.

Mem. | Sat. FPGAs | Avg. Time (per Iter.) | Speedup | ||||
---|---|---|---|---|---|---|---|

1 | 1 | 1552 | 56352 | 174.4 | 75.91 | 1.44e-07 | 1x |

1 | 4 | 2945 | 13956 | 124.42 | 122.43 | 8.34e-08 | 1.73x |

1 | 8 | 5051 | 7142 | 122.99 | 123.92 | 9.43e-08 | 1.53x |

On the final hardware platform, the size of solvable graphs was upgraded to 4096 nodes, since expansion to 8192 nodes was not possible due to lack of available on-chip memory. Figure 3 shows the average processing time on DNDEWG-64 (multi-FPGAs approach), NDEWG-32 (previous work) and on a Intel Core 2 Quad Q6600 (2.40 GHz) with the operator coded in C, running in parallel by using OpenMP. Data for NDEWG-32 and PC are available only up to 512 nodes.

The NDEWG shows a better efficiency, since it is not affected by a latency generated by the network interface. However its scalability is strongly held by its monolithic nature, restricted to resources of only one FPGA. For 1024 nodes, the DNDEWG-64 has an average of approx. 85 us, whereas in simulation (1 memory, 1 Sat. module), the average time was of 0.144 us. Thus the bottleneck is concentrated on the network system and must be investigated for optimisations.

## 4 Conclusion

Parallelised Evolutionary Algorithms are good candidates for optimising NP-Hard problems. This kind of solution can be applied to real world problems, such as electric distribution, where complex networks can have up to 100,000 nodes. Much work must still be done in order to improve the multi-FPGAs approach. The NAS works on a low speed compared to the high-speed link, hence improvements on the system must be made and are objectives for further research.

## References

- [De Jong 2006] De Jong, K. A. (2006). Evolutionary computation: a unified approach, volume 262041944. MIT press Cambridge.
- [De Lima et al. 2008] De Lima, T. W., Rothlauf, F., and Delbem, A. C. (2008). The node-depth encoding: analysis and application to the bounded-diameter minimum spanning tree problem. In GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 969–976, New York, NY, USA. ACM.
- [Delbem et al. 2012] Delbem, A. C. B., De Lima, T., and Telles, G. P. (2012). Efficient forest data structure for evolutionary algorithms applied to network design. IEEE Transactions on Evolutionary Computation, PP(99):1.
- [Gois et al. 2014] Gois, M. M., Matias, P., Perina, A. B., Bonato, V., and Delbem, A. C. B. (2014). A parallel hardware architecture based on node-depth encoding to solve network design problems. International Journal of Natural Computing Research, 4(1):54–75.
- [Knowles and Corne 2000] Knowles, J. and Corne, D. (2000). A new evolutionary approach to the degree-constrained minimum spanning tree problem. IEEE Transactions on Evolutionary Computation, 4(2):125–134.
- [Santos et al. 2010] Santos, A., Delbem, A., London, J., and Bretas, N. (2010). Node-depth encoding and multiobjective evolutionary algorithm applied to large-scale distribution system reconfiguration. IEEE Transactions on Power Systems, 25(3):1254–1265.