1. Introduction
One of the most important properties of a numerical algorithm designed for largescale cluster systems is scalability. Scalability can be defined as a measure of a parallel system’s capacity to decrease computation time in proportion to the number of processors. The upper bound of scalability is an integral characteristic of a parallel algorithm/program. The upper bound of scalability is the least number of processor nodes for which the speedup takes the maximal value. It is valuable to be able to estimate the upper bound of scalability in early phases of program development; the parallel computation model is a tool providing this possibility. A model of computation is a framework for specifying and analyzing algorithms or programs [1]. Many parallel computation models have been proposed for distributedmemory multiprocessors. The most famous of these models are the BSP model family (see [2, 3, 4, 5, 6, 7]) and the LogP model family (see [8, 9, 10, 11, 12, 13, 14]). Most of these models are lowlevel models and require detailed description of the structure of the algorithm to the level of code in a programming language or pseudocode. This article extends the basic BSP (Bulk Synchronous Parallelism) model [15] to deal with the computeintensive iterative numerical methods executed on distributedmemory multiprocessor systems. Iterative methods are an important class of numerical methods. An overview of various iterative methods can be found in [16, 17, 18]. The new parallel computation model proposed in this article was named BSF – Bulk Synchronous Farm. The BSF model is a highlevel parallel computation model based on the masterworker (masterslave) framework [19] and the SPMD (SingleProgramMultipleData) programming model [20, 21]. A distinctive feature of the BSF model is the ability to estimate the upper bound of scalability in the early stages of the algorithm design.
The rest of the article is organized as follows. In Section 2, the BSF parallel computation model presented in this paper is described. Section 3 introduces a cost metric for BSFprograms and provides equations for estimating the speedup and parallel efficiency of an algorithm before its implementation in a programming language. Moreover, a simple inequality to estimate the upper scalability bound of a BSFprogram is deduced. Section 4 summarises the results and outlines some directions for future research.
2. BSF computational model
The BSF (Bulk Synchronous Farm) model is intended for multiprocessor systems with distributed memory. A BSFcomputer consists of a collection of homogeneous computing nodes with private memory connected by a communication network delivering messages among the nodes. There is just one node called the masternode in a BSFcomputer. The rest of the nodes are the workernodes. A BSFcomputer must include at least one masternode and one workernode. The BSFcomputer layout is shown in Fig. 1.
A BSFcomputer utilizes the SPMD programming model according to which all the workernodes executes the same program but process different data. A BSFprogram consists of sequences of macrosteps and global barrier synchronizations performed by the master and all the workers. Each macrostep is divided into two sections: the master section and the worker section. The master section includes instructions performed by only the master. A worker section includes instructions performed by only the workers. The sequential order of the master section and the worker section within the macrostep is not important. All the worker nodes operate on the same data array, but the base address of the data assigned to the workernode for processing is determined by the logical number of this node. A BSFprogram includes the following sequential sections (see Fig. 2):

initialization;

iterative process;

finalization.
Initialization is a macrostep in which the master and workers read or generate input data. Initialization is followed by barrier synchronization. The iterative process repeatedly performs its body until the exit condition checked by the master becomes true. In the finalization macrostep, the master outputs the results and ends the program.
The body of the iterative process includes the following macrosteps:

[label=0)]

sending orders (from master to workers);

processing orders (by workers);

receiving results (from workers to master);

evaluating the results (by master).
In the first macrostep, the master sends the same orders to all workers. Then, the workers execute the received orders (the master is idle at that time). All the workers execute the same program code but operate on different data with a base address which depends on the workernode number. Therefore, all workers spend the same amount of time on calculation. There are no data transfers between nodes during order processing. In the third step, all workers send the results to the master. Next, global barrier synchronization is performed. During the fourth step, the master evaluates the results it has received. The workers are idle at this time. After evaluation of the results, the master checks the exit condition. If the exit condition is true, then the iterative process is finished, otherwise the iterative process is continued. BSFprogram execution is illustrated in Fig. 3.
3. Evaluation of BSFprogram scalability
The main characteristic of scalability is the speedup. For a parallel program, speedup can be defined as a ratio of execution time on one computing node to execution time on computing nodes:
(1) 
Parallel efficiency is another important characteristic of scalability. Parallel efficiency can be defined as a ratio of speedup to the number of processors:
(2) 
This section offers a cost metric which can be used to estimate the scalability of a BSFprogram. We assume that time spent on initialization and finalization of a BSFprogram is negligible compared to the cost of iterative process execution. The cost of an iterative process is equal to the sum of the costs of separate iterations. Therefore, to estimate the execution time of a BSF program, it is sufficient to obtain an estimation of the execution time of a single iteration. For this purpose, the following main parameters of the BSF model are introduced:

the number of workernodes;

an upper bound on the latency, or delay, incurred in communicating a message containing one byte from its source node to its target node;

the time that the masternode is engaged in sending one order to one workernode, excluding latency;

the time a BSFcomputer with one workernode needs to perform one order;

the total time that the masternode is engaged in receiving the results from all workernodes, excluding latency;

the total time that the masternode is engaged in evaluating the results received from all workernodes.
The global barrier synchronization performed in iterative process is implemented by the master waiting for completion of reading all messages from workers, and therefore, it does not require an additional cost.
The time needed for the execution of a single iteration by a BSFcomputer with one masternode and one workernode can be calculated as follows:
(3) 
which is equivalent to
(4) 
Now, let us calculate the time a BSFcomputer with one masternode and workernodes needs to execute a single iteration. All of the workers receive the same message from the master, so the total time for sending messages from the master to the workers is equal to . All of the workers perform the same program code on their own data segment, so the time of order execution by a group with workers is equal to . The resulting data volume produced by the workers is a parameter of the task and does not depend on , so the total time needed for sending messages from the workers to the master is equal to . The time needed for the master to process the results received from the workers is also a task parameter and does not depend on the number of workers. Thus, the total execution time of one iteration in a BSFcomputer with one master and workers can be calculated as follows:
(5) 
which is equivalent to
(6) 
By reducing the righthand side of the equation to the common denominator, we obtain
(7) 
Using equations (1), (4) and (7), we obtain the following equation for the speedup of BSFprogram:
(8) 
Let us analyze as a function depending on . The function takes the value 1 at which is concordant with the definition of the speedup and equation (1). The function is a continuous and positive definite function on the interval . Let us find the derivative of the function :
(9) 
It follows from (9) that the derivative takes the value 0 at the point . Moreover, the derivative takes positive values for and negative values for . This indicates that the point is the point at which the BSFprogram speedup takes the maximum value. Thus, we may make a conclusion that the value is the upper bound of the BSFprogram scalability:
(10) 
Note that the upper bound of BSFprogram scalability does not depend on the amount of time that the master is engaged in receiving and evaluating worker results.
One more important characteristic of a parallel program is parallel efficiency, calculated by equation (2). Let us estimate the efficiency of a BSFprogram. Using equations (2) and (8) we obtain
Assuming , we have
and
Hence,
(11) 
for . Dividing both parts of the equation (11) by , we receive the following approximate equation to estimate the parallel efficiency of a BSFprogram:
(12) 
4. Conclusion
In this article, the new BSF (Bulk Synchronous Farm) model of parallel computations was introduced. The BSF model is intended for evaluating iterative numerical algorithms designed for distributed memory multiprocessors. One distinctive feature of the BSF model is the ability to evaluate the scalability of an algorithm in the early phases of its development. The structure of a BSFcomputer was described. A BSFcomputer includes one masternode and several workernodes connected by a communication network. The structure of a BSFprogram was described. A BSFprogram uses the SPMD (SingleProgramManyData) model according to which all the workernodes execute the same program but process different data. The execution of a BSFprogram is divided into iterations. In each iteration, the master sends the orders to the workers; the workers execute the orders and send the results to the master; the master processes the results and checks the exit condition; if the condition is not satisfied, then the master sends new orders to the workers, beginning the next iteration, otherwise, the calculations are stopped. A cost metric was constructed for BSFprograms. This metric offers the following simple estimation for the upper bound of scalability:
where is the number of workernodes, is the latency, is the time a BSFcomputer with one workernode needs to execute the order, and is the time needed to send an order to one workernode, excluding latency.
A BSFimplementation of the NSLP algorithm [22]
was performed to validate the theoretical studies presented in this article. The NSLP algorithm is used to solve largescale nonstationary linear programming problems. A BSFimplementation of the NSLP algorithm is described in article
[23]. The source code of this implementation is freely available on Github, at https://github.com/leonidsokolinsky/BSFNSLP. The results of the computational experiments presented in [23] show that the BSF model accurately predicts the upper bound of scalability for the NSLP algorithm implemented as a BSFprogram.Future work concerning the BSF model includes the following directions. First, develop a formalism to describe BSFprograms through higherorder functions. Next, design and implement a BSF skeleton for the rapid development of BSFprograms in C++ using the MPIlibrary. Finally, validate the BSF model with different wellknown iterative numerical methods.
References
 [1] Bilardi G., Pietracaprina A.: Models of Computation, Theoretical. In: Encyclopedia of Parallel Computing. pp. 1150–1158. Springer US, Boston, MA (2011). DOI: 10.1007/9780387097664_218
 [2] Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM. 33(8), 103–111 (1990). DOI: 10.1145/79173.79181
 [3] Auf der Heide, F.M., Wanka, R.: Parallel Bridging Models and Their Impact on Algorithm Design. In: Proceedings of the International Conference on Computational Science – ICCS’01. Part II. Lecture Notes in Computer Science, vol. 2074. pp. 628–637. Springer, Berlin, Heidelberg (2001). DOI: 10.1007/3540457186_68
 [4] Valiant, L.G.: A bridging model for multicore computing. Journal of Computer and System Sciences. 77(1), 154–166 (2011). DOI: 10.1016/j.jcss.2010.06.012
 [5] Blanco, V., Gonzalez, J.A., Leon, C., Rodriguez, C., Rodriguez, G., Printista, M.: Predicting the performance of parallel programs. Parallel Computing. 30(3), 337–356 (2004). DOI: 10.1016/j.parco.2003.11.004
 [6] Gerbessiotis, A. V.: Extending the BSP model for multicore and outofcore computing: MBSP. Parallel Computing. 41, 90–102 (2015). DOI: 10.1016/j.parco.2014.12.002
 [7] Cha, H., Lee, D.: HBSP: A Hierarchical BSP Computation Model. The Journal of Supercomputing. 18(2), 179–200 (2001). DOI: 10.1023/A:1008113017444
 [8] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. In: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming – PPOPP’93. pp. 1–12. ACM Press, New York, New York, USA (1993). DOI: 10.1145/155332.155333
 [9] Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation. Journal of Parallel and Distributed Computing. 44(1), 71–79 (1997). DOI: 10.1006/jpdc.1997.1346
 [10] Liu, G., Wang, Y., Zhao, T., Gu, J., Li, D.: mHLogGP: A Parallel Computation Model for CPU/GPU. In: Park J.J., Zomaya A., Yeo SS., S.S. (ed.) Network and Parallel Computing – 9th IFIP International Conference, NPC 2012. Gwangju, Korea, September 68, 2012. Proceedings. pp. 217 224. Springer, Berlin, Heidelberg (2012). DOI: 10.1007/9783642356063_25
 [11] Lu, F., Song, J., Pang, Y.: HLognGP: A parallel computation model for GPU clusters. Concurrency and Computation: Practice and Experience. 27(17), 4880–4896 (2015). DOI: 10.1002/cpe.3475
 [12] Ino, F., Fujimoto, N., Hagihara, K.: LogGPS: A parallel computational model for synchronization analysis. ACM SIGPLAN Notices. 36(7), 133–142 (2001). DOI: 10.1145/568014.379592
 [13] Cameron, K.W., Ge, R., Sun, X.: logNP and log3P: Accurate Analytical Models of PointtoPoint Communication in Distributed Systems. IEEE Transactions on Computers. 56(3), 314–327 (2007). DOI: 10.1109/TC.2007.38
 [14] Yuan, L., Zhang, Y., Tang, Y., Rao, L., Sun, X.: LogGPH: A Parallel Computational Model with Hierarchical Communication Awareness. In: Proceedings of the 2010 13th IEEE International Conference on Computational Science and Engineering – CSE’10. pp. 268–274. IEEE Computer Society, Washington, DC, US (2010). DOI: 10.1109/CSE.2010.40
 [15] Tiskin, A.: BSP (Bulk Synchronous Parallelism). In: Encyclopedia of Parallel Computing. pp. 192 199. Springer US, Boston, MA (2011). DOI: 10.1007/9780387097664_311
 [16] Hageman, L.A., Young, D.M.: Applied iterative methods. Academic Press, New York, London, Toronto, Sydney, San Francisco (1981).
 [17] Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Society for Industrial and Applied Mathematics, Philadelphia (1995). DOI: 10.1137/1.9781611970944
 [18] Hadjidimos, A.: A survey of the iterative methods for the solution of linear systems by extrapolation, relaxation and other techniques. Journal of Computational and Applied Mathematics. 20, 37–51 (1987). DOI: 10.1016/03770427(87)901245
 [19] Sahni, S., Vairaktarakis, G.: The masterslave paradigm in parallel computer and industrial settings. Journal of Global Optimization. 9(3–4), 357–377 (1996). DOI: 10.1007/BF00121679
 [20] Darema, F., George, D.A., Norton, V.A., Pfister, G.F.: A singleprogrammultipledata computational model for EPEX/FORTRAN. Parallel Computing. 7(1), 11–24 (1988). DOI: 10.1016/01678191(88)900944
 [21] Darema, F.: SPMD Computational Model. In: Encyclopedia of Parallel Computing. pp. 1933–1943. Springer US, Boston, MA (2011). DOI: 10.1007/9780387097664_26
 [22] Sokolinskaya, I., Sokolinsky, L.B.: On the Solution of Linear Programming Problems in the Age of Big Data. In: Parallel Computational Technologies. PCT 2017. Communications in Computer and Information Science, vol. 753. pp. 86–100. Springer, Cham (2017). DOI: 10.1007/9783319670355_7
 [23] Sokolinskaya, I., Sokolinsky, L.B.: Scalability Evaluation of NSLP Algorithm for Solving NonStationary Linear Programming Problems on Cluster Computing Systems. In: Supercomputing. RuSCDays 2017. Communications in Computer and Information Science, vol. 793. Springer, Cham (2017).
Comments
There are no comments yet.