A Comparative Study of Two Dynamic Load Balancing Algorithms as a Means to Increase Performance in Shared Memory Parallel Computing
Frankline Makokhaα & William Okello-Odongoσ
The proliferation of multicore computing devices ranging from notebooks, tablets and smartphones has led to a need for load balancing to ensure optimum and full utilization of all the cores. Various algorithms exist for implementing load balancing in these multicore platforms albeit with different performance characteristics. To ensure optimum usage of all the cores, an experimental comparison of the performance of the various algorithms on real world problem domains is necessary to inform on which algorithm to use for each domain of computational problems.
This research focused on two dynamic load balancing algorithms namely, centralized dynamic load balancing algorithm and cyclic load balancing algorithm using matrix multiplication, sorting and searching as the problem domains, with the measured parameters being processing time and processor idle time.
Based on the experiments carried out, centralized dynamic load balancing performed better in matrix multiplication in terms of processing time while cyclic load balancing algorithm performed better for sorting and searching.
From the results obtained, it is recommended to use centralized dynamic load balancing algorithm for mathematical computations while for non-mathematical computations it is recommended to use cyclic load balancing algorithm.
Keywords: load balancing; parallel computing; distributed computing; distributed memory parallel systems and shared memory parallel systems.
Author α σ: School of Computing and Informatics University of Nairobi, Nairobi, Kenya.
With the increasing application of computers in almost all facets of human life, there has been a significant demand for high performance from the various computing platforms.
Due to this demand, the evolution of computers has been characterized by increasing processor speed, decreasing component size, increasing memory size, and increasing I/O capacity and speed . One factor contributing to increases in processor speed is the shrinking size of microprocessor components which reduces the distance between components and hence increases speed . However, the true gains in speed in recent years have come from the organization of the processor, including heavy use of pipelining, parallel execution techniques and the use of speculative execution techniques (tentative execution of future instructions that might be needed).
These advances have led to development of super computers. A super computer is defined as the fastest computer currently available that provides peak performance . The value of supercomputers derives from the value of problems they solve and not from the technology they showcase . It is also these advances that led to the development of vector computer systems. A vector computer/processor is a machine designed to efficiently handle arithmetic operations on elements of arrays called vectors .
To design computers that are effectively fast, performance of various components has to be balanced, so that a performance of some components is not dragged by low performance of others. Case in point, processor speeds have increased drastically than memory access. To ameliorate this mismatch, various techniques have been adopted, namely: use of caches, wide data path between memory and processor, and more intelligent memory chips .
Further, apart from increasing performance by considering the design of the internal computer components, other techniques have been developed, e.g. use of parallel computing, the use of a memory cache hierarchy, and speedup in memory access time and I/O transfer rate due to technology improvements.
However, this has also been limited by the nature of problems being solved as highlighted by Amdahl’s law which states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially .
Due to the above limitation, Computer Scientists have always strived to increase the performance of their computer architectures. High performance may come from fast dense circuitry, packaging technology, and parallelism . This research focuses on parallelism enhanced by shrewd load balancing techniques.
Parallel computing systems are computer systems consisting of multiple processing units connected via some interconnection network plus the software needed to make the processing units work together .
Parallel computing systems can be classified as shared memory systems and distributed memory systems . A shared memory system typically accomplishes inter processor coordination through a global memory shared by all processors while a distributed memory system combines the local memory and processor at each node of the interconnection network. Since there in no global memory in a distributed memory system, it is necessary to move data from one local memory to another by means of message passing.
One of the important mechanisms for utilizing and sharing the CPUs optimally is the policy of balancing the load amongst the processors. This type of load balancing can be achieved by transferring some of the tasks from a heavily loaded processor to a lightly loaded processor .
A diagrammatic representation of the two parallel computing systems is as shown in figure 1 and figure 2 .
Figure 1: Shared Memory Parallel System
Figure 2: Distributed Memory Parallel System
Load balancing is the process of roughly equalizing the work load among all nodes of the distributed system .
Comprehensively, Load balancing is the distribution and/or redistribution of processing tasks among the processing nodes of a parallel or distributed system for the purpose of improving the efficiency and effectiveness of the entire processing system.
An imbalance on today’s fastest supercomputers can force hundreds of thousands of cores to idle, and on future exascale machines this cost will increase by over a factor of a thousand .
Load balancing is one of the central problems which have to be solved in parallel computation . Since load imbalance leads directly to processor idle times, high efficiency can only be achieved if the computational load is evenly balanced among the processors.
The load balancing problem can be stated as: given the initial job arrival rates at each computer in the system find an allocation of jobs among the computers so that the response time of the entire system over all jobs is minimized .
Load balancing algorithms can be broadly categorized as either static load balancing algorithms or dynamic load balancing algorithms. Static load balancing algorithms distribute the tasks to processing elements at compile time, while dynamic algorithms bind tasks to processing elements at run time .
Other authors have classified load balancing algorithms into three main classes: static algorithms, dynamic algorithms, and adaptive algorithms . Static algorithms decide how to distribute the workload according a prior knowledge of the problem and the system characteristics. Dynamic algorithms use state information to make decisions during program execution. Finally, Adaptive algorithms are a special case of dynamic algorithms, which dynamically change their parameters in order to adapt its behavior to the load balancing requirements.
The various dynamic load balancing algorithms include: Centralized Dynamic Load balancing algorithm, Random (RAND), Adaptive Contracting with Neighbor, Prioritized Random (PRAND) and Cyclic Algorithm .
Whereas work has been done in analysis of the various dynamic load balancing algorithms, most emphasis has been on distributed systems and using qualitative parameters e.g. overload rejection, reliability, predictability, adaptability, scalability, stability, waiting time, throughput etc., and thus there has been little practical emphasis on shared memory parallel systems and use of quantitative parameters like execution time and processor idle times .
This is attributed to the fact that during the times of those researches, shared memory parallel devices were not as prevalent as they are today.
The various works that have performed a comparative study of algorithms using qualitative parameters include: ; ; .
This research aimed to address this gap by performing a comparative study of two dynamic load balancing algorithms, namely Centralized Dynamic Load balancing algorithm and Cyclic Algorithm. The outcome form this research informs on the choice of load balancing algorithm to use on the various computational domains.
In the comparison of the performance of the two dynamic load balancing algorithms, the following methodology was adopted.
5.1 Experimentation Methodology
The aim of this project was to practically and using real world problems compare the performance of centralized dynamic load balancing algorithm and cyclic load balancing algorithm. The problems were implemented using each of the identified load balancing techniques and then a comparison made based on how each of the algorithm performs by measuring the processing time and processor idle time.
5.2 Experimental Problems
In this project, the identified common problems were implemented on a shared memory system for parallel execution, using each of the identified dynamic load balancing techniques.
The identified common problems are Matrix Multiplication, Sorting and Searching.
Matrix multiplication was chosen because it is an important linear algebra operation and hence a number of scientific and engineering applications include this operation as building blocks . Further, matrix multiplications are important linear algebra algorithms which may simulate many real applications like image processing, video compression . Due to their fundamental importance, much effort has been devoted to studying and implementing matrix multiplications. For parallel matrix multiplications, the entire task should be decomposed, this introduces various overheads. The most important are the communication and load balancing overheads.
Sorting was chosen because sorting algorithms are widely used in a broad variety of applications e.g. in commercial computing where government organizations, financial institutions, and commercial enterprises organize much of this information by sorting it  Further, keeping data in sorted order makes it possible to efficiently search through it.
The selection sort was chosen because it is an in-place sorting (requires no extra memory, thus ideal for small parallel systems e.g. phones and tablets where auxiliary memory is limited), and also for its simplicity in implementations.
Searching was chosen because keeping data in sorted order makes it possible to efficiently search through it. Linear search was chosen because it is extremely common in most real world applications e.g. in ruby's find_index and jQuery, further it is also the most basic search that can be found.
5.3 Experiment Design
The experiment to compare the performance of the algorithms was designed as below:
For matrix multiplication, the following structure was used:
If Matrix A:
and Matrix B:
Then the multiplication shall be:
This results in four tasks, which are divided into chunks that are assigned to the processing entities. The experimentation was organized into chunk size 4, 8, 16 and 32.
For random number generation, part of the module code was from the code distributed under the GNU LGPL License .
For sorting and searching, the sort and search spaces are divided into chunk sizes 4, 8, 16 and 32 at different times during experimentation e.g. if we have chunk search space or sort space 120, then using chunk size 4, this shall be divided into sort/search space of 30 and each assigned to the processing entities
During experimentation, the following procedures were adapted
For matrix multiplication, the sizes used were as shown in table 1. Each matrix size is run eight (8) times with results recorded for each run and averaged at the end. This is in line with  on secrets of successful simulation studies.
Table 1: Matrix Multiplication
(500 by 500) *(500 by 500)
(800 by 800)* (800 by 800)
(2000 by 2000) *(2000 by 2000)
(4000 by 4000) *(4000 by 4000)
These sizes were chosen due to their ability to produce better results based on preliminary runs.
For sorting, the sort spaces used are as shown in table 2. Each sort space is run eight (8) times and results recorded for each run and averaged at the end.
Table 2: Sort Space Size
For searching, the search space used is as shown in table 3 below: each is run eight (8) times and results recorded for each run an averaged.
Table 3: Search Space Sizes
5.5 Chunk Sizes
A chunk is an ordered, fixed-sized array of fixed-sized slots . In this experiment chunk size refers to the portion of the whole problem being solved that is assigned to a given core for execution. The chunk sizes used are as shown in table 4 .
Table 4: Chunk Sizes
These chunk sizes were chosen because they are multiples of the physical processors in the hardware used in the experiment and also are multiples of the total threads in the system.
The system was developed as below:
6.1 Softwares used
Minimalist GNU for Windows (MinGW). This is an Open Source development environment for native Microsoft Windows applications.
Eclipse for Parallel Application Developers. This is an IDE for Parallel Application Developers.
The development language used in the implemententation was OpenMP programming language, which enables parallel execution and timing of processing time.
Parallelism was enabled by the directive: # pragma omp parallel, while the timing of the processing time shall be done using: omp_get_wtime ( ) which is an OpenMP function that returns a double precision value equal to the number of seconds since the initial value of the operating system real-time clock.
For load balancing algorithm implementation, # pragma omp for schedule (type, chunk), was used. The type shall represent the load balancing algorithm to use while chunk represented size of work allocated to a processing entity at a given time.
The algorithm used in the experiment for matrix multiplication is as shown below:
The algorithm used in the experiment for searching is as shown below:
The algorithm used in the experiment for sorting is as shown below.
6.3 Experimentation platform
The experiment was perfomed on a computer with specifications as indicated in table 5:
Table 5: Experiment Terminal Specifications
Intel Core i7 3630QM
Windows 8 64 bit
RESULTS AND DISCUSSION
7.1 General Results
From the experiments done, the results are as below:
Table 6: Matrix Multiplication Results
The processing time can be depicted graphically as shown in figure 3.
Figure 3: Matrix Multiplication results
The results for the search experiment are shown in table 7.
Table 7: Search Experiment Results
The processing time can be depicted graphical as shown in figure 4.
Figure 4: Search Experiment Results
From the above summaries, the cyclic dynamic load balancing performs better than the centralized dynamic on linear search.
The results for the sort experiment are as shown in table 8.
Table 8: Sort Experiment Results
Graphically, the processing time can be presented as in figure 5.
Figure 5: Sort Experiment Results
7.2 Summary of Results
- Processing speed
For matrix multiplication, centralized dynamic load balancing algorithm performs better than cyclic load balancing as depicted in table 6. This is because for computations, which vary in computation complexities, it is better for the core to first finish its portion of assigned work before it can assign more work upon request as opposed to continuous assignment in a cyclic way. This is to reduce unbalanced workload as much as possible.
For sorting applications, the cyclic load balancing performs better than the centralized dynamic load balancing as depicted in table 8. This is because there are no varied complexities in sorting and hence waiting for the core to finish the assigned portion of work and waiting for it to request wastes time.
For searching applications the cyclic load balancing performs better than the centralized dynamic load balancing as depicted in table 7. This is because there are no varied complexities in searching and hence waiting for the core to finish the assigned portion of work and waiting for it to request wastes time
- Processor idle time
From the conducted experiment, there was no substantial consistency in the observed processor idle time to warrant a conclusive analysis.
From the results obtained, it is recommended to use centralized dynamic load balancing algorithm for mathematical computations while for non- mathematical computations it is recommended to use cyclic load balancing algorithm.
I acknowledge the assistance and guidance provided by the late Prof.Okello-Odongo, of the University of Nairobi, throughout this research before his untimely demise.
- Stallings, W. (2010). Computer Organization and Architecture: Designing for Performance. 8th Ed. New Jersey: Prentice.
- Dongara, J. (2004) Trends in High Performance Computing. The Boole Lecture. Vol. 47. No. 4. 10th March.
- Hesham, E. and Mostafa, A. (2005) Advanced Computer Architecture and Parallel processing. New Jersey: John Wiley & Sons, Inc.
- Barmon, C., Faruqui, M. N. and Battacharjee, J. P. (1990/91). Dynamic Load Balancing Algorithm in a Distributed System. Microprocessing and Microprogramming. Volume 29, Issue 5.
- Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R., (2001). Parallel Programming in OpenMP. San Diego: Morgan Kaufman Publishers.
- Alakeel, A. M. (2010). A Guide to Dynamic Load Balancing in Distributed Computer Systems. IJCSNS International Journal of Computer Science and Network Security, Vol. 10 No.6.
- Pearce, O., Gamblin, T., Supinskiy, B., Schulzy, M., and Amato, M. (2012) Quantifying the Effectiveness of Load Balance Algorithms.US: Department of Energy.
- Horton, G. (1993) A multi-level diffusion method for dynamic load balancing. Erlangen: Elsevier Science Publishers B.V.
- Grosu, D., Chronopoulos, T. A., and Ming-Ying L. (2002). Various Schemes of Load Balancing in Distributed Systems- A Review. In Proc. of the 16th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2002): Fort Lauderdale, Florida, USA, IEEE Computer Society Press.
- Mandal, A. and Chandra, S. (2010) An Empirical Study and Analysis of the Dynamic Load Balancing Algorithms Used in Parallel Computing Systems: Proceedings of ICCS-2010, 19-20 Nov. West Bengal: University of North Bengal.
- Aldasht, M., Ortega, J. and Puntonet, C. (2007) Dynamic Load Balancing in Heterogeneous Clusters: Exploitation of the Processing Power. 2nd Palestinian International Conference on Computer and Information Technology (PICCIT), Hebron, Palestine.
- Firoj, A. & Khan, Z. (2012) The Study on Load Balancing Strategies in Distributed computing System: International Journal of Computer Science & Engineering Survey (IJCSES) Vol.3, No.2.
- Makokha, F. and Okello_Odongo (2018) A review of Dynamic Load Balancing Algorithms. International Journal of Computer and Information Technology. Volume 7– Issue 1, 2018
- Kushwaha, M. and Gupta, S (2015). Various Schemes of Load Balancing in Distributed Systems- A Review. International Journal of Scientific Research Engineering & Technology (IJSRET), Volume 4, Issue 7.
- Manekar, S. A., Poundekar, D. M., Gupta, H. and Nagle, M.(2012). A Pragmatic Study and Analysis of Load Balancing Techniques In Parallel Computing. International Journal of Engineering Research and Applications. Vol. 2, Issue 4.
- Willebeek-Lemair, H. M. and Revees, A. P. (1993). Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Transactions on Parallel and Distributed Systems. Volume 4. No 9.
- Jin, D. and Ziavras,S. G. (2004). A Super-Programming Technique for Large. Sparse Matrix Multiplication on PC Clusters: IEICE Transactions on. Information and Systems, Vol. E87- D, issue 7
- Sedgewick, R. and Wayne, K. (2011) Algorithms, 4th Ed. Boston: Addison-Wesley.
- Burkardt, J. (2008) Matrix Multiplication using C++ http://people.sc.fsu.edu/~ jburkardt/cpp_src/mxm/mxm.cpp [Accessed on 17th February 2014].
- Law, M. A. and McComas, G. M. (1991). Secrets of successful simulation studies, Winter Simulation Conference Proceedings, Phoenix, AZ, 1991.
- Paluska, M., J. (2013). Computing with Chunks. PhD, Massachusetts Institute of Technology.