Coloring for distributed-memory-parallel Gauss-Seidel algorithm

(1)

COLORING FOR

DISTRIBUTED-MEMORY-PARALLEL

GAUSS-SEIDEL ALGORITHM

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

computer engineering

By

Onur Ko¸cak

September 2019

(2)

COLORING FOR DISTRIBUTED-MEMORY-PARALLEL GAUSS-SEIDEL ALGORITHM

By Onur Ko¸cak September 2019

We certify that we have read this thesis and that in our opinion it is fully adequate, in scope and in quality, as a thesis for the degree of Master of Science.

Cevdet Aykanat(Advisor)

¨

Ozg¨ur Ulusoy

Engin Demir

Approved for the Graduate School of Engineering and Science:

Ezhan Kara¸san

(3)

ABSTRACT

COLORING FOR

DISTRIBUTED-MEMORY-PARALLEL GAUSS-SEIDEL

ALGORITHM

Onur Ko¸cak M.S. in Computer Engineering Advisor: Cevdet Aykanat

September 2019

Gauss-Seidel is a well-known iterative method for solving linear system of equa-tions. The computations performed on Gauss-Seidel sweeps are sequential in nature since each component of new iterations depends on previously computed results. Graph coloring is widely used for extracting parallelism in Gauss-Seidel by eliminating data dependencies caused by precedence in the calculations. In this thesis, we present a method to provide a better coloring for distributed-memory-parallel Gauss-Seidel algorithm. Our method utilizes combinatorial approaches including graph partitioning and balanced graph coloring in order to decrease the number of colors while maintaining a computational load balance among the color classes. Experiments performed on irregular sparse problems arising from various scientific applications show that our model effectively reduces the required number of colors thus the number of parallel sweeps in the Gauss-Seidel algorithm.

(4)

¨

OZET

DA ˘

GITIK-BELLEK-PARALEL GAUSS-SEIDEL

ALGOR˙ITMASI ˙IC

¸ ˙IN RENKLEND˙IRME

Onur Ko¸cak

Bilgisayar M¨uhendisli˘gi, Y¨uksek Lisans Tez Danı¸smanı: Cevdet Aykanat

Eyl¨ul 2019

Gauss-Seidel do˘grusal denklem sistemlerinin ¸cözümünde iyi bilinen bir ¨

ozyinelemeli yöntemdir. Gauss-Seidel taramalarında yapılan hesaplamalar, yeni yinelemelerin her bir bile¸seni önceden hesaplanan sonu¸clara ba˘glı oldu˘gundan do˘gası gere˘gi ardı¸sıktır. Ç izge renklendirme, hesaplamalarda önceli˘gin neden oldu˘gu veri ba˘gımlılı˘gını ortadan kaldırarak paralellik elde etmek i¸cin yaygın olarak kullanılır. Bu tezde, da˘gıtık-bellek-paralel Gauss-Seidel algoritması i¸cin daha iyi bir renklendirme sa˘glayan bir method sunuyoruz. Metodumuz, renk sınıfları arasındaki hesaplama yük dengesini korurken renk sayısını azaltmak i¸cin ¸cizge bölümleme ve dengeli ¸cizge renklendirme gibi ¸ce¸sitli kombinasyonel yakla¸sımları kullanır. Ç e¸sitli bilimsel uygulamalardan do˘gan düzensiz seyrek problemler üzerinde yapılan deneyler, modelimizin Gauss-Seidel algoritmasındaki gerekli renk sayısını ve dolayısıyla paralel taramaları etkili bir ¸sekilde azalttı˘gını göstermektedir.

Anahtar sözcükler : paralel hesaplama, Gauss-Seidel yöntemi, paralel Gauss-Seidel algoritması, da˘gıtık bellek algoritma, ¸cizge renklendirme.

(5)

Acknowledgement

I would like to express my deepest gratitude to my thesis supervisor Prof. Cevdet Aykanat for his guidance, valuable suggestions and support throughout the de-velopment of this thesis. His guidance and immense knowledge helped me in all the aspects of research and writing of this thesis.

I would like to express my endless thanks to Dr. Reha O˘guz Selvitopi for his support and suggestions. I appreciate his guidance to me for this work.

I would like to thank the jury members, Prof. ¨Ozg¨ur Ulusoy and Asst. Prof. Engin Demir, for reading and commenting on this thesis.

My sincere thanks also go to my friend Cihan Bostancı for his moral support and suggestions throughout this thesis.

Finally, I am grateful to my lovely family and friends for their infinite moral support and understanding throughout this thesis.

(6)

List of Figures

2.1 A simple M matrix and its CSR representation . . . 10 2.2 A simple symmetric matrix and its undirected graph representation 12 2.3 An undirected graph and its 4-way partitioning. . . 13

2.4 An undirected graph and the same graph with distance-1 coloring 14

4.1 Step by step coarsening stage . . . 22 4.2 Traditional parallel processing vs. parallel processing with

coars-ening and balanced coloring . . . 25

5.1 A comparison of coloring vs. balanced coloring by number of super vertices per each color . . . 38 5.2 A comparison of coloring vs. balanced coloring by sum of weights

per each color . . . 38 5.3 A comparison of coloring vs. balanced coloring by number of colors

(9)

List of Tables

5.1 Properties of Test Matrices . . . 28 5.2 Coarsening and Coloring Results for B ∈ {1, 50, 100, 250, 500} . . 33 5.3 Average Results for All Test Matrices . . . 40

(10)

List of Algorithms

1 Sequential Gauss-Seidel Algorithm . . . 8

(11)

Chapter 1 Introduction

Many scientific applications and engineering simulations depend on solving linear algebraic systems. In diverse problems arising from several areas such as electro-magnetic, computational fluid dynamics, power systems, genetics, graph theory, synthetic geometry, computer vision, quantum mechanics; scientific problems of-ten necessitate dealing with linear algebraic systems. Therefore, it is important to efficiently solve linear algebraic systems in scientific computing. There are two categories of linear algebraic solvers: direct solvers and iterative solvers. Direct solvers attempt to calculate an exact solution to the system of linear equations, whereas iterative solvers repeatedly try to solve system of equations using ap-proximations. Direct solvers may be impractical if the coefficient matrix of the linear algebraic system to be solved is large and sparse because the sought-after factor can be dense [1]. Therefore, iterative solvers are more commonly used in solving large sparse system of linear equations and we can obtain an approximate solution with an acceptable error tolerance if the algorithm converges.

Gauss-Seidel is one of the most popular stationary iterative solvers in linear algebra. It is also known as Liebmann method or successive displacement method. It is an inherently sequential method and very efficient in solving problems with tightly coupled constraints. Gauss-Seidel method is very similar to the Jacobi

(12)

is that Gauss-Seidel uses previously computed results in new iterations. Unlike the Jacobi method, only one storage vector is required as elements can be over-written as they are computed. This can be advantageous for large problems. However, computations cannot be performed in parallel because of the depen-dency and the values at every iteration are dependent of the order of the original equations. Gauss-Seidel has better convergence properties than the Gauss-Jacobi algorithm [2, 3]. Successive Over Relaxation (SOR) method is one of other sta-tionary iterative methods used to solve linear problems. However, the difficulty in choosing an optimal relaxation parameter makes it impractical [4]. Therefore, Gauss-Seidel is practically one of the most popular among the other iterative solvers in linear algebra. With its attractive properties, the Gauss-Seidel method is also widely used as a preconditioner in multi-grid applications [5].

Parallelization of the Gauss-Seidel is difficult since the nature of the algorithm is sequential. There have been many parallelization studies done on sparse matrix systems in scientific computing field [5, 6, 7, 8], however tightly coupled problems lead to a dense matrix which prevents parallelization. Additionally, the cost of synchronization in distributed memory systems affects parallelization efficiency. Therefore, Jacobi method can be used as an alternative despite of the fact that it is slow to converge but fully parallel. There are several approaches to extract parallelism in Gauss-Seidel through performing special orderings on sparse matri-ces to eliminate data dependencies. Graph coloring is a widely used method for extracting parallelism in Gauss-Seidel by eliminating data dependencies caused by precedence in the calculations.

In this thesis, we present a method to provide a better coloring for distributed-memory-parallel Gauss-Seidel algorithm. Our method utilizes combinatorial ap-proaches including graph partitioning and balanced graph coloring in order to decrease the number of colors while maintaining a computational load balance among the color classes. We present a method that (i) first utilizes graph parti-tioning in order to decrease the number of colors required in the coloring stage, (ii) then applies a balanced coloring algorithm in order to extract balanced par-allelism by decoupling the data dependencies while maintaining strict precedence rules in sequential Gauss-Seidel iterations. We have tested the validity of our

(13)

model on various test matrices arising from different scientific applications. Ex-perimental results show that the proposed model effectively reduces the number of colors required in coloring stage of parallel Gauss-Seidel algorithm which re-sults in reducing number of parallel sweeps and synchronization overheads. In the experiments, it is also observed that utilized balanced coloring method efficiently maintains computational load balance among the color classes.

This thesis is organized as follows: Chapter 2 gives the necessary background and preliminary information. In Chapter 3, the existing work in the literature on parallelization of Gauss-Seidel is investigated. Chapter 4 describes our method-ology with 3 stages. The experimental results are presented in Chapter 5. We present the conclusion of this thesis in Chapter 6.

(14)

Chapter 2 Background

In this chapter, the preliminary information and definitions related to our methodology for the Gauss-Seidel algorithm are presented. Following sections cover the basic principles of the Gauss-Seidel method, sparse matrices and their storage models, graph model, graph partitioning and coloring problem.

2.1 The Gauss-Seidel Method

The Gauss–Seidel method is an iterative method to solve a system of n linear equations with unknown x.

Consider the following linear system:

Ax = b (2.1) A =        a11 a12 · · · a1n a21 a22 · · · a2n .. . ... . .. ... an1 an2 · · · ann        , x =        x1 x2 .. . xn        , b =        b1 b2 .. . bn        (2.2)

(15)

of size n.

The Gauss-Seidel method is defined by the iteration;

L∗x(k+1)= b − U x(k) (2.3)

where x(k) is the k -th approximation or iteration of x, and the coefficient matrix A is decomposed into a lower triangular component L∗, and a strictly upper

triangular component U : A = L∗+ U .

In detail, we can decompose A with its lower and upper triangular components as follows: A = L∗+U where L∗ =        a11 0 · · · 0 a21 a22 · · · 0 .. . ... . .. ... an1 an2 · · · ann        , U =        0 a12 · · · a1n 0 0 · · · a2n .. . ... . .. ... 0 0 · · · 0        (2.4) The system of linear equations can be written as:

L∗x = b − U x (2.5)

The Gauss–Seidel method then solves the left hand side of this expression for x, using previous value for x on the right hand side. This can be shown as:

x(k+1) = L−1_∗ b − U x(k)

(2.6)

Using the triangular component L∗, we can sequentially compute the values of

x(k+1) with forward substitution. Hence, the Gauss-Seidel method can be written as:

(16)

x(k+1)_i = 1 aii bi− i−1 X j=1 aijx (k+1) j − n X j=i+1 aijx (k) j ! , i = 1, 2, . . . , n (2.7) where:

x(k)_i is the ith _{unknown in x during the k}th _{iteration, i = 1, · · · , n}

x(0)_i is the initial guess for the ith _{unknown in x,}

aij is the coefficient of A in the ith row and jth column

bi is the ith value in b.

In matrix view, the Gauss-Seidel method can be defined as:

x(k+1) = (D + L∗)−1b − Ux(k)

(2.8)

where:

x(k) _{is the k}th _{solution to x}

x(0) _{is the initial guess at x,}

D is the diagonal of A,

L∗ is the lower triangular portion of A,

U is the strictly upper triangular portion of A, b is right-hand-side vector.

Convergence is evaluated by using the absolute relative approximate error cal-culated as follows after each iteration to check if the error is within an acceptable tolerance:

|Ea| =

x(k+1)− x(k)

(17)

Characteristics of Gauss-Seidel are as follows:

• Convergence of the Gauss-Seidel solver depends on the characteristics of the coefficient matrix. Convergence is only guaranteed if the coefficient matrix A is either symmetric positive definite or strictly/irreducibly diagonally dominant [9].

• The Gauss-Seidel method has better convergence rates compared to the Jacobi method on model problems as it uses the updated values during each of the iterations [5].

• The Gauss-Seidel method is widely used in multigrid applications. It works well on unstructured problems since there is no need for choosing a damping parameter [5].

The Gauss-Seidel algorithm is inherently sequential since new iterations de-pend on the results of previously calculated values in that same iteration. This is in contrast to the Jacobi method, where we need to have x(k)_i in order to calculate x(k+1)_i . Therefore, it is required to identify tasks which can be computed simul-taneously by resolving data dependencies between tasks to efficiently parallelize the Gauss-Seidel method.

Algorithm 1 presents a sequential algorithm to solve a linear system Ax = b using the Gauss-Seidel method. The algorithm takes A and b as input and solves for x, which is the solution vector. We assume a initial guess for x (i.e., x(0) = 0). The algorithm algebraically solves each linear equation for xi and repeats until

converges. The iteration is terminated when either (i) the user-specified maximum number of iterations has been reached or (ii) the norm of successive iterates is less than a user-specified epsilon Eu.

(18)

Algorithm 1: Sequential Gauss-Seidel Algorithm Input: A, b, kmax, Eu

Output: x k ← 0

Choose an initial guess x(0) _{to solution x}

/* Repeat until reaching a desired error tolerance or maximum

iterations limit */

while k < kmax and Ea > Eu do

for i = 1 to n do sum ← 0 for j = 1 to i − 1 do sum ← sum + aijxj end for j = i + 1 to n do sum ← sum + aijxj end xi ← (bi− sum)/aii end

/* Calculate new iteration error */

if k > 0 then Ea ← kx − xoldk

end

/* Update the old solution */

xold← x

/* Update the iteration counter */

k ← k + 1 end

(19)

2.2 Sparse Matrices and Compressed Sparse

Row (CSR) Format

A matrix is called to be a sparse matrix if most of its elements are zero. Sparse matrices are distinct from matrices with mostly non-zero values, which are re-ferred to as dense matrices. The sparsity of a matrix can be defined as the number of non-zero elements divided by the total number of elements (m × n). Many of the scientific problems require working with sparse matrices.

When working with sparse matrices, it is efficient and often necessary to use special data structures since they may allocate large amount of memory and storage resources when treated as if they are dense and store zero values as well. Compressed Storage by Rows (CSR) is an efficient data structure to store and process large sparse matrices. The CSR data structure allows fast row access in matrix-vector multiplications.

The CSR data structure represents a matrix by three arrays: row ptr, col ind and values. nnz denotes the total number of non-zero entries in the matrix.

• values is an array of size nnz and it stores the subsequent non-zeros of the matrix rows on continuous memory locations traversed in row-wise fashion. • col ind is an array of size nnz and it stores the column indices of the

respective nonzero values in the values array.

• row ptr is an array of size n + 1 and it stores the row start pointers which corresponds to the items the values array.

(20)

M = 1 2 3 4 5                 20 0 0 25 0 1 30 35 0 40 0 2 45 0 50 55 60 3 0 0 65 70 0 4 0 0 0 0 75 5 1 2 3 4 5 6 row ptr 1 3 6 10 12 13 1 2 3 4 5 6 7 8 9 10 11 12 col ind 1 4 1 2 4 1 3 4 5 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 values 20 25 30 35 40 45 50 55 60 65 70 75

Figure 2.1: A simple M matrix and its CSR representation

In Figure 2.1, a simple M matrix and its CSR representation with correspond-ing arrays presented. One can observe that the lengths of col ind and values arrays are equal to the number of nonzero items. Numbers in row ptr array points the row start positions in col ind and values arrays and its length is one more than the number of rows.

Algorithm 2 presents the same Gauss-Seidel algorithm as Algorithm 1, only difference is that the coefficient matrix A is supplied in the CSR format. In-stead of a m × n two dimensional array for storing the coefficient matrix A, the corresponding row ptr, col ind and values arrays are provided as input.

(21)

Algorithm 2: Sequential Gauss-Seidel Algorithm with CSR data format Input: row ptr, col ind, values, b, kmax, Eu

Output: x k ← 0

Choose an initial guess x(0) _{to solution x}

/* Repeat until reaching a desired error tolerance or maximum

iterations limit */

while k < kmax and Ea > Eu do

for i = 1 to n do sum ← 0

for j = row ptr[i] to row ptr[i + 1] − 1 do if j 6= i then

sum ← sum + values[j] ∗ x[col ind[j]] end else d ← values[j] end end xi ← (bi− sum)/d end

/* Calculate new iteration error */

if k > 0 then Ea ← kx − xoldk

end

/* Update the old solution */

xold← x

/* Update the iteration counter */

k ← k + 1 end

2.3 Graphs and Matrices

An undirected graph is defined as G = (V, E ), where V denotes the set of vertices and E denotes the set of edges. Each edge eij ∈ E connects a pair of distinct

vertices vi and vj. The degree di of a vertex vi is equal to the number of edges

incident to vi. For each vertex vi ∈ V, a weight w(vi); and for each edge eij ∈ E,

a cost c(eij) can be assigned. The set of vertices connected by vertex vi is called

(22)

An n × n square symmetric matrix A = (aij) can be represented as an

undi-rected graph G(A) = (V, E ) with n vertices, where the the sparsity pattern of A corresponds to the adjacency matrix representation of graph G. Vertex set V represents the rows/columns and edge set E represents non-zeros of matrix A. V contains a vertex vifor each row/column i. E contains an edge eij, which connects

the vertices vi and vj for each non-zero pair aij and aji in A.

        0 × 0 0 0 0 × 0 × × 0 0 0 × 0 0 × 0 0 × 0 0 × × 0 0 × × 0 × 0 0 0 × × 0         6 5 3 4 1 2

Figure 2.2: A simple symmetric matrix and its undirected graph representation Graph representation of matrices is often used in scientific computing applica-tions to investigate data dependencies of tasks and communication patterns.

2.4 Graph Partitioning

Graph partitioning is a widely used method in scientific applications including matrix multiplications to decompose computational tasks for parallelization [10, 11, 12].

Π = {P1, P2, . . . , PK} is said to be a K-way partition of G if the following

conditions hold: (i) each part Pk is a non-empty subset of V, (ii) parts are

pairwise disjoint, i.e., Pk ∩ Pl = ∅ for all 1 ≤ k < l ≤ K, and (iii) union of

all K parts is equal to V. A partition Π of V is balanced if the following balance criterion is satisfied by each part:

Wk ≤ Wavg(1 + ε), for k = 1, 2, . . . , K, (2.10)

(23)

Wavg is the average weight, i.e., Wavg =

P

vi∈Vw(vi)/K , and ε is the maximum

imbalance ratio allowed. In a partition Π of G, an edge eij is called a cut edge

(external) if connects two different parts, and it is called uncut edge (internal) if both vertices vi and vj are in the same part. EE denotes the set of external edges.

The cost of a partition Π is determined in terms of the cut edges as follows:

cost(Π) = X

eij∈EE

c(eij) (2.11)

Graph partitioning problem is to partition the graph into two or more parts while minimizing the cutsize in (2.11) and satisfying the maximum load balance constraint in (2.10). The graph partitioning problem is known to be NP-hard. There exist well known successful graph partitioning tools in the literature such as Chaco [13] and METIS [14, 15].

Figure 2.3: An undirected graph and its 4-way partitioning.

Figure 2.3 presents an undirected graph and its 4-way partitioning. The total cutsize is 9. All part weights are perfectly balanced and equal to 6.

(24)

2.5 Graph Coloring

A distance-1 coloring of a graph G is defined as a mapping clr : V → {1, 2, . . . , s} such that clr(vi) 6= clr(vj) for all edges ei,j ∈ E, where clr(vi) is the color of

vertex vi. This corresponds to an assignment of colors to vertices such that any

two adjacent vertices have different colors. The complexity of graph coloring problem with using minimum number of colors is known to be NP-Hard [16].

The standard graph coloring aims to minimize the number of colors used with-out considering the size of color classes relative to each other. A coloring of a graph is called equitable if the sizes of any two color classes differ by at most one. Another coloring method is balanced coloring. A coloring of a graph is called balanced if the ratio of any two color classes is at most within a given threshold. Given a graph G(V, E ), balanced coloring is defined as computing a distance-1 coloring such that each color class receives approximately |V|_C vertices, where C is the number of colors used [17].

Figure 2.4: An undirected graph and the same graph with distance-1 coloring

Figure 2.4 represents a simple undirected graph and its coloring. The graph has 9 vertices and 14 edges. The maximum degree of the graph is 5. There are 3 colors used. Color classes have different sizes as 3, 2 and 4.

(25)

Graph coloring is frequently used in parallel computing domain to resolve the dependencies in scientific problems by identifying the independent tasks that can be computed in parallel. It is used for extracting parallelism in iterative sparse solvers including Gauss-Seidel [18, 7, 19] and also used in many other scientific applications such as calculation of sparse Jacobian matrices in partial differential equation solvers [20], time tabling and scheduling problems [21, 22].

(26)

Chapter 3 Related Work

In this chapter, we investigate and discuss the existing work on parallel Gauss-Seidel algorithms. There are several studies on parallelization of Gauss-Gauss-Seidel in the literature, which use graph decomposition, matrix ordering and graph coloring models.

Koester et al. [6] presented a parallel Gauss-Seidel algorithm developed for ir-regular sparse matrices from power system applications. They suggest a two step matrix ordering technique that first partitions the matrix into a block-diagonal bordered form using diakoptic techniques and then multi-colors the data in the last diagonal block using graph coloring method. They order the matrix into block-diagonal-bordered form using a node-tearing technique, and then assign block-diagonal blocks of matrix partitions Ai,i, Ai,m, Am,i to the same

proces-sor. Their work utilizes active remote procedure calls in order to minimize the communication overhead and obtain good speed-up. The drawback of their algo-rithm is that the number of color classes increased dramatically in unstructured problems, load imbalance is observed in many experiments and the calculation of values in the last block show poor parallel performance. Since they inherently apply different orderings in the matrix structure in order to extract parallelism, the convergence rates are affected by the ordering.

(27)

Motivated by this observation, Adams [5] proposed a new distributed mem-ory Gauss-Seidel algorithm as a multi-grid smoother with unstructured problems. Their approach takes advantage of processor partitions and uses the domain de-composition with the distribution of the stiffness matrix. Their approach suggests that nodes are partitioned to minimize communication and then each processor partitions its nodes into groups of interior (Int1, Int2) and boundary nodes (Top,

Bot, Mid). Interior nodes, which do not have any edges with nodes on

dif-ferent processors, are computed first to hide the communication latency of the boundary nodes. The boundary nodes are partitioned into nodes, which only communicates with processors of higher and lower colors which are referred to Bot and Top nodes, respectively. By processing the processor partitions with special orderings, their algorithm can be effective by utilizing optimal partition-ing of finite element problems. However, a weakness of their approach is that their algorithm requires instant communication for some computations that oc-cur in Mid nodes and also requires different orderings depending the the number of processors. This algorithm works well under certain assumptions such as load balancing of computations are perfectly satisfied and there are enough number of interior nodes to hide the communication requirement of boundary nodes.

Huang and Ongsakul [7] propose a task allocation algorithm and apply it in parallel Gauss-Seidel implementation with two stages on distributed memory systems. Their approach first suggests a recursive clustering algorithm to re-duce communication costs and then devises a coloring algorithm to coordinate information exchange among processors. Their approach partitions the graph into clusters depending on the number of processors by recursively using a balanced 2-way weakcut algorithm, which minimizes the number of boundary vertices instead of minimizing the number of inter-connection edges between two clusters. Each cluster is assigned to a single processor. Then, a coloring algorithm is proposed to color only the boundary vertices. They have implemented their algorithm on nCUBE2 machine by applying recursive k times partitioning depending on the number of processors. Their load balancing paradigm is based on distributing non-colored vertices among cluster groups. The weakness of their algorithm is

(28)

of cluster sizes if vertices are not ideally connected, and load imbalance may be observed between the color classes.

Courtecuisse and Allard [23] propose several Gauss-Seidel parallelization schemes for many-core processors, that can be used in fully coupled dense prob-lems. Their approach aims to synchronize computations without CPU interven-tion by exploiting a small number of GPU processors. They suggest block-column based parallelization, which first computes the blocks on the diagonal of the ma-trix, and then process other columns. Their work also presents a different ap-proach, which eliminates the need for global synchronization. They suggest to use a shared memory counter, to store how many of the diagonal blocks are pro-cessed and deduce if xi is updated for newer iterations. The drawback of their

algorithm is that it does not guarantee the order of invocation of thread blocks and the scheduling is manually done.

Shang [8] proposes another distributed memory parallel Gauss-Seidel algo-rithm. Their approach first suggests to divide the coefficient matrix and the right hand side of the linear algebraic system into row blocks in the natural row-wise order, then distribute the row blocks among local memories of all processors through special mapping techniques. The communication volume is decreased by conveying the solution iteration vector among processors in cycles at each iteration. The parallel efficiency in their algorithm depends on a row-block parti-tioning parameter g and an optimal number of processors. They suggest that pa-rameter g depends on the relative ratio of communication to computation speed. To increase parallel efficiency, their approach overlaps communication and compu-tations. Their approach is a true parallel Gauss-Seidel algorithm that maintains to convergance rates of the serial Gauss-Seidel algorithm. Although communica-tion time is decreased by implementacommunica-tion of cyclically conveyed solucommunica-tion iteracommunica-tion vector, load balancing issue is not handled among processors.

Since Gauss-Seidel is also used as a preconditioner and multigrid smoother in linear problems, there exist some work in the literature, which implements a parallel Gauss-Seidel algorithm for these purposes. Heuveline et al. [24] uses multi-coloring technique for block Gauss-Seidel preconditioner on GPUs. Their

(29)

approach aims to resolve neighbor dependencies by introducing new neighborship classes. They evaluated scalability and performance results on hybrid multi-core CPU and GPU platforms.

Park et al. [19] implemented a shared memory parallel Gauss-Seidel algorithm as a preconditioner in High Performance Conjugate Gradient Benchmark (HPCG) application as the Gauss-Seidel kernel is the most time consuming operation in HPCG. Their approach uses three parallelization approaches including task scheduling with point-to-point synchronization, block multi-color reordering and running multiple MPI ranks within a node. Their multi-coloring approach colors vertices to the same color even there are some transitive dependency between a part of vertices. Although their studies report good results in terms of parallel efficiency, it is out of our interest as it targets shared memory architectures.

(30)

Chapter 4 Method

In this chapter, we present our 2 stage novel scheme for efficient and load bal-anced parallelization of Gauss-Seidel method. In the first stage, we apply graph partitioning and build a coarse graph based on partitioning of vertices. We apply coarsening in order to obtain better coloring and decrease the number of synchro-nization points. In the second stage, we use a balanced graph coloring approach in order to decouple the dependencies among Gauss-Seidel iterations that is re-quired to extract parallelism while maintaining the balance of the size of the color classes.

4.1 Coarsening Stage

In this stage, we propose a coarsening technique based on graph partitioning on the graph representation of the coefficient matrix A. This stage takes the original input graph G corresponding to the coefficient matrix A and outputs a coarse graph GC of this input graph in which each coarse vertex of the output

graph contains approximately B vertices of the original graph, where B is the block size parameter. The coefficient matrix A is represented as a undirected graph G = (V, E ), as described in Chapter 2. Providing a block size parameter

(31)

B, we suggest K -way partitioning the graph G with N vertices into K parts with each part approximately having B vertices such that K P , where P denotes the number of processors. Note that K = N/B. The block size parameter may depend on the size and sparsity structure of the coefficient matrix A and it may affect the overall efficiency of system. In our experiments, we have chosen various block-size parameters that range from 50 to 500 depending on the problem size. The effect of block size parameter is evaluated in Chapter 5 on scientific problems of various sizes.

Given a number of partitions K as a parameter, we partition the graph

into K parts with approximately N/B vertices. We obtain a partitioning

Π = {V1, V2, . . . , VK}. We calculate the edge-cut of the partitioning solution.

Then, we form a coarse graph GC = (VC, EC), where VC denotes the vertex set

of coarse graph GC, and called as super vertices, EC denotes edge set of coarse

graph GC, and called as super edges. The coarse graph GC is formed using the

partitioning information Π, that is each part of VK becomes a super vertex of

VC. In detail, let p be a vector of size |V | such that p[i] stores the number of the

partition that vertex vi belongs to. By partitioning the graph, we obtain a list

of vertex-part pairs hvi, p[i]i. We assign the group of vertices in G to GC based

on their partition numbers p[i]. If there exists at least one edge in between any two vertices, each in different super vertices, we form a super edge in between the respective super vertices. Multiple edges between vertices of corresponding super vertices are coalesced into a single edge.

The aim in graph partitioning problem is to minimize the cut size among all partitions. By minimizing the cut size, we obtain highly intra-connected cluster of vertices, and loosely inter-connected super vertices. Each super vertex denotes an atomic task and the rows corresponding to the vertices in that super vertex will be processed by a single processor in the Gauss-Seidel algorithm. However, note that multiple super vertices are assigned to a single processor based on the coloring results. This enables us to resolve communication costs while processing vertices in multiple processors in parallel. No communication is required when processing a single super vertex, and those fine-grain vertices that make up the

(32)

resulting coarse graph GC is less coupled than the initial graph. Therefore, the

chance of obtaining fewer number of colors in the coloring stage (described in Sec-tion 4.2) is increased. AddiSec-tionally, coarsening reduces the coloring time, which is a necessary pre-processing step in a parallel Gauss-Seidel implementation.

In our experimental results presented in Chapter 5, we will compare the char-acteristics of the initial graph G with resulting coarse graph GC observed that the

resulting coarsened graph GC is less coupled in terms of vertex degrees, and the

required number of colors is fewer in GC than in G.

Figure 4.1: Step by step coarsening stage

Figure 4.1 represents an undirected graph with 27 vertices and its step by step coarsening process. The initial graph has been partitioned into 5 parts with partition size of 5 or 6. Each part is considered as a super vertex, then super edges are formed among the super vertices. The resulting coarse graph has 5 super vertices. The weights of super vertices are provided with labels. One can observe that the vertices within a single partition is densely connected.

(33)

4.2 Coloring Stage

In order to extract parallelism in the Gauss-Seidel iterations, we use a graph coloring technique. We apply distance-1 coloring on the coarsened graph GC that

is obtained at the end of the coarsening stage. We first assign colors to the super vertices of GC, process nodes of GC belonging to same color in parallel, and proceed

with the next color. We will describe details of parallel execution steps later in this section.

Traditional graph coloring methods aim to minimize the number of colors. However, in the context of many parallel processing applications, it is also im-portant to obtain a balanced distribution of the sizes of the color classes. If the color classes are imbalanced in size, utilization of hardware resources becomes in-efficient, especially for the smaller color classes. Therefore our goal is to achieve a balanced coloring of the vertices in GC, that the number of vertices in each color

class is approximately the same. Lu et al. [17] presented a graph coloring toolkit named Grappolo which provides an algorithm for balanced coloring with Vertex First Fit (VFF) method. Their work presents a set of coloring strategies based on greedy, shuffling and recoloring approach. Basically, a vertex is assigned to the least used color among the set of permissible colors which is bounded by a bal-ancing constraint. Their work also includes guided balbal-ancing strategies in which a scheduled and unscheduled shuffling and recoloring methods are introduced.

We have also considered and discussed using a traditional coloring method due to the possibility that a balanced coloring method may increase the required number of colors potentially. For this purpose, we have evaluated a set of color-ing algorithms proposed by Gebremedhin et al. [25]. Their work presents a graph coloring toolkit named ColPack for scientific purposes. ColPack offers various sequential greedy coloring algorithms with degree based ordering in the context of distance-1 coloring. Each algorithm progressively extends a partial coloring by processing a single vertex at a time in some order by permanently assigning a vertex the smallest allowable color in each step. By comparing two different

(34)

coloring approaches in our experiments, we have observed that a balanced color-ing is generally possible with the same number of colors as traditional colorcolor-ing. Related experimental results are discussed in Chapter 5 in more detail.

In this stage, Grappolo is used to obtain a balanced coloring of vertices, which is required for the kernel of parallel Gauss-Seidel implementation. We color the vertices of coarse graph GC, and upon completion, each super vertex vci ∈ VC is

assigned a color number ci, where ci ∈ C. C denotes the set of all colors, and

nc denotes the total number of colors required. Note that vertices of G have not

been subjected to coloring at this stage. Instead, they are assigned colors based on the color of the corresponding super vertex which they are part of. Proposed coarsening and coloring method in together decreases the number of colors which results in reducing number of synchronization points that contributes to the total communication volume. Fewer number of colors requires less synchronization points between sub-iterations, hence, communication overhead decreases. The coloring time is also decreases with the proposed coarsening method.

We assign the rows of the iteration matrix A that correspond to the constituent vertices of each super vertex to a different processor. Since the number of su-per vertices of GC is much greater than the number of processors (K P ), a

single processor processes multiple super vertices of GC. All processors computes

the corresponding xi entries of same color. Once, processing of a color class is

complete, all processors exchange xi values. This is called a local synchronization

point. Then all processors proceed with the next color after receiving the updated xi values.

Figure 4.2 represents a visual comparison of traditional parallel processing versus parallel processing with coarsening and balanced coloring. Since number of colors decreases with the coarsening stage, fewer number of synchronization points are required. Size of color classes are almost perfectly balanced and all processors are assigned almost equal number of super vertices that are considered as atomic tasks.

(35)

Figure 4.2: Traditional parallel processing vs. parallel processing with coarsening and balanced coloring

(36)

Chapter 5 Experimental Results

In this chapter, we present the results of our experiments that we carried out for the validity of the proposed methodology. Firstly, we give the details of the experimental platform and data sets used. Then, following sections include eval-uation of numeric results for the implementation of the 2 stages of our proposed methodology. In our experiments, we evaluate two important metrics which are the number of colors and partitioning cutsize. Due to the randomized nature of some algorithms used, the experiments were repeated 5 times with different seeds and results are averaged. Timing results for each stage of our methodolody are also presented in each section.

5.1 Experimental Platform

The platform used in our experiments is a server computer with 4 x 6 Core AMD Opteron 8425 HE with 2.1 GHz clock frequency, 64 KB L1 cache, 512 KB L2 cache, 6 MB shared L3 cache and 128 GB total memory. The operating system is Debian GNU/Linux version 7. gcc version 4.7.2 is used as C/C++ compiler with −O3 optimization flag while compiling. METIS version 5.1.10 with default parameters is used as a graph partitioning tool. In the coloring stage, Grappolo

(37)

with currently available source code with no specific version tag and ColPack version 1.0.10 are used.

5.2 Data Set

In order to validate and measure the performance of our methodology, we used a real life test experiment data from wide range of scientific applications. We per-formed our tests on 40 different matrices from the SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) [26]. SuiteSparse Matrix Collection is a widely used set of sparse matrix benchmarks collected from a wide range of applications. The chosen problems are symmetric and positive definite matrices as per the convergence requirement of the Gauss-Seidel method. The size of test matrices ranges from 16,146 to 4,588,484 vertices and 1,015,156 to 48,538,952 nonzeros in a variety of scientific applications including electromag-netics, structural and quantum chemistry problems. The details of test matrices are presented in Table 5.1.

The matrices in the SuiteSparse Matrix Collection are provided in Matrix Market (MTX ) format which is a widely used storage format. However, CSR format is used in our implementation due to its advantages discussed before. We converted those matrices to CSR format while processing.

(38)

Table 5.1: Properties of Test Matrices

Matrix Name Problem Kind Rows Columns Nonzeros

Vertex Degrees

Min Avg Max

2cubes sphere electromagnetics problem 101,492 101,492 1,647,264 5 16 31

af 0 k101 structural problem 503,625 503,625 17,550,675 15 35 35

af shell1 structural problem sequence 504,855 504,855 17,588,875 20 35 40

bcsstk31 structural problem 35,588 35,588 1,181,416 2 33 189

bmw7st 1 structural problem 141,347 141,347 7,339,667 1 52 435

cfd2 fluid dynamics problem 123,440 123,440 3,087,898 12 20 33

coAuthorsCiteseer undirected graph 227,320 227,320 1,628,268 1 7 1372

coAuthorsDBLP undirected graph 299,067 299,067 1,955,352 1 7 336

consph 2D/3D problem 83,334 83,334 6,010,480 1 72 81

cop20k A 2D/3D problem 121,192 121,192 2,624,331 0 22 81

crankseg 2 structural problem 63,838 63,838 14,148,858 48 222 3423

CurlCurl 4 model reduction problem 2,380,515 2,380,515 26,515,867 4 11 13

dblp-2010 undirected graph 326,186 326,186 1,615,400 0 5 238

delaunay n20 undirected graph 1,048,576 1,048,576 6,291,372 3 6 23

delaunay n21 undirected graph 2,097,152 2,097,152 12,582,816 3 6 23

dielFilterV2real electromagnetics problem 1,157,456 1,157,456 48,538,952 6 42 110

filter3D model reduction problem 106,437 106,437 2,707,179 8 25 112

fullb structural problem 199,187 199,187 11,708,077 18 59 144

Ga10As10H30 quantum chemistry problem 113,081 113,081 6,115,633 7 54 698

Ga3As3H12 quantum chemistry problem 61,349 61,349 5,970,947 15 97 1622

(39)

Matrix Name Problem Kind Rows Columns Nonzeros

Vertex Degrees

Min Avg Max

GaAsH6 quantum chemistry problem 61,349 61,349 3,381,809 15 55 1646

gas sensor model reduction problem 66,917 66,917 1,703,365 8 25 33

gupta3 optimization problem 16,783 16,783 9,323,427 33 556 14672

hugetrace-00000 undirected graph 4,588,484 4,588,484 13,758,266 2 3 3

il2010 undirected weighted graph 451,554 451,554 2,164,464 1 5 90

mo2010 undirected weighted graph 343,565 343,565 1,656,568 1 5 84

msdoor structural problem 415,863 415,863 20,240,935 28 49 77

nc2010 undirected weighted graph 288,987 288,987 1,416,620 1 5 83

offshore electromagnetics problem 259,789 259,789 4,242,673 5 16 31

oilpan structural problem 73,752 73,752 3,597,188 28 49 70

olafu structural problem 16,146 16,146 1,015,156 24 63 89

pdb1HYS weighted undirected graph 36,417 36,417 4,344,765 18 119 204

PFlow 742 2D/3D problem 742,793 742,793 37,138,461 1 50 137

pkustk05 structural problem 37,164 37,164 2,205,144 24 59 126

rgg n 2 17 s0 undirected random graph 131,072 131,072 1,457,506 0 11 28

s3dkq4m2 structural problem 90,449 90,449 4,820,891 13 53 54

thermal2 thermal problem 1,228,045 1,228,045 8,580,313 1 7 11

thermomech dM thermal problem 204,316 204,316 1,423,116 4 7 10

va2010 undirected weighted graph 285,762 285,762 1,402,128 1 5 102

venturiLevel3 undirected graph 4,026,819 4,026,819 16,108,474 2 4 6

(40)

5.3 Evaluation

5.3.1 Experiments on the Coarsening Stage

To measure the effectiveness of our coarsening model, we analyze and compare the structure of coarse graph GC with the original graph G that corresponds to the

co-efficient matrix A. We partition G for different block sizes B ∈ {50, 100, 250, 500}. Then, we form a coarse graph GC. In our experiments, we evaluate the structure

of G and GC by comparing the vertex degrees. The minimum, average and

max-imum vertex degrees of G and GC are presented in Table 5.2 with respect to the

block size parameter used for the partitioning step. Please note that the original graph properties are presented with the lines where B = 1 for each test matrix and the subsequent rows present the coarsening results for the respective block sizes.

The vertex degree properties of a sparse matrix reflect the dependencies of any nonzero item to other nonzero items on the off-diagonal blocks. Depending on the reordering due to the partitioning, the connectivity between any items may cause computational dependency in the Gauss-Seidel iterations. These dependencies contributes to the output of the coloring stage. Experimental results on Table 5.2 show that our coarsening model effectively reduces the average and maximum degrees on test matrices. For instance, the test matrix named af shell has average degree of 34 and maximum degree of 40. After coarsening applied with B = 50, the resulting coarse graph GC has average degree of 7 and maximum vertex degree

of 11. This observation holds true for the almost any of test matrices.

In the implementation of the graph partitioning problem, we used METIS [14] as a graph partitioning tool. METIS is a multilevel graph partitioning software. It uses a multilevel approach with three stages and comes with several algorithms for each stage. It is well accepted and commonly used as a successful graph partitioning tool used various domains including finite element methods, linear programming, VLSI, and transportation. Their experiments show that METIS produces partitions that are consistently better than those produced by other

(41)

widely used algorithms.

One important quality metric of coarsening stage that is related to the parallel performance is the partitioning cutsize. Partitioning cutsize is important since it affects the quality of output of the coarsening stage. In Table 5.2, number of partitions denoted by K and partitioning cutsize are presented. Experimental results verify that the cutsize decreases with the coarsening process. For instance, considering the test matrix af shell1, the cutsize of partitioning decreases from 3,227,892 to 1,005,261 in relation to the blocksize parameter.

Timing results for partitioning with METIS (tpartitioning), coarsening

(tcoarsening) and coloring (tcoloring) stages are also provided in Table 5.2, all in

seconds. Our experiments show that proposed coarsening model reduces the col-oring time. Timing results for the colcol-oring of original graph G are presented on the lines where B = 1 for each test matrix and the subsequent rows present the timing results for larger block sizes. We observe that the most time consuming operation is the initial graph partitioning done by METIS.

5.3.2 Experiments on the Coloring Stage

Graph coloring is a crucial task of any parallel Gauss-Seidel implementation due to the sequential nature of the algorithm. Number of colors obtained from the coloring stage is directly related to the parallel efficiency. When the computa-tions within a color class are completed, synchronization points are required to exchange xi values, which results in synchronization overhead. The

synchroniza-tion overheads increase to the total communicasynchroniza-tion volume. Another important metric is the balance of the color classes. Balancing the computational weights among multiple color classes increases the overall efficiency of parallel system. In this section, we discuss the efficiency of our proposed model by evaluating these metrics. We evaluated and compared two graph coloring tools ColPack [25] and Grappolo [17].

(42)

In Table 5.2, we evaluated the number of colors for each test matrix in relation to various block sizes. By examining the experimental results, we observed that our coarsening model effectively reduces the number of colors required to color the graph. For instance, 908 colors are required to color Ga3As3H12 matrix, whereas our coarsening model with B = 50 only requires 32 colors which results in 28 times decrease in the synchronization overhead.

Balancing of computations in multiple color classes is important for a paral-lel Gauss-Seidel implementation. For this requirement, we suggested a balanced coloring approach as discussed previously. Here, we compare the results of tra-ditional coloring with balanced coloring in Figure 5.1 and Figure 5.2. ColPack and Grappolo are used for coloring and balanced coloring, respectively. Figure 5.1 compares two approach by the number of vertices in each color classes of the test matrix delaunay n20. Figure 5.2 compares the sum of weights of ver-tices having the same color for the test matrix. The results show that vertex counts and weights are almost equal with balanced coloring method we suggested whereas there is a huge imbalance among different colors in the traditional color-ing method. Color IDs are sorted to remark the imbalance. In the traditional col-oring experiments, color classes of very small number of nodes are observed. This may cause that some processors stay idle in parallel systems with large number of processors. Although, the balancing results are provided only for delaunay n20, experiments were repeated for every test matrix and results are similar.

We also present a comparison of number of counts required to color same problem with traditional coloring versus balanced coloring. In Figure 5.3, total number of colors required is given by applying both coloring techniques for all test matrices. Experiments show that in only rare cases balanced coloring approach slightly increases the number of colors. In most cases, balanced coloring method requires the same number of colors as the traditional coloring method, moreover it produces quality color classes of almost equal sizes.

Timing results for coloring stage are presented across the column titled (tcoloring) of Table 5.2. The results show that our coarsening model decreases

(43)

Table 5.2: Coarsening and Coloring Results for B ∈ {1, 50, 100, 250, 500}

Matrix Name B K Cutsize Vertex Degrees # Colors Time (seconds)

Min Avg Max tpartitioning tcoarsening tcoloring

2cubes sphere 1 101,492 - 5 16 31 13 - - 1.03 50 2,029 350,747 7 16 28 12 6.36 0.1 0.2 100 1,014 262,980 6 16 25 11 3.15 0.07 0.26 250 405 193,804 6 15 22 11 1.54 0.06 0.24 500 202 153,665 6 14 21 10 1.02 0.04 0.22 af 0 k101 1 503,625 - 15 34 35 15 - - 6.61 50 10,072 3,218,626 3 7 12 7 48.73 0.57 0.21 100 5,036 2,240,029 3 7 9 6 20.15 0.39 0.19 250 2,014 1,422,306 3 6 11 6 8.21 0.37 0.19 500 1,007 1,011,004 3 6 11 7 4.4 0.25 0.18 af shell1 1 504,855 - 20 34 40 25 0.01 - 7.99 50 10,097 3,227,892 3 7 11 7 46.77 0.51 0.22 100 5,048 2,242,409 4 6 9 7 17.6 0.42 0.19 250 2,019 1,422,405 3 6 10 6 7.67 0.3 0.2 500 1,009 1,005,261 3 6 12 6 5.19 0.3 0.19 bcsstk31 1 35,588 - 2 33 189 36 - - 0.57 50 711 246,610 4 12 25 10 2.45 0.04 0.17 100 355 170,328 4 10 20 10 1.13 0.03 0.18 250 142 107,868 3 9 17 7 0.53 0.02 0.18 500 71 70,833 4 8 12 7 0.32 0.02 0.17 bmw7st 1 1 141,347 - 1 51 435 54 - - 3.01 50 2,826 1,673,618 3 9 23 9 14.77 0.15 0.19 100 1,413 1,191,352 3 8 25 8 5.67 0.1 0.18 250 565 753,110 2 7 19 8 2.71 0.09 0.16 500 282 519,441 3 7 14 7 1.54 0.08 0.18 cfd2 1 123,440 - 8 25 30 15 - - 1.32 50 2,468 715,902 4 14 24 12 11.15 0.16 0.19 100 1,234 538,840 4 12 21 10 5.51 0.08 0.19 250 493 375,650 3 10 20 9 2.23 0.06 0.18 500 246 280,725 4 10 16 9 1.41 0.05 0.18 coAuthorsCiteseer 1 227,320 - 2 8 1373 87 - - 1.19 50 4,546 189,178 2 23 255 19 14.03 0.32 0.22 100 2,273 140,696 2 36 225 25 7.15 0.3 0.22 250 909 115,505 11 69 195 28 3.87 0.24 0.21 500 454 105,737 18 106 205 42 2.6 0.2 0.2 coAuthorsDBLP 1 299,067 - 2 7 337 115 0.01 - 1.49 50 5,981 300,029 3 41 290 35 20.09 0.64 0.29 100 2,990 256,610 6 67 290 48 10.47 0.57 0.28

(44)

Matrix Name B K Cutsize

Vertex Degrees

# Colors

Time (seconds) Min Avg Max tpartitioning tcoarsening tcoloring

consph 1 83,334 - 1 72 81 33 - - 2.41 50 1,666 1,864,232 1 18 44 14 14.4 0.17 0.18 100 833 1,499,045 1 15 24 12 6.48 0.12 0.17 250 333 1,110,406 1 13 21 11 3.09 0.08 0.18 500 166 871,298 1 12 20 10 2.03 0.07 0.16 cop20k A 1 121,192 - 1 21 81 20 - - 1.29 50 2,423 663,733 1 13 35 12 10.33 0.14 0.19 100 1,211 442,535 1 12 24 10 5.31 0.11 0.17 250 484 311,814 1 10 19 10 2.01 0.06 0.2 500 242 234,704 1 9 15 9 1.36 0.06 0.19 crankseg 2 1 63,838 - 48 221 3423 198 - - 5.56 50 1,276 5,690,067 10 54 179 34 25.51 0.42 0.19 100 638 4,773,022 10 30 91 22 9.96 0.19 0.19 250 255 3,563,088 7 17 40 13 5.09 0.13 0.18 500 127 2,707,962 7 13 27 12 2.87 0.12 0.17 CurlCurl 4 1 2,380,515 - 4 11 13 7 0.02 - 11.78 50 47,610 5,463,353 5 16 29 12 247.51 2.19 0.69 100 23,805 3,950,078 5 15 31 12 132.71 1.59 0.52 250 9,522 2,929,611 5 15 23 12 76.52 1.41 0.31 500 4,761 2,327,368 5 14 23 12 49.86 2.02 0.26 dblp-2010 1 326,186 - 1 5 239 75 - - 1.16 50 6,523 198,143 1 25 248 29 18.41 0.43 0.24 100 3,261 156,011 1 38 225 33 8.88 0.32 0.26 250 1,304 137,542 1 72 293 40 5.61 0.3 0.24 500 652 127,118 1 105 266 55 3.98 0.29 0.21 delaunay n20 1 1,048,576 - 4 6 24 8 0.01 - 3.36 50 20,971 520,061 4 7 17 7 51.75 0.82 0.25 100 10,485 368,355 3 7 14 7 30.23 0.58 0.24 250 4,194 235,678 4 7 14 6 13.23 0.49 0.2 500 2,097 168,358 4 6 14 6 6.76 0.4 0.21 delaunay n21 1 2,097,152 - 4 6 24 9 0.01 - 7.17 50 41,943 1,042,183 3 7 16 7 127.13 1.35 0.37 100 20,971 737,897 3 7 17 7 73.66 1.19 0.28 250 8,388 472,685 4 7 17 7 29.38 1.23 0.23 500 4,194 337,595 4 6 14 7 16.86 0.88 0.2 dielFilterV2real 1 1,157,456 - 6 41 110 28 - - 21.08 50 23,149 12,330,229 6 18 41 14 199.49 1.47 0.55 100 11,574 9,780,125 5 16 31 13 102.55 1.13 0.35 250 4,629 7,126,512 5 15 34 13 61.65 1.1 0.24 500 2,314 5,589,193 5 14 26 12 36.99 1.21 0.19

(45)

Vertex Degrees

# Colors

filter3D 1 106,437 - 8 25 112 16 - - 1.27 50 2,128 616,562 3 13 29 11 9.47 0.13 0.19 100 1,064 450,344 3 11 24 10 4.23 0.1 0.19 250 425 306,507 3 9 17 8 1.99 0.08 0.17 500 212 221,077 4 7 12 8 1.47 0.08 0.16 fullb 1 199,187 - 18 58 144 54 - - 4.48 50 3,983 2,959,022 4 11 26 11 26.79 0.3 0.2 100 1,991 2,249,181 4 10 19 10 11.07 0.18 0.18 250 796 1,556,875 5 10 15 8 4.82 0.16 0.18 500 398 1,195,246 4 9 16 8 3 0.15 0.18 Ga10As10H30 1 113,081 - 7 54 697 390 - - 2.77 50 2,261 2,359,972 30 76 218 52 23.73 0.54 0.24 100 1,130 2,118,671 27 70 223 50 14.45 0.23 0.21 250 452 1,563,816 15 53 139 37 8.88 0.14 0.18 500 226 1,090,058 19 40 115 29 5.32 0.16 0.18 Ga3As3H12 1 61,349 - 15 97 1622 908 - - 2.62 50 1,226 2,589,109 30 81 223 58 12.9 0.27 0.2 100 613 2,454,111 27 77 231 61 7.57 0.17 0.19 250 245 2,143,907 22 59 165 45 4.93 0.12 0.19 500 122 1,625,746 17 39 101 32 3.15 0.08 0.17 GaAsH6 1 61,349 - 15 55 1645 910 - - 1.7 50 1,226 1,337,907 33 70 195 50 10.69 0.24 0.2 100 613 1,232,815 25 64 197 52 6.56 0.16 0.19 250 245 1,067,671 22 51 140 37 3.76 0.08 0.17 500 122 847,381 16 33 90 23 2.46 0.06 0.17 gas sensor 1 66,917 - 8 25 33 13 - - 0.82 50 1,338 402,689 6 14 22 12 5.24 0.08 0.2 100 669 306,530 6 13 22 12 2.66 0.06 0.18 250 267 221,316 5 12 22 10 1.39 0.05 0.15 500 133 168,873 6 10 19 8 0.79 0.04 0.17 gupta3 1 16,783 - 33 555 14672 201 - - 3.76 50 335 4,424,987 117 239 335 152 26.2 0.22 0.21 100 167 4,355,360 76 142 167 104 17.39 0.11 0.19 250 67 4,243,284 54 64 67 56 9.54 0.08 0.18 500 33 4,478,075 1 31 32 32 13.31 0.07 0.16 hugetrace-00000 1 4,588,484 - 3 3 4 4 0.04 - 11.63 50 91,769 843,383 3 6 12 7 238.55 3.17 0.64 100 45,884 579,429 3 6 12 7 151.64 2.57 0.46 250 18,353 367,904 3 6 11 7 64.02 3.07 0.28 500 9,176 262,679 3 6 11 7 35.16 3.15 0.26

(46)

Vertex Degrees

# Colors

il2010 1 451,554 - 2 5 91 7 - - 1.29 50 9,031 150,546 2 6 18 7 22.08 0.29 0.2 100 4,515 100,285 2 6 15 7 8.93 0.24 0.2 250 1,806 59,596 2 6 13 7 4.26 0.2 0.2 500 903 40,074 2 6 12 6 2.55 0.18 0.19 mo2010 1 343,565 - 2 5 85 7 - - 1.11 50 6,871 120,522 2 6 14 7 14.86 0.24 0.2 100 3,435 80,280 2 6 14 7 6.23 0.16 0.23 250 1,374 47,872 2 6 12 7 3.21 0.17 0.23 500 687 32,500 3 6 13 6 2.08 0.15 0.18 msdoor 1 415,863 - 28 48 77 42 - - 8.89 50 8,317 4,310,933 4 7 17 8 54.68 0.51 0.21 100 4,158 3,005,766 3 7 18 7 21.23 0.38 0.19 250 1,663 1,844,762 3 7 17 7 8.4 0.36 0.17 500 831 1,239,576 3 6 15 6 4.63 0.28 0.18 nc2010 1 288,987 - 2 5 84 7 - - 0.86 50 5,779 102,515 2 6 15 7 13.32 0.21 0.19 100 2,889 67,852 2 6 17 7 5.64 0.17 0.19 250 1,155 40,075 3 6 13 6 2.39 0.11 0.19 500 577 26,957 3 6 11 6 1.42 0.12 0.19 offshore 1 259,789 - 5 16 31 12 - - 2.05 50 5,195 893,918 6 16 37 13 22.35 0.39 0.21 100 2,597 669,790 5 16 41 12 11.69 0.3 0.2 250 1,039 482,485 5 15 50 12 5.7 0.22 0.17 500 519 365,778 4 14 42 12 3.28 0.17 0.18 oilpan 1 73,752 - 28 48 70 35 - - 1.55 50 1,475 792,888 4 7 14 7 5.85 0.06 0.18 100 737 556,249 4 7 15 7 2.48 0.05 0.19 250 295 351,572 4 6 11 6 1.08 0.04 0.16 500 147 248,793 4 6 10 6 0.71 0.04 0.18 olafu 1 16,146 - 24 62 89 39 - - 0.52 50 322 260,616 3 9 22 9 1.39 0.03 0.17 100 161 179,089 3 7 13 7 0.54 0.01 0.16 250 64 103,724 3 6 9 5 0.24 0.01 0.18 500 32 66,522 2 5 8 4 0.15 0.01 0.17 pdb1HYS 1 36,417 - 18 119 204 82 - - 1.77 50 728 1,458,402 4 19 59 15 6.35 0.07 0.16 100 364 1,078,272 3 15 36 13 2.98 0.05 0.18 250 145 684,938 2 10 19 9 1.35 0.04 0.17 500 72 489,993 4 8 19 7 0.78 0.04 0.15

(47)

Vertex Degrees

# Colors

PFlow 742 1 742,793 - 1 49 137 27 - - 14.39 50 14,855 11,241,896 1 18 71 15 187.2 0.93 0.47 100 7,427 8,584,193 1 14 33 12 77.68 0.79 0.26 250 2,971 6,282,559 1 12 21 11 55.39 0.68 0.23 500 1,485 4,837,489 1 10 17 9 50.23 0.66 0.19 pkustk05 1 37,164 - 24 59 126 66 - - 1 50 743 584,681 7 13 28 12 3.49 0.04 0.18 100 371 450,434 5 11 20 8 1.58 0.03 0.18 250 148 304,077 4 8 15 8 0.73 0.02 0.18 500 74 215,495 3 7 11 7 0.47 0.02 0.18 rgg n 2 17 s0 1 131,072 - 1 12 29 16 - - 0.74 50 2,621 93,384 3 6 14 6 5.46 0.11 0.27 100 1,310 57,871 3 6 11 6 2.39 0.11 0.18 250 524 33,878 3 6 13 7 1.12 0.07 0.18 500 262 23,023 3 6 13 6 0.65 0.07 0.17 s3dkq4m2 1 90,449 - 13 53 54 24 - - 1.89 50 1,808 1,084,415 3 7 13 7 8.77 0.08 0.18 100 904 765,330 3 7 10 7 3.46 0.07 0.18 250 361 483,160 3 6 9 6 1.56 0.07 0.18 500 180 339,316 3 6 9 6 0.88 0.05 0.17 thermal2 1 1,228,045 - 1 6 10 7 0.01 - 4.37 50 24,560 993,578 1 6 11 7 84.27 0.95 0.28 100 12,280 513,059 1 6 11 7 28.87 0.77 0.22 250 4,912 321,477 1 6 10 7 15.4 0.7 0.18 500 2,456 227,701 1 6 9 6 7.54 0.57 0.2 thermomech dM 1 204,316 - 4 6 10 7 - - 0.87 50 4,086 161,932 3 6 10 6 8.35 0.19 0.19 100 2,043 82,570 3 6 9 6 3.52 0.14 0.19 250 817 50,374 3 6 10 6 1.74 0.15 0.19 500 408 35,122 3 6 9 6 1.04 0.15 0.19 va2010 1 285,762 - 2 5 103 7 - - 0.97 50 5,715 103,804 2 6 21 7 12.3 0.17 0.27 100 2,857 69,078 3 6 16 7 5.31 0.17 0.19 250 1,143 41,042 2 6 13 6 2.31 0.12 0.2 500 571 27,647 2 6 12 7 1.41 0.11 0.18 venturiLevel3 1 4,026,819 - 3 5 7 5 0.03 - 9.91 50 80,536 1,255,709 3 6 11 7 216.99 2.17 0.48 100 40,268 870,008 3 6 12 7 135.32 1.95 0.4 250 16,107 543,187 3 6 11 6 71.34 2.3 0.29 500 8,053 383,938 3 6 11 7 40.61 1.73 0.23 End of Table

(48)

Figure 5.1: A comparison of coloring vs. balanced coloring by number of super vertices per each color

Figure 5.2: A comparison of coloring vs. balanced coloring by sum of weights per each color

(49)

Figure 5.3: A comparison of coloring vs. balanced coloring by number of colors obtained

(50)

In Table 5.3, we present the average results for all test matrices with respect to the block size parameter by geometric mean. Results show that our model performs up to 67.7% better coloring by the number of colors in average compared to the baseline where B = 1. The coloring time is decreased up to 92.8% for B = 500. However, partitioning and coarsening times are introduced.

B K Cutsize Vertex Degrees # Colors Time (seconds)

Min Avg Max tpartitioning tcoarsening tcoloring

1 162,652.00 - 5.77 32.55 203.20 38.15 - - 2.65 50 3,251.63 1,136,266.95 5.15 20.62 59.45 16.99 21.13 0.30 0.25 100 1,625.49 867,214.79 4.14 19.05 46.48 15.78 10.30 0.21 0.22 250 649.48 621,076.86 4.77 16.24 34.58 13.51 5.20 0.16 0.20 500 324.30 472,592.98 4.51 14.29 27.57 12.32 3.35 0.14 0.19

(51)

Chapter 6 Conclusion

Gauss-Seidel is a popular iterative solver for the linear system of equations due to its fast convergence property. In this thesis, we proposed a method based on coarsening and balanced coloring for the efficient parallelization of the Gauss-Seidel algorithm on distributed memory systems. Our coarsening method effec-tively reduces the required number of colors in the coloring stage, thus decreases the volume of communication caused by the synchronization overheads. Since the Gauss-Seidel method is inherently sequential, we utilized a graph coloring technique in order to extract parallelism. Utilized balanced coloring method effi-ciently balances the computational weights among the color classes. Experiments performed on a wide range of test matrices arising from various scientific do-mains verify the validity of proposed model. We tested the performance of our model against the traditional model and observed that proposed model reduces the number of colors in the coloring stage by up to 67.7% depending on a chosen block size parameter while balancing computations among the color classes.

(52)

Bibliography

[1] N. Lord, G. H. Golub, and C. F. V. Loan, “Matrix Computations,” The Mathematical Gazette, vol. 83, no. 498, p. 556, 2007.

[2] N. Chrisochoides, E. Houstis, and J. Rice, “Mapping algorithms and software environment for data parallel pde iterative solvers,” Journal of Parallel and Distributed Computing, vol. 21, no. 1, pp. 75–95, 1994.

[3] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, “Solving Problems on Concurrent Computers,” Osti.Gov, 1988.

[4] L. A. Hageman and D. M. Young, “Iterative Solution of Large Linear Sys-tems.,” The American Mathematical Monthly, 2006.

[5] M. F. Adams, “A distributed memory unstructured gauss-seidel algorithm for multigrid smoothers,” pp. 4–4, Association for Computing Machinery (ACM), feb 2004.

[6] D. P. Koester, S. Ranka, and G. C. Fox, “A parallel Gauss-Seidel algorithm for sparse power system matrices,” 2004.

[7] G. Huang and W. Ongsakul, “An efficient task allocation algorithm and its use to parallelize irregular Gauss-Seidel type algorithms,” pp. 497–501, Institute of Electrical and Electronics Engineers (IEEE), dec 2002.

[8] Y. Shang, “A distributed memory parallel Gauss-Seidel algorithm for linear algebraic systems,” Computers and Mathematics with Applications, vol. 57, no. 8, pp. 1369–1376, 2009.

(53)

[9] F. B. Hildebrand and Mathematics, Introduction to Numerical Analysis: Sec-ond Edition. 1993.

[10] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland, “Mas-sively parallel methods for engineering and science problems,” Communica-tions of the ACM, vol. 37, no. 4, pp. 30–41, 1994.

[11] T. Bultan and C. Aykanat, “A new mapping heuristic based on mean field annealing,” Journal of Parallel and Distributed Computing, vol. 16, no. 4, pp. 292–305, 1992.

[12] M. Kaddoura, C. W. Ou, and S. Ranka, “Partitioning Unstructured Compu-tational Graphs for Nonuniform and Adaptive Environments,” IEEE Parallel and Distributed Technology, vol. 3, no. 3, pp. 63–69, 1995.

[13] B. Hendrickson and R. Leland, “The Chaco User’s Guide Version 2.0,” Tech. Rep. July, Technical Report SAND95-2344, Sandia National Laboratories, 1995.

[14] G. Karypis and V. Kumar, “Metis: a software package for partitioning un-structured graphs,” International Cryogenics Monograph, pp. p´ags. 121–124, 1998.

[15] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 2003.

[16] S. R. dos Santos, [Computers in nursing: development of free software ap-plication with care and management]., vol. 44. 2010.

[17] H. Lu, M. Halappanavar, D. Chavarria-Miranda, A. H. Gebremedhin, A. Pa-nyala, and A. Kalyanaraman, “Algorithms for balanced graph colorings with applications in parallel computing,” IEEE Transactions on Parallel and Dis-tributed Systems, 2017.

(54)

[19] J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar, X. Liu, M. M. A. Patwary, Y. Lu, and P. Dubey, “Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC, vol. 2015-Janua, pp. 945–955, 2014.

[20] T. F. Coleman and J. J. Mor´e, “Estimation of sparse hessian matrices and graph coloring problems,” Mathematical Programming, vol. 28, no. 3, pp. 243–270, 1984.

[21] D. Welsh and M. Powell, “Summary for Policymakers,” Climate Change 2013 - The Physical Science Basis, pp. 1–30, 1967.

[22] D. Br´elaz, “New methods to color the vertices of a graph,” Communications of the ACM, vol. 22, no. 4, pp. 251–256, 1979.

[23] H. Courtecuisse and J. Allard, “Parallel dense gauss-seidel algorithm on many-core processors,” 2009 11th IEEE International Conference on High Performance Computing and Communications, HPCC 2009, no. 1, pp. 139– 147, 2009.

[24] V. Heuveline, D. Lukarski, and J. P. Weiss, “Scalable multi-coloring precon-ditioning for multi-core CPUs and GPUs,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lec-ture Notes in Bioinformatics), vol. 6586 LNCS, pp. 389–397, 2011.

[25] A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, “Col-Pack,” ACM Transactions on Mathematical Software, vol. 40, pp. 1–31, oct 2013.

[26] T. A. Davis and Y. Hu, “The university of Florida sparse matrix collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011.

Coloring for distributed-memory-parallel Gauss-Seidel algorithm

COLORING FOR

DISTRIBUTED-MEMORY-PARALLEL

GAUSS-SEIDEL ALGORITHM

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

computer engineering

By

Onur Ko¸cak

September 2019

ABSTRACT

COLORING FOR

DISTRIBUTED-MEMORY-PARALLEL GAUSS-SEIDEL

ALGORITHM

¨

OZET

DA ˘

GITIK-BELLEK-PARALEL GAUSS-SEIDEL

ALGOR˙ITMASI ˙IC

¸ ˙IN RENKLEND˙IRME

Acknowledgement

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

Chapter 2

Background

2.1

The Gauss-Seidel Method

2.2

Sparse Matrices and Compressed Sparse

Row (CSR) Format

2.3

Graphs and Matrices

2.4

Graph Partitioning

2.5

Graph Coloring

Chapter 3

Related Work

Chapter 4

Method

4.1

Coarsening Stage

4.2

Coloring Stage

Chapter 5

Experimental Results

5.1

Experimental Platform

5.2

Data Set

5.3

Evaluation

5.3.1

Experiments on the Coarsening Stage

5.3.2

Experiments on the Coloring Stage

Chapter 6

Conclusion

Bibliography