Reducing latency cost in 2D sparse matrix partitioning models

(1)

Parallel

Computing

journal homepage: www.elsevier.com/locate/parco

Reducing

latency

cost

in

2D

sparse

matrix

partitioning

models

R

Oguz

Selvitopi

a

_,

_Cevdet

_Aykanat

a , ∗

Bilkent University, Computer Engineering Department, 06800, Ankara, Turkey

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 8 March 2015 Revised 11 December 2015 Accepted 21 April 2016 Available online 23 April 2016 MSC:

00-01 99-00, Keywords:

Parallel iterative solvers Nonsymmetric linear systems Sparse matrix-vector multiplication Sparse matrix partitioning Latency overhead Bandwidth overhead

a

b

s

t

r

a

c

t

Sparsematrix partitioning is acommon technique used for improving performance of parallellineariterative solvers.Comparedto solversusedfor symmetriclinearsystems, solversfornonsymmetricsystemsoffer morepotential foraddressing differentmultiple communicationmetricsduetotheflexibilityofadoptingdifferentpartitionsontheinput andoutputvectorsofsparsematrix-vectormultiplicationoperations.Inthisregard,there existworksbasedonone-dimensional(1D)andtwo-dimensional(2D)fine-grain partition-ingmodels thateffectively addressbothbandwidth and latency costsinnonsymmetric solvers.Inthiswork,weproposetwonewmodelsbasedon2Dcheckerboardandjagged partitioning.Thesemodels aimatminimizing totalmessage count whilemaintaining a balanceoncommunicationvolume loadsofprocessors;hence, theyaddressboth band-widthand latencycosts. Weevaluateallpartitioningmodelsontwo nonsymmetric sys-temsolversimplementedusingthe widelyadopted PETSctoolkitand conductextensive experimentsusing thesesolversona modernsystem (aBlueGene/Qmachine) success-fullyscalingthemupto8Kprocessors.Alongwiththeproposedmodels,weputpractical aspectsofeight evaluatedmodels(two1D- and six2D-based)under thoroughanalysis. Tothe bestofourknowledge,thisisthefirstwork thatanalyzespracticalperformance of2Dmodelsonthisscale.Amongevaluatedmodels,themodelsthatrelyon2Djagged partitioningobtainthemostpromisingresultsbystrikingabalancebetweenminimizing bandwidthandlatencycosts.

1. Introduction

Many scientiﬁc and engineering applications necessitate solving a linear system of equations. The methods used for this purpose are categorized as direct and iterative methods. When the linear system is large and sparse, iterative methods are preferred to their direct counterparts due to their speed and ﬂexibility. Most widely used iterative methods for solving large-scale linear systems are based on Krylov subspace iterations.

A single iteration in Krylov subspace methods usually consists of one or more Sparse Matrix–Vector multiplica- tions (SpMV), dot product(s) and vector updates. In a distributed setting, SpMV operations require regular or irregular point-to-point (P2P) communication depending on the sparsity pattern of the coeﬃcient matrix in which each processor

R This work was supported by The Scientiﬁc and Technological Research Council of Turkey (TÜB ˙ITAK) under Grant EEEAG-114E545. This article is also based upon work from COST Action CA 15109 (COSTNET).

∗ _{Corresponding author. Tel.: +90 (312) 290 1213; fax: +90 (312) 266 4047.}

E-mail addresses: [email protected] (O. Selvitopi), [email protected] (C. Aykanat). http://dx.doi.org/10.1016/j.parco.2016.04.004

(2)

sends/receives messages to/from a subset of processors. On the other hand, dot products necessitate global communication that involves a reduction operation on one or a few scalars in which all processors participate. Vector updates usually do not require any communication.

A common model to capture the cost of communicating a single message of m words consists of two major components and is given by the formula t s+ mt w. Here, t sis the startup time and signifies the costs related to preparation of the message (referred to as latency cost). The t wcomponent is the time required to transfer a single word between two processors and is equal to the reciprocal bandwidth (referred to as bandwidth cost). Latency cost is proportional to the number of messages whereas bandwidth cost is proportional to the number of words. Among these two components, latency costs prove to be more vital for parallel performance as they are generally harder to avoid and improve [1] . Although both costs are reduced within time, the gap between them gradually increases in favor of bandwidth costs with an approximately 20% annual improvement over latency costs [1,2] . Furthermore, computation speeds evolve faster than communication speeds, making communication costs more critical for performance. With the latest developments in the scientific computing field, communication costs are likely to be a major factor in ranking fastest high performance computing (HPC) systems [3] . This work focuses on reducing latency costs of parallel SpMV operations in the context of nonsymmetric iterative solvers. The models and methods proposed in our work apply to matrices both with regular and irregular sparsity patterns. However, the benefits are more evident for the irregular ones, which are usually harder to exploit.

1.1. Related work

Communication requirements of iterative solvers have been of interest for more than three decades. There are numerous works on reducing communication overhead of global reduction operations in iterative solvers. Several works in this cate- gory aim at decreasing the number of global synchronization points in a single iteration of the solver by reformulating it [4–12] . Another important area of study is s -step Krylov subspace methods, which focus on further reducing the number of global synchronization points by a factor of s by performing only a single reduction once in every s iterations [8,13–17] . The performance gain of s -step methods comes at the cost of deteriorated stability and complications related to integration of preconditioners. However, these methods recently gained popularity again and promising studies address these shortcom- ings [13,16,18,19] . Another common technique is to overlap communication and computation with the aim of hiding global synchronization overheads [20,21] . Especially with the introduction of nonblocking collective constructs in the MPI-3 standard, this technique is gaining attraction [22–24] . Overlapping is commonly used for SpMV operations as well. In addition, a recent work proposed hierarchical and nested Krylov methods that constrain global reductions into smaller subsets of processors where they are cheaper [25] . Another recent work uses the idea of embedding SpMV communications into global reductions to avoid latency overhead of SpMV communications [26] .

The performance of iterative solvers is also addressed by minimizing communication costs related to parallel SpMV operations, which is also addressed by this work. There are studies that can handle sparse matrices that are well-structured and have predictable sparsity patterns, generally arising from 2D/3D problems [16,27–29] . However, the studies in this field generally focus on combinatorial models that are capable of exploiting both regular and irregular patterns to obtain a good partition of the coefficient matrix. In this regard, graph and hypergraph partitioning models are widely utilized with successful partitioning tools such as MeTiS [30] , PaToH [31] , Scotch [32] , Mondriaan [33] . These models can broadly be categorized as one-dimensional (1D) and two-dimensional (2D) partitioning models. In 1D models [30,31,34–39] , each processor is responsible for a row/column stripe, whereas in 2D models, each processor may be responsible for a submatrix block (generally defined by a subset of rows and columns) or as in the most general case, each processor may be responsible for an arbitrarily defined subset of nonzeros. Compared to 1D models, 2D models possess more freedom in partitioning the coefficient matrix. Some works on 2D models do not take the communication volume into account, however they provide an upper bound on the number of messages communicated [40–44] . On the other hand, there are 2D models that aim at reducing volume, with or without providing a bound on the maximum number of messages [33,45–50] . 2D partitioning models in the literature can further be categorized into three classes: checkerboard partitioning [47,49,50] (also known as coarse-grain partitioning), jagged partitioning [45,49] and fine-grain partitioning [46,4 8,4 9] . Notably, a recent work [50] proposes a fast 2D partitioning for scale-free graphs via a two-phase approach. This method uses 1D partitioning to reduce volume in the first phase and an efficient heuristic in the second phase to obtain a bound on the maximum number of messages. This work differs from ours as it does not explicitly minimize the message count, instead, it uses a property of the Cartesian distribution of the matrices to provide the mentioned upper bound.

1.2. Motivation and contributions

Most of the aforementioned and other existing partitioning models optimize the objective of minimizing total communication volume, which is an effort to reduce bandwidth costs. However, communication cost is a function of both bandwidth and latency, with the latter being at least as important as the former, as the current trends indicate. The need for partitioning models that also consider other cost metrics has been noted in other works [26,34] . There are a few notable works that focus on different communication cost metrics. Balancing communication volume is one of them [33,51,52] . More important and overlooked work targets multiple communication metrics including latency [53] , on which this study is based. Compared to [53] , this study concentrates more on practical aspects.

(3)

a symmetric partition where the same partition is imposed on both input and output vectors, or by using a nonsymmetric partition where a distinct partition is employed for input and output vectors. The latter alternative is more appealing and it should be adopted whenever convenient since it is more ﬂexible and allows operating in a broader search space. A nonsymmetric partition can be utilized in nonsymmetric linear system solvers such as the conjugate gradient normal equation error (CGNE) [54–56] and residual method (CGNR) [54–57] , and the standard quasi-minimal residual (QMR) [58] where the coeﬃcient matrix is square and nonsymmetric. The details of how to utilize nonsymmetric partitioning without incurring communication during linear vector update operations are explained for CGNE and CGNR solvers in Appendix E . We constrain ourselves to nonsymmetric square matrices in this work, but all proposed models apply to certain iterative methods that involve rectangular matrices as well.

Our work is based on [53] , which also achieves a nonsymmetric partition through a two-phase methodology with a model called communication hypergraph . Our contributions and differences from [53] are as follows:

(i) We propose two new partitioning models for reducing latency which are based on 2D checkerboard and jagged partitioning. These models aim at reducing latency costs usually at the expense of increasing bandwidth costs. Similar models have been investigated [48,53] , but they are based on 1D and 2D ﬁne-grain models.

(ii) All proposed and investigated partitioning models are realized on two iterative methods CGNE and CGNR implemented with the widely adopted PETSc toolkit [59] . We describe how to obtain a nonsymmetric partition on the vectors utilized in these solvers using the communication hypergraph model and thoroughly evaluate partitioning requirements of them via experiments. In this manner, we differ from [53] , in which the proposed methods were tested with a code developed by the authors that contains only parallel SpMV computations.

(iii) We conduct extensive experiments for the mentioned iterative solvers. Although better suited to large-scale systems, the communication hypergraph model was originally tested only for 24 processors on a local cluster and only for 1D partitioning. In this work, we test and show this model’s validity on a modern HPC system (a BlueGene/Q machine) successfully scaling up to 8K processors.

(iv) We compare one 1D-based, three 2D-based models (checkerboard, jagged and ﬁne-grain), and these four models’ latency-improved versions, making a total of eight partitioning models. Among these, the 2D models are somewhat overlooked in the literature, never being tested in a realistic setting on a large-scale system. Although their theoretical merits are of no question, their practical merits are not appreciated. In our experiments, we put these methods’ practical aspects into a thorough analysis. The experiments show surprising results with 2D jagged partitioning and its latency-improved version performing better in the majority of the matrices.

The rest of the paper is organized as follows: Section 2.1 and 2.2 describe the proposed partitioning models to reduce the latency overhead of checkerboard and jagged models, respectively. These two sections describe basic checkerboard and jagged models as well. We also briefly review the fine-grain model and its latency-improved version in Section 2.3 , since they are included in our experiments. We compare communication properties of all partitioning models in Section 3 . Section 4 contains the results and discussions of the extensive large-scale experimental evaluation of eight partitioning models on a BlueGene/Q system with 28 matrices. Our experiments range from 256 to 8192 processors. Final remarks are given in Section 5 . The presentation of the paper relies on the assumption that the reader is already familiar with the communication hypergraph model for 1D partitioning [53] . The unfamiliar reader can find a detailed background with explanatory examples about the communication hypergraph model in Appendix D .

2. Reducinglatencycostin2Dpartitioningmodels

2D models work at a ﬁner level of partitioning granularity compared to 1D models by allowing nonzeros of a single row/column to be assigned to more than one processor. In this manner, they possess more ﬂexibility in partitioning since they do not constrain the search space by assigning all nonzeros of a row/column to the same processor. This leads them to exploit existing partitioning tools better. 1D models necessitate a row-parallel/column-parallel algorithm ( Appendix B ), whereas 2D models necessitate a row-column-parallel algorithm. The fundamental difference between them is that the former necessitates a single communication stage in parallel SpMV operations, which is either pre-communication if the matrix is partitioned row wise, or post-communication if the matrix is partitioned column wise, whereas the latter necessitates two distinct communication stages: one before the local SpMV computations (on the input vector in a pre -communication stage via expand communication tasks) and one after the local SpMV computations (on the output vector in a post -communication stage via fold communication tasks), For more details on implementation issues regarding 2D partitioning, see [49] .

In this section, we describe how to reduce latency overhead of 2D checkerboard and jagged partitioning models. For these models, we assume a K ₌P _{× Q virtual} processor mesh. A simple example depicting a matrix partitioned with these two models is given in Fig. 1 . Compared to their original counterparts, the proposed models are likely to increase the bandwidth costs by increasing communication volume. However, this issue is addressed by maintaining a balance on this metric. We also brieﬂy review the 2D ﬁne-grain model and compare it to checkerboard and jagged models as we evaluate it in our

(4)

(a) Checkerboard

(b) Jagged

Fig. 1. A sample of 2D checkerboard and jagged partitionings on a 16 = 4 × 4 virtual processor mesh.

experiments. Note that a model to reduce latency overhead of ﬁne-grain partitioning has already been investigated [48] . All proposed models are discussed on parallel w = Ap. However, the arguments are also valid for z =A T_{r as communication} requirements of w = Ap and z = A T_{r are}_{the dual of each other and minimizing the}_{objective function in w}_{= Ap is equivalent} to minimizing the objective function in z =A T_r.

2.1. Checkerboard partitioning

Checkerboard partitioning is a two-phase process in which each phase utilizes a 1D partitioning model. The second phase depends on the first phase by using information obtained in the first phase to determine multiple vertex weights utilized in the second phase. For the rest of the section, we assume a 1D row wise partition in the first phase and a 1D column wise partition in the second phase. For the arguments made in this section, an analogous discussion holds for the dual scheme as well.

Consider a K ₌P _{× Q processor} mesh and an n _{× n} square matrix A . In the ﬁrst phase, the column-net hypergraph model H R =

(

V R, N C

)

is used to obtain a P -way partition

R =

{

V 1 , . . . , V P

}

, which induces a P -way row wise partition

{

R1 ,...,RP

}

of A . Here, Rα denotes the set of rows that correspond to vertices in Vα, for

α

=1 ,...,P . The rows in Rα form a row stripe A _α whose size is n _α × n, with n _α=

|

Vα

|

. At the end of the ﬁrst phase, the assignment of rows of A is determined by associating row stripe A _αwith the Q processors in row

α

of the processor mesh, P _α,∗.

In the second phase, the row-net hypergraph model HC =

(

VC,NR

)

is used to obtain a Q -way partition

C =

{

V1 ,...VQ

}

, which induces a Q -way column wise partition

{

C1 ,...,CQ

}

of A . Here, Cβ denotes the set of columns that correspond to vertices in _V_β_, for

β

₌ 1 _,_._._._,Q. The columns in _C_β form column stripe A _β whose size is n _{× n}_β, with n _β₌

|

_V_β

|

. At the end of the second phase, we complete the assignment of columns of A and actually obtain a Q -way column wise partition of each row stripe A _α, forming Q submatrix blocks A _α_,₁,...,A _α_,Q. Hence,

(

_R,

C

)

deﬁnes an assignment for rows and columns of A where processor P _α_,_β is responsible for the set of rows in _R_α and the set of columns in _C_β. In other words, nonzero a ijis assigned to P _α, β if r i ∈ R α and c j ∈ C _β.

This two-phase process aims at minimizing total communication volume for the pre- and post-communication stages in the first and second phases, respectively, while maintaining computational load balance [47,49] . A notable property of checkerboard partitioning is that it confines the communication in expand and fold operations to the processors in the same column and row of the processor mesh, respectively. It achieves a Cartesian distribution of the matrix, in which each processor owns an intersection of a subset of rows and a subset of columns. A row (column) is said to be coherent if the nonzeros of this row (column) generate partial results for (require) the same w -vector ( x -vector) element. Consider a row r i that is assigned to Rα at the end of the first phase. The coherency of this row is preserved at this point as it is modeled by

v

i in HR. In the second phase, the nonzeros of this row can be distributed among Q processors in row

α

of the processor mesh, which is also the case for all other rows in _R_α. Hence, row coherency is respected in a coarse level by assigning nonzeros of rows in R α to the processors in the same row of the processor mesh, P _α, ∗. A coarse level here implies that the nonzeros belonging to a subset of rows are distributed among the same subset of processors (in this case among P processors in a speciﬁc column of the processor mesh). This provides the upper bound Q _{− 1} on the number of messages communicated in the post-communication stage as there are Q processors in row

α

of the processor mesh. With a similar argument, column coherency is also respected in a coarse level by assigning nonzeros of columns in Cβ to the processors in the same column of the processor mesh, P _∗,_β. This provides the upper bound P _{− 1} on the number of messages communicated in the pre-communication stage as there are P processors in column

β

of the processor mesh. Hence, the maximum number of messages handled by a single processor is bounded by P +Q − 2. In checkerboard partitioning, the second phase is performed with P -way multi-constraint [60,61] partitioning to balance computational loads.

(5)

Fig. 2. Formation of the communication matrix for the third column of the processor mesh ( β= 3 ) to summarize expand operations in the pre- communication stage.

2.1.1. Communication matrices

The expand communication tasks in the checkerboard model are bound to distinct columns of the processor mesh. For this reason, to summarize the communication requirements of expand tasks in the pre-communication stage, we form Q distinct communication matrices. Let p C_β denote the vector elements that necessitate communication in column

β

of the processor mesh, for 1 ≤

β

≤ Q. Note that at most P processors can participate in communicating p C_β, conﬁned to the set of processors in P _∗,_β. We summarize the communication operations in column

β

of the mesh with the P ×

|

p C_β

|

communication matrix M _β. Rows of M _β correspond to processors in P _∗,_β and columns of M _β correspond to expand tasks on p C_β. There exists a nonzero m _αj ∈ M β if and only if there is a non-empty column segment in submatrix A α, β at the respective column. The nonzeros in column j of M _β represent the set of processors that participate in communicating p C_β[ j] , which is a subset of processors in P _∗,_β. The nonzeros in row

α

of M _β represent the expand tasks processor P _α_,_β participates in. Note that vector elements corresponding to internal columns (those which have a single non-empty column segment in P row stripes) do not incur communication and they are not included in M _β. These vector elements should be assigned to the respective processors to avoid unnecessary communication. An example in Fig. 2 is presented to illustrate the formation of communication matrix M _β for the third column of the processor mesh (

β

₌3 ) to summarize the expand tasks. There are four input vector elements that necessitate communication (denoted by p C_β) and they form columns of M β. For instance, the ﬁrst column of M _β has nonzeros corresponding to processors P _1,₃, P _{2, 3}and P _{4, 3}since in matrix A , there exist nonzero column segments in the respective submatrix blocks. Two vector elements–second and ﬁfth–corresponding to internal columns do not incur communication and they are not included in M _β; these elements should be assigned to P _3,₃and P _2,₃, respectively, to avoid unnecessary communication.

= 3 ) to summarize the fold tasks. The dual of the discussions made for M _β in Fig. 2 follows also for M _α.

We form a total of P ₊Q communication matrices to summarize communication requirements of checkerboard parti- tioning. We can address the communication requirements of both pre- and post-communication stages independently since communication operations in these stages are bound to distinct columns and rows of the processor mesh, respectively. For- mation of these communication matrices is illustrated in Fig. 4 .

(6)

Fig. 3. Formation of the communication matrix for the third row of the processor mesh ( α= 3 ) to summarize fold operations in the post-communication stage.

2.1.2. Formation of communication hypergraphs

We form Q hypergraphs from Q communication matrices for the pre-communication stage. For each M _β, a communication hypergraph HCM

β is formed using the row-net hypergraph model, for 1 ≤

β

≤ Q. The net set of HCMβ represents the

processors in column

β

( P _∗,_β) of the processor mesh and the vertex set of _HCM

α

( P _α, ∗) of the processor mesh and the vertex set of HCM

α represents the fold tasks on w C_α. Hence, there are Q nets and

|

w C_α

|

β that represents

P _α_,_β. The connectivity set of this net contains the parts that correspond to the processors each of which send a message to P _α_,_β. The size of this set can be at most P since _HCM

β is partitioned into P , bounding the number of messages sent/received

by a single processor by P − 1 in the pre-communication stage. Hence, this feature of original checkerboard partitioning is still respected. In partitioning H CM

β , the partitioning objective of minimizing cutsize corresponds to minimizing the number

of messages communicated in column

β

of the processor mesh in the pre-communication stage, and the partitioning constraint of maintaining balance among part weights corresponds to obtaining a balance on the communication volume loads of these processors.

Similarly, we partition HCM

α to get a Q -way partition

_α=

{

V1,V2,...,VQ

}

and obtain a distribution of fold tasks among Q processors in row

α

of the processor mesh for the post-communication stage, for 1 ≤

α

≤ P . H CM

α is partitioned into Q ,

bounding the number of messages sent/received by a single processor by Q − 1 in the post-communication stage. Hence, this feature of original checkerboard partitioning is also respected. The partitioning objective and the balancing constraint are identical to those in partitioning _HCM

β .

The formed P ₊Q hypergraphs can be independently partitioned since they do not depend on each other in any way. The maximum number of messages handled by a single processor is still P + Q − 2 as in the original checkerboard partitioning. As a result, we improve latency costs in each row/column of the processor mesh independently while respecting basic characteristics of checkerboard partitioning.

(7)

Fig. 4. Minimizing latency cost in checkerboard partitioning model.

2.2. Jagged partitioning

Jagged partitioning consists of two phases. The first phase consists of a single 1D partitioning model, whereas the second phase consists of multiple, independent and same type of 1D partitioning models. The second phase depends on the first phase by using the partitioning information obtained in the first phase to determine vertex sets and vertex weights for the models formed in the second phase. For the rest of the section, we assume a 1D row wise partition in the first phase and a 1D column wise partition in the second phase. For the arguments made in this section, an analogous discussion holds for the dual scheme as well.

Assume a K = P × Q processor mesh and an n × n square matrix A . The ﬁrst phase of jagged partitioning is exactly the same as the ﬁrst phase of checkerboard partitioning: a column-net hypergraph model HR =

(

VR,NC

)

is used to obtain a P -way partition

_R =

{

V1,...,VP

}

, which induces a row wise partition

{

)

+

(

Q − 1

)

=K − 1 in jagged partitioning.

2.2.1. Communication matrices

The expand communication tasks in the jagged model are not bound to distinct columns of the processor mesh. For this reason, we form a single communication matrix to summarize the communication requirements of expand tasks in the pre- communication stage. Let p Cdenote the vector elements that necessitate communication. We summarize the communication operations with the

(

K = P × Q

)

×

|

p C

|

communication matrix M R. Rows of M Rcorrespond to all processors and columns of M R correspond to expand tasks on p C. Consider two vector elements owned by the same processor. Although these two elements can be communicated by at most P processors (each of which belongs to a distinct row of the processor mesh), they do not necessarily need to be conﬁned to the same column of the processor mesh. For this reason, we include all processors in M R. The formation of M R essentially resembles that of 1D row-parallel w =Ap(D.1) : m kj = 0 if and only if column c _jhas a nonzero column segment in k th row stripe of A . Nonzeros in column c _j_∈ M Rrepresent the set of processors that participate in communicating p C[ j ] and nonzeros in row r k ∈ M R represent the expand tasks P k participates in. The difference is, however, as a consequence of jagged partitioning, each column in M R can have at most P nonzeros instead of K . As usual, vector elements corresponding to internal columns are not included in M R since they do not necessitate communication.

In contrast to expand tasks, the fold communication tasks are bound to distinct rows of the processor mesh. The communication requirements of fold tasks in the post-communication stage are thus summarized by P distinct communication matrices, M _α, for 1 ≤

α

≤ P . The formation of these matrices is the same as the formation of matrices for summarizing communication requirements of fold tasks in checkerboard partitioning. The semantics of nonzeros in rows and columns of M _α are identical to those of M _α in checkerboard partitioning.

We form a total of P ₊1 communication matrices to summarize communication requirements of jagged partitioning. We can address the communication requirements of the post-communication stage independently using P matrices. However, since communication operations in the pre-communication stage are not bound to the processors in the same column of the processor mesh, the expand tasks are represented in a single matrix with all K processors. Formation of these communication matrices is illustrated in Fig. 5 .

2.2.2. Formation of the communication hypergraphs

For the pre-communication stage, we form a single communication hypergraph HCM

R from communication matrix M R using the row-net hypergraph model. The net set of _HCM

R corresponds to K processors and the vertex set of HCMR corresponds to expand tasks on p C. Hence, there are K nets and | p C| vertices in H CMR . A vertex

v

jin H CMR is connected by the set of nets corresponding to processors that communicate the respective vector element p C[ j ]. Note that

v

jcan be connected by at most P nets, i.e., d _j_{≤ P}.

For the post-communication stage, we form P communication hypergraphs from P communication matrices. The formation of these communication hypergraphs is actually the same as for checkerboard partitioning. A communication hypergraph _HCM

α is formed using the column-net hypergraph model for matrix M α, for 1 _≤

α

_{≤ P}. The semantics of these hypergraphs are also the same: the net set of H CM

α represents the processors in row

α

of the processor mesh, P α, ∗, and the vertex set of HCM

α represents the fold tasks on w C_α. Similarly, there are

|

w C

|

vertices and K nets in all communication hypergraphs.

In total, we form P + 1 communication hypergraphs from P + 1 communication matrices. This process is illustrated in Fig. 5 .

2.2.3. Partitioning of the communication hypergraphs Communication hypergraph _HCM

R is partitioned to obtain a K -way partition

R =

{

V1 ,V2 ,...,VK

}

to induce a distribution of expand tasks in the pre-communication stage. The responsibility of the expand tasks represented by the vertices in V k is assigned to processor P k. Consider a net n k in HCMR that represents P k. The connectivity set of this net corresponds to processors each of which send a message to P k. Since HCMR is partitioned into K parts, the size of this set can be at most

(9)

Fig. 5. Minimizing latency cost in jagged partitioning model.

K . Thus, the maximum number of messages sent/received by a single processor is K − 1 in the pre-communication stage (note that it is K _{− Q in} the original jagged partitioning). In partitioning _HCM

R , the partitioning objective of minimizing cutsize corresponds to minimizing the number of messages communicated in the pre-communication stage, and the partitioning constraint of maintaining balance among part weights corresponds to obtaining a balance on the communication volume loads of K processors.

Communication hypergraph H CM

α is partitioned to obtain a Q -way partition

α=

{

V 1 , V 2 , . . . , V Q

}

, for 1 ≤

α

≤ P , to induce a distribution of fold tasks in the post-communication stage among Q processors in row

α

of the processor mesh. The partitioning of these communication hypergraphs have the same semantics with those in checkerboard partitioning.

The formed P + 1 hypergraphs can be partitioned independently, since they do not depend on each other in any way. Note that the maximum number of messages handled by a single processor is slightly increased from K − 1 to K +Q − 2, which is caused by the partitioning for distributing tasks in the pre-communication stage. However, the cases beyond this are expected to be rare as a good partitioning tool will avoid them. As a result, we improve latency costs while respecting most characteristics of the original jagged partitioning.

2.3. Fine-grain (nonzero-based) partitioning

We briefly review fundamental properties of fine-grain partitioning. This model is first proposed in [46] and its communication costs are improved in a later work using the CHG model [48] . Both of these models are included in our experiments. The fine-grain model forms a hypergraph in which vertices represent nonzeros of A and nets represent rows and columns of A . Nets corresponding to columns of A capture the communication volume incurred in the pre-communication stage, while nets corresponding to rows of A capture the communication volume incurred in the post-communication stage. Par- titioning this hypergraph into K parts induces a distribution of nonzeros of the matrix among K processors. This leads to a completely arbitrary distribution of fine-grain computational tasks on a nonzero basis, where each vertex signifies a scalar

(10)

Fig. 6. Minimizing latency cost in ﬁne-grain partitioning model.

multiplication with a single nonzero. Therefore, the ﬁne-grain model respects neither row nor column coherence. In this aspect, it accommodates the highest level of ﬂexibility by not restraining the computational tasks to coarser levels (i.e., nonzeros of a row and/or column) compared to checkerboard, jagged and 1D models. As a result, a processor may communicate with any other processor. Thus, the maximum number of messages handled by a single processor is K − 1 in the pre-communication stage and is also K _{− 1} in the post-communication stage, summing up to a total of 2

(

K _{− 1}

)

messages. The ﬁne-grain model correctly minimizes the total communication volume while maintaining computational load balance. For more details, see [46,49] .

The approach to improve communication requirements of the fine-grain model [48] consists of forming two communication matrices: one matrix for summarizing communication operations in the pre-communication stage and one matrix for summarizing communication operations in the post-communication stage. Compare this to the formation of communication matrices in the checkerboard and jagged models. We address the communication requirements in the checkerboard model by separately forming a total of P + Q communication matrices since they are confined to distinct columns and rows of the processor mesh. Also in the jagged model, we form P communication matrices for the post-communication stage. These are not valid for the fine-grain model since the resulting partition is arbitrarily defined on the nonzeros of the matrix and any of the K processors may communicate with any other processor in both pre- and post-communication stages. For this reason, the whole set of processors and all vector elements that necessitate communication are included in two communication matrices. The process for reducing the communication costs of the fine-grain model is illustrated in Fig. 6 .

3. Comparisonofpartitioningmodels

We compare the basic properties of the investigated partitioning models to aid the discussions of the results in experiments. It is assumed that P ₌ Q ₌√ K in the P _{× Q} processor mesh for ease of presentation.

Fig. 7 compares the partitioning models in terms of ﬂexibility they provide during partitioning. 1D models lie at the left extreme of the spectrum since they represent each row/column of the coeﬃcient matrix with a single distinct vertex as an atomic task. This leads to the assignment of all nonzeros of a row/column to an individual processor as a whole. Hence, 1D

(11)

Fig. 7. Comparison of models in terms of partitioning ﬂexibility. Table 1

Comparison of partitioning models in terms of latency overhead and partitioning granularity.

Number of messages Comm. stage 1D 2D partitioning models

Checkerboard Jagged Fine-grain

pre K − 1 √ K − 1 K −√ K K − 1

max post N/A √ K − 1 √ K − 1 K − 1

overall K − 1 2(√ K − 1) K − 1 2(K − 1)

total overall K(K − 1) 2 K(√ K − 1) K(K − 1) 2 K(K − 1)

Row/column coherency Either entire row or Coarse-level on both Coarse-level on either None (Partitioning granularity) Entire column Rows and columns Rows or columns

models respect row/column coherency at the individual processor level. The fine-grain model lies at the right extreme of the spectrum since in this model each vertex represents an atomic task corresponding to a single nonzero of the matrix. This is the most flexible and the finest level of partitioning granularity available, where neither row nor column coherency is preserved. So, in theory, nonzeros of a row/column can be distributed among K processors. Between these two extremes, the checkerboard and jagged models strive to distribute nonzeros of a row/column among a subset of processors. By doing so, they obtain a coarse-level row/column coherency at the processor mesh’s row/column level. The checkerboard model leverages a coarse-level coherency in both partitioning phases whereas the jagged model leverages it in a single partitioning phase.

Among these partitioning models, 2D models are expected to achieve lower bandwidth costs compared to 1D models since they offer more ﬂexibility in optimizing the objective of minimizing total communication volume. Among 2D models, ﬁne-grain is expected to obtain the best results in terms of bandwidth costs, whereas checkerboard is likely to obtain the worst. The metrics related to latency costs (as upper bounds on the maximum number of messages) are presented in Table 1 . Checkerboard has the lowest overhead with 2

(

√ K − 1

)

maximum messages per processor, whereas ﬁne-grain has the highest overhead with 2

(

K _{− 1}

)

. Although 1D and jagged models have the same upper bound K _{− 1}_, in practice jagged partitioning is more likely to achieve better results in this metric since it restricts the number of messages in both stages of communication.

and

FG+CHG

. These aim to reduce total message count and maximum volume. Hence, we evaluate the merit of reducing a latency-related cost metric in partitioning. We use PaToH [31,61] to partition the computational hypergraphs formed in the ﬁrst phase of all models and the communication hypergraphs formed in the second phase of the

CHG

-enhanced models.

We implemented CGNE and CGNR solvers via the PETSc toolkit [59] and utilized the mentioned models for partitioning the coeﬃcient matrix and vectors in these solvers. Since obtained runtime results for both solvers are similar, we only present speedup results corresponding to CGNR. Note that the metrics regarding partitioning models ( Section 4.1 ) such as total volume, message count, etc. are the same for both solvers as they contain the same type of communication operations.

(12)

Table 2

Test matrices and their properties.

Nonzeros

Number of per row per column

Matrix rows/cols nonzeros avg min max min max

venkat01 62,424 1,717,792 27 .52 16 44 16 44 mc2depi 525,825 2,100,225 3 .99 2 4 2 4 poisson3Db 85,623 2,374,949 27 .74 6 145 6 145 thermomech_dK 204,316 2,846,228 13 .93 7 20 7 20 stomach 213,360 3,021,648 14 .16 7 19 6 22 FEM_3D_thermal2 147,900 3,489,300 23 .59 12 27 12 27 laminar_duct3D 67,173 3,833,077 57 .06 1 89 3 89 xenon2 157,464 3,866,688 24 .56 1 27 1 27 iChem_Jacobian 274,087 4,137,369 15 .10 5 17 5 17 torso3 259,156 4,429,042 17 .09 7 22 6 21 tmt_unsym 917,825 4,584,801 5 .00 3 5 3 5 t2em 921,632 4,590,832 4 .98 1 5 1 5 Hamrle3 1,447,360 5,514,242 3 .81 2 6 2 9 largebasis 440,020 5,560,100 12 .64 4 14 4 14 Chevron4 711,450 6,376,412 8 .96 2 9 2 9 cage13 445,315 7,479,343 16 .80 3 39 3 39 PR02R 161,070 8,185,136 50 .82 1 92 5 88 atmosmodl 1,489,752 10,319,760 6 .93 4 7 4 7 kim2 456,976 11,330,020 24 .79 4 25 5 25 memchip 2,707,524 14,810,202 5 .47 2 27 1 27 Freescale1 3,428,755 18,920,347 5 .52 1 27 1 25 circuit5M_dc 3,523,317 19,194,193 5 .45 1 27 1 25 fem_hifreq_circuit 491,100 20,239,237 41 .21 12 110 12 110 rajat31 4,690,002 20,316,253 4 .33 1 1252 1 1252 CoupCons3D 416,800 22,322,336 53 .56 20 76 20 76 Transport 1,602,111 23,500,731 14 .67 5 15 5 15 ML_Laplace 377,002 27,689,972 73 .45 26 74 26 74 RM07R 381,689 37,464,962 98 .16 1 295 1 245

All models are tested with 28 matrices chosen from the UFL matrix collection [63] . The properties of these matrices are presented in Table 2 . The evaluated models are tested on a BlueGene/Q system with varying number of processors K ∈ {256, 512, 1024, 2048, 4096, 8192}. A node on this system consists of 16 cores (single PowerPC A2 processor) with 1.6 GHz clock frequency and 16 GB memory. The nodes are connected by a 5D torus chip-to-chip network. We only consider the case of strong scaling.

and

JGD+CHG

are the same since they use the same heuristic to distribute expand tasks. Total communication volume of two successive K values (512 and 1024, 2048 and 4096, etc.) are the same for the models that are based on checkerboard and jagged partitioning since these models have the same number of processor columns in the corresponding processor mesh. For instance, at K =512 and K =1024 , the processor meshes are of sizes 32 × 16 and 32 × 32, respectively. We also present the average speedup values obtained by models to give an idea about the eﬃciency. A more detailed and accurate discussion with performance proﬁles and speedup curves can be found in the next section.

(13)

K Model Expand Fold Sum Expand Fold Sum Expand Fold Sum Expand Fold Sum Speedup 1D+BP 2464 – 2464 17 .9 – 17 .9 164470 – 164470 728 – 728 143 1D+CHG 1640 – 1640 13 .5 – 13 .5 245514 – 245514 1234 – 1234 148 CKBD+BP 414 1728 2141 4 .8 6 .6 11 .4 38476 145554 184029 367 759 1125 142 256 CKBD+CHG 414 1423 1836 4 .8 5 .8 10 .5 38476 214657 253133 367 1176 1542 138 JGD+BP 850 1266 2117 10 .0 4 .9 14 .9 38476 112712 151188 311 599 910 158 JGD+CHG 850 1131 1982 10 .0 5 .0 15 .0 38476 165081 203557 311 959 1270 153 FG+BP 1959 1751 3710 15 .5 6 .8 22 .3 107993 48359 156352 546 275 821 150 FG+CHG 1513 1257 2770 12 .5 4 .6 17 .1 173449 74672 248121 847 377 1223 150 1D+BP 5683 – 5683 21 .6 – 21 .6 226127 – 226127 513 – 513 236 1D+CHG 3496 – 3496 15 .5 – 15 .5 327581 – 327581 829 – 829 231 CKBD+BP 1019 3590 4608 6 .4 6 .8 13 .1 57434 195114 252548 264 530 794 225 512 CKBD+CHG 1019 2885 3904 6 .4 6 .6 13 .0 57434 280548 337981 264 804 1067 220 JGD+BP 2088 2604 4692 11 .7 5 .6 17 .3 57434 147488 204922 220 405 626 247 JGD+CHG 2088 2267 4355 11 .7 4 .5 16 .2 57434 212002 269436 220 609 829 238 FG+BP 4261 3784 8045 17 .7 8 .0 25 .7 144 96 8 70492 215460 374 205 580 228 FG+CHG 3165 2622 5787 13 .8 4 .8 18 .6 228987 106423 335410 566 273 839 229 1D+BP 13201 – 13201 27 .0 – 27 .0 314109 – 314109 359 – 359 294 1D+CHG 7451 – 7451 16 .9 – 16 .9 441340 – 441340 568 – 568 343 CKBD+BP 1587 9624 11211 5 .9 9 .1 15 .0 57434 288909 346343 167 399 567 320 1024 CKBD+CHG 1587 6897 8484 5 .9 7 .1 13 .0 57434 404550 461984 167 534 701 320 JGD+BP 3571 6775 10346 11 .7 7 .4 19 .1 57434 223665 281099 138 314 452 354 JGD+CHG 3571 5266 8837 11 .7 5 .2 16 .9 57434 314303 371737 138 430 568 341 FG+BP 9286 8141 17427 20 .6 8 .5 29 .1 192165 103683 295848 252 153 405 318 FG+CHG 6537 5420 11957 15 .1 6 .1 21 .3 297236 152370 449606 376 201 577 327 1D+BP 30651 – 30651 30 .9 – 30 .9 437009 – 437009 256 – 256 355 1D+CHG 16028 – 16028 19 .0 – 19 .0 591207 – 591207 389 – 389 450 CKBD+BP 3885 21008 24892 7 .3 9 .4 16 .7 82863 395862 478725 116 277 393 421 2048 CKBD+CHG 3885 14275 18159 7 .3 8 .5 15 .8 82863 533056 615920 116 362 478 429 JGD+BP 8301 14621 22921 14 .1 7 .4 21 .5 82863 297989 380852 99 214 313 468 JGD+CHG 8301 10833 19134 14 .1 5 .6 19 .6 82863 404389 487252 99 284 383 457 FG+BP 20050 17669 37719 22 .3 10 .3 32 .6 251398 154339 405737 171 116 287 412 FG+CHG 13494 11272 24766 16 .3 8 .2 24 .5 379611 221040 600651 243 152 395 440 1D+BP 71313 – 71313 37 .0 – 37 .0 615662 – 615662 185 – 185 372 1D+CHG 36563 – 36563 21 .8 – 21 .8 810012 – 810012 267 – 267 530 CKBD+BP 6122 55186 61308 6 .7 13 .5 20 .3 82863 579257 662121 72 209 280 511 4096 CKBD+CHG 6122 32230 38352 6 .7 8 .7 15 .4 82863 747663 830526 72 252 324 556 JGD+BP 13440 38102 51543 12 .9 10 .4 23 .2 82863 449614 532477 62 163 225 584 JGD+CHG 13440 24711 38151 12 .9 6 .7 19 .5 82863 584 4 42 667306 62 203 264 586 FG+BP 43084 38239 81323 24 .8 11 .0 35 .7 326646 228666 555312 114 87 201 516 FG+CHG 28683 24606 53289 18 .0 8 .8 26 .7 486581 323623 810204 160 117 278 554 1D+BP 160507 – 160507 42 .9 – 42 .9 871090 – 871090 137 – 137 392 1D+CHG 81592 – 81592 26 .0 – 26 .0 1040719 – 1040719 175 – 175 560 CKBD+BP 14589 125660 140249 7 .9 15 .7 23 .6 115948 814437 930384 49 149 198 593 8192 CKBD+CHG 14589 70942 85531 7 .9 10 .7 18 .6 115948 1003817 1119764 49 171 220 667 JGD+BP 28838 85171 114009 13 .9 11 .4 25 .2 115948 618294 734242 43 114 157 683 JGD+CHG 28838 52882 81720 13 .9 6 .5 20 .3 115948 769539 885486 43 137 180 701 FG+BP 90483 81319 171801 26 .3 13 .3 39 .6 430725 327017 757741 77 65 142 591 FG+CHG 59692 52380 112072 19 .8 10 .1 29 .9 572597 432085 1004682 99 84 183 663 The bold values in ﬁve columns total/maximum number of messages (Sum), total/maximum communication volume (Sum) and Speedup indicate the best values obtained by the respective model at a speciﬁc K .

The major factors that determine overall latency and bandwidth costs are the maximum number of messages and the maximum volume of communication handled by a single processor, respectively. As seen in Table 3 , with increasing number of processors, the maximum number of messages increases sharply, whereas maximum volume decreases despite the increase in total volume. For instance, for

1D+BP

, when K increases from 256 to 8192 processors, the maximum number of messages increases from 17.9 to 42.9, whereas maximum volume decreases from 728 to 137 words, on average. Hence, the latency overhead on average increases by a factor of 2.4, whereas the bandwidth overhead on average decreases by a factor of 5.3. Moreover, the total message count increases more sharply compared to total volume: 65.1 times versus 5.3 times. These ﬁgures imply that with increasing number of processors, latency costs steadily become more important than bandwidth costs in determining overall communication cost of parallel SpMV operations. Hence, reducing latency costs should pay off with improved scalability, as will be seen in the following section. Observe that similar arguments hold for other partitioning models as well.

(14)

Table 4

Comparison of partitioning models with communication hypergraphs normalized with respect to their baseline counterparts averaged over all matrices for each K .

Number of messages Communication volume

is often the worst in terms of total and maximum number of messages because it causes increases in these metrics in order to reduce total volume.

To aid the assessment of beneﬁts of the communication hypergraph, we present Table 4 . In this table, each latency- improved (

CHG

-enhanced) model’s performance metrics are normalized with respect to those of their baselines. In other words, the results of

1D+CHG

are normalized with respect to those of

1D+BP

, the results of

CKBD+CHG

are normalized with respect to those of

CKBD+BP

, etc. The normalization is performed on a matrix basis and the averages of these normalized values over 28 matrices are given separately for each K . As seen from the table,

CHG

-enhanced models improve the total message count drastically, as minimizing this metric is one of the main objectives in these models. For example at 2048 processors,

1D+CHG

achieves 36% improvement over

1D+BP

,

CKBD+CHG

CKBD+BP

,

JGD+CHG

JGD+BP

, and

FG+CHG

FG+BP

. However, this comes at the cost of increased total volume. Again at 2048 processors,

1D+CHG

increases the total volume by 40%,

CKBD+CHG

by 33%,

JGD+CHG

by 30%, and

FG+CHG

by 49%. The crucial observation, however, is that the message count improvements of partitioning models that rely on the communication hypergraph model tend to increase with increasing number of processors. For example,

1D+CHG

achieves a 24% improvement over

1D+BP

at K =256 in total message count and this improvement be- comes 37% at K = 8192 . This improvement increases from 11% to 33%, 2% to 24%, and 14% to 24% for

CKBD+CHG

,

JGD+CHG

(15)

32.43 33.79 21.68 114.22

CHG 1.68 0.98 1.13 1.98

and

FG+CHG

, respectively. This implies that the beneﬁts obtained using the communication hypergraph model are more prominent at higher processor counts. Note that for

CKBD+CHG

and

JGD+CHG

JGD+CHG

and

FG+CHG

, respectively, compared to their baseline counterparts. This is an important beneﬁt of the communication hypergraph model since it strives for balancing volume.

The sequential partitioning times of the evaluated models are given in Table 5 averaged over all matrices and K values. The CHG times (indicated via

CHG

row) include only the partitioning times of the communication hypergraphs formed for the respective model. As expected, the ﬁne-grain model has the highest partitioning time as the hypergraphs formed in this model are typically larger. The partitioning times of the communication hypergraphs are quite low compared to the respective original partitionings since they are small as they contain only the vertices that correspond to the vector elements that necessitate communication. Note that

CKBD+CHG

,

JGD+CHG

and

FG+CHG

form a number of communication hypergraphs that can independently be partitioned, hence the partitioning of them can easily be parallelized. A more healthy comparison of partitioning overhead for 2D models can be found in [49] .

, respectively.

1D+BP

CKBD+CHG

. These ﬁgures show that the communication hypergraph proves to be a valuable method for achieving scalability.