• Sonuç bulunamadı

Reducing latency cost in 2D sparse matrix partitioning models

N/A
N/A
Protected

Academic year: 2021

Share "Reducing latency cost in 2D sparse matrix partitioning models"

Copied!
24
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Parallel

Computing

journal homepage: www.elsevier.com/locate/parco

Reducing

latency

cost

in

2D

sparse

matrix

partitioning

models

R

Oguz

Selvitopi

a

,

Cevdet

Aykanat

a , ∗

Bilkent University, Computer Engineering Department, 06800, Ankara, Turkey

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 8 March 2015 Revised 11 December 2015 Accepted 21 April 2016 Available online 23 April 2016 MSC:

00-01 99-00, Keywords:

Parallel iterative solvers Nonsymmetric linear systems Sparse matrix-vector multiplication Sparse matrix partitioning Latency overhead Bandwidth overhead

a

b

s

t

r

a

c

t

Sparsematrix partitioning is acommon technique used for improving performance of parallellineariterative solvers.Comparedto solversusedfor symmetriclinearsystems, solversfornonsymmetricsystemsoffer morepotential foraddressing differentmultiple communicationmetricsduetotheflexibilityofadoptingdifferentpartitionsontheinput andoutputvectorsofsparsematrix-vectormultiplicationoperations.Inthisregard,there existworksbasedonone-dimensional(1D)andtwo-dimensional(2D)fine-grain partition-ingmodels thateffectively addressbothbandwidth and latency costsinnonsymmetric solvers.Inthiswork,weproposetwonewmodelsbasedon2Dcheckerboardandjagged partitioning.Thesemodels aimatminimizing totalmessage count whilemaintaining a balanceoncommunicationvolume loadsofprocessors;hence, theyaddressboth band-widthand latencycosts. Weevaluateallpartitioningmodelsontwo nonsymmetric sys-temsolversimplementedusingthe widelyadopted PETSctoolkitand conductextensive experimentsusing thesesolversona modernsystem (aBlueGene/Qmachine) success-fullyscalingthemupto8Kprocessors.Alongwiththeproposedmodels,weputpractical aspectsofeight evaluatedmodels(two1D- and six2D-based)under thoroughanalysis. Tothe bestofourknowledge,thisisthefirstwork thatanalyzespracticalperformance of2Dmodelsonthisscale.Amongevaluatedmodels,themodelsthatrelyon2Djagged partitioningobtainthemostpromisingresultsbystrikingabalancebetweenminimizing bandwidthandlatencycosts.

© 2016PublishedbyElsevierB.V.

1. Introduction

Many scientific and engineering applications necessitate solving a linear system of equations. The methods used for this purpose are categorized as direct and iterative methods. When the linear system is large and sparse, iterative methods are preferred to their direct counterparts due to their speed and flexibility. Most widely used iterative methods for solving large-scale linear systems are based on Krylov subspace iterations.

A single iteration in Krylov subspace methods usually consists of one or more Sparse Matrix–Vector multiplica- tions (SpMV), dot product(s) and vector updates. In a distributed setting, SpMV operations require regular or irregular point-to-point (P2P) communication depending on the sparsity pattern of the coefficient matrix in which each processor

R This work was supported by The Scientific and Technological Research Council of Turkey (TÜB ˙ITAK) under Grant EEEAG-114E545. This article is also based upon work from COST Action CA 15109 (COSTNET).

Corresponding author. Tel.: +90 (312) 290 1213; fax: +90 (312) 266 4047.

E-mail addresses: [email protected] (O. Selvitopi), [email protected] (C. Aykanat). http://dx.doi.org/10.1016/j.parco.2016.04.004

(2)

sends/receives messages to/from a subset of processors. On the other hand, dot products necessitate global communication that involves a reduction operation on one or a few scalars in which all processors participate. Vector updates usually do not require any communication.

A common model to capture the cost of communicating a single message of m words consists of two major components and is given by the formula t s+ mt w. Here, t sis the startup time and signifies the costs related to preparation of the message (referred to as latency cost). The t wcomponent is the time required to transfer a single word between two processors and is equal to the reciprocal bandwidth (referred to as bandwidth cost). Latency cost is proportional to the number of messages whereas bandwidth cost is proportional to the number of words. Among these two components, latency costs prove to be more vital for parallel performance as they are generally harder to avoid and improve [1] . Although both costs are reduced within time, the gap between them gradually increases in favor of bandwidth costs with an approximately 20% annual improvement over latency costs [1,2] . Furthermore, computation speeds evolve faster than communication speeds, making communication costs more critical for performance. With the latest developments in the scientific computing field, communication costs are likely to be a major factor in ranking fastest high performance computing (HPC) systems [3] . This work focuses on reducing latency costs of parallel SpMV operations in the context of nonsymmetric iterative solvers. The models and methods proposed in our work apply to matrices both with regular and irregular sparsity patterns. However, the benefits are more evident for the irregular ones, which are usually harder to exploit.

1.1. Related work

Communication requirements of iterative solvers have been of interest for more than three decades. There are numerous works on reducing communication overhead of global reduction operations in iterative solvers. Several works in this cate- gory aim at decreasing the number of global synchronization points in a single iteration of the solver by reformulating it [4–12] . Another important area of study is s -step Krylov subspace methods, which focus on further reducing the number of global synchronization points by a factor of s by performing only a single reduction once in every s iterations [8,13–17] . The performance gain of s -step methods comes at the cost of deteriorated stability and complications related to integration of preconditioners. However, these methods recently gained popularity again and promising studies address these shortcom- ings [13,16,18,19] . Another common technique is to overlap communication and computation with the aim of hiding global synchronization overheads [20,21] . Especially with the introduction of nonblocking collective constructs in the MPI-3 stan- dard, this technique is gaining attraction [22–24] . Overlapping is commonly used for SpMV operations as well. In addition, a recent work proposed hierarchical and nested Krylov methods that constrain global reductions into smaller subsets of pro- cessors where they are cheaper [25] . Another recent work uses the idea of embedding SpMV communications into global reductions to avoid latency overhead of SpMV communications [26] .

The performance of iterative solvers is also addressed by minimizing communication costs related to parallel SpMV oper- ations, which is also addressed by this work. There are studies that can handle sparse matrices that are well-structured and have predictable sparsity patterns, generally arising from 2D/3D problems [16,27–29] . However, the studies in this field gen- erally focus on combinatorial models that are capable of exploiting both regular and irregular patterns to obtain a good par- tition of the coefficient matrix. In this regard, graph and hypergraph partitioning models are widely utilized with successful partitioning tools such as MeTiS [30] , PaToH [31] , Scotch [32] , Mondriaan [33] . These models can broadly be categorized as one-dimensional (1D) and two-dimensional (2D) partitioning models. In 1D models [30,31,34–39] , each processor is respon- sible for a row/column stripe, whereas in 2D models, each processor may be responsible for a submatrix block (generally defined by a subset of rows and columns) or as in the most general case, each processor may be responsible for an arbi- trarily defined subset of nonzeros. Compared to 1D models, 2D models possess more freedom in partitioning the coefficient matrix. Some works on 2D models do not take the communication volume into account, however they provide an upper bound on the number of messages communicated [40–44] . On the other hand, there are 2D models that aim at reducing volume, with or without providing a bound on the maximum number of messages [33,45–50] . 2D partitioning models in the literature can further be categorized into three classes: checkerboard partitioning [47,49,50] (also known as coarse-grain partitioning), jagged partitioning [45,49] and fine-grain partitioning [46,4 8,4 9] . Notably, a recent work [50] proposes a fast 2D partitioning for scale-free graphs via a two-phase approach. This method uses 1D partitioning to reduce volume in the first phase and an efficient heuristic in the second phase to obtain a bound on the maximum number of messages. This work differs from ours as it does not explicitly minimize the message count, instead, it uses a property of the Cartesian distribution of the matrices to provide the mentioned upper bound.

1.2. Motivation and contributions

Most of the aforementioned and other existing partitioning models optimize the objective of minimizing total communi- cation volume, which is an effort to reduce bandwidth costs. However, communication cost is a function of both bandwidth and latency, with the latter being at least as important as the former, as the current trends indicate. The need for parti- tioning models that also consider other cost metrics has been noted in other works [26,34] . There are a few notable works that focus on different communication cost metrics. Balancing communication volume is one of them [33,51,52] . More im- portant and overlooked work targets multiple communication metrics including latency [53] , on which this study is based. Compared to [53] , this study concentrates more on practical aspects.

(3)

a symmetric partition where the same partition is imposed on both input and output vectors, or by using a nonsymmetric partition where a distinct partition is employed for input and output vectors. The latter alternative is more appealing and it should be adopted whenever convenient since it is more flexible and allows operating in a broader search space. A non- symmetric partition can be utilized in nonsymmetric linear system solvers such as the conjugate gradient normal equation error (CGNE) [54–56] and residual method (CGNR) [54–57] , and the standard quasi-minimal residual (QMR) [58] where the coefficient matrix is square and nonsymmetric. The details of how to utilize nonsymmetric partitioning without incurring communication during linear vector update operations are explained for CGNE and CGNR solvers in Appendix E . We con- strain ourselves to nonsymmetric square matrices in this work, but all proposed models apply to certain iterative methods that involve rectangular matrices as well.

Our work is based on [53] , which also achieves a nonsymmetric partition through a two-phase methodology with a model called communication hypergraph . Our contributions and differences from [53] are as follows:

(i) We propose two new partitioning models for reducing latency which are based on 2D checkerboard and jagged par- titioning. These models aim at reducing latency costs usually at the expense of increasing bandwidth costs. Similar models have been investigated [48,53] , but they are based on 1D and 2D fine-grain models.

(ii) All proposed and investigated partitioning models are realized on two iterative methods CGNE and CGNR imple- mented with the widely adopted PETSc toolkit [59] . We describe how to obtain a nonsymmetric partition on the vectors utilized in these solvers using the communication hypergraph model and thoroughly evaluate partitioning re- quirements of them via experiments. In this manner, we differ from [53] , in which the proposed methods were tested with a code developed by the authors that contains only parallel SpMV computations.

(iii) We conduct extensive experiments for the mentioned iterative solvers. Although better suited to large-scale systems, the communication hypergraph model was originally tested only for 24 processors on a local cluster and only for 1D partitioning. In this work, we test and show this model’s validity on a modern HPC system (a BlueGene/Q machine) successfully scaling up to 8K processors.

(iv) We compare one 1D-based, three 2D-based models (checkerboard, jagged and fine-grain), and these four models’ latency-improved versions, making a total of eight partitioning models. Among these, the 2D models are somewhat overlooked in the literature, never being tested in a realistic setting on a large-scale system. Although their theoretical merits are of no question, their practical merits are not appreciated. In our experiments, we put these methods’ practical aspects into a thorough analysis. The experiments show surprising results with 2D jagged partitioning and its latency-improved version performing better in the majority of the matrices.

The rest of the paper is organized as follows: Section 2.1 and 2.2 describe the proposed partitioning models to reduce the latency overhead of checkerboard and jagged models, respectively. These two sections describe basic checkerboard and jagged models as well. We also briefly review the fine-grain model and its latency-improved version in Section 2.3 , since they are included in our experiments. We compare communication properties of all partitioning models in Section 3 . Section 4 contains the results and discussions of the extensive large-scale experimental evaluation of eight partitioning mod- els on a BlueGene/Q system with 28 matrices. Our experiments range from 256 to 8192 processors. Final remarks are given in Section 5 . The presentation of the paper relies on the assumption that the reader is already familiar with the commu- nication hypergraph model for 1D partitioning [53] . The unfamiliar reader can find a detailed background with explanatory examples about the communication hypergraph model in Appendix D .

2. Reducinglatencycostin2Dpartitioningmodels

2D models work at a finer level of partitioning granularity compared to 1D models by allowing nonzeros of a single row/column to be assigned to more than one processor. In this manner, they possess more flexibility in partitioning since they do not constrain the search space by assigning all nonzeros of a row/column to the same processor. This leads them to exploit existing partitioning tools better. 1D models necessitate a row-parallel/column-parallel algorithm ( Appendix B ), whereas 2D models necessitate a row-column-parallel algorithm. The fundamental difference between them is that the for- mer necessitates a single communication stage in parallel SpMV operations, which is either pre-communication if the matrix is partitioned row wise, or post-communication if the matrix is partitioned column wise, whereas the latter necessitates two distinct communication stages: one before the local SpMV computations (on the input vector in a pre -communication stage via expand communication tasks) and one after the local SpMV computations (on the output vector in a post -communication stage via fold communication tasks), For more details on implementation issues regarding 2D partitioning, see [49] .

In this section, we describe how to reduce latency overhead of 2D checkerboard and jagged partitioning models. For these models, we assume a K =P × Q virtual processor mesh. A simple example depicting a matrix partitioned with these two models is given in Fig. 1 . Compared to their original counterparts, the proposed models are likely to increase the bandwidth costs by increasing communication volume. However, this issue is addressed by maintaining a balance on this metric. We also briefly review the 2D fine-grain model and compare it to checkerboard and jagged models as we evaluate it in our

(4)

(a) Checkerboard

(b) Jagged

Fig. 1. A sample of 2D checkerboard and jagged partitionings on a 16 = 4 × 4 virtual processor mesh.

experiments. Note that a model to reduce latency overhead of fine-grain partitioning has already been investigated [48] . All proposed models are discussed on parallel w = Ap. However, the arguments are also valid for z =A Tr as communication requirements of w = Ap and z = A Tr are the dual of each other and minimizing the objective function in w = Ap is equivalent to minimizing the objective function in z =A Tr.

2.1. Checkerboard partitioning

Checkerboard partitioning is a two-phase process in which each phase utilizes a 1D partitioning model. The second phase depends on the first phase by using information obtained in the first phase to determine multiple vertex weights utilized in the second phase. For the rest of the section, we assume a 1D row wise partition in the first phase and a 1D column wise partition in the second phase. For the arguments made in this section, an analogous discussion holds for the dual scheme as well.

Consider a K =P × Q processor mesh and an n × n square matrix A . In the first phase, the column-net hypergraph model H R =

(

V R, N C

)

is used to obtain a P -way partition



R =

{

V 1 , . . . , V P

}

, which induces a P -way row wise partition

{

R1 ,...,RP

}

of A . Here, denotes the set of rows that correspond to vertices in Vα, for

α

=1 ,...,P . The rows in form a row stripe A α whose size is n α × n, with n α=

|

|

. At the end of the first phase, the assignment of rows of A is determined by associating row stripe A αwith the Q processors in row

α

of the processor mesh, P α,∗.

In the second phase, the row-net hypergraph model HC =

(

VC,NR

)

is used to obtain a Q -way partition



C =

{

V1 ,...VQ

}

, which induces a Q -way column wise partition

{

C1 ,...,CQ

}

of A . Here, denotes the set of columns that correspond to vertices in Vβ, for

β

= 1 ,...,Q. The columns in Cβ form column stripe A β whose size is n × nβ, with n β=

|

Vβ

|

. At the end of the second phase, we complete the assignment of columns of A and actually obtain a Q -way column wise partition of each row stripe A α, forming Q submatrix blocks A α,1 ,...,A α,Q. Hence,

(

R,



C

)

defines an assignment for rows and columns of A where processor P α, β is responsible for the set of rows in Rα and the set of columns in Cβ. In other words, nonzero a ijis assigned to P α, β if r i ∈ R α and c jC β.

This two-phase process aims at minimizing total communication volume for the pre- and post-communication stages in the first and second phases, respectively, while maintaining computational load balance [47,49] . A notable property of checkerboard partitioning is that it confines the communication in expand and fold operations to the processors in the same column and row of the processor mesh, respectively. It achieves a Cartesian distribution of the matrix, in which each processor owns an intersection of a subset of rows and a subset of columns. A row (column) is said to be coherent if the nonzeros of this row (column) generate partial results for (require) the same w -vector ( x -vector) element. Consider a row r i that is assigned to at the end of the first phase. The coherency of this row is preserved at this point as it is modeled by

v

i in HR. In the second phase, the nonzeros of this row can be distributed among Q processors in row

α

of the processor mesh, which is also the case for all other rows in Rα. Hence, row coherency is respected in a coarse level by assigning nonzeros of rows in R α to the processors in the same row of the processor mesh, P α, ∗. A coarse level here implies that the nonzeros belonging to a subset of rows are distributed among the same subset of processors (in this case among P processors in a specific column of the processor mesh). This provides the upper bound Q − 1 on the number of messages communicated in the post-communication stage as there are Q processors in row

α

of the processor mesh. With a similar argument, column coherency is also respected in a coarse level by assigning nonzeros of columns in to the processors in the same column of the processor mesh, P ∗, β. This provides the upper bound P − 1 on the number of messages communicated in the pre-communication stage as there are P processors in column

β

of the processor mesh. Hence, the maximum number of messages handled by a single processor is bounded by P +Q − 2. In checkerboard partitioning, the second phase is performed with P -way multi-constraint [60,61] partitioning to balance computational loads.

(5)

Fig. 2. Formation of the communication matrix for the third column of the processor mesh ( β= 3 ) to summarize expand operations in the pre- communication stage.

2.1.1. Communication matrices

The expand communication tasks in the checkerboard model are bound to distinct columns of the processor mesh. For this reason, to summarize the communication requirements of expand tasks in the pre-communication stage, we form Q distinct communication matrices. Let p Cβ denote the vector elements that necessitate communication in column

β

of the processor mesh, for 1 ≤

β

≤ Q. Note that at most P processors can participate in communicating p Cβ, confined to the set of processors in P ∗, β. We summarize the communication operations in column

β

of the mesh with the P ×

|

p Cβ

|

communica- tion matrix M β. Rows of M β correspond to processors in P ∗, β and columns of M β correspond to expand tasks on p Cβ. There exists a nonzero m αj ∈ M β if and only if there is a non-empty column segment in submatrix A α, β at the respective column. The nonzeros in column j of M β represent the set of processors that participate in communicating p Cβ[ j] , which is a subset of processors in P ∗, β. The nonzeros in row

α

of M β represent the expand tasks processor P α, β participates in. Note that vector elements corresponding to internal columns (those which have a single non-empty column segment in P row stripes) do not incur communication and they are not included in M β. These vector elements should be assigned to the respective processors to avoid unnecessary communication. An example in Fig. 2 is presented to illustrate the formation of communi- cation matrix M β for the third column of the processor mesh (

β

=3 ) to summarize the expand tasks. There are four input vector elements that necessitate communication (denoted by p Cβ) and they form columns of M β. For instance, the first col- umn of M β has nonzeros corresponding to processors P 1, 3 , P 2, 3 and P 4, 3 since in matrix A , there exist nonzero column segments in the respective submatrix blocks. Two vector elements–second and fifth–corresponding to internal columns do not incur communication and they are not included in M β; these elements should be assigned to P 3, 3 and P 2, 3 , respectively, to avoid unnecessary communication.

The fold communication tasks in checkerboard model are bound to distinct rows of the processor mesh. Following a similar approach, we form P distinct communication matrices to summarize the communication requirements of fold tasks in the post-communication stage. Let w Cα denote the vector elements that necessitate communication in row

α

of the processor mesh, for 1 ≤

α

≤ P . We summarize the communication operations in row

α

of the mesh with the

|

w Cα

|

× Q communication matrix M α, where there exists a nonzero m ∈ M α if and only if there is a non-empty row segment in submatrix A α, β at the respective row. An example is presented in Fig. 3 to illustrate the formation of communication matrix M α in the third row of the processor mesh (

α

= 3 ) to summarize the fold tasks. The dual of the discussions made for M β in Fig. 2 follows also for M α.

We form a total of P +Q communication matrices to summarize communication requirements of checkerboard parti- tioning. We can address the communication requirements of both pre- and post-communication stages independently since communication operations in these stages are bound to distinct columns and rows of the processor mesh, respectively. For- mation of these communication matrices is illustrated in Fig. 4 .

(6)

Fig. 3. Formation of the communication matrix for the third row of the processor mesh ( α= 3 ) to summarize fold operations in the post-communication stage.

2.1.2. Formation of communication hypergraphs

We form Q hypergraphs from Q communication matrices for the pre-communication stage. For each M β, a communi- cation hypergraph HCM

β is formed using the row-net hypergraph model, for 1 ≤

β

≤ Q. The net set of HCMβ represents the

processors in column

β

( P ∗, β) of the processor mesh and the vertex set of HCM

β represents the expand tasks on p Cβ. Hence, there are P nets and

|

p Cβ

|

vertices in H CMβ . A vertex

v

jin H CMβ is connected by the set of nets corresponding to processors that communicate the respective vector element p Cβ[ j] . In all Q communication hypergraphs, the total number of vertices is equal to

|

p C

|

=  βQ=1

|

p Cβ

|

and the total number of nets is equal to Q × P = K.

In a similar manner, we form P hypergraphs from P communication matrices for the post-communication stage. For each M α, a communication hypergraph H CM

α is formed using the column-net hypergraph model, for 1 ≤

α

≤ P . The net set of H CMα

represents the processors in row

α

( P α, ∗) of the processor mesh and the vertex set of HCM

α represents the fold tasks on w Cα. Hence, there are Q nets and

|

w Cα

|

vertices in HαCM. A vertex

v

iin HCMα is connected by the set of nets corresponding to the processors that communicate the respective vector element w Cα[ i ] . In all P communication hypergraphs, the total number of vertices is equal to

|

w C

|

=Pα=1

|

w Cα

|

and the total number of nets is equal to P × Q= K.

In total, we form P +Q communication hypergraphs from P +Q communication matrices. This process is illustrated in Fig. 4 .

2.1.3. Partitioning of the communication hypergraphs We partition HCM

β to get a P -way partition



β=

{

V1 ,V2 ,...,VP

}

and obtain a distribution of expand tasks among P processors in column

β

of the processor mesh for the pre-communication stage, for 1 ≤

β

≤ Q. The responsibility of the ex- pand tasks represented by the vertices in



βis assigned to processor P α, β. Consider a net n α, β in HCM

β that represents

P α, β. The connectivity set of this net contains the parts that correspond to the processors each of which send a message to P α, β. The size of this set can be at most P since HCM

β is partitioned into P , bounding the number of messages sent/received

by a single processor by P − 1 in the pre-communication stage. Hence, this feature of original checkerboard partitioning is still respected. In partitioning H CM

β , the partitioning objective of minimizing cutsize corresponds to minimizing the number

of messages communicated in column

β

of the processor mesh in the pre-communication stage, and the partitioning con- straint of maintaining balance among part weights corresponds to obtaining a balance on the communication volume loads of these processors.

Similarly, we partition HCM

α to get a Q -way partition



α=

{

V1,V2,...,VQ

}

and obtain a distribution of fold tasks among Q processors in row

α

of the processor mesh for the post-communication stage, for 1 ≤

α

≤ P . H CM

α is partitioned into Q ,

bounding the number of messages sent/received by a single processor by Q − 1 in the post-communication stage. Hence, this feature of original checkerboard partitioning is also respected. The partitioning objective and the balancing constraint are identical to those in partitioning HCM

β .

The formed P +Q hypergraphs can be independently partitioned since they do not depend on each other in any way. The maximum number of messages handled by a single processor is still P + Q − 2 as in the original checkerboard partitioning. As a result, we improve latency costs in each row/column of the processor mesh independently while respecting basic characteristics of checkerboard partitioning.

(7)

Fig. 4. Minimizing latency cost in checkerboard partitioning model.

2.2. Jagged partitioning

Jagged partitioning consists of two phases. The first phase consists of a single 1D partitioning model, whereas the second phase consists of multiple, independent and same type of 1D partitioning models. The second phase depends on the first phase by using the partitioning information obtained in the first phase to determine vertex sets and vertex weights for the models formed in the second phase. For the rest of the section, we assume a 1D row wise partition in the first phase and a 1D column wise partition in the second phase. For the arguments made in this section, an analogous discussion holds for the dual scheme as well.

Assume a K = P × Q processor mesh and an n × n square matrix A . The first phase of jagged partitioning is exactly the same as the first phase of checkerboard partitioning: a column-net hypergraph model HR =

(

VR,NC

)

is used to obtain a P -way partition



R =

{

V1,...,VP

}

, which induces a row wise partition

{

R1,...,RP

}

of A . At the end of this phase, the rows in row stripe A αare associated with the Q processors in row

α

of the processor mesh, P α, ∗.

In the second phase, we form a hypergraph for each row submatrix A α obtained in the former phase using the row- net hypergraph model, for 1 ≤

α

≤ P. In total, P hypergraphs are formed. In this aspect, jagged partitioning differs from checkerboard partitioning – which forms a single hypergraph in the second phase. The net set of Hαrepresents rows of A α and the vertex set of H α represents columns of A that have a nonzero column segment in A α. Hence, the same vertex can appear in multiple hypergraphs since the corresponding column may have nonzero column segments in more than one row stripes. These P hypergraphs are independently partitioned into Q parts to obtain a Q -way partition of each row stripe. At the end of the second phase, for each row stripe A α, we obtain Q submatrix blocks A α,1 , . . . , A α,Q by partitioning vertices corresponding to columns of A α.

The first and the second phase aim to minimize the volume of communication in pre- and post-communication stages, respectively, while maintaining computational load balance [47,49] . In contrast to checkerboard partitioning, the objective in the second phase of the jagged partitioning is addressed independently by partitioning row stripes separately in distinct hypergraphs. The jagged model also differs from the checkerboard model as it does not lead to a Cartesian distribution

(8)

of the matrix. Jagged partitioning is more flexible in this sense since it allows nonzeros of a column to be distributed among any processor that is in distinct rows of the processor mesh – not just among the processors that are in the same column of the mesh, as is the case for checkerboard partitioning. Hence, the column coherency is not preserved in the assignment of columns. This leads to improved communication volume for expand tasks in the pre-communication stage at the expense of increasing the upper bound on the number of messages handled by a single processor. In other words, the jagged model sacrifices the coarse level coherency of columns and causes the number of messages handled by a single processor in the pre-communication stage to be at most P × Q− Q= K − Q. In this stage, a processor may communicate with any other processor except the processors that are in the same row of the processor mesh as this processor. On the other hand, the coherency of rows owned by a processor is respected in a coarse level as in checkerboard partitioning. This is because nonzeros of these rows are distributed among processors in the same row of processor mesh, P α, ∗, if the respective processor is in row

α

. Hence, the number of messages handled by a single processor for fold tasks in the post-communication stage is bounded by Q − 1 as there are Q processors in a single row of the processor mesh. Conse- quently, the maximum number of messages handled by a processor can be at most

(

K − Q

)

+

(

Q − 1

)

=K − 1 in jagged partitioning.

2.2.1. Communication matrices

The expand communication tasks in the jagged model are not bound to distinct columns of the processor mesh. For this reason, we form a single communication matrix to summarize the communication requirements of expand tasks in the pre- communication stage. Let p Cdenote the vector elements that necessitate communication. We summarize the communication operations with the

(

K = P × Q

)

×

|

p C

|

communication matrix M R. Rows of M Rcorrespond to all processors and columns of M R correspond to expand tasks on p C. Consider two vector elements owned by the same processor. Although these two elements can be communicated by at most P processors (each of which belongs to a distinct row of the processor mesh), they do not necessarily need to be confined to the same column of the processor mesh. For this reason, we include all processors in M R. The formation of M R essentially resembles that of 1D row-parallel w =Ap(D.1) : m kj = 0 if and only if column c jhas a nonzero column segment in k th row stripe of A . Nonzeros in column c j M Rrepresent the set of processors that participate in communicating p C[ j ] and nonzeros in row r k ∈ M R represent the expand tasks P k participates in. The difference is, however, as a consequence of jagged partitioning, each column in M R can have at most P nonzeros instead of K . As usual, vector elements corresponding to internal columns are not included in M R since they do not necessitate communication.

In contrast to expand tasks, the fold communication tasks are bound to distinct rows of the processor mesh. The com- munication requirements of fold tasks in the post-communication stage are thus summarized by P distinct communication matrices, M α, for 1 ≤

α

≤ P . The formation of these matrices is the same as the formation of matrices for summarizing communication requirements of fold tasks in checkerboard partitioning. The semantics of nonzeros in rows and columns of M α are identical to those of M α in checkerboard partitioning.

We form a total of P +1 communication matrices to summarize communication requirements of jagged partitioning. We can address the communication requirements of the post-communication stage independently using P matrices. However, since communication operations in the pre-communication stage are not bound to the processors in the same column of the processor mesh, the expand tasks are represented in a single matrix with all K processors. Formation of these commu- nication matrices is illustrated in Fig. 5 .

2.2.2. Formation of the communication hypergraphs

For the pre-communication stage, we form a single communication hypergraph HCM

R from communication matrix M R using the row-net hypergraph model. The net set of HCM

R corresponds to K processors and the vertex set of HCMR corresponds to expand tasks on p C. Hence, there are K nets and | p C| vertices in H CMR . A vertex

v

jin H CMR is connected by the set of nets corresponding to processors that communicate the respective vector element p C[ j ]. Note that

v

jcan be connected by at most P nets, i.e., d j ≤ P.

For the post-communication stage, we form P communication hypergraphs from P communication matrices. The for- mation of these communication hypergraphs is actually the same as for checkerboard partitioning. A communication hy- pergraph HCM

α is formed using the column-net hypergraph model for matrix M α, for 1

α

≤ P. The semantics of these hypergraphs are also the same: the net set of H CM

α represents the processors in row

α

of the processor mesh, P α, ∗, and the vertex set of HCM

α represents the fold tasks on w Cα. Similarly, there are

|

w C

|

vertices and K nets in all communication hypergraphs.

In total, we form P + 1 communication hypergraphs from P + 1 communication matrices. This process is illustrated in Fig. 5 .

2.2.3. Partitioning of the communication hypergraphs Communication hypergraph HCM

R is partitioned to obtain a K -way partition



R =

{

V1 ,V2 ,...,VK

}

to induce a distribution of expand tasks in the pre-communication stage. The responsibility of the expand tasks represented by the vertices in V k is assigned to processor P k. Consider a net n k in HCMR that represents P k. The connectivity set of this net corresponds to processors each of which send a message to P k. Since HCMR is partitioned into K parts, the size of this set can be at most

(9)

Fig. 5. Minimizing latency cost in jagged partitioning model.

K . Thus, the maximum number of messages sent/received by a single processor is K − 1 in the pre-communication stage (note that it is K − Q in the original jagged partitioning). In partitioning HCM

R , the partitioning objective of minimizing cut- size corresponds to minimizing the number of messages communicated in the pre-communication stage, and the partitioning constraint of maintaining balance among part weights corresponds to obtaining a balance on the communication volume loads of K processors.

Communication hypergraph H CM

α is partitioned to obtain a Q -way partition



α=

{

V 1 , V 2 , . . . , V Q

}

, for 1 ≤

α

≤ P , to induce a distribution of fold tasks in the post-communication stage among Q processors in row

α

of the processor mesh. The partitioning of these communication hypergraphs have the same semantics with those in checkerboard partitioning.

The formed P + 1 hypergraphs can be partitioned independently, since they do not depend on each other in any way. Note that the maximum number of messages handled by a single processor is slightly increased from K − 1 to K +Q − 2, which is caused by the partitioning for distributing tasks in the pre-communication stage. However, the cases beyond this are expected to be rare as a good partitioning tool will avoid them. As a result, we improve latency costs while respecting most characteristics of the original jagged partitioning.

2.3. Fine-grain (nonzero-based) partitioning

We briefly review fundamental properties of fine-grain partitioning. This model is first proposed in [46] and its commu- nication costs are improved in a later work using the CHG model [48] . Both of these models are included in our experiments. The fine-grain model forms a hypergraph in which vertices represent nonzeros of A and nets represent rows and columns of A . Nets corresponding to columns of A capture the communication volume incurred in the pre-communication stage, while nets corresponding to rows of A capture the communication volume incurred in the post-communication stage. Par- titioning this hypergraph into K parts induces a distribution of nonzeros of the matrix among K processors. This leads to a completely arbitrary distribution of fine-grain computational tasks on a nonzero basis, where each vertex signifies a scalar

(10)

Fig. 6. Minimizing latency cost in fine-grain partitioning model.

multiplication with a single nonzero. Therefore, the fine-grain model respects neither row nor column coherence. In this aspect, it accommodates the highest level of flexibility by not restraining the computational tasks to coarser levels (i.e., nonzeros of a row and/or column) compared to checkerboard, jagged and 1D models. As a result, a processor may com- municate with any other processor. Thus, the maximum number of messages handled by a single processor is K − 1 in the pre-communication stage and is also K − 1 in the post-communication stage, summing up to a total of 2

(

K − 1

)

messages. The fine-grain model correctly minimizes the total communication volume while maintaining computational load balance. For more details, see [46,49] .

The approach to improve communication requirements of the fine-grain model [48] consists of forming two communica- tion matrices: one matrix for summarizing communication operations in the pre-communication stage and one matrix for summarizing communication operations in the post-communication stage. Compare this to the formation of communication matrices in the checkerboard and jagged models. We address the communication requirements in the checkerboard model by separately forming a total of P + Q communication matrices since they are confined to distinct columns and rows of the processor mesh. Also in the jagged model, we form P communication matrices for the post-communication stage. These are not valid for the fine-grain model since the resulting partition is arbitrarily defined on the nonzeros of the matrix and any of the K processors may communicate with any other processor in both pre- and post-communication stages. For this reason, the whole set of processors and all vector elements that necessitate communication are included in two communication matrices. The process for reducing the communication costs of the fine-grain model is illustrated in Fig. 6 .

3. Comparisonofpartitioningmodels

We compare the basic properties of the investigated partitioning models to aid the discussions of the results in experi- ments. It is assumed that P = Q =K in the P × Q processor mesh for ease of presentation.

Fig. 7 compares the partitioning models in terms of flexibility they provide during partitioning. 1D models lie at the left extreme of the spectrum since they represent each row/column of the coefficient matrix with a single distinct vertex as an atomic task. This leads to the assignment of all nonzeros of a row/column to an individual processor as a whole. Hence, 1D

(11)

Fig. 7. Comparison of models in terms of partitioning flexibility. Table 1

Comparison of partitioning models in terms of latency overhead and partitioning granularity.

Number of messages Comm. stage 1D 2D partitioning models

Checkerboard Jagged Fine-grain

pre K − 1 √ K − 1 K −√ K K − 1

max post N/A √ K − 1 √ K − 1 K − 1

overall K − 1 2(K − 1) K − 1 2(K − 1)

total overall K(K − 1) 2 K(K − 1) K(K − 1) 2 K(K − 1)

Row/column coherency Either entire row or Coarse-level on both Coarse-level on either None (Partitioning granularity) Entire column Rows and columns Rows or columns

models respect row/column coherency at the individual processor level. The fine-grain model lies at the right extreme of the spectrum since in this model each vertex represents an atomic task corresponding to a single nonzero of the matrix. This is the most flexible and the finest level of partitioning granularity available, where neither row nor column coherency is preserved. So, in theory, nonzeros of a row/column can be distributed among K processors. Between these two extremes, the checkerboard and jagged models strive to distribute nonzeros of a row/column among a subset of processors. By doing so, they obtain a coarse-level row/column coherency at the processor mesh’s row/column level. The checkerboard model leverages a coarse-level coherency in both partitioning phases whereas the jagged model leverages it in a single partitioning phase.

Among these partitioning models, 2D models are expected to achieve lower bandwidth costs compared to 1D models since they offer more flexibility in optimizing the objective of minimizing total communication volume. Among 2D models, fine-grain is expected to obtain the best results in terms of bandwidth costs, whereas checkerboard is likely to obtain the worst. The metrics related to latency costs (as upper bounds on the maximum number of messages) are presented in Table 1 . Checkerboard has the lowest overhead with 2

(

K − 1

)

maximum messages per processor, whereas fine-grain has the highest overhead with 2

(

K − 1

)

. Although 1D and jagged models have the same upper bound K − 1, in practice jagged partitioning is more likely to achieve better results in this metric since it restricts the number of messages in both stages of communication.

The discussions made so far in this section reflect the characteristics of the original partitioning models in which the communication hypergraph model is not used. The original models completely focus on minimizing bandwidth costs, disre- garding latency-related objectives. Using the communication hypergraph model as a further step reduces latency costs at the expense of increasing bandwidth costs while respecting certain characteristics of the original models as much as possible. Our experimental evaluation shows that latency should definitely be on the table to achieve scalable performance.

4. Experiments

We evaluate two 1D-based and six 2D-based models, that is, a total of eight partitioning models. The evaluated models are based on 1D row wise partitioning (

1D

), checkerboard partitioning (

CKBD

), jagged partitioning (

JGD

) and fine-grain partitioning (

FG

). Four of the evaluated models are the baseline models in which the communication tasks are assigned to processors using a simple heuristic that aims at balancing the communication volume loads while respecting total volume attained in the initial partitioning. This heuristic is also utilized in [53] and is an adaptation of the best-fit-decreasing heuristic used in solving the NP-hard K -feasible Bin Packing (

BP

) problem [62] . These

BP

-enhanced baseline models are referred to as

1D+BP

,

CKBD+BP

,

JGD+BP

and

FG+BP

. These four baseline models aim to reduce two important volume- related communication cost metrics, namely total volume and maximum volume. The remaining four evaluated models are the

CHG

-enhanced versions in which the communication tasks are assigned to processors using the communication hypergraph model. These

CHG

-enhanced models are referred to as

1D+CHG

,

CKBD+CHG

,

JGD+CHG

and

FG+CHG

. These aim to reduce total message count and maximum volume. Hence, we evaluate the merit of reducing a latency-related cost metric in partitioning. We use PaToH [31,61] to partition the computational hypergraphs formed in the first phase of all models and the communication hypergraphs formed in the second phase of the

CHG

-enhanced models.

We implemented CGNE and CGNR solvers via the PETSc toolkit [59] and utilized the mentioned models for partitioning the coefficient matrix and vectors in these solvers. Since obtained runtime results for both solvers are similar, we only present speedup results corresponding to CGNR. Note that the metrics regarding partitioning models ( Section 4.1 ) such as total volume, message count, etc. are the same for both solvers as they contain the same type of communication operations.

(12)

Table 2

Test matrices and their properties.

Nonzeros

Number of per row per column

Matrix rows/cols nonzeros avg min max min max

venkat01 62,424 1,717,792 27 .52 16 44 16 44 mc2depi 525,825 2,100,225 3 .99 2 4 2 4 poisson3Db 85,623 2,374,949 27 .74 6 145 6 145 thermomech_dK 204,316 2,846,228 13 .93 7 20 7 20 stomach 213,360 3,021,648 14 .16 7 19 6 22 FEM_3D_thermal2 147,900 3,489,300 23 .59 12 27 12 27 laminar_duct3D 67,173 3,833,077 57 .06 1 89 3 89 xenon2 157,464 3,866,688 24 .56 1 27 1 27 iChem_Jacobian 274,087 4,137,369 15 .10 5 17 5 17 torso3 259,156 4,429,042 17 .09 7 22 6 21 tmt_unsym 917,825 4,584,801 5 .00 3 5 3 5 t2em 921,632 4,590,832 4 .98 1 5 1 5 Hamrle3 1,447,360 5,514,242 3 .81 2 6 2 9 largebasis 440,020 5,560,100 12 .64 4 14 4 14 Chevron4 711,450 6,376,412 8 .96 2 9 2 9 cage13 445,315 7,479,343 16 .80 3 39 3 39 PR02R 161,070 8,185,136 50 .82 1 92 5 88 atmosmodl 1,489,752 10,319,760 6 .93 4 7 4 7 kim2 456,976 11,330,020 24 .79 4 25 5 25 memchip 2,707,524 14,810,202 5 .47 2 27 1 27 Freescale1 3,428,755 18,920,347 5 .52 1 27 1 25 circuit5M_dc 3,523,317 19,194,193 5 .45 1 27 1 25 fem_hifreq_circuit 491,100 20,239,237 41 .21 12 110 12 110 rajat31 4,690,002 20,316,253 4 .33 1 1252 1 1252 CoupCons3D 416,800 22,322,336 53 .56 20 76 20 76 Transport 1,602,111 23,500,731 14 .67 5 15 5 15 ML_Laplace 377,002 27,689,972 73 .45 26 74 26 74 RM07R 381,689 37,464,962 98 .16 1 295 1 245

All models are tested with 28 matrices chosen from the UFL matrix collection [63] . The properties of these matrices are presented in Table 2 . The evaluated models are tested on a BlueGene/Q system with varying number of processors K ∈ {256, 512, 1024, 2048, 4096, 8192}. A node on this system consists of 16 cores (single PowerPC A2 processor) with 1.6 GHz clock frequency and 16 GB memory. The nodes are connected by a 5D torus chip-to-chip network. We only consider the case of strong scaling.

For the pre-communication stages of

CKBD+CHG

and

JGD+CHG

, we opted not to apply the communication hypergraph model since the partitioning corresponding to this stage leads to a number of very small communication hypergraphs in which the number of communication tasks (that is, vertices) and the number of messages per processor are very low. Hence, utilizing a dedicated tool to partition these hypergraphs often does not pay off. Instead, the simple aforementioned heuristic is able to obtain comparable partition qualities in a shorter amount of time. Hence, the pre-communication stages (expand tasks) of

CKBD+BP

and

CKBD+CHG

models, and

JGD+BP

and

JGD+CHG

models have the same quantities for the statistics presented in Section 4.1 . However, for the post-communication stage, the respective quantities drastically differ in these models as the benefits of using the communication hypergraph model are more apparent.

4.1. Bandwidth and latency costs of partitioning models

Table 3 displays the metrics related to latency and bandwidth costs for the evaluated models. The metrics related to latency are highlighted under “Number of messages” columns and the metrics related to bandwidth are highlighted under “Communication volume” columns. The statistics for both total and maximum metrics are presented. The columns “Expand” and “Fold” in the table indicate the results obtained in the pre- and post-communication stages of 2D models, respectively. Recall that the 1D models (

1D+BP

and

1D+CHG

) have a single communication stage, which is the pre-communication stage in our case. The values are averaged over 20 test matrices separately for each K . The communication volume statistics are in terms of words. Note that the total/maximum number of messages and maximum volume of communication of

CKBD+BP

and

CKBD+CHG

, and

JGD+BP

and

JGD+CHG

are the same since they use the same heuristic to distribute expand tasks. Total communication volume of two successive K values (512 and 1024, 2048 and 4096, etc.) are the same for the models that are based on checkerboard and jagged partitioning since these models have the same number of processor columns in the corresponding processor mesh. For instance, at K =512 and K =1024 , the processor meshes are of sizes 32 × 16 and 32 × 32, respectively. We also present the average speedup values obtained by models to give an idea about the efficiency. A more detailed and accurate discussion with performance profiles and speedup curves can be found in the next section.

(13)

K Model Expand Fold Sum Expand Fold Sum Expand Fold Sum Expand Fold Sum Speedup 1D+BP 2464 – 2464 17 .9 – 17 .9 164470 – 164470 728 – 728 143 1D+CHG 1640 – 1640 13 .5 – 13 .5 245514 – 245514 1234 – 1234 148 CKBD+BP 414 1728 2141 4 .8 6 .6 11 .4 38476 145554 184029 367 759 1125 142 256 CKBD+CHG 414 1423 1836 4 .8 5 .8 10 .5 38476 214657 253133 367 1176 1542 138 JGD+BP 850 1266 2117 10 .0 4 .9 14 .9 38476 112712 151188 311 599 910 158 JGD+CHG 850 1131 1982 10 .0 5 .0 15 .0 38476 165081 203557 311 959 1270 153 FG+BP 1959 1751 3710 15 .5 6 .8 22 .3 107993 48359 156352 546 275 821 150 FG+CHG 1513 1257 2770 12 .5 4 .6 17 .1 173449 74672 248121 847 377 1223 150 1D+BP 5683 – 5683 21 .6 – 21 .6 226127 – 226127 513 – 513 236 1D+CHG 3496 – 3496 15 .5 – 15 .5 327581 – 327581 829 – 829 231 CKBD+BP 1019 3590 4608 6 .4 6 .8 13 .1 57434 195114 252548 264 530 794 225 512 CKBD+CHG 1019 2885 3904 6 .4 6 .6 13 .0 57434 280548 337981 264 804 1067 220 JGD+BP 2088 2604 4692 11 .7 5 .6 17 .3 57434 147488 204922 220 405 626 247 JGD+CHG 2088 2267 4355 11 .7 4 .5 16 .2 57434 212002 269436 220 609 829 238 FG+BP 4261 3784 8045 17 .7 8 .0 25 .7 144 96 8 70492 215460 374 205 580 228 FG+CHG 3165 2622 5787 13 .8 4 .8 18 .6 228987 106423 335410 566 273 839 229 1D+BP 13201 – 13201 27 .0 – 27 .0 314109 – 314109 359 – 359 294 1D+CHG 7451 – 7451 16 .9 – 16 .9 441340 – 441340 568 – 568 343 CKBD+BP 1587 9624 11211 5 .9 9 .1 15 .0 57434 288909 346343 167 399 567 320 1024 CKBD+CHG 1587 6897 8484 5 .9 7 .1 13 .0 57434 404550 461984 167 534 701 320 JGD+BP 3571 6775 10346 11 .7 7 .4 19 .1 57434 223665 281099 138 314 452 354 JGD+CHG 3571 5266 8837 11 .7 5 .2 16 .9 57434 314303 371737 138 430 568 341 FG+BP 9286 8141 17427 20 .6 8 .5 29 .1 192165 103683 295848 252 153 405 318 FG+CHG 6537 5420 11957 15 .1 6 .1 21 .3 297236 152370 449606 376 201 577 327 1D+BP 30651 – 30651 30 .9 – 30 .9 437009 – 437009 256 – 256 355 1D+CHG 16028 – 16028 19 .0 – 19 .0 591207 – 591207 389 – 389 450 CKBD+BP 3885 21008 24892 7 .3 9 .4 16 .7 82863 395862 478725 116 277 393 421 2048 CKBD+CHG 3885 14275 18159 7 .3 8 .5 15 .8 82863 533056 615920 116 362 478 429 JGD+BP 8301 14621 22921 14 .1 7 .4 21 .5 82863 297989 380852 99 214 313 468 JGD+CHG 8301 10833 19134 14 .1 5 .6 19 .6 82863 404389 487252 99 284 383 457 FG+BP 20050 17669 37719 22 .3 10 .3 32 .6 251398 154339 405737 171 116 287 412 FG+CHG 13494 11272 24766 16 .3 8 .2 24 .5 379611 221040 600651 243 152 395 440 1D+BP 71313 – 71313 37 .0 – 37 .0 615662 – 615662 185 – 185 372 1D+CHG 36563 – 36563 21 .8 – 21 .8 810012 – 810012 267 – 267 530 CKBD+BP 6122 55186 61308 6 .7 13 .5 20 .3 82863 579257 662121 72 209 280 511 4096 CKBD+CHG 6122 32230 38352 6 .7 8 .7 15 .4 82863 747663 830526 72 252 324 556 JGD+BP 13440 38102 51543 12 .9 10 .4 23 .2 82863 449614 532477 62 163 225 584 JGD+CHG 13440 24711 38151 12 .9 6 .7 19 .5 82863 584 4 42 667306 62 203 264 586 FG+BP 43084 38239 81323 24 .8 11 .0 35 .7 326646 228666 555312 114 87 201 516 FG+CHG 28683 24606 53289 18 .0 8 .8 26 .7 486581 323623 810204 160 117 278 554 1D+BP 160507 – 160507 42 .9 – 42 .9 871090 – 871090 137 – 137 392 1D+CHG 81592 – 81592 26 .0 – 26 .0 1040719 – 1040719 175 – 175 560 CKBD+BP 14589 125660 140249 7 .9 15 .7 23 .6 115948 814437 930384 49 149 198 593 8192 CKBD+CHG 14589 70942 85531 7 .9 10 .7 18 .6 115948 1003817 1119764 49 171 220 667 JGD+BP 28838 85171 114009 13 .9 11 .4 25 .2 115948 618294 734242 43 114 157 683 JGD+CHG 28838 52882 81720 13 .9 6 .5 20 .3 115948 769539 885486 43 137 180 701 FG+BP 90483 81319 171801 26 .3 13 .3 39 .6 430725 327017 757741 77 65 142 591 FG+CHG 59692 52380 112072 19 .8 10 .1 29 .9 572597 432085 1004682 99 84 183 663 The bold values in five columns total/maximum number of messages (Sum), total/maximum communication volume (Sum) and Speedup indicate the best values obtained by the respective model at a specific K .

The major factors that determine overall latency and bandwidth costs are the maximum number of messages and the maximum volume of communication handled by a single processor, respectively. As seen in Table 3 , with increasing num- ber of processors, the maximum number of messages increases sharply, whereas maximum volume decreases despite the increase in total volume. For instance, for

1D+BP

, when K increases from 256 to 8192 processors, the maximum number of messages increases from 17.9 to 42.9, whereas maximum volume decreases from 728 to 137 words, on average. Hence, the latency overhead on average increases by a factor of 2.4, whereas the bandwidth overhead on average decreases by a factor of 5.3. Moreover, the total message count increases more sharply compared to total volume: 65.1 times versus 5.3 times. These figures imply that with increasing number of processors, latency costs steadily become more important than band- width costs in determining overall communication cost of parallel SpMV operations. Hence, reducing latency costs should pay off with improved scalability, as will be seen in the following section. Observe that similar arguments hold for other partitioning models as well.

(14)

Table 4

Comparison of partitioning models with communication hypergraphs normalized with respect to their baseline counterparts averaged over all matrices for each K .

Number of messages Communication volume

Total Maximum Total Maximum

K Model Expand Fold Sum Expand Fold Sum Expand Fold Sum Expand Fold Sum 1D+CHG 0 .76 – 0 .76 0 .85 – 0 .85 1 .51 – 1 .51 1 .66 – 1 .66 256 CKBD+CHG 1 .00 0 .86 0 .89 1 .00 1 .01 0 .98 1 .00 1 .49 1 .41 1 .00 1 .54 1 .38 JGD+CHG 1 .00 0 .95 0 .98 1 .00 1 .21 1 .09 1 .00 1 .45 1 .35 1 .00 1 .53 1 .36 FG+CHG 0 .87 0 .85 0 .86 0 .96 0 .81 0 .90 1 .56 1 .51 1 .56 1 .53 1 .26 1 .47 1D+CHG 0 .72 – 0 .72 0 .84 – 0 .84 1 .48 – 1 .48 1 .62 – 1 .62 512 CKBD+CHG 1 .00 0 .85 0 .88 1 .00 1 .41 1 .08 1 .00 1 .46 1 .37 1 .00 1 .52 1 .35 JGD+CHG 1 .00 0 .92 0 .96 1 .00 0 .97 0 .97 1 .00 1 .42 1 .32 1 .00 1 .50 1 .33 FG+CHG 0 .85 0 .84 0 .84 0 .94 0 .80 0 .88 1 .55 1 .50 1 .54 1 .52 1 .23 1 .44 1D+CHG 0 .67 – 0 .67 0 .77 – 0 .77 1 .44 – 1 .44 1 .61 – 1 .61 1024 CKBD+CHG 1 .00 0 .78 0 .81 1 .00 1 .03 0 .97 1 .00 1 .44 1 .38 1 .00 1 .37 1 .27 JGD+CHG 1 .00 0 .86 0 .91 1 .00 0 .88 0 .94 1 .00 1 .41 1 .34 1 .00 1 .42 1 .29 FG+CHG 0 .81 0 .81 0 .80 0 .86 1 .37 0 .87 1 .53 1 .48 1 .51 1 .51 1 .24 1 .43 1D+CHG 0 .64 – 0 .64 0 .74 – 0 .74 1 .40 – 1 .40 1 .57 – 1 .57 2048 CKBD+CHG 1 .00 0 .74 0 .77 1 .00 1 .02 0 .99 1 .00 1 .39 1 .33 1 .00 1 .33 1 .23 JGD+CHG 1 .00 0 .82 0 .87 1 .00 0 .94 0 .95 1 .00 1 .37 1 .30 1 .00 1 .37 1 .25 FG+CHG 0 .77 0 .77 0 .77 0 .82 0 .96 0 .84 1 .51 1 .45 1 .49 1 .46 1 .32 1 .41 1D+CHG 0 .65 – 0 .65 0 .68 – 0 .68 1 .39 – 1 .39 1 .54 – 1 .54 4096 CKBD+CHG 1 .00 0 .66 0 .69 1 .00 0 .77 0 .84 1 .00 1 .35 1 .31 1 .00 1 .27 1 .19 JGD+CHG 1 .00 0 .74 0 .80 1 .00 0 .82 0 .90 1 .00 1 .33 1 .29 1 .00 1 .33 1 .23 FG+CHG 0 .77 0 .78 0 .77 0 .82 0 .85 0 .81 1 .50 1 .46 1 .49 1 .45 1 .38 1 .42 1D+CHG 0 .63 – 0 .63 0 .69 – 0 .69 1 .35 – 1 .35 1 .47 – 1 .47 8192 CKBD+CHG 1 .00 0 .64 0 .67 1 .00 0 .79 0 .84 1 .00 1 .30 1 .26 1 .00 1 .23 1 .16 JGD+CHG 1 .00 0 .69 0 .76 1 .00 0 .70 0 .85 1 .00 1 .29 1 .25 1 .00 1 .30 1 .21 FG+CHG 0 .76 0 .76 0 .76 0 .82 0 .75 0 .78 1 .47 1 .42 1 .45 1 .44 1 .40 1 .42

If we compare the partitioning models that do not use the communication hypergraph model among themselves (i.e.,

1D+BP

,

CKBD+BP

,

JGD+BP

and

FG+BP

) in terms of total communication volume, we see from Table 3 that

JGD+BP

ob- tains the best results, whereas

CKBD+BP

obtains the worst results.

FG+BP

is expected to achieve the best results in this metric since it offers the highest flexibility by performing the partitioning on a nonzero basis – the finest granularity avail- able. However, the reason why

JGD+BP

achieves slightly better results than

FG+BP

in this metric is related not to models themselves but to the shortcomings of recursive bisectioning used in partitioning. The shortcomings of recursive biparti- tioning are well known for high partitioning values [64,65] . For example at K =4096 ,

FG+BP

directly partitions the input matrix into 4096 parts whereas

JGD+BP

first partitions it into 64 parts, and for each of these parts, it partitions them into 64 parts again to obtain a 4096-way partition. Hence, by using smaller partition values compared to

FG+BP

,

JGD+BP

is relatively able to mitigate the drawbacks of recursive bisection. The poor performance of

CKBD+BP

in this metric is due to the use of multi-constraint partitioning. This limits the search space drastically, where the higher the number of constraints, the harder it is to get good quality partitions as the search space narrows down with increasing number of constraints. However, this is a tradeoff for

CKBD+BP

as it often achieves good results in total message count, which are comparable to those of

JGD+BP

at lower processor counts. At higher processor counts, like 4096 and 8192,

JGD+BP

achieves better results in total message count. This is again because the high number of constraints at high values of K leads to poor total volume in

CKBD+BP

, which in turn affects the total message count as a side effect by causing an increase. As expected, the smallest maximum number of messages is obtained by

CKBD+BP

as it bounds the communication to specific rows and columns of the processor mesh in both stages of communication.

FG+BP

is often the worst in terms of total and maximum number of messages because it causes increases in these metrics in order to reduce total volume.

To aid the assessment of benefits of the communication hypergraph, we present Table 4 . In this table, each latency- improved (

CHG

-enhanced) model’s performance metrics are normalized with respect to those of their baselines. In other words, the results of

1D+CHG

are normalized with respect to those of

1D+BP

, the results of

CKBD+CHG

are normalized with respect to those of

CKBD+BP

, etc. The normalization is performed on a matrix basis and the averages of these normal- ized values over 28 matrices are given separately for each K . As seen from the table,

CHG

-enhanced models improve the total message count drastically, as minimizing this metric is one of the main objectives in these models. For example at 2048 pro- cessors,

1D+CHG

achieves 36% improvement over

1D+BP

,

CKBD+CHG

achieves 23% improvement over

CKBD+BP

,

JGD+CHG

achieves 13% improvement over

JGD+BP

, and

FG+CHG

achieves 23% improvement over

FG+BP

. However, this comes at the cost of increased total volume. Again at 2048 processors,

1D+CHG

increases the total volume by 40%,

CKBD+CHG

by 33%,

JGD+CHG

by 30%, and

FG+CHG

by 49%. The crucial observation, however, is that the message count improvements of par- titioning models that rely on the communication hypergraph model tend to increase with increasing number of processors. For example,

1D+CHG

achieves a 24% improvement over

1D+BP

at K =256 in total message count and this improvement be- comes 37% at K = 8192 . This improvement increases from 11% to 33%, 2% to 24%, and 14% to 24% for

CKBD+CHG

,

JGD+CHG

(15)

32.43 33.79 21.68 114.22

CHG 1.68 0.98 1.13 1.98

and

FG+CHG

, respectively. This implies that the benefits obtained using the communication hypergraph model are more prominent at higher processor counts. Note that for

CKBD+CHG

and

JGD+CHG

, normalized total message average is closer to fold message average rather than expand message average, which is also the case for normalized average volume. This is because the message count and communication volume in the post-communication stage are much higher than the pre- communication stage in these models. This is also where the communication hypergraph model is expected to perform well.

The models that use the communication hypergraph model improve the maximum number of messages as well. This is a consequence of the reduction in total message count. In Table 3 , if we compare partitioning models in this metric, it can be seen that

CKBD+CHG

obtains the best results which is usually followed by

JGD+CHG

. For example, at 8192 processors on average, the maximum number of messages handled by a single processor for

CKBD+CHG

is only 18.6 and for

JGD+CHG

it is only 20.3. These values are followed by

CKBD+BP

with 23.6 and

JGD+BP

with 25.2. When we examine the other important metric the maximum volume in Table 4 , it is seen that the models that rely on the communication hypergraph model close the gap with their baseline counterparts with increasing K . For instance, when K increases from 256 to 8192 processors, the increase in maximum volume incurred by the use of the communication hypergraph model decreases from 66%, 38%, 36% and 47% to 47%, 16%, 21% and 42% in

1D+CHG

,

CKBD+CHG

,

JGD+CHG

and

FG+CHG

, respectively, compared to their baseline counterparts. This is an important benefit of the communication hypergraph model since it strives for balancing volume.

The sequential partitioning times of the evaluated models are given in Table 5 averaged over all matrices and K values. The CHG times (indicated via

CHG

row) include only the partitioning times of the communication hypergraphs formed for the respective model. As expected, the fine-grain model has the highest partitioning time as the hypergraphs formed in this model are typically larger. The partitioning times of the communication hypergraphs are quite low compared to the respective original partitionings since they are small as they contain only the vertices that correspond to the vector elements that necessitate communication. Note that

CKBD+CHG

,

JGD+CHG

and

FG+CHG

form a number of communication hypergraphs that can independently be partitioned, hence the partitioning of them can easily be parallelized. A more healthy comparison of partitioning overhead for 2D models can be found in [49] .

4.2. Speedup analysis

For a detailed comparison of the partitioning models in terms of parallel solver running times/speedups, we present the runtime performance profiles in Fig. 8 . Performance profiles provide a better understanding of the characteristics of the compared models as they capture the relative performance of the compared models more accurately [66] . A point x , y in a profile reads as the respective model is within a x factor of the best result in a y fraction of the test instances. In other words, the closer the performance profile of a scheme to the y -axis, the better it is. A test instance in our case is the parallel solver running time obtained for a specific matrix and K . We compare the performances of partitioning models for all K values in Fig. 8 a and for K 4096, 8192 in Fig. 8 b. The former contains 168 instances and the latter contains 56 instances.

When we compare the models considering all K values in Fig. 8 a,

JGD+BP

is clearly the best performing model followed by

JGD+CHG

.

JGD+BP

obtains the best results for more than 40% of the test cases and exhibits very good performance for a very large fraction of the test cases. These two models are followed by two models that use the communication hypergraph:

1D+CHG

and

FG+CHG

. Except the jagged model, applying the communication hypergraph seems to improve performance of the partitioning models as

1D+CHG

,

CKBD+CHG

and

FG+CHG

perform better than

1D+BP

,

CKBD+BP

and

FG+BP

, respectively.

1D+BP

obtains the worst results, proving itself to be not a viable partitioning model compared to the 2D models as long as the communication hypergraph is not used for it.

Fig. 8 b is presented to better assess the benefits of using the communication hypergraph model. As discussed, latency gets more important with increasing K and it is expected that the models using the communication hypergraph model should be performing better as K increases. If we consider the performances of partitioning models at only 4096 and 8192 processors, it can be seen from the figure that the models that use the communication hypergraph improve the performance much more compared to the case when all K values are considered. In other words, for example, if we compare

CKBD+BP

and

CKBD+CHG

in Fig. 8 a and Fig. 8 b the performance difference between them increases in favor of

CKBD+CHG

in Fig. 8 b. This can be observed for all partitioning models, i.e., by comparing

1D+BP

and

1D+CHG

,

CKBD+BP

and

CKBD+CHG

,

JGD+BP

and

JGD+CHG

,

FG+BP

and

FG+CHG

in Fig. 8 a and Fig. 8 b. This is also validated as

JGD+CHG

can be said to be the best performing model in Fig. 8 b followed by

JGD+BP

. These two models are again followed by two models that use the com- munication hypergraph:

FG+CHG

and

CKBD+CHG

. These figures show that the communication hypergraph proves to be a valuable method for achieving scalability.

Şekil

Fig. 1. A sample of 2D checkerboard and jagged partitionings on a 16 = 4 × 4 virtual processor mesh
Fig. 2. Formation of the communication matrix for the third column of the processor mesh (  β = 3 ) to summarize expand operations in the pre-  communication stage
Fig. 3. Formation of the communication matrix for the third row of the processor mesh (  α = 3 ) to summarize fold operations in the post-communication  stage
Fig. 4. Minimizing latency cost in checkerboard partitioning model.
+7

Referanslar

Benzer Belgeler

• The topic map data model provided for a Web-based information resource (i.e., DBLP) is a semantic data model describing the contents of the documents (i.e., DBLP

This study examined the in fluence of immersive and non-immersive VDEs on design process creativ- ity in basic design studios, through observing factors related to creativity as

Analysis of Volvo IT’s Closed Problem Management Processes By Using Process Mining Software ProM and Disco.. Eyüp Akçetin | Department of Maritime Business Administration,

The objective was to maximize the throughput of the serial production line by allocating the total fixed number of buffer slots among the buffer locations and in order to achieve

Each layer contributes a different amount to the mode confinement, roughly proportional to their thickness, and quantum dot layers have the lowest confinement factor compared to

Örneğin, Aziz Çalışlar’ın çevirdiği Sanat ve Edebiyat (1996) başlıklı derleme ile “Sol Yayınları”nın derlediği Yazın ve Sanat Üzerine (1995) başlıklı

In this study, a mechanistic cutting force model for milling CFRPs is proposed based on experimentally collected cutting force data during slot milling of unidirectional

Using this result and the filtered-graded transfer of Grobner basis obtained in [LW] for (noncommutative) solvable polynomial algebras (in the sense of [K-RW]), we are able