Reducing communication overhead in sparse matrix and tensor computations

(1)

REDUCING COMMUNICATION

OVERHEAD IN SPARSE MATRIX AND

TENSOR COMPUTATIONS

a dissertation submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

doctor of philosophy

in

computer engineering

By

Mustafa Ozan Karsavuran

August 2020

(2)

Reducing Communication Overhead in Sparse Matrix and Tensor Computations

By Mustafa Ozan Karsavuran August 2020

We certify that we have read this dissertation and that in our opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy.

Cevdet Aykanat(Advisor)

˙Ibrahim K¨orpeo˘glu

¨

Ozcan ¨Ozt¨urk

Murat Manguo˘glu

Tayfun K¨u¸c¨ukyılmaz

Approved for the Graduate School of Engineering and Science:

Ezhan Kara¸san

(3)

Copyright Information

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Bilkent University’s products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http:// www.ieee.org/ publications standards/ publications/ rights/ rights link. html to learn how to obtain a License from RightsLink.

c

2020 IEEE. Reprinted, with permission, from M. O. Karsavuran, S. Acer, and C. Aykanat, “Reduce operations: Send volume balancing while minimizing latency,” IEEE Transactions on Parallel and Distributed Systems, June 2020.

c

2021 IEEE. Reprinted, with permission, from M. O. Karsavuran, S. Acer, and C. Aykanat, “Partitioning Models for General Medium-Grain Parallel Sparse Tensor Decomposition,” IEEE Transactions on Parallel and Distributed Systems, January 2021.

(4)

ABSTRACT

REDUCING COMMUNICATION OVERHEAD IN

SPARSE MATRIX AND TENSOR COMPUTATIONS

Mustafa Ozan Karsavuran Ph.D. in Computer Engineering

Advisor: Cevdet Aykanat August 2020

Encapsulating multiple communication cost metrics, i.e., bandwidth and la-tency, is proven to be important in reducing communication overhead in the parallelization of sparse and irregular applications.

Communication hypergraph model was proposed in a two-phase setting for encapsulating multiple communication cost metrics. The reduce-communication hypergraph model suffers from failing to correctly encapsulate send-volume bal-ancing. We propose a novel vertex weighting scheme that enables part weights to correctly encode send-volume loads of processors for send-volume balancing. The model also suffers from increasing the total communication volume during partitioning. To decrease this increase, we propose a method that utilizes the recursive bipartitioning (RB) paradigm and refines each bipartition by vertex swaps. For performance evaluation, we consider column-parallel SpMV, which is one of the most widely known applications in which the reduce-task assignment problem arises. Extensive experiments on 313 matrices show that, compared to the existing model, the proposed models achieve considerable improvements in all communication cost metrics. These improvements lead to an average decrease of 30 percent in parallel SpMV time on 512 processors for 70 matrices with high irregularity.

We further enhance the reduce-communication hypergraph model so that it also encapsulates the minimization of the maximum number of messages sent by a processor. For this purpose, we propose a novel cutsize metric which we realize using RB paradigm while partitioning the reduce-communication hypergraph. We also introduce a new type of net for the communication hypergraph which models decreasing the increase in the total communication volume directly with the partitioning objective. Experiments on 300 matrices show that the proposed models achieve considerable improvements in communication cost metrics which lead to better column-parallel SpMM time on 1024 processors.

(5)

v

We propose a hypergraph model for general medium-grain sparse tensor parti-tioning which does not enforce any topological constraint on the partiparti-tioning. The proposed model is based on splitting the given tensor into nonzero-disjoint com-ponent tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these mode-dependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication vol-ume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. We also utilize the well-known RB paradigm to improve the quality of the splitting heuristic. We propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of in-creasing the total communication volume. Parallel experiments conducted on 10 real-world tensors on up to 1024 processors confirm the validity of the proposed hypergraph and graph models.

Keywords: distributed-memory systems, parallel computing, communication cost, recursive bipartitioning, graph partitioning, hypergraph partitioning, sparse ma-trix, sparse tensor.

(6)

¨

OZET

SEYREK MATR˙IS VE TENS ¨

OR

HESAPLAMALARINDA ˙ILET˙IS

¸ ˙IM Y ¨

UK ¨

UN ¨

UN

AZALTILMASI

Mustafa Ozan Karsavuran Bilgisayar M¨uhendisli˘gi, Doktora Tez Danı¸smanı: Cevdet Aykanat

A˘gustos 2020

Seyrek ve düzensiz uygulamalar paralelle¸stirilirken ileti¸sim yükünün azaltılması i¸cin birden fazla ileti¸sim maliyeti öl¸cütünün (hacim ve gecikim) kapsanması ¨

onemlidir.

˙Ileti¸sim hiper¸cizge modeli birden fazla ileti¸sim maliyeti öl¸cütünü kapsaya-cak ¸sekilde iki fazlı bir yöntem olarak önerilmi¸stir. Toplama-ileti¸sim-hiper¸cizge modeli i¸slemciler arasındaki gönderim-hacmi dengesini do˘gru bir ¸sekilde mod-elleyememektedir. Önerdi˘gimiz yeni kö¸se a˘gırlı˘gı ¸seması ile bölüm a˘gırlıklarının do˘gru gönderim-hacmine kar¸sılık gelmesi sayesinde i¸slemciler arasındaki gönderim hacmini dengelemekteyiz. Ayrıca bu hiper¸cizge modeli bölümleme sırasında toplam ileti¸sim hacminde artı¸sa neden olmaktadır. Bu artı¸sı azaltma amacı ile ¨

ozyinelemeli ikiye-bölümleme ( ÖB) yakla¸sımı kullanan ve her ikiye bölümleme sırasında kö¸seleri kar¸sılıklı yer de˘gi¸stiren bir yöntem önermekteyiz. Performans de˘gerlendirmesi i¸cin toplama-görevi ataması ortaya ¸cıkartan bilinen en yaygın uygulamalardan olan sütun-paralel SyMV i¸slemini kullanmaktayız. 313 matris ¨

uzerinde yapılan kapsamlı deneylerle gösterildi˘gi üzere önerdi˘gimiz yöntemler, var olan modele göre tüm ileti¸sim maliyeti öl¸cütlerinde kayda de˘ger iyile¸sme elde etmi¸stir. Bu iyile¸smeler, yüksek düzensizlikteki 70 SyMV örne˘gi i¸cin 512 i¸slemcide ortalama %30 daha az paralel ko¸sma elde edilmesini sa˘glamı¸stır.

˙Ileti¸sim hiper¸cizge modelini daha da geli¸stirerek bir i¸slemci tarafından gönderilen en fazla mesaj sayısını da kapsamasını sa˘glamaktayız. Bu ama¸c i¸cin hiper¸cizgenin bölümlenmesinde kullanılmak üzere, ÖB yakla¸sımı ile ger¸cekledi˘gimiz yeni bir toplam paha öl¸cütü önermekteyiz. Ayrıca toplam ileti¸sim hacmindeki artı¸sı azaltmak i¸cin, bu artı¸sı do˘grudan bölümleme hedefi ile modelleyen yeni bir tür hiperkenar önermekteyiz. 300 matris üzerinde 1024 i¸slemci i¸cin ger¸cekle¸stirdi˘gimiz deneylerde önerdi˘gimiz yöntemler ileti¸sim maliyeti

(7)

vii

¨

ol¸c¨utlerinde kayda de˘ger iyile¸sme elde etmi¸s ve bu iyile¸smeler daha iyi s¨ utun-paralel SyMM ko¸sma zamanı sa˘glamı¸stır.

Seyrek tensör bölümlemesi amacıyla bölümleme üzerine herhangi bir topolo-jik kısıt gerektirmeyen, genel orta-taneli hiper¸cizge modeli önermekteyiz.

¨

Onerdi˘gimiz model verilen tensörün sıfırdı¸sı-ayrık bile¸sen tensörlerine ayrılması tabanlıdır. Sonrasında her bir bile¸sen tensörü i¸cin boyut-ba˘gımlı iri-taneli hiper¸cizge olu¸sturmaktayız. Onerdi˘¨ gimiz hiperkenar bütünle¸stirme yöntemi ile bu boyut-ba˘gımlı iri-taneli hiper¸cizgelerden bile¸sik orta-taneli hiper¸cizge olu¸sturmaktayız. Onerdi˘¨ gimiz bile¸sik orta-taneli hiper¸cizge modeli toplam ileti¸sim hacminin en aza indirilmesini do˘gru bir ¸sekilde kapsamaktadır.

¨

Onerdi˘gimiz ayırma sezgiseli ile tensörün yo˘gun dilimlerindeki sıfırdı¸sı eleman-ları ayırarak bile¸sen tensörlerinde seyrek dilimler elde etmekteyiz. Ayrıca ÖB yakla¸sımı kullanarak ayırma sezgiselinin daha kaliteli sonu¸clar elde etmesini sa˘glamaktayız. Onerdi˘¨ gimiz orta-taneli ü¸c-kümeli ¸cizge modeli ile toplam ileti¸sim hacmini arttırma pahasına daha hızlı bölümleme zamanı elde etmek-teyiz. Ger¸cek problemlerden elde edilmi¸s 10 tensör üzerinde 1024 i¸slemciye kadar ger¸cekle¸stirdi˘gimiz paralel deneyler önerdi˘gimiz yöntemlerin ge¸cerlili˘gini do˘grulamaktadır.

Anahtar s¨ozc¨ukler : da˘gıtık-bellekli sistemler, paralel hesaplama, ileti¸sim maliyeti, ¨

ozyinelemeli ikiye-bölümleme, ¸cizge bölümleme, hiper¸cizge bölümleme, seyrek matris, seyrek tensör.

(8)

Acknowledgement

First of all, I would like to express my deepest and greatest thanks to my advisor Professor Cevdet Aykanat for being a great person in all possible aspects. His guidance and suggestions told me a lot while performing research for this thesis. His attitude to research and life will be a guiding light in my future academic career.

I would like to thank Dr. Seher Acer for being a great research fellow, sharing her great ideas and experience, and being a good friend. Her contributions to this thesis are invaluable. Also, I would like to thank all colleagues in Bilkent Parallel Group for being a great team.

I would like to thank Dr. M. Mustafa ¨Ozdal for sharing his great ideas which improved our research.

I would like to thank Dr. ˙Ibrahim Körpeo˘glu, Dr. Tayfun Kü¸cükyılmaz, Dr. Özcan Öztürk, and Dr. Murat Manguo˘glu for reading and providing valu-able feedbacks on this thesis.

I am also grateful to my whole family for their support and endless love. I feel very lucky to have such a great family.

I would like to thank the Scientific and Technological Research Council of Turkey (T ¨UB˙ITAK) 1001 program for supporting me throughout my Ph.D. stud-ies in the EEEAG-114E545, EEEAG-116E043, and EEEAG-119E035 projects.

(9)

List of Figures

3.1 (a) Three processors, six reduce tasks and partial results in between and (b) A partition of reduce tasks in (a) (Rk assigned to pk). . . 11

3.2 Reduce-communication hypergraph of the reduce-task assignment problem given in Figure 3.1a. . . 14 3.3 A partition of reduce-tasks shown in Figure 3.1a with an outcast

reduce task (r4). . . 17

3.4 A balanced partition of the reduce-communication hypergraph given in Figure 3.2 with proposed vertex weights. . . 22 3.5 Among vertices vr

g, vhr,vri, only vri is candidate although both vhr

and vr

i are outcast vertices. . . 25

3.6 Strong scaling curves for column-parallel SpMV obtained by RCb

and RCs_vw. . . 37 3.7 Variation of communication cost metrics for RCs

vw with respect to

RCb for CV>0.50 (top) and CV>0.10 (bottom). . . 39

4.1 A four-way partition of a sample reduce-communication hyper-graph, the connectivity of cut net nk signifies the processors that

pk will send messages. . . 44

4.2 Updating cost of net n according to bipartitionings in the RB tree. 47 4.3 A reduce-communication hypergraph with three processors and

four reduce tasks augmented with proposed volume nets whose pins are shown with dashed lines. . . 49 4.4 Strong scaling curves for column-parallel SpMM obtained by RCs

vw

and RCVn

(12)

LIST OF FIGURES xii

5.1 Split tensor, coarse-grain mode-dependent hypergraphs and medium-grain hypergraph. “_{6” denotes the vanishing vertices and} nets. . . 77 5.2 Medium-grain tripartite graph obtained from HM G(X ) in Figure 5.1. 83

5.3 Performance profile for parallel CPD-ALS and partitioning times. 95 5.4 Strong scaling curves for CPD-ALS obtained by CartHP, MGHP,

and MGTGP up to P = 1024 processors and R = 64. . . 97

(13)

List of Tables

3.1 Properties of Test Matrices and Their Reduce-Communication Hy-pergraphs. . . 31 3.2 Outcast Vertex Elimination for P = 512 . . . 32 3.3 Comparison of Communication Metrics and Parallel Runtimes for

P = 512 . . . 34 3.4 Sequential 512-way Partitioning Times (seconds) . . . 40 4.1 Properties of Test Matrices and Their Reduce-Communication

Hy-pergraphs. . . 54 4.2 Outcast Vertex Elimination for P = 1024 . . . 56 4.3 Comparison of Communication Metrics for P = 1024 . . . 57 5.1 Size Comparison of Coarse-, Fine-, and Medium-Grain Models . . 78 5.2 Properties of the Test Tensors . . . 86 5.3 Hypergraph Size Comparison for CG, FG, and MG models . . . . 87 5.4 Comparison of Communication Metrics and Parallel CPD-ALS

Time of FGHP and MGHP for P = 512 and R = 64 . . . 91

5.5 Comparison of Communication Metrics and Parallel CPD-ALS time of CartHP, MGHP, and MGTGP for P = 512 and R = 64 . . . . 92

5.6 Comparison of Parallel CPD-ALS time of CartHP, MGHP, and

MGTGP for R = {64, 32, 16, 8} for P = 512 . . . 98

5.7 Sequential 512-Way Partitioning Overhead in terms of Sequential CPD-ALS Factorization Time for R = 64 . . . 99 5.8 Average Sequential 512-Way Partitioning Overhead in terms of

(14)

List of Algorithms

1 RB with Swap . . . 23

2 SWAP-OUTCAST . . . 26

3 RB utilizing novel cutsize metric . . . 46

4 CPD-ALS . . . 68

5 Parallel CPD-ALS . . . 69

6 SPLIT-TENSOR . . . 75

(15)

Chapter 1 Introduction

Scaling sparse and irregular applications on distributed-memory parallel systems requires encapsulating multiple communication cost metrics. Communication cost metrics are categorized under bandwidth and latency costs. Bandwidth costs consist of total communication volume and maximum communication vol-ume handled by a processor, whereas latency costs consist of total number of messages and maximum number of messages handled by a processor. In this the-sis, we propose graph/hypergraph partitioning based models and methods that effectively reduce these communication costs of sparse matrix and tensor compu-tations.

Sparse matrix and tensor computations are important kernels for many scien-tific and engineering applications [1, 2, 3, 4, 5, 6, 7, 8], machine learning algo-rithms [9, 10, 11], and recommender systems [12, 13]. In this thesis, we focus on sparse matrix-vector multiplication (SpMV), sparse matrix-dense matrix multi-plication (SpMM), and canonical polyadic sparse tensor decomposition utilizing alternating least squares (CPD-ALS).

SpMV is a widely used kernel operation in a variety of sparse iterative solvers such as the Conjugate Gradient [1, 2, 3, 4, 5, 6]. On distributed-memory systems

(16)

parallel SpMV is considered to be a latency bound operation. However, band-width costs can also be a bottleneck for SpMV operation. Especially maximum communication volume handled by a processor becomes an important metric for some instances. In Chapter 3, we focus on column-parallel SpMV which incurs reduce operations on the output vector. We enhance the reduce-communication hypergraph model [14] so that it encapsulates the maximum communication vol-ume of the column-parallel SpMV operation.

SpMM is also a widely used kernel operation in block iterative solvers such as block Conjugate Gradient variants [7], block Lanczos [8] methods and multi-source breadth-first search algorithms [15, 16]. On distributed-memory systems, the performance of parallel SpMM is mainly affected by the bandwidth costs since the amount of communication volume scales with the column size of the dense input and output matrix. However, latency costs, especially the maximum number of messages handled by a processor, are still important. In Chapter 4, we focus on column-parallel SpMM operation which also incurs reduce operations on the output matrix. We further enhance the reduce-communication hypergraph model [14] so that it also encapsulates the maximum number of messages handled by a processor in the column-parallel SpMM operation.

Sparse tensor decomposition is an important kernel for machine learning ap-plications [9, 10, 11] and recommender systems [12, 13]. Parallel CPD-ALS op-eration on distributed-memory systems becomes a bandwidth bound application for high rank tensors, since the amount of the communication scales with the rank. In Chapter 5, we propose a medium-grain graph and hypergraph parti-tioning based methods for reducing the total communication volume with less preprocessing overhead compared to the state-of-the art method [17].

In all of the above-mentioned methods, we utilize the recursive-bipartitioning (RB) paradigm for graph and hypergraph partitioning. In order to obtain multi-way partitions of graphs or hypergraphs, RB paradigm is commonly used by the well-known tools [18, 19, 20]. In the RB paradigm, the graph/hypergraph is partitioned into two parts (bipartitioned) and then resulting graphs/hypergraphs induced by these parts are recursively partitioned into two parts until the desired

(17)

number of parts is obtained. The RB paradigm enables many opportunities for optimizing the partitioning which is not possible with direct multi-way partition-ing. For instance, refining the bipartition at each RB step according to a metric which is not encapsulated by the objective of the partitioning is possible. An-other example could be re-adjusting costs and/or constraints of the bipartitioning according to the current outcome.

The rest of the thesis is organized as follows: Chapter 2 gives background information on graph and hypergraph partitioning and their variants, as well as RB paradigm. Chapter 3 presents communication volume balancing while min-imizing the latency for reduce operations. Chapter 4 presents latency balancing and minimization for reduce operations. Chapter 5 presents medium-grain parti-tioning for distributed-memory parallel sparse tensor decomposition. Chapter 6 concludes the thesis.

(18)

Chapter 2 Background

2.1 Hypergraph/Graph Partitioning Problem

A hypergraph H = (V, N ) is defined as the set V of vertices and the set N of nets. Each net n connects a subset of vertices, which is denoted by P ins(n). The set of nets that connect vertex v is denoted by N ets(v). In H, each vertex v is assigned a weight w(v) and each net n is assigned a cost c(n).

A tripartite graph G = (VA∪VB_∪VC_{, E ) is defined as three disjoint sets V}A_{, V}B_,

VC _{of vertices, and a set E of edges. Each edge (u, v) connects two vertices from}

different sets of vertices. Adj(v) is used to denote the set of vertices adjacent to vertex v. In G, each vertex v is assigned a weight w(v) and each edge (u, v) is assigned a cost c((u, v)).

Hypergraph and graph partitioning are defined as finding a P -way partition Π of vertices with the objective of minimizing the cutsize while maintaining a balance constraint on the part weights.

The same balancing constraint applies to both hypergraph and graph par-titioning as follows: Let Π = {V1, V2, . . . , VP} denote a P -way partition of the

(19)

the sum of the weights of the vertices in Vk, which is denoted by W (Vk), . That is, W (Vk) = X v∈Vk w(v). (2.1)

A P -way vertex partition of H or G is said to satisfy the partitioning constraint if

W (Vk) ≤ Wavg(1 + ) (2.2)

for each part Vk in Π, for a given maximum allowed imbalance ratio . Here Wavg

denotes the weight of each part under perfect balance, that is,

Wavg = Wtot P , where Wtot = P X k=1 W (Vk). (2.3)

In a given partition Π of a hypergraph, net nj is said to connect part Vk if it

connects at least one vertex in Vk. The connectivity set of nj is defined as the

set of parts connected by nj. The connectivity of nj, λ(nj), denotes the number

of parts connected by nj. nj is said to be cut if it connects more than one part,

i.e., λ(n) > 1, and uncut otherwise. Then the cutsize is defined as cutsize(Π) = X

n∈N

(λ(n) − 1)c(n). (2.4)

This is referred as connectivity cutsize metric in the literature [19] where each net incurs its connectivity to the cutsize.

In a given partition Π of a (tripartite) graph, an edge (u, v) is said to be cut if u and v belong to two different parts. The set of cut edges in Π is denoted as Ecut. Then the cutsize is defined as

cutsize(Π) = X

(u,v)∈Ecut

c((u, v)). (2.5)

2.1.1 HP with fixed vertices

The HP problem with fixed vertices is a version of the HP problem in which the assignments of some vertices are determined before partitioning. These vertices

(20)

are called fixed vertices and Fk denotes the set of vertices that are fixed to part

Vk. At the end of the partitioning, vertices in Fk remain in Vk, i.e., Fk⊆ Vk. The

rest of the vertices are called free vertices.

2.2 Recursive Bipartitioning Paradigm

Recursive-bipartitioning (RB) paradigm is commonly used by the successful graph/hypergraph partitioning tools [18, 19, 20] for obtaining multi-way par-titions of graphs or hypergraphs. In the RB paradigm, the graph/hypergraph is partitioned into two parts (bipartitioned) and then resulting graphs/hypergraphs induced by these parts are recursively partitioned into two parts until the de-sired number of parts is obtained. Here, we describe how RB paradigm works for partitioning hypergraphs into a P -way partition. Without loss of generality, we assume that the number P of processors is an exact power of 2.

In the RB paradigm, the given hypergraph is bipartitioned into two subhy-pergraphs, which are further bipartitioned recursively until P parts are obtained. This procedure produces a complete binary tree with log₂P levels which is re-ferred as the RB tree. 2` hypergraphs in the `th level of the RB tree are denoted by H₀`, . . . , H`₂`₋₁ from left to right for 0 ≤ ` ≤ log2P .

A bipartition Π2 = {VL, VR} of an `th level hypergraph H`k forms two new

vertex-induced subhypergraphs H`+1_2k = (VL, NL) and H`+1_2k+1= (VR, NR), both in

level ` + 1. Here, VL and VR respectively refer to the left and right parts of the

bipartition. Internal nets of the left and right parts are assigned to net sets NL

and NRas is, respectively, whereas the cut-nets are assigned to both net sets, only

with the pins found in the respective part. That is, NL= {ni: P ins(ni)∩VL6= ∅},

where P ins(nL

j) = P ins(nj) ∩ VL for each nLj ∈ NL, whereas NR is formed in a

dual manner. This way of forming NL and NR is known as the cut-net splitting

method [19] which is proposed to encode the connectivity cutsize metric (2.4) in the final P -way partition. These split nets are also called as subnets.

(21)

Chapter 3 Reduce Operations: Send

Volume Balancing While

Minimizing Latency

Several successful partitioning models and methods have been proposed for ef-ficient parallelization of irregular applications on distributed memory systems. These partitioning models and methods aim at reducing communication overhead while maintaining computational load balance [19, 22, 23, 24, 25, 26]. Encapsu-lating multiple communication cost metrics is proven to be important in reducing communication overhead for scaling irregular applications [14, 27, 28, 29, 30, 31, 32, 33, 34].

The communication hypergraph model was proposed for modeling the mini-mization of multiple communication cost metrics in a two-phase setting [14, 27, 28, 29]. This model was first proposed by U¸car and Aykanat [14] for parallel sparse matrix-vector multiplication (SpMV) based on one-dimensional (1D) par-titioning of sparse matrices. Later, this model was extended for two-dimensional (2D) fine-grain partitioned sparse matrices [27] and checkerboard and 2D-jagged partitioned sparse matrices [28]. Communication hypergraph models were

(22)

also developed for parallel sparse matrix-matrix multiplication operations based on 1D partitions [29].

The communication hypergraph model encapsulates multiple communication cost metrics in a two-phase setting as follows. In the first phase, computational-task-processor assignment is performed with the objective of minimizing to-tal communication volume while maintaining computational load balance. Sev-eral successful graph/hypergraph models and methods are proposed for the first phase [19, 23, 25, 27, 29, 35, 36]. In the second phase, communication-task-to-processor assignment is performed with the objective of minimizing total number of messages while maintaining communication-volume balance. The computational-task-to-processor assignment obtained in the first phase deter-mines the communication tasks to be distributed in the second phase. The com-munication hypergraph model was proposed for assigning these comcom-munication tasks to processors in the second phase.

In the communication hypergraph model, vertices represent communication tasks (expand and/or reduce tasks) and hyperedges (nets) represent processors where each net is anchored to the respective part/processor via a fixed vertex. The partitioning objective of minimizing the cutsize correctly encapsulates minimizing the total number of messages, i.e., total latency cost.

In this model, the partitioning constraint of maintaining balance on the part weights aims to encode maintaining balance on the communication volume loads of the processors. Communication volume balancing is expected to decrease the communication load of the maximally loaded processor. The communication volume load of a processor is considered as its send-volume load, whereas receive-volume load is omitted with the assumption that each processor has enough local computation that overlaps with incoming messages in the network [14, 27, 28, 29]. An accurate vertex weighting scheme is needed for part weights to encode send-volume loads of processors. The vertex weighting scheme proposed for the expand-communication hypergraph enables the part weights to correctly encode the send-volume loads of processors [14]. However, the vertex weighting scheme

(23)

proposed for the reduce-communication hypergraph fails to encode send-volume loads of processors as already reported in [14]. The authors of [14] explicitly stated that their partitioning constraint corresponds to an approximate the send-volume load balancing and report this approximation to be a reasonable one only if net degrees are close to each other.

In this work, in order to address the above-mentioned deficiency of the reduce-communication hypergraph model, we propose a novel vertex weighting scheme so that a part weight becomes exactly equal to the send-volume load of the respective processor. The proposed vertex weighting scheme involves negative vertex weights. Since the current implementations of hypergraph partitioning tools do not support negative vertex weights, we propose a vertex reweighting scheme to transform all vertex weights to positive values.

The communication hypergraph models also suffer from outcast vertex assign-ment. In a partition, a vertex assigned to a part is said to be outcast if it is not connected by the net anchored to that part. Outcast vertices have the following adverse effects: First, communication volume increases with increasing number of outcast vertices, so that balancing the communication volume loads of processors begins to loosely relate to minimizing the maximum communication volume load. Second, the correctness of the proposed vertex weighting scheme may decrease with increasing number of outcast vertices. So the number of outcast vertices should be reduced as much as possible to avoid these adverse effects.

In this work, we also propose a method for decreasing the number of outcast vertices during the partitioning of the reduce-communication hypergraph. The proposed method utilizes the well known recursive bipartitioning (RB) paradigm. After each RB step, the proposed method refines the bipartition by swapping outcast vertices so that they are not outcast anymore. This method involves swapping as many outcast vertices as possible without increasing the cutsize and without disturbing the balance of the current bipartition.

For evaluating the performance of the proposed models, we consider column-parallel SpMV, which is one of the most widely known applications in which the

(24)

reduce-task assignment problem arises. We conduct extensive experiments on the reduce-communication hypergraphs obtained from 1D column-wise partitioning of 313 sparse matrices. The performance of the proposed models are reported and discussed both in terms of multiple communication cost metrics attained for column-parallel SpMV as well as runtime of column-parallel SpMV on a dis-tributed memory system. Compared to the baseline model, the proposed model achieves an average improvement of 30% in parallel SpMV time on 512 processors for 70 matrices that have high level of irregularity.

The rest of the chapter is organized as follows: Section 3.1 defines the task assignment problem, gives the background material on the reduce-communication hypergraph model, and then explains its above-mentioned defi-ciencies. In section 3.2 we propose a novel vertex weighting scheme that encap-sulates send communication volume loads of processors and an algorithm that eliminates the outcast vertices. Section 3.3 presents experiments, and Section 3.4 concludes.

3.1 Communication Hypergraph for Reduce

Operations

In this work, discussion follows with both Reduce-Task Assignment Problem and reduce-communication hypergraph interchangeably, which are equivalent and given for better understanding.

3.1.1 Reduce-Task Assignment Problem

Assume that the target application to be parallelized involves computational tasks that produce partial results for possibly multiple data elements. Also assume that computational-task-to-processor assignment has already been determined in the first phase. Based on this assignment, if there are at least two processors that

(25)

produce a partial result for an output data element, then those results are reduced to obtain a final value through communication. Here and hereinafter, reducing the partial results to a final value is referred to as a reduce-task. Each reduce-task is assigned to a processor, which is the sole processor that holds the final value of the respective output data element.

Let R = {r1, r2, . . . , rn} denote the set of reduce tasks for which at least two

processors produce a partial result. Let results(pk) ⊆ R denote the set of reduce

tasks for which processor pkproduces a partial result. Figure 3.1a illustrates three

₂

R

₃ (b)

Figure 3.1: (a) Three processors, six reduce tasks and partial results in between and (b) A partition of reduce tasks in (a) (Rk assigned to pk).

Let Π = {R1, R2, . . . , RP} denote a P -way partition of reduce tasks for a

P -processor system, where reduce tasks in Rk are assumed to be assigned to

processor pk. Then pk needs to send the partial results in results(pk)−Rk to the

processors to which respective reduce tasks are assigned.

In the reduce-task partition Π, the amount of data sent by pk, i.e.,

communi-cation volume load of pk, is defined as

volΠ(pk) = |results(pk) − Rk| (3.1)

(26)

processors becomes

volΠ_max = max

k vol Π_(p

k). (3.2)

In the reduce-task partition Π, the number of messages sent by pk, i.e., latency

cost of pk, is

nmsgΠ(pk) = |{Rm6=k|results(pk) ∩ Rm 6= ∅}|. (3.3)

That is, nmsgΠ(pk) is equal to the number of distinct processors to which the

reduce tasks in results(pk)−Rkare assigned. Then, the total number of messages,

i.e., total latency cost, becomes

nmsgΠ_tot =

P

X

k=1

nmsgΠ(pk). (3.4)

Definition 1. The Reduce-Task Assignment Problem: Consider a set of re-duce tasks R = {r1, r2, . . . , rn}. Assume that results(pk) ⊆ R is given for

k = 1, 2, . . . , P . Reduce-task assignment problem is defined as the problem of finding a P -way partition Π = {R1, R2, . . . , RP} of R with the objective of

mini-mizing both vol_maxΠ and nmsgΠ(pk) given in (3.2) and (3.4), respectively.

Figure 3.1b illustrates a 3-way partition Π of the reduce tasks displayed in Figure 3.1a. In the figure, the set of reduce-tasks in Rkis assigned to processor pk

for k = 1, 2, 3. A dashed arrow line denotes a processor producing a result for a local reduce task, whereas a solid arrow line denotes a processor producing a result for a reduce task assigned to another processor. So, dashed arrow lines do not incur communication whereas solid arrow lines incur communication. In Π, volΠ_(p

2) = |results(p2)−R2| = |{r1, r2, r3, r4, r6}−{r3, r4}| = 3. Similarly volΠ(p1) =

2 and volΠ_(p

3) = 1. Then, volΠmax= volΠ(p2) = 3. In Π, nmsgΠ(p2) = |{R1, R3}| =

2. Similarly nmsgΠ_(p

(27)

3.1.2 Reduce-Communication Hypergraph Model

3.1.2.1 Reduce-Communication Hypergraph Model

The reduce-communication hypergraph model [14] H = (Vp_∪Vr_{, N ) contains two}

types of vertices, which correspond to processors and reduce tasks, and a single net type, which corresponds to processors. Each processor pk is represented by a

vertex v_kp in Vp_{, whereas each reduce task r}

i in R is represented by a vertex vir

in Vr. Then the set of vertices V is formulated by

V = Vp∪ Vr = {v₁p, vp₂, . . . v_Pp} ∪ {v_ir : ri ∈ R}. (3.5)

Each processor pk is also represented by a net nk in N . Then the set of nets N

is formulated by

N = {n1, n2, . . . , nP}. (3.6)

Each net nk connects the vertex that represents pk as well as the vertices that

represent the reduce tasks for which pk produces a partial result. That is,

P ins(nk) = {vpk} ∪ {v r

i : ri ∈ results(pk)}. (3.7)

The vertices in Vp_{are assigned zero weight, whereas the vertices in V}r_{are assigned}

unit weight. That is,

w(v_kp) = 0, ∀v_kp ∈ Vp

w(v_ir) = 1, ∀v_ir∈ Vr_. (3.8)

The nets in N are assigned unit cost. That is,

c(nk) = 1, ∀nk ∈ N . (3.9)

The reduce-communication hypergraph model utilizes fixed vertices. The ver-tices in Vp are fixed, whereas the vertices in Vr are free. For k = 1, 2, . . . , P , vertex v_kp in Vp _{is fixed to part V}

k, i.e., Fk= {vkp}.

Figure 3.2 displays the reduce-communication hypergraph of the reduce-task assignment problem shown in Figure 3.1a. In the figure, fixed and free vertices are

1

n2

n3

Figure 3.2: Reduce-communication hypergraph of the reduce-task assignment problem given in Figure 3.1a.

represented by triangles and circles, respectively. Nets and pins are represented by small circles and lines, respectively.

A P -way partition

Π =V1, V2, . . . , VP , where Vk= {vkp} ∪ V r k,

of reduce-communication hypergraph H is decoded as follows. Each free vertex vr i

in Vk induces that the reduce task ri is assigned to processor pk since vkp∈ Vk in

Π. That is, vertex partition {V1, V2, . . . , VP} induces a reduce-task partitioning

{R1, R2, . . . , RP}, where Rk contains the reduce-tasks corresponding to the

ver-tices in V_kr. So we use the symbol Π for both reduce-task partition/assignment and hypergraph partition interchangeably.

The partitioning objective of minimizing the cutsize (2.4) encodes minimizing the number of messages, which is also referred to as the latency cost (3.4). During partitioning communication hypergraphs, almost all nets remain cut. This is because a communication hypergraph contains small number (as many as the number of processors/parts) of nets with possibly high degrees and an uncut net refers to a processor that does not send any messages. Therefore it is important to utilize the connectivity cutsize metric in (2.4). The vertex weight definition given in (3.8) encodes the part weight (2.1) as the number of reduce-tasks assigned to

(29)

the respective processor. So, the partitioning constraint of maintaining balance on the part weights corresponds to maintaining balance on the number of reduce-tasks assigned to processors, i.e., |Rk| values.

3.1.3 Deficiencies of Reduce-Communication Hypergraph

3.1.3.1 Failure to Encode Communication Volume Loads of Proces-sors

The part weights computed according to the vertex weighting scheme utilized in the reduce-communication hypergraph model fails to correctly encapsulate the communication volume loads of processors. That is, the existing reduce-communication hypergraph model computes the volume load of processor pk as

volΠ(pk) = |Rk|, (3.10)

whereas the actual volume load of pk is

volΠ(pk) = |results(pk) − Rk|. (3.11)

So, the partitioning constraint of maintaining balance on part weights does not correctly correspond to maintaining communication volume load balancing.

In regular reduce-task assignment instances, processors produce partial results for similar number of reduce tasks, that is, they have similar |results(pk)| values,

which corresponds to similar net degrees. For such regular instances, the ap-proximation provided by the existing reduce-communication hypergraph model can be considered reasonable, as also reported in [14]. This is because the ex-isting reduce-communication hypergraph model makes a similar amount of error in computing the volume loads of processors according to (3.10), hence main-taining balance on the |Rk| values corresponds to maintaining balance on the

volume loads. However, the deficiency of the existing model in encapsulating correct communication volume balancing increases with increasing irregularity in net degrees.

(30)

Figure 3.1b exemplifies the above-mentioned deficiency. Note that |R1| =

|R2| = |R3| = 2 in Π = {R1, R2, R3}. Assume that the perfect balance on

these |Rk| values is obtained via achieving a perfect balance on part weights

|results(pk)| values induces an imbalance on volΠ(pk) values as volΠ(p1) = 4−2 = 2,

volΠ_(p

2) = 5−2 = 3, and volΠ(p3) = 3−2 = 1.

3.1.3.2 Increase in Total Communication Volume

The existing reduce-communication hypergraph model also suffers from the in-crease in the total communication volume during the partitioning. A reduce task ri assigned to a processor which does not compute a partial result for ri is

referred to here as an outcast reduce task. Each outcast reduce-task assignment increases the total communication volume by one. However, this increase due to the outcast reduce-tasks is controlled neither by the problem formulation given in Section 3.1.1 nor by the reduce-communication hypergraph model described in Section 3.1.2.1. This deficiency has an adverse effect on the correspondence between maintaining communication volume balancing and minimizing the maxi-mum communication volume in the communication hypergraph model. The more the increase in the total communication volume is, the more the above-mentioned adverse effect becomes pronounced.

This is because attaining tight balance on processors’ communication volume loads while increasing the total communication volume may not correspond to reducing the maximum communication volume (3.2).

Figure 3.3 displays a reduce-task partition which contains one outcast reduce task. This partition is obtained from the outcast-free partition given in Fig-ure 3.1b by changing the assignments of r2 and r4 to processors p2 and p1,

re-spectively. As seen in the figure, reduce task r4 is outcast in the current partition

since processor p1 does not compute a partial result for r4. Note that this change

₃

Figure 3.3: A partition of reduce-tasks shown in Figure 3.1a with an outcast reduce task (r4).

In a P -way partition Π of reduce-communication hypergraph H, we define outcast vertices to identify the outcast reduce-task assignments. A vertex vr

i is

called outcast if vr_i is assigned to a part Vk, where net nk does not connect vir,

that is, v_ir∈ Vk and vri∈ P ins(n/ k). Note that vir∈ Vk signifies that reduce task ri

is assigned to processor pk and vir∈ P ins(n/ k) signifies that pk does not compute

a partial result for ri.

In a partition Π, the existence of outcast vertices does not necessarily disturb the partitioning objective of minimizing cutsize (2.4). Indeed, partitioning the hypergraph while trying to maintain balance without increasing the cutsize might motivate the partitioning tool to assign vertices to parts where they become outcast. Moreover, in the case the partitioning tool discovers a partition with outcast vertices, it has no motivation to refine it as long as the cutsize and the imbalance on the part weights remain the same.

(32)

3.2 Correct Reduce-Communication Hypergraph

Model

3.2.1 A Novel Vertex Weighting Scheme

In order to minimize the maximum communication volume handled by processors, we propose a novel vertex weighting scheme that encapsulates the communication volume loads of processors via part weights.

Consider a P -way outcast-vertex-free partition Πof of a given

reduce-communication hypergraph H = (Vp∪ Vr_{, N ). Here and hereafter, we refer to}

outcast-vertex-free partition shortly as outcast-free partition. Note that in an outcast-free partition each reduce task ri is assigned to a processor that

com-putes a partial result for ri. Let {R1, R2, . . . , RP} denote the reduce-task

parti-tion/assignment induced by Πof. Then the communication volume load of each

processor pk becomes

volΠof_(p

k) = |results(pk) − Rk| (3.12a)

= |results(pk)| − |Rk| (3.12b)

= (|P ins(nk)|−1) − |Vkr|. (3.12c)

We obtain (3.12b) from (3.12a) since Rk ⊆ results(pk) for each part Rk in an

outcast-free partition. We obtain (3.12c) from (3.12b) by utilizing the hypergraph theoretical view.

According to (3.12b), for any outcast-free partition, |results(pk)| is an upper

bound on the send volume load of processor pk and each reduce-task assigned

to processor pk reduces the communication volume load of pk by one. In other

words, each free vertex that is connected by nk and assigned to part Vk reduces

the volume load of processor pk by one. So we propose the following vertex

weighting scheme

w(v_kp) = |results(pk)| , ∀vpk ∈ V p

(33)

Then the weight of part Vk becomes W (Vk) = X v∈Vk w(v) (3.14a) = w(v_kp) + X vr i∈Vkr w(vr_i) (3.14b) = |results(pk)| + X vr i∈Vkr (−1) (3.14c) = |results(pk)| − |Vkr| (3.14d) = |results(pk)| − |Rk| (3.14e) = volΠof_(p k). (3.14f)

That is, part weight W (Vk) will correctly encode the volume of data sent by

processor pk.

As seen in (3.13), the proposed vertex weighting scheme assigns a nega-tive weight to all free vertices. However, current implementations of hyper-graph/graph partitioning tools (PaToH [19, 37], hMETIS [20], METIS [18]) do not support negative vertex weights. We propose the following vertex reweighting scheme for transforming all vertex weights to positive values.

We first multiply each vertex weight with −1. This scaling transforms the weights of all free vertices to +1, while transforming the weight of each fixed vertex v_kp to a negative value of −|results(pk)|. Then we shift the weights of

fixed vertices to positive values by adding the maximum fixed-vertex weight to the weight of all fixed vertices.

That is, after the proposed reweighting scheme, vertex weights become ˆ w(vp_k) = −|results(pk)| + Mf vw , ∀vkp ∈ V p ˆ w(v_ir) = +1 , ∀v_ir ∈ Vr, (3.15)

where Mf vw denotes the maximum fixed vertex weight, i.e., Mf vw =

(34)

Under the proposed vertex reweighting scheme, we can compute the weight of part Vk as

ˆ

W (Vk) = Mf vw − volΠof(pk) (3.16)

by following the steps of equation (3.14) for an outcast-free partition Πof. As seen

in (3.16), the part weights encode send volume loads of processors with the same constant shift amount Mf vw.

Note that maintaining balance on the part weights corresponds to maintaining balance on the send-volume loads of processors. Hence, perfect balance on the part weights corresponds to minimizing the maximum send volume (3.2) which is one of the objectives of the Reduce-Task Assignment Problem.

We present the following theorem to address the validity of the proposed vertex reweighting scheme.

Theorem 1. Let (H, w) denote the reduce-communication hypergraph model with the vertex weighting scheme proposed in (3.13). Let (H, ˆw) denote the model with the vertex reweighting scheme proposed in (3.15). Then Π∗ is a perfectly-balanced partition of (H, w) if and only if it is a perfectly-balanced partition of (H, ˆw) .

Proof. We find the relation between the total vertex weights Wtot and ˆWtot to

derive the relation between average part weights Wavg and ˆWavg for two vertex

weighting schemes w and ˆw, respectively. The derivations of expressions for Wavg

and ˆWavg are important since in a perfectly-balanced partition the weight of each

part should be equal to the average part weight by (2.2), that is, W (Vk) = Wavg

and ˆW (Vk) = ˆWavg for k = 1, . . . , P .

From (3.14f) and (3.16) we obtain

W (Vk) = Mf vw− ˆW (Vk). (3.17)

Then, we compute the sum of both sides of (3.17) for all k

P X k=1 W (Vk) = P X k=1 (Mf vw− ˆW (Vk)), (3.18a) Wtot = P Mf vw− ˆWtot. (3.18b)

(35)

Finally, we divide both sides of (3.18b) by P to obtain

Wavg = Mf vw− ˆWavg. (3.19)

(3.19) holds because reduce-communication hypergraph model contains P fixed vertices in total, whereas (3.17) holds because it contains exactly one fixed vertex in each part. Hence shifting the weight of each fixed vertex by Mf vw, shifts each

part weight and average weight by the same amount Mf vw.

⇒ Assume that Π∗ _{is a perfectly-balanced partition of (H, w). Then, we have}

W (Vk) = Wavg for k = 1, . . . , P. (3.20)

Replacing left hand side by (3.17) and right hand side by (3.19), (3.20) becomes Mf vw− ˆW (Vk) = Mf vw− ˆWavg

and hence

ˆ

W (Vk) = ˆWavg for k = 1, . . . , P.

This shows that Π∗ is also a perfectly-balanced partition of (H, ˆw).

⇐ A dual proof holds. That is, assume ˆW (Vk) = ˆWavg and then show W (Vk) =

Wavg for k = 1, . . . , P .

Figure 3.4 illustrates a perfectly-balanced partition of the communication hypergraph given in Figure 3.2 with vertex weights assigned by the proposed vertex (re)weighting scheme. Dashed pins denote partial results for local reduce tasks, whereas solid pins denote partial results for external ones. So, the number of solid lines (except the one connected to the fixed vertex) incident to each net is equal to the communication volume load of the respective processor As seen in the figure, the weight of fixed vertex v₁p is ˆw(v₁p) = −|results(p1)| + Mf vw= −4 + 5 = 1.

Similarly, ˆw(vp₂) = −5 + 5 = 0 and ˆw(v₃p) = −3 + 5 = 2. As seen in the figure, parts V1, V2, and V3 contain two, three, and one free vertices, respectively. Note

that ˆW (V1) = 1+2 = 3, ˆW (V2) = 0+3 = 3, and ˆW (V3) = 2+1 = 3. Also note that

volΠ∗_of_(p 1) = |results(p1)|−|R1| = 4−2 = 2, volΠ ∗ of(p₂) = |results(p₂)|−|R₂| = 5−3 = 2, and volΠ∗_of_(p 3) = |results(p3)|−|R3| = 3−1 = 2.

(36)

}

V

₃ v₁p w( ) = 4 p1 vol ( ) = 4 – 2 ∏ = 2 p2 vol ( ) = 5–3 ∏ = 2 p₃ vol ( ) = 3 –1 ∏ = 2 v1 p w( ) = 1^ v2 p w( ) = 5 v2 p w( ) = 0^ v3 p w( ) = 3 v₃p w( ) = 2^ vi r w( ) = –1 vi r w( ) = +1^

Figure 3.4: A balanced partition of the reduce-communication hypergraph given in Figure 3.2 with proposed vertex weights.

3.2.2 Eliminating Outcast Vertices via Recursive

Biparti-tioning

In this section, we propose a RB-based paradigm that aims at minimizing the total number of outcast vertices. The details of the general RB paradigm is given in Section 2.2. Note that communication hypergraphs contain one fixed vertex in each part of the resulting P -way partition. Hence, at each RB step, one half of the fixed vertices in the current hypergraph are assigned to VL, whereas the other

half are assigned to VR, in order to attain one fixed vertex in each part of the

final P -way partition. In this way, hypergraph H`

k contains P/2` fixed vertices.

Although every net of the reduce-communication hypergraph model connects exactly one fixed vertex, subhypergraphs may contain nets that do not connect a fixed vertex. This stems from the cut-net splitting method described in Sec-tion 2.2..

(37)

Algorithm 1 shows the basic steps of the proposed RB-based scheme. In the algorithm, BIPARTITION at line 4 denotes a call to a 2-way HP tool to ob-tain Π2= {VL, VR}, whereas lines 6 and 7 show the formation of left and right

subhypergraphs according to the above-mentioned cut-net splitting technique. The proposed scheme is applied after obtaining bipartition Π2 through calling

SWAP-OUTCAST function at line 5. Algorithm 1 RB with Swap

Require: H = (V, N ), P 1: H0 0 = H 2: for ` ← 0 to log₂P − 1 do 3: for k ← 0 to 2`_{− 1 do} 4: Π2 ← BIPARTITION(Hk`) . Π2= {VL, VR} 5: Π2 ← SWAP-OUTCAST(Hk`, Π2) .updates Π2 6: Form HL= H`+1_2k = (VL, NL) induced by VL 7: Form HR= H`+12k+1= (VR, NR) induced by VR 8: end for 9: end for

In Π2, a vertex in left part VLis said to be outcast if it is not connected by any

left-anchored nets. Here, a net is said to be left-/right-anchored if it connects a fixed vertex in the left/right part. It is clear that an outcast vertex in VL remains

to be outcast in the further RB steps as well as in the final P -way partition. A similar argument holds for an outcast vertex in VR. SWAP-OUTCAST function

refines Π2 by swapping outcast vertices so that they are not outcast anymore in

Π2.

The vertices of VL satisfying all three conditions given below are defined as

candidates for swap operations.

(i) v_ir is not connected by a left-anchored net, (ii) v_ir is connected by a cut right-anchored net, and (iii) vr

i is not connected by an internal net in VL.

(38)

Condition (i) identifies vr

i as an outcast vertex of VL. Condition (ii) ensures

that vr

i would not be outcast in Π2 if it were assigned to VR. Thus conditions (i)

and (ii) together identify that moving vr

i to VR in a swap operation will make

vr

i not outcast anymore in Π2. The swap of any two vertices in VL and VR both

of which satisfy conditions (i) and (ii) together reduces the number of outcast vertices in Π2 by two. The swap operations are preferred over individual moves

in order not to disturb the imbalance of the current bipartition. In a swap operation, moving vr

i to VR increases the cutsize by the number of

internal nets that connect v_ir in VL. Hence condition (iii) ensures that moving

v_ir to VR does not increase the cutsize and thus, the partitioning objective of

minimizing the cutsize is not disturbed.

Figure 3.5 shows an RB step illustrating different states for vertices in terms of the candidacy for being swapped. For simplicity, we only show them in the left part. Consider vertices vr_g, v_hr and vr_i in VL. Vertex vgr is connected by a

left-anchored net (na), so it violates condition (i), which means that it is not

outcast. Vertex vr

h is not connected by a left-anchored net and it is connected

by a right-anchored net (nb). So, it satisfies conditions (i) and (ii), which means

that vr

h would not be outcast in VR. However, since it is connected by an internal

net (nt), moving it to VR increases the cutsize, hence, it violates condition (iii).

Vertex v_ir, on the other hand, satisfies all three conditions, hence, it is a candidate for being swapped.

t right-fixed vertices right-anchored nets

V

L

V

R

Figure 3.5: Among vertices v_gr, v_hr,vr_i, only vr_i is candidate although both v_hr and vr

i are outcast vertices.

their candidate flags set to true.

The for loop in lines 21–29 performs a pass over all free vertices to construct the set of swappable vertex sets SL and SR by utilizing candidate vertex flags.

Finally the while loop in lines 30–36 performs min{|SL|, |SL|} swaps.

(40)

Algorithm 2 SWAP-OUTCAST Require: H = (V, N ), Π2 = {VL, VR}

1: for each free vertex vr

i ∈ V do 2: cand(vr

i) ← maybe 3: end for

4: for each net n ∈ N do

5: if n connects a fixed vertex then .n is an anchored net 6: v_kp ← the fixed vertex in P ins(n)

7: for each free vertex vr

i ∈ P ins(n) do 8: if part(v_kp) = part(v_ip) then

9: cand(vr_i) ← false 10: else if cand(vr i) 6= false then 11: cand(vr i) ← true 12: end if 13: end for

14: else if n is internal then

15: for each v_ir ∈ P ins(n) do

16: cand(vr

i) ← false

17: end for

18: end if

19: end for

20: SL ← ∅ and SR ← ∅ . swappable Left/Right vertex sets

21: for each free vertex vr_i ∈ V do

22: if cand(vr i) = true then 23: if part(vr i) = L then 24: SL← SL∪ {vir} 25: else 26: SR← SR∪ {vir} 27: end if 28: end if 29: end for

30: while SL 6= ∅ and SR 6= ∅ do .min{|SL|, |SL|} swaps

31: Let vr i ∈ SL and vjr ∈ SR 32: part(vr i) ← R . swap vri and vjr 33: part(v_jr) ← L 34: SL← SL− {vri} 35: SR← SR− {vrj} 36: end while

*Here cand refers to a three-state variable, where true denotes swappable, false denotes not swappable and maybe denotes not decided yet.

(41)

3.3 Experiments

3.3.1 Test Application: Column-Parallel SpMV

SpMV is denoted by y ← Ax, where A = (aij) is an n × m sparse matrix and

x = (xi) and y = (yj) are dense vectors. In column-parallel SpMV, the columns

of matrix A are distributed among processors as well as the entries of vectors x and y. The partitions of columns of A and entries of x and y are obtained by a two-phase partitioning approach.

In the first phase, the row-net hypergraph partitioning model [19] is utilized to obtain a partition of columns of A in such a way that the total communication volume is minimized while maintaining balance on the computational loads of processors. This column partition induces a conformable partition on the input vector x, that is, xi is assigned to the processor to which column i is assigned.

Note that assigning all nonzeros of column i together with xi to a single processor

eliminates the need for the pre-communication phase, which is performed for broadcasting x-vector entries. However, since multiple processors may produce partial results for the same y-vector entries, the post-communication phase needs to be performed to reduce those partial results for obtaining final values of y-vector entries.

In the second phase, a partition of output vector y is obtained via the proposed reduce-communication hypergraph model as follows. The set of reduce tasks, R, corresponds to the subset of y-vector entries for which multiple processors compute a partial result. That is,

R = {ri :∃ columns j1 and j2 assigned to different

processors and ai,j1 6= 0 and ai,j2 6= 0}.

Here, ri represents the reduce-task associated with yi, as well as row i. Then,

results(pk) can be formulated as

(42)

Note that a row whose nonzeros are all assigned to a single processor does not incur a reduce task.

3.3.2 Setup

The performance of the proposed models are compared against the existing reduce-communication hypergraph model [14] (Section 3.1.2.1) which is referred to as the baseline model RCb. The reduce-communication hypergraph model that

utilizes the proposed novel vertex (re)weighting scheme (Section 3.2.1) is referred to as RCvw, whereas the model that utilizes both the proposed vertex weighting

scheme and the proposed outcast vertex elimination scheme (Section 3.2.2) is referred to as RCs

vw. We used P = 512 processors for performance comparison of

these models.

We use PaToH [19, 37] for partitioning both row-net hypergraphs and reduce-communication hypergraphs, in the first and second phases, respectively. For the row-net, RCb, and RCvw models, PaToH is called for P -way partitioning for

a P -processors system, whereas for the RCs_vw model, PaToH is used for 2-way partitioning (line 6 of Algorithm 1). PaToH is used with default parameters for partitioning row-net hypergraph, whereas refinement algorithm is set to boundary FM for partitioning communication hypergraphs. Maximum allowed imbalance is set to 10%, i.e., = 0.10, for all models. Since PaToH utilizes randomized algorithms we partitioned each hypergraph three times and report average results. We utilize the column-parallel SpMV implementation [38], which is imple-mented in C using MPI for interprocess communication. Parallel SpMV times are obtained on a cluster with 19 nodes where each node contains 28 cores (two Intel Xeon E5-2680 v4 CPUs) running at 2.40 GHz clock frequency and 128 GB memory. The nodes are connected by an InfiniBand FDR 56 Gbps network with switched fabric topology.

(43)

3.3.3 Dataset

We conduct experiments on a very large set of sparse matrices obtained from the SuiteSparse Matrix Collection (formerly known as the University of Florida Sparse Matrix Collection) [39]. We select square (both symmetric and unsym-metric) matrices that have more than 100K and less than 51M rows/columns. The number of nonzeros of these matrices is in the range from 207K to 1.1B. The collection contains 358 such matrices. We exclude those matrices that does not satisfy one of the following two conditions:

(i) the row-net hypergraph partitioning in the first phase does not incur empty parts,

(ii) the communication hypergraph in the second phase contains more than 100 vertices per part on average.

Condition (i) prevents unrealistic results, whereas condition (ii) ensures parti-tioning quality in the second phase. The resulting dataset contains 313 matrices for P = 512 processors.

As described in Section 3.1.3.1, the deficiency of the existing reduce-communication hypergraph model increases with increasing irregularity on the net degrees. Therefore, in order to better show the validity of the proposed ver-tex (re)weighting scheme, we group test matrices according to the coefficient of variation (CV) values on the net degrees of their communication hypergraphs. Here, CV refers to the ratio of the standard deviation to the mean of the net degrees. Recall that the net degree in a communication hypergraph also refers to the number of partial results produced by the respective processor. We use six matrix groups, denoted by CV>0.50, CV>0.30, CV>0.20, CV>0.15, CV>0.10, and

CV>0.00 (all matrices). CVα consists of matrices whose corresponding CV value

is greater than α. Note that the set of matrices in a group associated with a smaller value is a superset of matrices in a group associated with a larger value.

(44)

Table 3.1 displays properties of the test matrices as well as properties of their reduce-communication hypergraphs. In the table, the first column shows the lower bound of the CV value of the matrices in each group. The second column shows the number of matrices in the corresponding CV group. In the table, the rest of the columns show the values averaged over the matrices of each CV group. The third and fourth columns show the number of rows/columns and nonzeros, whereas the fifth column shows the number of nonzeros per row/column. The following two columns show maximum number of nonzeros per row and column, respectively.

In Table 3.1, the last six columns show properties of reduce-communication hypergraphs. The first of those six columns shows the average number of reduce-tasks obtained from column-wise partitioning (using row-net hypergraph model) of the matrices in the respective CV group, i.e., number of free vertices in the communication hypergraph. The second and third columns show the average and maximum free-vertex degree of those hypergraphs, respectively. Note that all fixed-vertices have a unit degree. The last three columns show the minimum, average, and maximum net degree of those hypergraphs, respectively.

(45)

Table 3.1: Properties of Test Matrices and Their Reduce-Communication Hypergraphs.

CV

sparse matrices reduce communication hypergraph

number of avg nnz per row/col max nnz per # of reduce tasks

vtx degree net degree

matrices rows/cols nonzeros row col avg max min avg max

> 0.50 32 615,123 9,335,833 15.18 971 3,219 156,698 3.56 44 105 1,082 7,153 > 0.30 70 651,939 8,855,071 13.58 684 1,453 136,263 3.67 39 115 968 4,445 > 0.20 148 445,686 4,982,116 11.18 280 387 86,634 3.06 19 105 512 1,426 > 0.15 206 499,288 6,641,398 13.30 186 235 109,187 2.97 15 160 627 1,479 > 0.10 302 575,028 7,892,009 13.72 98 114 119,804 2.69 11 202 625 1,248 ALL 313 555,892 7,616,780 13.70 94 109 121,017 2.71 11 212 635 1,248 31

(46)

3.3.4 Results

Performance results are displayed in three tables and two figures. In all tables, the first row shows actual values averaged over each CV group, whereas the second row shows the normalized values with respect to respective baseline for each CV group.

Table 3.2 is introduced to show the performance of the proposed SWAP-OUTCAST algorithm in eliminating outcast vertices. The table compares RCs

vw

against RCvw in terms of the ratio of the number of outcast vertices to the total

number of free vertices. As seen in the table, the proposed SWAP-OUTCAST algorithm achieves approximately 13% less outcast vertices on average. Further-more, this performance improvement does not change much according to the CV group.

Table 3.2: Outcast Vertex Elimination for P = 512

CV outcast vertex ratio RCvw RCs vw > 0.50 84% 76% 1.00 0.91 > 0.30 82% 74% 1.00 0.91 > 0.20 75% 67% 1.00 0.89 > 0.15 75% 66% 1.00 0.88 > 0.10 73% 64% 1.00 0.87 ALL 74% 64% 1.00 0.87

Table 3.3 compares the relative performance of the three different RC models in terms of multiple communication cost metrics as well as parallel runtime on 512 processors. The communication cost metrics include maximum send volume, average volume, maximum send message, and average message. In the table, after

(47)

the CV column, each one of the 3-column groups of the 12 columns compares the three RC models in terms of one of the above-mentioned communication cost metrics averaged over the respective CV group. Here, average volume and average message values refer to the total communication volume and total number of messages divided by the number of processors. We prefer to report average values instead of total values, because average values give a better feeling on how much the maximum values deviate from the average values.

As seen in Table 3.3, in terms of the maximum send volume metric, both RCvw

and RCs_vw perform significantly better than RCb, where RCsvw is the clear winner.

The performance gap between the proposed RC schemes (RCvw and RCsvw) and

the baseline RCb scheme increases with increasing CV values. For example, RCsvw

achieves a 6% improvement over RCb for the matrices in CV>0.10 group and this

improvement increases to 8%, 9%, 15%, and 19% for the matrices in CV>0.15,

CV>0.20, CV>0.30, and CV>0.50 groups, respectively. This is expected since the

irregularity on the net degrees increases with the increasing CV value.

In terms of the average/total communication volume metric, RCvw performs

slightly worse than RCb, whereas RCsvw performs slightly better than RCb and

considerably better than RCvw. This is also expected since neither RCb nor

RCvw has explicit effort towards decreasing total communication volume due to

the outcast vertex assignments, whereas RCs_vw tries to decrease the number of such assignments by utilizing the SWAP-OUTCAST algorithm.

Reducing communication overhead in sparse matrix and tensor computations

REDUCING COMMUNICATION

OVERHEAD IN SPARSE MATRIX AND

TENSOR COMPUTATIONS

a dissertation submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

doctor of philosophy

in

computer engineering

By

Mustafa Ozan Karsavuran

August 2020

Copyright Information

ABSTRACT

REDUCING COMMUNICATION OVERHEAD IN

SPARSE MATRIX AND TENSOR COMPUTATIONS

¨

OZET

SEYREK MATR˙IS VE TENS ¨

OR

HESAPLAMALARINDA ˙ILET˙IS

¸ ˙IM Y ¨

UK ¨

UN ¨

UN

AZALTILMASI

Acknowledgement

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

Chapter 2

Background

2.1

Hypergraph/Graph Partitioning Problem

2.1.1

HP with fixed vertices

2.2

Recursive Bipartitioning Paradigm

Chapter 3

Reduce Operations: Send

Volume Balancing While

Minimizing Latency

3.1

Communication Hypergraph for Reduce

Operations

3.1.1

Reduce-Task Assignment Problem

r

r

r

r

r

r

p

p

p

r

r

r

r

r

r

p

p

p

R

R

R

3.1.2

Reduce-Communication Hypergraph Model

v1

v2

v3

v4

_v