Scheduling Loops with Partial Loop-Carried ... - OpenLSD

LC = Completed task list. Pp = Partial ... list LC of completed tasks as well as to the local scheduling list Pp. Once all of ..... 109–116, New York, July 13–17 1998.
267KB taille 4 téléchargements 329 vues
Scheduling Loops with Partial Loop-Carried Dependencies Fr´ed´eric Br´egier and Marie-Christine Counilh and Jean Roman LaBRI, ENSERB and Universit´e Bordeaux I, 33405 Talence Cedex, France

Abstract This paper deals with task scheduling, where each task is one particular iteration of a DO loop with partial loop-carried dependencies. Independent iterations of such loops can be scheduled in an order different from the one of classical serial execution, so as to increase program performance. The approach that we present is based both on the use of a directive added to the HPF2 language, which specifies the dependencies between iterations, and on inspector/executor support, implemented in the CoLUMBO library, which builds the task graph and schedules tasks associated with iterations. We validate our approach by showing results achieved on an IBM SP2 for a sparse Cholesky factorization algorithm applied to real problems. Key words: HPF 2.0, Irregular Applications, Loop-Carried Dependencies, Task Scheduling, Inspection-Execution

1

Introduction

High Performance Fortran (HPF2 [17]), the standard language for writing data parallel programs, is quite efficient for coding regular applications, but efficiency is still a major concern for irregular applications. As a matter of fact, irregular applications often include irregular loops that need specific processing at compile time, and sometimes even at run time, using inspection-execution techniques, if compile-time information is not sufficient. We are interested in loops with partial loop-carried dependencies. Loop-carried dependencies means that there exist data dependencies between instructions executed in different loop iterations [22], such that iterations must be executed in sequential order. Loops with partial loop-carried dependencies are loops such that there still exist independent iterations that can be executed

Preprint submitted to Elsevier Preprint

14 April 2000

in any order. These loops have intrinsic parallelism that is currently not exploitable within the HPF2 framework. We only consider in this study the case where loop-carried dependencies are precomputable, that is, they do not depend on the execution of the loops, and therefore can be analyzed prior to their execution. Hence, we study loops with precomputable partial loop-carried dependencies, denoted PPLD loops in all of the following. The order of the execution of two independent iterations on a single processor is not relevant with respect to the final result, but can have some impact on the global efficiency of the program. Scheduling loop iterations according to some performance criteria and while enforcing the precedence constraints of the execution amounts to performing task scheduling, where every task is associated with one iteration of the PPLD loop. The tasks and their mutual dependencies are classically represented by a Directed Acyclic task Graph, denoted DAG. Since we consider loops in the context of compilation according to the dataparallelism paradigm, the mapping of tasks (i.e. iterations) is performed with respect to the data distribution directives supplied by the programmer within the HPF2 code, and according to the “owner computes” rule generally used by the compiler; the mapping is never reconsidered at run time. In other words, each task has to be executed in a distributed way by several processors, which are the ones that own the data which are read and written during the task. So, the number and the identity of the processors which execute a task is fixed by the HPF context. The problem we deal with is therefore referred to as scheduling multi-processor tasks on dedicated processors. This paper is organized as follows. Section 2 sums-up previous work related to the task scheduling problem, and specifically to our scheduling problem. Section 3 introduces our new SCHEDULE directive, which allows optimized execution of a PPLD loop based on the inspection-execution approach. Section 4 outlines the implementation of the scheduler carried out in our CoLUMBO library. Finally, Section 5 presents an experimental validation of this work for sparse Cholesky factorization applied to real size problems.

2

Related Work

Most of the studies on task scheduling are devoted to mono-processor tasks (where each task is executed by only one processor) and also deal with the task mapping problem [31, 32]. Schedules are often computed based on a task priority list, to execute tasks using a greedy strategy. The critical path heuristic as list priority (denoted by CP) was demonstrated to be close to the optimal solution (see [32, and included references] for the mono-processor task scheduling problem). 2

The works devoted to multi-processor task scheduling on dedicated processors essentially deal with complexity issues [10,21,30]. Finding the task scheduling that minimizes the global time is a N P -complete problem (and even N P -hard, see [10, 21]), so one uses heuristics to find approximate solutions. In [23,30], the dedicated multi-processor task scheduling problem is presented from an algorithmic point of view. An algorithm based on task insertion in partial schedulings is given, which has the advantage of being less expensive than the ones presented in other studies ( [21], for example). We will discuss this algorithm in Section 4.1. The HPF2 language does not compute schedulings of loop iterations using task parallelism techniques, nor using data-parallelism ones. There was not so much work on task scheduling within HPF. One we know [26] only considers regular problem (regular loop carried dependencies). So, to our knowledge, no work has dealt with this irregular task scheduling problem in HPF. Outside the HPF framework, the RAPID system [12] provides a code generation mechanism based on task mapping and scheduling, but only for monoprocessor tasks. It uses an inspection-execution approach to parallelize programs using an irregular task graph. In this system, mono-processor tasks can communicate in the epilog of a task and the prolog of another one. In our approach, tasks (i.e. iterations) can perform communications during their execution (between processors participating to the same task). Consequently, to be handled by the RAPID system, our tasks should be subdivided into successive computation-communication phases. The work presented in [9] uses, as in the RAPID system, an inspection-execution approach to perform mapping and task scheduling, but uses a fine-grained irregular DAG (instruction level). In [27], the autors introduce the notion of inspector-executor to deal with precomputable partial loop-carried dependencies. The principle is to set up wavefront of concurrently executable loop iterations and can be related to our scheduling algorithm. But this work is mainly intended for shared memory architectures (SM) and do not adress the HPF language since it is earlier. Second, obviously, it does not address the problem of an already fixed task mapping, therefore it tries to load-balance computations on processors by the mean of the iterations repartition. Third, their results imply their inspectionexecution method should be used for iterative algorithms since the inspector cost can be heavier for only one execution step, contrary to our approach. Other works related to data-parallelism support parallel and multi-threaded execution of HPF codes, but do not consider the scheduling problem. The language and runtime support Opus [14] allows the execution of asynchronous activities and was included in the Vienna Fortran Compilation System [15]. The study presented in [4] uses multi-threaded execution in a data-parallel model by considering numerous virtual processors that are treated as lightweight processes to map onto real processors. 3

Other works related to HPF support irregular problems based on sparse matrix computations, such as to limit the inspector overhead [28] or to support fillin computations involving dynamicity in sparse structures. But they do not address the scheduling problem of such irregular loops. In a previous study [6], we introduced an extension of the HPF2 ON HOME directive [17] which allowed algorithmic specification of active processor sets associated with loop iterations. Each active processor set specified which processors should execute one particular iteration, and we considered that no communication with processors outside the active processor set was required; this was specified by the RESIDENT clause of the ON HOME directive. Moreover, we provided an inspection-execution support to manage active processor sets so that each processor executed only the necessary iterations (i.e. iterations for which the processor belonged to the active processor set) in increasing order. Therefore, the use of active processor sets allowed us to exploit some of the concurrency contained in PPLD loops. Indeed, two iterations that used disjoined active processor sets could be executed in parallel, because these separate sets were a consequence of the absence of direct dependencies between these two iterations. This increased the useable concurrency and also reduced communication (which was restricted to the required processors only). In this paper, we propose to go further and to provide an additional support for the scheduling of independent iterations.

3

Construction of the task graph associated with a PPLD loop

We use an inspection-execution approach [24,28,29] to generate the task graph (DAG) and the associated scheduling. The inspector and the executor code must be generated by a compiler (here an HPF compiler, as in [1–3, 18]). The inspector code analyzes at runtime some properties of the program (usually data accesses) not known at compile-time. Then, the execution code uses the inspection results so as to execute computations efficiently. Usually, inspectionexecution techniques are used in iterative programs, where inspection results can be re-used by several execution steps, so that the inspection cost is amortized over many execution steps. However, our first objective is (as in [6]) to repay the inspection cost (due to the construction of the task graph and the scheduling) by a single execution step. In order to make the task graph generation associated with a PPLD loop easier and cheaper, we introduce the new SCHEDULE directive that allows the programmer to specify dependencies between iterations. As described above, the processors which participate in the execution of one task are determined by the active processor set inspection based on the extended ON HOME directive associated with the loop (see [6]). 4

REAL A(N) INTEGER INDEX(N,M), NB_ELT(N) !HPF$ DISTRIBUTE A(CYCLIC) !HPF$ SCHEDULE (J = 1:K-1, ANY(INDEX(K,1:NB_ELT(K))) .eq. J) DO K = 1, N !HPF$ ON HOME (A(J), J=1:K, J .EQ. K .OR. ANY(INDEX(K,1:NB_ELT(K))) .EQ. J), RESIDENT, BEGIN B = 0.0 !HPF$ INDEPENDENT, REDUCE(B) DO J = 1, NB_ELT(K) B = B + A(INDEX(K,J)) END DO A(K) = A(K) + B !HPF$ END ON END DO

Program 1. A PPLD loop with extended ON HOME and SCHEDULE directives.

Program 1 will be our reference example. It shows a PPLD loop (loop K) which performs a partial sum of the elements of A in every iteration K. The result of this partial sum is added to element A(K). The selection of the elements of A used for the partial sum is achieved by the INDEX index array, so that loop-carried dependencies are partial and precomputable (the INDEX array is not modified during the loop).

3.1 The SCHEDULE directive

The format of the new SCHEDULE directive we propose is the following: !HPF$ SCHEDULE (schedule-triplet-spec-list [, scalar-mask-expr])

This directive must precede a PPLD loop. It defines data dependencies between instructions of different iterations using schedule-triplet-spec-list and scalar-mask-expr (similar to the forall-triplet-spec-list and scalar-mask-expr parameters used in the FORALL statement). For each value of the reference variable (the control variable associated with the loop), all of the values of the constraint variable (the first control variable in schedule-triplet-spec-list) determine the loop iteration index subset for which there is a dependence relation. Consequently, in Program 1, the extended ON HOME directive specifies the active processor set associated with each iteration K. It contains the processors which own A(K) or one of the A(INDEX(K,J)) for J varying from 1 to NB ELT(K). The SCHEDULE directive defines precedence constraints (left dependencies) of every task K: iteration K depends on iterations J where J < K and J is contained in the INDEX(K,1:NB_ELT(K)) array section. In fact, the dependencies can be expressed by any logical expression form. Here, we use an array based expression, but we can use any boolean expressions, remember that they must remain precomputable. 5

From the extended ON HOME directive, the inspector code first generates the active processor sets associated with every iteration. Then, from the SCHEDULE directive, it generates the DAG associated with the loop. Finally, the knowledge of both active processor sets and the task graph is used to achieve task scheduling. 3.2

Inspection-Execution associated with the SCHEDULE directive

Program 2 presents the inspector pseudo-code obtained from Program 1 by the first two phases: the processor set inspection (1, see [6]), and the DAG inspection (2, 3). Note that the notation “add ... to” is not Fortran 90 standard, but a pseudo-code notation. At the end of these inspections, PROC SET(K) and DEPENDENCIES(K) contain respectively the active processor set and the precedence constraint set for iteration K. (1)

! PROCESSOR SETS INSPECTOR !HPF$ INDEPENDENT DO K = 1, N FORALL(J=1:K, J .EQ. K .OR. ANY(INDEX(K,1:NB_ELT(K))) .EQ. J) add (OWNER(A(J))) to PROC_SET(K) END FORALL END DO FINALIZE_SETS()

! DAG INSPECTOR !HPF$ INDEPENDENT (2) DO K = 1, N FORALL(J=1:K-1, ANY(INDEX(K,1:NB_ELT(K))) .EQ. J) add J to DEPENDENCES(K) END FORALL END DO (3) FINALIZE_DAG() (4)

EXECUTE_DAG()

Program 2. Inspection-Execution pseudo-code for Program 1.

The DAG construction pseudo-code (2) is a direct translation of the SCHEDULE directive. Then, the first step of the FINALIZE DAG call (3) builds a simplified DAG [19] where only important data are kept and where constraint redundancies are removed. In order to do so, global communications are performed to fetch the distributed information coming from the distributed inspector phase (the INDEPENDENT loop in (2)). Then, in a second step, the inspector computes the task scheduling (see Section 4). In the executor code generated by the compiler, the PPLD loop is translated into a SPMD code so that execution can be driven by the result of the inspection phase. Then, the loop body is rewritten into a procedure using the iteration number K as parameter (see pseudo-Program 3). Note that the notations “I am in ...” or “is local” are not Fortran 90 standard, but pseudocode notations. The PROC SET(K) indicates the active processor set necessary to the K task, so the task mapping. Note also that the code generation always follows the HPF rules (using in general the owner computes rule). So 6

the procedure can include communications (point to point or global communications) between the participating processors of the task, but not between different task, as in the HPF generation code model. SUBROUTINE TASK(K) USE TASK_MODULE ! Common variables with the main integer K IF (I am in PROC_SET(K)) THEN PUSH_SET(PROC_SET(K)) ! Communications in the context of PROC_SET(K) B = 0.0 DO J = 1, NB_ELT(K) WHERE A(INDEX(K,J)) IS LOCAL B = B + A(INDEX(K,J)) END DO CALL REDUCTION(B) IF (A(K) is local) THEN A(K) = A(K) + B END IF POP_SET() END IF END SUBROUTINE

Program 3. Pseudo-procedure generated from the PPLD loop body.

The scheduled loop is therefore replaced by the EXECUTE DAG call (4), which performs the calls to the TASK subroutine shown in Program 3 according to the scheduling obtained from the inspection phase.

4

Scheduling of tasks associated with a PPLD loop

The scheduler that we have implemented is distributed across processors: every processor has a local scheduler, which only orders the iterations that must be executed locally. In fact, the scheduler is the last part of our inspector (taking place at the end of FINALIZE DAG() call in Program 1 (3)). In order to avoid dead-lock situations that could arise if two iterations were executed in different orders on distinct processors, every local scheduler owns exactly the same information about the iterations that it has to schedule (but the information required to schedule one iteration is replicated only on processors that belong to the active processor set of this iteration). Moreover, a distributed implementation of the scheduler avoids the bottleneck induced by a client/server centralized implementation, which constitutes a natural limitation for performance. 4.1 Simple scheduling and Wennink’s scheduling The general principle of our scheduling algorithm is to perform what we call a symbolic execution of the DAG in order to obtain on every processor a sorted list holding the scheduling of local tasks. The real execution of local tasks (or the executor phase performed in the EXECUTE DAG call) is based on this local task list. 7

We studied two different algorithms for scheduling multi-processor tasks on dedicated processors in order to study the interest of each one. Both algorithms are based on priority lists, the priority criterion being the critical path heuristic. When ties have to be broken between two tasks, secondary criteria are used, such as the iteration number associated with each task (iteration numbers ensure total order between all of the tasks). The first scheduling algorithm (invented by us) does not care about the fact that tasks are possibly executed on multiple processors. In this so called simple scheduling, every local scheduler considers its local tasks as if they only required one processor (that is, the local processor). The second algorithm can handle multiple processors associated with each task, and is directly based on Wennink’s algorithm [23, 30], using an insertion method; it stands to be an efficient scheduling for our particular problem, so a reference example. Simple schedule: In this algorithm (see Program 4), local tasks are scheduled without considering the other processors of the same active processor set. The symbolic execution of the DAG consists in several steps. Notations : Vp = Set of tasks to be executed on processor p Rp = Set of tasks without active predecessors (Ready tasks) LP = Priority List LC = Completed task list Pp = Partial schedule result (FIFO) for processor p Symbolic execution: (2nd part of the FINALIZE_DAG call) Pp = {} LC = {} While there are tasks not yet executed (or while Vp not empty) (1) For each ready task (from Rp) Add task into LP (2) While LP not empty Remove first task from Vp and LP and add it to LC and Pp (3) Exchange LC task lists between all processors (4) Update Rp ready task set using LC (at the end, LC = {}) Real execution: (EXECUTE_DAG call) For each task in Pp (in order) Execute current task

Program 4. Simple scheduling: symbolic and real execution.

On processor p, a step is defined by the set Rp of all tasks that are ready at the beginning of the step. A task is ready if all of the tasks that precede it in the DAG (called its predecessors) have completed their symbolic execution (i.e. if its counter of active predecessors equals zero). Phase (1) sorts the Rp set into the built list LP according to priority criteria. Phase (2) successively and symbolically activates each task in the ordered list LP and adds it to the local list LC of completed tasks as well as to the local scheduling list Pp. Once all of the tasks in the LP list have been processed, all processors exchange their LC local lists for the current step (Phase 3). Then, every processor updates the active predecessor counters of its not yet (symbolically) executed local tasks 8

(Phase 4). All of these tasks whose active predecessor counters become zero constitute the set of the ready tasks for the next symbolic execution step. Once all tasks have completed their symbolic execution, the real execution step begins (see Program 4). Tasks are executed on every processor p according to the order defined by the Pp list. If N is the number of tasks to schedule, the execution cost of the simple scheduling algorithm is in O(N log N ), due to the cost of the sorting algorithm. The communication cost is in O(N ), due to the exchange of the LC task lists. Wennink’s scheduling: This algorithm, presented in [23, 30], is based on the insertion method. On the opposite of the simple scheduling algorithm, this one allows to question the list schedule Pp at each step. The task of greatest priority is inserted in the best position into Pp according to heuristics based on the two following longest paths, computed for every already scheduled task v: the longest path denoted by Iv from the symbolic initial task I to the task v, and the longest path denoted by vT from the task v to the symbolic terminal task T . This algorithm is therefore more flexible than the previous one, and less dependent on the priority criteria used. That’s why it is intended for a reasonable efficient scheduling algorithm. However, its complexity is in O(N 2 ) (both in execution and communication), well above the complexity of the previous algorithm.

4.2

Non iterative scheme: the static scheduling

Whatever algorithm is used to compute the scheduling, information is needed on each task to compute the critical path, such as the cost of the task, and the identity of its predecessors and successors in the task graph. In an HPF compilation framework, it seems difficult to extract at compile-time the execution costs of every iteration. If iteration costs cannot be known at compile-time and if the PPLD loop is not part of an iterative scheme, then iteration costs are set by default to unit costs. In this case, the scheduling only depends on precedence constraints in the DAG, and is said to be static. In fact, with static scheduling, the critical path is the distance in the graph with unit costs (that is, the number of edges to cross from one vertex to a symbolic terminal vertex). Therefore, the priority list in static scheduling is based first on distances in the graph with unit costs, and second, in case of ties, on secondary criteria, notably on task iteration numbers. Although presented in a non iterative scheme, it is obviously possible to use the static schedule in an iterative way, but the Pp list will never be challenged, since used criteria will never evolve. 9

4.3

Extension to the iterative scheme: dynamic scheduling

It is clear that all iterations will not have the same computation and communication costs. Consequently, the static scheduling may be inaccurate. In the dynamic scheduling algorithm, iteration costs (that is, communication and/or computation costs) are taken into account. If the PPLD loop is part of an iterative scheme such that the precedence constraints (and thus the task graph) do not evolve during the iterative scheme, then dynamic scheduling can be used. It consists in the following steps. First of all, a static scheduling is computed. Then, a first (real) execution (first call to EXECUTE DAG) of the PPLD loop is performed, based on the result of the static scheduling. During this execution, cost measures are collected for every task on every processor. At the end of this first execution, a short inspection phase takes place, using collective communications, which computes the real cost of every task. Then, a second task scheduling is computed, where the critical path of each task in the graph is calculated by taking cost measures into account. Finally, every following (real) execution of the PPLD loop within the external loop will be driven by the result of the second scheduling. Our CoLUMBO library can use different metrics of task costs, such as pure computation times (without communication times) or theoretical task times (pure computation times plus theoretical communication times). Obviously, these metrics are valid for one particular execution only, and therefore cannot be exact. Thus, one object of this study was also to identify which of these costs criteria seemed to be the most accurate. In this paper, we will only consider the most relevant task cost (in theory and also observed in practice), which is the theoretical task time cost, denoted TT. 4.4

The case of a multi-threaded runtime system

In the above algorithms, two independent tasks that have same values of the scheduling criterion are ordered according to their loop iteration number only. However, there is no reason to start one of the two rather than the other. A way to avoid this final choice consists in providing a runtime support with light-weight processes (threads), so that local concurrency between these tasks can also be exploited. In order to avoid thread overhead as much as possible, in our run-time support, threads keep the processor until they enter a blocking communication call, and the number of simultaneous tasks (one of them being active, and the others sleeping) is limited. It allows to overlap communications realized in one iteration by computations realized in another one. Naturally, the initial program semantics is enforced, since a new thread is initiated for the next 10

task in the schedule list only if all of its predecessors have completed their execution. To implement this runtime support, we used threads in user mode (by opposition to heavier threads systems) provided by the MARCEL library of the PM2 High-Perf environment [5]. 4.5

Memory costs of the CoLUMBO library

If every task owns on average succ successors (or predecessors) in the task graph, before scheduling, the memory cost of a DAG with N vertices is in O(N × succ) with local simplifications (if a task does not concern a given processor, associated information will not be stored). Besides having a more important computation cost, Wennink’s algorithm also imposes an additional memory cost due to the necessity to store the transitive closure of the graph, and thus its memory cost is in O(N × succ + N 2 ). Once the scheduling is computed, if the thread runtime system is not activated, only the ordered task execution list (Pp) is kept, resulting in a cost in O(N ). On the opposite, with threads, successor information is kept in order to verify the legality of a task execution simultaneously with the other threads already active, the memory cost is then in O(N × succ + N ).

5

Experimental Results

The algorithm that we have chosen to validate experimentally our ideas is a block Cholesky factorization algorithm for symmetric positive definite sparse matrices [11, 13]. This factorization is an extremely important computation arising in many scientific and engineering applications. It is quite time-consuming and is frequently the computational bottleneck in these applications. In Karmarkar’s algorithm [20], an iterative algorithm, the Cholesky factorization can be used in every iteration to solve a sparse system AX = B. In this algorithm, the matrix sparse structure is preserved during all the iterations so that the inspector result is always valid and can be reused. So, we use the Cholesky factorization as our kernel example both in non-iterative and iterative schemes, in order to illustrate our study. At the time being, no compiler can directly generate inspector-executor code from an HPF source code using our extended ON HOME and SCHEDULE directives. Therefore, we wrote HPF codes with explicit calls to the TriDenT and CoLUMBO libraries, and compiled them with the ADAPTOR [7] HPF compiler. The TriDenT library provides Trees (implemented with Fortran 95 derived data types) to write irregular data structures with hierarchical access [6]. The CoLUMBO library provides inspection-execution algorithms for irregular processor sets, irregular loop indexes, and inter-iteration or extra-iteration 11

communications [6]; it also provides a support for irregular progressive prefix operations [8] and the task scheduling associated with a PPLD loop. The Cholesky factorization algorithm used is a 1D column-block version, so the representation of the matrix is column-block oriented (see Figure 1). The A matrix is defined as an array of column-blocks (type Level1). Each of these column-blocks (BCOL) is constituted by an array of blocks (type Level2), every block (BLOCK) being a dense sub-matrix (type Level3). The matrix is distributed at the column-block level according to the “subtree-to-subcube” like mapping [11, 16] using an HPF2 INDIRECT distribution format; this mapping leads to efficient reduction of communication while keeping good load balance between processors. type Level3 real VAL end type Level3 type Level2 type (Level3), pointer :: BLOCK(:,:) end type Level2 type Level1 type (Level2), pointer :: BCOL(:) end type Level1

A

BCOL

BLOCK

type (Level1), allocatable :: A(:) !HPF$ TREE A !HPF$ DISTRIBUTE A(INDIRECT(MAP))

---

VAL

Figure 1. Derived type definitions and associated Tree.

The Cholesky factorization program (see Program 5) has the same structure as the one of the simple program given as introductary example (Program 1). It contains a reduction step and the processors which participate in an iteration are fixed by an extended ON HOME directive. At each iteration K, we update column-block K (CMOD operation) by using a reduction of the contributions of some column-blocks which precede it. More precisely, these are column-blocks from 1 to K-1 which own a non zero block faced to the diagonal block of column-block K (denoted A(J)%BCOL(:) in A(K)%BCOL(1) to simplify). The expression used in the extended ON HOME directive is used also in the SCHEDULE directive, since it specifies the induced loop-carried dependencies. The CDIV operation computes the final value of column-block K. We will denote the different experimental versions as follows. The first version, called V0, is a native reference HPF2 version without any inspection-execution optimization. The second one, called V1, contains all of our inspection-execution schemes except DAG inspection, and is used to evaluate the impact of DAG inspection. For the simple scheduling (see 4.1), S will denote the static version (with unit costs), and D the dynamic version (with TT costs). When using the MARCEL multi-thread runtime system, all of these versions will be denoted using the th suffix (for example, S th). Wennink’s scheduling will be denoted by W. 12

!HPF$ SCHEDULE (J = 1:K-1, A(J)%BCOL(:) in A(K)%BCOL(1)) DO K = 1, N !HPF$ ON HOME (A(J), J=1:K, J .EQ. K .OR. A(J)%BCOL(:) & ! in A(K)%BCOL(1)), RESIDENT, BEGIN B(:)%BLOC(:,:) = 0.0 !HPF$ INDEPENDENT, REDUCE(B) DO J = 1, K-1 IF (A(J)%BCOL(:) in A(K)%BCOL(1)) THEN B = B + CMOD(A(J)%BCOL) END IF END DO A(K)%BCOL = A(K)%BCOL + B CDIV(A(K)%BCOL) !HPF$ END ON END DO Program 5. Block Cholesky factorization.

Figure 2 shows the characteristics of four studied matrices. The last column shows the average coefficient of irregularity [8] for each matrix; every columnblock K has K×coefficient left dependencies; therefore, the lower the coefficient is, the less there exist column-blocks modifying every column-block K. Name

Column-Blocks

Columns

2D Grid 511×511

8216

261121

Oilpan

8024

73752

BCSSTK32

4286

44609

3D Cube 39×39×39

1129

59319

Non-Zeroes

Operations

Average Coefficient

12 M

2.5 GFlop

0.121%

9.5 M

3.35 GFlop

0.129%

5.5 M

1.3 GFlop

0.368%

22 M

22 GFlop

4.518%

Figure 2. Characteristics of test matrices.

5.1

Experiments on schedulings of tasks associated with a PPLD loop

In all of the following, global time refers to the sum of inspection and execution times, and re-execution time refers only to the execution time spent in the iterative scheme (without inspection time). Comparisons between versions: Figure 3 presents, for our test matrices and on 16 processors, the relative efficiencies of the different versions with respect to version V1; the leftmost curves show global times and the rightmost curves re-execution times, respectively. As we have already proven in [6], the inspection of active processor sets leads to a great improvement compared to the V0 basic HPF2 version (the ratio of times between versions V1 and V0 varies from 4 to 17% depending on test matrices). Except for the W scheduling, all versions yield better global times than version V1 for all matrices. This confirms that, in order to be profitable, our 13

250

equality with V1 Oilpan 2D Grid 511 BCSSTK32 CUBE39

300 Ratio vs V1 (%)

200 Ratio vs V1 (%)

350

equality with V1 Oilpan 2D Grid 511 BCSSTK32 CUBE39

150 100

250 200 150 100

50 50 0 V0

V1

S

D Versions

W

S_th

D_th

0 V0

V1

S

D Versions

W

S_th

D_th

Figure 3. Comparisons between versions (Global time - Re-execution time).

inspection-execution scheme does not necessarily require applications to be iterative, on the opposite of other works that used inspection-execution schemes. The ability to improve the performances of the PPLD loop execution depends on the irregularity coefficient (the less dependencies between iterations exist, the more the scheduler can optimize the iteration order), but also, naturally, on the number of iterations to schedule (the smaller this number, the more limited the choices). So, various versions realize the best improvements for the 2D Grid 511 and Oilpan matrices. In the rest of the experimental analysis, we will now focus on extreme situations (lowest irregularity coefficient and highest iteration number and, on the opposite, highest coefficient and lowest iteration number), and thus consider the 2D Grid and 3D Cube matrices. The simple scheduling algorithm does not use processor sets (cf. section 4.1). This is at the risk of a loss of efficiency, since not considering associated processors could produce delays among those processors, and thus passive waits on communications. In consequence, we thought that it could be appreciably improved by using of dynamic cost criteria. However, experiments show that improvements compared to S are very small (only 2.4% for the 2D Grid 511 and 0.1% for Cube 39). Other dynamic cost criteria than TT criterion have experienced even lower performance (not shown), in particular because they do not represent task properties well. Considering global times, W presents an important additional inspection cost due to the algorithm (see 4.1). On the opposite, with respect to re-execution times, W gives results comparable to the S version, and is less sensitive to the cost criteria used than the simple scheduling algorithm. While variation between best and worst times observed for the simple scheduling with various cost criteria is around 10%, it is only 0.6% for Wennink’s scheduling. The thread runtime system (see 4.4), thanks to computation/communication overlapping, allows a relative improvement in efficiency (between 3 and 7%). 14

Static criteria yields best performance for both matrices, since the dynamicity implied by the thread runtime support allows to refine at runtime the scheduling supplied by the inspector. Efficiency comparisons: Figure 4 shows relative efficiencies for the 2D Grid matrix with respect to the sequential time of V1. 120

V1 S W S_th

100 Relative efficiencies vs V1 (%)

100 Relative efficiencies vs V1 (%)

110

V1 S W S_th

110

90 80 70 60 50 40

90 80 70 60 50

30 40

20 10

30 1

2

4 Processors

8

16

1

2

4 Processors

8

16

Figure 4. Efficiencies relative to V1 on one processor for the 2D Grid matrix (Global time - Re-execution time).

Again, all versions except W behave well better, with relative efficiencies on 16 processors of 40% for V1 and more than 55% for S. These values confirm the good scalability of our inspectors, especially if we consider that version V0 has a very low relative efficiency (between 2 and 11%, not shown in the figure). Therefore, irrespective of the number of processors, versions using our inspection-execution scheme are more efficient than the basic HPF2 version. Versions using the task scheduling improve their relative efficiency: the more the processors are, the more the processors are synchronized, the more the schedule can play a role in the minimization of the influence of synchronization. By comparing versions S and S th (on re-execution times), we observe that for a sufficient number of processors (8 in the example), as communication becomes more important, the thread runtime support yields better performances (5% of improvement). The thread runtime support is ever beneficial if the amount of communication is sufficient, that is to say if the number of processors is sufficiently large. Time proportion for inspection: Figure 5 shows, for the 2D Grid and 3D Cube matrices, the amount of time spent in the various phases of version D: the static (DAG1) and the dynamic (DAG2) scheduling phases, the active processor set inspection phase (PROC) and the first execution phase (EXEC). The relative proportions are matrix dependent because of the inspection codes (which are linear with respect to the number of edges in the task graph induced by the matrix). It is therefore natural that inspection costs are higher with the 2D Grid than with the 3D Cube. Nevertheless, one can notice that the total 15

Figure 5. Time proportions of the various phases of S version global time.

inspection part is relatively stable with respect to the number of processors. The importance of inspection times for the 2D Grid matrix is to be minimized since the additional DAG inspection cost is compensated with only 8 processors and one iteration (as we can see in Figure 4, with respect to version V1). With less than 8 processors, the execution benefit is still important, so as the execution phase needs few reiterations to reach balance (15, 6 and 4 re-executions on respectively 1, 2 and 4 processors). To sum-up, DAG inspection improves performance, on the one hand, if the underlying task graph generates enough independences between tasks (low irregularity coefficient for our example), and on the other hand, if there are enough processors since the aim of scheduling is to decrease synchronization induced by the runtime on these processors (more than 8 processors for our examples). 5.2

Comparisons between HPF and MPI versions

Figure 6 presents the relative efficiencies for global times (above) and reexecution times (below) for the 2D Grid matrix (to the left) and the 3D Cube matrix (to the right) compared to the time of an state-of-the-art optimized MPI implementation (PaSTiX software, see [16]). The aim of this comparison is only to estimate the potential performance of our approach. We introduce in Figure 6 another version of the Cholesky factorization, in order to show the interest of the scheduling. This version, denoted by the prefix P, uses the progressive irregular prefix operation presented in [8], as well as DAG inspection. The performance of this version is closer to the one of the MPI version than the one of the version presented here. This is due to the prefix operation, and above all to the DAG inspection. The main difference between these versions is that the task graph for version P contains only 16

mono-processor tasks, due to the prefix operation implementation, unlike for Program 5. Note that all the analysis regarding the first version is also valid for version P. 100 80 70

S S_th W PS PW PasTiX

110 Relative Efficiencies (%)

Relative Efficiencies (%)

120

S S_th W PS PW PasTiX

90

60 50 40 30 20

100 90 80 70 60 50

10 0

40 2

4

8

16

2

4

Processors 100

80

Relative Efficiencies (%)

Relative Efficiencies (%)

120

S S_th W PS PW PasTiX

90

8

16

Processors

70 60 50

S S_th W PS PW PasTiX

110 100

40 30

90 80 70 60

2

4

8

16

Processors

2

4

8

16

Processors

Figure 6. Efficiencies relative to PasTiX software on 2 processors for 2D Grid matrix (left) and 3D Cube matrix (right): global times (top) - re-execution times (bottom).

One can note that the differences between the performances of the HPF and MPI versions for re-execution times are rather low: between 8 and 20% for the 2D Grid matrix (PS), and between 4 and 28% for the 3D Cube matrix (S th). For global times, differences are more important, while remaining reasonable: between 40 and 50% for the 2D Grid (PS) and between 7 and 18% for 3D Cube (S th). It is generally accepted that the time ratio between a regular HPF program and an hand-written equivalent MPI program is about 1.5, and is greater for irregular programs, which is our framework of study. Here, re-execution time ratio are between 1.14 and 1.24 (PW) for the 2D Grid, and between 1.04 and 1.31 (S th) for the 3D Cube. For global times, these ratios are more important due to inspector costs (2D Grid between 2 and 2.46 (PS), 3D Cube between only 1.07 and 1.3 (S th)). Nevertheless, these ratios are satisfactory, and even excellent if we only consider the execution part. They confirm that an HPF compiled code can be an honest competitor to an ad-hoc MPI version. 17

6

Conclusion

In this paper, we present a study on task scheduling where each task is one iteration of a loop with precomputable partial loop-carried dependencies. Our approach is based both on the use of a directive which specifies the dependencies between iterations, and on an inspector/executor support, implemented in the CoLUMBO library, which builds the task graph associated with the iterations and schedules these tasks. We validate our contribution with a sparse Cholesky factorization algorithm applied to real size problems, that shows good performance, relatively close to an equivalent MPI version. Moreover we think that our study can be transposed to other parallelizer compiler associated with other parallel languages, such as OpenMP [25].

References

[1] S. Benkner. Vienna Fortran 90 — an advanced data parallel language. In Victor Malyshkin, editor, Parallel computing technologies: third international conference, PaCT-95, St. Petersburg, Russia, Lecture Notes in Computer Science, pages 142–156, Berlin, Germany / Heidelberg, Germany / London, UK / etc., September 12–25 1995. Springer-Verlag. [2] S. Benkner. HPF+: High Performance Fortran for advanced scientific and engineering applications. Future Generation Computer Systems, 15(3):381–391, 1999. also in Tech. Report TR 99-1 from Institute for Software Technology and Parallel Systems, University of Vienna, with E. Laure and H. Zima. [3] S. Benkner, P. Mehrotra, J. Van Rosendale, and H. Zima. High-Level Management of Communication Schedules in HPF-like Languages. In Proceedings of the International Conference on Supercomputing (ICS-98), pages 109–116, New York, July 13–17 1998. ACM press. [4] L. Boug´e, P. Hatcher, R. Namyst, and C. Perez. Multithreaded code generation for a HPF data-parallel compiler. In Proc. 1998 Int. Conf. Parallel Architectures and Compilation Techniques (PACT’98), ENST, Paris, France, October 1998. [5] L. Boug´e, J.F. M´ehaut, and R. Namyst. Madeleine: an efficient and portable communication interface for multithreaded environments. In Proc. 1998 Int. Conf. Parallel Architectures and Compilation Techniques (PACT’98), ENST, Paris, France, October 1998. Also available as a LIP Research Report RR9826. [6] T. Brandes, F. Br´egier, M.C. Counilh, and J. Roman. Contribution to Better Handling of Irregular Problems in HPF2. In Proceedings of EURO-PAR’98, volume 1470 of LNCS, pages 639–649, Southampton, UK, September 1998.

18

Springer-Verlag. Also available as a LaBRI Research Report RR 120598, 1998, http://dept-info.labri.u-bordeaux.fr/˜bregier/DOC/paper.ps. [7] T. Brandes and F. Zimmermann. ADAPTOR — A transformation tool for HPF programs. In K. M. Decker and R. M. Rehmann, editors, Programming environments for massively parallel distributed systems: working conference of the IFIP WG10.3, Ascona, Italy, pages 91–96, Cambridge, MA, USA, April 25– 29 1994. Birkhauser Boston Inc. [8] F. Br´egier, M.-C. Counilh, and J. Roman. Asynchronous Irregular Prefix Operation in HPF2. In PDP’2000 - 8th Euromicro Worskshop on Parallel and Distributed Processing - IEEE, 2000. To appear. [9] F.T. Chong, S.D. Sharma, E.A. Brewer, and J. Saltz. Multiprocessor runtime support for fine-grained irregular DAGs. Parallel Processing Letters, 5(4):671– 683, December 1995. [10] M. Drozdowski. An overview of multiprocessor task scheduling. Habilitation Thesis, Institute of Comp. Sci., Poznan University of Technology, 1997. [11] K.A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, 1990. [12] C. Fu and T. Yang. Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 42:143–156, 1997. [13] A. George and J.W.-H. Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs, NJ, 1981. [14] M. Haines, B. Hess, and P. Mehrotra. Exploiting Parallelism in Multidisciplinary Applications Using Opus. In Bailey, David H., Bjørstad, Petter E., Gilbert, John E., Mascagni, Michael V., Schreiber, Robert S., Simon, Horst D., Torczon, Virginia J. and Layne T. Watson, editors, Proceedings of the 27th Conference on Parallel Processing for Scientific Computing, pages 710–715, Philadelphia, PA, USA, February 15–17 1995. SIAM Press. [15] M. Haines, B. Hess, P. Mehrotra, J. Van Rosendale, and H. Zima. Runtime Support for Data Parallel Tasks. In Proceedings of The Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 432–439, McLean VA, February 1995. IEEE. [16] P. Henon, P. Ramet, and J. Roman. A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization. In Europar’99 Parallel Processing, number 1685 in Lecture Notes in Computer Science, pages 1059– 1067, Toulouse, France, August 1999. Springer-Verlag. [17] HPF Forum. High Performance Fortran Language Specification, January 1997. Version 2.0. [18] Y-S. Hwang, B. Moon, S.D. Sharma, R. Ponnusamy, R. Das, and J.H. Saltz. Runtime and Language Support for Compiling Adaptive Irregular Programs on

19

Distributed Memory Machines. Software - Practice and Experience, 25(6):597– 621, 1995. [19] C. Jacques and P. Chretienne. Probl`emes d’Ordonnancement : Mod´elisation, Complexit´e, Algorithmes. Masson, 1988. [20] N. Karmarkar. A New Polynomial-Time Algorithm for Linear Programming. Combinatorica, 4:373–395, 1984. [21] A. Kr¨amer. Scheduling Multiprocessor Tasks on Dedicated Processors. Dissertation, February 1995. [22] D. Kulkarni and M. Stumm. Loop and Data Transformations : A tutorial. Tech. Report 337, Department of Computer Science and Department of Electrical and Computer Engineering, University of Toronto, November 1993. [23] R. Lioce and C. Martini. Heuristic methods for machine scheduling problems with processor sets: a computational investigation. Memorandum COSOR 9510, Eindhoven University of Technology, March 1995. [24] S.S. Mukherjee, S.D. Sharma, M.D. Hill, J.R. Larus, A. Rogers, and J. Saltz. Efficient Support for Irregular Applications on Distributed-Memory Machines. In ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 68–79, July 1995. [25] OpenMP Consortium. OpenMP Fortran Application Program Interface, Version 1.0, October 1997. [26] S. Ramaswany, S. Sapatnekar, and P. Banerjee. A Framework for exploiting Task and Data parallelism on DMM. IEEE TPDS, 8(11), 1997. [27] J. Saltz, R. Mirchandaney, and K. Crowley. Run-Time Parallelization and Scheduling of Loops. IEEE Transactions on Computers, 40(5), May 1991. [28] M. Ujaldon, E.L. Zapata, S.D. Sharma, and J. Saltz. Parallelization Techniques for Sparse Matrix Applications. Journal of Parallel and Distributed Computing, 38(2):256–266, November 1996. [29] R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. Compiler Analysis for Irregular Problems in Fortran D. In Uptal Banerjee, David Gelernter, Alex Nicolau, and David Padua, editors, Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 97–111, New Haven, Connecticut, August 3–5, 1992. Springer-Verlag. [30] M. Wennink. Algorithmic Support for Automated Planning Boards. PhD thesis, Eindhoven University of Technology, 1995. [31] S.S. Wu and D. Sweeting. Heuristic Algoritms for Task Assignment and Scheduling in a Processor Network. Parallel Computing, 20(1):1–14, 1994. [32] T. Yang and A. Gerasoulis. List Scheduling with and without Communication Delays. Technical Report NJ 08903, Department of Computer Science, Rutgers University, August 1992.

20