Une heuristique pour un ordonnancement tolérant aux

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems. M. Nakechbandi, J.-Y. Colin, LITIS, ...
110KB taille 0 téléchargements 37 vues
An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Une heuristique pour un ordonnancement tolérant aux pannes dans un système hétérogène de serveurs distribués.

“A Fault-tolerant Scheduling heuristic for distributed servers on heterogeneous computing systems” M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France

1

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Outline I. The central problem I.1. The Distributed Servers System I.2. The multi-valued DAG I.3. What is a feasible solution ? II. The scheduling algorithm II.1. Algorithm presentation II.2. Complexity III. Discussion

2

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

I. The central problem I.1. The Distributed Servers System (DSS) set Σ = {σ1, ..., σs} of geographically distributed, multi-users, heterogeneous servers SERVER σ1

SERVER σ2

Possible Tasks task 1 task 2 task 4 task 5

Possible Tasks task 2 task 3 task 5 Network

Possible Tasks task 1 task 2 task 3 task 5 task 6

Possible Tasks task 1 task 4 task 5 task 6 SERVER σ3

SERVER σ4

Servers and network are heterogeneous. Tasks may or may not be executed by all servers. We suppose that only one permanent failure, of one server, can occur, without any possibility of a temporary failure (1-fault hypothesis). 3

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Hypothesis (H): the concurrent executions of tasks of the application on one server have a negligible effect on the processing time of any other task of the same application on the same server.

4

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

I.2. The multi-valued DAG • I = {1,..., n} is the set of tasks to be executed on Σ • Πi is the set of the processing times π i

/σ r

of task i of I on all servers σr of Σ

• ∆i,j is the set of all possible communication delays ci /σ r , j /σ p of the transmission of the result of task i toward task j on Σ ∆1,2 σ1 σ2 σ3 σ4

σ1

σ2

σ3

σ4

0

3 ∞ 3 2

2 ∞ 0 1

∞ ∞ ∞ 0

∞ 2

1

Π1 = ( 3, ∞, 2, 2)

Π2 ∆1,2

∆2,4

2

4

∆1,3

=3,

π 1 /σ

4

∆4,6

∆3,4

Π1 1

c1 /σ 3 , 2 /σ 2

Π4

= 2

Π6 6

∆2,5 Π3 3

∆3,5

Π5

∆5,6

5 5

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

• this a static scheduling problem :  the various processing times and communications delays of each task on each server are supposedly known, • task duplication is allowed

6

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

I.3. What is a feasible solution ? • each task i is executed at least once on at least one server σr of Σi, • to each task i of the application executed by a server σr of Σi, is associated one positive execution date t i / σ r , • for each execution of a task i on a server σr, such that PRED(i) ≠ ∅, there is at least an execution of a task k, k ∈PRED(i), on a server σp, σp ∈ Σκ, that can transmit its result to server σr before the execution date t i / σ r . The last condition, also known as the Generalized Precedence Constraint (GPC)

Minimize the makespan T of a feasible solution S, with T defined as

T = max (t i / σ r + π i / σ r ) i/σ r ∈S

7

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

II. The scheduling algorithms II.1 Algorithm presentation Our algorithm has two phases:  the first one is for the scheduling of primary copies where we use the DSS-OPT algorithm (J.-Y. Colin , M. Nakechbandi, PDPTA'05);

 the second one is for the scheduling of the backup copies in which a variant of the eFRCD algorithm (X. Qin and H. Jiang, , Parallel Computing, 2006) is used.

8

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Phase 1: Scheduling of primary copies (The DSS_OPT algorithm) DSS_OPT has two steps. step 1: • computes the earliest feasible execution dates bi / σ r for all possible executions i/σr of each task i on each possible server σr . • a PERT-like algorithm that takes into account the communication delays is used, with

bi / σ r ←

max

min (bk / σ p + π k / σ p + ck / σ p ,i / σ r )

∀k ∈PRED(i ) ∀σ p ∈Σ k

9

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

1 σ1 σ2 σ3 σ4

Π2

b 1 r1 0 3 ∞ ∞ 0 2 0 2

Π4

∆2,4

2

∆1,2

4

∆4,6

∆3,4

Π1

Π6

1

6

∆2,5

∆1,3

Π3

∆5,6

Π5

∆3,5

3

5

gives for example the following earliest execution dates and earliest end dates : 1 σ1 σ2 σ3 σ4

b1 0 ∞ 0 0

r1 3 ∞ 2 2

2 σ1 σ2 σ3 σ4

b2 3 5 2 ∞

r2 5 9 5 ∞

3 σ1 σ2 σ3 σ4

b3 ∞ 4 2 4

r3 ∞ 6 4 8

4 σ1 σ2 σ3 σ4

b4 7 ∞ ∞ 8

r4 11 ∞ ∞ 10

5 σ1 σ2 σ3 σ4

b5 7 7 5 7

r5 12 10 7 11

6 σ1 σ2 σ3 σ4

b6 ∞ ∞ 12 10

r6 ∞ ∞ 14 13 10

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

step 2: • The examination of the tasks is done in a recursive manner, starting from the tasks without successors, and ending the studied tasks do not have any predecessors,

• for every task i that does not have any successor, determine its execution i/σr ending at the earliest possible date bi / σ r . If several executions of task i end at the same earliest date bi / σ r , one is chosen, arbitrarily or using other criteria of convenience, and is kept in the solution,

• then, for each execution i/σr kept in the solution with at least one predecessor, the subset Li of the executions of its predecessors that satisfy GPC(i/σr) is established. This subset of executions of predecessors of i contains at least an execution of each of its predecessors. One execution k/σp of every predecessor task k of task i is chosen in the subset, arbitrarily or using any criteria of convenience, and kept in the solution. This ‘primary’ copy is executed at date bk / σ p .

11

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Using for example the following earliest execution dates and earliest end dates : 1 σ1 σ2 σ3 σ4

b1 0 ∞ 0 0

r1 3 ∞ 2 2

2 σ1 σ2 σ3 σ4

b2 3 5 2 ∞

r2 5 9 5 ∞

3 σ1 σ2 σ3 σ4

b3 ∞ 4 2 4

r3 ∞ 6 4 8

4 σ1 σ2 σ3 σ4

b4 7 ∞ ∞ 8

r4 11 ∞ ∞ 10

5 σ1 σ2 σ3 σ4

b5 7 7 5 7

r5 12 10 7 11

6 σ1 σ2 σ3 σ4

b6 ∞ ∞ 12 10

r6 ∞ ∞ 14 13

gives the following result: 1/σ3 2/σ3 3/σ3 3/σ4 4/σ4 5/σ3 6/σ4

bi / σ r

0

2

2

4

8

5

10

ri / σ r

2

5

4

8

10

7

13

12

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

This finally gives the following optimal scheduling:

σ3

3/σ3

Communication time between task 3 to task 5

σ3

1/σ3

2/σ3

5/σ3

5 to 6

2 to 4 1 to 3

σ4

1

2

3

3/σ4

4

5

6

4/σ4

7

8

9

6/σ4

10

11

12

13

14 13

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

Phase 2: Scheduling of the backup copies (eFRCD algorithm) The eFRCD algorithm adds extra time spaces in which backup copies will possibly be executed if one server fails. It uses a list algorithm which will sort out tasks according to their earliest end date in the optimal scheduling built by VDS_OPT in phase 1. The most urgent tasks are placed first by a greedy algorithm, on a server different of the one which executes the primary copy of this task. Example: Using the last example we obtain the flowing list 1

2

Tasks Candidate servers for backup copies

1

4

1

Tasks earliest end date

2

2

5

Chosen server for backup copies

4

1

2

4 1

9

10 1

1

5 2

4

6 3

12

10

11

16

2

3

14

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

This gives the following fault-tolerant scheduling: σ1

2B /σ1

4B /σ1

σ2

5B /σ2

σ3

3/σ3

σ3

1/σ3

σ4

1B/σ4

1

2/σ3

5/σ3

6B /σ3

3/σ4

2

3

4

5

6

4/σ4

7

8

9

6/σ4

10

11

12

13

14

15

16

15

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

II.2. Complexity  On a problem with n tasks and s servers, the global complexity of the algorithm DSS_OPT(P) is Ο(n2s2).  eFRCD is polynomial also and his complexity is Ο(n2s).  Consequently the global complexity of our new algorithm is Ο(n2s2).

16

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.

III. Discussion •



This algorithm has two advantages: 

When there is no fault on servers, our algorithm is optimal as it uses the optimal solution computed by DSS-OPT.



When there is a fault on one server, our solution is a good one because before the fault the solution by DSS-OPT is better than eFRCD, and after the fault it uses a solution computed by eFRCD, and eFRCD builds a good solution in this case.

Our solution is guaranteed to finish if one fault occurs, because every tasks has two or more scheduled copies on different servers in the final solution. If more than one fault occur, the solution may still finish, but there is no guaranty now.

17

An Fault-tolerant Scheduling heuristic for the problems of distributed servers on heterogeneous computing systems M. Nakechbandi, J.-Y. Colin, LITIS, Université du Havre, France.



The concurrent executions of tasks on a server is supposed to have no, or a negligible effect on the processing times of other tasks on the same server (Hypothesis H). This is in fact similar to the ‘non limited number of available processors hypothesis’ present in all classical PERT (like algorithms) problems. And the same way classical PERT results are used in real-life problems with limited resources as a first step for list scheduling algorithms, our result may also be used as a first step for list scheduling algorithms or heuristics in real-life distributed systems with real servers (without H hypothesis).



The model of failure, as it features at most 1 crash, may seem poor. However, if the probability of any failure is very low, and the probabilities of failure are independent, then the probability of two failures will be much smaller indeed. Furthermore, we are currently working on extending the algorithm to 2 or more failures cases, by using two or more backup copies for example.

18