Instruction Pipelining II. Pipeline Hazards Floating Point Pipeline

Speedup = 1. CPIpipeline_ideal + SPI ... 1 + SPI. S. Our goal is to minimize SPI => maximize speedup from pipelining. ..... (ALU Reg-Reg, Reg- Imm). Initiation.
253KB taille 8 téléchargements 243 vues
Instruction Pipelining II. Pipeline Hazards Floating Point Pipeline Ing. Miloš Bečvář

Computer Architecture

Presentation Outline ° Pipeline hazards in the integer pipeline and their solutions (pipeline stall, forwarding, branch evaluation, delayed and cancelled branch) ° Floating point pipeline ° More complex hazards in floating point pipeline and their solutions

Computer Architecture

Ideal Pipeline Execution Pipeline produces result every clock cycle

Latency to fill pipeline

instr1 instr.2 instr.3 instr.4 instr.5

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

CPI pipeline_ideal = 1

Computer Architecture

WB

Realistic Pipeline Execution – 1 cycle stall

instr1

instr.2

instr.3 instr.4

1

2

3

4

5

IF

ID

EX

MEM

WB

IF

ID stall

ID

IF stall

IF

8

6

7

EX

MEM

WB

ID

EX

MEM

WB

IF

ID

EX

MEM

9

WB

No Instruction started in cycle 3 No Instruction finished in cycle 6

CPI pipeline_real = CPIpipeline_ideal + Stalls per instruction

Computer Architecture

Speedup From Pipelining Tcpu_single_cycle IC * CPIsingle_cycle * Tclk_single_cycle = Speedup = Tcpu_pipeline IC * CPIpipeline_real * Tclk_pipe CPIsingle_cycle = 1; CPIpipeline_real = CPIpipeline_ideal + SPI Stalls per Instruction

1 Speedup =

CPIpipeline_ideal + SPI

*

Tclk_single_cycle Tclk_pipe

S Speedup =

CPIpipeline_ideal + SPI

=

S 1 + SPI

Our goal is to minimize SPI => maximize speedup from pipelining. Computer Architecture

Another Diagram for Pipeline Stall Clock cycle 1 2 3 4 5 Clock cycle 1 2 3 4 5 6 7

IF stage

ID stage

EX stage

MEM stage

WB stage

Instr1 Instr2 Instr3 Instr4 Instr5

Instr0 Instr1 Instr2 Instr3 Instr4

Instr-1 Instr0 Instr1 Instr2 Instr3

Instr-2 Instr-1 Instr0 Instr1 Instr2

Instr-3 Instr-2 Instr-1 Instr0 Instr1

IF stage

ID stage

EX stage

MEM stage

WB stage

Instr1 Instr2 Instr3 Instr3 Instr4 Instr5 Instr6

Instr0 Instr1 Instr2 Instr2 Instr3 Instr4 Instr5

Instr-1 Instr0 Instr1 NOP Instr2 Instr3 Instr4

Instr-2 Instr-1 Instr0 Instr1 NOP Instr2 Instr3

Instr-3 Instr-2 Instr-1 Instr0 Instr1 NOP Instr2

Instr2 stalls in ID stage Computer Architecture

Stall Implementation in the Pipeline – The Idea Stage 1

Stage 2

Stage 3

Stage 4

Comb. Log. S1

Comb. Log. S2

Comb. Log. S3

Comb. Log. S4

Stall occurred here

Recycle all stages before the stall source until stall is resolved

Continue flow through the pipeline Insert NOP in the following stage until stall is resolved

Note that the end of the pipeline is not affected. This ensures forward progress by removing the reason for a stall by completing of conflicting instruction in the end of the pipeline. Computer Architecture

Why the pipeline is stalled ? - Pipeline must be stalled to avoid hazards. Hazard is potential violation of semantics of program execution. ° Data hazards: violation of data or name dependences ° Control hazards: violation of control dependences ° Structural hazards: imperfect resource replication in the pipeline, collision on shared resources (memory, register file, functional unit) Note that data and control hazards exists due to collision of program semantic and the pipelined execution. Structural hazards are result of imperfect pipelining.

Computer Architecture

Data and Name Dependences InstrI Variable1, Variable2, Variable3 (True) Data Dependency

.... InstrJ Variable4, Variable5, Variable1 InstrI Variable1, Variable2, Variable3

Output Dependency

.... InstrJ Variable1, Variable4, Variable5 InstrI Variable3, Variable2, Variable1

Antidependency

.... InstrJ Variable1, Variable4, Variable5

Name (False) Dependencies = Output Dependency + Antidependency Computer Architecture

Data Dependences – Typical Program Sequences InstrI Variable1, Variable2, Variable3 ....

Data Dependency

InstrJ Variable5, Variable4, Variable1 Output …. Antidependency Dependency

Typical define-use-reuse sequence for Variable1

InstrK Variable1, Variable6, Variable7 InstrI Variable1, Variable2, Variable3 Output .... Dependency InstrJ Variable1, Variable4, Variable5

Redundant computation of Variable1 – typical by-product of aggressive compiler optimization.

Dependences exists between variables in registers (easy to identify) as well as between memory accessed variables (more difficult). Computer Architecture

Data Hazards – Classification Data hazards are potential violations of data and name dependences in program during pipelined execution. Let i < j be 2 integers

(Instruction i is earlier in prog. than instr. j)

° RAW (Read After Write) hazard : 1. There is true-data dependency between i and j 2. Instruction j wants to read a new value before instruction i can write it. ° WAR (Write After Read) hazard : 1. There is antidependency between i and j 2. Instruction j wants to write a result earlier than instruction i reads the previous value. ° WAW (Write After Write) hazard : 1. There is output dependency between i and j 2. Instruction j wants to write data earlier than instruction i Existence of hazard depends on - distance between i and j with respect to pipeline latencies - order in which instructions read and write their operands with respect to their order in program. Only RAW hazards can occur in our simple integer pipeline. Why ? Computer Architecture

Only RAW hazards occur in integer DLX Variables in registers All instructions

1.

2.

3.

4.

5.

IF

ID

EX

MEM

WB

read

w rite

All instructions are executed in program-order. Register read occurs in 2nd cycle, register write occurs in 5th cycle. For j>i: Instruction j is fetched in cycle j, instruction i in cycle i RAW hazard ? Instruction j reads in cycle j+1, instruction i writes in i+4 RAW hazard occur if j+1 ji+4 WAR hazard ? Instruction j writes in cycle j+4, instruction i reads in i+1 WAW hazard can not occur because j+4>i+1

Similarly you can prove that no RAW, WAR, WAW hazard occur between load and store instructions. Computer Architecture

Not Every Dependency Leads to a Hazard ! ADD writes a new value of R1

SUB reads an old value of R1

ADD R1,R2,R3

IF

SUB R4,R1,R3

ADD R1,R2,R3

OR R5,R2,R3

AND R6,R2,R6

SUB R4,R1,R3

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Dependency and hazard ! WB

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Dependency but No hazard !

Computer Architecture

WB

Example of RAW Data Hazard 1) Stall:

ADD R1,R2,R3

IF

ID IF

SUB R4,R1,R3

EX

MEM

ID stall ID stall

WB ID

EX

MEM

WB

2) Better solution – forwarding (also called register bypassing)

ADD R1,R2,R3

IF

ID

EX

MEM

WB

Forward the new value for R1 SUB R4, R1,R3

IF

ID

EX

MEM

WB

Result of ADD produced in EX stage is immediately forwarded to following instruction => no stall is necessary !

Computer Architecture

Forwarding – Principle of HW Solution ID stage

IF stage Instruction Memory

MEM stage

Comb. Logic

Comb Logic

Instruction

EX stage

Comb. Logic

WB stage Comb. Logic

jump_now MEM/EX

Addr

zero Rw

busW PC

Imm16 Disp26

jump_now

Ra Rb

32 32-bit Registers

Ext

busA

ALU busB

Forwarding ctrl

Imm32

Addr Next Addr. Logic

PC

Jump/Branch Target Address

Data

PC+4

Add

Memory

Target Address Calculation WB/EX

Results forwarded from pipeline registers using MEM/EX and WB/EX path. Computer Architecture

RAW Hazard Requiring Stall

LW R1, (R2) +5 SUB R4,R1, R3

IF

ID

EX

MEM

WB

Forwarding back in time ? IF

ID

Data needed by SUB

EX

MEM

WB

Data available from the memory

Data from LOAD are available at the end of the MEM stage, but are needed in the beginning of the EX stage => cannot be forwarded back in time.

Computer Architecture

RAW Hazard Requiring Stall (contd.)

LW R1,(R2)+5

IF

ID

EX

MEM

WB Forward

SUB R4, R1,R3

IF

IDstall

ID

EX

MEM

WB

This delay is called LOAD - USE delay (typical for a RISC pipeline). This delay can be eliminated by inserting an independent instruction between the LOAD and USE instructions. This is a typical task of an optimizing compiler.

Computer Architecture

Performance Impact of Data Hazards on CPI a) Consider integer pipeline with stalls Let i be the position of producing instruction, j be the position of consuming instruction: j i+1 i+2

2 stalls 1 stall

P Probability (nonoptimized code) 40 % 20 %

i+3 and more

0 stals

40 %

Instruction type

Stalls in the pipeline

F (Dynamic frequency)

ALU Load

48 % 26 %

Store Branch (control)

13 % 13 %

SPIdata = (Falu + Fload)*(p1*2 +p2*1) SPIdata=(0,48 + 0,26)*(0,4*2 +0,2*1) SPIdata=0,74 CPIpipe_real >=1,74 !

Computer Architecture

Performance Impact of Data Hazards on CPI b) Consider integer pipeline with forwarding Let i be the position of Load instruction, j be the position of consuming instruction: j i+1 i+2 and more

Instruction type

Stalls in the pipeline 1 stall 0 stalls

F (Dynamic frequency)

P Probability (nonoptimized code) 40 % 60 %

SPIdata = Fload*p1*1

ALU Load

48 % 26 %

SPIdata= 0,26*0,4*1

Store Branch (control)

13 % 13 %

SPIdata=0,104 CPIpipe_real >=1,104 !

Beware that Tclk can be affected by forwarding in negative sense ! Computer Architecture

Control Hazard – Branch Instruction ID stage

IF stage Instruction Memory

MEM stage

Comb. Logic

Comb Logic

Instruction

EX stage

Comb. Logic

WB stage Comb. Logic

jump_now MEM/EX

Addr

zero Rw

busW PC

Imm16 Disp26

jump_now

Ra Rb

32 32-bit Registers

Ext

busA

ALU busB

Condition Evaluation

Imm32

Addr Next Addr. Logic

PC

Jump/Branch Target Address

Data

PC+4

Add

Memory

Target Address Calculation WB/EX

A branch is evaluated during EX stage, PC changed during MEM stage. Computer Architecture

Solution of Control Hazard by Pipeline Stall

BEQZ R3, M1 instr.2

IF

ID

EX

IF stall IF stall

MEM

WB

IF stall

instr.3 instr.4 M1: ADD R4,R6,R7

condition resolution & address PC calculation changed

IF

ID

EX

MEM

3 stalls for every branch (taken/ not taken) SPIcontrol=0,13*3 = 0,39 !

Computer Architecture

WB

Solution of Control Brach by Flushing

BEQZ R3, M1 instr.2 instr.3 instr.4 M1: ADD R4,R6,R7

IF

ID

EX

MEM

IF

ID

EX

IF

ID

WB

NOP

IF condition IF resolution & address PC calculation changed

Cancelled (flushed) Instructions

NOP

NOP ID

EX

MEM

WB

Instructions 2,3,4 are flushed only when branch is taken: °13% of instructions in SpecInt2000 are branches, 67 % of branches are taken, 3 stalls occurs only when branch is taken: SPIcontrol = 0.13*0.67*3 = 0,26 Computer Architecture

Better Solution of Control Stalls ° Solution has two parts : - To evaluate the condition and target address earlier - Change PC earlier ° Possible implementation - Add condition evaluation logic into the ID stage - Compute the next PC during the ID stage ° Result : Penalty of 1 instead of 3 => SPIcontrol = 0,0871 ° Note however, that evaluation of condition in ID can not use the forwarding to EX stage => SPIdata can be increased. But this can be solved by optimization of instruction ordering. ° Tclk can be also affected due to more complex logic in ID stage !

Computer Architecture

Improved Branch Instruction - Implementation ID stage

IF stage Instruction Memory

Comb Logic

Instruction Imm16 Disp26

Addr

EX stage Comb. Logic

MEM stage Comb. Logic

MEM/EX Rw

busW

Ra Rb

32 32-bit Registers

PC

busA

ALU busB

jump_now

Imm32

Ext

Addr Next Addr. Logic

PC

Condition Evaluation Logic PC+4

Add Jump/Branch Target Address

Target Address WB/EX Calculation

Computer Architecture

Data Memory

WB stage Comb. Logic

Improved Branch Instruction BEQZ R3, M1 instr.2

IF

ID

EX

MEM

IF

M1: ADD R4,R6,R7

WB

Cancelled (flushed) Instruction

NOP

IF

ID

EX

MEM

WB

condition evaluation & address calculation, PC changed

Instr.2 is called to be in Branch Delay Slot. Similarly we can call the instruction after load to be in Load Delay Slot.

Computer Architecture

Further Improved Branch – Delayed Branch BEQZ R3, M1 instr.2

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

M1: ADD R4,R6,R7

Branch Delay Slot WB

condition evaluation & address calculation

Let instruction in the branch delay slot always execute. This principle is called delayed branch. SPIcontrol = Taken Branch Frequency * Branch Penalty If the instruction in a branch delay slot performs some useful computation, the branch penalty is reduced.

Computer Architecture

Which Instructions Can be Placed in a Branch Delay Slot ? ADD R1, R2, R3 BEQZ R4, LAB

Solution 1: From Before

BEQZ R4, LAB ADD R1, R2, R3

Delay Slot

LAB:

LAB:

° This is a safe solution. ° Branch Penalty is reduced to zero. ° Branch must not depend on the instruction placed in the delay slot (obviously). Computer Architecture

Which Instructions Can be Placed in a Branch Delay Slot (2) ? LAB: SUB R4, R5, R6 ADD R1, R2, R3 BEQZ R7, LAB

SUB R4, R5, R6

Solution 2: From Target

LAB: ADD R1, R2, R3 BEQZ R7, LAB

Delay Slot

ADD R1, R2, R3 BEQZ R7, LAB

SUB R4, R5, R6 ADD R1, R2, R3

Solution 3: From Fall-Through

BEQZ R7, LAB SUB R4, R5, R6

Delay Slot SUB R4, R5, R6

LAB:

LAB: Computer Architecture

Which Instructions Could be Placed in the Branch Delay Slot (3) ? º Branch Penalty is reduced only when the program goes in the expected path from which the instruction was taken. º Solutions 2 and 3 are legal only when the instruction in a delay slot can be safely executed when the branch goes in the unexpected path. The instruction in a branch delay slot should not rise an exception when executed in the unexpected path. – esp. problem with Load, Store instr. (Virtual Memory Protection Fault) Solution : So called canceling branch The instruction in a delay slot is executed only when the branch is taken. It is cancelled otherwise (when the branch falls through).

Computer Architecture

Delayed Branches - Summary Some architectures support all 3 types of branches : º Delayed branches – used when instruction in a delay slot can be safely executed without dependence of branch direction (preferred option) º Conventional branches – instruction in a delay slot is cancelled for taken branch º Canceling branches – instruction in delay slot is cancelled for fall-through path Delayed branches work well for a simple pipeline, but complicate exception (interrupt) handling and superscalar execution (more advanced pipeline). Computer Architecture

Structural Hazards

° Structural hazards: imperfect resource replication in the pipeline, collision on shared resources (memory, register file, functional unit) ° Our integer DLX pipeline has no shared resources between pipeline stages, hence there are no structural hazards SPIstruct = 0 ° In real pipelines such shared resources exists (due to area and cost savings) and are sources of structural stalls

Computer Architecture

Example of Structural Hazard – Unified Memory Load instr.2 instr.3 Stall instr.4

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

Lost instruction slot (Pipeline buble)

IF

ID

EX

MEM

WB

Instruction 4 can not start in the 4th cycle because Load is accessing unified instruction and data memory. This can be seen as a stall in the IF stage. Performance impact of unified memory (cache): When memory operands are accessed, instructions cannot be fetched. 26 % of instructions are Loads 13 % Store => SPIstruct = 0,39 Computer Architecture

Example of Structural Hazard – Missing MEM stage IF

Load AD D

ID

EX

MEM

WB

IF

ID

EX

WB

Collision on w rite port into register file.

Reason: Instructions have different number of pipeline stages. Solution: Adding of register file write port, but it is not always reasonable (area, delay). Stall is cheaper solution.

Load ADD instr.3 instr.4 instr.5

IF

ID

EX

MEM

WB

IF

ID

EX

stall

WB

IF

ID

stall

EX

WB

IF

stall

ID

EX

MEM

WB

stall

IF

ID

EX

MEM

WB

This is reason why ALU instructions go through MEM stage to avoid this type of hazard. Computer Architecture

Identifying Structural Hazards (1) Identify clock cycles in which pipeline resources are used by different instruction types. Example 1 – shared memory 1

Load, Store

Other instr.

2

3

ID

IF

EX

1

2

3

IF

ID

EX

4

5

MEM

WB

4

MEM

5

Shared Memory Used

WB

Example 2 – different number of pipeline stages 1

Load

2

3

Other instr.

5

MEM

WB

ID

EX

2

3

IF

ID

EX

MEM

1

2

3

4

IF

ID

EX

WB

IF 1

Store

4

4

Register File W rite Port Used

Shared resource is resource used by instructions in multiple different cycles => structural hazards can occur. Computer Architecture

Identifying Structural Hazards (2) Clock cycle -> Instruction num ber

1.

Load

2.

instr.2

3.

instr.3

4.

instr.4

1.

2.

3.

4.

5.

6.

7.

IF

ID

EX

M EM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

8.

WB

Instruction j using shared resource in cycle c2 collides with instruction i which uses the same resource in cycle c1 IFF j = i + (c1 – c2) In our example: i=1, c1=4 j=4, c2=1 Computer Architecture

Performance of Integer DLX Pipeline (with Delayed Branches) on SPECInt95

Average stalls per instruction in SPECInt95 : SPIcontrol= 0,06, SPIdata=0,05 CPI (assuming perfect memory system) = 1,11 Speedup against non-pipelined version is Computer Architecture

4,5 © David Patterson, CS152 UCB 1996

DLX and Multicycle Instructions – FP Pipeline EX Integer Unit

EX IF

ID

FP/Int Multiply

MEM WB

EX WinDLX Model

FP ADD/SUB

EX FP/Int Divide Computer Architecture

Functional Units

DLX FP Pipeline – More Realistic Model Integer Unit

E

M FP multiplier

F

D

MX1

MX2

MX3 W

FP adder

AX1

AX2

AX3

AX4

FP DIV DX1-DX28 SQRT (multicycle)

Computer Architecture

Functional Unit Parameters ° Latency - the number of clock cycles before the result is available - typically number of pipeline stages in a unit ° Initiation (Repetition) Interval - the number of clock cycles before the next operation can use this functional unit Pipeline latency = 5 cycles IF

ID

EX

MEM

WB

Latency of integer unit = 1 cycle IF

ID

Init. interval = 1 cycle Computer Architecture

EX

MEM

WB

Functional Unit Parameters: an Example Functional Unit (Producing Instruction Type) Integer (ALU Reg-Reg, Reg- Imm) Integer (Load instruction) FP MULT (pipelined) FP ADD (pipelined) DIVIDER (iterative)

Latency Initiation Interval 1 1 2 6 3 28

1 1 1 28

In WinDLX all functional units are multicycle (latency = initiation interval)

Computer Architecture

Instruction Latency and Initiation Interval Latency of Instruction -the number of clock cycles before the result is available for the dependent instruction It depends on: -Latency of functional unit for a given instruction -Forwarding paths in processor -Consuming instruction type

Initiation (Repetition) Interval of Instruction -the number of clock cycles before the same type of instruction can start execution (without resource conflicts) It depends on: -Number of functional units and their initiation intervals

M non-pipelined functional units of latency M have the same initiation interval as a single fully- pipelined functional unit of latency M. See also supplementary material about parameters of various DLX CPUs Computer Architecture

New Problems with a FP Pipeline o Increasing number of structural hazards – non-pipelined unit blockages, register write ports o RAW hazards are more frequent (higher latency of FP) o WAR hazards can not occur because operands are read in the beginning of execution, instructions start execution in order o WAW hazards can occur because instructions are completed out – of – order (take different number of EX stages) o More complex hazard detection logic in the ID stage Computer Architecture

New Sources of Structural Hazards 1

FMULT F0, F4, F6 Integer instr. Integer instr.

FADD F2, F4, F6 Integer instr.

LDF F3, 15(R1)

IF

2

3

4

5

6

7

8

9

ID

MX1

MX2

MX3

MX4

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

MX1 AX1

IF

10

MX5

MX6

MEM

WB

MX2 AX2

MX3 AX3

MEM

WB

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

• Need more register file write ports (up to 4) otherwise structural stall occur. • This stall occurs in WinDLX during MEM stage (only a single instruction can finish in 1 cycle in WinDLX) • In a real CPU – integer writes and FP writes would not collide and there would be more write ports for FP reg. file. Computer Architecture

An example of a WAW Hazard 1

FMULT F0, F4, F6 Integer instr.

FADD F0, F3, F2

IF

2

3

4

5

6

ID

MX1

MX2

MX3

MX4

IF

ID

EX

MEM

WB

IF

ID

MX1 AX1

MX2 AX2

7

8

9

MX5

MX6

MEM

MX3 AX3

MEM

WB

10

WB

º The value from FADD could be overwritten by FMULT º WAW hazard can be resolved by stalling output-dependent instruction in ID stage º Performance impact is not significant because it occurs infrequently (only for “redundant computation”.) Computer Architecture

Performance of FP DLX (SPECfp95)

© David Patterson, CS152 UCB 1996 Computer Architecture

Performance of FP DLX (SPECfp95)

© David Patterson, CS152 UCB 1996 Computer Architecture

Performance of FP DLX º Integer stalls are negligible comparing to FP Stalls º RAW stalls dominate (see FP result stalls bars) º Structural stalls due to divider are negligible The number of stalls per FP operation corresponds to 50 % of their latency (the rest is overlapped by useful computation of other instructions) - 14,2 stalls per division (latency = 28 cycles) - 2,7 stalls per multiplication (latency = 6 cycles) Impact on CPI (SPIFP) o Depends on application – 0,65 – 1,21 o SPECfp stall average is 0,87 o DLX FP CPI = 1.87 (SPECfp95) © David Patterson, CS152 UCB 1996 Computer Architecture

Conclusion ° Pipeline performance is limited by data, control and structural hazards ° Data hazards are RAW, WAW and WAR ° These hazards can stall the pipeline ° Some techniques to avoid stalls were shown (forwarding, delayed branch, instruction scheduling by compiler) Following 3 weeks : Memory hierarchy (what is behind IF and MEM stages) We return one more time to the topic of processor architecture (superscalar, dynamic scheduling, branch prediction, speculation …)

Computer Architecture