Instruction Pipelining II. Pipeline Hazards Floating Point Pipeline Ing. Miloš Bečvář
Computer Architecture
Presentation Outline ° Pipeline hazards in the integer pipeline and their solutions (pipeline stall, forwarding, branch evaluation, delayed and cancelled branch) ° Floating point pipeline ° More complex hazards in floating point pipeline and their solutions
Computer Architecture
Ideal Pipeline Execution Pipeline produces result every clock cycle
Latency to fill pipeline
instr1 instr.2 instr.3 instr.4 instr.5
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
CPI pipeline_ideal = 1
Computer Architecture
WB
Realistic Pipeline Execution – 1 cycle stall
instr1
instr.2
instr.3 instr.4
1
2
3
4
5
IF
ID
EX
MEM
WB
IF
ID stall
ID
IF stall
IF
8
6
7
EX
MEM
WB
ID
EX
MEM
WB
IF
ID
EX
MEM
9
WB
No Instruction started in cycle 3 No Instruction finished in cycle 6
CPI pipeline_real = CPIpipeline_ideal + Stalls per instruction
Computer Architecture
Speedup From Pipelining Tcpu_single_cycle IC * CPIsingle_cycle * Tclk_single_cycle = Speedup = Tcpu_pipeline IC * CPIpipeline_real * Tclk_pipe CPIsingle_cycle = 1; CPIpipeline_real = CPIpipeline_ideal + SPI Stalls per Instruction
1 Speedup =
CPIpipeline_ideal + SPI
*
Tclk_single_cycle Tclk_pipe
S Speedup =
CPIpipeline_ideal + SPI
=
S 1 + SPI
Our goal is to minimize SPI => maximize speedup from pipelining. Computer Architecture
Another Diagram for Pipeline Stall Clock cycle 1 2 3 4 5 Clock cycle 1 2 3 4 5 6 7
IF stage
ID stage
EX stage
MEM stage
WB stage
Instr1 Instr2 Instr3 Instr4 Instr5
Instr0 Instr1 Instr2 Instr3 Instr4
Instr-1 Instr0 Instr1 Instr2 Instr3
Instr-2 Instr-1 Instr0 Instr1 Instr2
Instr-3 Instr-2 Instr-1 Instr0 Instr1
IF stage
ID stage
EX stage
MEM stage
WB stage
Instr1 Instr2 Instr3 Instr3 Instr4 Instr5 Instr6
Instr0 Instr1 Instr2 Instr2 Instr3 Instr4 Instr5
Instr-1 Instr0 Instr1 NOP Instr2 Instr3 Instr4
Instr-2 Instr-1 Instr0 Instr1 NOP Instr2 Instr3
Instr-3 Instr-2 Instr-1 Instr0 Instr1 NOP Instr2
Instr2 stalls in ID stage Computer Architecture
Stall Implementation in the Pipeline – The Idea Stage 1
Stage 2
Stage 3
Stage 4
Comb. Log. S1
Comb. Log. S2
Comb. Log. S3
Comb. Log. S4
Stall occurred here
Recycle all stages before the stall source until stall is resolved
Continue flow through the pipeline Insert NOP in the following stage until stall is resolved
Note that the end of the pipeline is not affected. This ensures forward progress by removing the reason for a stall by completing of conflicting instruction in the end of the pipeline. Computer Architecture
Why the pipeline is stalled ? - Pipeline must be stalled to avoid hazards. Hazard is potential violation of semantics of program execution. ° Data hazards: violation of data or name dependences ° Control hazards: violation of control dependences ° Structural hazards: imperfect resource replication in the pipeline, collision on shared resources (memory, register file, functional unit) Note that data and control hazards exists due to collision of program semantic and the pipelined execution. Structural hazards are result of imperfect pipelining.
Computer Architecture
Data and Name Dependences InstrI Variable1, Variable2, Variable3 (True) Data Dependency
.... InstrJ Variable4, Variable5, Variable1 InstrI Variable1, Variable2, Variable3
Output Dependency
.... InstrJ Variable1, Variable4, Variable5 InstrI Variable3, Variable2, Variable1
Antidependency
.... InstrJ Variable1, Variable4, Variable5
Name (False) Dependencies = Output Dependency + Antidependency Computer Architecture
Data Dependences – Typical Program Sequences InstrI Variable1, Variable2, Variable3 ....
Data Dependency
InstrJ Variable5, Variable4, Variable1 Output …. Antidependency Dependency
Typical define-use-reuse sequence for Variable1
InstrK Variable1, Variable6, Variable7 InstrI Variable1, Variable2, Variable3 Output .... Dependency InstrJ Variable1, Variable4, Variable5
Redundant computation of Variable1 – typical by-product of aggressive compiler optimization.
Dependences exists between variables in registers (easy to identify) as well as between memory accessed variables (more difficult). Computer Architecture
Data Hazards – Classification Data hazards are potential violations of data and name dependences in program during pipelined execution. Let i < j be 2 integers
(Instruction i is earlier in prog. than instr. j)
° RAW (Read After Write) hazard : 1. There is true-data dependency between i and j 2. Instruction j wants to read a new value before instruction i can write it. ° WAR (Write After Read) hazard : 1. There is antidependency between i and j 2. Instruction j wants to write a result earlier than instruction i reads the previous value. ° WAW (Write After Write) hazard : 1. There is output dependency between i and j 2. Instruction j wants to write data earlier than instruction i Existence of hazard depends on - distance between i and j with respect to pipeline latencies - order in which instructions read and write their operands with respect to their order in program. Only RAW hazards can occur in our simple integer pipeline. Why ? Computer Architecture
Only RAW hazards occur in integer DLX Variables in registers All instructions
1.
2.
3.
4.
5.
IF
ID
EX
MEM
WB
read
w rite
All instructions are executed in program-order. Register read occurs in 2nd cycle, register write occurs in 5th cycle. For j>i: Instruction j is fetched in cycle j, instruction i in cycle i RAW hazard ? Instruction j reads in cycle j+1, instruction i writes in i+4 RAW hazard occur if j+1 ji+4 WAR hazard ? Instruction j writes in cycle j+4, instruction i reads in i+1 WAW hazard can not occur because j+4>i+1
Similarly you can prove that no RAW, WAR, WAW hazard occur between load and store instructions. Computer Architecture
Not Every Dependency Leads to a Hazard ! ADD writes a new value of R1
SUB reads an old value of R1
ADD R1,R2,R3
IF
SUB R4,R1,R3
ADD R1,R2,R3
OR R5,R2,R3
AND R6,R2,R6
SUB R4,R1,R3
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Dependency and hazard ! WB
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Dependency but No hazard !
Computer Architecture
WB
Example of RAW Data Hazard 1) Stall:
ADD R1,R2,R3
IF
ID IF
SUB R4,R1,R3
EX
MEM
ID stall ID stall
WB ID
EX
MEM
WB
2) Better solution – forwarding (also called register bypassing)
ADD R1,R2,R3
IF
ID
EX
MEM
WB
Forward the new value for R1 SUB R4, R1,R3
IF
ID
EX
MEM
WB
Result of ADD produced in EX stage is immediately forwarded to following instruction => no stall is necessary !
Computer Architecture
Forwarding – Principle of HW Solution ID stage
IF stage Instruction Memory
MEM stage
Comb. Logic
Comb Logic
Instruction
EX stage
Comb. Logic
WB stage Comb. Logic
jump_now MEM/EX
Addr
zero Rw
busW PC
Imm16 Disp26
jump_now
Ra Rb
32 32-bit Registers
Ext
busA
ALU busB
Forwarding ctrl
Imm32
Addr Next Addr. Logic
PC
Jump/Branch Target Address
Data
PC+4
Add
Memory
Target Address Calculation WB/EX
Results forwarded from pipeline registers using MEM/EX and WB/EX path. Computer Architecture
RAW Hazard Requiring Stall
LW R1, (R2) +5 SUB R4,R1, R3
IF
ID
EX
MEM
WB
Forwarding back in time ? IF
ID
Data needed by SUB
EX
MEM
WB
Data available from the memory
Data from LOAD are available at the end of the MEM stage, but are needed in the beginning of the EX stage => cannot be forwarded back in time.
Computer Architecture
RAW Hazard Requiring Stall (contd.)
LW R1,(R2)+5
IF
ID
EX
MEM
WB Forward
SUB R4, R1,R3
IF
IDstall
ID
EX
MEM
WB
This delay is called LOAD - USE delay (typical for a RISC pipeline). This delay can be eliminated by inserting an independent instruction between the LOAD and USE instructions. This is a typical task of an optimizing compiler.
Computer Architecture
Performance Impact of Data Hazards on CPI a) Consider integer pipeline with stalls Let i be the position of producing instruction, j be the position of consuming instruction: j i+1 i+2
2 stalls 1 stall
P Probability (nonoptimized code) 40 % 20 %
i+3 and more
0 stals
40 %
Instruction type
Stalls in the pipeline
F (Dynamic frequency)
ALU Load
48 % 26 %
Store Branch (control)
13 % 13 %
SPIdata = (Falu + Fload)*(p1*2 +p2*1) SPIdata=(0,48 + 0,26)*(0,4*2 +0,2*1) SPIdata=0,74 CPIpipe_real >=1,74 !
Computer Architecture
Performance Impact of Data Hazards on CPI b) Consider integer pipeline with forwarding Let i be the position of Load instruction, j be the position of consuming instruction: j i+1 i+2 and more
Instruction type
Stalls in the pipeline 1 stall 0 stalls
F (Dynamic frequency)
P Probability (nonoptimized code) 40 % 60 %
SPIdata = Fload*p1*1
ALU Load
48 % 26 %
SPIdata= 0,26*0,4*1
Store Branch (control)
13 % 13 %
SPIdata=0,104 CPIpipe_real >=1,104 !
Beware that Tclk can be affected by forwarding in negative sense ! Computer Architecture
Control Hazard – Branch Instruction ID stage
IF stage Instruction Memory
MEM stage
Comb. Logic
Comb Logic
Instruction
EX stage
Comb. Logic
WB stage Comb. Logic
jump_now MEM/EX
Addr
zero Rw
busW PC
Imm16 Disp26
jump_now
Ra Rb
32 32-bit Registers
Ext
busA
ALU busB
Condition Evaluation
Imm32
Addr Next Addr. Logic
PC
Jump/Branch Target Address
Data
PC+4
Add
Memory
Target Address Calculation WB/EX
A branch is evaluated during EX stage, PC changed during MEM stage. Computer Architecture
Solution of Control Hazard by Pipeline Stall
BEQZ R3, M1 instr.2
IF
ID
EX
IF stall IF stall
MEM
WB
IF stall
instr.3 instr.4 M1: ADD R4,R6,R7
condition resolution & address PC calculation changed
IF
ID
EX
MEM
3 stalls for every branch (taken/ not taken) SPIcontrol=0,13*3 = 0,39 !
Computer Architecture
WB
Solution of Control Brach by Flushing
BEQZ R3, M1 instr.2 instr.3 instr.4 M1: ADD R4,R6,R7
IF
ID
EX
MEM
IF
ID
EX
IF
ID
WB
NOP
IF condition IF resolution & address PC calculation changed
Cancelled (flushed) Instructions
NOP
NOP ID
EX
MEM
WB
Instructions 2,3,4 are flushed only when branch is taken: °13% of instructions in SpecInt2000 are branches, 67 % of branches are taken, 3 stalls occurs only when branch is taken: SPIcontrol = 0.13*0.67*3 = 0,26 Computer Architecture
Better Solution of Control Stalls ° Solution has two parts : - To evaluate the condition and target address earlier - Change PC earlier ° Possible implementation - Add condition evaluation logic into the ID stage - Compute the next PC during the ID stage ° Result : Penalty of 1 instead of 3 => SPIcontrol = 0,0871 ° Note however, that evaluation of condition in ID can not use the forwarding to EX stage => SPIdata can be increased. But this can be solved by optimization of instruction ordering. ° Tclk can be also affected due to more complex logic in ID stage !
Computer Architecture
Improved Branch Instruction - Implementation ID stage
IF stage Instruction Memory
Comb Logic
Instruction Imm16 Disp26
Addr
EX stage Comb. Logic
MEM stage Comb. Logic
MEM/EX Rw
busW
Ra Rb
32 32-bit Registers
PC
busA
ALU busB
jump_now
Imm32
Ext
Addr Next Addr. Logic
PC
Condition Evaluation Logic PC+4
Add Jump/Branch Target Address
Target Address WB/EX Calculation
Computer Architecture
Data Memory
WB stage Comb. Logic
Improved Branch Instruction BEQZ R3, M1 instr.2
IF
ID
EX
MEM
IF
M1: ADD R4,R6,R7
WB
Cancelled (flushed) Instruction
NOP
IF
ID
EX
MEM
WB
condition evaluation & address calculation, PC changed
Instr.2 is called to be in Branch Delay Slot. Similarly we can call the instruction after load to be in Load Delay Slot.
Computer Architecture
Further Improved Branch – Delayed Branch BEQZ R3, M1 instr.2
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
M1: ADD R4,R6,R7
Branch Delay Slot WB
condition evaluation & address calculation
Let instruction in the branch delay slot always execute. This principle is called delayed branch. SPIcontrol = Taken Branch Frequency * Branch Penalty If the instruction in a branch delay slot performs some useful computation, the branch penalty is reduced.
Computer Architecture
Which Instructions Can be Placed in a Branch Delay Slot ? ADD R1, R2, R3 BEQZ R4, LAB
Solution 1: From Before
BEQZ R4, LAB ADD R1, R2, R3
Delay Slot
LAB:
LAB:
° This is a safe solution. ° Branch Penalty is reduced to zero. ° Branch must not depend on the instruction placed in the delay slot (obviously). Computer Architecture
Which Instructions Can be Placed in a Branch Delay Slot (2) ? LAB: SUB R4, R5, R6 ADD R1, R2, R3 BEQZ R7, LAB
SUB R4, R5, R6
Solution 2: From Target
LAB: ADD R1, R2, R3 BEQZ R7, LAB
Delay Slot
ADD R1, R2, R3 BEQZ R7, LAB
SUB R4, R5, R6 ADD R1, R2, R3
Solution 3: From Fall-Through
BEQZ R7, LAB SUB R4, R5, R6
Delay Slot SUB R4, R5, R6
LAB:
LAB: Computer Architecture
Which Instructions Could be Placed in the Branch Delay Slot (3) ? º Branch Penalty is reduced only when the program goes in the expected path from which the instruction was taken. º Solutions 2 and 3 are legal only when the instruction in a delay slot can be safely executed when the branch goes in the unexpected path. The instruction in a branch delay slot should not rise an exception when executed in the unexpected path. – esp. problem with Load, Store instr. (Virtual Memory Protection Fault) Solution : So called canceling branch The instruction in a delay slot is executed only when the branch is taken. It is cancelled otherwise (when the branch falls through).
Computer Architecture
Delayed Branches - Summary Some architectures support all 3 types of branches : º Delayed branches – used when instruction in a delay slot can be safely executed without dependence of branch direction (preferred option) º Conventional branches – instruction in a delay slot is cancelled for taken branch º Canceling branches – instruction in delay slot is cancelled for fall-through path Delayed branches work well for a simple pipeline, but complicate exception (interrupt) handling and superscalar execution (more advanced pipeline). Computer Architecture
Structural Hazards
° Structural hazards: imperfect resource replication in the pipeline, collision on shared resources (memory, register file, functional unit) ° Our integer DLX pipeline has no shared resources between pipeline stages, hence there are no structural hazards SPIstruct = 0 ° In real pipelines such shared resources exists (due to area and cost savings) and are sources of structural stalls
Computer Architecture
Example of Structural Hazard – Unified Memory Load instr.2 instr.3 Stall instr.4
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Lost instruction slot (Pipeline buble)
IF
ID
EX
MEM
WB
Instruction 4 can not start in the 4th cycle because Load is accessing unified instruction and data memory. This can be seen as a stall in the IF stage. Performance impact of unified memory (cache): When memory operands are accessed, instructions cannot be fetched. 26 % of instructions are Loads 13 % Store => SPIstruct = 0,39 Computer Architecture
Example of Structural Hazard – Missing MEM stage IF
Load AD D
ID
EX
MEM
WB
IF
ID
EX
WB
Collision on w rite port into register file.
Reason: Instructions have different number of pipeline stages. Solution: Adding of register file write port, but it is not always reasonable (area, delay). Stall is cheaper solution.
Load ADD instr.3 instr.4 instr.5
IF
ID
EX
MEM
WB
IF
ID
EX
stall
WB
IF
ID
stall
EX
WB
IF
stall
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
WB
This is reason why ALU instructions go through MEM stage to avoid this type of hazard. Computer Architecture
Identifying Structural Hazards (1) Identify clock cycles in which pipeline resources are used by different instruction types. Example 1 – shared memory 1
Load, Store
Other instr.
2
3
ID
IF
EX
1
2
3
IF
ID
EX
4
5
MEM
WB
4
MEM
5
Shared Memory Used
WB
Example 2 – different number of pipeline stages 1
Load
2
3
Other instr.
5
MEM
WB
ID
EX
2
3
IF
ID
EX
MEM
1
2
3
4
IF
ID
EX
WB
IF 1
Store
4
4
Register File W rite Port Used
Shared resource is resource used by instructions in multiple different cycles => structural hazards can occur. Computer Architecture
Identifying Structural Hazards (2) Clock cycle -> Instruction num ber
1.
Load
2.
instr.2
3.
instr.3
4.
instr.4
1.
2.
3.
4.
5.
6.
7.
IF
ID
EX
M EM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
8.
WB
Instruction j using shared resource in cycle c2 collides with instruction i which uses the same resource in cycle c1 IFF j = i + (c1 – c2) In our example: i=1, c1=4 j=4, c2=1 Computer Architecture
Performance of Integer DLX Pipeline (with Delayed Branches) on SPECInt95
Average stalls per instruction in SPECInt95 : SPIcontrol= 0,06, SPIdata=0,05 CPI (assuming perfect memory system) = 1,11 Speedup against non-pipelined version is Computer Architecture
4,5 © David Patterson, CS152 UCB 1996
DLX and Multicycle Instructions – FP Pipeline EX Integer Unit
EX IF
ID
FP/Int Multiply
MEM WB
EX WinDLX Model
FP ADD/SUB
EX FP/Int Divide Computer Architecture
Functional Units
DLX FP Pipeline – More Realistic Model Integer Unit
E
M FP multiplier
F
D
MX1
MX2
MX3 W
FP adder
AX1
AX2
AX3
AX4
FP DIV DX1-DX28 SQRT (multicycle)
Computer Architecture
Functional Unit Parameters ° Latency - the number of clock cycles before the result is available - typically number of pipeline stages in a unit ° Initiation (Repetition) Interval - the number of clock cycles before the next operation can use this functional unit Pipeline latency = 5 cycles IF
ID
EX
MEM
WB
Latency of integer unit = 1 cycle IF
ID
Init. interval = 1 cycle Computer Architecture
EX
MEM
WB
Functional Unit Parameters: an Example Functional Unit (Producing Instruction Type) Integer (ALU Reg-Reg, Reg- Imm) Integer (Load instruction) FP MULT (pipelined) FP ADD (pipelined) DIVIDER (iterative)
Latency Initiation Interval 1 1 2 6 3 28
1 1 1 28
In WinDLX all functional units are multicycle (latency = initiation interval)
Computer Architecture
Instruction Latency and Initiation Interval Latency of Instruction -the number of clock cycles before the result is available for the dependent instruction It depends on: -Latency of functional unit for a given instruction -Forwarding paths in processor -Consuming instruction type
Initiation (Repetition) Interval of Instruction -the number of clock cycles before the same type of instruction can start execution (without resource conflicts) It depends on: -Number of functional units and their initiation intervals
M non-pipelined functional units of latency M have the same initiation interval as a single fully- pipelined functional unit of latency M. See also supplementary material about parameters of various DLX CPUs Computer Architecture
New Problems with a FP Pipeline o Increasing number of structural hazards – non-pipelined unit blockages, register write ports o RAW hazards are more frequent (higher latency of FP) o WAR hazards can not occur because operands are read in the beginning of execution, instructions start execution in order o WAW hazards can occur because instructions are completed out – of – order (take different number of EX stages) o More complex hazard detection logic in the ID stage Computer Architecture
New Sources of Structural Hazards 1
FMULT F0, F4, F6 Integer instr. Integer instr.
FADD F2, F4, F6 Integer instr.
LDF F3, 15(R1)
IF
2
3
4
5
6
7
8
9
ID
MX1
MX2
MX3
MX4
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
MX1 AX1
IF
10
MX5
MX6
MEM
WB
MX2 AX2
MX3 AX3
MEM
WB
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
• Need more register file write ports (up to 4) otherwise structural stall occur. • This stall occurs in WinDLX during MEM stage (only a single instruction can finish in 1 cycle in WinDLX) • In a real CPU – integer writes and FP writes would not collide and there would be more write ports for FP reg. file. Computer Architecture
An example of a WAW Hazard 1
FMULT F0, F4, F6 Integer instr.
FADD F0, F3, F2
IF
2
3
4
5
6
ID
MX1
MX2
MX3
MX4
IF
ID
EX
MEM
WB
IF
ID
MX1 AX1
MX2 AX2
7
8
9
MX5
MX6
MEM
MX3 AX3
MEM
WB
10
WB
º The value from FADD could be overwritten by FMULT º WAW hazard can be resolved by stalling output-dependent instruction in ID stage º Performance impact is not significant because it occurs infrequently (only for “redundant computation”.) Computer Architecture
Performance of FP DLX (SPECfp95)
© David Patterson, CS152 UCB 1996 Computer Architecture
Performance of FP DLX (SPECfp95)
© David Patterson, CS152 UCB 1996 Computer Architecture
Performance of FP DLX º Integer stalls are negligible comparing to FP Stalls º RAW stalls dominate (see FP result stalls bars) º Structural stalls due to divider are negligible The number of stalls per FP operation corresponds to 50 % of their latency (the rest is overlapped by useful computation of other instructions) - 14,2 stalls per division (latency = 28 cycles) - 2,7 stalls per multiplication (latency = 6 cycles) Impact on CPI (SPIFP) o Depends on application – 0,65 – 1,21 o SPECfp stall average is 0,87 o DLX FP CPI = 1.87 (SPECfp95) © David Patterson, CS152 UCB 1996 Computer Architecture
Conclusion ° Pipeline performance is limited by data, control and structural hazards ° Data hazards are RAW, WAW and WAR ° These hazards can stall the pipeline ° Some techniques to avoid stalls were shown (forwarding, delayed branch, instruction scheduling by compiler) Following 3 weeks : Memory hierarchy (what is behind IF and MEM stages) We return one more time to the topic of processor architecture (superscalar, dynamic scheduling, branch prediction, speculation …)
Computer Architecture