Store Multiple Volatile MCSRR Word
4.4 Instruction Pipeline
The processor pipeline consists of stages for instruction fetch, instruction decode, register read,
execution, and result writeback. Certain stages involve multiple clock cycles of execution. The processor also contains an instruction prefetch buffer to allow buffering of instructions prior to the decode stage.
Instructions proceed from this buffer to the instruction decode stage by entering the instruction decode register.
Table 4-2 explains the five pipeline stages.
Figure 4-2 shows the pipeline diagram.
Figure 4-2. Pipeline Diagram Table 4-2. Pipeline Stages
STAGE Description
IFETCH Instruction Fetch From Memory DECODE/RF READ/FF/
MEM EA
Instruction Decode/Register Read/Operand Forwarding/
Memory Effective Address Generation
EXECUTE0/MEM0 Instruction Execution stage 0/Memory Access stage 0 EXECUTE1/MEM1 Instruction Execution stage 1/Memory Access stage 1
WB Write Back to Registers
IFetch
Decode1/ Reg read/ FFwd
Execute0 I0,I1 I2,I3
Simple Instructions
I0,I1 I2,I3 Feedforward
I0,I1 I2,I3 I0,I1 I2,I3
I0,I1 I2,I3
Writeback
IFetch
Decode1/ Reg read / EA calc
Memory0 L0 L1
Load Instructions
L0 L1
Memory1
L0 L1
L0,L1 L1,I2 L0,L1 I2,I3
Writeback
4.4.1 Description of Pipeline Stages
The IFetch pipeline stage retrieves instructions from the memory system and determines where the next instruction fetch is performed. Up to two 32-bit instructions or four 16-bit instructions are sent from memory to the instruction buffers each cycle.
The decode pipeline stage decodes instructions, reads operands from the register file, and performs dependency checking.
Execution occurs in one or both of the execute pipeline stages in each execution unit (perhaps over multiple cycles). Execution of most load/store instructions is pipelined. The load/store unit has three pipeline stages: effective address calculation (EA Calc), initial memory access (MEM0), and final memory access, data format, and forward (MEM1).
Simple integer instructions complete execution in the Execute 0 stage of the pipeline. Multiply instructions require both the Execute 0 and Execute 1 stages but may be pipelined as well. Most condition-setting instructions complete in the Execute 0 stage of the pipeline, thus conditional branches dependent on a condition-setting instruction may be resolved by an instruction in this stage.
Result feed-forward hardware forwards the result of one instruction into the source operand(s) of a following instruction so that the execution of data-dependent instructions does not wait until the
completion of the result write-back. Feed forward hardware is supplied to allow bypassing of completed instructions from both execute stages into the first execution stage for a subsequent data-dependent instruction.
4.4.2 Instruction Prefetch Buffers and Branch Target Buffer
The e200z4 contains an eight-entry instruction prefetch buffer, which supplies instructions into the instruction register (IR) for decoding. Each slot in the prefetch buffer is 32 bits wide, capable of holding a single 32-bit instruction or a pair of 16-bit instructions.
Instruction prefetches request a 64-bit double word, and the prefetch buffer is filled with a pair of
instructions at a time, except for the case of a change of flow fetch where the target is to the second (odd) word. In that case only a 32-bit prefetch is performed to load the instruction prefetch buffer. This 32-bit fetch may be immediately followed by a 64-bit prefetch to fill slots 0 and 1 in the event that the branch is resolved to be taken.
In normal sequential execution, instructions are loaded into the IR from prefetch buffer slots 0 and 1. As a pair of slots are emptied, they are refilled. Whenever a pair of slots is empty, a 64-bit prefetch is initiated, which fills the earliest empty slot pairs beginning with slot 0.
If the instruction prefetch buffer empties, instruction issue stalls, and the buffer is refilled. The first returned instruction is forwarded directly to the IR. Open cycles on the memory bus are utilized to keep the buffer full when possible.
Figure 4-3 shows the instruction prefetch buffers.
Figure 4-3. e200z4 Instruction Prefetch Buffers
To resolve branch instructions and improve the accuracy of branch predictions, the e200z4 implements a dynamic branch prediction mechanism using an 8-entry branch target buffer (BTB).
An entry is allocated in the BTB whenever a branch resolves as taken and the BTB is enabled. Entries in the BTB are allocated on taken branches using a FIFO replacement algorithm.
Each BTB entry holds the branch target address and a 2-bit branch history counter whose value is incremented or decremented on a BTB hit depending on whether the branch was taken. The counter can assume four different values: strongly taken, weakly taken, weakly not taken, and strongly not taken. On initial allocation of an entry to the BTB for a taken branch, the counter is initialized to the weakly-taken state.
A branch will be predicted as taken on a hit in the BTB with a counter value of strongly or weakly taken.
In this case the target address contained in the BTB is used to redirect the instruction fetch stream to the target of the branch prior to the branch reaching the instruction decode stage. In the case of a BTB miss, static prediction is used to predict the outcome of the branch. In the case of a mispredicted branch, the instruction fetch stream will return to the proper instruction stream after the branch has been resolved.
When a branch is predicted taken and the branch is later resolved (in the branch execute stage), the value of the appropriate BTB counter is updated. If a branch whose counter indicates weakly taken is resolved as taken, the counter increments so that the prediction becomes strongly taken. If the branch resolves as not taken, the prediction changes to weakly not-taken. The counter saturates in the strongly taken states when the prediction is correct.
The e200z4 does not implement the static branch prediction that is defined by the Power ISA embedded category architecture. The BO prediction bit in branch encodings is ignored.
Dynamic branch prediction is enabled by setting BUCSR[BPEN]. Allocation of branch target buffer entries may be controlled using the BUCSR[BALLOC] field to control whether forward or backward branches (or both) are candidates for entry into the BTB, and thus for branch prediction. Once a branch is in the BTB, BUCSR[ALLOC] has no further effect on that branch entry. Clearing BUCSR[BPEN]
disables dynamic branch prediction, in which case the e200z4 reverts to a static prediction mechanism
SLOT0SLOT1
SLOT2 DECODE
SLOT3
. .
MUX IR
DATA 0:63 SLOT4SLOT5
SLOT6SLOT7
The BTB uses virtual addresses for performing tag comparisons. On allocation of a BTB entry, the effective address of a taken branch, along with the current Instruction Space (as indicated by MSR[IS]) is loaded into the entry and the counter value is set to weakly taken. The current PID value is not maintained as part of the tag information.
The e200z4 does support automatic flushing of the BTB when the current PID value is updated by a mtcr PID0 instruction. Software is otherwise responsible for maintaining coherency in the BTB when a change in effective to real (virtual to physical) address mapping is changed. This is supported by the
BUCSR[BBFI] control bit.
Figure 4-4 shows the branch target buffer.
Figure 4-4. e200z4 Branch Target Buffer
4.4.3 Single-Cycle Instruction Pipeline Operation
Sequences of single-cycle execution instructions follow the flow in Figure 4-5. Instructions are issued and completed in program order. Most arithmetic and logical instructions fall into this category. Instructions may feed-forward results of execution at the end of the E0 or FF stage.
Figure 4-5. Basic Pipe Line Flow, Single Cycle Instructions
TAG DATA
branch addr[0:30] IS target address[0:30] counter entry 7 IS = Instruction Space
... ... ... ... ...
branch addr[0:30] IS target address[0:30] counter entry 1 branch addr[0:30] IS target address[0:30] counter entry 0
1st, 2ndInst.
Time Slot
3rd, 4th Inst.
IF DEC E0 FF
IF DEC
WB
E0 FF WB
5th, 6th Inst. IF DEC E0 FF WB
4.4.4 Basic Load and Store Instruction Pipeline Operation
For load and store instructions, the effective address is calculated in the EA Calc stage, and memory is accessed in the MEM0–MEM1 stages. Data selection and alignment is performed in MEM1, and the result is available at the end of MEM1 for the following instruction. If the instruction has a data dependency on the result of a load, there is a single stall cycle. Data will be fed-forward from the preceding load at the end of the MEM1 stage.
Figure 4-6 shows the basic pipe line flow for the load/store instructions.
Figure 4-6. Basic Pipe Line Flow, Load/Store Instructions
4.4.5 Change-of-Flow Instruction Pipeline Operation
Simple change of flow instructions require 2 clock cycles to refill the pipeline with the target instruction for taken branches and branch and link instructions with no BTB hit and correct branch prediction.
Figure 4-7 shows the basic pipe line flow for the change of flow instructions.
Figure 4-7. Basic Pipe Line Flow, Branch Instructions (BTB Miss, Correct Prediction, Branch Taken)
This 2 cycle timing may be reduced for branch type instructions by performing the target fetch
1st LD Inst.
Time Slot
2nd LD Inst.
IF DEC/ M0 M1
IF DEC/
WB
M0 M1 WB
3rd single cycle Inst. IF DEC stall E0 FF
EA
EA
(data dependent) WB
BR Inst.
Time Slot
Target Inst.
IF DEC (E0) (E1)
TF DEC
WB
E0 E1 WB
target address can be obtained from the BTB. The resulting branch timing is reduced to a single clock when the target fetch is initiated early enough and the branch is correctly predicted.
Figure 4-8 shows the basic pipe line flow for the reduced timing.
Figure 4-8. Basic Pipe Line Flow, Branch Instructions (BTB Hit, Correct Prediction, Branch Taken)
For certain cases where the branch is incorrectly predicted, 3 cycles are required for the not-taken branch, which must correct the misprediction outcome. Figure 4-9 shows one example.
Figure 4-9. Basic Pipe Line Flow, Branch Instruction (BTB Hit, Predict Taken, Incorrect Prediction) BR Inst.
Time Slot
Target Inst.
IF DEC (E0) (E1)
TF DEC
WB
E0 E1 WB
(BTB hit)
BR Inst.
Time Slot
Target Inst.
IF DEC (E0) (E1)
TF DEC
WB
abort --
--(BTB hit)
Next seq. Inst. IF DEC E0 E1 WB
For certain other cases where the branch is incorrectly predicted as taken, a stall cycle is required to correct the misprediction outcome and begin refilling the instruction buffer. Figure 4-10 shows one example.
Figure 4-10. Basic Pipe Line Flow, Branch Instructions
(BTB Miss, Predict Taken, Incorrect Prediction, Instruction Buffer Empty)
4.4.6 Basic Multi-Cycle Instruction Pipeline Operation
Most multi-cycle instructions may be pipelined so that the effective execution time is smaller than the overall number of clock cycles spent in execution. The restrictions to this execution overlap are that no data dependencies between the instructions are present and that instructions must complete and write back results in order. A single cycle instruction which follows a multi-cycle instruction must wait for
completion of the multi-cycle instruction prior to its write-back in order to meet the in-order requirement.
Result feed-forward paths are provided so that execution may continue prior to result write-back.
BR Inst.
Time Slot
Next Inst.
IF DEC (E0) (E1)
IF DEC
WB
E0 E1 WB
(BTB miss)
Target Inst.
Next Seq Inst. IF DEC E0 E1 WB
TF abort --
--Figure 4-11 shows the basic pipe line flow for multi-cycle instruction.
Figure 4-11. Basic Pipe Line Flow, Multiply Class Instructions
Since load and store instructions calculate the effective address in the DEC stage, any dependency on a previous instruction for EA calculation may stall the load or store in DEC until the result is available.
Figure 4-12 shows the infrequent case of a load instruction dependent on a multiply instruction.
Figure 4-12. Pipe Line Flow, Multiply with Data Dependent Load Instruction 1st Mul
Time Slot
2nd Inst.
IF DEC E0 E1
IF DEC
WB
E0 FF WB
3rd Inst. IF DEC E0 FF WB
(data dependent Inst.
single cycle.
single-cycle)
1st Mul Time Slot
2nd Inst.
IF DEC E0 E1
IF DEC
WB
E0 FF WB
3rd Inst. IF DEC DEC FF WB
(data dependent Inst.
single cycle
load) (stall)
The divide and load and store multiple instructions require multiple cycles in the execute stage as shown in Figure 4-13.
Figure 4-13. Basic Pipe Line Flow, long instruction
4.4.7 Additional Examples of Instruction Pipeline Operation for Load and Store
Figure 4-14 shows an example of pipelining a data-dependent add instruction following a load with update instruction. While the first load begins accessing memory in the M0 stage, the next load with update can be calculating a new effective address in the EA Calc stage. Following the EA Calc, the updated base register value can be fed-forward to subsequent instructions, even during the MEM0 or MEM1 stage. The add in this example will not stall, even though a data dependency exists on the updated base register of the load with update.
Figure 4-14. Pipe Line Flow, Load/Store Instructions with Base Register Update long inst.
Time Slot
next inst.
IF DEC
IF E0
DEC E1
— ....
— ....
— ....
— ....
— ....
—
Elast
—
WB
E0 FF WB
(single cycle)
1st LD Inst.
Time Slot
2nd LD Inst.
IF DEC/ M0 M1
IF DEC/
WB
M0 M1 WB
3rd single cycle Inst. IF DEC E0 FF WB
EA
EA
(data dependent on EA calc)
Figure 4-15 shows an example of pipelining a data-dependent store instruction following a load
instruction. The store in this example will stall, due to the store data dependency existing on the load data of the load instruction.
Figure 4-15. Pipelined Store Instruction with Store Data Dependency
4.4.8 Move To/From SPR Instruction Pipeline Operation
Many mtspr and mfspr instructions are treated like single cycle instructions in the pipeline and do not cause stalls. The following SPRs are exceptions and do cause stalls:
• MSR
• Debug SPRs
• The SPE unit
• Cache/MMU SPRs
Figure 4-16–Figure 4-18 show examples of mtspr and mfspr instruction timing.
1st LD Inst.
Time Slot
2nd LD Inst.
IF DEC/ M0 M1
IF DEC/
WB
M0 M1 WB
3rd Inst. IF DEC DEC/ M0 M1
EA
EA
(data dependent WB
store data) (stall) EA
Figure 4-16 applies to the debug SPRs and SPEFSCR. These instructions do not begin execution until all previous instructions have finished their execute stage(s). In addition, execution of subsequent instructions is stalled until the mfspr and mtspr instructions complete.
Figure 4-16. mtspr, mfspr Instruction Execution, Debug and SPE SPRs
Figure 4-17 applies to the mtmsr instruction and the wrtee and wrteei instructions. Execution of subsequent instructions is stalled until the cycle after these instructions write-back.
Figure 4-17. mtmsr, wrtee[i] Instruction Execution
Access to cache and MMU SPRs are stalled until all outstanding bus accesses have completed on both interfaces and the Cache and MMU are idle (p_[d,i]_cmbusy negated) to allow an access window where no translations or cache cycles are required. Other situations such as a cache linefill may cause the cache to be busy even when the processor interface is idle (p_[d,i]_tbusy[0]_b is negated). In these cases execution stalls until the cache and MMU are idle as signaled by negation of p_[d,i]_cmbusy. Processor access requests will be held off during execution of a Cache/MMU SPR instruction. A subsequent access request may be generated the cycle following the last execute stage (i.e. during the WB cycle). This same
Prev Inst.
Time Slot
mtspr, mfspr
IF DEC/ E0 E1
IF DEC
WB
stall E0 E1
Next Inst. IF DEC stall stall stall
EA
Debug, SPE WB
E0 E1 WB
(stall) (stall)
Prev Inst.
Time Slot
mtmsr, wrtee,
IF DEC/ E0 E1
IF DEC
WB
E0 E1 WB
Next Inst. IF DEC stall stall E0
EA
wrteei Inst.
E1 WB
(stall)
protocol applies to cache and MMU management instructions (e.g. icbi, tlbre, tlbwe, etc.) as well as the DCRs.
Figure 4-18 shows an example where an outstanding bus access causes mtspr/mfspr execution to be delayed until the bus becomes idle.
Figure 4-18. Cache/DCR, MMU mtspr, mfspr and MMU Management Instruction Execution