Instruction Pipeline - Store Multiple Volatile MCSRR Word

Store Multiple Volatile MCSRR Word

4.4 Instruction Pipeline

The processor pipeline consists of stages for instruction fetch, instruction decode, register read,

execution, and result writeback. Certain stages involve multiple clock cycles of execution. The processor also contains an instruction prefetch buffer to allow buffering of instructions prior to the decode stage.

Instructions proceed from this buffer to the instruction decode stage by entering the instruction decode register.

Table 4-2 explains the five pipeline stages.

Figure 4-2 shows the pipeline diagram.

Figure 4-2. Pipeline Diagram Table 4-2. Pipeline Stages

STAGE Description

IFETCH Instruction Fetch From Memory DECODE/RF READ/FF/

MEM EA

Instruction Decode/Register Read/Operand Forwarding/

Memory Effective Address Generation

EXECUTE0/MEM0 Instruction Execution stage 0/Memory Access stage 0 EXECUTE1/MEM1 Instruction Execution stage 1/Memory Access stage 1

WB Write Back to Registers

IFetch

Decode1/ Reg read/ FFwd

Execute0 I0,I1 I2,I3

Simple Instructions

I0,I1 I2,I3 Feedforward

I0,I1 I2,I3 I0,I1 I2,I3

I0,I1 I2,I3

Writeback

IFetch

Decode1/ Reg read / EA calc

Memory0 L0 L1

Load Instructions

L0 L1

Memory1

L0 L1

L0,L1 L1,I2 L0,L1 I2,I3

Writeback

4.4.1 Description of Pipeline Stages

The IFetch pipeline stage retrieves instructions from the memory system and determines where the next instruction fetch is performed. Up to two 32-bit instructions or four 16-bit instructions are sent from memory to the instruction buffers each cycle.

The decode pipeline stage decodes instructions, reads operands from the register file, and performs dependency checking.

Execution occurs in one or both of the execute pipeline stages in each execution unit (perhaps over multiple cycles). Execution of most load/store instructions is pipelined. The load/store unit has three pipeline stages: effective address calculation (EA Calc), initial memory access (MEM0), and final memory access, data format, and forward (MEM1).

Simple integer instructions complete execution in the Execute 0 stage of the pipeline. Multiply instructions require both the Execute 0 and Execute 1 stages but may be pipelined as well. Most condition-setting instructions complete in the Execute 0 stage of the pipeline, thus conditional branches dependent on a condition-setting instruction may be resolved by an instruction in this stage.

Result feed-forward hardware forwards the result of one instruction into the source operand(s) of a following instruction so that the execution of data-dependent instructions does not wait until the

completion of the result write-back. Feed forward hardware is supplied to allow bypassing of completed instructions from both execute stages into the first execution stage for a subsequent data-dependent instruction.

4.4.2 Instruction Prefetch Buffers and Branch Target Buffer

The e200z4 contains an eight-entry instruction prefetch buffer, which supplies instructions into the instruction register (IR) for decoding. Each slot in the prefetch buffer is 32 bits wide, capable of holding a single 32-bit instruction or a pair of 16-bit instructions.

Instruction prefetches request a 64-bit double word, and the prefetch buffer is filled with a pair of

instructions at a time, except for the case of a change of flow fetch where the target is to the second (odd) word. In that case only a 32-bit prefetch is performed to load the instruction prefetch buffer. This 32-bit fetch may be immediately followed by a 64-bit prefetch to fill slots 0 and 1 in the event that the branch is resolved to be taken.

In normal sequential execution, instructions are loaded into the IR from prefetch buffer slots 0 and 1. As a pair of slots are emptied, they are refilled. Whenever a pair of slots is empty, a 64-bit prefetch is initiated, which fills the earliest empty slot pairs beginning with slot 0.

If the instruction prefetch buffer empties, instruction issue stalls, and the buffer is refilled. The first returned instruction is forwarded directly to the IR. Open cycles on the memory bus are utilized to keep the buffer full when possible.

Figure 4-3 shows the instruction prefetch buffers.

Figure 4-3. e200z4 Instruction Prefetch Buffers

To resolve branch instructions and improve the accuracy of branch predictions, the e200z4 implements a dynamic branch prediction mechanism using an 8-entry branch target buffer (BTB).

An entry is allocated in the BTB whenever a branch resolves as taken and the BTB is enabled. Entries in the BTB are allocated on taken branches using a FIFO replacement algorithm.

Each BTB entry holds the branch target address and a 2-bit branch history counter whose value is incremented or decremented on a BTB hit depending on whether the branch was taken. The counter can assume four different values: strongly taken, weakly taken, weakly not taken, and strongly not taken. On initial allocation of an entry to the BTB for a taken branch, the counter is initialized to the weakly-taken state.

A branch will be predicted as taken on a hit in the BTB with a counter value of strongly or weakly taken.

In this case the target address contained in the BTB is used to redirect the instruction fetch stream to the target of the branch prior to the branch reaching the instruction decode stage. In the case of a BTB miss, static prediction is used to predict the outcome of the branch. In the case of a mispredicted branch, the instruction fetch stream will return to the proper instruction stream after the branch has been resolved.

When a branch is predicted taken and the branch is later resolved (in the branch execute stage), the value of the appropriate BTB counter is updated. If a branch whose counter indicates weakly taken is resolved as taken, the counter increments so that the prediction becomes strongly taken. If the branch resolves as not taken, the prediction changes to weakly not-taken. The counter saturates in the strongly taken states when the prediction is correct.

The e200z4 does not implement the static branch prediction that is defined by the Power ISA embedded category architecture. The BO prediction bit in branch encodings is ignored.

Dynamic branch prediction is enabled by setting BUCSR[BPEN]. Allocation of branch target buffer entries may be controlled using the BUCSR[BALLOC] field to control whether forward or backward branches (or both) are candidates for entry into the BTB, and thus for branch prediction. Once a branch is in the BTB, BUCSR[ALLOC] has no further effect on that branch entry. Clearing BUCSR[BPEN]

disables dynamic branch prediction, in which case the e200z4 reverts to a static prediction mechanism

SLOT0SLOT1

SLOT2 DECODE

SLOT3

. .

MUX IR

DATA 0:63 SLOT4SLOT5

SLOT6SLOT7

The BTB uses virtual addresses for performing tag comparisons. On allocation of a BTB entry, the effective address of a taken branch, along with the current Instruction Space (as indicated by MSR[IS]) is loaded into the entry and the counter value is set to weakly taken. The current PID value is not maintained as part of the tag information.

The e200z4 does support automatic flushing of the BTB when the current PID value is updated by a mtcr PID0 instruction. Software is otherwise responsible for maintaining coherency in the BTB when a change in effective to real (virtual to physical) address mapping is changed. This is supported by the

BUCSR[BBFI] control bit.

Figure 4-4 shows the branch target buffer.

Figure 4-4. e200z4 Branch Target Buffer

4.4.3 Single-Cycle Instruction Pipeline Operation

Sequences of single-cycle execution instructions follow the flow in Figure 4-5. Instructions are issued and completed in program order. Most arithmetic and logical instructions fall into this category. Instructions may feed-forward results of execution at the end of the E0 or FF stage.

Figure 4-5. Basic Pipe Line Flow, Single Cycle Instructions

TAG DATA

branch addr[0:30] IS target address[0:30] counter entry 7 IS = Instruction Space

... ... ... ... ...

branch addr[0:30] IS target address[0:30] counter entry 1 branch addr[0:30] IS target address[0:30] counter entry 0

1st, 2ndInst.

Time Slot

3rd, 4th Inst.

IF DEC E0 FF

IF DEC

E0 FF WB

5th, 6th Inst. IF DEC E0 FF WB

4.4.4 Basic Load and Store Instruction Pipeline Operation

For load and store instructions, the effective address is calculated in the EA Calc stage, and memory is accessed in the MEM0–MEM1 stages. Data selection and alignment is performed in MEM1, and the result is available at the end of MEM1 for the following instruction. If the instruction has a data dependency on the result of a load, there is a single stall cycle. Data will be fed-forward from the preceding load at the end of the MEM1 stage.

Figure 4-6 shows the basic pipe line flow for the load/store instructions.

Figure 4-6. Basic Pipe Line Flow, Load/Store Instructions

4.4.5 Change-of-Flow Instruction Pipeline Operation

Simple change of flow instructions require 2 clock cycles to refill the pipeline with the target instruction for taken branches and branch and link instructions with no BTB hit and correct branch prediction.

Figure 4-7 shows the basic pipe line flow for the change of flow instructions.

Figure 4-7. Basic Pipe Line Flow, Branch Instructions (BTB Miss, Correct Prediction, Branch Taken)

This 2 cycle timing may be reduced for branch type instructions by performing the target fetch

1st LD Inst.

Time Slot

2nd LD Inst.

IF DEC/ M0 M1

IF DEC/

M0 M1 WB

3rd single cycle Inst. IF DEC stall E0 FF

(data dependent) WB

BR Inst.

Time Slot

Target Inst.

IF DEC (E0) (E1)

TF DEC

E0 E1 WB

target address can be obtained from the BTB. The resulting branch timing is reduced to a single clock when the target fetch is initiated early enough and the branch is correctly predicted.

Figure 4-8 shows the basic pipe line flow for the reduced timing.

Figure 4-8. Basic Pipe Line Flow, Branch Instructions (BTB Hit, Correct Prediction, Branch Taken)

For certain cases where the branch is incorrectly predicted, 3 cycles are required for the not-taken branch, which must correct the misprediction outcome. Figure 4-9 shows one example.

Figure 4-9. Basic Pipe Line Flow, Branch Instruction (BTB Hit, Predict Taken, Incorrect Prediction) BR Inst.

Time Slot

Target Inst.

IF DEC (E0) (E1)

TF DEC

E0 E1 WB

(BTB hit)

BR Inst.

Time Slot

Target Inst.

IF DEC (E0) (E1)

TF DEC

abort --

--(BTB hit)

Next seq. Inst. IF DEC E0 E1 WB

For certain other cases where the branch is incorrectly predicted as taken, a stall cycle is required to correct the misprediction outcome and begin refilling the instruction buffer. Figure 4-10 shows one example.

Figure 4-10. Basic Pipe Line Flow, Branch Instructions

(BTB Miss, Predict Taken, Incorrect Prediction, Instruction Buffer Empty)

4.4.6 Basic Multi-Cycle Instruction Pipeline Operation

Most multi-cycle instructions may be pipelined so that the effective execution time is smaller than the overall number of clock cycles spent in execution. The restrictions to this execution overlap are that no data dependencies between the instructions are present and that instructions must complete and write back results in order. A single cycle instruction which follows a multi-cycle instruction must wait for

completion of the multi-cycle instruction prior to its write-back in order to meet the in-order requirement.

Result feed-forward paths are provided so that execution may continue prior to result write-back.

BR Inst.

Time Slot

Next Inst.

IF DEC (E0) (E1)

IF DEC

E0 E1 WB

(BTB miss)

Target Inst.

Next Seq Inst. IF DEC E0 E1 WB

TF abort --

--Figure 4-11 shows the basic pipe line flow for multi-cycle instruction.

Figure 4-11. Basic Pipe Line Flow, Multiply Class Instructions

Since load and store instructions calculate the effective address in the DEC stage, any dependency on a previous instruction for EA calculation may stall the load or store in DEC until the result is available.

Figure 4-12 shows the infrequent case of a load instruction dependent on a multiply instruction.

Figure 4-12. Pipe Line Flow, Multiply with Data Dependent Load Instruction 1st Mul

Time Slot

2nd Inst.

IF DEC E0 E1

IF DEC

E0 FF WB

3rd Inst. IF DEC E0 FF WB

(data dependent Inst.

single cycle.

single-cycle)

1st Mul Time Slot

2nd Inst.

IF DEC E0 E1

IF DEC

E0 FF WB

3rd Inst. IF DEC DEC FF WB

(data dependent Inst.

single cycle

load) (stall)

The divide and load and store multiple instructions require multiple cycles in the execute stage as shown in Figure 4-13.

Figure 4-13. Basic Pipe Line Flow, long instruction

4.4.7 Additional Examples of Instruction Pipeline Operation for Load and Store

Figure 4-14 shows an example of pipelining a data-dependent add instruction following a load with update instruction. While the first load begins accessing memory in the M0 stage, the next load with update can be calculating a new effective address in the EA Calc stage. Following the EA Calc, the updated base register value can be fed-forward to subsequent instructions, even during the MEM0 or MEM1 stage. The add in this example will not stall, even though a data dependency exists on the updated base register of the load with update.

Figure 4-14. Pipe Line Flow, Load/Store Instructions with Base Register Update long inst.

Time Slot

next inst.

IF DEC

IF E0

DEC E1

— ....

—

E_last

—

E0 FF WB

(single cycle)

1st LD Inst.

Time Slot

2nd LD Inst.

IF DEC/ M0 M1

IF DEC/

M0 M1 WB

3rd single cycle Inst. IF DEC E0 FF WB

(data dependent on EA calc)

Figure 4-15 shows an example of pipelining a data-dependent store instruction following a load

instruction. The store in this example will stall, due to the store data dependency existing on the load data of the load instruction.

Figure 4-15. Pipelined Store Instruction with Store Data Dependency

4.4.8 Move To/From SPR Instruction Pipeline Operation

Many mtspr and mfspr instructions are treated like single cycle instructions in the pipeline and do not cause stalls. The following SPRs are exceptions and do cause stalls:

• MSR

• Debug SPRs

• The SPE unit

• Cache/MMU SPRs

Figure 4-16–Figure 4-18 show examples of mtspr and mfspr instruction timing.

1st LD Inst.

Time Slot

2nd LD Inst.

IF DEC/ M0 M1

IF DEC/

M0 M1 WB

3rd Inst. IF DEC DEC/ M0 M1

(data dependent WB

store data) (stall) EA

Figure 4-16 applies to the debug SPRs and SPEFSCR. These instructions do not begin execution until all previous instructions have finished their execute stage(s). In addition, execution of subsequent instructions is stalled until the mfspr and mtspr instructions complete.

Figure 4-16. mtspr, mfspr Instruction Execution, Debug and SPE SPRs

Figure 4-17 applies to the mtmsr instruction and the wrtee and wrteei instructions. Execution of subsequent instructions is stalled until the cycle after these instructions write-back.

Figure 4-17. mtmsr, wrtee[i] Instruction Execution

Access to cache and MMU SPRs are stalled until all outstanding bus accesses have completed on both interfaces and the Cache and MMU are idle (p_[d,i]_cmbusy negated) to allow an access window where no translations or cache cycles are required. Other situations such as a cache linefill may cause the cache to be busy even when the processor interface is idle (p_[d,i]_tbusy[0]_b is negated). In these cases execution stalls until the cache and MMU are idle as signaled by negation of p_[d,i]_cmbusy. Processor access requests will be held off during execution of a Cache/MMU SPR instruction. A subsequent access request may be generated the cycle following the last execute stage (i.e. during the WB cycle). This same

Prev Inst.

Time Slot

mtspr, mfspr

IF DEC/ E0 E1

IF DEC

stall E0 E1

Next Inst. IF DEC stall stall stall

Debug, SPE WB

E0 E1 WB

(stall) (stall)

Prev Inst.

Time Slot

mtmsr, wrtee,

IF DEC/ E0 E1

IF DEC

E0 E1 WB

Next Inst. IF DEC stall stall E0

wrteei Inst.

E1 WB

(stall)

protocol applies to cache and MMU management instructions (e.g. icbi, tlbre, tlbwe, etc.) as well as the DCRs.

Figure 4-18 shows an example where an outstanding bus access causes mtspr/mfspr execution to be delayed until the bus becomes idle.

Figure 4-18. Cache/DCR, MMU mtspr, mfspr and MMU Management Instruction Execution

W dokumencie E200Z4 (Stron 138-150)