Delay slot

In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP architecture; this instruction will execute even if the preceding branch is taken. This makes the instruction execute out-of-order compared to its location in the original assembler language code.

Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of branch prediction. In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time. If the assumption is wrong, and the other side of the branch has to be called, this can introduce a lengthy delay. This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions.

Pipelining
A central processing unit generally performs instructions from the machine code using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory. In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's clock cycle to complete. For instance, in the Zilog Z80, the minimum number of clocks needed to complete an instruction was four, but could be as many as 23 clocks for some (rare) instructions.

At any given stage of the instruction's processing, only one part of the chip is involved. For instance, during the execution stage, typically only the arithmetic logic unit (ALU) is active, while other units, like those that interact with main memory or decode the instruction, are idle. One way to improve the overall performance of a computer is through the use of an instruction pipeline. This adds some additional circuitry to hold the intermediate states of the instruction as it flows through the units. While this does not improve the cycle timing of any single instruction, the idea is to allow a second instruction to use the other CPU sub-units when the previous instruction has moved on.

For instance, while one instruction is using the ALU, the next instruction from the program can be in the decoder, and a third can be fetched from memory. In this assembly line type arrangement, the total number of instructions processed at any time can be improved by up to the number of pipeline stages. In the Z80, for example, a four-stage pipeline could improve overall throughput by four times. However, due to the complexity of the instruction timing, this would not be easy to implement. The much simpler instruction set architecture (ISA) of the MOS 6502 allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed.

Branching problems
A major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts. For instance, the instruction to add two values would often be offered in multiple versions, or opcodes, which varied on where they read in the data. One version of add might take the value found in one processor register and add it to the value in another, another version might add the value found in memory to a register, while another might add the value in one memory location to another memory location. Each of these instructions takes a different amount of bytes to represent it in memory, meaning they take different amounts of time to fetch, may require multiple trips through the memory interface to gather values, etc. This greatly complicates the pipeline logic. One of the goals of the RISC chip design concept was to remove these variants so that the pipeline logic was simplified, which leads to the classic RISC pipeline which completes one instruction every cycle.

However, there is one problem that comes up in pipeline systems that can slow performance. This occurs when the next instruction may change depending on the results of the last. In most systems, this happens when a branch occurs. For instance, consider the following pseudo-code:

top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register write the result to memory read a number from memory and store it in another register ...

In this case, the program is linear and can be easily pipelined. As soon as the first read instruction has been read and is being decoded, the second read instruction can be read from memory. When the first moves to execute, the add is being read from memory while the second read is decoding, and so forth. Although it still takes the same number of cycles to complete the first read, by the time it is complete the value from the second is ready and the CPU can immediately add them. In a non-pipelined processor the first four instructions will take 16 cycles to complete, in a pipelined one, it takes only five.

Now consider what occurs when a branch is added:

top: read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to top: (if it is not) write the result to memory read a number from memory and store it in another register ...

In this example the outcome of the comparison on line four will cause the "next instruction" to change; sometimes it will be the following write to memory, and sometimes it will be the read from memory at the top. The processor's pipeline will normally have already read the next instruction, the write, by the time the ALU has calculated which path it will take. This is known as a branch hazard. If it has to return to the top, the write instruction has to be discarded and the read instruction read from memory instead. That takes one full instruction cycle, at a minimum, and results in the pipeline being empty for at least one instruction's time. This is known as a "pipeline stall" or "bubble", and, depending on the number of branches in the code, can have a noticeable impact on overall performance.

Branch delay slots
One strategy for dealing with this problem is to use a delay slot, which refers to the instruction slot after any instruction that needs more time to complete. In the examples above, the instruction that requires more time is the branch, which is by far the most common type of delay slot, and these are more commonly referred to as a branch delay slot.

In early implementations, the instruction following the branch would be filled with a no-operation, or NOP, simply to fill out the pipeline to ensure the timing was right such that by the time the NOP had been loaded from memory the branch was complete and the program counter could be updated with the correct value. This simple solution wastes the processing time available. More advanced solutions would instead try to identify another instruction, typically nearby in the code, to place in the delay slot so that useful work would be accomplished.

In the examples above, the read instruction at the end is completely independent, it does not rely on any other information and can be performed at any time. This makes it suitable for placement in the branch delay slot. Normally this would be handled automatically by the assembler program or compiler, which would re-order the instructions:

read a number from memory and store it in a register read another number and store it in a different register add the two numbers into a third register if the result in the 3rd register is greater than 1000, then go back to the top read a number from memory and store it in another register (if it is not) write the result to memory ...

Now when the branch is executing, it goes ahead and performs the next instruction. By the time that instruction is read into the processor and starts to decode, the result of the comparison is ready and the processor can now decide which instruction to read next, the read at the top or the write at the bottom. This prevents any wasted time and keeps the pipeline full at all times.

Finding an instruction to fill the slot can be difficult. The compilers generally have a limited "window" to examine and may not find a suitable instruction in that range of code. Moreover, the instruction cannot rely on any of the data within the branch; if an add instruction takes a previous calculation as one of its inputs, that input cannot be part of the code in a branch that might be taken. Deciding if this is true can be very complex in the presence of register renaming, in which the processor may place data in registers other than what the code specifies without the compiler being aware of this.

Another side effect is that special handling is needed when managing breakpoints on instructions as well as stepping while debugging within the branch delay slot. An interrupt is unable to occur during a branch delay slot and is deferred until after the branch delay slot. Placing branch instruction in the branch delay slot is prohibited or deprecated.

The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behaviour is followed despite no longer being relevant.

Implementations
Branch delay slots are found mainly in DSP architectures and older RISC architectures. MIPS, PA-RISC (delayed or non-delayed branch can be specified), ETRAX CRIS, SuperH (unconditional branch instructions have one delay slot), Am29000, Intel i860 (unconditional branch instructions have one delay slot), MC88000 (delayed or non-delayed branch can be specified), and SPARC are RISC architectures that each have a single branch delay slot; PowerPC, ARM, Alpha, V850, and RISC-V do not have any. DSP architectures that each have a single branch delay slot include μPD77230 and the VS DSP. The SHARC DSP and MIPS-X use a double branch delay slot; such a processor will execute a pair of instructions following a branch instruction before the branch takes effect. Both TMS320C3x and TMS320C4x use a triple branch delay slot. The TMS320C4x has both non-delayed and delayed branches.

The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once. R0 = 0; CALL fn (DB);     /* call a function, below at label "fn" */ R1 = 0;           /* first delay slot */ R2 = 0;           /* second delay slot */ /***** discontinuity here (the CALL takes effect) *****/

R6 = 0;           /* the CALL/RTS comes back here, not at "R1 = 0" */ JUMP end (DB); R7 = 0;           /* first delay slot */ R8 = 0;           /* second delay slot */ /***** discontinuity here (the JUMP takes effect) *****/

/* next 4 instructions are called from above, as function "fn" */ fn: R3 = 0; RTS (DB);         /* return to caller, past the caller's delay slots */ R4 = 0;           /* first delay slot */ R5 = 0;           /* second delay slot */ /***** discontinuity here (the RTS takes effect) *****/

end: R9 = 0;

Load delay slot
A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.

The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot.