Talk:Out-of-order execution

branch prediction
how does out of order execution fit in with branch prediction? some people claim that to have branch prediction there needs to be an out of order execution, since otherwise the next instruction can only be fetched when the branch is done, thus having no benefit from any prediction at all. — Preceding unsigned comment added by 62.155.231.153 (talk) 21:19, 15 June 2011 (UTC)


 * There is a tradition of instruction fetch, instruction decode, instruction execution, instruction retirement, without OoO you can still do those in parallel. So, with branch prediction you can still do fetch and decode ahead of time, and then do execution in order. With core memory, this was pretty important. A big part of the 360/91 is 16-way interleaved memory, but it still took a long time. After the 360/91, the 360/85, first machine with memory cache, was close to the speed on the 360/91 for many programs, as the cache made up for the processing speed. As well as I remember, the 360/91 doesn't do branch prediction, but can prefetch on two possible branch paths. The 360/91 has a special 'loop mode' which keeps a cache of instruction words, such that tight loops can run without fetching instructions from core. Not exactly prediction, but it only works for successful back branches. Gah4 (talk) 18:01, 16 November 2015 (UTC)


 * You can implement OoO execution without branch prediction, but performance would likely be drastically impaired. The purpose of branch prediction is to enlarge the instruction window under execution (the depth of the pipe), the window being defined as the range of instructions in the dynamic instruction stream from the issue point (next instruction to fetch/decode/issue) through the commit point (next instruction to retire). I have long used the rule of thumb that execution rate in an OoO machine could be first order modeled using the approximation, "IPC = min(I, sqrt(W)/L)", where IPC is instructions per clock (throughput), I is the machine's peak issue rate (e.g, in a 4-issue machine, I = 4), W is the mean window size in instructions, and L is the mean latency to execute an instruction once it is "ready" (all input dependencies satisfied) and "dispatched" (sent to an execution unit). L for example is 1 for a single-cycle ALU while it could be 2 or more clocks for a load-to-L1D-cache-hit instruction. Assuming a conditional branch density of 20%, the average window size without branch prediction would be quite small. With branch prediction, we can greatly increase the mean window size. This will obviously have a big impact on performance Mshebanow (talk) 04:17, 31 July 2020 (UTC)


 * the advantage of out-of-order is that multiple "in-flight" instructions and their associated "not-yet-committed" results may sit in internal latches, until such time that the information sufficient to resolve the branch has been fully calculated. once calculated, the relevant "in flight" instructions may be cancelled, and if they have already completed (bear in mind they can complete but are prevented from committing), their results discarded. it's all really quite straightforward. User:lkcl

Queue
Isn't the point of a queue to have something first in first out? FIFO?. If the instruction is fetched and decoded and stuck in a queue, wouldn't that imply some kind of order of instruction? Or is it more like a priority queue? I'm kind of confused on that point, could someone clarify? Epachamo 03:54, 2 March 2006 (UTC)

Conceptually, the queue holds instructions and (once produced) their results. Instructions are kept in order by the queue. When the result of an instruction is computed, it is written to the proper entry in the queue, next to the instruction. When the instruction commits (reaches head of the queue) its result is read from the queue and written to the programmer visible registers. In summary: it is a FIFO discipline with respect to instructions, but not results. Results are written into the queue out of order, but extracted in program order. Schiermer 19:53, 30 March 2006 (UTC)

I suppose that seems right for in-order retirement. I believe the 360/91 does out-of-order retirement, with register renaming to get the data in the right place. But it is hard to do OoO retirement and virtual memory, as you don't know when a page fault might occur. The 360/91 has the dreaded "multiple imprecise interrupt" when more than one interrupt happens before the pipeline has been flushed. The next instruction address doesn't tell you where the fault occurred, either. You can't have a page fault that doesn't identify the faulting instruction and allow it to be retried. Gah4 (talk) 18:09, 16 November 2015 (UTC)

History
I believe there are a few inaccuracies in the history section of the article.

Firstly, the article mentions, 'IBM also introduced the first out-of-order microprocessor, the POWER1 (1990) for its RS/6000.'

This is misleading and has potential for confusion as the POWER1 could only execute floating point instructions out of order, but the article speaks of it in such a way that suggests that it was fully out of order. It was not until the POWER3 was released in 1997 that the POWER series had a fully OoO member.

Secondly, the article says, 'It was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream.'

This too is also misleading and has potential for confusion because it sounds like that Intel, by using OoO, prompted other companies such as MTI, HP, HAL and DEC design equivalent CPUs. In fact, the Pentium Pro was based on the OoO technique of the PA-8000, which Intel had access to thanks to its alliance with HP. Other companies such as MIPS also had OoO CPUs in development at roughly the same time as HP and Intel. Further more, the article says that the Pentium Pro brought the technology into the mainstream which is inaccurate as the processor was deemed too expensive for the consumer market by Intel late into the design stage and was retargeted towards enterprise markets.

Does anyone have any thoughts or comments? Rilak 17:39, 14 August 2007 (UTC)


 * I just find odd that the claim "it was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream" when the Power Macs (sp. models 6100/60, 7100/66, and 8100/80) were introduced in March of 1994. These models were based on the PPC601 and the Motorola & IBM white papers I head read in 1992/1993 stated that the 601 operated "fully" out-of-order (or "out-of-queue" as the authors preferred) as opposed to out-of-order FPU execution of the POWER1. (I hope I still have those 15-year-old papers) ▪ NeoAmsterdam TalkEdits 00:10, 18 March 2008 (UTC)


 * From this paper: The Design Space of Register Renaming Techniques by Dezsö Sima of Budapest Polytechnic, published in the September/October 2000 issue of the IEEE Micro, the ES/9000 mainframe of 1992 from IBM was the first processor to implement full out of order execution. I'm not sure of the out of order capabilities of the PowerPC, but I think that they were limited much like the POWER1 and POWER2. The NextGen Nx586 of 1994, released one year before the Pentium Pro could execute fixed point instructions out of order, and the AMD K6 was released in the same year as the Pentium Pro had complete capability. Other designs such as the SPARC64 from HAL and the PowerPC chips of 1995 were also fully out of order. I think with all this, the we can conclude that the Pentium Pro was certainly not innovative at all in regards of OoO and was following the general trend like everyone else. Rilak (talk) 06:22, 20 March 2008 (UTC)


 * I was the person who wrote the sentence "brought the technology to the mainstream". What I meant is that the Pentium Pro device was the first OoO device to be sold in large mass-market volumes, not so much that it caused other design houses to change their microarchitectures. Reading all of the other CPUs listed on this discussion page, I still believe my statement is an accurate assessment. The only other device that you might claim sold in large volumes are the PPCs used in the Macs, but by the early 1990s Apple had perhaps 10 percent(?) of the PC market in total and much of that still 68K based. As a separate point, the Pentium Pro CPU was the one that ended the architecture wars of the '80s and '90s by removing any hope of performance advantage to be provided by the simpler RISC architectures. Dyl (talk) 07:19, 20 March 2008 (UTC)


 * Maybe the statement could be clarified further? The Pentium Pro was not exactly common, it was only used in expensive (relative to PCs of the time) x86 workstations and entry level servers. I'm not sure what you mean by the PPro removing any hope of a performance advantage by simpler RISC architectures, but I think that it is fair to say that most RISC designs of the same period were still faster, to put it bluntly. I think that the only significance of the PPro is that it was part of the trend to translate x86 instructions in simpler micro-ops that could be executed more effectively by a RISC-like or inspired core. Rilak (talk) 19:29, 20 March 2008 (UTC)


 * My belief is that the volume of x86 workstations/entry-level servers is still higher then Apple's 601-based systems. Though I don't have the references to back that up. Perhaps the statement would be less controversial if it said PentiumPro and its descendant PentiumII brought OoO to the masses. I disagree about the performance issue. I was there at Microprocessor Forum '96? when the first PPro performance numbers were announced and the integer numbers were as good as any at the time. Yes, the RISC machines would continue to have better FP numbers for a long while, but for the mass-market that's a non-issue. Dyl (talk) 16:34, 21 March 2008 (UTC)


 * Yes, PPro workstations and servers were far more widespread than PPC systems but isn't that the point? The PPro was not a consumer device, eg. Pentium in a $1,500 box. I would say that the Pentium II was the processor that really brought OoO to the masses but I think it would be better if we simply didn't state what chip did this and that first. Instead by providing the history without any "controversial" words with debatable meanings such as 'mainstream' or 'masses', the article would be a lot more neutral. As for performance, I would say that the PA-8000 ad Alpha beat the P6, but that's completely irrelevant to this article. Rilak (talk) 06:52, 22 March 2008 (UTC)


 * I changed the "mainstream" comment to be less controversial. Since I didn't add the "first" comments, I'll leave that to others to change or defend. Dyl (talk) 23:35, 22 March 2008 (UTC)


 * yes it's definitely a false statement. the CDC 6600 was much earlier and i heard somewhere that the 6502, from analysis of its internal architecture when it was reverse-engineered finally a few years ago, was found to be a superscalar microarchitecture and may have been out-of-order, i can't quite remember the exact details.  the 6600 however is definitely waay before whatever intel came up with. User:lkcl

Regarding POWER1 could only execute floating point instructions out of order,: For scientific programming, floating point is the dominant time use, and most fixed point instructions are fast enough that in-order is fine. The design of the CDC 6600, 7600, and IBM 360/91 is mostly to speed up floating point calculations. Gah4 (talk) 02:55, 24 September 2016 (UTC)

Lynn Conway
During the development of ACS-1, Lynn Conway invented Dynamic Instruction Scheduling. Her paper can be found here: http://ai.eecs.umich.edu/people/conway/ACS/DIS/DynInstSched-paper.html. This is likely the most importent development in out-of-order execution, but is nowhere mentioned in the article.

Yale Patt seems to be getting to much credit. As he mostly plagiarized Lynn's work. This is mentioned here: http://ai.eecs.umich.edu/people/conway/Retrospective2.html#anchor305182 — Preceding unsigned comment added by 145.120.12.69 (talk) 22:49, 6 February 2012 (UTC)
 * It's an interesting debate. The problem is that Lynn (who wasn't named Lynn at the time, which makes some things more tricky in some cases) published internally only (as I understand it).   Obviously Tomasulo is the one who gets the base credit--and it's well deserved--even though his algorithm was really only used for floating point scheduling.  I think Lynn's work deserves credit, but needs some independent reliable sourcing (WP:V) about its impact.   The links above probably don't cut it.  And I very much doubt the movement of the idea from Lynn to Yale will able to be shown to have occurred one way or the other.
 * Out of curiosity, does Dr. Conway's paper address renaming? If not, I'm not really seeing how this is an improvement over scoreboarding.  In fact, after a quick scan, I'm not seeing how name dependencies are resolved/dealt with at all.  I assuming I'm missing something obvious.  While a bit off topic for Wikipedia, I'd love to know what I'm missing.  Back to work! Hobit (talk) 01:06, 8 February 2012 (UTC)
 * Reading the ACS disclosure now, I can definitely state than in 1984 when initial concepts of HPS were fleshed out, we were not aware of that paper. So, I think the statement, "As he mostly plagiarized Lynn's work," is not correct. (For the record, Wen-mei Hwu and myself started working on HPS in the summer of that year while we were interns at DEC's ERL (Eastern Research Lab) in Hudson, Mass.) Mshebanow (talk) 03:56, 31 July 2020 (UTC)
 * Does this change anything? IBM seems to think that they capitalized on her work after firing her:
 * Dario Gil, Director of IBM Research, who revealed the award during the online event, says, "Lynn was recently awarded the rare IBM Lifetime Achievement Award, given to individuals who have changed the world through technology inventions. Lynn's extraordinary technical achievements helped define the modern computing industry. She paved the way for how we design and make computing chips today — and forever changed microelectronics, devices, and people's lives."
 * The company also acknowledged that after Conway’s departure in 1968, her research aided its own success. “In 1965 Lynn created the architectural level Advanced Computing System-1 simulator and invented a method that led to the development of a superscalar computer. This dynamic instruction scheduling invention was later used in computer chips, greatly improving their performance,” a spokesperson stated.
 * That certainly sounds to me like IBM credited her dynamic instruction scheduling invention for their success after they fired her, and gives Wikipedia editors a reference to look into for documenting her contribution.CherylJosie (talk) 16:35, 1 April 2024 (UTC)
 * Dario Gil, Director of IBM Research, who revealed the award during the online event, says, "Lynn was recently awarded the rare IBM Lifetime Achievement Award, given to individuals who have changed the world through technology inventions. Lynn's extraordinary technical achievements helped define the modern computing industry. She paved the way for how we design and make computing chips today — and forever changed microelectronics, devices, and people's lives."
 * The company also acknowledged that after Conway’s departure in 1968, her research aided its own success. “In 1965 Lynn created the architectural level Advanced Computing System-1 simulator and invented a method that led to the development of a superscalar computer. This dynamic instruction scheduling invention was later used in computer chips, greatly improving their performance,” a spokesperson stated.
 * That certainly sounds to me like IBM credited her dynamic instruction scheduling invention for their success after they fired her, and gives Wikipedia editors a reference to look into for documenting her contribution.CherylJosie (talk) 16:35, 1 April 2024 (UTC)
 * That certainly sounds to me like IBM credited her dynamic instruction scheduling invention for their success after they fired her, and gives Wikipedia editors a reference to look into for documenting her contribution.CherylJosie (talk) 16:35, 1 April 2024 (UTC)
 * That certainly sounds to me like IBM credited her dynamic instruction scheduling invention for their success after they fired her, and gives Wikipedia editors a reference to look into for documenting her contribution.CherylJosie (talk) 16:35, 1 April 2024 (UTC)

Merge from Decoupled architecture
Decoupled architecture → Out-of-order execution

✅ ~Kvng (talk) 04:59, 14 February 2016 (UTC)
 * Support looks like the contents of Decoupled architecture can be effectively merged into Out-of-order_execution. ~KvnG 00:00, 3 February 2015 (UTC)
 * Support the Decoupled architecture is too complicated for some readers and is very short. I think this should be merged. -- ☣  Anar chyte  ☣ 02:53, 22 May 2015 (UTC)
 * Support I think I understand Decoupled architecture, but it seems to be saying that "decoupled architecture" is another way of saying "an architecture with out-of-order execution". If that isn't the case, then it should be rewritten in a way that explains the distinction.--Wcoole (talk) 19:24, 30 October 2015 (UTC)

Static vs. Dynamic Scheduling Superscalar
I'm a bit muddled on the distinction between Static and Dynamic Scheduling and they overlap with OOOE, VLIW, and Superscalar. I think I understand, in a general way, the technology involved, but the terminology seems to be a bit variable. The Conway paper uses "Dynamic Instruction Scheduling", but describes something I recognize as OOOE. http://foldoc.org/Very%20Long%20Instruction%20Word says that VLIW is equivalent to "staticly scheduled superscalar". It seems as though "static" in this context means "in-order" and "dynamic" means "out-of-order", but I haven't found a reference that states this clearly. I also haven't found a source that contradicts this interpretation.

I'd be very happy if someone more conversant with this area were to add to an appropriate article, explaining these terms.--Wcoole (talk) 19:38, 30 October 2015 (UTC)


 * It isn't so obvious. Seems to me that the idea of dynamic is that it is determined during execution. For example, it could change from run to run, as memory access, and especially effects from other tasks running, could be unpredictable.
 * Note that Itanium is listed as not OoO. Instead, Itanium depends on the compiler to order things in the optimal order. I would say, echnically, static should refer to fixed order, not necessarily in-order. Gah4 (talk) 02:30, 24 September 2016 (UTC)

Basic concept
The Basic concept section seem to suggest that the steps (fetch, decode, execute, retire) happen sequentially for in-order. As well as I know, they can overlap as long as execute is in-order.

Also, for OoO, the section seems to imply in-order retirement. As far as I know, all current processors do that, as it is almost necessary for dynamic memory. The 360/91, as well as I know, allows OoO retirement, consistent with register renaming. With OoO retirement, it is necessary to flush all pipelines before an interrupt can be taken. Gah4 (talk) 02:35, 24 September 2016 (UTC)


 * Also, in explaining in order, it says: the instruction is dispatched to the appropriate functional unit. I suppose this is right, but the reason is that such processors tend to have only one functional unit.  The idea of more functional units is to allow more parallelism, which increases the need for OoO. Gah4 (talk) 15:10, 8 January 2018 (UTC)


 * FWIW, we always defined six major events in the life of an instruction:
 * Fetch - an "FPC" (fetch program counter) is used to index memory to get the next instruction to execute. That instruction is read and brought into the machine.
 * Decode - the instruction is decoded and classified along with a determination of all operands (to be read and/or written).
 * Issue - the instruction is staged for execution. Staging means it is made available for the execution core.
 * Dispatch - the instruction is destaged and sent to an execution unit for actual execution.
 * Complete - the instruction completes execution and its result is broadcast to other (potentially) waiting, staged instructions.
 * Commit (aka retire) - the instruction's side effects are made (permanently) visible outside the machine.
 * (Note there is controversy on the definitions of Dispatch versus Issue). I would define that in an in-order machine, at the very least, Fetch/Decode/Issue/Dispatch and Commit must occur in program order; Complete can occur out of program order so long as side effects from completion are not exposed outside the machine (in order retirement/commit). Whereas in an OoO machine, only Fetch/Decode/Issue and Commit must occur in order; Dispatch and Complete can occur out of program order so long as operand dependency hazards are honored. Mshebanow (talk) 04:31, 31 July 2020 (UTC)
 * I suspect that all six are commonly only distinguished in fancier processors, which do some overlap. Also, on simpler processors those might not all be really separate. Early computers published instruction timings, which were often a requirement when bidding for sales. As processor got more parallel, it wasn't possible to give individual times. That is where the distinctions above become interesting. Early parallel processors used out-of-order retirement (or commit), but that is a lost art. Mostly virtual addressing complicated things. If an interrupt, especially an exceptional condition in the executing instructions, occurs, the pipeline must be flushed. Translation interrupts can't occur during that time. The IBM 360/91, a popular example in many books on pipelined processors, allows for out-of-order retirement.  (There isn't enough state to keep it all in.)  One result is the dreaded multiple imprecise interrupt.  The IBM PL/I compilers generate a message saying where a problem occurred.  On the 360/91, the message says "near" instead. Fetch is important, as that affects self-modifying code. Gah4 (talk) 07:11, 31 July 2020 (UTC)
 * there is quite a lot of confusion, here. in-order is by definition not allowed to do *anything* out of order. its sole exclusive option on detecting a hazard is to stall. Reg renaming has quite a significant impact in preventing stalls which otherwise occur quite a lot. but you *cannot* do out of order *anything* in an in-order system.  even trying to do early-out from pipelines in an in-order system is considered extremely complex and risky.  i outline below how Shadowing 100% solves "imprecision". User:lkcl  — Preceding unsigned comment added by 92.40.200.150 (talk) 03:00, 21 May 2021 (UTC)
 * Mshebanow: no, commit does not *have* to be in order. this is a fallacy based on the assumption that the Tomasulo Algorithm is the sole exclusive "Precise" OoO algorithm, which is documented as requiring that commit occurs in-order. The sole reason why is because the Tomasulo ROB is based around sequential numbering.  A precise-augmented Shadow-capable 6600 Algorithm does NOT demand in-order commit.  Any result that has no "Shadow" over it is, by definition, already 100% clear of all exceptions and other "damage" opportunities: consequently the order in which those commits are actioned is *completely immaterial*. User:lkcl
 * Put for ~'s at the end, to automatically sign your note. Though SineBot should eventually find it.
 * The complication is always exceptions, and especially precise interrupts. The 360/91 has imprecise interrupts as they might occur on instructions after later ones were retired. In the 1960's, it was believed that Fortran COMMON could not have padding, and there was a fix-up in the case of alignment exceptions. The data is copied somewhere else, the operation performed, and copied back (if needed). That assumes a precise interrupt that can identify the failing instruction. The interrupt is not precise on the 360/91, so that can't be used.  Next there is multiple imprecise interrupts. After the first one, and before the pipeline is empty, more exceptions might be detected. All are reported. The 360/91 doesn't have enough internal storage to keep enough state to avoid such. By the way, the first use of IC memory on a production processor is the protection keys on the 360/91. Four bits for every 2K of core memory, built from 16 bit SRAM chips.  Gah4 (talk) 05:00, 21 May 2021 (UTC)
 * Mshebanow: no, commit does not *have* to be in order. this is a fallacy based on the assumption that the Tomasulo Algorithm is the sole exclusive "Precise" OoO algorithm, which is documented as requiring that commit occurs in-order. The sole reason why is because the Tomasulo ROB is based around sequential numbering.  A precise-augmented Shadow-capable 6600 Algorithm does NOT demand in-order commit.  Any result that has no "Shadow" over it is, by definition, already 100% clear of all exceptions and other "damage" opportunities: consequently the order in which those commits are actioned is *completely immaterial*. User:lkcl
 * Put for ~'s at the end, to automatically sign your note. Though SineBot should eventually find it.
 * The complication is always exceptions, and especially precise interrupts. The 360/91 has imprecise interrupts as they might occur on instructions after later ones were retired. In the 1960's, it was believed that Fortran COMMON could not have padding, and there was a fix-up in the case of alignment exceptions. The data is copied somewhere else, the operation performed, and copied back (if needed). That assumes a precise interrupt that can identify the failing instruction. The interrupt is not precise on the 360/91, so that can't be used.  Next there is multiple imprecise interrupts. After the first one, and before the pipeline is empty, more exceptions might be detected. All are reported. The 360/91 doesn't have enough internal storage to keep enough state to avoid such. By the way, the first use of IC memory on a production processor is the protection keys on the 360/91. Four bits for every 2K of core memory, built from 16 bit SRAM chips.  Gah4 (talk) 05:00, 21 May 2021 (UTC)
 * The complication is always exceptions, and especially precise interrupts. The 360/91 has imprecise interrupts as they might occur on instructions after later ones were retired. In the 1960's, it was believed that Fortran COMMON could not have padding, and there was a fix-up in the case of alignment exceptions. The data is copied somewhere else, the operation performed, and copied back (if needed). That assumes a precise interrupt that can identify the failing instruction. The interrupt is not precise on the 360/91, so that can't be used.  Next there is multiple imprecise interrupts. After the first one, and before the pipeline is empty, more exceptions might be detected. All are reported. The 360/91 doesn't have enough internal storage to keep enough state to avoid such. By the way, the first use of IC memory on a production processor is the protection keys on the 360/91. Four bits for every 2K of core memory, built from 16 bit SRAM chips.  Gah4 (talk) 05:00, 21 May 2021 (UTC)

Pipelining
There isn't much discussion on pipelining and parallel execution. The big advantages of OoO come when you can execute more than one instruction at a time, and especially if you pipeline them. There is a small advantage in memory access without execution overlap, but not the big advantage possible. Gah4 (talk) 03:12, 24 September 2016 (UTC)

although out-of-order execution was limited to floating-point instructions
I am wondering about the meaning of:  although out-of-order execution was limited to floating-point instructions. Does that mean that if you erase all the floating point instructions, those that are left are in-order? If you exchange the order of a floating point and fixed point instruction, which one is out of order? But yes, all the OoO hardware on the 360/91, such as register renaming and reservation stations is for floating point, but that allows faster fixed-point operations to finish earlier. Gah4 (talk) 15:31, 10 April 2017 (UTC)
 * The only part of that machine that could execute instructions out of order was the floating point processor. I don't recall how interactions between the integer side and the fp side were handled. Hobit (talk) 18:23, 10 April 2017 (UTC)


 * OK, but this is pretty fundamental to the meaning of OoO execution, the subject here. Many processors overlap fetch, decoding, and retirement with execution, but still do execution in order.  For the 8086, you can see this by modifying instructions after fetch, where the processor does not detect the change.  That is part of the 8086 model.  The 360/91 does detect such modification, because S/360 allows for self modifying code. Since fixed point is used for computing addresses, you pretty much have to be able to execute fixed point instructions mixed in with floating point instructions. Next, consider the 8086/8087 where the 8087 executes concurrently with the 8086.  (A WAIT instruction is required to resynchronize the instruction stream.)  This does not, as far as I know, count as OoO, as fixed point instructions are executed in order, and floating point instructions are in order, even though the two instruction streams can overlap.  It seems to me that the article needs to make this distinction clear.  Gah4 (talk) 20:16, 10 April 2017 (UTC)
 * It would be good to clarify, I agree. I even knew enough details of this processor at one point I could have clarified, but I don't right now.  If you have solid insight (and ideally a source) I'd suggest you tackle it in the article. Hobit (talk) 22:02, 10 April 2017 (UTC)

meltdown versus spectre
The revision on 19:45, 4 January 2018‎, has added information about the Meltdown (security vulnerability) stating that the vulnerability takes advantage of out-of-order execution. This statement is actually incorrect in that it was taking advantage of missing privilege checks that occurred during the execution of Tomasulo algorithm across a security boundary (ring levels) and not OoOE directly. This vulnerability also was only exploitable against Intel's implementation of their architecture (and not AMD which actually performed the proper checks).

I believe the author intended to mention Spectre (security vulnerability) (speculative execution) which is an attack on Robert Tomasulo's Tomasulo algorithm and directly taking advantage of an oversight of out-of-order execution. Despite the Meltdown (security vulnerability) being the more practical vulnerability (and thus the more public one), Spectre (security vulnerability) is the vulnerability that directly attacks the speculative execution provided by Tomasulo algorithm. I've updated the page to reference this. 2605:6000:E890:B100:CD1D:1DAC:CAE2:5F48 (talk) 20:00, 31 January 2018 (UTC)


 * Even more, it seems to me that it is speculative execution, and not OoO, that creates the problem. The 360/91, where Tomasulo algorithm came from, does OoO but not speculative execution.  It does speculative prefetch (on two additional paths), though.  Also, with no cache, cache related effects don't occur. But otherwise, it seems to me that this is big enough for an article of its own, mentioned from this page.  Gah4 (talk) 21:47, 31 January 2018 (UTC)


 * Agreed. It seems that Speculative_execution has this written to completion. Should we just remove the mentioning of these attacks and perhaps just replace them with a link to that article? 2605:6000:E890:B100:CD1D:1DAC:CAE2:5F48 (talk) —Preceding undated comment added 18:24, 9 February 2018 (UTC)

Basic concept - what about writing results back to memory
I feel that a major piece is missing from the description of the basic concept. The described instruction cycle stops after writing operation results back to register file. But at some point some results must be stored in memory (the so-called side effects of the instructions, observable from the outside).

In a very simple architecture, you could have all operations on register and have separate instructions to store registers in memory. Even this simplified case is absent in the article. I think however in most CPUs: One instruction can fetch operands from memory, operate on it, and store the result in memory. This has effects on the OoO instruction cycle, and for completeness, this should be described, as well as any observable effects of OoO on the memory data flow. Basically: Will memory reads and writes always occur in "program order" or not?

(Pardon me for not being updated on all the correct terminology. Although memory in practice is cached, I assume here a simple memory model without caching or virtual addressing, to simplify this discussion.)

1. Data dependency in memory. In the stage "Instruction dispatch to an instruction queue", if this instruction needs operands from memory, the logic must discover if an earlier queued instruction will update the same memory. If so, this instruction must (in principle) stall until the previous instruction has finished the memory update cycle, in addition to waiting for the register file to be updated. Here I don't know the tech details: theoretically there is an obvious possibility to bypass the memory here and just pass the data from instruction A to B via internal register instead, in particular if instruction B will also update the memory cell.

2. An instruction that writes to memory is not finished when the register file is updated, it is finished when the write-to-memory cycle is completed. (Memory is most likely cache memory.)

3. The question, as already stated: Will the CPU guarantee that all observable side effects (memory reads and writes) occur in "program order" so that the (assembly-level) programmer of an OoO CPU does not have to worry about it in terms of program correctness? (Performance optimization is another thing.)

3a. Only considering one CPU vs memory.

3b. Including exceptions and interrupts, which is however already discussed in the article and the talk page.

3c. With multiple processing units sharing one memory, or CPU + peripheral. If the answer to 3a is yes, then it appears to me that there are also no OoO-induced synchronization issues when using multiple CPUs. Am I wrong? Probably.

Note, related to 3c: I know that there are many, many issues with multiprocessing, but how much does CPU-level OoO contribute to this, this is a part of what I'm trying to get at. Perhaps most or all of the trouble comes from "external" OoO execution caused by compiler optimizations that actually modify the program flow seen by the CPU, as well as cache coherency issues? I'm posting a separate talk subject about the need to distinguish between internal and external OoO.

I apologize for the length of this talk. This looks like a bunch of questions. I know this is not the Stack Overflow forum. My point is that the article should clarify these things in order to address some of these questions. 195.60.183.2 (talk) 12:34, 25 January 2021 (UTC) (Anders Hallström non-registered user)
 * As well as I know it, all processors now do in-order retirement. Data is written to memory in order, such that any canceled or otherwise not executed instructions don't have any effect. This is especially true with virtual storage, where there could be a page fault on write. If the processor verifies that can't happen, I suppose it could go ahead, but then there is the problem of other interrupts, especially exceptions. The 360/91, and I believe contemporary CDC processors, didn't have virtual memory and could (and did) do OoO retirement. Otherwise, it must be that anything OoO is consistent all the way with in-order results, unless exceptions occur. The 360/91 has the well-known multiple imprecise interrupt. Consider that a floating point overflow occurs in an instruction in which following instructions have already been executed. The saved PSW doesn't point to the following instruction, so fix-up can't be done. Even more, other exceptions might follow, as for example underflow or divide by zero. There is no way to consistently flush the pipeline and handle the interrupt(s). With in-order retirement, you only have to be sure that the stored PSW has the right address to continue. Gah4 (talk) 14:29, 31 January 2021 (UTC)
 * Gah4, what you are referring to here ("all processors now do in-order memory retirement") is known as "Total Store Order" (TSO) and it is done by x86 but NOT necessarily by ALL modern architectures. RISC-V implements something called "Weak Memory Order" where as far as SMP is concerned and as far as programs are concerned, it all *looks* like it occurred in the right order but actually wasn't if you track the actual bus reads/writes. You have to use atomic operations on critical data structures but that's par for the course, snd is a hell of a lot less complicated to implement than TSO. Mitch Alsup explained to me on comp.arch and in his book chapters extending Thornton's original work that as long as you batch-commit sequences of reads before writes, and batxh-commit writes before reads, it's all good.  i.e. *batches* have to be in-order, but one read in amongst a batch of reads (or one write in amongst a batch of writes) you actually don't care *unless* it is IO memory or an atomic operation. in other words, surpriiise, you can actually apply the exact same RaW-WaR 6600 scoreboard algorithm *to memory operations* and it all works. User:lkcl
 * The 360/91 does out-of-order retirement. Not so hard with only real addresses, but much harder with virtual addresses. As far as I know, all processors now do in-order retirement. That is, an instruction is not forgotten until all previous instructions are forgotten (retired). The result, for the 360/91, in the dreaded multiple imprecise interrupt. After a program exception, the pipeline is emptied before the interrupt is taken. (There isn't much choice.) Instructions might have written to memory, for example, and yes it might be visible to I/O devices. (Consider that S/360 allows modification of running channel programs.) More interrupts might happen while emptying the pipeline. In the case of imprecice interrupts, the actual address of the instruction causing it is not available, which makes virtual memory difficult. Note also trial execution for instructions like TR on S/370, to avoid problems with page faults in the middle of instructions. And note that it is not only memory, but reigsters, too.  Though the 91 does have register renaming, but still, registers can change. Gah4 (talk) 00:45, 21 May 2021 (UTC)
 * Mitch Alsup also went to the trouble of documenting Shadowing and explaining it to me. Shadowing is how you deal with interrupts, predication, speculative execution, exceptions, branch prediction. It's Very Simple. you hook into the write commit and prevent write until you are absolutely 100% certain it and all prior dependent instructions can succeed. note very carefully those conditions. hence the "shadow". see this diagram https://libre-soc.org/3d_gpu/shadow.svg the shadow you can see hooks into write, preventing it. "success" clears the shadow and allows write. fail CANCELS the entire operation AND all downstream instructions. this is where you are conflating "in order". the result COMPUTATION can be OoO. the result WRITE can be OoO (only when no shadows fall across it). but a Cancel MUST cancel the instruction AND all instructions issued after it, whete it is the Shadow Matrix's job to record exactly that ordering. quite simple and elegant really, but just not part of the widespread academic literature thanks to Patterson's inaccurate evaluation being taken as law. User:lkcl  — Preceding unsigned comment added by 92.40.200.149 (talk) 02:39, 21 May 2021 (UTC)
 * The 360/91 does out-of-order retirement. Not so hard with only real addresses, but much harder with virtual addresses. As far as I know, all processors now do in-order retirement. That is, an instruction is not forgotten until all previous instructions are forgotten (retired). The result, for the 360/91, in the dreaded multiple imprecise interrupt. After a program exception, the pipeline is emptied before the interrupt is taken. (There isn't much choice.) Instructions might have written to memory, for example, and yes it might be visible to I/O devices. (Consider that S/360 allows modification of running channel programs.) More interrupts might happen while emptying the pipeline. In the case of imprecice interrupts, the actual address of the instruction causing it is not available, which makes virtual memory difficult. Note also trial execution for instructions like TR on S/370, to avoid problems with page faults in the middle of instructions. And note that it is not only memory, but reigsters, too.  Though the 91 does have register renaming, but still, registers can change. Gah4 (talk) 00:45, 21 May 2021 (UTC)
 * Mitch Alsup also went to the trouble of documenting Shadowing and explaining it to me. Shadowing is how you deal with interrupts, predication, speculative execution, exceptions, branch prediction. It's Very Simple. you hook into the write commit and prevent write until you are absolutely 100% certain it and all prior dependent instructions can succeed. note very carefully those conditions. hence the "shadow". see this diagram https://libre-soc.org/3d_gpu/shadow.svg the shadow you can see hooks into write, preventing it. "success" clears the shadow and allows write. fail CANCELS the entire operation AND all downstream instructions. this is where you are conflating "in order". the result COMPUTATION can be OoO. the result WRITE can be OoO (only when no shadows fall across it). but a Cancel MUST cancel the instruction AND all instructions issued after it, whete it is the Shadow Matrix's job to record exactly that ordering. quite simple and elegant really, but just not part of the widespread academic literature thanks to Patterson's inaccurate evaluation being taken as law. User:lkcl  — Preceding unsigned comment added by 92.40.200.149 (talk) 02:39, 21 May 2021 (UTC)
 * Mitch Alsup also went to the trouble of documenting Shadowing and explaining it to me. Shadowing is how you deal with interrupts, predication, speculative execution, exceptions, branch prediction. It's Very Simple. you hook into the write commit and prevent write until you are absolutely 100% certain it and all prior dependent instructions can succeed. note very carefully those conditions. hence the "shadow". see this diagram https://libre-soc.org/3d_gpu/shadow.svg the shadow you can see hooks into write, preventing it. "success" clears the shadow and allows write. fail CANCELS the entire operation AND all downstream instructions. this is where you are conflating "in order". the result COMPUTATION can be OoO. the result WRITE can be OoO (only when no shadows fall across it). but a Cancel MUST cancel the instruction AND all instructions issued after it, whete it is the Shadow Matrix's job to record exactly that ordering. quite simple and elegant really, but just not part of the widespread academic literature thanks to Patterson's inaccurate evaluation being taken as law. User:lkcl  — Preceding unsigned comment added by 92.40.200.149 (talk) 02:39, 21 May 2021 (UTC)
 * Mitch Alsup also went to the trouble of documenting Shadowing and explaining it to me. Shadowing is how you deal with interrupts, predication, speculative execution, exceptions, branch prediction. It's Very Simple. you hook into the write commit and prevent write until you are absolutely 100% certain it and all prior dependent instructions can succeed. note very carefully those conditions. hence the "shadow". see this diagram https://libre-soc.org/3d_gpu/shadow.svg the shadow you can see hooks into write, preventing it. "success" clears the shadow and allows write. fail CANCELS the entire operation AND all downstream instructions. this is where you are conflating "in order". the result COMPUTATION can be OoO. the result WRITE can be OoO (only when no shadows fall across it). but a Cancel MUST cancel the instruction AND all instructions issued after it, whete it is the Shadow Matrix's job to record exactly that ordering. quite simple and elegant really, but just not part of the widespread academic literature thanks to Patterson's inaccurate evaluation being taken as law. User:lkcl  — Preceding unsigned comment added by 92.40.200.149 (talk) 02:39, 21 May 2021 (UTC)

Internal vs external OoO
Slightly related to my previous talk about memory writes where I also brought this up.

I feel there is a need to distinguish between internal out-of-order (OoO) execution (occurring inside a processing unit) and external out-of-order execution (caused for example by compiler optimizations that modify the "program flow" seen by the processing unit). Another kind of external OoO is OoO data flow in memory caused by caching effects in a multi-processing environment. The article as it stands only discusses "internal" OoO, while the article on Memory barrier refers to this one, but seemingly in the context of external OoO causing the need for a memory barrier. I feel there is an information gap here that needs filling. 195.60.183.2 (talk) 13:00, 25 January 2021 (UTC) (Anders Hallström)
 * A lot of questions. Somehow it has to work. A favorite for books on pipelined processors is the IBM 360/91, which pioneered much that was later used in other processors. The 91 is before cache, and also no virtual addressing. OK, for one, the 360/91 has 16 way interleaved 64 bit wide memory. And yes, it tests the write buffer registers on memory read, and so bypasses the writing. Many parts of the 360/91 design were needed, as it was required to run programs written for any S/360 processor. Among others, it has to detect self-modifying code, modifying instructions already fetched. The 8086 defines that modifying already fetched doesn't update the prefetch cache. The prefetch cache was changed in the 8088, and that is used to detect which processor is in use. In any case, for machines with cache, it is only needed to test against what is in the cache. Gah4 (talk) 13:11, 25 January 2021 (UTC)
 * I'm sorry, I was a bit wrong and I let several days pass before writing it. The article on "Memory barriers" does in fact refer to "internal OoO", but implies that this causes "reordering of memory operations". This may be true, but my point is, this point is not discussed at all in the referred OoOE article, i.e. this one. The effect on memory operations should be included in this article, to fill the gap. (See my other talk subject about that.) (If there is indeed no such effect because the queues are fully synchronized internally before memory operations take place, then clarify that, and in that case the "Memory barriers" article needs to be corrected...) 195.60.183.2 (talk) 23:48, 30 January 2021 (UTC) (Anders H)
 * OK, so external like Itanium. As well as I know, one reason Itanium didn't do so well is that compiler technology did not keep up. Memory barriers might still be needed, for example in the case of I/O. A single thread must do everything consistently, including memory access. But in the case of multi-thread, or I/O, it gets more complicated. Gah4 (talk) 14:39, 31 January 2021 (UTC)
 * my feeling is that what you are referring to is simply a job carried out by compilers. there is nothing special about it: it's just what compilers (and assembly writers) have to do. they do have to know in some cases the internal design of the architecture, wbich to be honest and frank, is almost always a damn nuisance, and costs time and money.  and, worse, you end up tying the compiler specifically to a machine. upgrade the internal architecture for a new revision of a processor, the old compiler and all hand-written assembler is now worthless.  this is why good general purpose processors don't try this trick that you've named "external" OoO. there is however one exception: The Mill.  The Mill is deeply impressive in that it is utterly immune to all forms of speculative attacks, because all instructions are pre-scheduled by the compiler on fixed (known) length pipelines and internal "belts".  i don't think it justifies putting into this page, however a separate page on compiler static scheduling or a special section on an existing compiler page then crosslinking here would be a good idea. User:lkcl
 * Please sign your posts with ~.
 * the 360/91 was designed to execute code written for any S/360 model, with the small exception of imprecise interrupts. Among other features, it detected and properly processed self-modifying code. If a write went to an address already prefetched, it would do what it needed to do. There are some complications related to modifying running channel programs, where out of order stores might cause problems. But the idea of Itanium was that it could all be figured out at compile time. The instruction word is 128 bits wide, which can specify three instructions each 41 bits wide. Then, six instructions can be executed per clock cycle. But all the dependencies have to be figured out by the compiler. The extra five bits (128-3*41) help to the processor about dependencies. There are 128 registers, so register renaming isn't needed. (It was for the 360/91, as S/360 has only four floating point registers.) It seems, though, that compiler technology wasn't up to the task. Gah4 (talk) 09:18, 23 May 2021 (UTC)
 * my feeling is that what you are referring to is simply a job carried out by compilers. there is nothing special about it: it's just what compilers (and assembly writers) have to do. they do have to know in some cases the internal design of the architecture, wbich to be honest and frank, is almost always a damn nuisance, and costs time and money.  and, worse, you end up tying the compiler specifically to a machine. upgrade the internal architecture for a new revision of a processor, the old compiler and all hand-written assembler is now worthless.  this is why good general purpose processors don't try this trick that you've named "external" OoO. there is however one exception: The Mill.  The Mill is deeply impressive in that it is utterly immune to all forms of speculative attacks, because all instructions are pre-scheduled by the compiler on fixed (known) length pipelines and internal "belts".  i don't think it justifies putting into this page, however a separate page on compiler static scheduling or a special section on an existing compiler page then crosslinking here would be a good idea. User:lkcl
 * Please sign your posts with ~.
 * the 360/91 was designed to execute code written for any S/360 model, with the small exception of imprecise interrupts. Among other features, it detected and properly processed self-modifying code. If a write went to an address already prefetched, it would do what it needed to do. There are some complications related to modifying running channel programs, where out of order stores might cause problems. But the idea of Itanium was that it could all be figured out at compile time. The instruction word is 128 bits wide, which can specify three instructions each 41 bits wide. Then, six instructions can be executed per clock cycle. But all the dependencies have to be figured out by the compiler. The extra five bits (128-3*41) help to the processor about dependencies. There are 128 registers, so register renaming isn't needed. (It was for the 360/91, as S/360 has only four floating point registers.) It seems, though, that compiler technology wasn't up to the task. Gah4 (talk) 09:18, 23 May 2021 (UTC)
 * Please sign your posts with ~.
 * the 360/91 was designed to execute code written for any S/360 model, with the small exception of imprecise interrupts. Among other features, it detected and properly processed self-modifying code. If a write went to an address already prefetched, it would do what it needed to do. There are some complications related to modifying running channel programs, where out of order stores might cause problems. But the idea of Itanium was that it could all be figured out at compile time. The instruction word is 128 bits wide, which can specify three instructions each 41 bits wide. Then, six instructions can be executed per clock cycle. But all the dependencies have to be figured out by the compiler. The extra five bits (128-3*41) help to the processor about dependencies. There are 128 registers, so register renaming isn't needed. (It was for the 360/91, as S/360 has only four floating point registers.) It seems, though, that compiler technology wasn't up to the task. Gah4 (talk) 09:18, 23 May 2021 (UTC)
 * the 360/91 was designed to execute code written for any S/360 model, with the small exception of imprecise interrupts. Among other features, it detected and properly processed self-modifying code. If a write went to an address already prefetched, it would do what it needed to do. There are some complications related to modifying running channel programs, where out of order stores might cause problems. But the idea of Itanium was that it could all be figured out at compile time. The instruction word is 128 bits wide, which can specify three instructions each 41 bits wide. Then, six instructions can be executed per clock cycle. But all the dependencies have to be figured out by the compiler. The extra five bits (128-3*41) help to the processor about dependencies. There are 128 registers, so register renaming isn't needed. (It was for the 360/91, as S/360 has only four floating point registers.) It seems, though, that compiler technology wasn't up to the task. Gah4 (talk) 09:18, 23 May 2021 (UTC)

= Scoreboard and 6600 misinformation. =

"although in modern usage, such scoreboarding is considered to be in-order execution, not out-of-order execution, since such machines stall on the first RAW conflict – strictly speaking, such machines initiate execution in-order, although they may complete execution out-of-order)."

this is plain factually wrong. the 6600 was perfectly capable of out-of-order execution and coped perfectly adequately with both RaW and WaR: it was WaW (renaming) that it lacked. Seymour Cray and Janes Thornton simply ran out of time to fit WaW hazard logic in before they had to make a commercial sale, and likely later added WaW to the 7600 although we do not have the same level of documentation to confirm that. Fact is, that the lack of WaW (reg renaming) DOES NOT "make" a system "in-order" and is is blindingly obvious and trivial to create an example demonstrating this statement to be false. i spent 5 months studying the 6600 under Mitch Alsup's tutelage, Mitch is the world leading expert on the 6600. he was particularly pissed off with Patterson's incompetent assessment of the 6600, and wrote two chapters to augment Thornton's book, translating it to modern IEEE gate diagrams (Thornton's book predates IEEE and uses ECL gate diagrams). as this is the main page on OoO i will have a think about what the paragraph should say and will draft it here first. User:lkcl

Draft:

Arguably the first machine to use out-of-order execution is the CDC 6600 (1964), which used a scoreboard to resolve conflicts. The 6600 however lacked WAW conflict handling, choosing instead to stall. This situation was termed a "First Order Conflict" by Thornton. Whilst it had both RAW conflict resolution (termed "Second Order Conflict" ) and WAR conflict resolution (termed "Third Order Conflict" ) all of which is sufficient to declare it capable of full out-of-order execution, the 6600 did not have precise exception handling. An early and limited form of Branch prediction was possible as long as the branch was to locations on what was termed the "Instruction Stack" which was limited to within a depth of seven words from the Program Counter.

References for above: