Intel 8087

The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. The purpose of the chip was to speed up floating-point arithmetic operations, such as addition, subtraction, multiplication, division, and square root. It also computes transcendental functions such as exponential, logarithmic or trigonometric calculations. The performance enhancements were from approximately 20% to over 500%, depending on the specific application. The 8087 could perform about 50,000 FLOPS using around 2.4 watts.



The 8087 was an advanced integrated circuit, pushing the limits of manufacturing technology of the period. Basic operations on the 8087 such as addition and subtraction can take over 100 machine cycles to execute and some instructions exceed 1000 cycles. The chip lacks a hardware multiplier and implements calculations using the CORDIC algorithm.

Sales of the 8087 received a significant boost when a coprocessor socket was included on the 1981 IBM PC motherboard. Development of the 8087 led to the IEEE 754-1985 standard for floating-point arithmetic. The available speed version were 4.77 (5), 8, and 10 MHz. There were later x87 coprocessors for the 80186, 80286, 80386, and 80386SX processors. Starting with the 80486, the later Intel x86 processors did not use a separate floating-point coprocessor; floating-point functions were integrated with the processor.

Intel 80186 and its associated products were discontinued on March 30, 2007 for orders and September 28, 2007 for shipments.

Design and development
Intel had previously manufactured the 8231 Arithmetic processing unit, and the 8232 Floating Point Processor. These were designed for use with 8080 or similar processors and used an 8-bit data bus. They were interfaced to a host system either through programmed I/O or a DMA controller.

The 8087 was initially conceived by Bill Pohlman, the engineering manager at Intel who oversaw the development of the 8086 chip. Bill took steps to be sure that the 8086 chip could support a yet-to-be-developed math chip.

In 1977 Pohlman got the go ahead to design the 8087 math chip. Bruce Ravenel was assigned as architect, and John Palmer was hired to be co-architect and mathematician for the project. The two came up with a revolutionary design with 64 bits of mantissa and 16 bits of exponent for the longest-format real number, with a stack architecture CPU and eight 80-bit stack registers, with a computationally rich instruction set. The design solved a few outstanding known problems in numerical computing and numerical software: rounding-error problems were eliminated for 64-bit operands, and numerical mode conversions were solved for all 64-bit numbers. Palmer credited William Kahan's writings on floating point as a significant influence on their design.

The 8087 design initially met a cool reception in Santa Clara due to its aggressive design. Eventually, the design was assigned to Intel Israel, and Rafi Nave was assigned to lead the implementation of the chip. Palmer, Ravenel and Nave were awarded patents for the design. Robert Koehler and John Bayliss were also awarded a patent for the technique where some instructions with a particular bit pattern were offloaded to the coprocessor.

The 8087 had 65,000 transistors and was manufactured as a 4.5 μm (then shrunk to 3 μm) depletion-load HMOS circuit. It worked in tandem with the 8086 or 8088 and introduced about 60 new instructions. Most 8087 assembly mnemonics begin with F, such as FADD, FMUL, FCOM and so on, making them easily distinguishable from 8086 instructions. The binary encodings for all 8087 instructions begin with the bit pattern 11011, decimal 27, the same as the ASCII character ESC, although in the higher-order bits of a byte; similar instruction prefixes are also sometimes referred to as "escape codes." The instruction mnemonic assigned by Intel for these coprocessor instructions is "ESC." The 8087 was expensive and difficult to manufacture with low yields. It also ran quite hot, forcing Intel to use a more expensive ceramic package for improved thermal dissipation.

When the 8086 or 8088 CPU executed the ESC instruction, if the second byte (the ModR/M byte) specified a memory operand, the CPU would execute a bus cycle to read one word from the memory location specified in the instruction (using any 8086 addressing mode), but it would not store the read operand into any CPU register or perform any operation on it; the 8087 would observe the bus and decode the instruction stream in sync with the 8086, recognizing the coprocessor instructions meant for itself. For an 8087 instruction with a memory operand, if the instruction called for the operand to be read, the 8087 would take the word of data read by the main CPU from the data bus. If the operand to be read was longer than one word, the 8087 would also copy the address from the address bus; then, after completion of the data read cycle driven by the CPU, the 8087 would immediately use DMA to take control of the bus and transfer the additional bytes of the operand itself. If an 8087 instruction with a memory operand called for that operand to be written, the 8087 would ignore the read word on the data bus and just copy the address, then request DMA and write the entire operand, in the same way that it would read the end of an extended operand. In this way, the main CPU maintained general control of the bus and bus timing, while the 8087 handled all other aspects of execution of coprocessor instructions, except for brief DMA periods when the 8087 would take over the bus to read or write operands to/from its own internal registers. As a consequence of this design, the 8087 could only operate on operands taken either from memory or from its own registers, and any exchange of data between the 8087 and the 8086 or 8088 was only through RAM.

The main CPU program continued to execute while the 8087 executed an instruction; from the perspective of the main 8086 or 8088 CPU, a coprocessor instruction took only as long as the processing of the opcode and any memory operand cycle (2 clock cycles for no operand, 8 clock cycles plus the EA calculation time [5 to 12 clock cycles] for a memory operand [plus 4 more clock cycles on an 8088], to transfer the second byte of the operand word), after which the CPU would begin executing the next instruction of the program. Thus, a system with an 8087 was capable of true parallel processing, performing one operation in the integer ALU of the main CPU while at the same time performing a floating-point operation in the 8087 coprocessor. Since the 8086 or 8088 exclusively controlled the instruction flow and timing and had no direct access to the internal status of the 8087, and because the 8087 could execute only one instruction at a time, programs for the combined 8086/8087 or 8088/8087 system had to ensure that the 8087 had time to complete the last instruction issued to it before it was issued another one. The WAIT instruction (of the main CPU) was provided for this purpose, and most assemblers implicitly asserted a WAIT instruction before each instance of most floating-point coprocessor instructions. (It is not necessary to use a WAIT instruction before an 8087 operation if the program uses other means to ensure that enough time elapses between the issuance of timing-sensitive 8087 instructions so that the 8087 can never receive such an instruction before it completes the previous one. It is also not necessary, if a WAIT is used, that it immediately precede the next 8087 instruction.)  The WAIT instruction waited for the −TEST input pin of the 8086/8088 to be asserted (low), and this pin was connected to the BUSY pin of the 8087 in all systems that had an 8087 (so TEST was asserted when BUSY was deasserted).

Because the instruction prefetch queues of the 8086 and 8088 make the time when an instruction is executed not always the same as the time it is fetched, a coprocessor such as the 8087 cannot determine when an instruction for itself is the next instruction to be executed purely by watching the CPU bus. The 8086 and 8088 have two queue status signals connected to the coprocessor to allow it to synchronize with the CPU's internal timing of execution of instructions from its prefetch queue. The 8087 maintains its own identical prefetch queue, from which it reads the coprocessor opcodes that it actually executes. Because the 8086 and 8088 prefetch queues have different sizes and different management algorithms, the 8087 determines which type of CPU it is attached to by observing a certain CPU bus line when the system is reset, and the 8087 adjusts its internal instruction queue accordingly. The redundant duplication of prefetch queue hardware in the CPU and the coprocessor is inefficient in terms of power usage and total die area, but it allowed the coprocessor interface to use very few dedicated IC pins, which was important. At the time when the 8086, which defined the coprocessor interface, was introduced, IC packages with more than 40 pins were rare, expensive, and suffered from problems such as excessive lead capacitance, a major limiting factor for signalling speeds.

The coprocessor operation codes are encoded in 6 bits across 2 bytes, beginning with the escape sequence:

┌───────────┬───────────┐ │ 1101 1xxx │ mmxx xrrr │ └───────────┴───────────┘

The first three "x" bits are the first three bits of the floating-point opcode. Then two "m" bits, then the latter half three bits of the floating-point opcode, followed by three "r" bits. The "m" and "r" bits specify the addressing-mode information.

Application programs had to be written to make use of the special floating-point instructions. At run time, software could detect the coprocessor and use it for floating-point operations. When detected absent, similar floating-point functions had to be calculated in software, or the whole coprocessor could be emulated in software for more precise numerical compatibility.

Registers
The x87 family does not use a directly addressable register set such as the main registers of the x86 processors; instead, the x87 registers form an eight-level deep stack structure ranging from st0 to st7, where st0 is the top. The x87 instructions operate by pushing, calculating, and popping values on this stack. However, dyadic operations such as FADD, FMUL, FCMP, and so on may either implicitly use the topmost st0 and st1 or may use st0 together with an explicit memory operand or register; the st0 register may thus be used as an accumulator (i.e. as a combined destination and left operand) and can also be exchanged with any of the eight stack registers using an instruction called FXCH stX (codes D9C8–D9CFh). This makes the x87 stack usable as seven freely addressable registers plus an accumulator. This is especially applicable on superscalar x86 processors (Pentium of 1993 and later), where these exchange instructions are optimized down to a zero-clock penalty.

IEEE floating-point standard
When Intel designed the 8087, it aimed to make a standard floating-point format for future designs. An important aspect of the 8087 from a historical perspective was that it became the basis for the IEEE 754 floating-point standard. The 8087 did not implement the eventual IEEE 754 standard in all its details, as the standard was not finished until 1985, but the 80387 did. The 8087 provided two basic 32/64-bit floating-point data types and an additional extended 80-bit internal temporary format (that could also be stored in memory) to improve accuracy over large and complex calculations. Apart from this, the 8087 offered an 80-bit/18-digit packed BCD (binary-coded decimal) format and 16-, 32-, and 64-bit integer data types.

Infinity
The 8087 handles infinity values by either affine closure or projective closure (selected by the status register). With affine closure, positive and negative infinities are treated as different values. With projective closure, infinity is treated as an unsigned representation for very small or very large numbers. These two methods of handling infinity were incorporated into the draft version of the IEEE 754 floating-point standard. However, projective closure (projectively extended real number system) was dropped from the later formal issue of IEEE 754-1985. The 80287 retained projective closure as an option, but the 80387 and subsequent floating-point processors (including the 80187) only supported affine closure.

Coprocessor interface
The 8087 differed from subsequent Intel coprocessors in that it was directly connected to the address and data buses. The 8087 looked for instructions that commenced with the "11011" sequence and acted on them, immediately requesting DMA from the main CPU as necessary to access memory operands longer than one word (16 bits), then immediately releasing bus control back to the main CPU. The coprocessor did not hold up execution of the program until the coprocessor instruction was complete, and the program had to explicitly synchronize the two processors, as explained above (in the "Design and development" section). There was a potential crash problem if the coprocessor instruction failed to decode to one that the coprocessor understood. Intel's later coprocessors did not connect to the buses in the same way, but received instructions through the main processor I/O ports. This yielded an execution time penalty, but the potential crash problem was avoided because the main processor would ignore the instruction if the coprocessor refused to accept it. The 8087 was able to detect whether it was connected to an 8088 or an 8086 by monitoring the data bus during the reset cycle.

The 8087 was, in theory, capable of working concurrently while the 8086/8 processes additional instructions. In practice, there was the potential for program failure if the coprocessor issued a new instruction before the last one had completed. The assembler would automatically insert an FWAIT instruction after every coprocessor opcode, forcing the 8086/8 to halt execution until the 8087 signalled that it had finished. This limitation was removed from later designs.

Successors
Just as the 8088 and 8086 processors were superseded by later parts, so was the 8087 superseded. Other Intel coprocessors were the 80287 (actually - 80C287, as Intel moved to a CMOS process by that time), 80387, and the 80187. Starting with the 80486, the later Intel processors did not use a separate floating-point coprocessor; virtually all included it on the main processor die, with the significant exception of the 80486SX, which was a modified 80486DX with the FPU disabled. 80487 was in fact a full-blown 80486DX chip with an extra pin. When installed, it disabled the 80486SX CPU. The 80486DX, Pentium, and later processors include floating-point functionality on the CPU core.