SIMD instructions

MMX instructions
MMX instructions operate on the mm registers, which are 64 bits wide. They are shared with the FPU registers.

Original MMX instructions
Added with Pentium MMX

MMX instructions added with MMX+ and SSE
The following MMX instruction were added with SSE. They are also available on the Athlon under the name MMX+.

MMX instructions added with SSE2
The following MMX instructions were added with SSE2:

SSE instructions
Added with Pentium III

SSE instructions operate on xmm registers, which are 128 bit wide.

SSE consists of the following SSE SIMD floating-point instructions:

* The floating point single bitwise operations ANDPS, ANDNPS, ORPS and XORPS produce the same result as the SSE2 integer (PAND, PANDN, POR, PXOR) and double ones (ANDPD, ANDNPD, ORPD, XORPD), but can introduce extra latency for domain changes when applied values of the wrong type.

SSE2 instructions
Added with Pentium 4

SSE2 conversion instructions

 * CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings. Assemblers disambiguate them based on the presence or absence of operands.

SSE2 MMX-like instructions extended to SSE registers
SSE2 allows execution of MMX instructions on SSE registers, processing twice the amount of data at once.

SSE2 integer instructions for SSE registers only
The following instructions can be used only on SSE registers, since by their nature they do not work on MMX registers

SSE3 instructions
Added with Pentium 4 supporting SSE3

SSSE3 instructions
Added with Xeon 5100 series and initial Core 2

The following MMX-like instructions extended to SSE registers were added with SSSE3

SSE4.1
Added with Core 2 manufactured in 45nm

SSE4a
Added with Phenom processors

SSE4.2
Added with Nehalem processors

F16C
Half-precision floating-point conversion.

AVX
AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.

Vector operations on 256 bit registers.

AVX2
Introduced in Intel's Haswell microarchitecture and AMD's Excavator.

Expansion of most vector integer SSE and AVX instructions to 256 bits

FMA3 and FMA4 instructions
Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.

The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.

FMA3 encoding

FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form  or. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte  consists of two nibbles, where the top nibble   selects operand ordering ( ='132',  ='213',  ='231') and the bottom nibble   (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. ( and   outside the given ranges will result in something that is not an FMA3 instruction.)

At the assembly language level, the operand ordering is specified in the mnemonic of the instruction: For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
 * will perform
 * will perform
 * will perform

The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form  with the opcode byte working in the same way as for the FP32/FP64 variants. (For the FMA4 instructions, no FP16 variants are defined.)

FMA4 encoding

FMA4 instructions are encoded with the VEX prefix, on the form  (no EVEX encodings are defined). The opcode byte  uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
 * will perform  and require a W=0 encoding.
 * will perform  and require a W=1 encoding.
 * will perform  and can be encoded with either W=0 or W=1.

Opcode table

The 10 fused-multiply-add operations and the 110 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

AVX-512
AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory. Most of the added instructions may also be used with the 256- and 128-bit registers.

AMX
Intel AMX adds eight new tile-registers, -, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a  register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.