The 13 new instructions are summarized below. For detailed information on each instruction, see
Chapter 3, "Instruction Set Reference," of the Next Generation Intel® Processor: Software Developers Guide.
One instruction improves x87-FP integer conversion:
FISTTP.
FISTTP (
Store Integer and Pop from x87-FP with Truncation) behaves like the
FISTP instruction but uses truncation, irrespective of the rounding mode specified in the floating-point control word (
FCW). The instruction converts the top of stack (
STO) to integer with rounding to truncate and pop the stack.
FISTTP is available in three precisions: short integer (
word or 16-bit), integer (
double word or 32-bit), and long integer (
64-bit). With
FISTTP, applications no longer need to change the FCW when truncation is desired. This instruction is the only x87-FP instruction in the Prescott New Instruction technology.
Three instructions enhance
LOAD/MOVE/DUPLICATE performance:
MOVSHDUP, MOVSLDUP, and
MOVDDUP.
MOVSHDUP loads/moves 128 bits, duplicating the second and fourth 32-bit data elements.
- MOVSHDUP OperandA OperandB
- OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a
- OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b
- Result (stored in OperandA): 3b, 3b, 1b, 1b
MOVSLDUP loads/moves 128-bits, duplicating the first and third 32-bit data elements.
- MOVSLDUP OperandA OperandB
- OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a
- OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b
- Result (stored in OperandA): 2b, 2b, 0b, 0b
MOVDDUP loads/moves 64-bits (
bits[63-0] if the source is a register) and returns the same 64 bits in both the lower and upper halves of the 128-bit result register. This action duplicates the 64 bits from the source.
- MOVDDUP OperandA OperandB
- OperandA (128 bits, two data elements): 1a, 0a
- OperandB (64 bits, one data element): 0b
- Result (stored in OperandA): 0b, 0b
One instruction provides specialized 128-bit unaligned data load:
LDDQU.
LDDQU is a special 128-bit unaligned load designed to avoid cache-line splits. If the address of the load is aligned on a 16-byte boundary,
LDQQU loads the 16 bytes requested. If the address of the load is not aligned on a 16-byte boundary,
LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately below the load request. It then extracts the requested 16 bytes. The instruction provides significant performance improvement on 128-bit unaligned memory accesses at the cost of some usage-model restrictions.
Two instructions provide packed addition/subtraction:
ADDSUBPS and
ADDSUBPD.
ADDSUBPS has two 128-bit operands. The instruction performs single-precision addition on the second and fourth pairs of 32-bit data elements within the operands, and single-precision subtraction on the first and third pairs. This instruction is effective at evaluating complex products on packed single-precision data.
- ADDSUBPS OperandA OperandB
- OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a
- OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b
- Result (stored in OperandA): 3a+3b, 2a-2b, 1a+1b, 0a-0b
ADDSUBPD has two 128-bit operands. The instruction performs double-precision addition on the second pair of quadwords, and double-precision subtraction on the first pair. This instruction is useful when evaluating complex products on packed double-precision data.
- ADDSUBPD OperandA OperandB
- OperandA (128 bits, two data elements): 1a, 0a
- OperandB (128 bits, two data elements): 1b, 0b
- Result (stored in OperandA): 1a+1b, 0a-0b
Four instructions provide horizontal addition/subtraction:
HADDPS, HSUBPS, HADDPD, and
HSUBPD.
Most SIMD instructions operate vertically. This means that the result in position
i of the result is a function of the elements in position
i of both operands. Horizontal addition/subtraction operates horizontally. This means that contiguous data elements from the same operand are used to produce a result data element.
HADDPS performs a single-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand. The second element is obtained by adding the third and fourth elements of the first operand. The third element is obtained by adding the first and second elements of the second operand. The fourth element is obtained by adding the third and fourth elements of the second operand.
- HADDPS OperandA OperandB
- OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a
- OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b
- Result (Stored in OperandA): 3b+2b, 1b+0b, 3a+2a, 1a+0a
HSUBPS performs a single-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand. The second element is obtained by subtracting the fourth element of the first operand from the third element of the first operand. The third element is obtained by subtracting the second element of the second operand from the first element of the second operand. The fourth element is obtained by subtracting the fourth element of the second operand from the third element of the second operand.
- HSUBPS OperandA OperandB
- OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a
- OperandB (128 bits, four data elements): 3b, 2b, 1b,0b
- Result (Stored in OperandA): 2b-3b, 0b-1b, 2a-3a, 0a-1a
HADDPD performs a double-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand. The second element is obtained by adding the first and second elements of the second operand.
- HADDPD OperandA OperandB
- OperandA (128 bits, two data elements): 1a, 0a
- OperandB (128 bits, two data elements): 1b, 0b
- Result (Stored in OperandA): 1b+0b, 1a+0a
HSUBPD performs a double-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand. The second element is obtained by subtracting the second element of the second operand from the first element of the second operand.
- HSUBPD OperandA OperandB
- OperandA (128 bits, two data elements): 1a, 0a
- OperandB (128 bits, two data elements): 1b, 0b
- Result (Stored in OperandA): 0b-1b, 0a-1a
Two instructions improve synchronization between agents:
MONITOR and
MWAIT.
- MONITOR sets up an address range used to monitor write-back stores.
- MWAIT enables a logical processor to enter into an optimized state while waiting for a write-back store to the address range set up by the MONITOR instruction.
Support for
MONITOR/MWAIT is indicated by the
CPUID MONITOR/MWAIT. Software need not check for support of SSE in order to use the
MONITOR/MWAIT.