4 Instruction tables - Agner Fog
4 Instruction tables - Agner Fog
4 Instruction tables - Agner Fog
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Definition of terms<br />
Operands<br />
Latency<br />
Reciprocal<br />
throughput<br />
Definition of terms<br />
Operands can be different types of registers, memory, or immediate constants. Abbreviations<br />
used in the <strong>tables</strong> are: i = immediate constant, r = any general purpose<br />
register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm<br />
register, y = 256 bit ymm register, sr = segment register, m = any memory operand including<br />
indirect operands, m64 means 64-bit memory operand, etc.<br />
The latency of an instruction is the delay that the instruction generates in a dependency<br />
chain. The measurement unit is clock cycles. Where the clock frequency is varied<br />
dynamically, the figures refer to the core clock frequency. The numbers listed are<br />
minimum values. Cache misses, misalignment, and exceptions may increase the<br />
clock counts considerably. Floating point operands are presumed to be normal numbers.<br />
Denormal numbers, NAN's and infinity may increase the latencies by possibly<br />
more than 100 clock cycles on many processors, except in move, shuffle and Boolean<br />
instructions. Floating point overflow, underflow, denormal or NAN results may give a<br />
similar delay. A missing value in the table means that the value has not been measured<br />
or that it cannot be measured in a meaningful way.<br />
Some processors have a pipelined execution unit that is smaller than the largest register<br />
size so that different parts of the operand are calculated at different times. Assume,<br />
for example, that we have a long depencency chain of 128-bit vector instructions<br />
running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64<br />
bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64<br />
bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the<br />
figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But<br />
if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles<br />
per instruction plus one extra clock cycle in the end. The latency in this case is listed<br />
as 4 in the <strong>tables</strong> because this is the value it adds to a dependency chain.<br />
The throughput is the maximum number of instructions of the same kind that can be<br />
executed per clock cycle when the operands of each instruction are independent of<br />
the preceding instructions. The values listed are the reciprocals of the throughputs,<br />
i.e. the average number of clock cycles per instruction when the instructions are not<br />
part of a limiting dependency chain. For example, a reciprocal throughput of 2 for<br />
FMUL means that a new FMUL instruction can start executing 2 clock cycles after a<br />
previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution<br />
units can handle 3 integer additions per clock cycle.<br />
The reason for listing the reciprocal values is that this makes comparisons between<br />
latency and throughput easier. The reciprocal throughput is also called issue latency.<br />
The values listed are for a single thread or a single core. A missing value in the table<br />
means that the value has not been measured.<br />
Page 3