03.03.2013 Views

4 Instruction tables - Agner Fog

4 Instruction tables - Agner Fog

4 Instruction tables - Agner Fog

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Definition of terms<br />

Operands<br />

Latency<br />

Reciprocal<br />

throughput<br />

Definition of terms<br />

Operands can be different types of registers, memory, or immediate constants. Abbreviations<br />

used in the <strong>tables</strong> are: i = immediate constant, r = any general purpose<br />

register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm<br />

register, y = 256 bit ymm register, sr = segment register, m = any memory operand including<br />

indirect operands, m64 means 64-bit memory operand, etc.<br />

The latency of an instruction is the delay that the instruction generates in a dependency<br />

chain. The measurement unit is clock cycles. Where the clock frequency is varied<br />

dynamically, the figures refer to the core clock frequency. The numbers listed are<br />

minimum values. Cache misses, misalignment, and exceptions may increase the<br />

clock counts considerably. Floating point operands are presumed to be normal numbers.<br />

Denormal numbers, NAN's and infinity may increase the latencies by possibly<br />

more than 100 clock cycles on many processors, except in move, shuffle and Boolean<br />

instructions. Floating point overflow, underflow, denormal or NAN results may give a<br />

similar delay. A missing value in the table means that the value has not been measured<br />

or that it cannot be measured in a meaningful way.<br />

Some processors have a pipelined execution unit that is smaller than the largest register<br />

size so that different parts of the operand are calculated at different times. Assume,<br />

for example, that we have a long depencency chain of 128-bit vector instructions<br />

running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64<br />

bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64<br />

bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the<br />

figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But<br />

if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles<br />

per instruction plus one extra clock cycle in the end. The latency in this case is listed<br />

as 4 in the <strong>tables</strong> because this is the value it adds to a dependency chain.<br />

The throughput is the maximum number of instructions of the same kind that can be<br />

executed per clock cycle when the operands of each instruction are independent of<br />

the preceding instructions. The values listed are the reciprocals of the throughputs,<br />

i.e. the average number of clock cycles per instruction when the instructions are not<br />

part of a limiting dependency chain. For example, a reciprocal throughput of 2 for<br />

FMUL means that a new FMUL instruction can start executing 2 clock cycles after a<br />

previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution<br />

units can handle 3 integer additions per clock cycle.<br />

The reason for listing the reciprocal values is that this makes comparisons between<br />

latency and throughput easier. The reciprocal throughput is also called issue latency.<br />

The values listed are for a single thread or a single core. A missing value in the table<br />

means that the value has not been measured.<br />

Page 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!