03.03.2013 Views

4 Instruction tables - Agner Fog

4 Instruction tables - Agner Fog

4 Instruction tables - Agner Fog

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Bulldozer<br />

AMD Bulldozer<br />

List of instruction timings and macro-operation breakdown<br />

Explanation of column headings:<br />

<strong>Instruction</strong>:<br />

<strong>Instruction</strong> name. cc means any condition code. For example, Jcc can be JB, JNE,<br />

etc.<br />

Operands:<br />

i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit<br />

mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory<br />

operand including indirect operands, m64 means 64-bit memory operand, etc.<br />

Ops:<br />

Latency:<br />

Reciprocal throughput:<br />

Execution pipe:<br />

Domain:<br />

Number of macro-operations issued from instruction decoder to schedulers. <strong>Instruction</strong>s<br />

with more than 2 macro-operations use microcode.<br />

This is the delay that the instruction generates in a dependency chain. The numbers<br />

are minimum values. Cache misses, misalignment, and exceptions may increase<br />

the clock counts considerably. Floating point operands are presumed to be<br />

normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the<br />

delays. The latency listed does not include the memory operand where the listing<br />

for register and memory operand are joined (r/m).<br />

This is also called issue latency. This value indicates the average number of clock<br />

cycles from the execution of an instruction begins to a subsequent independent<br />

instruction of the same kind can begin to execute. A value of 1/3 indicates that the<br />

execution units can handle 3 instructions per clock cycle in one thread. However,<br />

the throughput may be limited by other bottlenecks in the pipeline.<br />

Indicates which execution pipe or unit is used for the macro-operations:<br />

Integer pipes:<br />

EX0: integer ALU, division<br />

EX1: integer ALU, multiplication, jump<br />

EX01: can use either EX0 or EX1<br />

AG01: address generation unit 0 or 1<br />

Floating point and vector pipes:<br />

P0: floating point add, mul, div, convert, shuffle, shift<br />

P1: floating point add, mul, div, shuffle, shift<br />

P2: move, integer add, boolean<br />

P3: move, integer add, boolean, store<br />

P01: can use either P0 or P1<br />

P23: can use either P2 or P3<br />

Two macro-operations can execute simultaneously if they go to different<br />

execution pipes<br />

Tells which execution unit domain is used:<br />

ivec: integer vector execution unit.<br />

fp: floating point execution unit.<br />

fma: floating point multiply/add subunit.<br />

inherit: the output operand inherits the domain of the input operand.<br />

ivec/fma means the input goes to the ivec domain and the output comes from the<br />

fma domain.<br />

There is an additional latency of 1 clock cycle if the output of an ivec instruction<br />

goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction<br />

goes to the input of an ivec or store instruction. There is no latency<br />

between the fp and fma units. All other latencies after memory load and before<br />

memory store instructions are included in the latency counts.<br />

An fma instruction has a latency of 5 if the output goes to another fma instruction,<br />

6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or<br />

store instruction.<br />

Page 36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!