31.05.2013 Views

17eYvUA

17eYvUA

17eYvUA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GPUWattch + GPGPU-Sim:<br />

An Integrated Framework for Performance and<br />

Energy Optimizations in Manycore Architectures<br />

Jingwen Leng (The University of Texas at Austin)<br />

Tayler Hetherington (University of British Columbia)<br />

Version of simulator corresponding to these slides = GPGPU-­‐Sim 3.2.1<br />

Website<br />

gpgpu-sim.org<br />

(/gpuwattch)


Tutorial Goals<br />

• Make you more effective in your research<br />

using GPGPU-Sim & GPUWattch<br />

§ Feel free to ask questions when you have them<br />

• After this tutorial, you will be able to:<br />

§ Describe what GPGPU-Sim simulates<br />

§ New: Describe what GPUWattch simulates<br />

§ Setup GPGPU-Sim/GPUWattch and<br />

run CUDA/OpenCL applications on it<br />

§ Extend GPGPU-Sim/GPUWattch for your own research<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.2


Quick Survey<br />

• How many of you are:<br />

§ Graduate students or Faculty members?<br />

§ Working for industry?<br />

• Have you written a CUDA or OpenCL<br />

program before?<br />

• Have you used GPGPU-Sim/GPUWattch?<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.3


Overview<br />

1 GPGPU-­‐Sim | Brief Background on GPU CompuBng 40 min<br />

2 GPGPU-­‐Sim | Overview 30 min<br />

3 GPUWaIch | Power Basics 20 min<br />

Coffee Break (10:00 – 10:30am)<br />

4 GPUWaIch | Details of GPUWaIch Modeling I 30 min<br />

5 GPUWaIch | Details of GPUWaIch Modeling II 35 min<br />

6 Demo: Setup and Run 15 min<br />

7 Wrap Up and Discussion 10 min<br />

Lunch (12:00 – 1:00pm)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.4


What is a GPU?<br />

§ Optimized for Highly Parallel Workloads<br />

§ Highly Programmable<br />

§ Commodity Hardware (“Desktop Supercomputing”)<br />

§ 512 ALUs<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.5


GPU Computing<br />

4 core CPU + 1536 core GPU<br />

§ Heterogeneous computing<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.6


Why GPU?<br />

*Slide from GTC 2011, GPU Compu*ng: Past, Present and Future, David Luebke, NVIDIA<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.7


Why GPU?<br />

*Slide from GTC 2011, GPU Compu*ng: Past, Present and Future, David Luebke, NVIDIA<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.8


Why GPU?<br />

*Slide from AFDS 2011, The Programmer’s Guide to the APU Galaxy, Phil Rogers, AMD<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.9


Why GPU?<br />

• OpenCL supported GPUs<br />

(besides AMD and NVIDIA)<br />

§ Adreno TM 3xx GPU from Qualcomm<br />

§ Mali TM -T600 Series GPUs from ARM<br />

§ HD 4000 on Intel’s Ivy Bridge<br />

§ Intel Xeon Phi (Knights Corner)<br />

• GPU Computing is gaining broad industry<br />

support.<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.10


Programming Model<br />

§ Producer / Consumer<br />

§ CPU offload data parallel code sections onto the<br />

GPU<br />

§ GPU = computation workhorse<br />

§ CPU = sequential code “accelerator” and I/O offload<br />

engine<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.11


GPU Microarchitecture Overview<br />

(10,000 feet)<br />

GPU<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

SIMT<br />

Core<br />

GDDR3/GDDR5<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

InterconnecEon Network<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

Off-­‐chip DRAM<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

Single-­‐InstrucBon, MulBple-­‐Threads<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

1.12


CUDA and OpenCL<br />

§ This tutorial focus on CUDA<br />

• More applications today<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.13


CUDA Thread Hierarchy<br />

• kernel = grid of blocks of warps<br />

of scalar threads<br />

• Thread blocks (CTAs) contains up<br />

to 1024 threads<br />

• Threads are grouped into warps<br />

in hardware<br />

SIMT Core<br />

Thread Block<br />

Thread Thread Block<br />

(CTA) (CTA)<br />

Block<br />

(CTA)<br />

32 Threads<br />

32 32Threads Threads<br />

32 32Threads Threads<br />

32 Threads<br />

32 32Threads Threads<br />

32 Threads<br />

Warps<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

Source: NVIDIA<br />

Each block is dispatched to<br />

a SIMT core as a unit of<br />

work: All of its warps run in<br />

the core’s pipeline unBl<br />

they are all done.<br />

1.14


Warp = SIMT Execution of<br />

Scalar Threads<br />

• Warp = Scalar threads grouped to execute in lockstep<br />

• SIMT vs SIMD<br />

§ SIMD: HW pipeline width must be known by SW<br />

§ SIMT: Pipeline width hidden from SW (★)<br />

Thread Warp<br />

Scalar<br />

Thread<br />

W<br />

Scalar<br />

Thread<br />

X<br />

Scalar<br />

Thread<br />

Y<br />

Common PC<br />

Scalar<br />

Thread<br />

Z<br />

Thread Warp 3<br />

Thread Warp 8<br />

Thread Warp 7<br />

SIMT Pipeline<br />

(★) Can sBll write sobware that assumes threads in a warp execute in lockstep (e.g. see reducBon in NVIDIA<br />

SDK)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.15


SIMT Execution Model<br />

foo[] = {4,8,12,16};<br />

A: v = foo[tid.x];<br />

B: if (v < 10)<br />

C: v = 0;<br />

else<br />

D: v = 10;<br />

E: w = bar[tid.x]+v;<br />

A T1 T2 T3 T4<br />

B T1 T2 T3 T4<br />

C T1 T2<br />

D T3 T4<br />

E T1 T2 T3 T4<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

Time<br />

1.16


CUDA Syntax Highlights<br />

__global__ void foo(...); // runs on GPU, callable from CPU<br />

__device__ void bar(...); // function callable from a GPU thread<br />

foo(...); // 500 blocks, 128 threads each<br />

dim3 threadIdx; dim3 blockIdx; dim3 blockDim;<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.17


CUDA Example Code<br />

Standard C Code<br />

void saxpy_serial(int n, float a, float *x, float *y)<br />

{<br />

for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];<br />

}<br />

// Invoke serial SAXPY kernel<br />

saxpy_serial(n, 2.0, x, y);<br />

High performance compuBng with CUDA, SC09 Tutorial, David<br />

Luebke, NVIDIA<br />

CUDA code<br />

__global__ void saxpy_parallel(int n, float a, float *x, float *y)<br />

{<br />

int i = blockIdx.x*blockDim.x + threadIdx.x;<br />

if(i


GPU Microarchitecture Overview<br />

GPU<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

SIMT<br />

Core<br />

GDDR3/GDDR5<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

InterconnecEon Network<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

Off-­‐chip DRAM<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

1.19


Inside a SIMT Core<br />

SIMT<br />

Front End<br />

Fetch<br />

Decode<br />

Schedule<br />

Branch<br />

Done (Warp ID)<br />

Reg<br />

File<br />

SIMD Datapath<br />

Memory Subsystem Icnt.<br />

Network<br />

SMem L1 D$ Tex $ Const$<br />

§ Interleave warp execution to hide latency<br />

§ Register values of all threads stays in core<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.20


Inside a SIMT Core (2.0)<br />

Schedule<br />

+ Fetch<br />

Decode<br />

§ Add extra stage for Register Read<br />

§ Add fine-grained multithreading<br />

§ Add SIMT stacks<br />

Register<br />

Read<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

Execute Memory Writeback<br />

1.21


Inside a SIMT Core (3.0)<br />

SIMT Front End<br />

Branch Target PC<br />

Fetch SIMT-­‐Stack<br />

Scheduler 1 Valid[1:N]<br />

I-­‐Buffer<br />

I-­‐Cache Decode<br />

Score<br />

Board<br />

Issue<br />

AcBve<br />

Mask<br />

Done (WID)<br />

Pred.<br />

§ Three decoupled warp schedulers<br />

§ Scoreboard<br />

§ Operand collector<br />

Scheduler 2<br />

§ Multiple SIMD functional unit<br />

Operand<br />

Collector<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

SIMD Datapath<br />

Scheduler 3<br />

ALU<br />

ALU<br />

MEM<br />

1.22


CUDA Memory Model<br />

§ Local<br />

§ Shared<br />

§ Global<br />

§ Constant<br />

§ Texture<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

Source: CUDA programming manual<br />

1.23


GPU Microarchitecture Overview<br />

GPU<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

SIMT<br />

Core<br />

GDDR3/GDDR5<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

InterconnecEon Network<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

Off-­‐chip DRAM<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

SIMT Core Cluster<br />

SIMT<br />

Core<br />

Memory<br />

ParEEon<br />

GDDR3/GDDR5<br />

SIMT<br />

Core<br />

1.24


Memory Partition<br />

• Service memory request (Load/Store/AtomicOp)<br />

§ Contains L2 cache bank, DRAM timing model<br />

§ Model Raster Operations Pipeline (ROP) latency<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.25


GPGPU-Sim in a Nutshell<br />

§ New: Power Model: GPUWattch<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.26


Accuracy<br />

GPGPU-Sim IPC<br />

200<br />

150<br />

100<br />

RODINIA Benchmark Suite<br />

Quadro FX5800 SASS<br />

GPGPU-Sim 3.1.0 – Correlation: 98.37%<br />

50<br />

0<br />

Similarity Score<br />

copyChunks_kernel() Back Propaga*on<br />

bpnn_layerforward_CUDA()<br />

HotSpot<br />

calculate_temp()<br />

0 50 100 150 200<br />

Hardware IPC<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.27


Accuracy<br />

GPGPU-Sim IPC<br />

500<br />

400<br />

300<br />

200<br />

100<br />

RODINIA Benchmark Suite<br />

Tesla C2050 (Fermi) SASS<br />

GPGPU-Sim 3.1.0 – Correlation: 97.35%<br />

0<br />

0 100 200 300 400 500<br />

Hardware IPC<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.28


Accuracy (Average Power)<br />

EsEmated Power (W)<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

NVIDIA GTX 480<br />

Average Absolute Error ≈ 12%<br />

0 50 100 150 200 250<br />

Measured Power (W)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.29


Dependencies<br />

§ GCC, Make, etc.<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

1.30


Citation<br />

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt,<br />

Analyzing CUDA Workloads Using a Detailed GPU Simulator, In proceedings of the IEEE<br />

InternaBonal Symposium on Performance Analysis of Systems and Sobware (ISPASS),<br />

pp. 163-­‐174, Boston, MA, April 26-­‐28, 2009.<br />

§ E.g. “GPGPU-Sim version 3.2.1”<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.31


Citation<br />

Jingwen Leng, Syed Gilani, Tayler Hetherington, Ahmed El-­‐Shafiey, Nam Sung Kim, Tor<br />

M. Aamodt, Vijay Janapa Reddi, GPUWaIch: Enabling Energy OpBmizaBon in GPGPUs,<br />

in ISCA 2013<br />

§ E.g. “GPUWattch version 1.0”<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.32


Session Summary<br />

• GPU Computing<br />

• CUDA Programming Model Concepts<br />

§ Thread Hierarchy<br />

§ Memory Spaces<br />

§ SIMT Execution Model<br />

• GPGPU-Sim:<br />

Timing + power simulator of modern GPUs<br />

§ Good accuracy<br />

§ Runs on systems without HW GPUs<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.33


Overview<br />

1 GPGPU-­‐Sim | Brief Background on GPU CompuBng 40 min<br />

2 GPGPU-­‐Sim | Overview 30 min<br />

3 GPUWaIch | Power Basics 20 min<br />

Coffee Break (10:00 – 10:30am)<br />

4 GPUWaIch | Details of GPUWaIch Modeling I 30 min<br />

5 GPUWaIch | Details of GPUWaIch Modeling II 35 min<br />

6 Demo: Setup and Run 15 min<br />

7 Wrap Up and Discussion 10 min<br />

Lunch (12:00 – 1:00pm)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.34


Outline<br />

§ Functional model for<br />

PTX/SASS + CUDA/OpenCL<br />

§ Timing model for the compute part of a GPU<br />

§ New: Power model: GPUWattch<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.35


Session Objective<br />

• After this session, you will be able to:<br />

1. Summarize what GPGPU-Sim simulates<br />

2. Describe how GPGPU-Sim interfaces with CUDA<br />

applications and supports SASS<br />

3. Summarize the advances between<br />

GPGPU-Sim 2.1.1b and 3.2.1<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.36


What GPGPU-Sim Simulates<br />

§ PTX = Parallel Thread eXecution<br />

• A scalar low-level, data-parallel virtual ISA defined by Nvidia<br />

§ SASS = Native ISA for Nvidia GPUs<br />

§ Not DirectX, Not shader model N, Not AMD’s ISA,<br />

Not x86, Not Larrabee. Only PTX or SASS.<br />

§ Not for CPU or PCIe<br />

§ Only model microarchitecture timing relevant to<br />

GPU compute<br />

§ Other parts idle when GPU is running compute kernels<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.37


What GPGPU-Sim Simulates<br />

Functional Model<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.38


Functional Model (PTX)<br />

§ Instruction level<br />

§ Unlimited registers<br />

§ Parallel threads running in blocks; barrier<br />

synchronization instruction<br />

§ SIMT execution model<br />

.cu<br />

.cl<br />

NVCC<br />

OpenCL Drv<br />

PTX ptxas<br />

G80<br />

GT200<br />

Fermi<br />

Kepler<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.39


for (int d = blockDim.x; d > 0; d /= 2)<br />

{<br />

__syncthreads();<br />

}<br />

if (Ed < d) {<br />

float f0 = shared[Ed];<br />

float f1 = shared[Ed + d];<br />

}<br />

if (f1 < f0)<br />

shared[Ed] = f1;<br />

FuncEonal Model (PTX)<br />

• Scalar PTX ISA<br />

• Scalar control flow (if-­‐branch, for-­‐loops)<br />

• Parallel Intrinsic (__syncthreads())<br />

• Register allocaBon not done in PTX<br />

// some iniBalizaBon code omiIed<br />

$Lt_0_6146:<br />

bar.sync 0;<br />

setp.le.s32 %p3, %r7, %r1;<br />

@%p3 bra $Lt_0_6402;<br />

ld.shared.f32 %f3, [%rd9+0];<br />

add.s32 %r9, %r7, %r1;<br />

cvt.s64.s32 %rd18, %r9;<br />

mul.lo.u64 %rd19, %rd18, 4;<br />

add.u64 %rd20, %rd6, %rd19;<br />

ld.shared.f32 %f4, [%rd20+0];<br />

setp.gt.f32 %p4, %f3, %f4;<br />

@!%p4 bra $Lt_0_6914;<br />

st.shared.f32 [%rd9+0], %f4;<br />

$Lt_0_6914:<br />

$Lt_0_6402:<br />

shr.s32 %r10, %r7, 31;<br />

mov.s32 %r11, 1;<br />

and.b32 %r12, %r10, %r11;<br />

add.s32 %r13, %r12, %r7;<br />

shr.s32 %r7, %r13, 1;<br />

mov.u32 %r14, 0;<br />

setp.gt.s32 %p5, %r7, %r14;<br />

@%p5 bra $Lt_0_6146;<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.40


Functional Model (SASS)<br />

§ Better correlation with HW GPU<br />

§ “SASS” is what NVIDIA’s cuobjdump calls it – note some NVIDIA<br />

SM architects are unaware of this J<br />

CUDA<br />

Executable<br />

cuobjdump SASS conversion PTXPlus<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.41


When to use SASS?<br />

§ ptxas reschedules instructions after converting PTX to SASS to increase<br />

computation-memory overlap.<br />

§ It also converts short branches into predicated instructions.<br />

§ In SASS (for Quadro FX 5800), shared memory and constant memory can<br />

be accessed directly as an operand of an instruction.<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.42


PTX vs. SASS<br />

PTX<br />

$Lt_25_13570:<br />

ld.global.s32 %r9, [%rd5+0];<br />

add.s32 %r10, %r9, %r8;<br />

ld.global.s32 %r11, [%rd5+1024];<br />

add.s32 %r8, %r11, %r10;<br />

add.u32 %r5, %r7, %r5;<br />

add.u64 %rd5, %rd5, %rd6;<br />

ld.param.u32 %r6, [size];<br />

setp.lt.u32 %p2, %r5, %r6;<br />

@%p2 bra $Lt_25_13570;<br />

...<br />

mov.u32 %r12, 127;<br />

setp.gt.u32 %p3, %r3, %r12;<br />

@%p3 bra $Lt_25_14082;<br />

ld.shared.s32 %r13, [%rd10+512];<br />

add.s32 %r8, %r13, %r8;<br />

st.shared.s32 [%rd10+0], %r8;<br />

$Lt_25_14082:<br />

bar.sync 0;<br />

SASS (PTXPlus)<br />

l0x00000060:<br />

add.half.u32 $r7, $r4, 0x00000400;<br />

ld.global.u32 $r8, [$r4];<br />

ld.global.u32 $r7, [$r7];<br />

add.half.u32 $r0, $r5, $r0;<br />

add.half.u32 $r6, $r8, $r6;<br />

set.gt.u32.u32 $p0/$o127, s[0x0020], $r0;<br />

add.half.u32 $r6, $r7, $r6;<br />

add.half.u32 $r4, $r4, $r3;<br />

@$p0.ne bra l0x00000060;<br />

...<br />

set.gt.u32.u32 $p0/$o127, $r2, const [0x0000];<br />

@$p0.equ add.u32 $ofs2, $ofs1, 0x00000230;<br />

@$p0.equ add.u32 $r6, s[$ofs2+0x0000], $r6;<br />

@$p0.equ mov.u32 s[$ofs1+0x0030], $r6;<br />

bar.sync 0x00000000;<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 2.43


What GPGPU-Sim Simulates<br />

Timing Model<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.44


Timing Model for<br />

Compute Parts of a GPU<br />

• GPGPU-Sim models timing for:<br />

§ SIMT Core (SM, SIMD Unit)<br />

§ Caches (Texture, Constant, …)<br />

§ Interconnection Network<br />

§ Memory Partition<br />

§ Graphics DRAM<br />

• It does NOT model timing for:<br />

§ CPU, PCIe<br />

Gfx DRAM<br />

GPU<br />

§ Graphics Specific HW (Rasterizer, Clipping, Display…<br />

etc.)<br />

Interconnect<br />

Mem Part. SIMT Cores<br />

Gfx HW<br />

PCIe<br />

CPU<br />

Cache<br />

Raster…<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.45


Timing Model for<br />

GPU Micro-architecture<br />

• GPGPU-Sim simulates the<br />

timing model of a GPU<br />

running each launched<br />

CUDA kernel.<br />

§ Reports # cycles spent<br />

running the kernels.<br />

§ Exclude any time spent on<br />

data transfer on PCIe bus.<br />

§ CPU may run concurrently<br />

with asynchronous kernel<br />

launches.<br />

CPU<br />

Blocking<br />

CPU<br />

CPU<br />

Async. Kernel Launch<br />

Done<br />

Done<br />

GPU HW<br />

GPU HW<br />

Sync. Kernel Launch<br />

Done<br />

GPGPU-­‐Sim<br />

GPGPU-­‐Sim<br />

GPU HW<br />

GPGPU-­‐Sim<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.46<br />

Time


Timing Model for<br />

GPU Micro-architecture<br />

§ Cycle-level model for each part of the<br />

microarchitecture<br />

§ Research focused<br />

• Ignoring rare corner cases to reduce complexity<br />

§ CUDA manual provides some hints. NVIDIA IEEE Micro<br />

articles provide other hints. In most cases we can only<br />

guess at details. Guesses “informed” by studying<br />

patents and microbenchmarking.<br />

GPGPU-­‐Sim w/ SASS is ~0.98<br />

correlated to the real HW.<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.47


What GPGPU-Sim Simulates<br />

Power Model: GPUWattch<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.48


New: Power Model GPUWattch<br />

GPGPU-­‐Sim<br />

Timing Model<br />

uArch AcEviEes<br />

(Perf. Counters) GPUWasch<br />

Power Model<br />

(McPAT++)<br />

Power<br />

EsEmaEon<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 2.49


Interfacing GPGPU-Sim to Applications<br />

§ libcudart.so ß CUDA runtime API<br />

§ libOpenCL.so ß OpenCL API<br />

§ Need a config file (gpgpusim.config), an interconnection<br />

config file and a GPUWattch config as well<br />

We provide the config files for modeling:<br />

-­‐ Quadro FX 5800 (GT200)<br />

-­‐ Geforce GTX 480 and Tesla C2050 (Fermi)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.50


Debugging and Visualization<br />

• GPGPU-Sim provides tools to debug and<br />

visualize simulated GPU behavior.<br />

§ GDB macros:<br />

Cycle-level debugging<br />

§ AerialVision:<br />

High-level performance dynamics<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.51


GPGPU-Sim 3.2.1<br />

§ Refactored for C++ Object-Oriented Implementation<br />

§ Redesigned Timing Models<br />

• SIMT Core model, Cache models, GDDR5 timing … (later)<br />

§ Asynchronous Kernel Calls<br />

§ Concurrent Kernel Execution<br />

§ Support for CUDA 3.1, 4.0 and 4.2<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.52


GPGPU-Sim 3.2.1<br />

§ Updated timing model to model Fermi more accurately<br />

§ Much more robust SASS support<br />

§ Support for CUDA 4.0 (New runtime flow)<br />

§ Support for CUDA 4.1 and 4.2 (Robust runtime flow)<br />

§ Support for OpenCL with newer NVIDIA drivers<br />

§ Two-Level Warp Scheduler from ISCA 2012 Tutorial<br />

§ Experimental Support for Libraries (CUBLAS, CUFFT)<br />

§ Redesigned Cache Model<br />

§ Power Model: GPUWattch<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 2.53


Roadmap<br />

• Unified timing model framework<br />

§ From simple (~v2.x) to detailed (v3.x)<br />

• Fermi SASS (HW ISA) support<br />

• AMD Graphics Core Next (GCN) ISA<br />

• Kepler Model (HW ISA and timing)<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.54


Session Summary<br />

• GPGPU-Sim simulates<br />

§ PTX/SASS<br />

§ Timing Model for GPU Compute<br />

§ Power Model: GPUWattch<br />

• It interface to CUDA/OpenCL application<br />

via a shared runtime library<br />

• Enhancements in GPGPU-Sim 3.2.1<br />

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013)<br />

2.55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!