17eYvUA

GPUWattch + GPGPU-Sim: 

An Integrated Framework for Performance and 

Energy Optimizations in Manycore Architectures 

Jingwen Leng (The University of Texas at Austin) 

Tayler Hetherington (University of British Columbia) 

Version of simulator corresponding to these slides = GPGPU-‐Sim 3.2.1 

Website 

gpgpu-sim.org 

(/gpuwattch)

Tutorial Goals 

• Make you more effective in your research 

using GPGPU-Sim & GPUWattch 

§ Feel free to ask questions when you have them 

• After this tutorial, you will be able to: 

§ Describe what GPGPU-Sim simulates 

§ New: Describe what GPUWattch simulates 

§ Setup GPGPU-Sim/GPUWattch and 

run CUDA/OpenCL applications on it 

§ Extend GPGPU-Sim/GPUWattch for your own research 

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 1.2

Quick Survey 

• How many of you are: 

§ Graduate students or Faculty members? 

§ Working for industry? 

• Have you written a CUDA or OpenCL 

program before? 

• Have you used GPGPU-Sim/GPUWattch? 


Overview 

1 GPGPU-‐Sim | Brief Background on GPU CompuBng 40 min 

2 GPGPU-‐Sim | Overview 30 min 

3 GPUWaIch | Power Basics 20 min 

Coffee Break (10:00 – 10:30am) 

4 GPUWaIch | Details of GPUWaIch Modeling I 30 min 

5 GPUWaIch | Details of GPUWaIch Modeling II 35 min 

6 Demo: Setup and Run 15 min 

7 Wrap Up and Discussion 10 min 

Lunch (12:00 – 1:00pm) 


What is a GPU? 

§ Optimized for Highly Parallel Workloads 

§ Highly Programmable 

§ Commodity Hardware (“Desktop Supercomputing”) 

§ 512 ALUs 

April 2013 GPGPU-Sim/GPUWattch Tutorial (ISPASS 2013) 

1.5

GPU Computing 

4 core CPU + 1536 core GPU 

§ Heterogeneous computing 


1.6

Why GPU? 

*Slide from GTC 2011, GPU Compu*ng: Past, Present and Future, David Luebke, NVIDIA 


1.7

Why GPU? 

*Slide from GTC 2011, GPU Compu*ng: Past, Present and Future, David Luebke, NVIDIA 


1.8

Why GPU? 

*Slide from AFDS 2011, The Programmer’s Guide to the APU Galaxy, Phil Rogers, AMD 


1.9

Why GPU? 

• OpenCL supported GPUs 

(besides AMD and NVIDIA) 

§ Adreno TM 3xx GPU from Qualcomm 

§ Mali TM -T600 Series GPUs from ARM 

§ HD 4000 on Intel’s Ivy Bridge 

§ Intel Xeon Phi (Knights Corner) 

• GPU Computing is gaining broad industry 

support. 


Programming Model 

§ Producer / Consumer 

§ CPU offload data parallel code sections onto the 

GPU 

§ GPU = computation workhorse 

§ CPU = sequential code “accelerator” and I/O offload 

engine 


1.11

GPU Microarchitecture Overview 

(10,000 feet) 

GPU 

SIMT Core Cluster 

SIMT 

Core 

Memory 

ParEEon 

SIMT 

Core 

GDDR3/GDDR5 


SIMT 

Core 

InterconnecEon Network 

Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 

Off-‐chip DRAM 


Single-‐InstrucBon, MulBple-‐Threads 


SIMT 

Core 

Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 

1.12

CUDA and OpenCL 

§ This tutorial focus on CUDA 

• More applications today 


1.13

CUDA Thread Hierarchy 

• kernel = grid of blocks of warps 

of scalar threads 

• Thread blocks (CTAs) contains up 

to 1024 threads 

• Threads are grouped into warps 

in hardware 

SIMT Core 

Thread Block 

Thread Thread Block 

(CTA) (CTA) 

Block 

(CTA) 

32 Threads 

32 32Threads Threads 


32 Threads 


32 Threads 

Warps 


Source: NVIDIA 

Each block is dispatched to 

a SIMT core as a unit of 

work: All of its warps run in 

the core’s pipeline unBl 

they are all done. 

1.14

Warp = SIMT Execution of 

Scalar Threads 

• Warp = Scalar threads grouped to execute in lockstep 

• SIMT vs SIMD 

§ SIMD: HW pipeline width must be known by SW 

§ SIMT: Pipeline width hidden from SW (★) 

Thread Warp 

Scalar 

Thread 

W 

Scalar 

Thread 

X 

Scalar 

Thread 

Y 

Common PC 

Scalar 

Thread 

Z 

Thread Warp 3 

Thread Warp 8 

Thread Warp 7 

SIMT Pipeline 

(★) Can sBll write sobware that assumes threads in a warp execute in lockstep (e.g. see reducBon in NVIDIA 

SDK) 


1.15

SIMT Execution Model 

foo[] = {4,8,12,16}; 

A: v = foo[tid.x]; 

B: if (v < 10) 

C: v = 0; 

else 

D: v = 10; 

E: w = bar[tid.x]+v; 

A T1 T2 T3 T4 

B T1 T2 T3 T4 

C T1 T2 

D T3 T4 

E T1 T2 T3 T4 


Time 

1.16

CUDA Syntax Highlights 

__global__ void foo(...); // runs on GPU, callable from CPU 

__device__ void bar(...); // function callable from a GPU thread 

foo(...); // 500 blocks, 128 threads each 

dim3 threadIdx; dim3 blockIdx; dim3 blockDim; 


1.17

CUDA Example Code 

Standard C Code 

void saxpy_serial(int n, float a, float *x, float *y) 

{ 

for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i]; 

} 

// Invoke serial SAXPY kernel 

saxpy_serial(n, 2.0, x, y); 

High performance compuBng with CUDA, SC09 Tutorial, David 

Luebke, NVIDIA 

CUDA code 

__global__ void saxpy_parallel(int n, float a, float *x, float *y) 

{ 

int i = blockIdx.x*blockDim.x + threadIdx.x; 

if(i


GPU 


SIMT 

Core 

Memory 

ParEEon 

SIMT 

Core 

GDDR3/GDDR5 


SIMT 

Core 


Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 




SIMT 

Core 

Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 

1.19

Inside a SIMT Core 

SIMT 

Front End 

Fetch 

Decode 

Schedule 

Branch 

Done (Warp ID) 

Reg 

File 

SIMD Datapath 

Memory Subsystem Icnt. 

Network 

SMem L1 D$ Tex $ Const$ 

§ Interleave warp execution to hide latency 

§ Register values of all threads stays in core 


1.20

Inside a SIMT Core (2.0) 

Schedule 

+ Fetch 

Decode 

§ Add extra stage for Register Read 

§ Add fine-grained multithreading 

§ Add SIMT stacks 

Register 

Read 


Execute Memory Writeback 

1.21

Inside a SIMT Core (3.0) 

SIMT Front End 

Branch Target PC 

Fetch SIMT-‐Stack 

Scheduler 1 Valid[1:N] 

I-‐Buffer 

I-‐Cache Decode 

Score 

Board 

Issue 

AcBve 

Mask 

Done (WID) 

Pred. 

§ Three decoupled warp schedulers 

§ Scoreboard 

§ Operand collector 

Scheduler 2 

§ Multiple SIMD functional unit 

Operand 

Collector 


SIMD Datapath 

Scheduler 3 

ALU 

ALU 

MEM 

1.22

CUDA Memory Model 

§ Local 

§ Shared 

§ Global 

§ Constant 

§ Texture 


Source: CUDA programming manual 

1.23


GPU 


SIMT 

Core 

Memory 

ParEEon 

SIMT 

Core 

GDDR3/GDDR5 


SIMT 

Core 


Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 




SIMT 

Core 

Memory 

ParEEon 

GDDR3/GDDR5 

SIMT 

Core 

1.24

Memory Partition 

• Service memory request (Load/Store/AtomicOp) 

§ Contains L2 cache bank, DRAM timing model 

§ Model Raster Operations Pipeline (ROP) latency 


GPGPU-Sim in a Nutshell 

§ New: Power Model: GPUWattch 


1.26

Accuracy 

GPGPU-Sim IPC 

200 

150 

100 

RODINIA Benchmark Suite 

Quadro FX5800 SASS 

GPGPU-Sim 3.1.0 – Correlation: 98.37% 

50 

0 

Similarity Score 

copyChunks_kernel() Back Propaga*on 

bpnn_layerforward_CUDA() 

HotSpot 

calculate_temp() 

0 50 100 150 200 

Hardware IPC 


1.27

Accuracy 

GPGPU-Sim IPC 

500 

400 

300 

200 

100 

RODINIA Benchmark Suite 

Tesla C2050 (Fermi) SASS 

GPGPU-Sim 3.1.0 – Correlation: 97.35% 

0 

0 100 200 300 400 500 

Hardware IPC 


1.28

Accuracy (Average Power) 

EsEmated Power (W) 

250 

200 

150 

100 

50 

0 

NVIDIA GTX 480 

Average Absolute Error ≈ 12% 

0 50 100 150 200 250 

Measured Power (W) 


Dependencies 

§ GCC, Make, etc. 


1.30

Citation 

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, 

Analyzing CUDA Workloads Using a Detailed GPU Simulator, In proceedings of the IEEE 

InternaBonal Symposium on Performance Analysis of Systems and Sobware (ISPASS), 

pp. 163-‐174, Boston, MA, April 26-‐28, 2009. 

§ E.g. “GPGPU-Sim version 3.2.1” 


Citation 

Jingwen Leng, Syed Gilani, Tayler Hetherington, Ahmed El-‐Shafiey, Nam Sung Kim, Tor 

M. Aamodt, Vijay Janapa Reddi, GPUWaIch: Enabling Energy OpBmizaBon in GPGPUs, 

in ISCA 2013 

§ E.g. “GPUWattch version 1.0” 


Session Summary 

• GPU Computing 

• CUDA Programming Model Concepts 

§ Thread Hierarchy 

§ Memory Spaces 

§ SIMT Execution Model 

• GPGPU-Sim: 

Timing + power simulator of modern GPUs 

§ Good accuracy 

§ Runs on systems without HW GPUs 


Overview 

1 GPGPU-‐Sim | Brief Background on GPU CompuBng 40 min 

2 GPGPU-‐Sim | Overview 30 min 

3 GPUWaIch | Power Basics 20 min 

Coffee Break (10:00 – 10:30am) 

4 GPUWaIch | Details of GPUWaIch Modeling I 30 min 

5 GPUWaIch | Details of GPUWaIch Modeling II 35 min 

6 Demo: Setup and Run 15 min 

7 Wrap Up and Discussion 10 min 

Lunch (12:00 – 1:00pm) 


Outline 

§ Functional model for 

PTX/SASS + CUDA/OpenCL 

§ Timing model for the compute part of a GPU 

§ New: Power model: GPUWattch 


2.35

Session Objective 

• After this session, you will be able to: 

1. Summarize what GPGPU-Sim simulates 

2. Describe how GPGPU-Sim interfaces with CUDA 

applications and supports SASS 

3. Summarize the advances between 

GPGPU-Sim 2.1.1b and 3.2.1 


2.36

What GPGPU-Sim Simulates 

§ PTX = Parallel Thread eXecution 

• A scalar low-level, data-parallel virtual ISA defined by Nvidia 

§ SASS = Native ISA for Nvidia GPUs 

§ Not DirectX, Not shader model N, Not AMD’s ISA, 

Not x86, Not Larrabee. Only PTX or SASS. 

§ Not for CPU or PCIe 

§ Only model microarchitecture timing relevant to 

GPU compute 

§ Other parts idle when GPU is running compute kernels 


2.37


Functional Model 


2.38

Functional Model (PTX) 

§ Instruction level 

§ Unlimited registers 

§ Parallel threads running in blocks; barrier 

synchronization instruction 

§ SIMT execution model 

.cu 

.cl 

NVCC 

OpenCL Drv 

PTX ptxas 

G80 

GT200 

Fermi 

Kepler 


2.39

for (int d = blockDim.x; d > 0; d /= 2) 

{ 

__syncthreads(); 

} 

if (Ed < d) { 

float f0 = shared[Ed]; 

float f1 = shared[Ed + d]; 

} 

if (f1 < f0) 

shared[Ed] = f1; 

FuncEonal Model (PTX) 

• Scalar PTX ISA 

• Scalar control flow (if-‐branch, for-‐loops) 

• Parallel Intrinsic (__syncthreads()) 

• Register allocaBon not done in PTX 

// some iniBalizaBon code omiIed 

$Lt_0_6146: 

bar.sync 0; 

setp.le.s32 %p3, %r7, %r1; 

@%p3 bra $Lt_0_6402; 

ld.shared.f32 %f3, [%rd9+0]; 

add.s32 %r9, %r7, %r1; 

cvt.s64.s32 %rd18, %r9; 

mul.lo.u64 %rd19, %rd18, 4; 

add.u64 %rd20, %rd6, %rd19; 

ld.shared.f32 %f4, [%rd20+0]; 

setp.gt.f32 %p4, %f3, %f4; 

@!%p4 bra $Lt_0_6914; 

st.shared.f32 [%rd9+0], %f4; 

$Lt_0_6914: 

$Lt_0_6402: 

shr.s32 %r10, %r7, 31; 

mov.s32 %r11, 1; 

and.b32 %r12, %r10, %r11; 

add.s32 %r13, %r12, %r7; 

shr.s32 %r7, %r13, 1; 

mov.u32 %r14, 0; 

setp.gt.s32 %p5, %r7, %r14; 

@%p5 bra $Lt_0_6146; 


2.40

Functional Model (SASS) 

§ Better correlation with HW GPU 

§ “SASS” is what NVIDIA’s cuobjdump calls it – note some NVIDIA 

SM architects are unaware of this J 

CUDA 

Executable 

cuobjdump SASS conversion PTXPlus 


2.41

When to use SASS? 

§ ptxas reschedules instructions after converting PTX to SASS to increase 

computation-memory overlap. 

§ It also converts short branches into predicated instructions. 

§ In SASS (for Quadro FX 5800), shared memory and constant memory can 

be accessed directly as an operand of an instruction. 


2.42

PTX vs. SASS 

PTX 

$Lt_25_13570: 

ld.global.s32 %r9, [%rd5+0]; 

add.s32 %r10, %r9, %r8; 

ld.global.s32 %r11, [%rd5+1024]; 

add.s32 %r8, %r11, %r10; 

add.u32 %r5, %r7, %r5; 

add.u64 %rd5, %rd5, %rd6; 

ld.param.u32 %r6, [size]; 

setp.lt.u32 %p2, %r5, %r6; 

@%p2 bra $Lt_25_13570; 

... 

mov.u32 %r12, 127; 

setp.gt.u32 %p3, %r3, %r12; 

@%p3 bra $Lt_25_14082; 

ld.shared.s32 %r13, [%rd10+512]; 

add.s32 %r8, %r13, %r8; 

st.shared.s32 [%rd10+0], %r8; 

$Lt_25_14082: 

bar.sync 0; 

SASS (PTXPlus) 

l0x00000060: 

add.half.u32 $r7, $r4, 0x00000400; 

ld.global.u32 $r8, [$r4]; 

ld.global.u32 $r7, [$r7]; 

add.half.u32 $r0, $r5, $r0; 

add.half.u32 $r6, $r8, $r6; 

set.gt.u32.u32 $p0/$o127, s[0x0020], $r0; 

add.half.u32 $r6, $r7, $r6; 

add.half.u32 $r4, $r4, $r3; 

@$p0.ne bra l0x00000060; 

... 

set.gt.u32.u32 $p0/$o127, $r2, const [0x0000]; 

@$p0.equ add.u32 $ofs2, $ofs1, 0x00000230; 

@$p0.equ add.u32 $r6, s[$ofs2+0x0000], $r6; 

@$p0.equ mov.u32 s[$ofs1+0x0030], $r6; 

bar.sync 0x00000000; 



Timing Model 


2.44

Timing Model for 

Compute Parts of a GPU 

• GPGPU-Sim models timing for: 

§ SIMT Core (SM, SIMD Unit) 

§ Caches (Texture, Constant, …) 

§ Interconnection Network 

§ Memory Partition 

§ Graphics DRAM 

• It does NOT model timing for: 

§ CPU, PCIe 

Gfx DRAM 

GPU 

§ Graphics Specific HW (Rasterizer, Clipping, Display… 

etc.) 

Interconnect 

Mem Part. SIMT Cores 

Gfx HW 

PCIe 

CPU 

Cache 

Raster… 


2.45


GPU Micro-architecture 

• GPGPU-Sim simulates the 

timing model of a GPU 

running each launched 

CUDA kernel. 

§ Reports # cycles spent 

running the kernels. 

§ Exclude any time spent on 

data transfer on PCIe bus. 

§ CPU may run concurrently 

with asynchronous kernel 

launches. 

CPU 

Blocking 

CPU 

CPU 

Async. Kernel Launch 

Done 

Done 

GPU HW 

GPU HW 

Sync. Kernel Launch 

Done 

GPGPU-‐Sim 

GPGPU-‐Sim 

GPU HW 

GPGPU-‐Sim 


2.46 

Time


GPU Micro-architecture 

§ Cycle-level model for each part of the 

microarchitecture 

§ Research focused 

• Ignoring rare corner cases to reduce complexity 

§ CUDA manual provides some hints. NVIDIA IEEE Micro 

articles provide other hints. In most cases we can only 

guess at details. Guesses “informed” by studying 

patents and microbenchmarking. 

GPGPU-‐Sim w/ SASS is ~0.98 

correlated to the real HW. 


2.47


Power Model: GPUWattch 


2.48

New: Power Model GPUWattch 

GPGPU-‐Sim 

Timing Model 

uArch AcEviEes 

(Perf. Counters) GPUWasch 

Power Model 

(McPAT++) 

Power 

EsEmaEon 


Interfacing GPGPU-Sim to Applications 

§ libcudart.so ß CUDA runtime API 

§ libOpenCL.so ß OpenCL API 

§ Need a config file (gpgpusim.config), an interconnection 

config file and a GPUWattch config as well 

We provide the config files for modeling: 

-‐ Quadro FX 5800 (GT200) 

-‐ Geforce GTX 480 and Tesla C2050 (Fermi) 


2.50

Debugging and Visualization 

• GPGPU-Sim provides tools to debug and 

visualize simulated GPU behavior. 

§ GDB macros: 

Cycle-level debugging 

§ AerialVision: 

High-level performance dynamics 


2.51

GPGPU-Sim 3.2.1 

§ Refactored for C++ Object-Oriented Implementation 

§ Redesigned Timing Models 

• SIMT Core model, Cache models, GDDR5 timing … (later) 

§ Asynchronous Kernel Calls 

§ Concurrent Kernel Execution 

§ Support for CUDA 3.1, 4.0 and 4.2 


2.52

GPGPU-Sim 3.2.1 

§ Updated timing model to model Fermi more accurately 

§ Much more robust SASS support 

§ Support for CUDA 4.0 (New runtime flow) 

§ Support for CUDA 4.1 and 4.2 (Robust runtime flow) 

§ Support for OpenCL with newer NVIDIA drivers 

§ Two-Level Warp Scheduler from ISCA 2012 Tutorial 

§ Experimental Support for Libraries (CUBLAS, CUFFT) 

§ Redesigned Cache Model 

§ Power Model: GPUWattch 


Roadmap 

• Unified timing model framework 

§ From simple (~v2.x) to detailed (v3.x) 

• Fermi SASS (HW ISA) support 

• AMD Graphics Core Next (GCN) ISA 

• Kepler Model (HW ISA and timing) 


2.54

Session Summary 

• GPGPU-Sim simulates 

§ PTX/SASS 

§ Timing Model for GPU Compute 

§ Power Model: GPUWattch 

• It interface to CUDA/OpenCL application 

via a shared runtime library 

• Enhancements in GPGPU-Sim 3.2.1 


2.55

17eYvUA

Create successful ePaper yourself

Delete template?

Save as template?