Introduction to CUDA C
Introduction to CUDA C Introduction to CUDA C
Multiblock Dot Product: Algorithm � And then contributes its sum to the final result: Block 0 a b a 0 a 1 a 2 a 3 … * * * * b 0 b 1 b 2 b 3 Block 1 a b a 512 a 513 a 514 a 515 … * * * * … b 512 b 513 b 514 b 515 … + + sum sum c
Multiblock Dot Product: dot() #define N (2048*2048) #define THREADS_PER_BLOCK 512 __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; } __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) sum += temp[i]; *c atomicAdd( += sum; c , sum ); } � But we have a race condition… � We can fix it with one of CUDA’s atomic operations
- Page 1 and 2: Introduction to CUDA C San Jose Con
- Page 3 and 4: What is CUDA? � CUDA Architecture
- Page 5 and 6: CUDA C Prerequisites � You (proba
- Page 7 and 8: Hello, World! int main( void ) { pr
- Page 9 and 10: Hello, World! with Device Code __gl
- Page 11 and 12: A More Complex Example � A simple
- Page 13 and 14: Memory Management � Host and devi
- Page 15 and 16: A More Complex Example: main() int
- Page 17 and 18: Parallel Programming in CUDA C �
- Page 19 and 20: Parallel Programming in CUDA C �
- Page 21 and 22: Parallel Addition: main() #define N
- Page 23 and 24: Review � Difference between “ho
- Page 25 and 26: Threads � Terminology: A block ca
- Page 27 and 28: Parallel Addition (Threads): main()
- Page 29 and 30: Indexing Arrays With Threads And Bl
- Page 31 and 32: Addition with Threads and Blocks
- Page 33 and 34: Parallel Addition (Blocks/Threads):
- Page 35 and 36: Dot Product � Unlike vector addit
- Page 37 and 38: Dot Product � But we need to shar
- Page 39 and 40: Parallel Dot Product: dot() � We
- Page 41 and 42: Faulty Dot Product Exposed! � Ste
- Page 43 and 44: Synchronization � We need threads
- Page 45 and 46: Parallel Dot Product: dot() __globa
- Page 47 and 48: Parallel Dot Product: main() } // c
- Page 49 and 50: Review (cont) � Using __shared__
- Page 51: Multiblock Dot Product: Algorithm
- Page 55 and 56: Global Memory Contention Block 0 su
- Page 57 and 58: Atomic Operations � Terminology:
- Page 59 and 60: Parallel Dot Product: main() #defin
- Page 61 and 62: Review � Race conditions — Beha
- Page 63: Questions � First my questions
Multiblock Dot Product: dot()<br />
#define N (2048*2048)<br />
#define THREADS_PER_BLOCK 512<br />
__global__ void dot( int *a, int *b, int *c ) {<br />
__shared__ int temp[THREADS_PER_BLOCK];<br />
int index = threadIdx.x + blockIdx.x * blockDim.x;<br />
temp[threadIdx.x] = a[index] * b[index];<br />
}<br />
__syncthreads();<br />
if( 0 == threadIdx.x ) {<br />
int sum = 0;<br />
for( int i = 0; i < THREADS_PER_BLOCK; i++ )<br />
sum += temp[i];<br />
*c a<strong>to</strong>micAdd( += sum; c , sum );<br />
}<br />
� But we have a race condition…<br />
� We can fix it with one of <strong>CUDA</strong>’s a<strong>to</strong>mic operations