InfiniBand and 10-Gigabit Ethernet for Dummies
InfiniBand and 10-Gigabit Ethernet for Dummies
InfiniBand and 10-Gigabit Ethernet for Dummies
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
High Per<strong>for</strong>mance MPI Support in MVAPICH2<br />
<strong>for</strong> <strong>InfiniB<strong>and</strong></strong> Clusters<br />
A Talk at NVIDIA Booth (SC ’11)<br />
by<br />
Dhabaleswar K. (DK) P<strong>and</strong>a<br />
The Ohio State University<br />
E-mail: p<strong>and</strong>a@cse.ohio-state.edu<br />
http://www.cse.ohio-state.edu/~p<strong>and</strong>a
Trends <strong>for</strong> Commodity Computing Clusters in the Top 500<br />
List (http://www.top500.org)<br />
NVIDIA-SC '11<br />
Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)<br />
Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)<br />
Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 4<strong>10</strong>/500 (82.0%)<br />
Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 4<strong>10</strong>/500 (82.0%)<br />
Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)<br />
Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 20<strong>10</strong>: 424/500 (84.8%)<br />
Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 20<strong>10</strong>: 415/500 (83%)<br />
Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)<br />
Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 4<strong>10</strong>/500 (82.0%)<br />
Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%)<br />
Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)<br />
2
<strong>InfiniB<strong>and</strong></strong> in the Top500<br />
NVIDIA-SC '11<br />
Percentage share of <strong>InfiniB<strong>and</strong></strong> is steadily increasing<br />
3
Large-scale <strong>InfiniB<strong>and</strong></strong> Installations<br />
• 209 IB Clusters (41.8%) in the November‘11 Top500 list<br />
(http://www.top500.org)<br />
• Installations in the Top 30 (13 systems):<br />
120,640 cores (Nebulae) in China (4 th ) 29,440 cores (Mole-8.5) in China (21 st )<br />
73,278 cores (Tsubame-2.0) in Japan (5 th ) 42,440 cores (Red Sky) at S<strong>and</strong>ia (24 th )<br />
111,<strong>10</strong>4 cores (Pleiades) at NASA Ames (7 th ) 62,976 cores (Ranger) at TACC (25 th )<br />
138,368 cores (Tera-<strong>10</strong>0) at France (9 th ) 20,480 cores (Bull Benchmarks) in France (27 th )<br />
122,400 cores (RoadRunner) at LANL (<strong>10</strong> th ) 20,480 cores (Helios) in Japan (28 th )<br />
137,200 cores (Sunway Blue Light) in China (14 th ) More are getting installed !<br />
46,208 cores (Zin) at LLNL (15 th )<br />
33,072 cores (Lomonosov) in Russia (18 th )<br />
NVIDIA-SC '11<br />
4
Data movement in GPU+IB clusters<br />
• Many systems today want to use systems<br />
that have both GPUs <strong>and</strong> high-speed<br />
networks such as <strong>InfiniB<strong>and</strong></strong><br />
• Steps in Data movement in <strong>InfiniB<strong>and</strong></strong><br />
clusters with GPUs<br />
– From GPU device memory to main memory<br />
at source process, using CUDA<br />
– From source to destination process, using<br />
MPI<br />
– From main memory to GPU device memory<br />
at destination process, using CUDA<br />
• Earlier, GPU device <strong>and</strong> <strong>InfiniB<strong>and</strong></strong> device<br />
required separate memory registration<br />
• GPU-Direct (collaboration between<br />
NVIDIA <strong>and</strong> Mellanox) supported common<br />
registration between these devices<br />
• However, GPU-GPU communication is still<br />
costly <strong>and</strong> programming is harder<br />
NVIDIA-SC '11<br />
CPU<br />
PCIe<br />
GPU<br />
NIC<br />
Switch<br />
5
Sample Code - Without MPI integration<br />
• Naïve implementation with st<strong>and</strong>ard MPI <strong>and</strong> CUDA<br />
At Sender:<br />
cudaMemcpy(sbuf, sdev, . . .);<br />
MPI_Send(sbuf, size, . . .);<br />
At Receiver:<br />
MPI_Recv(rbuf, size, . . .);<br />
cudaMemcpy(rdev, rbuf, . . .);<br />
CPU<br />
PCIe<br />
GPU<br />
• High Productivity <strong>and</strong> Poor Per<strong>for</strong>mance<br />
NVIDIA-SC '11<br />
NIC<br />
Switch<br />
6
Sample Code – User Optimized Code<br />
• Pipelining at user level with non-blocking MPI <strong>and</strong> CUDA interfaces<br />
• Code at Sender side (<strong>and</strong> repeated at Receiver side)<br />
At Sender:<br />
<strong>for</strong> (j = 0; j < pipeline_len; j++)<br />
cudaMemcpyAsync(sbuf + j * blk, sdev + j *<br />
blksz,. . .);<br />
<strong>for</strong> (j = 0; j < pipeline_len; j++) {<br />
}<br />
while (result != cudaSucess) {<br />
}<br />
result = cudaStreamQuery(…);<br />
if(j > 0) MPI_Test(…);<br />
MPI_Isend(sbuf + j * block_sz, blksz . . .);<br />
MPI_Waitall();<br />
• User-level copying may not match with<br />
internal MPI design<br />
• High Per<strong>for</strong>mance <strong>and</strong> Poor Productivity<br />
NVIDIA-SC '11<br />
CPU<br />
PCIe<br />
GPU<br />
NIC<br />
Switch<br />
7
Can this be done within MPI Library?<br />
• Support GPU to GPU communication through st<strong>and</strong>ard MPI<br />
interfaces<br />
– e.g. enable MPI_Send, MPI_Recv from/to GPU memory<br />
• Provide high per<strong>for</strong>mance without exposing low level details<br />
to the programmer<br />
– Pipelined data transfer which automatically provides optimizations<br />
inside MPI library without user tuning<br />
• A new Design incorporated in MVAPICH2 to support this<br />
functionality<br />
NVIDIA-SC '11<br />
8
MVAPICH/MVAPICH2 Software<br />
• High Per<strong>for</strong>mance MPI Library <strong>for</strong> IB <strong>and</strong> HSE<br />
– MVAPICH (MPI-1) <strong>and</strong> MVAPICH2 (MPI-2.2)<br />
– Available since 2002<br />
– Used by more than 1,8<strong>10</strong> organizations in 65 countries<br />
– More than 85,000 downloads from OSU site directly<br />
– Empowering many TOP500 clusters<br />
• 5 th ranked 73,278-core cluster (Tsubame 2) at Tokyo Tech<br />
• 7 th ranked 111,<strong>10</strong>4-core cluster (Pleiades) at NASA<br />
• 17 th ranked 62,976-core cluster (Ranger) at TACC<br />
– Available with software stacks of many IB, HSE <strong>and</strong> server vendors<br />
including Open Fabrics Enterprise Distribution (OFED) <strong>and</strong> Linux<br />
Distros (RedHat <strong>and</strong> SuSE)<br />
– http://mvapich.cse.ohio-state.edu<br />
NVIDIA-SC '11<br />
9
Sample Code – MVAPICH2-GPU<br />
• MVAPICH2-GPU: st<strong>and</strong>ard MPI interfaces used<br />
At Sender:<br />
MPI_Send(s_device, size, …);<br />
At Receiver:<br />
MPI_Recv(r_device, size, …);<br />
inside<br />
MVAPICH2<br />
• High Per<strong>for</strong>mance <strong>and</strong> High Productivity<br />
NVIDIA-SC '11<br />
<strong>10</strong>
Design considerations<br />
• Memory detection<br />
– CUDA 4.0 introduces Unified Virtual Addressing (UVA)<br />
– MPI library can differentiate between device memory <strong>and</strong><br />
host memory without any hints from the user<br />
• Overlap CUDA copy <strong>and</strong> RDMA transfer<br />
– Data movement from GPU <strong>and</strong> RDMA transfer are DMA<br />
operations<br />
– Allow <strong>for</strong> asynchronous progress<br />
NVIDIA-SC '11<br />
11
Time (us)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
MPI-Level Two-sided Communication<br />
0<br />
Memcpy+Send<br />
MemcpyAsync+Isend<br />
MVAPICH2-GPU<br />
32K 64K 128K 256K 512K 1M 2M 4M<br />
Message Size (bytes)<br />
with GPU Direct<br />
• 45% <strong>and</strong> 38% improvements compared to Memcpy+Send, with <strong>and</strong> without<br />
GPUDirect respectively, <strong>for</strong> 4MB messages<br />
• 24% <strong>and</strong> 33% improvement compared with MemcpyAsync+Isend, with <strong>and</strong> without<br />
GPUDirect respectively, <strong>for</strong> 4MB messages<br />
Time (us)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Memcpy-Send<br />
MemcpyAsync+Isend<br />
MVAPICH2-GPU<br />
32K 64K 128K 256K 512K 1M 2M 4M<br />
Message Size (bytes)<br />
without GPU Direct<br />
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, MVAPICH2-GPU: Optimized GPU to<br />
GPU Communication <strong>for</strong> <strong>InfiniB<strong>and</strong></strong> Clusters, ISC ‘11<br />
NVIDIA-SC '11<br />
12
Time (us)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
MPI-Level One-sided Communication<br />
500<br />
0<br />
Mempy+Put w/ GPU Direct<br />
MVAPICH2-GPU w/ GPU Direct<br />
Mempy+Put w/o GPU Direct<br />
MVAPICH2-GPU w/o GPU Direct<br />
32K 64K 128K 256K 512K 1M 2M 4M<br />
Message Size (bytes)<br />
• 45% improvement compared to Memcpy+Put (with GPU Direct)<br />
• 39% improvement compared with Memcpy+Put (without GPU Direct)<br />
• Similar improvement <strong>for</strong> Get operation<br />
NVIDIA-SC '11<br />
Time (us)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Mempy+Get w/ GPU Direct<br />
MVAPICH2-GPU w/ GPU Direct<br />
Mempy+Get w/o GPU Direct<br />
MVAPICH2-GPU w/o GPU Direct<br />
32K 64K 128K 256K 512K 1M 2M 4M<br />
Message Size (bytes)<br />
13
MVAPICH2-GPU-NonContiguous Datatypes:<br />
Design Goals<br />
• Support GPU-GPU non-contiguous data communication through st<strong>and</strong>ard<br />
MPI interfaces<br />
– e.g. MPI_Send / MPI_Recv can operate on GPU memory address <strong>for</strong> non-<br />
contiguous datatype, like MPI_Type_vector<br />
• Provide high per<strong>for</strong>mance without exposing low level details to the<br />
programmer<br />
– Offload Datatype pack <strong>and</strong> unpack to GPU<br />
• Pack: pack non-contiguous data into contiguous buffer inside GPU, then move out<br />
• Unpack: move contiguous data into GPU memory, then unpack to destination address<br />
– Pipelined data transfer which automatically provides optimizations inside MPI<br />
library without user tuning<br />
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, Optimized Non-contiguous<br />
MPI Datatype Communication <strong>for</strong> GPU Clusters: Design, Implementation <strong>and</strong> Evaluation with<br />
MVAPICH2, IEEE Cluster '11, Sept. 2011.<br />
NVIDIA-SC '11<br />
14
Times (us)<br />
250<br />
200<br />
150<br />
<strong>10</strong>0<br />
50<br />
0<br />
MPI Latency <strong>for</strong> Vector Data<br />
Cpy2D+Send<br />
Cpy2DAsync+CpyAsync+Isend<br />
MV2-GPU-NC<br />
8 16 32 64 128 256 512 1K<br />
Message Size (bytes)<br />
2K 4K<br />
NVIDIA-SC '11<br />
180000<br />
160000<br />
140000<br />
120000<br />
<strong>10</strong>0000<br />
80000<br />
60000<br />
40000<br />
20000<br />
• MV2-GPU-NonContiguous (MV2-GPU-NC)<br />
– 88% improvement compared with Cpy2D+Send at 4MB<br />
Times (us)<br />
0<br />
Cpy2D+Send<br />
Cpy2DAsync+CpyAsync+Isend<br />
MV2-GPU-NC<br />
Message Size (bytes)<br />
– Have similar latency with Cpy2DAsync+CpyAsync+Isend, but much easier <strong>for</strong><br />
programming<br />
15
Optimization with Collectives:<br />
Alltoall Latency (Large Message)<br />
Time (us)<br />
16000<br />
14000<br />
12000<br />
<strong>10</strong>000<br />
8000<br />
6000<br />
4000<br />
2000<br />
• P2P Based<br />
No MPI Level Optimization<br />
Point-to-Point Based SD<br />
Point-to-Point Based PE<br />
Dynamic Staging SD<br />
0<br />
128K 256K 512K<br />
Message Size<br />
1M 2M<br />
– Pipeline is enabled <strong>for</strong> each P2P channel (ISC’11); better than No MPI Level Optimization<br />
• Dynamic Staging Scatter Dissemination (SD)<br />
– Not only overlap DMA <strong>and</strong> RDMA <strong>for</strong> each channel, but also <strong>for</strong> different channels<br />
– up to 46% improvement <strong>for</strong> Dynamic Staging SD over No MPI Level Optimization<br />
– up to 26% improvement <strong>for</strong> Dynamic Staging SD over P2P Based method SD<br />
A. Singh, S. Potluri, H. Wang, K. K<strong>and</strong>alla, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, MPI Alltoall Personalized Exchange<br />
on GPGPU Clusters: Design Alternatives <strong>and</strong> Benefits, Workshop on Parallel Programming on<br />
Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011.<br />
NVIDIA-SC '11<br />
46%<br />
26%<br />
16
MVAPICH2 1.8a1p1 Release<br />
• Supports basic point-to-point communication (as outlined<br />
in ISC ‘11 paper)<br />
• Supports the following transfers<br />
– Device-Device<br />
– Device-Host<br />
– Host-Device<br />
• Supports GPU-Direct through CUDA support (from 4.0)<br />
• Tuned <strong>and</strong> optimized <strong>for</strong> CUDA 4.1 RC1<br />
• Provides flexibility <strong>for</strong> block-size (MV2_CUDA_BLOCK_SIZE)<br />
to be selected at runtime to tune per<strong>for</strong>mance of an MPI<br />
application based on its predominant message size<br />
• Available from http://mvapich.cse.ohio-state.edu<br />
NVIDIA-SC '11<br />
17
OSU MPI Micro-Benchmarks (OMB) 3.5 Release<br />
• OSU MPI Micro-Benchmarks provides a comprehensive suite of<br />
benchmarks to compare per<strong>for</strong>mance of different MPI stacks<br />
<strong>and</strong> networks<br />
• Enhancements done <strong>for</strong> three benchmarks<br />
– Latency<br />
– B<strong>and</strong>width<br />
– Bi-directional B<strong>and</strong>width<br />
• Flexibility <strong>for</strong> using buffers in NVIDIA GPU device (D) <strong>and</strong> host<br />
memory (H)<br />
• Flexibility <strong>for</strong> selecting data movement between D->D, D->H <strong>and</strong><br />
H->D<br />
• Available from http://mvapich.cse.ohio-state.edu/benchmarks<br />
• Available in an integrated manner with MVAPICH2 stack<br />
NVIDIA-SC '11<br />
18
Latency (us)<br />
MVAPICH2 vs. OpenMPI (Device-Device, Inter-node)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
1 16 256 4K 64K 1M<br />
NVIDIA-SC '11<br />
MVAPICH2<br />
OpenMPI<br />
Latency<br />
Message Size (bytes)<br />
B<strong>and</strong>width(MB/s)<br />
4000<br />
3000<br />
2000<br />
<strong>10</strong>00<br />
0<br />
B<strong>and</strong>width(MB/s)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Bi-directional B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
1 16 256 4K 64K 1M<br />
1 16 256 4K 64K 1M<br />
Message Size (bytes)<br />
Message Size (bytes)<br />
MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />
Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2050 GPU <strong>and</strong> CUDA Toolkit 4.1RC1<br />
19
Latency (us)<br />
MVAPICH2 vs. OpenMPI (Device-Host, Inter-node)<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
1 16 256 4K 64K 1M<br />
NVIDIA-SC '11<br />
MVAPICH2<br />
OpenMPI<br />
Latency<br />
Message Size (bytes)<br />
B<strong>and</strong>width(MB/s)<br />
5000<br />
4000<br />
3000<br />
2000<br />
<strong>10</strong>00<br />
0<br />
B<strong>and</strong>width(MB/s)<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Bi-directional B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
1 16 256 4K 64K 1M<br />
1 16 256 4K 64K 1M<br />
Message Size (bytes)<br />
Message Size (bytes)<br />
MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />
Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2050 GPU <strong>and</strong> CUDA Toolkit 4.1RC1<br />
Host-Device<br />
Per<strong>for</strong>mance is<br />
Similar<br />
20
Latency (us)<br />
MVAPICH2 vs. OpenMPI (Device-Device, Intra-node,<br />
Multi-GPU)<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
1 16 256 4K 64K 1M<br />
NVIDIA-SC '11<br />
MVAPICH2<br />
OpenMPI<br />
Latency<br />
Message Size (bytes)<br />
B<strong>and</strong>width(MB/s)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
B<strong>and</strong>width(MB/s)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Bi-directional B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
1 16 256 4K 64K 1M<br />
1 16 256 4K 64K 1M<br />
Message Size (bytes)<br />
Message Size (bytes)<br />
MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />
Magny Cours with 2 ConnectX-2 QDR HCAs, 2 NVIDIA Tesla C2075 GPUs <strong>and</strong> CUDA Toolkit 4.1RC1<br />
21
Latency (us)<br />
MVAPICH2 vs. OpenMPI (Device-Host, Intra-node, Multi-<br />
GPU)<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
<strong>10</strong>00<br />
0<br />
1 16 256 4K 64K 1M<br />
NVIDIA-SC '11<br />
MVAPICH2<br />
OpenMPI<br />
Latency<br />
Message Size (bytes)<br />
B<strong>and</strong>width(MB/s)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
B<strong>and</strong>width(MB/s)<br />
3000<br />
2500<br />
2000<br />
1500<br />
<strong>10</strong>00<br />
500<br />
0<br />
Bi-directional B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
B<strong>and</strong>width<br />
MVAPICH2<br />
OpenMPI<br />
1 16 256 4K 64K 1M<br />
1 16 256 4K 64K 1M<br />
Message Size (bytes)<br />
Host-Device<br />
Per<strong>for</strong>mance is<br />
Similar<br />
Message Size (bytes)<br />
MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />
Magny Cours with 2 ConnectX-2 QDR HCAs, 2 NVIDIA Tesla C2075 GPUs <strong>and</strong> CUDA Toolkit 4.1RC1<br />
22
Applications-Level Evaluation (Lattice Boltzmann Method<br />
(LBM))<br />
NVIDIA-SC '11<br />
Time <strong>for</strong> LB Step (us)<br />
160<br />
140<br />
120<br />
<strong>10</strong>0<br />
80<br />
60<br />
40<br />
20<br />
0<br />
MV2_1.7<br />
MV2_1.8a1p1<br />
13.04%<br />
12.29%<br />
12.6%<br />
11.94%<br />
11.11%<br />
64x512x64 128x512x64 256x512x64 512x512x64 <strong>10</strong>24x512x64<br />
Matrix Size XxYxZ<br />
• LBM-CUDA (Courtesy Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice<br />
Boltzmann Method <strong>for</strong> multiphase flows with large density ratios<br />
• NVIDIA Tesla C2050, Mellanox QDR <strong>InfiniB<strong>and</strong></strong> HCA MT26428, Intel Westmere Processor with 12 GB<br />
main memory; CUDA 4.1 RC1, MVAPICH2 1.7 <strong>and</strong> MVAPICH2 1.8a1p1<br />
• Run one process on each node <strong>for</strong> one GPU (4-node cluster)<br />
23
Future Plans<br />
• Extend the current design/implementation to support<br />
– Point-to-Point communication with Datatype Support<br />
– Direct one-sided communication<br />
– Optimized designs <strong>for</strong> all collectives<br />
• Additional tuning based on<br />
– Plat<strong>for</strong>m characteristics<br />
– <strong>InfiniB<strong>and</strong></strong> adapters (FDR)<br />
– Upcoming CUDA versions<br />
• Carry out applications-based evaluations <strong>and</strong> scalability studies <strong>and</strong><br />
enhance/adjust the designs<br />
• Extend the current designs to MPI-3 RMA <strong>and</strong> PGAS Models (UPC,<br />
OpenShmem, etc.)<br />
• Extend additional benchmarks in OSU MPI Micro-Benchmark (OMB)<br />
suite with GPU support<br />
NVIDIA-SC '11<br />
24
Conclusion<br />
• Clusters with NVIDIA GPUs <strong>and</strong> <strong>InfiniB<strong>and</strong></strong> are becoming very<br />
common<br />
• However, programming these clusters with MPI <strong>and</strong> CUDA while<br />
achieving highest per<strong>for</strong>mance is a big challenge<br />
• MVAPICH2-GPU design provides a novel approach to extract<br />
highest per<strong>for</strong>mance with least burden to the MPI programmers<br />
• Significant potential <strong>for</strong> programmers to write high-per<strong>for</strong>mance<br />
<strong>and</strong> scalable MPI applications <strong>for</strong> NVIDIA GPU+IB clusters<br />
NVIDIA-SC '11<br />
25
Funding Acknowledgments<br />
Funding Support by<br />
Equipment Support by<br />
NVIDIA-SC '11<br />
26
Personnel Acknowledgments<br />
Current Students<br />
– V. Dhanraj (M.S.)<br />
– J. Huang (Ph.D.)<br />
– N. Islam (Ph.D.)<br />
– J. Jose (Ph.D.)<br />
– K. K<strong>and</strong>alla (Ph.D.)<br />
– M. Luo (Ph.D.)<br />
– X. Ouyang (Ph.D.)<br />
– S. Pai (M.S.)<br />
– S. Potluri (Ph.D.)<br />
– R. Rajach<strong>and</strong>rasekhar (Ph.D.)<br />
– M. Rahman (Ph.D.)<br />
– A. Singh (Ph.D.)<br />
– H. Subramoni (Ph.D.)<br />
Current Post-Docs<br />
– J. Vienne<br />
– H. Wang<br />
NVIDIA-SC '11<br />
Past Students<br />
– P. Balaji (Ph.D.)<br />
– D. Buntinas (Ph.D.)<br />
– S. Bhagvat (M.S.)<br />
– L. Chai (Ph.D.)<br />
– B. Ch<strong>and</strong>rasekharan (M.S.)<br />
– N. D<strong>and</strong>apanthula (M.S.)<br />
– T. Gangadharappa (M.S.)<br />
– K. Gopalakrishnan (M.S.)<br />
– W. Huang (Ph.D.)<br />
– W. Jiang (M.S.)<br />
– S. Kini (M.S.)<br />
Current Programmers<br />
– M. Arnold<br />
– D. Bureddy<br />
– J. Perkins<br />
– M. Koop (Ph.D.)<br />
– R. Kumar (M.S.)<br />
– S. Krishnamoorthy (M.S.)<br />
Past Post-Docs<br />
– X. Besseron<br />
– H.-W. Jin<br />
– E. Mancini<br />
– S. Marcarelli<br />
– P. Lai (Ph. D.)<br />
– J. Liu (Ph.D.)<br />
– A. Mamidala (Ph.D.)<br />
– G. Marsh (M.S.)<br />
– V. Meshram (M.S.)<br />
– S. Naravula (Ph.D.)<br />
– R. Noronha (Ph.D.)<br />
– G. Santhanaraman (Ph.D.)<br />
– J. Sridhar (M.S.)<br />
– S. Sur (Ph.D.)<br />
– K. Vaidyanathan (Ph.D.)<br />
– A. Vishnu (Ph.D.)<br />
– J. Wu (Ph.D.)<br />
– W. Yu (Ph.D.)<br />
Past Research Scientist<br />
– S. Sur<br />
27
Web Pointers<br />
NVIDIA-SC '11<br />
http://www.cse.ohio-state.edu/~p<strong>and</strong>a<br />
http://nowlab.cse.ohio-state.edu<br />
MVAPICH Web Page<br />
http://mvapich.cse.ohio-state.edu<br />
p<strong>and</strong>a@cse.ohio-state.edu<br />
28