12.01.2013 Views

InfiniBand and 10-Gigabit Ethernet for Dummies

InfiniBand and 10-Gigabit Ethernet for Dummies

InfiniBand and 10-Gigabit Ethernet for Dummies

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

High Per<strong>for</strong>mance MPI Support in MVAPICH2<br />

<strong>for</strong> <strong>InfiniB<strong>and</strong></strong> Clusters<br />

A Talk at NVIDIA Booth (SC ’11)<br />

by<br />

Dhabaleswar K. (DK) P<strong>and</strong>a<br />

The Ohio State University<br />

E-mail: p<strong>and</strong>a@cse.ohio-state.edu<br />

http://www.cse.ohio-state.edu/~p<strong>and</strong>a


Trends <strong>for</strong> Commodity Computing Clusters in the Top 500<br />

List (http://www.top500.org)<br />

NVIDIA-SC '11<br />

Nov. 1996: 0/500 (0%) Jun. 2002: 80/500 (16%) Nov. 2007: 406/500 (81.2%)<br />

Jun. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Jun. 2008: 400/500 (80.0%)<br />

Nov. 1997: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Nov. 2008: 4<strong>10</strong>/500 (82.0%)<br />

Jun. 1998: 1/500 (0.2%) Nov. 2003: 208/500 (41.6%) Jun. 2009: 4<strong>10</strong>/500 (82.0%)<br />

Nov. 1998: 2/500 (0.4%) Jun. 2004: 291/500 (58.2%) Nov. 2009: 417/500 (83.4%)<br />

Jun. 1999: 6/500 (1.2%) Nov. 2004: 294/500 (58.8%) Jun. 20<strong>10</strong>: 424/500 (84.8%)<br />

Nov. 1999: 7/500 (1.4%) Jun. 2005: 304/500 (60.8%) Nov. 20<strong>10</strong>: 415/500 (83%)<br />

Jun. 2000: 11/500 (2.2%) Nov. 2005: 360/500 (72.0%) Jun. 2011: 411/500 (82.2%)<br />

Nov. 2000: 28/500 (5.6%) Jun. 2006: 364/500 (72.8%) Nov. 2011: 4<strong>10</strong>/500 (82.0%)<br />

Jun. 2001: 33/500 (6.6%) Nov. 2006: 361/500 (72.2%)<br />

Nov. 2001: 43/500 (8.6%) Jun. 2007: 373/500 (74.6%)<br />

2


<strong>InfiniB<strong>and</strong></strong> in the Top500<br />

NVIDIA-SC '11<br />

Percentage share of <strong>InfiniB<strong>and</strong></strong> is steadily increasing<br />

3


Large-scale <strong>InfiniB<strong>and</strong></strong> Installations<br />

• 209 IB Clusters (41.8%) in the November‘11 Top500 list<br />

(http://www.top500.org)<br />

• Installations in the Top 30 (13 systems):<br />

120,640 cores (Nebulae) in China (4 th ) 29,440 cores (Mole-8.5) in China (21 st )<br />

73,278 cores (Tsubame-2.0) in Japan (5 th ) 42,440 cores (Red Sky) at S<strong>and</strong>ia (24 th )<br />

111,<strong>10</strong>4 cores (Pleiades) at NASA Ames (7 th ) 62,976 cores (Ranger) at TACC (25 th )<br />

138,368 cores (Tera-<strong>10</strong>0) at France (9 th ) 20,480 cores (Bull Benchmarks) in France (27 th )<br />

122,400 cores (RoadRunner) at LANL (<strong>10</strong> th ) 20,480 cores (Helios) in Japan (28 th )<br />

137,200 cores (Sunway Blue Light) in China (14 th ) More are getting installed !<br />

46,208 cores (Zin) at LLNL (15 th )<br />

33,072 cores (Lomonosov) in Russia (18 th )<br />

NVIDIA-SC '11<br />

4


Data movement in GPU+IB clusters<br />

• Many systems today want to use systems<br />

that have both GPUs <strong>and</strong> high-speed<br />

networks such as <strong>InfiniB<strong>and</strong></strong><br />

• Steps in Data movement in <strong>InfiniB<strong>and</strong></strong><br />

clusters with GPUs<br />

– From GPU device memory to main memory<br />

at source process, using CUDA<br />

– From source to destination process, using<br />

MPI<br />

– From main memory to GPU device memory<br />

at destination process, using CUDA<br />

• Earlier, GPU device <strong>and</strong> <strong>InfiniB<strong>and</strong></strong> device<br />

required separate memory registration<br />

• GPU-Direct (collaboration between<br />

NVIDIA <strong>and</strong> Mellanox) supported common<br />

registration between these devices<br />

• However, GPU-GPU communication is still<br />

costly <strong>and</strong> programming is harder<br />

NVIDIA-SC '11<br />

CPU<br />

PCIe<br />

GPU<br />

NIC<br />

Switch<br />

5


Sample Code - Without MPI integration<br />

• Naïve implementation with st<strong>and</strong>ard MPI <strong>and</strong> CUDA<br />

At Sender:<br />

cudaMemcpy(sbuf, sdev, . . .);<br />

MPI_Send(sbuf, size, . . .);<br />

At Receiver:<br />

MPI_Recv(rbuf, size, . . .);<br />

cudaMemcpy(rdev, rbuf, . . .);<br />

CPU<br />

PCIe<br />

GPU<br />

• High Productivity <strong>and</strong> Poor Per<strong>for</strong>mance<br />

NVIDIA-SC '11<br />

NIC<br />

Switch<br />

6


Sample Code – User Optimized Code<br />

• Pipelining at user level with non-blocking MPI <strong>and</strong> CUDA interfaces<br />

• Code at Sender side (<strong>and</strong> repeated at Receiver side)<br />

At Sender:<br />

<strong>for</strong> (j = 0; j < pipeline_len; j++)<br />

cudaMemcpyAsync(sbuf + j * blk, sdev + j *<br />

blksz,. . .);<br />

<strong>for</strong> (j = 0; j < pipeline_len; j++) {<br />

}<br />

while (result != cudaSucess) {<br />

}<br />

result = cudaStreamQuery(…);<br />

if(j > 0) MPI_Test(…);<br />

MPI_Isend(sbuf + j * block_sz, blksz . . .);<br />

MPI_Waitall();<br />

• User-level copying may not match with<br />

internal MPI design<br />

• High Per<strong>for</strong>mance <strong>and</strong> Poor Productivity<br />

NVIDIA-SC '11<br />

CPU<br />

PCIe<br />

GPU<br />

NIC<br />

Switch<br />

7


Can this be done within MPI Library?<br />

• Support GPU to GPU communication through st<strong>and</strong>ard MPI<br />

interfaces<br />

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory<br />

• Provide high per<strong>for</strong>mance without exposing low level details<br />

to the programmer<br />

– Pipelined data transfer which automatically provides optimizations<br />

inside MPI library without user tuning<br />

• A new Design incorporated in MVAPICH2 to support this<br />

functionality<br />

NVIDIA-SC '11<br />

8


MVAPICH/MVAPICH2 Software<br />

• High Per<strong>for</strong>mance MPI Library <strong>for</strong> IB <strong>and</strong> HSE<br />

– MVAPICH (MPI-1) <strong>and</strong> MVAPICH2 (MPI-2.2)<br />

– Available since 2002<br />

– Used by more than 1,8<strong>10</strong> organizations in 65 countries<br />

– More than 85,000 downloads from OSU site directly<br />

– Empowering many TOP500 clusters<br />

• 5 th ranked 73,278-core cluster (Tsubame 2) at Tokyo Tech<br />

• 7 th ranked 111,<strong>10</strong>4-core cluster (Pleiades) at NASA<br />

• 17 th ranked 62,976-core cluster (Ranger) at TACC<br />

– Available with software stacks of many IB, HSE <strong>and</strong> server vendors<br />

including Open Fabrics Enterprise Distribution (OFED) <strong>and</strong> Linux<br />

Distros (RedHat <strong>and</strong> SuSE)<br />

– http://mvapich.cse.ohio-state.edu<br />

NVIDIA-SC '11<br />

9


Sample Code – MVAPICH2-GPU<br />

• MVAPICH2-GPU: st<strong>and</strong>ard MPI interfaces used<br />

At Sender:<br />

MPI_Send(s_device, size, …);<br />

At Receiver:<br />

MPI_Recv(r_device, size, …);<br />

inside<br />

MVAPICH2<br />

• High Per<strong>for</strong>mance <strong>and</strong> High Productivity<br />

NVIDIA-SC '11<br />

<strong>10</strong>


Design considerations<br />

• Memory detection<br />

– CUDA 4.0 introduces Unified Virtual Addressing (UVA)<br />

– MPI library can differentiate between device memory <strong>and</strong><br />

host memory without any hints from the user<br />

• Overlap CUDA copy <strong>and</strong> RDMA transfer<br />

– Data movement from GPU <strong>and</strong> RDMA transfer are DMA<br />

operations<br />

– Allow <strong>for</strong> asynchronous progress<br />

NVIDIA-SC '11<br />

11


Time (us)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

MPI-Level Two-sided Communication<br />

0<br />

Memcpy+Send<br />

MemcpyAsync+Isend<br />

MVAPICH2-GPU<br />

32K 64K 128K 256K 512K 1M 2M 4M<br />

Message Size (bytes)<br />

with GPU Direct<br />

• 45% <strong>and</strong> 38% improvements compared to Memcpy+Send, with <strong>and</strong> without<br />

GPUDirect respectively, <strong>for</strong> 4MB messages<br />

• 24% <strong>and</strong> 33% improvement compared with MemcpyAsync+Isend, with <strong>and</strong> without<br />

GPUDirect respectively, <strong>for</strong> 4MB messages<br />

Time (us)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Memcpy-Send<br />

MemcpyAsync+Isend<br />

MVAPICH2-GPU<br />

32K 64K 128K 256K 512K 1M 2M 4M<br />

Message Size (bytes)<br />

without GPU Direct<br />

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, MVAPICH2-GPU: Optimized GPU to<br />

GPU Communication <strong>for</strong> <strong>InfiniB<strong>and</strong></strong> Clusters, ISC ‘11<br />

NVIDIA-SC '11<br />

12


Time (us)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

MPI-Level One-sided Communication<br />

500<br />

0<br />

Mempy+Put w/ GPU Direct<br />

MVAPICH2-GPU w/ GPU Direct<br />

Mempy+Put w/o GPU Direct<br />

MVAPICH2-GPU w/o GPU Direct<br />

32K 64K 128K 256K 512K 1M 2M 4M<br />

Message Size (bytes)<br />

• 45% improvement compared to Memcpy+Put (with GPU Direct)<br />

• 39% improvement compared with Memcpy+Put (without GPU Direct)<br />

• Similar improvement <strong>for</strong> Get operation<br />

NVIDIA-SC '11<br />

Time (us)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Mempy+Get w/ GPU Direct<br />

MVAPICH2-GPU w/ GPU Direct<br />

Mempy+Get w/o GPU Direct<br />

MVAPICH2-GPU w/o GPU Direct<br />

32K 64K 128K 256K 512K 1M 2M 4M<br />

Message Size (bytes)<br />

13


MVAPICH2-GPU-NonContiguous Datatypes:<br />

Design Goals<br />

• Support GPU-GPU non-contiguous data communication through st<strong>and</strong>ard<br />

MPI interfaces<br />

– e.g. MPI_Send / MPI_Recv can operate on GPU memory address <strong>for</strong> non-<br />

contiguous datatype, like MPI_Type_vector<br />

• Provide high per<strong>for</strong>mance without exposing low level details to the<br />

programmer<br />

– Offload Datatype pack <strong>and</strong> unpack to GPU<br />

• Pack: pack non-contiguous data into contiguous buffer inside GPU, then move out<br />

• Unpack: move contiguous data into GPU memory, then unpack to destination address<br />

– Pipelined data transfer which automatically provides optimizations inside MPI<br />

library without user tuning<br />

H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, Optimized Non-contiguous<br />

MPI Datatype Communication <strong>for</strong> GPU Clusters: Design, Implementation <strong>and</strong> Evaluation with<br />

MVAPICH2, IEEE Cluster '11, Sept. 2011.<br />

NVIDIA-SC '11<br />

14


Times (us)<br />

250<br />

200<br />

150<br />

<strong>10</strong>0<br />

50<br />

0<br />

MPI Latency <strong>for</strong> Vector Data<br />

Cpy2D+Send<br />

Cpy2DAsync+CpyAsync+Isend<br />

MV2-GPU-NC<br />

8 16 32 64 128 256 512 1K<br />

Message Size (bytes)<br />

2K 4K<br />

NVIDIA-SC '11<br />

180000<br />

160000<br />

140000<br />

120000<br />

<strong>10</strong>0000<br />

80000<br />

60000<br />

40000<br />

20000<br />

• MV2-GPU-NonContiguous (MV2-GPU-NC)<br />

– 88% improvement compared with Cpy2D+Send at 4MB<br />

Times (us)<br />

0<br />

Cpy2D+Send<br />

Cpy2DAsync+CpyAsync+Isend<br />

MV2-GPU-NC<br />

Message Size (bytes)<br />

– Have similar latency with Cpy2DAsync+CpyAsync+Isend, but much easier <strong>for</strong><br />

programming<br />

15


Optimization with Collectives:<br />

Alltoall Latency (Large Message)<br />

Time (us)<br />

16000<br />

14000<br />

12000<br />

<strong>10</strong>000<br />

8000<br />

6000<br />

4000<br />

2000<br />

• P2P Based<br />

No MPI Level Optimization<br />

Point-to-Point Based SD<br />

Point-to-Point Based PE<br />

Dynamic Staging SD<br />

0<br />

128K 256K 512K<br />

Message Size<br />

1M 2M<br />

– Pipeline is enabled <strong>for</strong> each P2P channel (ISC’11); better than No MPI Level Optimization<br />

• Dynamic Staging Scatter Dissemination (SD)<br />

– Not only overlap DMA <strong>and</strong> RDMA <strong>for</strong> each channel, but also <strong>for</strong> different channels<br />

– up to 46% improvement <strong>for</strong> Dynamic Staging SD over No MPI Level Optimization<br />

– up to 26% improvement <strong>for</strong> Dynamic Staging SD over P2P Based method SD<br />

A. Singh, S. Potluri, H. Wang, K. K<strong>and</strong>alla, S. Sur <strong>and</strong> D. K. P<strong>and</strong>a, MPI Alltoall Personalized Exchange<br />

on GPGPU Clusters: Design Alternatives <strong>and</strong> Benefits, Workshop on Parallel Programming on<br />

Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011.<br />

NVIDIA-SC '11<br />

46%<br />

26%<br />

16


MVAPICH2 1.8a1p1 Release<br />

• Supports basic point-to-point communication (as outlined<br />

in ISC ‘11 paper)<br />

• Supports the following transfers<br />

– Device-Device<br />

– Device-Host<br />

– Host-Device<br />

• Supports GPU-Direct through CUDA support (from 4.0)<br />

• Tuned <strong>and</strong> optimized <strong>for</strong> CUDA 4.1 RC1<br />

• Provides flexibility <strong>for</strong> block-size (MV2_CUDA_BLOCK_SIZE)<br />

to be selected at runtime to tune per<strong>for</strong>mance of an MPI<br />

application based on its predominant message size<br />

• Available from http://mvapich.cse.ohio-state.edu<br />

NVIDIA-SC '11<br />

17


OSU MPI Micro-Benchmarks (OMB) 3.5 Release<br />

• OSU MPI Micro-Benchmarks provides a comprehensive suite of<br />

benchmarks to compare per<strong>for</strong>mance of different MPI stacks<br />

<strong>and</strong> networks<br />

• Enhancements done <strong>for</strong> three benchmarks<br />

– Latency<br />

– B<strong>and</strong>width<br />

– Bi-directional B<strong>and</strong>width<br />

• Flexibility <strong>for</strong> using buffers in NVIDIA GPU device (D) <strong>and</strong> host<br />

memory (H)<br />

• Flexibility <strong>for</strong> selecting data movement between D->D, D->H <strong>and</strong><br />

H->D<br />

• Available from http://mvapich.cse.ohio-state.edu/benchmarks<br />

• Available in an integrated manner with MVAPICH2 stack<br />

NVIDIA-SC '11<br />

18


Latency (us)<br />

MVAPICH2 vs. OpenMPI (Device-Device, Inter-node)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

1 16 256 4K 64K 1M<br />

NVIDIA-SC '11<br />

MVAPICH2<br />

OpenMPI<br />

Latency<br />

Message Size (bytes)<br />

B<strong>and</strong>width(MB/s)<br />

4000<br />

3000<br />

2000<br />

<strong>10</strong>00<br />

0<br />

B<strong>and</strong>width(MB/s)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Bi-directional B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

1 16 256 4K 64K 1M<br />

1 16 256 4K 64K 1M<br />

Message Size (bytes)<br />

Message Size (bytes)<br />

MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />

Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2050 GPU <strong>and</strong> CUDA Toolkit 4.1RC1<br />

19


Latency (us)<br />

MVAPICH2 vs. OpenMPI (Device-Host, Inter-node)<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

1 16 256 4K 64K 1M<br />

NVIDIA-SC '11<br />

MVAPICH2<br />

OpenMPI<br />

Latency<br />

Message Size (bytes)<br />

B<strong>and</strong>width(MB/s)<br />

5000<br />

4000<br />

3000<br />

2000<br />

<strong>10</strong>00<br />

0<br />

B<strong>and</strong>width(MB/s)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Bi-directional B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

1 16 256 4K 64K 1M<br />

1 16 256 4K 64K 1M<br />

Message Size (bytes)<br />

Message Size (bytes)<br />

MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />

Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2050 GPU <strong>and</strong> CUDA Toolkit 4.1RC1<br />

Host-Device<br />

Per<strong>for</strong>mance is<br />

Similar<br />

20


Latency (us)<br />

MVAPICH2 vs. OpenMPI (Device-Device, Intra-node,<br />

Multi-GPU)<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

1 16 256 4K 64K 1M<br />

NVIDIA-SC '11<br />

MVAPICH2<br />

OpenMPI<br />

Latency<br />

Message Size (bytes)<br />

B<strong>and</strong>width(MB/s)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

B<strong>and</strong>width(MB/s)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Bi-directional B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

1 16 256 4K 64K 1M<br />

1 16 256 4K 64K 1M<br />

Message Size (bytes)<br />

Message Size (bytes)<br />

MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />

Magny Cours with 2 ConnectX-2 QDR HCAs, 2 NVIDIA Tesla C2075 GPUs <strong>and</strong> CUDA Toolkit 4.1RC1<br />

21


Latency (us)<br />

MVAPICH2 vs. OpenMPI (Device-Host, Intra-node, Multi-<br />

GPU)<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

<strong>10</strong>00<br />

0<br />

1 16 256 4K 64K 1M<br />

NVIDIA-SC '11<br />

MVAPICH2<br />

OpenMPI<br />

Latency<br />

Message Size (bytes)<br />

B<strong>and</strong>width(MB/s)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

B<strong>and</strong>width(MB/s)<br />

3000<br />

2500<br />

2000<br />

1500<br />

<strong>10</strong>00<br />

500<br />

0<br />

Bi-directional B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

B<strong>and</strong>width<br />

MVAPICH2<br />

OpenMPI<br />

1 16 256 4K 64K 1M<br />

1 16 256 4K 64K 1M<br />

Message Size (bytes)<br />

Host-Device<br />

Per<strong>for</strong>mance is<br />

Similar<br />

Message Size (bytes)<br />

MVAPICH2 1.8 a1p1 <strong>and</strong> OpenMPI (Trunk nightly snapshot on Nov 1, 2011)<br />

Magny Cours with 2 ConnectX-2 QDR HCAs, 2 NVIDIA Tesla C2075 GPUs <strong>and</strong> CUDA Toolkit 4.1RC1<br />

22


Applications-Level Evaluation (Lattice Boltzmann Method<br />

(LBM))<br />

NVIDIA-SC '11<br />

Time <strong>for</strong> LB Step (us)<br />

160<br />

140<br />

120<br />

<strong>10</strong>0<br />

80<br />

60<br />

40<br />

20<br />

0<br />

MV2_1.7<br />

MV2_1.8a1p1<br />

13.04%<br />

12.29%<br />

12.6%<br />

11.94%<br />

11.11%<br />

64x512x64 128x512x64 256x512x64 512x512x64 <strong>10</strong>24x512x64<br />

Matrix Size XxYxZ<br />

• LBM-CUDA (Courtesy Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice<br />

Boltzmann Method <strong>for</strong> multiphase flows with large density ratios<br />

• NVIDIA Tesla C2050, Mellanox QDR <strong>InfiniB<strong>and</strong></strong> HCA MT26428, Intel Westmere Processor with 12 GB<br />

main memory; CUDA 4.1 RC1, MVAPICH2 1.7 <strong>and</strong> MVAPICH2 1.8a1p1<br />

• Run one process on each node <strong>for</strong> one GPU (4-node cluster)<br />

23


Future Plans<br />

• Extend the current design/implementation to support<br />

– Point-to-Point communication with Datatype Support<br />

– Direct one-sided communication<br />

– Optimized designs <strong>for</strong> all collectives<br />

• Additional tuning based on<br />

– Plat<strong>for</strong>m characteristics<br />

– <strong>InfiniB<strong>and</strong></strong> adapters (FDR)<br />

– Upcoming CUDA versions<br />

• Carry out applications-based evaluations <strong>and</strong> scalability studies <strong>and</strong><br />

enhance/adjust the designs<br />

• Extend the current designs to MPI-3 RMA <strong>and</strong> PGAS Models (UPC,<br />

OpenShmem, etc.)<br />

• Extend additional benchmarks in OSU MPI Micro-Benchmark (OMB)<br />

suite with GPU support<br />

NVIDIA-SC '11<br />

24


Conclusion<br />

• Clusters with NVIDIA GPUs <strong>and</strong> <strong>InfiniB<strong>and</strong></strong> are becoming very<br />

common<br />

• However, programming these clusters with MPI <strong>and</strong> CUDA while<br />

achieving highest per<strong>for</strong>mance is a big challenge<br />

• MVAPICH2-GPU design provides a novel approach to extract<br />

highest per<strong>for</strong>mance with least burden to the MPI programmers<br />

• Significant potential <strong>for</strong> programmers to write high-per<strong>for</strong>mance<br />

<strong>and</strong> scalable MPI applications <strong>for</strong> NVIDIA GPU+IB clusters<br />

NVIDIA-SC '11<br />

25


Funding Acknowledgments<br />

Funding Support by<br />

Equipment Support by<br />

NVIDIA-SC '11<br />

26


Personnel Acknowledgments<br />

Current Students<br />

– V. Dhanraj (M.S.)<br />

– J. Huang (Ph.D.)<br />

– N. Islam (Ph.D.)<br />

– J. Jose (Ph.D.)<br />

– K. K<strong>and</strong>alla (Ph.D.)<br />

– M. Luo (Ph.D.)<br />

– X. Ouyang (Ph.D.)<br />

– S. Pai (M.S.)<br />

– S. Potluri (Ph.D.)<br />

– R. Rajach<strong>and</strong>rasekhar (Ph.D.)<br />

– M. Rahman (Ph.D.)<br />

– A. Singh (Ph.D.)<br />

– H. Subramoni (Ph.D.)<br />

Current Post-Docs<br />

– J. Vienne<br />

– H. Wang<br />

NVIDIA-SC '11<br />

Past Students<br />

– P. Balaji (Ph.D.)<br />

– D. Buntinas (Ph.D.)<br />

– S. Bhagvat (M.S.)<br />

– L. Chai (Ph.D.)<br />

– B. Ch<strong>and</strong>rasekharan (M.S.)<br />

– N. D<strong>and</strong>apanthula (M.S.)<br />

– T. Gangadharappa (M.S.)<br />

– K. Gopalakrishnan (M.S.)<br />

– W. Huang (Ph.D.)<br />

– W. Jiang (M.S.)<br />

– S. Kini (M.S.)<br />

Current Programmers<br />

– M. Arnold<br />

– D. Bureddy<br />

– J. Perkins<br />

– M. Koop (Ph.D.)<br />

– R. Kumar (M.S.)<br />

– S. Krishnamoorthy (M.S.)<br />

Past Post-Docs<br />

– X. Besseron<br />

– H.-W. Jin<br />

– E. Mancini<br />

– S. Marcarelli<br />

– P. Lai (Ph. D.)<br />

– J. Liu (Ph.D.)<br />

– A. Mamidala (Ph.D.)<br />

– G. Marsh (M.S.)<br />

– V. Meshram (M.S.)<br />

– S. Naravula (Ph.D.)<br />

– R. Noronha (Ph.D.)<br />

– G. Santhanaraman (Ph.D.)<br />

– J. Sridhar (M.S.)<br />

– S. Sur (Ph.D.)<br />

– K. Vaidyanathan (Ph.D.)<br />

– A. Vishnu (Ph.D.)<br />

– J. Wu (Ph.D.)<br />

– W. Yu (Ph.D.)<br />

Past Research Scientist<br />

– S. Sur<br />

27


Web Pointers<br />

NVIDIA-SC '11<br />

http://www.cse.ohio-state.edu/~p<strong>and</strong>a<br />

http://nowlab.cse.ohio-state.edu<br />

MVAPICH Web Page<br />

http://mvapich.cse.ohio-state.edu<br />

p<strong>and</strong>a@cse.ohio-state.edu<br />

28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!