GPU based cloud computing - Open Grid Forum

GPU based cloud computing 

Dairsie Latimer, Petapath, UK 

Petapath 

© NVIDIA Corporation 2010

About Petapath 

! Founded in 2008 to focus on delivering innovative hardware and 

software solutions into the high performance computing (HPC) markets 

Petapath 

! Partnered with HP and SGI to deliverer two Petascale prototype 

systems as part of the PRACE WP8 programme 

! The system is a testbed for new ideas in usability, scalability and 

efficiency of large computer installations 

! Active in exploiting emerging standards for acceleration technologies and 

are members of Khronos group and sit on the OpenCL working committee 

! We also provide consulting expertise for companies wishing to explore the 

advantages offered by heterogeneous systems 


What is Heterogeneous or GPU Computing? 

x86 

PCIe bus 

GPU 

© NVIDIA Corporation 2010 

Computing with CPU + GPU 

Heterogeneous Computing

Low Latency or High Throughput? 

CPU 

! Optimised for low-latency 

access to cached data sets 

! Control logic for out-of-order 

and speculative execution 


! Optimised for data-parallel, 

throughput computation 

! Architecture tolerant of 

memory latency 

! More transistors dedicated to 

computation 


NVIDIA GPU Computing Ecosystem 

ISV 


Architecture 

CUDA 

Training 

Company 

CUDA 

Development 

Specialist 

TPP / OEM 

Hardware 

Architect 

VAR 

CUDA SDK 

& Tools 

NVIDIA Hardware 

Solutions 

Customer 

Requirements 

Customer 

Application 

Hardware 

Architecture 


Deployment

Science is Desperate for Throughput 

Gigaflops 

1,000,000,000 

1 Exaflop 

1,000,000 

1 Petaflop 

1,000 

1 

Ran for 8 months to 

simulate 2 nanoseconds 


1982 1997 2003 2006 2010 2012

Power Crisis in Supercomputing 

Household Power 

Equivalent 

Exaflop 

City 

Petaflop 

Town 

Teraflop 

Neighborhood 

Gigaflop 

1982 1996 2008 2020 

Block 


Enter the GPU 

GeForce ® 

Entertainment 

Tesla TM 

High-Performance Computing 

Quadro ® 

Design & Creation 


NVIDIA GPU Product Families

NEXT-GENERATION GPU ARCHITECTURE — 

‘FERMI’ 


Introducing the ‘Fermi’ Tesla Architecture 

The Soul of a Supercomputer in the body of a GPU 

! 3 billion transistors 

! Up to 2× the cores (C2050 has 448) 

! Up to 8× the peak DP performance 

! ECC on all memories 

! L1 and L2 caches 

Giga Thread 

! Improved memory bandwidth (GDDR5) 

! Up to 1 Terabyte of GPU memory 

! Concurrent kernels 

! Hardware support for C++ 


Design Goal of Fermi 

Data 

Parallel 

! Expand 

performance sweet 

spot of the GPU 

Instruction 

Parallel 

! Bring more users, 

more applications 

to the GPU 


Many Decisions 

Large Data Sets

Streaming Multiprocessor Architecture 

! 32 CUDA cores per SM (512 total) 

! 8× peak double precision floating 

point performance 

! 50% of peak single precision 

! Dual Thread Scheduler 

! 64 KB of RAM for shared memory 

and L1 cache (configurable) 

Load/Store Units × 16 

Special Func Units × 4 


CUDA Core Architecture 

! New IEEE 754-2008 floating-point standard, 

surpassing even the most advanced CPUs 

! Fused multiply-add (FMA) instruction 

for both single and double precision 

! New integer ALU optimized for 

64-bit and extended precision 

operations 

FP Unit 

INT Unit 

Load/Store Units x 16 

Special Func Units x 4 


Cached Memory Hierarchy 

! First GPU architecture to support a true cache 

hierarchy in combination with on-chip shared memory 

! L1 Cache per SM (32 cores) 

! Improves bandwidth and reduces latency 

! Unified L2 Cache (768 KB) 

! Fast, coherent data sharing across all cores in the GPU 

Parallel DataCache 

Memory Hierarchy 

Giga Thread 


Larger, Faster, Resilient Memory Interface 

! GDDR5 memory interface 

! 2× signaling speed of GDDR3 

! Up to 1 Terabyte of memory attached to GPU 

! Operate on larger data sets (3 and 6 GB Cards) 

Giga Thread 

! ECC protection for GDDR5 DRAM 

! All major internal memories are ECC protected 

! Register file, L1 cache, L2 cache 


GigaThread Hardware Thread Scheduler 


GigaThread Streaming Data Transfer Engine 

! Dual DMA engines 

! Simultaneous CPUGPU and GPUCPU 

data transfer 

! Fully overlapped with CPU and GPU 

processing time 

! Activity Snapshot: 

SDT 

Kernel 0 

SDT0 

SDT1 

Kernel 1 

SDT0 

SDT1 

Kernel 2 

SDT0 

SDT1 

Kernel 3 

SDT0 

SDT1 


Enhanced Software Support 

! Many new features in CUDA Toolkit 3.0 

! To be released on Friday 

! Including early support for the Fermi architecture: 

! Native 64-bit GPU support 

! Multiple Copy Engine support 

! ECC reporting 

! Concurrent Kernel Execution 

! Fermi HW debugging support in cuda-gdb 


Enhanced Software Support 

! OpenCL 1.0 Support 

! First class language citizen in CUDA Architecture 

! Supports ICD (so interoperability between vendors is a possibility) 

! Profiling support available 

! Debug support coming to Parallel Nsight (NEXUS) soon 

! gDebugger CL from graphicREMEDY 

! Third party OpenCL profiler/debugger/memory checker 

! Software Tools Ecosystem is starting to grow 

! Given boost by existence of OpenCL 


“Oak Ridge National Lab (ORNL) has already announced it 

will be using Fermi technology in an upcoming super that is 

"expected to be 10-times more powerful than today's fastest 

supercomputer." 

Since ORNL's Jaguar supercomputer, for all intents and 

purposes, holds that title, and is in the process of being 

upgraded to 2.3 PFlops…. 

…we can surmise that the upcoming Fermi-equipped super is 

going to be in the 20 Petaflops range.” 

September 30 2009 


NVIDIA TESLA PRODUCTS 


Tesla GPU Computing Products: 10 Series 

SuperMicro 1U 

GPU SuperServer 

Tesla S1070 

1U System 

Tesla C1060 

Computing Board 

Tesla Personal 

Supercomputer 

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs 

Single Precision 

Performance 

Double Precision 

Performance 

1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops 

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops 

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU) 


Tesla GPU Computing Products: 20 Series 

Tesla S2050 

1U System 

Tesla S2070 

1U System 

Tesla C2050 


Tesla C2070 


GPUs 4 Tesla GPUs 1 Tesla GPU 

Double Precision 

Performance 

2.1 – 2.5 Teraflops 500+ Gigaflops 

Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB 


HETEROGENEOUS CLUSTERS 


Data Centers: Space and Energy Limited 

Traditional Data 

Center Cluster 

Quad-core 

CPU 

8 cores per server 

2x Performance requires 2x Number of Servers 

1000’s of cores 

1000’s of servers 

Heterogeneous Data 

Center Cluster 

10,000’s of cores 

100’s of servers 

Augment/replace 

host servers 


Cluster Deployment 

! Now a number of GPU aware Cluster Management Systems 

! ActiveEon ProActive Parallel Suite® Version 4.2 

! Platform Cluster Manager and HPC Workgroup 

! Streamline Computing GPU Environment (SCGE) 

• Not just installation aids 

! i.e. putting the driver and toolkits in the right place 

! now starting to provide GPU node discovery and job steering 

! NVIDIA and Mellanox 

! Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs 

! Can provide as much as a 30% performance improvement by eliminating 

unnecessary data movement in a multi node heterogeneous application 


Cluster Deployment 

! A number of cluster and distributed debug tools now support 

CUDA and NVIDIA Tesla 

! Allinea® DDT for NVIDIA CUDA 

! Extends well known Distributed Debugging Tool (DDT) with CUDA 

support 

! TotalView® debugger (part of an Early Experience Program) 

! Extends with CUDA support, have also announced intentions to support 

OpenCL 

! Both based on the Parallel Nsight (NEXUS) Debugging API 


NVIDIA Reality Server 3.0 

! Cloud computing platform for running 3D web applications 

! Consists of an Tesla RS GPU-based server cluster running 

RealityServer software from mental images 

! Deployed in a number of different sizes 

! From 2 – 100’s of 1U Servers 

! iray® - Interactive Photorealistic Rendering Technology 

! Streams interactive 3D applications to any web connected device 

! Designers and architects can now share and visualize complex 3D models 

under different lighting and environmental conditions 


DISTRIBUTED COMPUTING PROJECTS 


Distributed Computing Projects 

! Traditional distributed computing projects have been 

making use of GPUs for some time (non-commercial) 

! Typically have 000’s to 10,000’s of contributors 

! Folding@Home has access to 6.5 PFLOPS of compute 

! Of which ~95% comes from GPUs or PS3s 

! Many are bio-informatics, molecular dynamics 

and quantum chemistry codes 

! Represent the current sweet spot applications 

! Ubiquity of GPUs in home systems helps 



! Folding@Home 

! Directed by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/) 

! Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm) 

! OpenMM library provides tools for molecular modeling simulation 

! Can be hooked into any MM application, allowing that code to do 

molecular modeling with minimal extra effort 

! OpenMM has a strong emphasis on hardware acceleration providing 

not just a consistent API, but much greater performance 

! Current NVIDIA target is via CUDA Toolkit 2.3 

! OpenMM 1.0 also provides Beta support for OpenCL 

! OpenCL is long term convergence software platform 



! Berkeley Open Infrastructure for Network Computing 

! BOINC project (http://boinc.berkeley.edu/) 

! Platform infrastructure originally evolved from SETI@home 

! Many projects use BOINC and several of these have 

heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing) 

! Examples include: 

! GPUGRID.net 

! SETI@home 

! Milkyway@home (IEEE 754 Double precision capable GPU required) 

! AQUA@home 

! Lattice 

! Collatz Conjecture 



! GPUGRID.net 

! Dr. Gianni De Fabritiis, 

Research Group of Biomedical Informatics 

University Pompeu Fabra-IMIM, Barcelona 

! Uses GPUs to deliver high-performance all-atom biomolecular 

simulation of proteins using ACEMD (http://multiscalelab.org/acemd) 

! ACEMD is a production bio-molecular dynamics code specially optimized to run 

on graphics processing units (GPUs) from NVIDIA 

! It reads CHARMM/NAMD and AMBER input files with a simple and powerful 

configuration interface 

! A commercial implementation of ACEMD is available from Acellera Ltd ( 

http://www.acellera.com/acemd/) 

! What makes this particularly interesting is that it is implemented using OpenCL 



! Have had to use brute force methods to deal with robustness 

! Run the same WU with multiple users and compare results 

! Running on purpose designed heterogeneous grids with ECC 

! Means that some of the paranoia can be relaxed 

(can at least detect there have been soft errors or WU corruption) 

! Results in better throughput on these systems 

! But does result in divergence between Consumer and HPC devices 

! Should be compensated for by HPC class devices being about 4x faster 


Tesla Bio Workbench 

Accelerating New Science 

January, 2010 

http://www.nvidia.com/bio_workbench 


Introducing Tesla Bio WorkBench 

TeraChem 

LAMMPS 

GPU-AutoDock 

MUMmerGPU 

Download, 

Documentation 

Technical 

papers 

Discussion 

Forums 

Benchmarks 

& Configurations 

Tesla Personal Supercomputer 

Tesla GPU Clusters 


Tesla Bio Workbench Applications 

! AMBER (MD) 

! ACEMD (MD) 

! GROMACS (MD) 

! GROMOS (MD) 

! LAMMPS (MD) 

! NAMD (MD) 

! TeraChem (QC) 

! VMD (Visualization MD & QC) 

! Docking 

! GPU AutoDock 

! Sequence analysis 

! CUDASW++ (SmithWaterman) 

! MUMmerGPU 

! GPU-HMMER 

! CUDA-MEME Motif Discovery 


Recommended Hardware Configurations 

Tesla Personal Supercomputer 

Tesla GPU Clusters 

! Up to 4 Tesla C1060s per 

workstation 

! 4GB main memory / GPU 

! Tesla S1070 1U 

! 4 GPUs per 1U 

! Integrated CPU-GPU Server 

! 2 GPUs per 1U + 2 CPUs 


Specifics at http://www.nvidia.com/bio_workbench


Molecular Dynamics and 

Quantum Chemistry Applications

Molecular Dynamics and 

Quantum Chemistry Applications 

! AMBER (MD) 

! ACEMD (MD) 

! HOOMD (MD) 

! GROMACS (MD) 

! LAMMPS (MD) 

! NAMD (MD) 

! TeraChem (QC) 

! VMD (Viz. MD & QC) 

! Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1U 

! Some applications (compute bound) show 20-100x speed ups 


Usage of TeraGrid National Supercomputing Grid 

Half of the 

cycles 



Summary

Summary 

! ‘Fermi’ debuts HPC/Enterprise features 

! Particularly ECC and high performance double precision 

! Software development environments are now more mature 

! Significant software ecosystem is starting to emerge 

! Broadening availability of development tools, libraries and applications 

! Heterogeneous (GPU) aware cluster management systems 

! Economics, open standards and improving programming 

methodologies 

! Heterogeneous computing is gradually changing long held perception 

that it is just an ‘exotic’ niche technology 



Questions?


Supporting Slides

AMBER Molecular Dynamics 

Alpha 

now 

Q1 2010 

Q2 2010 

• Generalized Born 

• PME: Particle Mesh Ewald 

• Beta release 

• Multi-GPU + MPI support 

• Beta 2 release 

Generalized Born Simulations 

! Implicit solvent GB results 

! 1 Tesla GPU 8x faster than 2 

quad-core CPUs 

7x 8.6x 

More Info 

http://www.nvidia.com/object/amber_on_tesla.html 


Data courtesy of San Diego Supercomputing Center

GROMACS Molecular Dynamics 

Beta 

now 

Q2 2010 

• Particle Mesh Ewald (PME) 

• Implicit solvent GB 

• Arbitrary forms of nonbonded 

interactions 

• Multi-GPU + MPI support 

• Beta 2 release 

! PME results 

! 1 Tesla GPU 3.5x-4.7x faster 

than CPU 

3.5x 

GROMACS on Tesla GPU Vs CPU 

Particle-Mesh-Ewald 

(PME) 

5.2x 

Reaction-Field 

Cutoffs 

22x 

More Info 

http://www.nvidia.com/object/gromacs_on_tesla.html 


Data courtesy of Stockholm Center for Biomembrane Research

HOOMD Blue Molecular Dynamics 

! Written bottom-up for CUDA 

GPUs 

! Modeled after LAMMPS 

! Supports multiple GPUs 

! 1 Tesla GPU outperforms 32 

CPUs running LAMMPS 

More Info 

http://www.nvidia.com/object/hoomd_on_tesla.html 


Data courtesy of University of Michigan

LAMMPS: Molecular Dynamics on a GPU Cluster 

! Available as beta on CUDA 

! Cut-off based non-bonded terms 

! 2 GPUs outperforms 24 CPUs 

! PME based electrostatic 

! Preliminary results: 5X speed-up 

! Multiple GPU + MPI support 

enabled 

2 GPUs = 24 CPUs 

More Info 

http://www.nvidia.com/object/lammps_on_tesla.html 


Data courtesy of Scott Hampton & Pratul K. Agarwal 

Oak Ridge National Laboratory

NAMD: Scaling Molecular Dynamics on a GPU Cluster 

! Feature complete on CUDA : 

available in NAMD 2.7 Beta 2 

! Full electrostatics with PME 

! Multiple time-stepping 

! 1-4 Exclusions 

! 4 GPU Tesla PSC outperforms 

8 CPU servers 

4 GPUs = 16 CPUs 

! Scales to a GPU cluster 

More Info 

http://www.nvidia.com/object/namd_on_tesla.html 


Data courtesy of Theoretical and Computational Bio-physics Group, UIUC

TeraChem: Quantum Chemistry Package for GPUs 

Beta 

now 

Q1 2010 

• HF, Kohn-Sham, DFT 

• Multiple GPUs supported 

• Full release 

• MPI support 

! First QC SW written ground-up for 


! 4 Tesla GPUs outperform 256 quadcore 

CPUs 

More Info 

http://www.nvidia.com/object/terachem_on_tesla.html 


VMD: Acceleration using CUDA GPUs 

! Several CUDA applications in 

VMD 1.8.7 

! Molecular Orbital Display 

! Coulomb-based Ion Placement 

! Implicit Ligand Sampling 

! Speedups : 20x - 100x 

! Multiple GPU support enabled 

More Info 

http://www.nvidia.com/object/vmd_on_tesla.html 

Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC 


GPU-HMMER: Protein Sequence Alignment 

! Protein sequence alignment 

using profile HMMs 

! Available now 

! Supports multiple GPUs 


CPU 

! Speedups range from 60-100x 

faster than CPU 

! Download 

! http://www.mpihmmer.org/releases.htm 


MUMmerGPU: 

Genome Sequence Alignment 

! High-throughput pair-wise 

local sequence alignment 

! Designed for large sequences 

! Drop-in replacement for 

“mummer” component in 

MUMmer software 

! Speedups 3.5x to 3.75x 

! Download 


! http://mummergpu.sourceforge.net

GPU based cloud computing - Open Grid Forum

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?