CUDA Toolkit 4.0 Overview - Nvidia

CUDA 4.0The ‘Super’ Computing CompanyFrom Super Phones to Super Computers© NVIDIA Corporation 2011

CUDA 4.0 for Broader Developer Adoption© NVIDIA Corporation 2011

CUDA 4.0Application Porting Made SimplerRapid Application PortingUnified Virtual AddressingFaster Multi-GPU ProgrammingNVIDIA GPUDirect 2.0Easier Parallel Programming in C++Thrust© NVIDIA Corporation 2011

CUDA 4.0: HighlightsEasier ParallelApplication PortingFasterMulti-GPU ProgrammingNew & ImprovedDeveloper Tools• Share GPUs across multiple threads• Single thread access to all GPUs• No-copy pinning of system memory• New CUDA C/C++ features• Thrust templated primitives library• NPP image/video processing library• Layered Textures• NVIDIA GPUDirect v2.0• Peer-to-Peer Access• Peer-to-Peer Transfers• Unified Virtual Addressing• Auto Performance Analysis• C++ Debugging• GPU Binary Disassembler• cuda-gdb for MacOS© NVIDIA Corporation 2011

Easier Porting of Existing ApplicationsShare GPUs across multiple threadsEasier porting of multi-threaded appspthreads / OpenMP threads share a GPULaunch concurrent kernels fromdifferent host threadsEliminates context switching overheadNew, simple context management APIsOld context migration APIs still supportedSingle thread access to all GPUsEach host thread can now access allGPUs in the systemOne thread per GPU limitation removedEasier than ever for applications totake advantage of multi-GPUSingle-threaded applications can nowbenefit from multiple GPUsEasily coordinate work across multipleGPUs (e.g. halo exchange)© NVIDIA Corporation 2011

No-copy Pinning of System MemoryReduce system memory usage and CPU memcpy() overheadEasier to add CUDA acceleration to existing applicationsJust register malloc’d system memory for async operationsand then call cudaMemcpy() as usualBefore No-copy PinningExtra allocation and extra copy requiredWith No-copy PinningJust register and go!cudaMallocHost(b)memcpy(b, a)malloc(a)cudaMemcpy() to GPU, launch kernels, cudaMemcpy() from GPUmemcpy(a, b)cudaFreeHost(b)cudaHostUnregister(a)All CUDA-capable GPUs on Linux or WindowsRequires Linux kernel 2.6.15+ (RHEL 5)cudaHostRegister(a)© NVIDIA Corporation 2011

New CUDA C/C++ Language FeaturesC++ new/deleteDynamic memory managementC++ virtual functionsEasier porting of existing applicationsInline PTXEnables assembly-level optimization© NVIDIA Corporation 2011

C++ Templatized Algorithms & Data Structures (Thrust)Powerful open source C++ parallel algorithms & data structuresSimilar to C++ Standard Template Library (STL)Automatically chooses the fastest code path at compile timeDivides work between GPUs and multi-core CPUsParallel sorting @ 5x to 100x fasterData Structures• thrust::device_vector• thrust::host_vector• thrust::device_ptr• Etc.Algorithms• thrust::sort• thrust::reduce• thrust::exclusive_scan• Etc.© NVIDIA Corporation 2011

GPU-Accelerated Image ProcessingNVIDIA Performance Primitives (NPP) library10x to 36x faster image processingInitial focus on imaging and video related primitivesData exchange & initializationSet, Convert, CopyConstBorder, Copy,Transpose, SwapChannelsColor ConversionRGB To YCbCr (& vice versa),ColorTwist, LUT_LinearThreshold & Compare OpsThreshold, CompareStatisticsMean, StdDev, NormDiff, MinMax,Histogram,SqrIntegral, RectStdDevFilter FunctionsFilterBox, Row, Column, Max, Min, Median,Dilate, Erode, SumWindowColumn/RowGeometry TransformsMirror, WarpAffine / Back/ Quad,WarpPerspective / Back / Quad, ResizeArithmetic & Logical OpsAdd, Sub, Mul, Div, AbsDiffJPEGDCTQuantInv/Fwd, QuantizationTable© NVIDIA Corporation 2011

Layered Textures – Faster Image ProcessingIdeal for processing multiple textures with same size/formatLarge sizes supported on Tesla T20 (Fermi) GPUs (up to 16k x 16k x 2k)e.g. Medical Imaging, Terrain Rendering (flight simulators), etc.Faster PerformanceReduced CPU overhead: single binding for entire texture arrayFaster than 3D Textures: more efficient filter cachingFast interop with OpenGL / Direct3D for each layerNo need to create/manage a texture atlasNo sampling artifactsLinear/Bilinear filtering applied only within a layer© NVIDIA Corporation 2011


NVIDIA GPUDirect:Towards Eliminating the CPU BottleneckVersion 1.0for applications that communicateover a networkVersion 2.0for applications that communicatewithin a node• Direct access to GPU memory for 3 rdparty devices• Eliminates unnecessary sys memcopies & CPU overhead• Supported by Mellanox and Qlogic• Up to 30% improvement incommunication performance• Peer-to-Peer memory access,transfers & synchronization• Less code, higher programmerproductivityDetails @ http://www.nvidia.com/object/software-for-tesla-products.html© NVIDIA Corporation 2011

Before NVIDIA GPUDirect v2.0Required Copy into Main MemoryGPU1MemoryGPU2MemoryTwo copies required:1. cudaMemcpy(GPU2, sysmem)2. cudaMemcpy(sysmem, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset© NVIDIA Corporation 2011

NVIDIA GPUDirect v2.0:Peer-to-Peer CommunicationDirect Transfers between GPUsGPU1MemoryGPU2MemoryOnly one copy required:1. cudaMemcpy(GPU2, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset© NVIDIA Corporation 2011

GPUDirect v2.0: Peer-to-Peer CommunicationDirect communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programmingDirect TransfersCopy from GPU0 memory to GPU1 memoryWorks transparently with UVADirect AccessGPU0 reads or writes GPU1 memory (load/store)Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC© NVIDIA Corporation 2011

Unified Virtual AddressingEasier to Program with Single Address SpaceNo UVA: Multiple Memory SpacesUVA : Single Address SpaceSystemMemoryGPU0MemoryGPU1MemorySystemMemoryGPU0MemoryGPU1Memory0x00000x00000x00000x00000xFFFF0xFFFF0xFFFF0xFFFFCPUGPU0GPU1CPUGPU0GPU1PCI-ePCI-e© NVIDIA Corporation 2011

Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from pointer valueEnables libraries to simplify their interfaces (e.g. cudaMemcpy)Before UVASeparate options for each permutationcudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceWith UVAOne function handles all casescudaMemcpyDefault(data location becomes an implementation detail)Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC© NVIDIA Corporation 2011


Automated Performance Analysis in Visual ProfilerSummary analysis & hintsSessionDeviceContextKernelNew UI for kernel analysisIdentify limiting factorAnalyze instruction throughputAnalyze memory throughputAnalyze kernel occupancy© NVIDIA Corporation 2011

New Features in cuda-gdbNow available for both Linux and MacOSinfo cuda threadsautomatically updated in DDDBreakpoints on all instancesof templated functionsFermidisassembly(cuobjdump)C++ symbols shownin stack trace viewDetails @ http://developer.nvidia.com/object/cuda-gdb.html© NVIDIA Corporation 2011

cuda-gdb Now Available for MacOSDetails @ http://developer.nvidia.com/object/cuda-gdb.html© NVIDIA Corporation 2011

NVIDIA Parallel Nsight Pro 1.5ProfessionalCUDA DebuggingCompute AnalyzerCUDA / OpenCL ProfilingTesla Compute Cluster (TCC) DebuggingTesla Support: C1050/S1070 or higherQuadro Support: G9x or higherWindows 7, Vista and HPC Server 2008Visual Studio 2008 SP1 and Visual Studio 2010OpenGL and OpenCL AnalyzerDirectX 10 & 11 Analyzer, Debugger & GraphicsinspectorGeForce Support: 9 series or higher© NVIDIA Corporation 2011

CUDA Registered Developer ProgramAll GPGPU developers should become NVIDIA Registered DevelopersBenefits include:Early Access to Pre-Release SoftwareBeta software and librariesCUDA 4.0 Release Candidate available nowSubmit & Track Issues and BugsInteract directly with NVIDIA QA engineersNew benefits in 2011Exclusive Q&A Webinars with NVIDIA EngineeringExclusive deep dive CUDA training webinarsIn-depth engineering presentations on pre-release softwareSign up Now: www.nvidia.com/ParallelDeveloper© NVIDIA Corporation 2011

Additional Information…CUDA Features OverviewCUDA Developer Resources from NVIDIACUDA 3 rd Party EcosystemPGI CUDA x86GPU Computing Research & EducationNVIDIA Parallel Developer ProgramGPU Technology Conference 2011© NVIDIA Corporation 2011

CUDA Features OverviewPlatformProgramming ModelParallel LibrariesDevelopment ToolsNew inCUDA 4.0GPUDirect tm (v 2.0)Peer-Peer CommunicationUnified Virtual AddressingC++ new/deleteC++ Virtual FunctionsThrust C++ LibraryTemplated PerformancePrimitives LibraryParallel Nsight Pro 1.5Hardware FeaturesECC MemoryDouble PrecisionNative 64-bit ArchitectureConcurrent Kernel ExecutionDual Copy Engines6GB per GPU supportedOperating System SupportMS Windows 32/64Linux 32/64Mac OS X 32/64Designed for HPCCluster ManagementGPUDirectTesla Compute Cluster (TCC)Multi-GPU supportC supportNVIDIA C CompilerCUDA C Parallel ExtensionsFunction PointersRecursionAtomicsmalloc/freeC++ supportClasses/ObjectsClass InheritancePolymorphismOperator OverloadingClass TemplatesFunction TemplatesVirtual Base ClassesNamespacesFortran supportCUDA Fortran (PGI)NVIDIA Library SupportComplete math.hComplete BLAS Library (1, 2 and 3)Sparse Matrix Math LibraryRNG LibraryFFT Library (1D, 2D and 3D)Video Decoding Library (NVCUVID)Video Encoding Library (NVCUVENC)Image Processing Library (NPP)Video Processing Library (NPP)3rd Party Math LibrariesCULA Tools (EM Photonics)MAGMA Heterogeneous LAPACKIMSL (Rogue Wave)VSIPL (GPU VSIPL)NVIDIA Developer ToolsParallel Nsightfor MS Visual Studiocuda-gdb Debuggerwith multi-GPU supportCUDA/OpenCL Visual ProfilerCUDA Memory CheckerCUDA DisassemblerGPU Computing SDKNVMLCUPTI3 rd Party Developer ToolsAllinea DDTRogueWave /TotalviewVampirTauCAPS HMPP© NVIDIA Corporation 2011

CUDA Developer Resources from NVIDIADevelopmentToolsCUDA ToolkitComplete GPU computingdevelopment kitcuda-gdbGPU hardware debuggingcuda-memcheckIdentifies memory errorscuobjdumpCUDA binary disassemblerVisual ProfilerGPU hardware profiler forCUDA C and OpenCLParallel Nsight ProIntegrated developmentenvironment for Visual StudioSDKs andCode SamplesGPU Computing SDKCUDA C/C++, DirectCompute,OpenCL code samples anddocumentationBooksCUDA by ExampleGPU Computing GemsProgramming MassivelyParallel ProcessorsMany more…Optimization GuidesBest Practices for GPUcomputing and graphicsdevelopmentLibraries andEnginesMath LibrariesCUFFT, CUBLAS, CUSPARSE,CURAND, math.h3 rd Party LibrariesCULA LAPACK, VSIPLNPP Image LibrariesPerformance primitivesfor imagingApp Acceleration EnginesRay Tracing: Optix, iRayVideo Encoding / DecodingNVCUVENC / VCUVIDhttp://developer.nvidia.com© NVIDIA Corporation 2011

CUDA 3 rd Party EcosystemCluster ToolsParallel LanguageSolutions & APIsParallel ToolsCompute PlatformProvidersCluster ManagementPlatform HPCPlatform SymphonyBright Cluster managerPGI CUDA FortranPGI Accelerator (C/Fortran)PGI CUDA x86CAPS HMPPParallel DebuggersMS Visual Studio withParallel Nsight ProCloud ProvidersAmazon EC2Peer 1Ganglia Monitoring SystemMoab Cluster SuitepyCUDA (Python)Tidepowerd GPU.NET (C#)Allinea DDT DebuggerTotalView DebuggerOEM’sAltair PBS ProJob SchedulingAltair PBSproTORQUEPlatform LSFMPI LibrariesJCuda (Java)Khronos OpenCLMicrosoft DirectCompute3rd Party Math LibrariesCULA Tools (EM Photonics)MAGMA Heterogeneous LAPACKIMSL (Rogue Wave)Parallel Performance ToolsParaTools VampirTraceTauCUDA Performance ToolsPAPIHPC ToolkitDellHPIBMInfiniband ProvidersMellanoxQLogicComing soon…VSIPL (GPU VSIPL)NAG© NVIDIA Corporation 2011

PGI CUDA x86 CompilerBenefitsDeploy CUDA apps onlegacy systems without GPUsLess code maintenancefor developersTimelineApril/May 1.0 initial releaseDevelop, debug, test functionalityAug 1.1 performance releaseMulticore, SSE/AVX support© NVIDIA Corporation 2011

GPU Computing Research & Educationhttp://research.nvidia.comWorld Class ResearchLeadership and TeachingUniversity of CambridgeHarvard UniversityUniversity of UtahUniversity of TennesseeUniversity of MarylandUniversity of Illinois at Urbana-ChampaignTsinghua UniversityTokyo Institute of TechnologyChinese Academy of SciencesNational Taiwan UniversityGeorgia Institute of TechnologyProven Research VisionJohn Hopkins University Mass. Gen. Hospital/NE UnivNanyan UniversityNorth Carolina State UniversityTechnical University-Czech Swinburne University of Tech.CSIROTechische Univ. MunichSINTEFUCLAHP LabsUniversity of New MexicoICHECUniversity Of Warsaw-ICMBarcelona SuperComputer Center VSB-TechClemson UniversityUniversity of OstravaFraunhofer SCAIAnd more coming shortly.Karlsruhe Institute Of TechnologyAcademic Partnerships / FellowshipsGPGPU Education350+ Universities© NVIDIA Corporation 2011

“Don’t kid yourself. GPUs are a game-changer.” said Frank Chambers, a GTCconference attendee shopping for GPUs for his finite element analysis work. “What we are seeinghere is like going from propellers to jet engines. That madetranscontinental flights routine. Wide access to this kind of computing power is making things likeartificial retinas possible, and that wasn’t predicted to happen until 2060.”- Inside HPC (Sept 22, 2010)

GPU Technology Conference 2011October 11-14 | San Jose, CAThe one event you can’t afford to miss• Learn about leading-edge advances in GPU computing• Explore the research as well as the commercial applications• Discover advances in computational visualization• Take a deep dive into parallel programmingWays to participate• Speak – share your work and gain exposure as a thought leader• Register – learn from the experts and network with your peers• Exhibit/Sponsor – promote your company as a key player in the GPU ecosystemwww.gputechconf.com

CUDA Toolkit 4.0 Overview - Nvidia

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?