11.07.2015 Views

CUDA Toolkit 4.0 Overview - Nvidia

CUDA Toolkit 4.0 Overview - Nvidia

CUDA Toolkit 4.0 Overview - Nvidia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>CUDA</strong> <strong>4.0</strong>The ‘Super’ Computing CompanyFrom Super Phones to Super Computers© NVIDIA Corporation 2011


<strong>CUDA</strong> <strong>4.0</strong> for Broader Developer Adoption© NVIDIA Corporation 2011


<strong>CUDA</strong> <strong>4.0</strong>Application Porting Made SimplerRapid Application PortingUnified Virtual AddressingFaster Multi-GPU ProgrammingNVIDIA GPUDirect 2.0Easier Parallel Programming in C++Thrust© NVIDIA Corporation 2011


<strong>CUDA</strong> <strong>4.0</strong>: HighlightsEasier ParallelApplication PortingFasterMulti-GPU ProgrammingNew & ImprovedDeveloper Tools• Share GPUs across multiple threads• Single thread access to all GPUs• No-copy pinning of system memory• New <strong>CUDA</strong> C/C++ features• Thrust templated primitives library• NPP image/video processing library• Layered Textures• NVIDIA GPUDirect v2.0• Peer-to-Peer Access• Peer-to-Peer Transfers• Unified Virtual Addressing• Auto Performance Analysis• C++ Debugging• GPU Binary Disassembler• cuda-gdb for MacOS© NVIDIA Corporation 2011


Easier Porting of Existing ApplicationsShare GPUs across multiple threadsEasier porting of multi-threaded appspthreads / OpenMP threads share a GPULaunch concurrent kernels fromdifferent host threadsEliminates context switching overheadNew, simple context management APIsOld context migration APIs still supportedSingle thread access to all GPUsEach host thread can now access allGPUs in the systemOne thread per GPU limitation removedEasier than ever for applications totake advantage of multi-GPUSingle-threaded applications can nowbenefit from multiple GPUsEasily coordinate work across multipleGPUs (e.g. halo exchange)© NVIDIA Corporation 2011


No-copy Pinning of System MemoryReduce system memory usage and CPU memcpy() overheadEasier to add <strong>CUDA</strong> acceleration to existing applicationsJust register malloc’d system memory for async operationsand then call cudaMemcpy() as usualBefore No-copy PinningExtra allocation and extra copy requiredWith No-copy PinningJust register and go!cudaMallocHost(b)memcpy(b, a)malloc(a)cudaMemcpy() to GPU, launch kernels, cudaMemcpy() from GPUmemcpy(a, b)cudaFreeHost(b)cudaHostUnregister(a)All <strong>CUDA</strong>-capable GPUs on Linux or WindowsRequires Linux kernel 2.6.15+ (RHEL 5)cudaHostRegister(a)© NVIDIA Corporation 2011


New <strong>CUDA</strong> C/C++ Language FeaturesC++ new/deleteDynamic memory managementC++ virtual functionsEasier porting of existing applicationsInline PTXEnables assembly-level optimization© NVIDIA Corporation 2011


C++ Templatized Algorithms & Data Structures (Thrust)Powerful open source C++ parallel algorithms & data structuresSimilar to C++ Standard Template Library (STL)Automatically chooses the fastest code path at compile timeDivides work between GPUs and multi-core CPUsParallel sorting @ 5x to 100x fasterData Structures• thrust::device_vector• thrust::host_vector• thrust::device_ptr• Etc.Algorithms• thrust::sort• thrust::reduce• thrust::exclusive_scan• Etc.© NVIDIA Corporation 2011


GPU-Accelerated Image ProcessingNVIDIA Performance Primitives (NPP) library10x to 36x faster image processingInitial focus on imaging and video related primitivesData exchange & initializationSet, Convert, CopyConstBorder, Copy,Transpose, SwapChannelsColor ConversionRGB To YCbCr (& vice versa),ColorTwist, LUT_LinearThreshold & Compare OpsThreshold, CompareStatisticsMean, StdDev, NormDiff, MinMax,Histogram,SqrIntegral, RectStdDevFilter FunctionsFilterBox, Row, Column, Max, Min, Median,Dilate, Erode, SumWindowColumn/RowGeometry TransformsMirror, WarpAffine / Back/ Quad,WarpPerspective / Back / Quad, ResizeArithmetic & Logical OpsAdd, Sub, Mul, Div, AbsDiffJPEGDCTQuantInv/Fwd, QuantizationTable© NVIDIA Corporation 2011


Layered Textures – Faster Image ProcessingIdeal for processing multiple textures with same size/formatLarge sizes supported on Tesla T20 (Fermi) GPUs (up to 16k x 16k x 2k)e.g. Medical Imaging, Terrain Rendering (flight simulators), etc.Faster PerformanceReduced CPU overhead: single binding for entire texture arrayFaster than 3D Textures: more efficient filter cachingFast interop with OpenGL / Direct3D for each layerNo need to create/manage a texture atlasNo sampling artifactsLinear/Bilinear filtering applied only within a layer© NVIDIA Corporation 2011


<strong>CUDA</strong> <strong>4.0</strong>: HighlightsEasier ParallelApplication PortingFasterMulti-GPU ProgrammingNew & ImprovedDeveloper Tools• Share GPUs across multiple threads• Single thread access to all GPUs• No-copy pinning of system memory• New <strong>CUDA</strong> C/C++ features• Thrust templated primitives library• NPP image/video processing library• Layered Textures• NVIDIA GPUDirect v2.0• Peer-to-Peer Access• Peer-to-Peer Transfers• Unified Virtual Addressing• Auto Performance Analysis• C++ Debugging• GPU Binary Disassembler• cuda-gdb for MacOS© NVIDIA Corporation 2011


NVIDIA GPUDirect:Towards Eliminating the CPU BottleneckVersion 1.0for applications that communicateover a networkVersion 2.0for applications that communicatewithin a node• Direct access to GPU memory for 3 rdparty devices• Eliminates unnecessary sys memcopies & CPU overhead• Supported by Mellanox and Qlogic• Up to 30% improvement incommunication performance• Peer-to-Peer memory access,transfers & synchronization• Less code, higher programmerproductivityDetails @ http://www.nvidia.com/object/software-for-tesla-products.html© NVIDIA Corporation 2011


Before NVIDIA GPUDirect v2.0Required Copy into Main MemoryGPU1MemoryGPU2MemoryTwo copies required:1. cudaMemcpy(GPU2, sysmem)2. cudaMemcpy(sysmem, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset© NVIDIA Corporation 2011


NVIDIA GPUDirect v2.0:Peer-to-Peer CommunicationDirect Transfers between GPUsGPU1MemoryGPU2MemoryOnly one copy required:1. cudaMemcpy(GPU2, GPU1)SystemMemoryGPU1GPU2CPUPCI-eChipset© NVIDIA Corporation 2011


GPUDirect v2.0: Peer-to-Peer CommunicationDirect communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programmingDirect TransfersCopy from GPU0 memory to GPU1 memoryWorks transparently with UVADirect AccessGPU0 reads or writes GPU1 memory (load/store)Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC© NVIDIA Corporation 2011


Unified Virtual AddressingEasier to Program with Single Address SpaceNo UVA: Multiple Memory SpacesUVA : Single Address SpaceSystemMemoryGPU0MemoryGPU1MemorySystemMemoryGPU0MemoryGPU1Memory0x00000x00000x00000x00000xFFFF0xFFFF0xFFFF0xFFFFCPUGPU0GPU1CPUGPU0GPU1PCI-ePCI-e© NVIDIA Corporation 2011


Unified Virtual AddressingOne address space for all CPU and GPU memoryDetermine physical memory location from pointer valueEnables libraries to simplify their interfaces (e.g. cudaMemcpy)Before UVASeparate options for each permutationcudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDeviceWith UVAOne function handles all casescudaMemcpyDefault(data location becomes an implementation detail)Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC© NVIDIA Corporation 2011


<strong>CUDA</strong> <strong>4.0</strong>: HighlightsEasier ParallelApplication PortingFasterMulti-GPU ProgrammingNew & ImprovedDeveloper Tools• Share GPUs across multiple threads• Single thread access to all GPUs• No-copy pinning of system memory• New <strong>CUDA</strong> C/C++ features• Thrust templated primitives library• NPP image/video processing library• Layered Textures• NVIDIA GPUDirect v2.0• Peer-to-Peer Access• Peer-to-Peer Transfers• Unified Virtual Addressing• Auto Performance Analysis• C++ Debugging• GPU Binary Disassembler• cuda-gdb for MacOS© NVIDIA Corporation 2011


Automated Performance Analysis in Visual ProfilerSummary analysis & hintsSessionDeviceContextKernelNew UI for kernel analysisIdentify limiting factorAnalyze instruction throughputAnalyze memory throughputAnalyze kernel occupancy© NVIDIA Corporation 2011


New Features in cuda-gdbNow available for both Linux and MacOSinfo cuda threadsautomatically updated in DDDBreakpoints on all instancesof templated functionsFermidisassembly(cuobjdump)C++ symbols shownin stack trace viewDetails @ http://developer.nvidia.com/object/cuda-gdb.html© NVIDIA Corporation 2011


cuda-gdb Now Available for MacOSDetails @ http://developer.nvidia.com/object/cuda-gdb.html© NVIDIA Corporation 2011


NVIDIA Parallel Nsight Pro 1.5Professional<strong>CUDA</strong> DebuggingCompute Analyzer<strong>CUDA</strong> / OpenCL ProfilingTesla Compute Cluster (TCC) DebuggingTesla Support: C1050/S1070 or higherQuadro Support: G9x or higherWindows 7, Vista and HPC Server 2008Visual Studio 2008 SP1 and Visual Studio 2010OpenGL and OpenCL AnalyzerDirectX 10 & 11 Analyzer, Debugger & GraphicsinspectorGeForce Support: 9 series or higher© NVIDIA Corporation 2011


<strong>CUDA</strong> Registered Developer ProgramAll GPGPU developers should become NVIDIA Registered DevelopersBenefits include:Early Access to Pre-Release SoftwareBeta software and libraries<strong>CUDA</strong> <strong>4.0</strong> Release Candidate available nowSubmit & Track Issues and BugsInteract directly with NVIDIA QA engineersNew benefits in 2011Exclusive Q&A Webinars with NVIDIA EngineeringExclusive deep dive <strong>CUDA</strong> training webinarsIn-depth engineering presentations on pre-release softwareSign up Now: www.nvidia.com/ParallelDeveloper© NVIDIA Corporation 2011


Additional Information…<strong>CUDA</strong> Features <strong>Overview</strong><strong>CUDA</strong> Developer Resources from NVIDIA<strong>CUDA</strong> 3 rd Party EcosystemPGI <strong>CUDA</strong> x86GPU Computing Research & EducationNVIDIA Parallel Developer ProgramGPU Technology Conference 2011© NVIDIA Corporation 2011


<strong>CUDA</strong> Features <strong>Overview</strong>PlatformProgramming ModelParallel LibrariesDevelopment ToolsNew in<strong>CUDA</strong> <strong>4.0</strong>GPUDirect tm (v 2.0)Peer-Peer CommunicationUnified Virtual AddressingC++ new/deleteC++ Virtual FunctionsThrust C++ LibraryTemplated PerformancePrimitives LibraryParallel Nsight Pro 1.5Hardware FeaturesECC MemoryDouble PrecisionNative 64-bit ArchitectureConcurrent Kernel ExecutionDual Copy Engines6GB per GPU supportedOperating System SupportMS Windows 32/64Linux 32/64Mac OS X 32/64Designed for HPCCluster ManagementGPUDirectTesla Compute Cluster (TCC)Multi-GPU supportC supportNVIDIA C Compiler<strong>CUDA</strong> C Parallel ExtensionsFunction PointersRecursionAtomicsmalloc/freeC++ supportClasses/ObjectsClass InheritancePolymorphismOperator OverloadingClass TemplatesFunction TemplatesVirtual Base ClassesNamespacesFortran support<strong>CUDA</strong> Fortran (PGI)NVIDIA Library SupportComplete math.hComplete BLAS Library (1, 2 and 3)Sparse Matrix Math LibraryRNG LibraryFFT Library (1D, 2D and 3D)Video Decoding Library (NVCUVID)Video Encoding Library (NVCUVENC)Image Processing Library (NPP)Video Processing Library (NPP)3rd Party Math LibrariesCULA Tools (EM Photonics)MAGMA Heterogeneous LAPACKIMSL (Rogue Wave)VSIPL (GPU VSIPL)NVIDIA Developer ToolsParallel Nsightfor MS Visual Studiocuda-gdb Debuggerwith multi-GPU support<strong>CUDA</strong>/OpenCL Visual Profiler<strong>CUDA</strong> Memory Checker<strong>CUDA</strong> DisassemblerGPU Computing SDKNVMLCUPTI3 rd Party Developer ToolsAllinea DDTRogueWave /TotalviewVampirTauCAPS HMPP© NVIDIA Corporation 2011


<strong>CUDA</strong> Developer Resources from NVIDIADevelopmentTools<strong>CUDA</strong> <strong>Toolkit</strong>Complete GPU computingdevelopment kitcuda-gdbGPU hardware debuggingcuda-memcheckIdentifies memory errorscuobjdump<strong>CUDA</strong> binary disassemblerVisual ProfilerGPU hardware profiler for<strong>CUDA</strong> C and OpenCLParallel Nsight ProIntegrated developmentenvironment for Visual StudioSDKs andCode SamplesGPU Computing SDK<strong>CUDA</strong> C/C++, DirectCompute,OpenCL code samples anddocumentationBooks<strong>CUDA</strong> by ExampleGPU Computing GemsProgramming MassivelyParallel ProcessorsMany more…Optimization GuidesBest Practices for GPUcomputing and graphicsdevelopmentLibraries andEnginesMath LibrariesCUFFT, CUBLAS, CUSPARSE,CURAND, math.h3 rd Party LibrariesCULA LAPACK, VSIPLNPP Image LibrariesPerformance primitivesfor imagingApp Acceleration EnginesRay Tracing: Optix, iRayVideo Encoding / DecodingNVCUVENC / VCUVIDhttp://developer.nvidia.com© NVIDIA Corporation 2011


<strong>CUDA</strong> 3 rd Party EcosystemCluster ToolsParallel LanguageSolutions & APIsParallel ToolsCompute PlatformProvidersCluster ManagementPlatform HPCPlatform SymphonyBright Cluster managerPGI <strong>CUDA</strong> FortranPGI Accelerator (C/Fortran)PGI <strong>CUDA</strong> x86CAPS HMPPParallel DebuggersMS Visual Studio withParallel Nsight ProCloud ProvidersAmazon EC2Peer 1Ganglia Monitoring SystemMoab Cluster Suitepy<strong>CUDA</strong> (Python)Tidepowerd GPU.NET (C#)Allinea DDT DebuggerTotalView DebuggerOEM’sAltair PBS ProJob SchedulingAltair PBSproTORQUEPlatform LSFMPI LibrariesJCuda (Java)Khronos OpenCLMicrosoft DirectCompute3rd Party Math LibrariesCULA Tools (EM Photonics)MAGMA Heterogeneous LAPACKIMSL (Rogue Wave)Parallel Performance ToolsParaTools VampirTraceTau<strong>CUDA</strong> Performance ToolsPAPIHPC <strong>Toolkit</strong>DellHPIBMInfiniband ProvidersMellanoxQLogicComing soon…VSIPL (GPU VSIPL)NAG© NVIDIA Corporation 2011


PGI <strong>CUDA</strong> x86 CompilerBenefitsDeploy <strong>CUDA</strong> apps onlegacy systems without GPUsLess code maintenancefor developersTimelineApril/May 1.0 initial releaseDevelop, debug, test functionalityAug 1.1 performance releaseMulticore, SSE/AVX support© NVIDIA Corporation 2011


GPU Computing Research & Educationhttp://research.nvidia.comWorld Class ResearchLeadership and TeachingUniversity of CambridgeHarvard UniversityUniversity of UtahUniversity of TennesseeUniversity of MarylandUniversity of Illinois at Urbana-ChampaignTsinghua UniversityTokyo Institute of TechnologyChinese Academy of SciencesNational Taiwan UniversityGeorgia Institute of TechnologyProven Research VisionJohn Hopkins University Mass. Gen. Hospital/NE UnivNanyan UniversityNorth Carolina State UniversityTechnical University-Czech Swinburne University of Tech.CSIROTechische Univ. MunichSINTEFUCLAHP LabsUniversity of New MexicoICHECUniversity Of Warsaw-ICMBarcelona SuperComputer Center VSB-TechClemson UniversityUniversity of OstravaFraunhofer SCAIAnd more coming shortly.Karlsruhe Institute Of TechnologyAcademic Partnerships / FellowshipsGPGPU Education350+ Universities© NVIDIA Corporation 2011


“Don’t kid yourself. GPUs are a game-changer.” said Frank Chambers, a GTCconference attendee shopping for GPUs for his finite element analysis work. “What we are seeinghere is like going from propellers to jet engines. That madetranscontinental flights routine. Wide access to this kind of computing power is making things likeartificial retinas possible, and that wasn’t predicted to happen until 2060.”- Inside HPC (Sept 22, 2010)


GPU Technology Conference 2011October 11-14 | San Jose, CAThe one event you can’t afford to miss• Learn about leading-edge advances in GPU computing• Explore the research as well as the commercial applications• Discover advances in computational visualization• Take a deep dive into parallel programmingWays to participate• Speak – share your work and gain exposure as a thought leader• Register – learn from the experts and network with your peers• Exhibit/Sponsor – promote your company as a key player in the GPU ecosystemwww.gputechconf.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!