2H 2015
intel-xeon-phi-sw-ecosystem-guide-2h-2015-public3 intel-xeon-phi-sw-ecosystem-guide-2h-2015-public3
Tflops 3 2 1 0 Distributed Iso3DFD* 16th Order Isotropic Kernel 14.7 GCells 1 node 2 nodes 3 nodes 4 nodes 2S Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessor 7120P For configuration details, go here. Distributed Iso3DFD Speed Up 29.3 GCells 44 GCells 58.7 GCells SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, 2015 CLUSTER BENCHMARK 1 NODE Application: Distributed version of Iso3DFD. APPROVED FOR PUBLIC PRESENTATION NEW Description: One dimensional domain decomposition Iso3DFD (16th order Isotropic kernel). Multi-node tests; cluster performance and scalability of MPI implementation. Availability: • Code: Not available. • Recipe: Available here. Optimization guide available here. Usage Model: Scaling analysis with each Intel® Xeon Phi coprocessor in a node solving a 14GB subdomain and each pair of Intel® Xeon® processors solving a 10GB subdomain. • Symmetric MPI. One MPI process per node or device. • OpenMP* within each MPI process on processor or coprocessor. (Note: No disc I/O being performed to save seismic wave-fields.) Highlights: • Symmetric model allows host-coprocessor load balancing at MPI job launch time. • Processes running either on host or MIC can be independently tuned. • Enabled the remaining DDR3 node memory to be used for fast I/O and other tasks. Results: Linear scalability confirmed to ensure no cluster-level limitation. 16GB memory on the coprocessors allows high overlap of computations and halo exchanges, hiding the cost of communication with the processors and other nodes. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others 74
Comparative Performance Petrobras* Isotropic RTM 5 4 3 2 1 0 1 Petrobras* Isotropic RTM Speed Up 2.2X 3.4X 4.6X 2S Intel® Xeon® processor E5-2697 v2, 2 MPI/12 OMP 2S Xeon E5-2697 v2, (2/12) + Xeon Phi 7120A, 1 MPI/244 OMP 2S Xeon E5-2697 v2, (2/12) + 2x Xeon Phi 7120A, 1/244 2S Xeon E5-2697 v2, (2/12) + 3x Xeon Phi 7120A, 1/244 2S Xeon E5-2697 v2, (2/12) + 4x Xeon Phi 7120A, 1/244 “Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2 “Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A 5.6X 1 NODE APPROVED FOR PUBLIC PRESENTATION Application: Isotropic RTM has a major role in accurate imaging of complex subsurface structures. Description: The 3D Isotropic RTM plays a major role in accurate imaging of complex subsurface structures. This benchmark measures performance and scalability of the compute kernel on a hybrid host + Intel® Xeon Phi coprocessor configuration. Halo exchanges are performed between devices. More at http://www.petrobras.com/en/home.htm. Availability: • Code: Proprietary. • Recipe: Not available. Check for future availability here. Usage Model: Intel® Xeon Phi coprocessor 7120A as a host in native mode concurrently executing with the Intel® Xeon® processor E5-2697 v2. Same code for both devices with OpenMP* and Intrinsics. Highlights: Scalable; competitive performance/watt Results: Intel® Xeon® processor populated with 4 Intel Xeon Phi cards show up to 5.6X scaling when compared to the baseline Intel® Xeon® processor E5-2697 v2. For configuration details, go here. SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others 75
- Page 23 and 24: Comparative Performance LAMMPS* Liq
- Page 25 and 26: Comparative Performance LAMMPS* Rho
- Page 27 and 28: Comparative Performance LAMMPS* Liq
- Page 29 and 30: Comparative Performance LAMMPS* Rho
- Page 31 and 32: Comparative Performance LAMMPS* Pro
- Page 33 and 34: Comparative Performance AMBER* 14 P
- Page 35 and 36: Comparative Performance AMBER* 14 P
- Page 37 and 38: Comparative Performance Burrows-Whe
- Page 39 and 40: Comparative Performance NWChem* CCS
- Page 41 and 42: Discover and design like never befo
- Page 43 and 44: Comparative Performance miniGhost*
- Page 45 and 46: Comparative Performance Quantum ESP
- Page 47 and 48: Comparative Performance ANSYS Mecha
- Page 49 and 50: Comparative Performance ANSYS Mecha
- Page 51 and 52: Comparative Performance ANSYS Mecha
- Page 53 and 54: Comparative Performance Sandia Mant
- Page 55 and 56: Comparative Increase Autodesk Maya*
- Page 57 and 58: Comparative Performance OpenLB* Cyl
- Page 59 and 60: CLUSTER BENCHMARKS New Data Center
- Page 61 and 62: Comparative Performance Monte Carlo
- Page 63 and 64: Comparative Performance QuantLib* S
- Page 65 and 66: Comparative Performance Monte Carlo
- Page 67 and 68: Comparative Performance Monte Carlo
- Page 69 and 70: Comparative Performance Monte Carlo
- Page 71 and 72: Comparative Performance Xcelerit* L
- Page 73: Comparative Increase 1 0 Iso3DFD* 1
- Page 77 and 78: CLUSTER BENCHMARK Data Center Serve
- Page 79 and 80: Comparative Performance BerkeleyGW*
- Page 81 and 82: Comparative Performance ASKAP* tHog
- Page 83 and 84: Comparative Increase specfem3D 300K
- Page 85 and 86: CLUSTER BENCHMARK 6,400 NODES APPRO
- Page 87 and 88: Comparative Performance Gyrokinetic
- Page 89 and 90: Comparative Increase ROMS* Idealize
- Page 91 and 92: Comparative Performance NASA* OVERF
- Page 93 and 94: Improving speed and quality through
- Page 95 and 96: Comparative Performance Embree 2.2
- Page 97 and 98: Intel® Software Development Tools
- Page 99 and 100: Features and Configurations Intel®
- Page 101 and 102: Speedup Turn Big Data Into Informat
- Page 103 and 104: Scalable Profiling for MPI and Hybr
- Page 105 and 106: Bright Cluster Manager* Advanced Cl
- Page 107 and 108: Intel® Xeon Phi Coprocessor Develo
- Page 109 and 110: Intel® Developer Zone Join us on S
- Page 111 and 112: Recommended Links Getting Started:
- Page 113 and 114: Hardware Configuration - Intel® Xe
- Page 115 and 116: OPTIMIZATION NOTICE Optimization No
Tflops<br />
3<br />
2<br />
1<br />
0<br />
Distributed Iso3DFD*<br />
16th Order Isotropic Kernel<br />
14.7 GCells<br />
1 node 2 nodes 3 nodes 4 nodes<br />
2S Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessor<br />
7120P<br />
For configuration details, go here.<br />
Distributed Iso3DFD Speed Up<br />
29.3 GCells<br />
44 GCells<br />
58.7 GCells<br />
SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />
CLUSTER BENCHMARK<br />
1 NODE<br />
Application: Distributed version of Iso3DFD.<br />
APPROVED FOR PUBLIC PRESENTATION<br />
NEW<br />
Description: One dimensional domain decomposition Iso3DFD<br />
(16th order Isotropic kernel). Multi-node tests; cluster performance<br />
and scalability of MPI implementation.<br />
Availability:<br />
• Code: Not available.<br />
• Recipe: Available here. Optimization guide available here.<br />
Usage Model: Scaling analysis with each Intel® Xeon Phi<br />
coprocessor in a node solving a 14GB subdomain and each pair of<br />
Intel® Xeon® processors solving a 10GB subdomain.<br />
• Symmetric MPI. One MPI process per node or device.<br />
• OpenMP* within each MPI process on processor or coprocessor.<br />
(Note: No disc I/O being performed to save seismic wave-fields.)<br />
Highlights:<br />
• Symmetric model allows host-coprocessor load balancing at MPI<br />
job launch time.<br />
• Processes running either on host or MIC can be independently<br />
tuned.<br />
• Enabled the remaining DDR3 node memory to be used for fast<br />
I/O and other tasks.<br />
Results: Linear scalability confirmed to ensure no cluster-level<br />
limitation. 16GB memory on the coprocessors allows high overlap<br />
of computations and halo exchanges, hiding the cost of<br />
communication with the processors and other nodes.<br />
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />
purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />
74