07.12.2015 Views

2H 2015

intel-xeon-phi-sw-ecosystem-guide-2h-2015-public3

intel-xeon-phi-sw-ecosystem-guide-2h-2015-public3

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>2H</strong> <strong>2015</strong>


Key Messages<br />

• Developers show standards-based programming models benefits using Intel® Xeon® processors and Intel® Xeon Phi<br />

coprocessors:<br />

Intel® Xeon Phi is the ideal platform for many core and wide vector code development (threads, vector lanes, memory).<br />

Once you’ve optimized for the most cores, threads and wide SIMD vectors, your code scales everywhere – including on future Intel<br />

Xeon Phi generations and on future (highly-parallel) Intel® Xeon® processors.<br />

Offload, symmetric, and native execution models are demonstrated in the proof points.<br />

• Continued Momentum Across Verticals<br />

78 performance proof points (41 Haswell, 37 Ivy Bridge) in this updated guide, featuring 38 new proof points, 12 new codes and 5 new<br />

recipes. 26 are cluster proof points.<br />

Several new Life Science (BLAST*, LAMMPS* and Johns Hopkins Bowtie2*) and Manufacturing proof points (ANSYS*, MSC Software*,<br />

HPCG benchmark, Sandia Mantevo*, and Fujitsu*) added. For more information about Life Sciences, click here.<br />

Includes new proof points for Financial Services, Weather, Physics, and Digital Content Creation.<br />

• Get started today with Intel and 3rd party tools/libraries:<br />

Software tailored to support the Intel® Many Integrated Core Architecture.<br />

Download the Application Catalog to see the list of available or in-flight codes.<br />

• Leverage Intel® Parallel Studio XE 2016 and Intel® Cluster Ready to create faster code.<br />

• Code Modernization – Drive faster breakthroughs through faster code: Get more results on your hardware today and carry<br />

your code forward to the future. More Code Modernization resources:<br />

Intel® Code Modernization Enablement Program<br />

What is Code Modernization?<br />

2


Intel® Modern Code Developer Challenge <strong>2015</strong><br />

Code Modernization contest for HPC developers<br />

• Learn valuable skills, make a difference and contribute to a social cause.<br />

• In partnership with CERN & others<br />

• Prizes include trip to CERN, Scholarship for top student participants.<br />

• Email interest list live today: http://software.intel.com/moderncode/challenge<br />

3


Table of Contents<br />

Intel® Xeon Phi Coprocessors<br />

• New or Updated Proof Points ><br />

• Proof Points Highlights and Summary ><br />

• A Growing Vendor Ecosystem ><br />

Customer Proof Points<br />

• Life Sciences ><br />

• Material Sciences ><br />

• Manufacturing ><br />

• Financial Services ><br />

• Energy ><br />

• Astrophysics ><br />

• Geophysics ><br />

• Physics ><br />

• Climate and Weather ><br />

• Digital Content Creation ><br />

Additional Resources<br />

• Tools and Libraries ><br />

• Intel Developer Training ><br />

• Development Tools ><br />

4


New or Updated Proof Points<br />

NEW<br />

proof points (38):<br />

• Life Sciences:<br />

LAMMPS*<br />

BLAST* (2)<br />

Johns Hopkins Bowtie2* (2)<br />

• Manufacturing and DCC:<br />

ANSYS Mechanical* (5)<br />

ANSYS server demo<br />

Altair Hyperworks*<br />

MSC Software*<br />

HPCG benchmark<br />

Sandia Mantevo miniFE<br />

Autodesk Maya* (2)<br />

Fujitsu HPC Gateway*<br />

Fujitsu server demo<br />

E4*<br />

Intel Modern Code<br />

CST*<br />

Embree (2)<br />

• Financial Services:<br />

Monte Carlo* European Option<br />

Pricing<br />

Monte Carlo Excel*-Based<br />

Option Pricing<br />

QuantLib* (2)<br />

• Energy Industry:<br />

Iso3DFD<br />

Distributed Iso3DFD<br />

specfem3D<br />

DownUnder GeoSolutions*<br />

• Physics:<br />

miniGhost* (2)<br />

BerkeleyGW*<br />

ASKAP*<br />

Texas Advanced Computing<br />

Center*<br />

• Weather Research:<br />

ROMS*<br />

Proof points with updated<br />

code and/or recipe (7):<br />

• Life Sciences:<br />

Amber 14 (2 codes)<br />

• Financial Services:<br />

Monte Carlo<br />

QuanLib (2 codes, 2 recipes)<br />

5


New Proof Points and New Processor Speed Ups:<br />

Verticals Snapshot<br />

Average speed up in proof points 1 :<br />

• Life Sciences: 2.58X average speed up among new proof points and new processor.<br />

• Manufacturing and DCC: 2.48X average speed up among new proof points and new processor.<br />

• Financial Services: 3.86X average speed up among new proof points and new processor.<br />

• Energy Industry: 1.72X speed up in new proof point and new processor.<br />

• Physics: 1.6X average speed up among new proof points and new processor.<br />

• Weather Research: 1.46X average speed up among new proof points and new processor.<br />

1 - Performance demonstrated in proof points in this presentation<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products.<br />

6


Intel® Xeon Phi Coprocessors Software Ecosystem Highlights and Summary<br />

Intel® Xeon Phi<br />

coprocessors<br />

Powerful software<br />

used to gain<br />

meaningful insights<br />

Compelling reasons<br />

to optimize<br />

applications<br />

This presentation demonstrates improved software performance for key applications in key business<br />

segments. The majority of the latest proof points show increases 1 when adding the Intel® Xeon Phi<br />

coprocessor to the host Intel® Xeon® processor E5-2697 v2 or Intel® Xeon® processor E5-2697 v3.<br />

Some of the best examples are summarized below:<br />

Life Sciences Highlights:<br />

• Molecular systems with classical models simulation:<br />

Up to 6.5X performance increase.<br />

• Bimolecular simulations (Protein, DNA, RNA, virus etc.):<br />

Up to 5X performance increase.<br />

Manufacturing and Highlights:<br />

• Static structural analysis<br />

Up to 4.7X single node performance increase.<br />

Financial Services Highlights:<br />

• European Option Pricing using Monte Carlo:<br />

Up to 8.91X performance increase.<br />

Energy Industry Highlights:<br />

• Accurate imaging of complex subsurface structures:<br />

Up to 2.2X performance increase.<br />

Physics and Astronomy Highlights:<br />

• Finite Difference mini-application which implements a difference stencil across a homogenous three<br />

dimensional domain:<br />

Up to 1.76X performance increase.<br />

Weather Research and Forecasting Highlights:<br />

• A weather prediction system designed to serve atmospheric research and operational forecasting needs:<br />

Up to 1.94X performance increase.<br />

1 - Performance demonstrated in proof points in this presentation<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer<br />

systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your<br />

contemplated purchases, including the performance of that product when combined with other products.<br />

7


Relative performance<br />

1.00<br />

1.22<br />

3.00<br />

1.23<br />

2.19<br />

2.35<br />

3.40<br />

4.58<br />

5.51<br />

1.11<br />

1.17<br />

1.34<br />

1.58<br />

1.59<br />

1.72<br />

1.85<br />

1.85<br />

2.09<br />

2.12<br />

3.13<br />

6.94<br />

7.04<br />

8.91<br />

1.15<br />

1.20<br />

1.26<br />

1.28<br />

1.35<br />

1.37<br />

1.49<br />

1.50<br />

1.55<br />

1.56<br />

1.67<br />

2.03<br />

3.25<br />

6.50<br />

1.51<br />

1.88<br />

2.43<br />

2.04<br />

4.70<br />

0.45<br />

1.11<br />

1.12<br />

1.18<br />

1.37<br />

1.20<br />

1.27<br />

1.53<br />

1.73<br />

3.48<br />

1.31<br />

1.57<br />

Application Performance Snapshot<br />

Intel® Xeon Phi Coprocessor vs. 2 socket Intel® Xeon® processor<br />

10<br />

DCC<br />

Energy<br />

Financial Services<br />

Life Sciences<br />

Mfg.<br />

Physics<br />

Weather<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. Source: Intel measured as of Q1 <strong>2015</strong> Configuration Details: Please slide speaker notes. For more information go to<br />

http://www.intel.com/performance<br />

8


Intel® Xeon® Processor E5-2697 v2 and Intel® Xeon® Processor<br />

E5-2697 v3 Better Together with Intel® Xeon Phi Coprocessor<br />

Software Performance Increases with 2-Socket Intel® Xeon®<br />

Processor E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120<br />

compared to Intel® Xeon® Processor E5-2697 v2 1<br />

Software Performance Increases with 2-Socket Intel® Xeon®<br />

Processor E5-2697 v3 + Intel® Xeon Phi Coprocessor 7120<br />

compared to Intel® Xeon® Processor E5-2697 v3 1<br />

Monte Carlo* European Options<br />

QuantLib* SP Monte Carlo<br />

ANSYS Mechanical<br />

Sandia Mantevo*<br />

Petrobas*<br />

Black-Scholes* Formula Valuation<br />

Monte Carlo Excel*-based<br />

QuantLib* SP Black Scholes<br />

Binomial Options*<br />

ASKAP*<br />

Iso3DFD<br />

LAMMPS Production Protein<br />

GROMACS*<br />

miniGhost*<br />

BWA-ALN*<br />

OpenLB*<br />

2.12<br />

2.09<br />

1.84<br />

1.85<br />

1.73<br />

1.72<br />

1.61<br />

1.56<br />

1.55<br />

1.50<br />

1.53<br />

2.30<br />

2.20<br />

3.34<br />

6.93<br />

1 Performance demonstrated in respective proof points in this presentation<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. * Other names and brands may be claimed as the property of others.<br />

8.91<br />

Quantlib* SP Monte Carlo<br />

ANSYS Mechanical<br />

Monte Carlo RNG* European Options<br />

NAMD* 2.10 STMV<br />

Autodesk Maya* 2014<br />

Binomial Options*<br />

NAMD* 2.10 ApoA1<br />

BLASTn v.30<br />

WRF*<br />

LAMMPS* Liquid Crystal; 32 nodes<br />

DMI*<br />

AMBER* 14 Tobacco Virus<br />

1.45<br />

1.33<br />

1.25<br />

1.54<br />

1.52<br />

1.52<br />

1.49<br />

1.87<br />

1.75<br />

1.74<br />

2.61<br />

3.8<br />

9


Is Intel® Xeon Phi Coprocessor Performance Compelling<br />

Compared to NVIDIA*?<br />

Software Performance Increases with 2-Socket Intel® Xeon® Processor E5-2697 v2 or v3 + Intel® Xeon Phi Coprocessor 7120 compared<br />

to Intel® Xeon® Processor E5-2697 v2 or v3 + NVIDIA GPU 1<br />

QuanLib* SP Monte Carlo<br />

QuanLib* SP Monte Carlo<br />

5.17<br />

Isotropic FD RTM Kernel*<br />

Isotropic FD RTM Kernel*<br />

4.78<br />

LAMMPS* Stillinger-Weber<br />

LAMMPS* Stillinger-Weber<br />

3.73<br />

Embree* 2.2 Pathtracer<br />

Embree* 2.2 Pathtracer<br />

3.48<br />

QuantLib* Black-Scholes<br />

QuantLib* Black-Scholes<br />

2.62<br />

LAMMPS* Rhodopsin 512K<br />

LAMMPS* Rhodopsin 512K<br />

1.94<br />

ASKAP*<br />

ASKAP<br />

Xcelerit*<br />

Xcelerit*<br />

1.25<br />

1.43<br />

1 Performance demonstrated in respective proof points in this presentation<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. * Other names and brands may be claimed as the property of others.<br />

10


Memory Capacity (GB)<br />

Memory Comparison: Intel vs. NVIDIA*<br />

Memory Capacity (GB)<br />

Without ECC With ECC<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

6.0<br />

5.8<br />

3.25% Drop<br />

w/ECC On<br />

8.0<br />

7.8<br />

16.0<br />

15.5<br />

4.0<br />

3.5<br />

5.0<br />

12.5% Drop<br />

w/ECC On<br />

3120P 5110P 7120P K10 K20 K20X K40<br />

4.4<br />

6.0<br />

5.3<br />

12.0<br />

10.5<br />

• Intel has Higher Memory Capacity (16GB vs. 12GB)<br />

• Intel has less impact to capacity when ECC is Enabled<br />

• NVIDIA loses ~12.5% mem capacity when ECC is enabled vs. ~3.25% for Intel<br />

Source:<br />

Intel results Measured as Oct, 2013. NVIDIA* results<br />

values based on www.nvidia.com<br />

K10 = capacity behind each GPU on the card<br />

Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others.<br />

11


Relative Performance<br />

Leverage Code Resources<br />

Code Modernization benefits both the Processor & the Coprocessor<br />

2S Xeon E5-2697 v2 (Original Parallel) 2S Xeon E5-2697 v2 (Optimized) 2S Xeon E5-2697 v2 + 1 Xeon Phi 7120A (Optimized)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Code Modernization results in<br />

significant improvements for both the<br />

processor and the coprocessor<br />

Original<br />

Parallel Code<br />

(Xeon Only)<br />

1.7X 2.22<br />

3.43X<br />

1.70<br />

Optimized<br />

1 Optimized<br />

Code<br />

1<br />

Code<br />

(Xeon +<br />

(Xeon Only)<br />

Xeon Phi)<br />

Amber 14 Cellulose NPT<br />

Original<br />

Parallel Code<br />

(Xeon Only)<br />

3.43<br />

Optimized<br />

Code<br />

(Xeon Only)<br />

LAMMPS Liquid Crystal<br />

5.07<br />

Optimized<br />

Code<br />

(Xeon +<br />

Xeon Phi)<br />

Xeon = Intel® Xeon® processor<br />

Xeon Phi = Intel® Xeon Phi coprocessor<br />

Code Optimizations<br />

- Reduce host data xfer<br />

- Better data decomp. between CPU & Card<br />

- Better parallellization<br />

- Better OpenMP sched.<br />

Code Optimizations<br />

- Better Vectorization<br />

- Mixed Precision support<br />

- Misc changes for better<br />

parallelization<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others.<br />

12


A Growing Ecosystem:<br />

The Intel® Xeon Phi Coprocessors Applications & Solutions Catalog<br />

One place for all current public<br />

announcements, collateral, recipes,<br />

articles, benchmarks and case studies:<br />

• Provides insight into work-in<br />

process optimization and porting<br />

efforts in the software communities<br />

• Lists optimized, ported, and<br />

downloadable software<br />

• Click here to Access the Catalog<br />

Now!<br />

• Click here for companies offering<br />

Intel® Xeon Phi Coprocessors<br />

Intel.com/xeonphi<br />

Developer Zone<br />

13


“Intel’s leading technology &<br />

product provide great high<br />

performance computing power<br />

which enable us achieve more<br />

genome scientific research success<br />

for genome application<br />

development for China and for the<br />

whole human being.”<br />

Optimizing applications. Accelerating discovery.<br />

Life Sciences<br />

Wang Bingqiang<br />

Head of High Performance Computing,<br />

BGI<br />

14


Comparative Performance<br />

LAMMPS*<br />

Stillinger-Weber Water Benchmark<br />

32 NODES CLUSTER BENCHMARK<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

3<br />

LAMMPS* Stillinger-Weber Water Benchmark Speed Up<br />

3X<br />

3.41X<br />

3.05X<br />

3.6X<br />

Application: LAMMPS*<br />

Description: Simulation of molecular systems with classical<br />

models. More at http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

2<br />

1<br />

0<br />

1<br />

0.9X<br />

1 Node (256K molecules) 32 Nodes (8.2M molecules)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS baseline)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)<br />

2S Xeon E5-2697 v3 + Tesla K40c*, boost off, ECC on<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A, turbo off (LAMMPS IA Package)<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1<br />

No<br />

testing<br />

on Tesla<br />

Usage Model: Load balancer offloads part of neighbor-list and<br />

non-bond force calculations to Intel® Xeon Phi coprocessor<br />

for concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-<br />

2697 v3 and Intel® Xeon Phi coprocessor 7120A. Dynamic<br />

load balancing allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and longrange<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on<br />

CPU.<br />

Results: Simulation rate increase with Intel® Package is up to<br />

3.6X. Concurrent Intel Xeon Phi coprocessor computations and<br />

MPI communications yield improved speedup and higher node<br />

counts.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

15


Comparative Performance<br />

Johns Hopkins Bowtie 2*<br />

Multiple workloads<br />

1<br />

0<br />

Johns Hopkins Bowtie 2 Multiple Workloads Speed Up<br />

1<br />

.88X<br />

For configuration details, go here.<br />

1.08X<br />

ERR161544 SRR034966_1 ERR000589 SRR002273_1<br />

Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40<br />

Intel® Xeon® processor E5-2697 v3<br />

1.59X<br />

1.87X<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, <strong>2015</strong><br />

1 NODE<br />

Applications: 1) Bowtie2 version 2.2.3; Intel® AVX2 port. Bowtie<br />

2 is an ultrafast and memory-efficient tool for aligning<br />

sequencing reads to long reference sequences. 2) NVBowtie<br />

version 0.9.9.3. NVBowtie is a GPU-accelerated re-engineering of<br />

Bowtie2, a widely used short-read aligner. While being<br />

completely rewritten from scratch, NVBowtie reproduces many<br />

(though not all) of the features of Bowtie2.<br />

http://nvlabs.github.io/nvbio/nvbowtie_page.html<br />

Description: Bowtie 2, in addition to aligning sequencing reads<br />

to long reference sequences, is particularly good at aligning<br />

reads of about 50 up to 100s or 1,000s of characters, and<br />

particularly good at aligning to relatively long (e.g. mammalian)<br />

genomes.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Not available. Check for future availability here.<br />

Usage Models: ERR161544, SRR002273_1, HEK001(TGen),<br />

ERR000589_1, SRR033552_1, SRR034966_1, ERR024139_1<br />

Highlights: See more here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Results: Bowtie2 running on the Intel® Xeon® processor E5-2697<br />

v3 with Intel® AVX2 port faster than NVBowtie running on the<br />

Intel® Xeon® processor E5-2697 v2 and the NVIDIA Tesla K40*<br />

for 6 of 7 workloads. NVIDIA published data of K40 compared to<br />

Intel® Xeon® processor E5-2600 (6 cores) on one workload.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

16


Comparative Performance<br />

Johns Hopkins Bowtie 2*<br />

TGen workload<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

1<br />

0<br />

Johns Hopkins Bowtie 2 TGen Workload Speed Up<br />

1.37X<br />

1<br />

Intel® Xeon® processor E5-2697 v2<br />

Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

Application: Johns Hopkins Bowtie2* SSE2 and Bowtie2 running<br />

on the Intel® Xeon Phi coprocessor. Bowtie 2 is an ultrafast and<br />

memory-efficient tool for aligning sequencing reads to long<br />

reference sequences.<br />

Description: TGen workload – Analyzing cancer tumor RNA data<br />

from Translational Genomics Research Institute (TGen), available<br />

here and here. This workload used in Intel® Xeon® processor<br />

launch benchmarks.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Not available. Check for future availability here.<br />

Usage Model: Heterogeneous model: Process data on host (78<br />

input files) and Intel® Xeon Phi coprocessor (36 input files), each<br />

file contains 500,000 reads. One Intel Xeon Phi coprocessor, 12<br />

simultaneous jobs with 10 threads each, were run and one job<br />

with 24 threads was run on the host.<br />

Highlights: See more here.<br />

Results: Up to 1.37X improved performance with the Intel®<br />

Xeon® processor E5-2697 v2 (host) and the Intel Xeon Phi<br />

coprocessor 7120A over the baseline host.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

17


Comparative Performance<br />

1<br />

0<br />

BLAST*<br />

BLASTn v.30<br />

1<br />

2S Intel® Xeon® processor E5-2697 v2 (BLASTn v.30 baseline)<br />

2S Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

2S Xeon E5-2697 v2 + Xeon Phi 7120A (OFS parallelized)<br />

2S Intel® Xeon® processor E5-2697 v3 (BLASTn v.30 baseline)<br />

2S Xeon E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A (OFS parallelized)<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

BLASTn* v.30 Speed Up<br />

1.26X<br />

1.41X<br />

1.22X<br />

1.49X<br />

1.52X<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Application: Basic Local Alignment Search Tool (BLASTn) v.30.<br />

Description: Searching for alignment in nucleotide query<br />

sequences against a known nucleotide db volume set. National<br />

Center for Biotechnology Information (NCBI*). More at<br />

http://blast.ncbi.nlm.nih.gov/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: #4 (multiple queries multiple db) 100 NCBI queries<br />

(concatenated) against db refseq_rna.00-02 are distributed to the<br />

Intel® Xeon® processor and Intel® Xeon Phi coprocessor for<br />

maximum speedup sweet spot. Experiment was repeated 20 times<br />

with the pick of queries randomized for a sweet spot split 80/20<br />

and 59/23/18.<br />

Highlights: Throughput for this load sharing model has a small<br />

sweet spot for a sufficiently large query set.<br />

Results: Compared to the baseline, simulation rate speed up with<br />

Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi<br />

coprocessor 7120A heterogeneous model is 1.52X. Performance is<br />

also improved on the CPU due to Output Formatting Section (OFS)<br />

parallelization.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

.<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

18


Comparative Performance<br />

1<br />

0<br />

BLAST*<br />

BLASTp v.30<br />

1<br />

2S Intel® Xeon® processor E5-2697 v2 (BLASTp v.30 baseline)<br />

2S Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

2S Xeon E5-2697 v2 + Xeon Phi 7120A (OFS parallelized)<br />

2S Intel® Xeon® processor E5-2697 v3 (BLASTp v.30 baseline)<br />

2S Xeon E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A (OFS parallelized)<br />

For configuration details, go here.<br />

BLASTp* v.30 Speed Up<br />

1.15X<br />

1.3X<br />

1.21X<br />

1.39X<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1.41X<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Application: Basic Local Alignment Search Tool (BLASTp) v.30<br />

Description: Searching for alignment in protein query sequence<br />

against a known protein db volume set. More at<br />

http://blast.ncbi.nlm.nih.gov/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

• Usage Model: #4 (multiple queries multiple db) 40 NCBI<br />

queries (concatenated) against db nr_sorted.00-02 are<br />

distributed to Intel® Xeon® processor and Intel® Xeon Phi<br />

coprocessor for maximum speedup sweet spot. Experiment<br />

was repeated 20 times with the pick of queries randomized for<br />

a sweet spot split 33/7 and 28/5/7.<br />

Highlights: Throughput for this offload model has a small sweet<br />

spot for a sufficiently large query set. Throughput is limited due<br />

to GAT stage not parallelized.<br />

Results: Compared to the baseline, simulation rate speed up<br />

with Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi<br />

coprocessor 7120A heterogeneous model is 1.41X.<br />

Performance is also improved on the CPU due to Output<br />

Formatting Section (OFS) parallelization.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

.<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

19


Comparative Performance<br />

NAMD* 2.10 Pre-Release<br />

STMV<br />

CLUSTER BENCHMARK<br />

32 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

1<br />

NAMD* 2.10 (pre-release) Cluster Performance Increase<br />

STMV (~1M atoms)<br />

1.2X<br />

2X<br />

2.1X<br />

6.8X<br />

7.9X<br />

12.2X 13.1X<br />

1 Node 8 Nodes 32 Nodes<br />

Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 23 or 47 PPN)<br />

Intel® Xeon® processor E5-2697 v3 (27 or 55 PPN)<br />

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intel® Xeon Phi coprocessor 7120A (240T)<br />

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intel® Xeon Phi coprocessor 7110A (240T)<br />

20X<br />

24.2X<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

27.2X<br />

32X<br />

Application: NAMD 2.10 pre-release; STMV<br />

Description: A parallel, object-oriented molecular<br />

dynamics code designed for high-performance<br />

simulation of large biomolecular systems. More at<br />

http://www.ks.uiuc.edu/Research/namd/<br />

Availability:<br />

• Code: Intel® Xeon Phi coprocessor support is<br />

available as a pre-release. Use the nightly build.<br />

• Recipe: Available here.<br />

Usage Model: Single rank on host with 47 threads.<br />

Various computations are offloaded to Intel® Xeon<br />

Phi coprocessor from each thread.<br />

Highlights: Intel® Xeon Phi coprocessor support is<br />

now in the development branch of NAMD 2.10 prerelease.<br />

Results: For the STMV workload, the Intel® Xeon®<br />

processor E5-2697 v3 and the Intel® Xeon Phi<br />

coprocessor (32 nodes, 55 PPN) improved<br />

performance by up to 32X compared to the baseline<br />

processor (1 node, 47 PPN).<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

20


Comparative Performance<br />

NAMD* 2.10 Pre-Release<br />

ApoA1<br />

CLUSTER BENCHMARK<br />

2 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

NAMD* 2.10 (pre-release) Cluster Performance Increase<br />

ApoA1 (~92K atoms); 55 PPN<br />

2.61X<br />

1<br />

1.52X<br />

1.94X<br />

1 Node 2 Nodes<br />

Intel® Xeon® processor E5-2697 v3 (Baseline: (Baseline; 1 node, node) 55PPN)<br />

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor B1-<br />

7110A (240T)<br />

Application: NAMD* 2.10 pre-release; ApoA1<br />

Description: A parallel, object-oriented molecular dynamics code<br />

designed for high-performance simulation of large bio molecular<br />

systems. More at http://www.ks.uiuc.edu/Research/namd/<br />

Availability:<br />

• Code: Intel® Xeon Phi coprocessor support is available as a prerelease.<br />

Use the nightly build.<br />

• Recipe: Available here.<br />

Usage Model: Single rank on host with 55 threads. Various<br />

computations are offloaded to Intel® Xeon Phi coprocessor from<br />

each thread.<br />

Highlights: Intel® Xeon Phi coprocessor support is now in the<br />

development branch of NAMD 2.10 pre-release.<br />

Results: For the ApoA1 workload, 2-node performance can be<br />

accelerated by up to 2.61X using a single Intel® Xeon Phi<br />

coprocessor.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

21


Comparative Performance<br />

NAMD* 2.10 Pre-Release<br />

ApoA1<br />

CLUSTER BENCHMARK<br />

2 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

3<br />

2<br />

1<br />

0<br />

NAMD* 2.10 (pre-release) Cluster Performance Increase<br />

ApoA1 (~92K atoms); 47 PPN / 55 PPN<br />

1<br />

1.2X<br />

1.64X 1.82X 1.92X<br />

2.33X<br />

1 Node 2 Nodes<br />

2.85X<br />

3.14X<br />

Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 47 PPN)<br />

Intel® Xeon® processor E5-2697 v3 (55 PPN)<br />

Xeon E5-2697 v2 (47 PPN) + 1 Xeon Phi C0-7120A (240T)<br />

Xeon E5-2697 v3 (55 PPN) + 1 Xeon Phi B1-7110A (240T)<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

“Xeon Phi 7110A/7120A” = Intel® Xeon Phi coprocessor 7110A/7120A<br />

Application: NAMD 2.10 pre-release; ApoA1<br />

Description: A parallel, object-oriented molecular dynamics code<br />

designed for high-performance simulation of large biomolecular<br />

systems. More at http://www.ks.uiuc.edu/Research/namd/<br />

Availability:<br />

• Code: Intel® Xeon Phi coprocessor support is available as a<br />

pre-release. Use the nightly build.<br />

• Recipe: Available here.<br />

Usage Model: Single rank on host with 47 threads. Various<br />

computations are offloaded to Intel® Xeon Phi coprocessor from<br />

each thread.<br />

Highlights: Intel® Xeon Phi coprocessor support is now in the<br />

development branch of NAMD 2.10 pre-release.<br />

Results: For the ApoA1 workload, the Intel® Xeon® processor E5-<br />

2697 v3 and the Intel® Xeon Phi coprocessor (2 nodes, 55 PPN)<br />

improved performance by up to 3.14X compared to the baseline<br />

processor (1 node, 47 PPN).<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

22


Comparative Performance<br />

LAMMPS*<br />

Liquid Crystal Benchmark; 524K Atoms<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS IA Package)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)<br />

2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS IA<br />

Package)<br />

2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off (LAMMPS<br />

IA Package)<br />

For configuration details, go here.<br />

LAMMPS* Liquid Crystal Benchmark Performance<br />

(Mixed Precision)<br />

1 1.2X<br />

3.4X<br />

5.4X<br />

5.1X<br />

6.5X<br />

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014<br />

Application: LAMMPS*<br />

Description: Simulation of molecular systems with classical<br />

models. More at http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

Usage Model: Load balancer offloads part of neighbor-list and<br />

non-bond force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697<br />

v3 and Intel® Xeon PHI coprocessor 7120A. Dynamic load<br />

balancing allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 6.5x performance improvement utilizing Intel®<br />

Xeon® processors and Intel® Xeon Phi coprocessors in<br />

combination with application optimization and generational<br />

improvements in architecture and higher core counts.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

23


Comparative Performance<br />

LAMMPS*<br />

Liquid Crystal Benchmark; 524K Atoms<br />

CLUSTER BENCHMARK<br />

32 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

LAMMPS* Liquid Crystal Benchmark Performance (Mixed Precision)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

1<br />

4.5X<br />

5.4X<br />

1 Node (524K Atoms) 32 Nodes (16.8M Atoms)<br />

2S Intel® Xeon® processor E5-2697v3 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)<br />

2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off<br />

(LAMMPS IA Package)<br />

1<br />

3.66X<br />

5.3X<br />

Application: LAMMPS*<br />

Description: Simulation of molecular systems with classical models.<br />

More at http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

Usage Model: Load balancer offloads part of neighbor-list and nonbond<br />

force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697<br />

v3 and Intel® Xeon PHI coprocessor 7120A. Dynamic load<br />

balancing allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 5.4X performance improvement utilizing Intel®<br />

Xeon® processors and Intel® Xeon Phi coprocessors with<br />

application optimization on a single node compared to the baseline<br />

configuration. Performance gains continue to hold at 5.3x when<br />

scaling up to 32 nodes.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

24


Comparative Performance<br />

LAMMPS*<br />

Rhodopsin Benchmark; 512K Atoms<br />

CLUSTER BENCHMARK<br />

32 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

LAMMPS* Rhodopsin Benchmark Performance<br />

(Mixed Precision)<br />

1<br />

1.27X<br />

1.68X<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package)<br />

2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off<br />

(LAMMPS IA Package)<br />

1<br />

1.07X<br />

1.47X<br />

Application: LAMMPS*<br />

Description: Simulation of molecular systems with classical<br />

models. More at http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

Usage Model: Load balancer offloads part of neighbor-list and<br />

non-bond force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697<br />

v3 and Intel Xeon Phi coprocessor 7120A. Dynamic load<br />

balancing allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 1.68X performance improvement utilizing Intel®<br />

Xeon® processors and Intel® Xeon Phi coprocessors with<br />

application optimization on a single node compared to the<br />

baseline configuration. Performance gains continue to hold at<br />

1.47X when scaling up to 32 nodes.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

25


Comparative Performance<br />

AMBER* 14<br />

Particle Mesh Ewald (PME) Tobacco Virus<br />

2<br />

1<br />

0<br />

AMBER* 14 PME: Tobacco Virus, 1 Million Atoms<br />

1<br />

1.52X<br />

2X<br />

Intel® Xeon® processor E5-2697 v2 (baseline)<br />

Intel® Xeon® processor E5-2697 v2 (optimized)<br />

Xeon E5-2697 v2 (optimized) + Intel® Xeon Phi coprocessor 7120A<br />

Xeon E5-2697 v2 (optimized) + NVIDIA* K40 DPFP<br />

Intel® Xeon® processor E5-2697 v3<br />

2.26X<br />

1.93X<br />

2.41X<br />

Xeon E5-2697 v3 (optimized) + Intel® Xeon Phi coprocessor 7120A<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

Application: AMBER* 14<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.).<br />

Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: Available as a patch.<br />

• Recipe: Available here (Section 18.7 of the manual).<br />

Usage Model:<br />

• Baseline is the Intel® Xeon® processor E5-2697 v2 compared to<br />

the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi<br />

coprocessor 7120A.<br />

• Offload processing on both, and using the released code, double<br />

precision code, across the platforms, 50% workload on the host<br />

and 50% on the coprocessor.<br />

Highlights: The code was optimized, delivered to the AMBER<br />

community (whoever has license) and available as an update patch<br />

during code configuration. The benchmark information is at<br />

http://www.ks.uiuc.edu/Research/STMV/<br />

Results: Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon<br />

Phi coprocessor 7120A offload demonstrated up to 2.41X improved<br />

performance over the Intel Xeon processor E5-2697 v2. Optimized<br />

offload process demonstrated 1.07X increased performance<br />

compared to NVIDIA K40* performance.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

26


Comparative Performance<br />

LAMMPS*<br />

Liquid Crystal Benchmark<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

LAMMPS* Liquid Crystal Benchmark Performance<br />

(Mixed Precision)<br />

1<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS IA Package)<br />

2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS<br />

IA Package)<br />

For configuration details, go here.<br />

3.43X<br />

5.07X<br />

1<br />

3.06X<br />

4.84X<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

CLUSTER BENCHMARK<br />

Application: LAMMPS*<br />

1-32 NODES<br />

Description: Simulation of molecular systems with classical models.<br />

Wide variety of academic, government, and industry users. Popular<br />

due to its versatility and support for a wide range of forcefields/potential<br />

models: Materials Science, Chemistry, Biophysics,<br />

Solid Mechanics, Granular Flow, etc. More at<br />

http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Load balancer offloads part of neighbor-list and nonbond<br />

force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697<br />

v2 and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing<br />

allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 5.07X performance improvement utilizing Intel®<br />

Xeon® processors and Intel® Xeon Phi coprocessors with<br />

application optimization on a single node compared to the baseline<br />

configuration. Performance at 4.84X when scaling up to 32 nodes.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

27


Comparative Performance<br />

LAMMPS*<br />

Rhodopsin Benchmark 256K Atoms<br />

LAMMPS* Rhodopsin Benchmark Performance (Mixed Precision); Includes<br />

External NVIDIA* Results 2/K20X + 1S AMD*<br />

2<br />

1<br />

0<br />

1<br />

1.26X<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® E5-2697 v2 (LAMPPS baseline)<br />

2S Intel® Xeon® E5-2697 v2 (LAMPPS IA Package)<br />

2S E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120A Turbo off (LAMPPS<br />

IA package)<br />

Cray XK7: 1S AMD Opteron* 6274 + NVIDIA Tesla* K20X; Cray Gemini*<br />

Interconnect, PCIe2.0* (LAMMPS GPU Package)<br />

http://www.nvidia.com/docs/IO/122634/computational-chemistry-benchmarks.pdf)<br />

For configuration details, go here.<br />

1.71X<br />

.94X 1<br />

1.56X<br />

2.15X<br />

1.46X<br />

SOURCE: INTEL MEASURED RESULTS AS OF MAY, 2014<br />

CLUSTER BENCHMARK<br />

Application: LAMMPS*<br />

32 NODES<br />

Description: Simulation of molecular systems with classical<br />

models. Wide variety of academic, government, and industry users.<br />

Popular due to its versatility and support for a wide range of forcefields/potential<br />

models: Materials Science, Chemistry, Biophysics,<br />

Solid Mechanics, Granular Flow, etc. More at<br />

http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Load balancer offloads part of neighbor-list and<br />

non-bond force calculations to Xeon Phi for concurrent<br />

calculations with CPU.<br />

Highlights: Dynamic load balancing allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 1.71X performance improvement utilizing Intel®<br />

Xeon® processors and Intel® Xeon Phi coprocessors with<br />

application optimization on a single node compared to the<br />

baseline configuration. Performance increases up to 2.15X when<br />

scaling up to 32 nodes, out-performing the alternative<br />

configuration.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

28


Comparative Performance<br />

LAMMPS*<br />

Rhodopsin Benchmark; 512K Atoms<br />

1<br />

0<br />

LAMMPS* Rhodopsin Benchmark Performance<br />

(Mixed Precision)<br />

1.78X 1.75X<br />

1<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS IA Package)<br />

2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS<br />

IA Package)<br />

For configuration details, go here.<br />

1.2X<br />

1<br />

1.17X<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

CLUSTER BENCHMARK<br />

Application: LAMMPS*<br />

32 NODES<br />

Description: Simulation of molecular systems with classical models.<br />

Wide variety of academic, government, and industry users. Popular<br />

due to its versatility and support for a wide range of forcefields/potential<br />

models: Materials Science, Chemistry, Biophysics,<br />

Solid Mechanics, Granular Flow, etc. More at<br />

http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Load balancer offloads part of neighbor-list and nonbond<br />

force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697 v2<br />

and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing<br />

allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 1.78X performance improvement utilizing Intel® Xeon®<br />

processors and Intel® Xeon Phi coprocessors with application<br />

optimization on a single node compared to the baseline<br />

configuration. Performance gains continue to hold at 1.75X when<br />

scaling up to 32 nodes.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

29


Comparative Performance<br />

LAMMPS*<br />

Rhodopsin Benchmark; 512K Atoms<br />

LAMMPS* Rhodopsin Benchmark Performance (Mixed Precision); Includes<br />

External NVIDIA* Results 2/K20X + 1S AMD*<br />

1<br />

0<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® processor E5-2697v2 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697v2 (LAMMPS IA Package)<br />

2S E5-2697v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS<br />

IA Package)<br />

Cray XK7: 1S AMD Opteron* 6274 + NVIDIA Tesla* K20X; Cray Gemini*<br />

Interconnect, PCIe* 2.0 (LAMMPS GPU Package)<br />

http://www.nvidia.com/docs/IO/122634/computational-chemistrybenchmarks.pdf)<br />

For configuration details, go here.<br />

1.21X<br />

1.75X<br />

1 1<br />

.9X<br />

1.2X<br />

1.72X<br />

1.22X<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

CLUSTER BENCHMARK<br />

Application: LAMMPS*<br />

32 NODES<br />

Description: Simulation of molecular systems with classical models.<br />

Wide variety of academic, government, and industry users. Popular<br />

due to its versatility and support for a wide range of forcefields/potential<br />

models: Materials Science, Chemistry, Biophysics,<br />

Solid Mechanics, Granular Flow, etc. More at<br />

http://lammps.sandia.gov/<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Load balancer offloads part of neighbor-list and nonbond<br />

force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697 v2<br />

and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing<br />

allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 1.75X performance improvement utilizing Intel® Xeon®<br />

processors and Intel® Xeon Phi coprocessors with application<br />

optimization on a single node compared to the baseline<br />

configuration. Performance gains continue to hold at 1.72X when<br />

scaling up to 32 nodes, out-performing the alternative configuration.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

30


Comparative Performance<br />

LAMMPS*<br />

Production Protein Sim.; 474K Atoms<br />

1<br />

0<br />

LAMMPS* Production Protein Simulation Performance<br />

(Mixed Precision)<br />

1<br />

1.18X<br />

1.9X<br />

1 Node 32 Nodes<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS Baseline)<br />

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS IA Package)<br />

2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS<br />

IA Package)<br />

1<br />

1.07X<br />

1.7X<br />

CLUSTER BENCHMARK<br />

Application: LAMMPS*<br />

32 NODES<br />

Description: Simulation of molecular systems with classical models.<br />

Wide variety of academic, government, and industry users. Popular<br />

due to its versatility and support for a wide range of forcefields/potential<br />

models: Materials Science, Chemistry, Biophysics,<br />

Solid Mechanics, Granular Flow, etc. More at<br />

http://lammps.sandia.gov/.<br />

Availability:<br />

• Code: In main LAMMPS repository.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Load balancer offloads part of neighbor-list and nonbond<br />

force calculations to Intel® Xeon Phi coprocessor for<br />

concurrent calculations with CPU.<br />

Highlights: Improved results with Intel® Xeon® processor E5-2697<br />

v2 and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing<br />

allows for concurrent:<br />

• Data transfer between host and coprocessor.<br />

• Calculations of neighbor-list, non-bond, bond, and long-range<br />

terms.<br />

Same routines in LAMMPS Intel Package also run faster on CPU.<br />

Results: Up to 1.9X performance improvement utilizing Intel® Xeon®<br />

processors and Intel® Xeon Phi coprocessors with application<br />

optimization on a single node compared to the baseline<br />

configuration. Performance at 4.84X when scaling up to 32 nodes.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

31


Comparative Performance<br />

AMBER* 14<br />

Generalized Born (GB) Nucleosome<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

AMBER* 14 GB Nucleosome, 25000 atoms<br />

1<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 optimized<br />

Xeon E5-2697 Phi 7120A v2 + optimized Xeon E5-2697 + Xeon v2 Phi offload 7120A offload<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“Xeon Phi” = Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

1.85X<br />

2.2X<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Application: AMBER* 14<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus<br />

etc.). Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: Available as a patch.<br />

• Recipe: Available here (Section 18.7 of the manual).<br />

Usage Model:<br />

• Baseline is the Intel® Xeon® processor E5-2697 v2 host (also<br />

measured in<br />

http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and<br />

speed up is shown with offload processing on both the Intel®<br />

Xeon® processor E5-2697 v2 and the Intel® Xeon Phi<br />

coprocessor 7120A.<br />

• Performance shown for the released code, all double precision,<br />

across the platforms. 50% workload on the host, 50% on the<br />

coprocessor.<br />

Highlights: The code had been optimized, will be delivered to<br />

the AMBER community (whoever has license) and available as<br />

update patch during code configuration.<br />

Results: Optimized offload process demonstrated up to 2.2X<br />

improved performance over the baseline Intel® Xeon® processor<br />

E5-2697 v2.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

32


Comparative Performance<br />

AMBER* 14<br />

Particle Mesh Ewald (PME) Cellulose NPT<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

AMBER* 14 PME: Cellulose NPT<br />

1<br />

Intel® Xeon® processor E5-2697 v2<br />

1.7X<br />

Optimized Intel® Xeon® processor E5-2697 v2<br />

2.2X<br />

Optimized Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi<br />

coprocessor 7120A<br />

Application: AMBER* 14<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.).<br />

Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: Available as a patch.<br />

• Recipe: Available here (Section 18.7 of the manual).<br />

Usage Model:<br />

• Baseline is the host Intel® Xeon® processor E5-2697 v2 compared<br />

to the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi<br />

coprocessor 7120A using double precision.<br />

• Offload processing on both, using the released code, double<br />

precision across the platforms.<br />

Highlights: The code was optimized, delivered to the AMBER<br />

community (whoever has license) and available as an update patch<br />

during code configuration.<br />

Results: Optimized Intel Xeon processor E5-2697 v2 and Intel Xeon<br />

Phi coprocessor 7120A offload demonstrated up to 2.2X improved<br />

performance over the baseline Intel Xeon processor E5-2697 only<br />

code.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

33


Comparative Performance<br />

AMBER* 14<br />

Particle Mesh Ewald (PME) Cellulose NPT<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

AMBER* 14 PME Cellulose NPT Speed Up<br />

1<br />

Intel® Xeon® processor E5-2697 v2<br />

1.7X<br />

Optimized Intel® Xeon® processor E5-2697 v2<br />

Optimized Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi<br />

coprocessor 7120A<br />

2X<br />

Application: AMBER* 14<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.).<br />

Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: As a patch of AMBER 14 when user updates AMBER<br />

(http://ambermd.org/bugfixes14.html,<br />

http://ambermd.org/bugfixesat.html) Update 5 and update 8.<br />

• Recipe: http://ambermd.org/doc12/Amber14.pdf (Section 18.7 of<br />

the manual).<br />

Usage Model: Baseline is the Intel® Xeon® processor E5-2697 v2<br />

compared to the Intel® Xeon® processor E5-2697 v2 and the Intel®<br />

Xeon Phi coprocessor 7120A with offload processing on both, and<br />

using the released code (double precision code across the<br />

platforms).<br />

Highlights: The code was optimized, delivered to the AMBER<br />

community (whoever has license) and available as an update patch<br />

during code configuration.<br />

Results: Optimized Intel Xeon processor E5-2697 v2 and Intel Xeon<br />

Phi coprocessor 7120A offload demonstrated up to 2X improved<br />

performance over the baseline Intel Xeon processor E5-2697 only<br />

code.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

34


Comparative Performance<br />

AMBER* 14<br />

Particle Mesh Ewald (PME) Cellulose NPT<br />

CLUSTER BENCHMARK<br />

4 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

AMBER* 14 PME Cellulose NPT (408K Atoms)<br />

2.8X<br />

2.6X<br />

2.3X<br />

2.3X<br />

2<br />

1.9X<br />

1.6X<br />

1.3X<br />

1<br />

1<br />

0<br />

Node 1 Node 2 Node 3 Node 4<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Application: AMBER* 14<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus<br />

etc.). Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: Available as a patch.<br />

• Recipe: Available here (Section 18.7 of the manual).<br />

Usage Model:<br />

• Baseline is on the Intel® Xeon® processor E5-2697 v2 host<br />

only (also measured in<br />

http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and<br />

speed up is shown with offload processing on both the Intel<br />

Xeon processor E5-2697 v2 and the Intel® Xeon Phi<br />

coprocessor 7120A.<br />

• Performance shown is for the released code, double precision<br />

across the platforms, 50% workload on the host, 50% on the<br />

coprocessor.<br />

Highlights: The code had been optimized, will be delivered to<br />

the AMBER community (whoever has license) and available as<br />

update patch during code configuration.<br />

Results: Optimized offload process demonstrated compelling<br />

cluster performance improvement, up to 2.8X, over the baseline<br />

Intel® Xeon® processor E5-2697 v2.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

35


Comparative Performance<br />

1<br />

0<br />

AMBER* 14<br />

Particle Mesh Ewald (PME) Cellulose NPT<br />

AMBER* 14 PME Cellulose NPT (408K Atoms)<br />

1<br />

1.14X 1.11X<br />

1.37X<br />

2 nodes 3 nodes<br />

Intel® Xeon® processor E5-2697 v2 (baseline)<br />

Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Xeon E5-2697 v2 + NVIDIA* K40 DPFP<br />

1.57X<br />

1.32X<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

CLUSTER BENCHMARK<br />

Application: AMBER* 14<br />

3 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Description: Bimolecular Simulations (Protein, DNA, RNA, virus<br />

etc.). Full double precision (DPDP). More at http://ambermd.org/<br />

Availability:<br />

• Code: Available as a patch.<br />

• Recipe: Available here (Section 18.7 of the manual).<br />

Usage Model:<br />

• Baseline is on the Intel® Xeon® processor E5-2697 v2 host only<br />

(also measured in<br />

http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and<br />

speed up is shown with offload processing on both the Intel<br />

Xeon processor E5-2697 v2 and the Intel® Xeon Phi<br />

coprocessor 7120A.<br />

• Performance shown is for the released code, double precision<br />

across the platforms, 50% workload on the host, 50% on the<br />

coprocessor.<br />

Highlights: The code had been optimized, will be delivered to the<br />

AMBER community (whoever has license) and available as update<br />

patch during code configuration.<br />

Results: Optimized offload process demonstrated compelling<br />

cluster performance improvement, up to 2.6X, over the baseline<br />

Intel® Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

36


Comparative Performance<br />

Burrows-Wheeler Aligner (BWA-ALN)*<br />

Human Genome<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

1<br />

BWA-ALN* Speed Up<br />

1.24X<br />

1.86X<br />

2S Intel® Xeon® processor E5-2697 v2 (baseline BWA-ALN)<br />

2S Intel® Xeon® processor E5-2697 v2 (optimized BWA-ALN)<br />

2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor<br />

7120A<br />

Application: Burrows-Wheeler Aligner*, version 0.5.10.<br />

BWA-ALN is represented in this benchmark. Workload is<br />

korean_female (read file 3.5 GB, 3.0 GB reference data<br />

base).<br />

Description: BWA is a popular software package for<br />

mapping low-divergent sequences against a large<br />

reference genome, such as the human genome. More at<br />

http://bio-bwa.sourceforge.net/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Hybrid MPI + OpenMP* using symmetric<br />

mode.<br />

Highlights: Results are identical to the unmodified run of<br />

BWA-ALN<br />

Results: The Intel® Xeon® processor E5-2697 v2 and the<br />

Intel® Xeon Phi coprocessor symmetric process<br />

demonstrated up to 1.86X improved performance over the<br />

baseline Intel® Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

37


Comparative Performance<br />

GROMACS*<br />

512K H2O with RF<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

GROMACS* 512K H2O with RF Speed Up<br />

1.79X<br />

1.72X<br />

1.56X<br />

1 1.03X<br />

1<br />

0<br />

Intel® Xeon® processor E5-2697 v2<br />

1 Intel® Xeon Phi coprocessor 7120P/X<br />

2 Intel® Xeon Phi coprocessor 7120P/X<br />

Intel® Xeon® processor E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120P/X<br />

Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi coprocessor 7120P/X<br />

Application: GROMACS* 5.0-RC1; Workload: 512K H2O with RF<br />

method<br />

Description: GROMACS is a versatile package to perform<br />

molecular dynamics, i.e. simulate the Newtonian equations of<br />

motion for systems with hundreds to millions of particles. It is<br />

one of the fastest and the most popular Molecular Dynamics<br />

packages.<br />

Availability:<br />

• Code: Version 5.0-rc1 available here and here.<br />

• Recipe: Available here.<br />

Highlights:<br />

• Highly optimized for Intel® Xeon® Processors (AVX-intrinsics).<br />

• Able to run full simulation on Intel® Xeon Phi coprocessor<br />

natively + host processor using a symmetric model.<br />

• Optimized with intrinsics for 512-bit vectorization on Intel<br />

Xeon Phi coprocessors.<br />

• Results: Symmetric process demonstrated up to 1.79X<br />

improved performance over the baseline Intel® Xeon®<br />

processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

38


Comparative Performance<br />

NWChem*<br />

CCSD(T) Method<br />

CLUSTER BENCHMARK<br />

32 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NWChem* 6.3rev2 and 6.5 CCSD(T) Method 32 Node Speed Up<br />

1<br />

0<br />

1<br />

1.24X<br />

NWChem 6.3, 64S Intel® Xeon® processor E5-2697 v2<br />

NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2<br />

1.52X<br />

NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 + 64 Intel® Xeon<br />

Phi Coprocessor 7120A 2<br />

Application: NWChem* is a computational chemistry software<br />

package that includes quantum chemical and molecular<br />

dynamics functionality. NWChem is developed the<br />

Environmental Molecular Sciences Laboratory (EMSL) at<br />

the Pacific Northwest National Laboratory (PNNL). More at<br />

http://www.nwchem-sw.org<br />

Availability:<br />

• Code: Available here and from the SVN repository.<br />

• Recipe: Available here.<br />

Usage Model: Offload using LEO and OpenMP*<br />

Highlights: NWChem with Intel® Xeon Phi coprocessor 7120A<br />

offloading is a compelling and cluster compelling application for<br />

the NWChem community .<br />

Results: Compared to the NWChem* 6.3rev2 and Intel® Xeon®<br />

processor E5-2697 v2 baseline:<br />

1) NWChem 6.5 CCSD(T) performed up to 1.24X faster with the<br />

Intel® Xeon® processor E5-2697 v2.<br />

2) NWChem 6.5 CCSD(T) performed up to 1.52X faster with the<br />

Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi<br />

coprocessor 7120A.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

39


Comparative Performance<br />

NWChem*<br />

CCSD(T) Method<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NWChem* 6.3rev2 CCSD(T) Method Single Node Speed Up<br />

1<br />

0<br />

1<br />

2S Intel® Xeon® processor E5-2697 v2<br />

1.35X<br />

2S Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi<br />

Coprocessor 7120A<br />

Application: NWChem* is a computational chemistry software<br />

package that includes quantum chemical and molecular<br />

dynamics functionality. It is designed for high-performance<br />

parallel supercomputers and aims to be scalable both in its<br />

ability to treat large problems efficiently, and in its usage of<br />

available parallel computing resources. NWChem is developed<br />

the Environmental Molecular Sciences Laboratory (EMSL) at<br />

the Pacific Northwest National Laboratory (PNNL). More at<br />

http://www.nwchem-sw.org<br />

Availability:<br />

• Code: Available here and from the SVN repository.<br />

• Recipe: Available here.<br />

Usage Model: Offload using LEO and OpenMP*.<br />

Highlights: NWChem with Intel® Xeon Phi coprocessor 7120A<br />

offloading is a compelling and cluster compelling application for<br />

the NWChem community .<br />

Results: NWChem* 6.3rev2 CCSD(T) performed up to 1.35X<br />

faster with the Intel® Xeon® processor E5-2697 v2 and the Intel<br />

Xeon Phi coprocessor 7120A compared to the Intel Xeon<br />

processor E5-2697 v2 baseline.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

40


Discover and design like never before.<br />

material Sciences<br />

41


Comparative Performance<br />

miniGhost*<br />

BSPMA<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

miniGhost v. 0.9 Speed Up<br />

1.76X<br />

1<br />

1<br />

1.17X<br />

1.53X<br />

Application: miniGhost v. 0.9<br />

Description: Finite Difference mini-application which<br />

implements a difference stencil across a homogenous three<br />

dimensional domain. More at https://mantevo.org/download/<br />

Availability:<br />

• Code: Available here or here.<br />

• Recipe: Available here.<br />

Usage Model: Symmetric model.<br />

Highlights: The code is a proxy for the CTH (shock physics<br />

code) and can be used as a good test-case to measure network<br />

(interconnect bandwidth vs. latency) performance.<br />

0<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

Results: Symmetric process (2S Intel® Xeon® processor E5-<br />

2697 v3 + Intel® Xeon Phi coprocessor 7120A) demonstrated<br />

up to 1.76X improved performance over the baseline<br />

processor.<br />

Intel® Xeon® processor E5-2697 v2 + Xeon Phi 7120A<br />

Intel® Xeon® processor E5-2697 v3 + Xeon Phi 7120A<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

42


Comparative Performance<br />

miniGhost*<br />

BSPMA<br />

1<br />

0<br />

1<br />

miniGhost v. 0.9 Speed Up<br />

1.18X<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

1.38X<br />

Intel® Xeon® processor E5-2697 v2 + Xeon Phi 7120A<br />

Intel® Xeon® processor E5-2697 v3 + Xeon Phi 7120A<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1.48X<br />

CLUSTER BENCHMARK<br />

Application: miniGhost v. 0.9<br />

Description: Finite Difference mini-application which<br />

implements a difference stencil across a homogenous three<br />

dimensional domain. More at https://mantevo.org/download/<br />

Availability:<br />

4 NODES<br />

• Code: Available here or here.<br />

• Recipe: Available here.<br />

Usage Model: Symmetric model.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Highlights: The code is a proxy for the CTH (shock physics<br />

code) and can be used as a good test-case to measure network<br />

(interconnect bandwidth vs. latency) performance.<br />

Results: Symmetric process (2S Intel® Xeon® processor E5-<br />

2697 v3 + Intel® Xeon Phi coprocessor 7120A) demonstrated<br />

up to 1.48X improved performance over the baseline<br />

processor.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

43


Comparative Performance<br />

Quantum ESPRESSO*<br />

GRIR443<br />

1<br />

0<br />

Quantum ESPRESSO* 5.0.3 GRIR443 Speed Up<br />

For configuration details, go here.<br />

1<br />

16S Intel® Xeon® processor E5-2697 v2<br />

1.4X<br />

16S Intel Xeon processor E5-2697 v2 + 16 Intel® Xeon Phi<br />

Coprocessor 7120A<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

CLUSTER BENCHMARK<br />

Application: Quantum ESPRESSO* 5.0.3<br />

Workload Description: Quantum ESPRESSO* is an integrated suite of<br />

Open-Source computer codes for electronic structure calculations<br />

and materials modeling at the nanoscale. GRIR443 is a public PRACE<br />

benchmark suited for multi-node execution. ESPRESSO is an acronym<br />

for opEn-Source Package for Research in Electronic Structure,<br />

Simulation, and Optimization. More at http://www.quantumespresso.org/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

8 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Usage Model: Offload using OpenMP* and Intel® Math Kernel Library.<br />

Highlights:<br />

• Results are obtained from real-world, publicly available benchmarks<br />

– no special treatment.<br />

• Effect of offloading heavily depends on the type of the workload –<br />

larger workloads (such as GRIR443) generate large dense matrix<br />

multiplications, benefiting from the Intel® Xeon Phi coprocessor<br />

7120A.<br />

Results: Quantum ESPRESSO 5.0.3 GRIR443 performed up to 1.4X<br />

faster with Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi<br />

coprocessor compared to the Intel® Xeon® processor E5-2697 v2<br />

baseline.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

44


Comparative Performance<br />

Quantum ESPRESSO*<br />

AUSRUF112<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

For configuration details, go here.<br />

Quantum ESPRESSO* 5.0.3 AUSURF112 Speed Up<br />

1<br />

2S Intel® Xeon® processor E5-2697 v2<br />

1.2X<br />

2S Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi<br />

Coprocessor 7120A<br />

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014<br />

Application: Quantum ESPRESSO* 5.0.3 AUSRUF112<br />

Description: Quantum ESPRESSO is an integrated suite of Open-<br />

Source computer codes for electronic-structure calculations and<br />

materials modeling at the nanoscale. AUSURF112 is a publicly<br />

available DEISA benchmark suited for single node execution.<br />

ESPRESSO is an acronym for opEn-Source Package for Research<br />

in Electronic Structure, Simulation, and Optimization. More at<br />

http://www.quantum-espresso.org/<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Offload using OpenMP* and Intel® Math Kernel<br />

Library.<br />

Highlights:<br />

• Results are obtained from real-world, publicly available<br />

benchmarks – no special treatment.<br />

• Effect of offloading heavily depends on the type of the<br />

workload.<br />

Results: Quantum ESPRESSO 5.0.3 AUSURF112 performed up to<br />

1.2X faster with Intel® Xeon® processor E5-2697 v2 and the Intel®<br />

Xeon Phi coprocessor compared to the Intel Xeon processor E5-<br />

2697 v2 baseline.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

45


Improving simulations speed and quality.<br />

Manufacturing<br />

46


Comparative Performance<br />

ANSYS Mechanical* 16.0<br />

V16sp-3 DMP<br />

ANSYS Mechanical* V16sp-3 Solver Rate, DMP Mode<br />

CLUSTER BENCHMARK<br />

8 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

1<br />

2X 1.8X<br />

4.7X<br />

1 (1 node x 1 MPI<br />

ranks)<br />

6X<br />

9.4X<br />

7.6X 6.9X<br />

12.3X<br />

Intel® Xeon® processor E5-2697 v2<br />

8 (1 node x 8 MPI 16 (2 nodes x 8 MPI<br />

ranks)<br />

ranks)<br />

Number of MPI processes on host<br />

16 (8 nodes x 2 MPI<br />

ranks)<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v3<br />

17.7X 16.9X<br />

21.2X<br />

19.4X<br />

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

22.4X<br />

Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessors 7120A<br />

32.3X<br />

Application: ANSYS Mechanical* 16.0 DMP Mode<br />

Description: V16sp-3 Harmonic Analysis, 1.7M DOF.<br />

Availability:<br />

• Code: Contact ANSYS for official V16 release,<br />

www.ansys.com<br />

• Recipe: Available here.<br />

Usage Model: Intel® MKL automatic offload.<br />

Highlights: First commercial simulator supporting the<br />

Intel® Xeon Phi coprocessor. V15sp-3 speaker<br />

benchmark model.<br />

Results: The Intel® Xeon Phi coprocessor provides<br />

speed-up over Intel® Xeon® E5-2697 v3 host on cluster<br />

runs, and dual Intel® Xeon Phi® coprocessors provide<br />

greater speed-up on both single node and on cluster<br />

runs.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

47


Comparative Performance<br />

ANSYS Mechanical* 16.0<br />

V145sp-5 DMP (Previous Version Model)<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

ANSYS Mechanical* V145sp-5 Solver Rate, DMP Mode<br />

11.3X<br />

Application: ANSYS Mechanical* 16.0 DMP Mode<br />

10<br />

8<br />

6<br />

4<br />

3.34X<br />

4.6X<br />

7.24X<br />

5.48X<br />

3.77X<br />

9.83X<br />

8.83X<br />

5.39X<br />

3.62X<br />

9.51X<br />

8.4X<br />

4.79X<br />

Description: V145sp-5 Static Structural Analysis, 2.1M<br />

DOF.<br />

Availability:<br />

• Code: Contact ANSYS for official V16 release,<br />

www.ansys.com<br />

• Recipe: Available here<br />

Usage Model: Intel® MKL automatic offload.<br />

2<br />

0<br />

1.98X 2.04X<br />

1<br />

1 2 4 8<br />

Number of MPI processes on host<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v3<br />

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

Highlights: First commercial simulator supporting the<br />

Intel® Xeon Phi coprocessor. V145sp-5 turbine blade<br />

benchmark model (discusses impact of Intel® SSDs).<br />

Results: At 1 MPI processes on the host, the Intel® Xeon<br />

Phi coprocessor provides up to 2.32X speed-up over<br />

the Intel® Xeon® processor E5-2697 v3 host, and the<br />

Intel® Xeon® processor E5-2697 v3 + the Intel® Xeon<br />

Phi coprocessor 7120A provides up to 4.6X speed-up<br />

over the Intel® Xeon® E5-2697 v2 host.<br />

Note: Intel® Xeon® processor uses 1 core per MPI process<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

48


Comparative Performance<br />

ANSYS Mechanical* 16.0<br />

V16ln-2 DMP<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

ANSYS Mechanical* V16ln-2 Solver Rate, DMP Mode<br />

2.48X<br />

2.03X 2X<br />

1.61X<br />

1<br />

4.89X<br />

3.73X<br />

1 2 4 8<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v3<br />

6.65X<br />

Number of MPI processes on host<br />

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

Note: Intel® Xeon® processor uses 1 core per MPI process<br />

3X<br />

6.47X<br />

5.83X<br />

9.58X<br />

6X<br />

9.42X<br />

8.97X<br />

10.08X<br />

Application: ANSYS Mechanical* 16.0 DMP Mode<br />

Description: V16ln-2 Dynamic Structural Modal<br />

Analysis, 2M DOF.<br />

Availability:<br />

• Code: Contact ANSYS for official V16 release,<br />

www.ansys.com<br />

• Recipe: Available here.<br />

Usage Model: Intel® MKL automatic offload.<br />

Highlights: First commercial simulator supporting the<br />

Intel® Xeon Phi coprocessor.<br />

Results: At 2 MPI processes on the host, the Intel® Xeon<br />

Phi coprocessor provides up to 1.78X speed-up over<br />

the Intel® Xeon® E5-2697 v3 host, and up to 2.44X<br />

speed-up over the Intel® Xeon® E5-2697 v2 host.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

49


Comparative Performance<br />

ANSYS Mechanical* 16.0<br />

V145sp-4 DMP (Previous Version Model)<br />

1 NODE APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

ANSYS Mechanical* V145sp-4 Solver Rate, DMP Mode<br />

14.8X<br />

15.11X<br />

14<br />

12<br />

11.76X<br />

10<br />

10.2X<br />

10.16X<br />

8.76X<br />

8<br />

6.99X<br />

6<br />

5.88X<br />

5.89X<br />

4<br />

3.43X<br />

2<br />

1.89X<br />

1 .96X<br />

0<br />

1 2 4 8 16<br />

Number of MPI processes on host<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 3151P<br />

Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi coprocessors 3151P<br />

Application: ANSYS Mechanical* 16.0 DMP Mode<br />

Description: V145sp-4 Static Structural Analysis, 3.2M<br />

DOF.<br />

Availability:<br />

• Code: V16 preview 4 is now available. Contact ANSYS<br />

for official V16 release, www.ansys.com<br />

• Recipe: Available here.<br />

Usage Model: Intel® MKL automatic offload.<br />

Highlights: First commercial simulator supporting the<br />

Intel® Xeon Phi coprocessor. V145sp-5 turbine blade<br />

benchmark model (discusses impact of Intel® SSDs).<br />

Results: At 2 MPI processes on the host, a single Intel®<br />

Xeon Phi 31S1P provides up to 3.11x speed-up over<br />

Intel® Xeon® host E5-2692 v2 alone. Two Intel Xeon Phi<br />

31S1P coprocessors provide a speed-up of up to 3.7x.<br />

Note: Intel® Xeon® processor uses 1 core per MPI process<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

50


Comparative Performance<br />

ANSYS Mechanical* 16.0<br />

V16ln-2 SMP<br />

ANSYS Mechanical* V16ln-2 Solver Rate, SMP Mode<br />

1 NODE APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

1<br />

2.06X<br />

1.67X<br />

2.87X<br />

1.77X<br />

3.12X<br />

2.75X<br />

1 2 8 16<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v3<br />

3.99X 3.88X<br />

4.38X<br />

Number of threads on host<br />

5.65X 5.75X 5.8X<br />

5.31X<br />

4.66X 4.95X<br />

Application: ANSYS Mechanical* 16.0 SMP Mode<br />

Description: V16ln-2 Dynamic Structural Modal<br />

Analysis, 2M DOF.<br />

Availability:<br />

• Code: Contact ANSYS for official V16 release,<br />

www.ansys.com<br />

• Recipe: Available here.<br />

Usage Model: Intel® MKL automatic offload.<br />

Highlights: First commercial simulator supporting the<br />

Intel® Xeon Phi coprocessor.<br />

Results: At 1 thread on the host, the Intel® Xeon Phi<br />

coprocessor provides up to 1.72X speed-up over the<br />

Intel® Xeon® E5-2697 v3 host, and up to 2.06X speed-up<br />

over the Intel® Xeon® E5-2697 v2 host.<br />

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

Note: Intel® Xeon® processor uses 1 core per thread<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

51


Comparative Increase<br />

HPCG* benchmark<br />

Conjugate gradient method<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

1.7X<br />

HPCG Benchmark Speed Up<br />

4.48X<br />

3.14X 3.27X<br />

2.95X<br />

4.15X<br />

5.67X<br />

5.31X<br />

1 1 1<br />

3.04X<br />

1 node 4 nodes 16 nodes<br />

Intel® Xeon® processor E5-2697 v3 (reference)<br />

Intel® Xeon® processor E5-2697 v3 (optimized)<br />

Xeon E5-2697 v3 + Xeon Phi 7120A (symmetric)<br />

Xeon E5-2697 v3 + 2 Xeon Phi 7120A (offload)<br />

Xeon E5-2697 v3 + 2 Xeon Phi 7120A (symmetric)<br />

3.95X<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

4.59X<br />

5.23X<br />

CLUSTER BENCHMARK<br />

16 NODES APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Application: HPCG benchmark<br />

Description: The HPCG benchmark is intended to complement the<br />

High Performance LINPACK benchmark used in the TOP500* system<br />

ranking by providing a metric that better aligns with a broader set of<br />

important cluster applications. The workload performs 50 iterations of<br />

CG method with Gauss-Seidel preconditioner and synthetic multigrid<br />

V-cycle.<br />

Availability:<br />

• Code: Available here. Intel Optimized Technical Preview for HPCG<br />

benchmark available by subscription. Contact intelmkl@intel.com.<br />

• Recipe: Available here.<br />

Usage Model:<br />

HPCG is a hybrid (MPI + OpenMP*) code supporting symmetric and<br />

offload modes. The reference version is pure MPI code running one<br />

rank per core.<br />

Highlights:<br />

Symmetric mode provides performance benefit by freeing CPU cores<br />

used by data transfer in offload for computations.<br />

Results:<br />

Optimized version demonstrated 1.7X performance improvement<br />

compared to reference implementation and 4.48X with 2 Intel® Xeon<br />

Phi single node.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MAY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

52


Comparative Performance<br />

Sandia Mantevo*<br />

MiniFE-2.0.1<br />

2<br />

1<br />

0<br />

1<br />

Sandia Mantevo* MiniFE 2.0.1 Speed Up<br />

1.1X<br />

2.14X<br />

2.58X<br />

1 node 8 nodes<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

Xeon E5-2697 v3 + 1 Intel® Xeon Phi coprocessor 7120A<br />

Xeon E5-2697 v3 + 2 Intel® Xeon Phi coprocessor 7120A<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

1<br />

1.2X<br />

2.23X<br />

2.49X<br />

CLUSTER BENCHMARK<br />

8 NODES<br />

Application: Sandia Mantevo* MiniFE 2.0.1.<br />

Description: Sandia National Laboratories; Intended to be<br />

the best approximation to an unstructured implicit finite<br />

that includes all important computational phases. More at<br />

https://mantevo.org.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Usage Model: Baseline is the Intel® Xeon® processor E5-<br />

2697 v2 host, and the host and the Intel® Xeon Phi<br />

coprocessor in a symmetric mode.<br />

Highlight: Porting ease using OpenMP*. The Intel® MPI<br />

Library enables rapid performance improvement when<br />

adding an Intel Xeon Phi coprocessor.<br />

Results: Up to 2.58X speed up for the Intel® Xeon®<br />

processor E5-2697 v3 and the Intel® Xeon Phi<br />

coprocessor 7120A in symmetric mode over the host.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF APRIL, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

53


Comparative Performance<br />

2<br />

1<br />

0<br />

MSC Software*<br />

Nastran* 2016 Alpha<br />

1<br />

For configuration details, go here.<br />

MSC Software Nastran* 2016 Alpha Speed Up<br />

1.97X 2X<br />

1<br />

1 2 4 8<br />

Cores<br />

2S Intel® Xeon® processor E5-2697 v2<br />

1.61X<br />

1 1<br />

2S Xeon E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120P<br />

1.3X<br />

SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014<br />

1 NODE<br />

Application: MSC Software Nastran* 2016 Alpha.<br />

Description: MSC Software Nastran sparse direct solver optimized<br />

for Intel® Xeon Phi coprocessors. Intel Xeon Phi coprocessors are<br />

especially well suited for large models that consist primarily of<br />

three-dimensional elements and have tightly-coupled multiphysics,<br />

such as noise, vibration, and harshness (NHV) studies.<br />

Availability:<br />

• Code: Contact an MSC Software representative.<br />

• Recipe: Use hStreams* with MPSS 3.5 or later.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Usage Model: When large matrices are split into submatrices for<br />

processing, the solver offloads the largest submatrices from the<br />

Intel® Xeon® processor to the Intel® Xeon Phi coprocessor.<br />

Performance is maximized through asynchronous compute<br />

capabilities, which allow matrix computations on the coprocessor to<br />

start almost instantly once the data offload begins.<br />

Highlight: This strategy has demonstrated performance gains<br />

across a range of solution sequences for MSC Nastran 2016 Alpha,<br />

including structural statics and modal decomposition, which<br />

account for the majority of customer workloads.<br />

Results: Up to 2X speed up for the Intel® Xeon® processor E5-2697<br />

v2 and the Intel® Xeon Phi coprocessor 7120A in offload mode<br />

over the host.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

54


Comparative Increase<br />

Autodesk Maya* 2014 Viewport Plugin<br />

Embree-based workloads<br />

1 NODE APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

2<br />

1<br />

0<br />

Embree-Based Viewport Plugin for Autodesk Maya* 2014<br />

Speed Up<br />

1.68X<br />

1.23X<br />

Chinese Dragon<br />

with translucent<br />

material<br />

1.67X<br />

1.2X<br />

Full-screen Crytek<br />

Sponza simple<br />

materials +<br />

textures<br />

Power Plant<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Xeon E5-2697 v2 + Xeon Phi 7120A<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A<br />

1.8X 1.77X<br />

1.58X<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

Rungholt<br />

2.8X<br />

Application: Embree-Based Viewport Plugin for Autodesk Maya* 2014<br />

Description: An open-source, proof-of-concept Autodesk Maya 2014<br />

viewport plugin which renders the scene being interactively edited using<br />

the Embree high-performance ray-tracing kernels and a modified version<br />

of the Embree sample path-tracer.<br />

Availability:<br />

• Code: Available here (branch “mayarender_v2.3.2”).<br />

• Recipe: Available here.<br />

Usage Model:<br />

This plugin can do computation using only the host, only the coprocessor<br />

via offload, and on both the host and coprocessor via offload. It can also<br />

make use of multiple coprocessors. The plugin is invoked every time the<br />

screen updates during interactive content authoring.<br />

Highlights:<br />

• Workstation application for both Linux* and Windows*<br />

• Makes use of the high-performance Embree ray-tracing kernels, which<br />

are optimized for both Intel® Xeon® processors and the Intel® Xeon Phi<br />

coprocessor<br />

Results:<br />

Dual Intel® Xeon® processor E5-2697 v2 + 1 Intel Xeon Phi coprocessor<br />

7120A hybrid mode improves performance by up to 2.8X over the dual<br />

Intel® Xeon® processor E5-2697 v2 host alone.<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

55


Comparative Increase<br />

2<br />

1<br />

0<br />

Autodesk Maya* 2014 Viewport Plugin<br />

Selected workloads<br />

Selected Workloads Viewport Plugin for Autodesk Maya* 2014<br />

Speed Up<br />

1.4X<br />

1.69X<br />

Chinese Dragon<br />

with translucent<br />

material<br />

1.35X<br />

1.82X<br />

Full-screen Crytek<br />

Sponza simple<br />

materials +<br />

textures<br />

1.6X<br />

1.43X<br />

Power Plant<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Intel® Xeon® processor E5-2697 v3<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A<br />

Rungholt<br />

“Xeon E5-2697 v2 / v3” = Intel® Xeon® processor E5-2697 v2 / v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1.44X<br />

2.51X<br />

1 NODE APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Application: Embree-Based Viewport Plugin for Autodesk Maya* 2014<br />

Description: An open-source, proof-of-concept Autodesk Maya 2014<br />

viewport plugin which renders the scene being interactively edited using<br />

the Embree high-performance ray-tracing kernels and a modified version<br />

of the Embree sample path-tracer.<br />

Availability:<br />

• Code: Available here (branch “mayarender_v2.3.2”).<br />

• Recipe: Available here.<br />

Usage Model:<br />

This plugin can do computation using only the host, only the coprocessor<br />

via offload, and on both the host and coprocessor via offload. It can also<br />

make use of multiple coprocessors. The plugin is invoked every time the<br />

screen updates during interactive content authoring.<br />

Highlights:<br />

• Workstation application for both Linux* and Windows*<br />

• Makes use of the high-performance Embree ray-tracing kernels, which<br />

are optimized for both Intel® Xeon® processors and the Intel® Xeon Phi<br />

coprocessor<br />

Results:<br />

Dual Intel® Xeon® processor E5-2697 v3 + 1 Intel Xeon Phi coprocessor<br />

7120A hybrid mode improves performance by up to 2.51X over the dual<br />

Intel® Xeon® processor E5-2697 v2 host alone.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

56


Comparative Performance<br />

OpenLB*<br />

Cylinder2d<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

OpenLB* Speed Up<br />

1 1X<br />

1.53X<br />

2S Intel® Xeon® processor E5-2697 v2, baseline OpenLB* (12 Ranks)<br />

2S Intel Xeon processor E5-2697 v2, optimized OpenLB (12 Ranks)<br />

2S Xeon E5-2697 v2 (12 Ranks) + Xeon Phi 7120A (12 Ranks * 20 thread per<br />

Rank)<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

Application: OpenLB* 0.8r0. Workload is cylinder2d, which is<br />

based on D2Q9 lattice mode and comes from the example in the<br />

source code package; the size is 16384*4096.<br />

Description: The OpenLB is a software package designed to set<br />

up an open source implementation of Lattice Boltzmann<br />

methods (LBM) in an object-oriented framework. More at<br />

http://optilb.com/openlb/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Hybrid MPI + OpenMP* on Intel® Xeon Phi<br />

coprocessor 7120A and MPI on Intel® Xeon® processor E5-2697<br />

v2 using symmetric mode.<br />

Highlights: Pure MPI mode shows the best performance on the<br />

host processor. For the coprocessor, MPI + OMP hybrid mode<br />

produces the best performance.<br />

Results: The performance shows up to 1.53X speed up in<br />

symmetric mode over the baseline Intel® Xeon® processor E5-<br />

2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF OCTOBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

57


Data Center Server Demo:<br />

Fujitsu HPC Gateway*<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Intel Better Together<br />

Hardware<br />

Intel® Xeon Phi coprocessor 7120P<br />

Intel® Xeon® processor E5-2600 v3<br />

Intel® Solid-State Drives<br />

Intel® True Scale Fabric<br />

Fujitsu HPC Gateway*<br />

Fujitsu CELSIUS* workstation<br />

Software<br />

Intel® Solutions for Lustre* software<br />

The Fujitsu HPC Gateway* is an extendable Web desktop that provides easy access to HPC capabilities. This demo<br />

features the Intel® Xeon Phi coprocessor 7120P and the Intel® Xeon® processor E5-2600 v3 powering this integrated<br />

and intuitive HPC environment that can be adapted by organizations to meet specific requirements, such as access<br />

policies and licensing. See the demo here.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others<br />

58


CLUSTER BENCHMARKS<br />

New Data Center Server Demos at Intel.com/XeonE5softwaresolutions:<br />

These demos feature the Intel® Xeon Phi coprocessor 7120P and the Intel® Xeon® processor E5-2600 v3<br />

powering applications that are improving the way HPC is utilized. See the demos linked here:<br />

• Intel® Xeon Phi Coprocessor Code Modernization<br />

• Altair HyperWorks* and Intel® Cluster Ready Shortens the HPC Learning Curve<br />

• ANSYS* Delivers Fast, Accurate Engineering Simulations<br />

• CST* 3D Simulations Powered by the Intel® Xeon Phi Coprocessor<br />

• E4* Accelerated Bioinformatics with Intel® Xeon® and Intel® Xeon Phi<br />

• Fujitsu* HPC Clusters Based on Intel® Cluster Ready<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Intel Better Together<br />

Hardware<br />

Intel® Xeon Phi coprocessor 7120P<br />

Intel® Xeon® processor E5-2600 v3<br />

Intel® Solid-State Drives<br />

Software<br />

Intel® Solutions for Lustre* software<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others<br />

59


Improving financial outcomes through faster simulations<br />

Financial Services<br />

60


Comparative Performance<br />

Monte Carlo *<br />

European Option Pricing<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

Intel® Xeon® processor E5-2697 v2 + Java* managed code<br />

Intel® Xeon® processor E5-2697 v2 + C/C++ native<br />

Intel® Xeon Phi coprocessor 7120A + C/C++ native<br />

Intel® Xeon Phi coprocessor 7120A + Hadoop*<br />

Java: version 8 update 30+ C/C++: Intel® Parallel Composer XE 15.0<br />

For configuration details, go here.<br />

Monte Carlo European Option Pricing Speed Up<br />

0.69X<br />

1<br />

8.91X<br />

Hadoop: Cloudera* Distribution of Hadoop 5.3+<br />

31.25X<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

1 NODE<br />

Application: Monte Carlo European option<br />

Description: Implements European Option Pricing using Monte Carlo. It<br />

compares the performance of<br />

1) Java* code, 2) C/C++ native code, 3) C/C++ offload accelerated code,<br />

4) C/C++ accelerated code using Hadoop*.<br />

Availability:<br />

• Code and Recipe: Available here.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Usage Model: Java (Managed Offload), Native on the host, Native and<br />

Accelerators, Hadoop on native and Accelerated.<br />

Highlights:<br />

• Java code can use parallelism but cannot vectorize any control flows<br />

• Native C/C++ code can take advantage of both vectorization &<br />

parallelization<br />

• Native Accelerated code offloads the whole workload to Intel® Xeon<br />

Phi Coprocessor.<br />

• Hadoop distribute the workload to 4 remote modes using mapreduce<br />

• Remote nodes accelerate the workload and send the run result back<br />

to the head note.<br />

Results:<br />

• Intel Java Stream can parallelize the workload but not vectorize<br />

• Native interface bring vectorization and parallelization to the<br />

workload<br />

• Acceleration extends the parallelism from Multicore to manycore<br />

• Hadoop map reduce can further distribute the application into 4<br />

nodes and achieve up to 31X improvement.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

61


Comparative Increase<br />

Monte Carlo*<br />

Microsoft Excel*-based option pricing<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Microsoft Excel*-based Monte Carlo Option Pricing Speed Up<br />

2.09X<br />

2<br />

1<br />

1<br />

0<br />

Intel® Xeon® processor E5-2697 v2<br />

Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120P<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

Application: Microsoft Excel*-based Monte Carlo option pricing.<br />

Description: This application uses Excel to price Monte Carlo<br />

European Options. Traditionally these calculations are done on<br />

CPUs. This framework demonstrates how Excel can offload<br />

calculations to Intel® Xeon Phi x100 (Knights Corner) using<br />

Windows* stack.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Offload, OpenMP*. No processing on the host;<br />

host offloads all data from Excel to the Intel® Xeon Phi<br />

coprocessor. All calculations were performed on the Intel Xeon<br />

Phi Coprocessor.<br />

Highlights: Excel users can speed-up their compute intensive<br />

applications by offloading these tasks to the Intel Xeon Phi<br />

Coprocessor.<br />

Results: Offloading calculations to Intel® Xeon Phi coprocessor<br />

7120P resulted in up to 2.09X performance improvement<br />

compared to the Intel® Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

62


Comparative Performance<br />

QuantLib*<br />

Single Precision Monte Carlo<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

QuantLib* SP Monte Carlo Speed Up<br />

6.93X<br />

1.82X<br />

1.34X<br />

1<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

Xeon E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40c<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

Application: QuantLib* Single Precision Monte Carlo<br />

Description: Monte Carlo is a popular simulation mathematical<br />

model to value and analyze complex financial instruments, in<br />

this case European options. This version is single-precision.<br />

More at Accelerating Financial Applications on the GPU.<br />

Availability:<br />

• Code and Recipe: Available here.<br />

Usage Model: Multi-option scenario; 400k paths, 4k options.<br />

This model uses a time-step of 250 per path. Uses OpenMP* for<br />

threads and Intel® MKL Library for vector of random numbers<br />

Highlights: The optimized Intel® Architecture code is compared<br />

with the reference CUDA* version.<br />

Results:<br />

• Intel® Xeon® processor E5-2697 v3 demonstrates a speed up<br />

of up to 1.82X over Intel Xeon processor E5-2697 v2.<br />

• Intel® Xeon Phi (native) demonstrates a speed up of up to<br />

6.93X over Intel Xeon processor E5-2697 v2.<br />

• NVIDIA Tesla* K40c demonstrates a speed up of up to 1.34X<br />

over Intel Xeon processor E5-2697 v2 (measurements do not<br />

include data transfer time).<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

67


Comparative Performance<br />

QuantLib*<br />

Single Precision Black-Scholes<br />

1<br />

0<br />

QuantLib* SP Black-Scholes Speed Up<br />

1<br />

1.26X<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

1.84X<br />

Xeon E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40c<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

0.69X<br />

Application: QuantLib* Single Precision Black-Scholes<br />

Description: The QuantLib library is a popular library used<br />

for many areas of computational finance. Black-Scholes is a<br />

popular mathematical model for European option valuation.<br />

This is a single-precision version. More at Accelerating<br />

Financial Applications on the GPU.<br />

Availability:<br />

• Code and Recipe: Available here.<br />

Usage Model: Test Conditions: 5 million options, 2048<br />

iterations. Uses OpenMP* for threads and Intel® TBB for<br />

memory alignment.<br />

Highlights: The optimized Intel® Architecture code is<br />

compared with the reference CUDA* version.<br />

Results:<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

• Intel® Xeon® processor E5-2697 v3 demonstrates a speed<br />

up of up to 1.26X over Intel Xeon processor E5-2697 v2.<br />

• Intel® Xeon Phi (native) demonstrates a speed up of up<br />

to 1.84X over Intel Xeon processor E5-2697 v2.<br />

• NVIDIA Tesla* K40c demonstrates a slow down of up to<br />

0.69X compared to the Intel Xeon processor E5-2697 v2<br />

(measurements do not include data transfer time).<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

64


Comparative Performance<br />

Monte Carlo* RNG<br />

European Options; Double Precision<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

Monte Carlo* RNG European Options – Double Precision<br />

.72X<br />

1<br />

1.87X<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Intel® Xeon® processor E5-2697 v3<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A<br />

2S Xeon E5-2697 v3 + 2 Xeon Phi 7120A<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

2.77X<br />

Application: Monte Carlo* RNG European option pricing; double<br />

precision.<br />

Description: Monte Carlo RNG is a popular derivative pricing<br />

benchmark widely used by investment banks such as UBS*, Bank of<br />

America* and Goldman-Sachs*. More at<br />

https://software.intel.com/en-us/articles/case-study-achieving-highperformance-on-monte-carlo-european-option-using-stepwise.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Hybrid or Asynchronous offload on one shared<br />

memory host node with up to two coprocessor devices.<br />

Highlights:<br />

• One of two most computational intensive workloads in FSI<br />

benchmark suites.<br />

• More operations per option data set.<br />

• Benefits from Intel® AVX2 FMA.<br />

Results: The baseline Intel® Xeon® processor E5-2697 v3 is 1.39X<br />

higher performance than Intel® Xeon® processor E5-2697 v2. Adding<br />

one coprocessor card nearly doubles the baseline performance;<br />

adding two coprocessors nearly triples it.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

65


Comparative Performance<br />

Binomial Options*<br />

Double Precision<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Binomial Options Double Precision<br />

2<br />

1.54X<br />

2X<br />

Application: Binomial European Option Pricing.<br />

Description: Binomial Options* is a popular derivative pricing<br />

benchmark widely used by investment banks such as UBS*, Bank<br />

of America* and Goldman-Sachs*. One of two most<br />

computational intensive workloads in FSI benchmark suites.<br />

1<br />

1<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

0<br />

.55X<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Intel® Xeon® processor E5-2697 v3<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120A<br />

2S Xeon E5-2697 v3 + 2 Xeon Phi 7120A<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

Usage Model: Hybrid or Asynchronous offload on one shared<br />

memory host node with up to two coprocessor devices.<br />

Highlights:<br />

• More operations per option data set.<br />

• Benefits from Intel® AVX2 FMA.<br />

Results: The baseline Intel® Xeon® processor E5-2697 v3 is<br />

nearly double the performance of the Intel® Xeon® processor E5-<br />

2697 v2. Adding one coprocessor card increases the baseline<br />

performance by over 50%; adding two coprocessors more than<br />

doubles it.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

66


Comparative Performance<br />

Monte Carlo*<br />

European Options<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

Monte Carlo* European Options Speed Up<br />

Single and Double Precision<br />

1<br />

1.51X<br />

Single Precision<br />

10.65X<br />

2S Intel® Xeon® processor E5-2670<br />

1<br />

1.4X<br />

Double Precision<br />

4.32X<br />

Application: Monte Carlo* RNG European option pricing.<br />

Description: Monte Carlo RNG is a popular derivative pricing<br />

benchmark widely used by investment banks such as UBS*, Bank<br />

of America* and Goldman-Sachs*. More here.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Single and double precision testing, host node<br />

with up to two coprocessor devices. Performance depends on<br />

raw computational power and the performance of exp2()<br />

Highlights: Dramatic performance scaling for both singleprecision<br />

and double-precision calculations.<br />

Results: For single precision, the Intel® Xeon® processor E5-<br />

2697 v2 performance improvement is up to 1.51X compared to<br />

the baseline, and the 2x Intel® Xeon Phi coprocessor<br />

performance improvement is up to 10.65X.<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2x Intel® Xeon Phi coprocessor 7120A (pre-production HW/SW)<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

67


Comparative Performance<br />

Black-Scholes Formula Valuation*<br />

Single and Double Precision<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

1<br />

0<br />

1<br />

Black-Scholes Speed Up<br />

Single and Double Precision<br />

1.36X<br />

Single Precision<br />

2.89X<br />

2S Intel® Xeon® processor E5-2670<br />

2S Intel® Xeon® processor E5-2697 v2<br />

Double Precision<br />

Intel® Xeon Phi coprocessor 7120A (pre-production HW/SW)<br />

1<br />

1.65X<br />

2.85X<br />

Application: Black-Scholes* financial modeling requires<br />

raw computational power plus high bandwidth<br />

between execution cores and memory<br />

Description: Popular derivative pricing benchmarks widely<br />

used by investment banks such as UBS, Bank of America,<br />

Goldman-Sachs<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Baseline is the Intel® Xeon® processor E5-<br />

2670 host only, and speed up is shown on the Intel® Xeon®<br />

processor E5-2697 v2 and on both the Intel® Xeon®<br />

processor E5-2697 v2 and on the Intel® Xeon Phi <br />

coprocessor 7120A.<br />

Highlights: Dramatic scaling for both single and double<br />

precision computations.<br />

Results: Up to 2.85X improved performance with Intel®<br />

Xeon Phi coprocessor 7120A over the baseline Intel® Xeon<br />

processor E5-2697 v2 for the double precision benchmark.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

68


Comparative Performance<br />

Monte Carlo* RNG<br />

European Options<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

2<br />

Monte Carlo* RNG European Options Speed Up<br />

Single and Double Precision<br />

1.64X<br />

2.6X<br />

1.55X<br />

1.81X<br />

Application: Monte Carlo* RNG European option pricing; double<br />

precision.<br />

Description: Monte Carlo RNG is a popular derivative pricing<br />

benchmark widely used by investment banks such as UBS*, Bank<br />

of America* and Goldman-Sachs*. More at<br />

https://software.intel.com/en-us/articles/case-study-achievinghigh-performance-on-monte-carlo-european-option-usingstepwise.<br />

1<br />

1<br />

1<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Single and double precision testing, host node<br />

with up to two coprocessor devices.<br />

0<br />

Single Precision<br />

2S Intel® Xeon® E5-2670<br />

2S Intel® Xeon® E5-2697 v2<br />

Double Precision<br />

2x Intel® Xeon Phi Coprocessor 7120A (pre-production HW/SW)<br />

Highlights: Intel® Xeon Phi coprocessor fast exp2 and FMA<br />

instructions deliver high performance, high accuracy for single<br />

and double precision computations.<br />

Results: For double precision, the Intel® Xeon® processor E5-<br />

2697 v2 performance improvement is up to 1.55X compared to<br />

the baseline, and the Intel® Xeon Phi coprocessor performance<br />

improvement is up to 1.81X.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

69


Comparative Performance<br />

Binomial Options*<br />

Single and Double Precision<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Binomial Options Speed Up<br />

Single and Double Precision<br />

1.85X<br />

1.85X<br />

Application: Binomial European Option Pricing.<br />

Description: Binomial Options* is a popular derivative pricing<br />

benchmark widely used by investment banks such as UBS*, Bank<br />

of America* and Goldman-Sachs*.<br />

1<br />

1<br />

1<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Hybrid or Asynchronous offload on one shared<br />

memory host node with up to two coprocessor devices.<br />

0<br />

Single Precision<br />

2S Intel® Xeon® processor E5-2697 v2<br />

Double Precision<br />

Highlights:<br />

• L2 cache for IA is large enough for this algorithm, which is an<br />

advantage compared to GPU shared memory. Intel compiler<br />

based SIMD vectorization can handle certain loop carry<br />

dependencies while GPU has to explicitly synch.<br />

Results: Adding one Intel® Xeon Phi coprocessor card increases<br />

the baseline Intel® Xeon processor E5-2697 v2 performance by<br />

up to 1.85X.<br />

2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi<br />

coprocessor (pre-production HW/SW)<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

70


Comparative Performance<br />

Xcelerit*<br />

Libor Market Model (LMM)<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Xcelerit* LMM Speed Up<br />

1<br />

1<br />

1.29X<br />

1.03X<br />

1<br />

1.29X<br />

1.07X<br />

1<br />

1.34X<br />

1.12X<br />

Application: Xcelerit* LMM<br />

Description: Used in “Fixed income” division to model forward<br />

interest rates. More at http://blog.xcelerit.com/intel-xeon-phivs-nvidia-tesla-gpu/.<br />

Availability:<br />

• Code:<br />

• Base version.<br />

• Optimized versions: Contact Xcelerit* at<br />

http://www.xcelerit.com/ Contact Names: Hicham Lahlou,<br />

hicham.lahlou@xcelerit.com, Jorg Lotze,<br />

jorg.lotze@xcelerit.com<br />

• Recipe: Not available. Check for future availability at here.<br />

0<br />

Paths: 128k 256k 512k<br />

2S Intel® Xeon® processor E5-2697 v2<br />

Usage Model: Baseline is the Intel® Xeon® processor E5-2697<br />

v2 host only and speed up is with native Intel® Xeon Phi<br />

coprocessor 7120A, double precision.<br />

Highlights: The code has been optimized and delivered to<br />

Xcelerit.<br />

Intel® Xeon Phi coprocessor 7120A<br />

NVIDIA K40*<br />

Results: Optimized Intel Xeon Phi coprocessor is the best<br />

performing platform for all configurations.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

71


"Intel Xeon Phi coprocessors<br />

enable exciting applications; we<br />

are seeing a noticeable boost in<br />

the performance of<br />

computationally-intensive kernels<br />

of wave migration solutions<br />

that are vital to our partners in the<br />

energy field.”<br />

Rishi Khan<br />

Vice President of Research and Development,<br />

ET International<br />

Enhancing exploration and extraction processes.<br />

Energy Industry<br />

72


Comparative Increase<br />

1<br />

0<br />

Iso3DFD*<br />

16th Order Isotropic Kernel<br />

1<br />

Iso3DFD Speed Up<br />

1.37X<br />

Intel® Xeon® processor E5-2697 v2 (turbo off)<br />

Intel® Xeon® processor E5-2697 v3 (turbo off)<br />

1.72X<br />

Intel® Xeon Phi coprocessor 7120A (native, turbo on)<br />

Application: Iso3DFD.<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Description: Iso3DFD - 16th order Isotropic kernel that is at the<br />

heart of RTM algorithm. This code computes the wave<br />

propagation used in seismic imaging.<br />

Availability:<br />

• Code: Intel® Phi Xeon coprocessor version available here.<br />

Intel® Advanced Vector Extensions 2 (Intel® AVX2) version not<br />

yet ready for publication<br />

• Recipe: Available here.<br />

Usage Model: OpenMP*, no MPI, native Intel® Xeon Phi<br />

coprocessor. Domain sizes, cache blocking sizes, prefetching<br />

distances are auto-tuned to provide the maximum possible<br />

throughput for each platform.<br />

Highlights: Intel® AVX2 intrinsics implementation. Auto-tuning<br />

used to find best set of parameters for Ivy Bridge, KNC and<br />

Haswell.<br />

Results: Intel Phi Xeon coprocessor 7120P performance is up to<br />

1.72X better than Intel® Xeon® processor E5-2697 v2 and up to<br />

1.25X better than Intel® Xeon® processor E5-2697 v3. Intel®<br />

Xeon® processor E5-2697 v3 performance is up to 1.37X better<br />

than Intel® Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

73


Tflops<br />

3<br />

2<br />

1<br />

0<br />

Distributed Iso3DFD*<br />

16th Order Isotropic Kernel<br />

14.7 GCells<br />

1 node 2 nodes 3 nodes 4 nodes<br />

2S Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessor<br />

7120P<br />

For configuration details, go here.<br />

Distributed Iso3DFD Speed Up<br />

29.3 GCells<br />

44 GCells<br />

58.7 GCells<br />

SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, <strong>2015</strong><br />

CLUSTER BENCHMARK<br />

1 NODE<br />

Application: Distributed version of Iso3DFD.<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Description: One dimensional domain decomposition Iso3DFD<br />

(16th order Isotropic kernel). Multi-node tests; cluster performance<br />

and scalability of MPI implementation.<br />

Availability:<br />

• Code: Not available.<br />

• Recipe: Available here. Optimization guide available here.<br />

Usage Model: Scaling analysis with each Intel® Xeon Phi<br />

coprocessor in a node solving a 14GB subdomain and each pair of<br />

Intel® Xeon® processors solving a 10GB subdomain.<br />

• Symmetric MPI. One MPI process per node or device.<br />

• OpenMP* within each MPI process on processor or coprocessor.<br />

(Note: No disc I/O being performed to save seismic wave-fields.)<br />

Highlights:<br />

• Symmetric model allows host-coprocessor load balancing at MPI<br />

job launch time.<br />

• Processes running either on host or MIC can be independently<br />

tuned.<br />

• Enabled the remaining DDR3 node memory to be used for fast<br />

I/O and other tasks.<br />

Results: Linear scalability confirmed to ensure no cluster-level<br />

limitation. 16GB memory on the coprocessors allows high overlap<br />

of computations and halo exchanges, hiding the cost of<br />

communication with the processors and other nodes.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

74


Comparative Performance<br />

Petrobras*<br />

Isotropic RTM<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

1<br />

Petrobras* Isotropic RTM Speed Up<br />

2.2X<br />

3.4X<br />

4.6X<br />

2S Intel® Xeon® processor E5-2697 v2, 2 MPI/12 OMP<br />

2S Xeon E5-2697 v2, (2/12) + Xeon Phi 7120A, 1 MPI/244 OMP<br />

2S Xeon E5-2697 v2, (2/12) + 2x Xeon Phi 7120A, 1/244<br />

2S Xeon E5-2697 v2, (2/12) + 3x Xeon Phi 7120A, 1/244<br />

2S Xeon E5-2697 v2, (2/12) + 4x Xeon Phi 7120A, 1/244<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

5.6X<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Application: Isotropic RTM has a major role in accurate imaging<br />

of complex subsurface structures.<br />

Description: The 3D Isotropic RTM plays a major role in accurate<br />

imaging of complex subsurface structures. This benchmark<br />

measures performance and scalability of the compute kernel on<br />

a hybrid host + Intel® Xeon Phi coprocessor configuration. Halo<br />

exchanges are performed between devices. More at<br />

http://www.petrobras.com/en/home.htm.<br />

Availability:<br />

• Code: Proprietary.<br />

• Recipe: Not available. Check for future availability here.<br />

Usage Model: Intel® Xeon Phi coprocessor 7120A as a host in<br />

native mode concurrently executing with the Intel® Xeon®<br />

processor E5-2697 v2. Same code for both devices with<br />

OpenMP* and Intrinsics.<br />

Highlights: Scalable; competitive performance/watt<br />

Results: Intel® Xeon® processor populated with 4 Intel Xeon Phi<br />

cards show up to 5.6X scaling when compared to the baseline<br />

Intel® Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

75


Comparative Performance<br />

ISO 3DFD*<br />

16th Order Isotropic Kernel<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

ISO 3DFD* Kernel Speed Up<br />

1<br />

2S Intel® Xeon® E5-2697 v2<br />

1.66X<br />

Application: ISO 3DFD* 16th order Isotropic kernel.<br />

Description: ISO 3DFD is a kernel that computes an isotropic<br />

wave propagation in 3D (ISO). These kernel types are the heart of<br />

Reverse Time Migration (RTM). RTM can also be computed with<br />

VTI or TTI kernels. ISO 3DFD computes the wave propagation in<br />

the 16th order in space and 2nd order in time. On a single Intel®<br />

Xeon Phi coprocessor, this kernel is able to update around 5800<br />

Mcells/s (350 Gflops).<br />

Availability:<br />

• Code: Not available.<br />

• Recipe: Available here.<br />

Usage Model: Native mode with intrinsics. OpenMP* but no MPI.<br />

Results: The Intel® Xeon Phi coprocessor increased<br />

performance by up to 1.66X compared to the Intel® Xeon®<br />

processor E5-2697 v2.<br />

Intel® Xeon Phi Coprocessor 7120A (native 244T)<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

76


CLUSTER BENCHMARK<br />

Data Center Server Demo:<br />

DownUnder GeoSolutions*<br />

3,800+ NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Intel Better Together<br />

Hardware<br />

Intel® Xeon Phi coprocessor<br />

Intel® Xeon® E5-2600 v3 processor<br />

Intel® Solid-State Drives<br />

Software<br />

Intel® Solutions for Lustre* software<br />

DownUnder GeoSolutions Insight* is optimized with Intel® Software Development Tools for Intel® Xeon Phi<br />

coprocessors and the Intel® processor Xeon® E5-2600 v3, providing oil and gas industry with the tools and<br />

functionality required to perform standard oil and gas workflows more efficiently and to interactively manipulate<br />

and migrate data and to achieve results in minutes that previously took hours or days 1 . See the demo here.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others<br />

77


“The Intel Xeon Phi coprocessor<br />

architecture provides truly<br />

impressive performance, and the<br />

potential to easily port<br />

applications represents a<br />

significant leap ahead.”<br />

Speeding discovery through faster, more accurate simulations.<br />

Astrophysics<br />

Greg Peterson,<br />

Director, National Institute for<br />

Computational Sciences<br />

78


Comparative Performance<br />

BerkeleyGW*<br />

Sigma Phase<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

1<br />

BerkeleyGW* Sigma Phase Speed Up<br />

1.58X<br />

1.31X 1.28X<br />

1<br />

Application: BerkeleyGW* (Sigma)<br />

Description: BerkeleyGW is a massively parallel computational<br />

package for electron-excited state properties. Sigma is the<br />

second half of the GW code. It gives the quasiparticle selfenergies<br />

and dispersion relation for quasielectron and<br />

quasihole states.<br />

Availability:<br />

• Code: Available here (version 1.1 beta).<br />

• Recipe: None. The exact input used is not available to the<br />

public although a similar input is available.<br />

Usage Model: MPI+OMP. In symmetric mode, MPI ranks are<br />

distributed to both the host and KNC card. Each MPI rank uses<br />

OMP threads (Hybrid).<br />

0<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A<br />

Xeon E5-2697 v3 + Intel® Xeon Phi coprocessor 7120A<br />

“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3<br />

Highlights: Hybrid MPI+OMP allows code to be run on both<br />

Intel® Xeon® processors and Intel® Xeon Phi coprocessors<br />

without any source modifications.<br />

Results: Intel® Xeon® processor E5-2697 v3 speed up is up to<br />

1.31X, and Intel Xeon processor E5-2697 v3 + Intel® Xeon Phi<br />

coprocessor 7120A speed up is up to 1.58X over the baseline<br />

processor.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

79


Comparative Performance<br />

ASKAP*<br />

tHogBomClean<br />

1<br />

0<br />

For configuration details, go here.<br />

Cleaning Rate Throughput per Second Speed Up<br />

1<br />

1.12X<br />

100 iterations<br />

Intel® Xeon® processor E5-2697 v3<br />

Intel® Xeon Phi coprocessor 7120A<br />

Xeon E5-2697 v2 + NVIDIA K40*<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

.71X<br />

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, <strong>2015</strong><br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Application: Australian Square Kilometer Array Pathfinder*<br />

(ASKAP) tHogBomClean.<br />

Description: The tHogBomClean benchmark implements the<br />

kernel of the HogBom Clean deconvolution algorithm. This<br />

benchmark is quite minimal and actually omits the final step,<br />

convolution of the model with the clean beam, but this<br />

involves the similar operations to the other steps as far as<br />

the CPU is concerned. More here.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Not available. Check for future availability at here.<br />

Usage Model: Native execution on Intel® Xeon® processor<br />

E5-2697 v3 , Intel® Xeon Phi coprocessor 7120, and NVIDIA<br />

K40*, using the same codebase.<br />

Results:<br />

• The Intel Xeon processor E5-2697 v3 increased<br />

performance up to 1.39X compared to NVIDIA K40*.<br />

• The Intel Xeon Phi coprocessor 7120 increased<br />

performance up to 1.57X compared to NVIDIA K40.<br />

• The Intel Xeon Phi coprocessor 7120 increased<br />

performance up to 1.12X compared.to Intel Xeon<br />

processor E5-2697 v3.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

80


Comparative Performance<br />

ASKAP*<br />

tHogBomClean<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

Cleaning Rate Throughput per Second Speed Up<br />

1.74X 1.78X<br />

1 1.03X<br />

E5-2697 v2 Baseline E5-2697 v2 Optimized<br />

E5-2697 v2 + 7120A E5-2697 v2 + 7120A Turbo Off Optimized<br />

E5-2697 v2 + 7120A Turbo On Optimized E5-2697 v2 + NVIDIA K40* Boost Off<br />

E5-2697 v2 + NVIDIA K40* Boost On<br />

.68X<br />

1.07X<br />

“E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1.24X<br />

Application: Australian Square Kilometer Array Pathfinder*<br />

(ASKAP) tHogBomClean.<br />

Description: The tHogBomClean benchmark implements<br />

the kernel of the HogBom Clean deconvolution algorithm.<br />

This benchmark is quite minimal and actually omits the final<br />

step, convolution of the model with the clean beam, but this<br />

involves the similar operations to the other steps as far as<br />

the CPU is concerned. More here.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Not available. Check for future availability at here.<br />

Usage Model: Offload using OpenMP*; host only (Intel®<br />

Xeon® processor E5-2697 v2) performs data initialization<br />

and transfers to the Intel® Xeon Phi coprocessor 7120A; all<br />

computing is performed by the Intel Xeon Phi coprocessor<br />

7120A.<br />

Results: The optimized, turbo on, Intel Xeon Phi<br />

coprocessor 7120A improved throughput speed by up to<br />

1.78X compared to the baseline Intel® Xeon® processor E5-<br />

2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

81


Simulations – Fast, detailed, accurate.<br />

geophysics<br />

82


Comparative Increase<br />

specfem3D<br />

300K Mesh, 1000 Time Steps<br />

1<br />

0<br />

1<br />

1 Node 2 Nodes<br />

2S Intel® Xeon® processor E5-2697 v3<br />

2S Xeon E5-2697 v3 + 1 Intel® Xeon Phi coprocessor 7120P<br />

For configuration details, go here.<br />

specfem3D 300K Mesh Speed Up<br />

1.15X 58.71.16X<br />

GCells<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

1<br />

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, <strong>2015</strong><br />

CLUSTER BENCHMARK<br />

2 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Application: specfem3D, 300k Mesh, 1000 time steps simulation.<br />

Description: SPECFEM3D Cartesian simulates acoustic (fluid),<br />

elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave<br />

propagation in any type of conforming mesh of hexahedra<br />

(structured or not.) It can, for instance, model seismic waves<br />

propagating in sedimentary basins or any other regional geological<br />

model following earthquakes. It can also be used for nondestructive<br />

testing or for ocean acoustics. More here.<br />

Availability:<br />

• Code: Available here. Specific version is r20645e5-2697 v3<br />

• Recipe: Available here. Optimization guide available here.<br />

Usage Model: Symmetric run on 2S Intel® Xeon® processor and<br />

Intel® Xeon Phi coprocessor 7120P, MPI + OpenMP* programming,<br />

and with a large rank count.<br />

Highlights:<br />

• Hybridized code and major hotspots were multi-threaded.<br />

• Performance gain with the Intel compiler flag “-xCORE-AVX2”.<br />

Results:<br />

• For 1 and 2 nodes, 1 Intel Xeon Phi coprocessor 7120P created up to<br />

1.16X gain compared to a 2S Intel Xeon processor E5-2697 v3.<br />

• For larger node counts, MPI dominates and there is no benefit using<br />

the Intel Xeon Phi coprocessor.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

83


CLUSTER BENCHMARK<br />

3,800+ NODES<br />

Data Center Server Demo:<br />

Texas Advanced Computer Center*<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Intel Better Together<br />

Hardware<br />

Intel® Xeon Phi coprocessor 7120A<br />

Intel® Xeon® E5-2600 v3 processor<br />

Intel® Solid-State Drive Data Center for PCIe*<br />

Software<br />

CentOS* 6.5, Intel® MPSS 3.4.1, VNC*, Intel®<br />

Parallel Studio XE <strong>2015</strong>, Intel® SPC, Intel®<br />

True Scale Fabric, Intel® MPI Library<br />

The Texas Advanced Computing Center* (TACC) combines CT scan and simulation data to analyze aquifer<br />

permeability in order to protect freshwater supplies in Florida, utilizing Intel® Xeon Phi coprocessors, Intel® Xeon®<br />

processors, and several Intel® Software products, and features visualization techniques made possible by ray<br />

tracing kernels. See the demo here.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others<br />

84


CLUSTER BENCHMARK<br />

6,400 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Texas Advanced Computing Center* (TACC)<br />

1 http://www.tacc.utexas.edu/resources/hpc/stampede<br />

SOURCE: INTEL MEASURED RESULTS AS OF MAY, 2014<br />

System: TACC Stampede*, a 10 petaflop supercomputer,<br />

one of the largest computing systems in the world for<br />

open science research, became operational on January 7,<br />

2013. Stampede will begin a two-phase transition to the<br />

Intel 15 compiler and the compatible software stack on<br />

March 31, <strong>2015</strong>. Please visit: Stampede Two-Phase<br />

Transition to Intel 15<br />

Status: In service.<br />

Workloads: Runs hundreds of applications for thousands<br />

of users around the world.<br />

Performance:<br />

• More than 7 petaflops using Intel® Xeon Phi<br />

coprocessors 1<br />

• More than 2 petaflops using the Intel® Xeon® processor<br />

E5-2600 family 1<br />

More Information:<br />

• Stampede at the Texas Advanced Computing Center<br />

(Produced by Dell*/Intel):<br />

https://www.youtube.com/watch?v=f174GjUY1K4<br />

• TACC HPC systems overview:<br />

www.tacc.utexas.edu/resources/hpc<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others<br />

85


Unlock, discover, innovate.<br />

physics<br />

86


Comparative Performance<br />

Gyrokinetic Toroidal Code*<br />

Princeton (GTC-P)<br />

CLUSTER BENCHMARK<br />

4 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

GTC-P 4 Node Cluster Speed Up<br />

1<br />

1<br />

0<br />

4x 2S Intel® Xeon® processor E5-2697 v2<br />

1.18X<br />

Application: The gyrokinetic toroidal code (GTC) is a<br />

particle-in-cell code for turbulence simulation in support<br />

of the burning plasma experiment ITER, the crucial next<br />

step in the quest for fusion energy*. GTC-P is optimized<br />

GTC code developed at Princeton Plasma Physics Lab.<br />

More here.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Hybrid MPI + OpenMP* using symmetric<br />

processing.<br />

Results: GTC-P shows performance increase of up to<br />

1.18X on 4 node Intel® Xeon® processor E5-2697 v2 +<br />

Intel® Xeon Phi coprocessor 7120P clusters.<br />

4x 2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor (C0-<br />

7120P)<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

87


Increasing accuracy and timeliness of forecasts.<br />

Climate And Weather<br />

88


Comparative Increase<br />

ROMS*<br />

Idealized Southern Ocean Benchmark<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

ROMS Idealized Southern Ocean Benchmark Speed Up<br />

1.31X<br />

1<br />

1<br />

0<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120P<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120P<br />

Application: ROMS idealized southern ocean benchmark; ROMS 3.6+<br />

(svn version 709)<br />

Description: ROMS is a free-surface, terrain-following, primitive<br />

equations ocean model that employs a split-explicit time-stepping<br />

scheme with special treatment and coupling between barotropic (fast)<br />

and baroclinic (slow) modes. The benchmark input file is<br />

“ocean_benchmark3.in” (distributed with publically available ROMS<br />

source code), modified to increase horizontal resolution in X and Y. See<br />

more at https://www.myroms.org/<br />

Availability:<br />

• Code: Available here. Patch available here (commit 37b3c86 applied).<br />

• Recipe: Not available. Check for future availability at here.<br />

Usage Model: Baseline is an Intel® Xeon® processor host only and<br />

speedup is shown with symmetric processing on the Intel Xeon<br />

processor the Intel® Xeon Phi coprocessor. Double precision code<br />

across the platforms.<br />

Highlights: The code has been optimized and will be available to the<br />

ROMS community as an update patch.<br />

Results: Symmetric process demonstrated 1.31X improved<br />

performance with the Intel Xeon processor E5-2697 v2 (host) and the<br />

Intel Xeon Phi coprocessor 7120A over the baseline host.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

89


Comparative Performance<br />

Weather & Research Forecast (WRF*)<br />

WRFV3.6 CONUS2.5KM<br />

CLUSTER BENCHMARK<br />

4 NODES<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

1<br />

WRF V3.6 CONUS2.5KM Speed Up<br />

4 Node Clusters<br />

1.27X 1.3X<br />

2S Intel® Xeon® processor E5-2697 v2<br />

2S Xeon E5-2697 v2 + Xeon Phi 7120P<br />

2S Intel® Xeon® processor E5-2697 v3<br />

2S Xeon E5-2697 v3 + Xeon Phi 7120P<br />

1.94X<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3<br />

“Xeon Phi 7120P” = Intel® Xeon Phi coprocessor 7120P<br />

Application: Weather & Research Forecast Model (WRF*) V3.6<br />

Description: WRF Model is a numerical weather prediction<br />

system designed to serve atmospheric research and<br />

operational forecasting needs. More at here.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: Baseline is the Intel® Xeon® processor E5-2697<br />

v2 host only, and speed up is shown with on both the host and<br />

on the Intel® Xeon Phi coprocessor 7120P, and on the Intel®<br />

Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor<br />

7120P. Performance shown is for released code.<br />

Highlights: The code had been optimized since 2011, delivered<br />

to the community, and available for download from the<br />

community site.<br />

Results: Symmetric process demonstrated up to 1.94X<br />

improved performance over the baseline Intel® Xeon®<br />

processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

90


Comparative Performance<br />

NASA*<br />

OVERFLOW 2.2g<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NASA OVERFLOW* 2.2g Speed Up<br />

1.2X<br />

1<br />

1<br />

0<br />

2S Intel® Xeon® E5-2697 v2<br />

2S Intel® Xeon® E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120P<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2013<br />

Application: NASA OVERFLOW* 2.2g; fluid flow using<br />

overset grids,<br />

Description OVERFLOW 2.2g is a three-dimensional<br />

time-marching implicit Navier-Stokes code that can also<br />

operate in two-dimensional or axisymmetric mode. The<br />

code uses structured overset grid systems. More at<br />

http://people.nas.nasa.gov/~pulliam/Overflow/<br />

Overflow_Manuals.html<br />

Availability:<br />

• Code: Available here. Xeon Phi optimization (but not<br />

build setup) is included in standard source code<br />

available from NASA<br />

• Recipe: Not available.<br />

Usage Model: MPI/OpenMP* symmetric processing<br />

distributed across an Intel® Xeon® processor and an Intel®<br />

Xeon Phi coprocessor.<br />

Results: Symmetric process demonstrated up to 1.2X<br />

improved performance over the baseline Intel® Xeon®<br />

processor E5-2697 v2.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

91


Comparative Performance<br />

NOAA NIM*<br />

Dynamic G5<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

1<br />

0<br />

NOAA NIM Dynamic Core* G5 Speed Up<br />

1 1.04X<br />

2S Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon Phi coprocessor 7120A<br />

Titan: AMD Opteron* 6274 (16 cores, 2.2 GHz) processor, a NVIDIA Tesla* K20X GPU<br />

2S Xeon E5-2697 v2 + Xeon Phi 7120A (symmetric)<br />

.98X<br />

“Xeon E5-2697 v2” = Intel® Xeon® processor E5-2697 v2<br />

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A<br />

1.49X<br />

Application: Non-Hydrostatic Icosahedral Model<br />

(NOAA NIM Dynamic Core*). NIM is available from the<br />

NOAA Earth Systems Research Lab.<br />

Description: Next generation model that is being<br />

evaluated across CPU, GPU and MIC and focused on<br />

next generation high resolution weather (especially<br />

severe storm) forecasting. Benchmark is “Dynamic”<br />

core of the NOAA NIM weather simulator. More at<br />

https://earthsystemcog.org/projects/dcmip-<br />

2012/nim.<br />

Availability:<br />

• Code: Contact NOAA/ESRL<br />

• Recipe: Available here.<br />

Usage Model: Release 1598 with the G5 workload. G5<br />

resolution; K96 configuration. Symmetric processing<br />

with Intel® Xeon® processor E5-2697 v2 (host) and<br />

Intel® Xeon Phi coprocessor 7120A.<br />

Results: Symmetric process demonstrated up to<br />

1.49X improved performance over the baseline Intel®<br />

Xeon® processor E5-2697 v2.<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

92


Improving speed and quality through digital design<br />

Digital Content Creation<br />

93


Comparative Performance<br />

40<br />

30<br />

20<br />

10<br />

Embree 2.2 Pathtracer Renderer<br />

Ray Tracing<br />

0<br />

3.69X<br />

2.15X<br />

1X<br />

Bentley<br />

(2.3M Tris)<br />

Frames Per Second<br />

1024x1024 Image Resolution<br />

3.63X<br />

2.2X<br />

1X<br />

Crown<br />

(4.8M Tris)<br />

3.48X<br />

2.13X<br />

1X<br />

Dragon<br />

(7.4M Tris)<br />

Intel® Xeon® Processor E5-2699 v3<br />

Karst Fluid<br />

Flow<br />

(8.4M Tris)<br />

Intel® Xeon Phi Coprocessor 7120A<br />

Nvidia GeForce GTX Titan X*<br />

3.4X<br />

2X<br />

1X<br />

2.73X<br />

1.54X<br />

1X<br />

Power Plant<br />

(12.8M Tris)<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

Application: Embree 2.2 Pathtracer Renderer using Embree and<br />

Optix* RT Libraries<br />

Description: Pathtracing app for use in benchmarking RT libraries<br />

Availability:<br />

• Code: Available here<br />

• Recipe: embree.github.io; http://embree.github.com<br />

Usage Model:<br />

• Native on Intel® Xeon® processor, COI-based offload on Intel®<br />

Xeon Phi coprocessor. Intel® TBB, ISPC and Intrinsics<br />

Highlights:<br />

• Typical ray tracing rendering pipeline code used throughout DCC<br />

to show comparative performance on different types of hardware<br />

with a variety of input 3D data models. Optimized for all Intel ISAs<br />

and delivered to the community and available for download.<br />

• End-User benefits: Ability to achieve competitive performance and<br />

the flexibility of IA for rendering and render farm applications<br />

Results:<br />

• Embree on Intel® Xeon Phi 7120A coprocessor performs up to<br />

2.15X faster than Nvidia<br />

– Path tracing causes low SIMD utilization (better for Xeon)<br />

– Xeon® E5-2699 v3 is latest silicon while Xeon Phi 7120A now<br />

2.5 years old<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

94


Comparative Performance<br />

Embree 2.2<br />

Morton Build<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

NEW<br />

1<br />

0<br />

1X<br />

Morton Build<br />

Million Triangles/Second<br />

1.42X<br />

1.3X<br />

Crown Bentley Car Dragon<br />

Intel® Xeon® Processor E5-2699 v3<br />

Intel® Xeon Phi coprocessor 7120A<br />

1.54X<br />

Application: Embree 2.2 Morton build<br />

Description: High performance ray tracing kernel library<br />

required to calculate photorealistic images; Morton build<br />

performance for dynamic scenes. Build for typical model<br />

sizes of a couple millions of triangles possible in a fraction<br />

of a second using Embree.<br />

Availability:<br />

• Code: embree.github.io; http://embree.github.com<br />

• Recipe: embree.github.io.<br />

Usage Model:<br />

• Native on Intel® Xeon® processor, COI-based offload on<br />

Intel® Xeon Phi coprocessor. Intel® TBB, ISPC and<br />

Intrinsics<br />

Highlights: Low-quality Morton build even faster and<br />

enables handling of dynamic models with 1M triangles.<br />

Targets application developers in professional rendering<br />

environment (e.g. Autodesk*, Dreamwork*, Pixar*, etc.)<br />

Results:<br />

• Embree on Intel® Xeon Phi 7120A coprocessor up to<br />

1.54X faster than 2S Intel® Xeon® E5-2699 v3 processor<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, <strong>2015</strong><br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

95


Comparative Performance<br />

Embree* 2.0<br />

Ray Tracing Kernels<br />

1 NODE<br />

APPROVED FOR PUBLIC PRESENTATION<br />

Embree* 2.0 Ray Tracing Speed Up<br />

1.89X<br />

Application: Embree 2.0<br />

1<br />

1<br />

1.49X<br />

Description: Embree 2.0 is a simple ray tracing framework<br />

with optimized kernels for Intel® Xeon® processors and<br />

Intel® Xeon Phi coprocessors. It is useful for digital<br />

content creation (DCC), mechanical computer aided design<br />

(CAD), industrial design, and scientific visualization. More<br />

at http://embree.github.io/.<br />

Availability:<br />

• Code: Available here.<br />

• Recipe: Available here.<br />

Usage Model: 4M (million) triangles scene, ambient<br />

occlusion.<br />

0<br />

2S Intel® Xeon® processor E5-2670<br />

Results: Up to 1.89X improved performance with the Intel®<br />

Xeon Phi coprocessor compared to the baseline Intel®<br />

Xeon® processor E5-2670.<br />

2S Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon Phi coprocessor 7120A<br />

For configuration details, go here.<br />

SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2013<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,<br />

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated<br />

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others<br />

96


Intel® Software Development Tools & 3RD Party Support for Developers<br />

and System Administrators<br />

97


What’s Inside:<br />

Intel® Parallel Studio XE 2016<br />

Composer Edition Professional Edition Cluster Edition<br />

What it does:<br />

Build fast code using<br />

industry leading compilers<br />

and libraries including new<br />

data analytics library<br />

Adds analysis tools<br />

Adds MPI cluster tools<br />

What’s included:<br />

• C++ and/or Fortran<br />

compilers<br />

• Performance libraries<br />

• Parallel models<br />

Composer edition +<br />

• Performance profiling<br />

• Threading design/prototyping<br />

& vectorization advisor<br />

• Memory & thread debugger<br />

• Data analytics acceleration<br />

library<br />

Professional edition +<br />

• MPI cluster<br />

communications library<br />

• MPI error checking and<br />

tuning<br />

98


Features and Configurations<br />

Intel® Parallel Studio XE 2016<br />

Composer Edition Professional Edition Cluster Edition<br />

Intel® C++ Compiler<br />

Intel® Fortran Compiler<br />

Intel® Data Analytics Acceleration Library<br />

Intel® Threading Building Blocks<br />

Intel® Integrated Performance Primitives<br />

Intel® Math Kernel Library<br />

Intel® Cilk Plus & Intel® OpenMP*<br />

Bundle or Add-on:<br />

Rogue Wave IMSL* Library<br />

Intel® C++ Compiler<br />

Intel® Fortran Compiler<br />

Intel® Data Analytics Acceleration Library<br />

Intel® Threading Building Blocks<br />

Intel® Integrated Performance Primitives<br />

Intel® Math Kernel Library<br />

Intel® Cilk Plus & Intel® OpenMP*<br />

Intel® Advisor XE<br />

Intel® Inspector XE<br />

Intel® VTune Amplifier XE<br />

Add-on:<br />

Rogue Wave IMSL* Library<br />

Intel® C++ Compiler<br />

Intel® Fortran Compiler<br />

Intel® Data Analytics Acceleration Library<br />

Intel® Threading Building Blocks<br />

Intel® Integrated Performance Primitives<br />

Intel® Math Kernel Library<br />

Intel® Cilk Plus & Intel® OpenMP*<br />

Intel® Advisor XE<br />

Intel® Inspector XE<br />

Intel® VTune Amplifier XE<br />

Intel® MPI Library<br />

Intel® Trace Analyzer and Collector<br />

Intel® Cluster Checker preview (Linux only)<br />

Add-on:<br />

Rogue Wave IMSL* Library<br />

Additional configurations including, floating and academic, are available at: http://intel.ly/perf-tools<br />

99


Visual C++*<br />

<strong>2015</strong><br />

Intel 16.0<br />

GCC<br />

5.2.0<br />

Intel 16.0<br />

Visual C++*<br />

<strong>2015</strong><br />

Intel C++<br />

16.0<br />

GCC<br />

5.2.0<br />

Intel C++<br />

16.0<br />

PGI<br />

Fortran*<br />

15.3<br />

Absoft*<br />

15.0.1<br />

Intel Fortran 16.0<br />

PGI<br />

Fortran*<br />

15.3<br />

gFortran*<br />

5.1.0<br />

Open64*<br />

4.5.2<br />

Absoft*<br />

15.0.1<br />

Intel Fortran 16.0<br />

Performance without Compromise<br />

Intel® C++ and Fortran Compilers on Windows*, Linux* & OS X*<br />

Boost C++ application performance<br />

on Windows* & Linux* using Intel® C++ Compiler<br />

(higher is better)<br />

Floating Point<br />

Integer<br />

1.51<br />

1.00 1.30 1.00 1.00<br />

1.00<br />

1.24<br />

1.51<br />

Boost Fortran application performance<br />

on Windows* & Linux* using Intel® Fortran Compiler<br />

(higher is better)<br />

1.00<br />

1.33<br />

1.88<br />

1.00<br />

1.07<br />

1.09<br />

1.32<br />

1.64<br />

Windows Linux Windows Linux<br />

Estimated SPECfp®_rate_base2006 Estimated SPECint®_rate_base2006<br />

Relative geomean performance, SPEC* benchmark - higher is better<br />

0.00<br />

Windows<br />

Linux<br />

Relative geomean performance, Polyhedron* benchmark– higher is better<br />

Configuration: Windows hardware: HP DL320e Gen8 v2 (single-socket server) with Intel(R) Xeon(R) CPU E3-1280 v3 @ 3.60GHz, 32 GB RAM, HyperThreading is off;<br />

Linux hardware: HP BL460c Gen9 with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 256 GB RAM, HyperThreading is on. Software: Intel C++ compiler 16.0, Microsoft (R)<br />

C/C++ Optimizing Compiler Version 19.00.23026 for x86/x64, GCC 5.2.0. Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel 3.10.0-229.el7.x86_64.<br />

Windows OS: Windows 8.1. SPEC* Benchmark (www.spec.org).<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and<br />

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results<br />

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that<br />

product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation<br />

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not<br />

unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not<br />

guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent<br />

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture<br />

are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific<br />

instruction sets covered by this notice. Notice revision #20110804 .<br />

Configuration: Hardware: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, HyperThreading is off, 16 GB RAM. Software: Intel Fortran compiler 16.0, Absoft*15.0.1,. PGI<br />

Fortran* 15.3, Open64* 4.5.2, gFortran* 5.1.0. Linux OS: Red Hat Enterprise Linux Server release 7.0 (Maipo), kernel 3.10.0-123.el7.x86_64. Windows OS: Windows 7,<br />

Service pack 1. Polyhedron Fortran Benchmark (www.fortran.uk). Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xINTEGER<br />

-stack:0x80000000. Intel® Fortran compiler: /fast /Qparallel /link /stack:64000000. PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa.<br />

Linux compiler switches: Absoft -m64 -mavx -O5 -speed_math=10 -march=core -xINTEGER. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4.<br />

Intel Fortran compiler: -fast –parallel. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed -Mstack_arrays -Mconcur=bind. Open64: -<br />

march=bdver1 -mavx -mno-fma4 -Ofast -mso –apo.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and<br />

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results<br />

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that<br />

product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation<br />

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not<br />

unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not<br />

guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent<br />

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture<br />

are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific<br />

instruction sets covered by this notice. Notice revision #20110804 .<br />

100


Speedup<br />

Turn Big Data Into Information Faster with<br />

Intel® Data Analytics Acceleration Library<br />

Advanced analytics algorithms supporting all data analysis<br />

stages.<br />

Business<br />

Scientific<br />

Engineering<br />

Web/Social<br />

Pre-processing<br />

• Decompression<br />

• Filtering<br />

• Normalization<br />

Transformation<br />

• Aggregation<br />

• Dimension<br />

Reduction<br />

Analysis<br />

• Summary<br />

Statistics<br />

• Clustering.<br />

Modeling<br />

• Machine<br />

Learning<br />

• Parameter<br />

Estimation<br />

• Simulation<br />

Validation<br />

• Hypothesis<br />

testing<br />

• Model<br />

errors<br />

Decision Making<br />

• Forecasting<br />

• Decision Trees<br />

• Etc.<br />

8<br />

Designed and<br />

Built by Intel<br />

to<br />

Delight<br />

Data Scientists<br />

PCA Performance Boost<br />

Using Intel® DAAL vs. Spark* MLLib<br />

Simple to incorporate object-oriented APIs for C++ and Java<br />

Easy connections to:<br />

• Popular analytics platforms (Hadoop, Spark)<br />

• Data sources (SQL, non-SQL, files, in-memory)<br />

6<br />

4<br />

2<br />

0<br />

4X<br />

6X<br />

Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel® Xeon® Processor E5-2699 v3, 2<br />

Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. PCA normalized input.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and<br />

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results<br />

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that<br />

product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation<br />

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not<br />

unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not<br />

guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent<br />

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture<br />

are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific<br />

instruction sets covered by this notice. Notice revision #20110804 .<br />

6X<br />

1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000<br />

Table Size<br />

7X<br />

7X<br />

101


Memory Access Analysis<br />

New! Intel® VTune Amplifier 2016<br />

New!<br />

Tune data structures for better performance<br />

• Attribute cache misses to data structures<br />

Bandwidth Analysis for Non-Uniform Memory<br />

• See Read & Write contributions to<br />

Total Bandwidth<br />

• Easier tuning of multi-socket bandwidth<br />

Seeing total bandwidth can suggest data blocking opportunities<br />

to change a bandwidth bound app into a compute bound app.<br />

102


Scalable Profiling for MPI and Hybrid Clusters with<br />

MPI Performance Snapshot<br />

Lightweight – Low overhead<br />

profiling up to 32K Ranks<br />

Scalability- Performance<br />

variation at scale can be<br />

detected sooner<br />

Identifying Key Metrics –<br />

Shows PAPI counters and<br />

MPI/OpenMP* imbalances<br />

103


Altair RADIOSS* and PBS Professional*<br />

Finite Element Analysis (FEA) / HPC Workload Management<br />

RADIOSS<br />

First evaluation of FEA code with Intel® Xeon Phi coprocessors<br />

• Speed time to results for computer aided engineering (CAE)<br />

simulations with an optimized solver<br />

• Perform linear and non-linear simulations of structures,<br />

fluids, fluid-structure interaction, sheet metal stamping, and<br />

mechanical systems<br />

• Extend value through tight integration with the Altair<br />

HyperWorks CAE software platform<br />

PBS Professional<br />

Schedule jobs for Intel Xeon Phi coprocessors<br />

• Improve resource utilization with a commercial grade HPC<br />

workload management solution<br />

• Improve value with advanced support for green provisioning,<br />

security, and policy-driven job scheduling<br />

• Get more information: Video, press release, white paper,<br />

configuration toolkit<br />

“Through our tight collaboration with Intel, we were able<br />

to test first-generation Intel Xeon Phi coprocessors at a<br />

very early stage and learned a great deal about how to<br />

optimize our solvers for Intel’s manycore architecture.<br />

We are eager to port our HPC solvers, such as RADIOSS,<br />

to the next-generation Intel Xeon Phi processor, which<br />

looks really promising for delivering cutting-edge, high<br />

performance to our customers.”<br />

Eric Lequiniou<br />

HPC Director, Altair<br />

See the:<br />

• Solution Brief<br />

• Case Study<br />

* Other names and brands may be claimed as the property of others.<br />

104


Bright Cluster Manager*<br />

Advanced Cluster Management Made Easy<br />

Bright Cluster Manager: An easy to use, integrated software solution<br />

that transforms the way clusters of all kinds are deployed and<br />

managed.<br />

Bright plus Intel® Xeon Phi coprocessor: Get everything you<br />

need to enable Intel Xeon Phi coprocessors in a cluster.<br />

• Deploy a software environment that works "out of the box“<br />

• Bare metal provisioning, powerful user interface, comprehensive monitoring<br />

• Reduce the time, effort, and cost associated with building and operating an HPC<br />

cluster<br />

Simplified management: For clusters that include Intel Xeon Phi<br />

coprocessors<br />

• Install, manage and use clusters of any size, with no need for Linux* knowledge.<br />

• Configure, schedule, monitor, and manage Intel Xeon Phi coprocessors in your<br />

cluster – with ease.<br />

“Bright Cluster Manager and<br />

Intel Xeon Phi provide full<br />

cluster management.<br />

Deploying, using, and<br />

optimizing your cluster just got<br />

easier.”<br />

Martijn de Vries<br />

CTO at Bright Computing<br />

* Other names and brands may be claimed as the property of others.<br />

105


106


Intel® Xeon Phi Coprocessor Developer Site<br />

• Architecture, setup, and programming resources<br />

• Self-guided training<br />

• Case studies<br />

• Codes and Recipes<br />

• Information on tools and ecosystem<br />

• Support through community forum<br />

View at: http://software.intel.com/mic-developer/<br />

107


Modern Code Developer Community<br />

A global online community for learning and sharing with experts<br />

software.intel.com/moderncode<br />

Developer Zone<br />

• Modern Code Zone<br />

• Software Tools, training webinars<br />

• How-to guides, parallel programming BKMs<br />

• Remote Access to hardware<br />

• Support Forums<br />

Topics<br />

• Vectorization/single instruction, multiple data (SIMD)<br />

• Multi-threading<br />

• Multi node/clustering<br />

• Take advantage of on-package high-bandwidth memory.<br />

• Increase memory and power efficiency.<br />

Experts<br />

• Black Belts, & Intel Engineer experts<br />

• Technical Content, Training -webinars, F2F; forum support<br />

• Conference and Tradeshows: Keynotes, Presentations,<br />

BOFs, Demos, Tutorials<br />

108


Intel® Developer Zone<br />

Join us on Social Media<br />

@intelsoftware<br />

Intel Developer Zone<br />

Intel Software TV<br />

@IntelDeveloperZone<br />

software.intel.com<br />

109


Intel® Software Development Tools<br />

For The Intel® Xeon Phi Coprocessor<br />

Intel® Parallel Studio XE 2016<br />

• Information: http:\\software.intel.com\en-us\intel-parallel-studio-xe<br />

Intel® System Studio<br />

• Information: https://software.intel.com/en-us/intel-system-studio<br />

• Video: https://www.youtube.com/watch?v=jraWWlc21gc<br />

Intel® Integrated Native Developer Experience (Intel® INDE)<br />

• Information: https://software.intel.com/en-us/intel-inde<br />

• Video: https://www.youtube.com/watch?v=PAGauYOLNGI&index=2&list=PLg-UKERBljNwPClBBuJvgG1Wl_P0iQXsQ<br />

Intel® Cluster Ready<br />

• https://software.intel.com/cluster-ready<br />

James Reinders: Capitalize on Parallelism with Intel® Xeon Phi coprocessors<br />

• youtube.com/watch?v=g9ehO6duNuE&list=UUH5Rft7GYM8KZpxA-4Ohihg&index=9&feature=plcp<br />

* Other names and brands may be claimed as the property of others.<br />

110


Recommended Links<br />

Getting Started:<br />

• An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi coprocessors<br />

• Intel® Xeon Phi Coprocessor Developer’s Quick Start Guide<br />

• http://software.intel.com/mic-developer:<br />

The Training tab has Beginner and Advanced workshop videos, and links to past/future webinars.<br />

The Tools and Downloads tab has a link to Intel and Third Party Tools and Libraries.<br />

− This page has links to available beta and production for developers.<br />

Vectorization:<br />

• http://software.intel.com/en-us/intel-vectorization-tools<br />

• Check out the 6 steps in the toolkit!<br />

Performance:<br />

• Life Sciences<br />

• Manufacturing<br />

• Financial Services<br />

• Energy<br />

• Physics<br />

• Digital Content Creation<br />

• Weather<br />

111


Hardware Configuration - Intel® Xeon® processor E5-2697 v2<br />

Platform<br />

CPU/Stepping<br />

Coprocessor<br />

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Intel® Xeon® processor E5-2697 v2 + NVIDIA K40*<br />

Intel® Server System R2208GZ4GC platform, 2U chassis, hot-swap drives, 24<br />

DIMMs, 1 750W Redundant Power Supply<br />

Intel® Xeon® processor E5-2697 v2, 2.7 GHz , 12 core, 8GT/s dual QPI links,<br />

130 W, 3.5GHz Max Turbo Frequency, 768kB instr L1 / 3072kB L2 / 30MB L3<br />

cache<br />

Intel® Xeon Phi coprocessor 7110 and 7120; 61 cores, 1.1 and 1.238 GHz,<br />

ECC enabled, TURBO disabled. Software Details: MPSS version - 2.1.6720-<br />

13/16/19, Flash version - 2.1.03.0386<br />

Intel® Server System R2208GZ4GC platform, 2U chassis, hot-swap drives, 24<br />

DIMMs, 1 750W Redundant Power Supply<br />

Intel® Xeon® processor E5-2697 v2, 2.7 GHz , 12 core, 8GT/s dual QPI links,<br />

130 W, 3.5GHz Max Turbo Frequency, 768kB instr L1 / 3072kB L2 / 30MB L3<br />

cache<br />

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory, ECC enabled, Boost<br />

disabled. Software Details: CUDA Version 5.5<br />

Memory 64GB total 8*8GB 1600MHZ Reg ECC DDR3 64GB total 8*8GB 1600MHZ Reg ECC DDR3<br />

Chipset<br />

Rev 4.6<br />

SE5C600.86B.99.99.x069.071520130923<br />

Rev 4.6<br />

SE5C600.86B.99.99.x069.071520130923<br />

HDD Specs SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM<br />

OS RHEL* 6.4 RHEL* 6.4<br />

MSC Nastran 2016 Alpha: Testing conducted by MSC Software on MSC Nastran* 2016 Alpha release, SOL 101. Baseline configuration: Intel® Xeon® processor E5-2679 v2 (2.6 GHz, 12<br />

cores, 30 MB cache), 128 GB (8 x 16) DDR3 memory @ 1.6 GHz, 4 TB HDD (SATA @ 6 GB/s 5900 RPM), Intel® Math Kernel Library (Intel® MKL) 11.2.3, Intel® Manycore Platform Software<br />

Stack (Intel® MPSS) 3.5. Test configuration: Identical to baseline configuration with the addition of one Intel® Xeon Phi coprocessor 7120A.<br />

* Other names and brands may be claimed as the property of others.<br />

104


Hardware Configuration – Intel® Xeon® processor E5-2697 v3<br />

Intel® Xeon® processor E5-2697 v2<br />

Intel® Xeon® processor E5-2697 v3<br />

Platform Intel® Server System R2208GZ4GC Intel® Xeon® processor E5-2697 v3, Grantley platform<br />

CPU<br />

Intel® Xeon® processor E5-2697 v2 2.7 GHz , Dual socket 12 core, 8GT/s dual QPI links,<br />

130 W, 3.5GHz Max Turbo Frequency 768kB instr L1 / 3072kB L2 / 30MB L3 cache<br />

RAM 64 GB total/node, 8*8GB 1600MHz Reg ECC DDR3 8x 8GB 2133 DDR4 DIMMs<br />

Intel® Xeon® processor E5-2697 v3 @ 2.60GHz / 9.6 GT/S, 145W, 14 core, Intel Smart<br />

Cache 35 MB<br />

Hard drive SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM SSDSA2B220, 200GB, Intel® SSD DC S3500 Series 800GB<br />

Intel® Xeon Phi<br />

Coprocessor<br />

Cluster File<br />

System Abstract<br />

Nvidia*<br />

Coprocessor<br />

OS / Kernel / IB<br />

stack<br />

ECC Mode: Enabled; Coprocessor Stepping: C0, Board SKU: C0PRQ-7120 P/A/X/D,<br />

Frequency: 1.24GHz and Coprocessor Stepping: B1, Turbo On/Off documented within<br />

measurements, SKU used documented within measurements<br />

Panasas (60 TB storage), IEEL Lustre DDN (49 TB storage), IEEL Lustre SSD (32 TB<br />

storage), Firmware version 5.5.0.b-1067797.15<br />

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory , ECC enabled, Boost disabled,<br />

Software Details: CUDA Version 5.5<br />

Red Hat Enterprise Linux Server* release 6.5 / 2.6.32-358.6.2.el6.x86_64.crt1 / OFED<br />

3.5-2-MIC-rc1<br />

ECC Mode: Enabled; Coprocessor Stepping: C0, Board SKU: C0PRQ-7120 P/A/X/D,<br />

Frequency: 1.24GHz and Coprocessor Stepping: B1, Turbo On/Off documented within<br />

measurements, SKU used documented within measurements<br />

Panasas (60 TB storage), IEEL Lustre DDN (49 TB storage), IEEL Lustre SSD (32 TB<br />

storage), Firmware version 5.5.0.b-1067797.15<br />

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory, ECC enabled, Boost disabled,<br />

Software Details: CUDA Version 5.5<br />

Red Hat Enterprise Linux Server* release 6.5 / 2.6.32-358.6.2.el6.x86_64.crt1 / OFED<br />

3.5-2-MIC-rc1,IFS-7.2.2.0.8<br />

* Other names and brands may be claimed as the property of others.<br />

113


LEGAL DISCLAIMERS<br />

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY<br />

THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED<br />

WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY<br />

PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.<br />

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer<br />

systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating<br />

your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance<br />

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each<br />

of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.<br />

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where<br />

similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.<br />

Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system<br />

configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost<br />

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See<br />

http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and<br />

products specified are for planning purposes only and are subject to change without notice<br />

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Product plans, dates,<br />

and specifications are preliminary and subject to change without notice<br />

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at<br />

less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration<br />

and you can learn more at http://www.intel.com/go/turbo.<br />

For more information go to http://www.intel.com/performance<br />

Any difference in system hardware or software design or configuration may affect actual performance<br />

Copyright © <strong>2015</strong> Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon logo, Xeon Phi, and Xeon Phi logo are trademarks of Intel Corporation<br />

in the U.S. and/or other countries. All dates and products specified are for planning purposes only and are subject to change without notice.<br />

*Other names and brands may be claimed as the property of others.<br />

114


OPTIMIZATION NOTICE<br />

Optimization Notice<br />

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for<br />

optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and<br />

SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or<br />

effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent<br />

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not<br />

specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable<br />

product User and Reference Guides for more information regarding the specific instruction sets covered by<br />

this notice.<br />

Notice revision #20110804<br />

115


116

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!