2H 2015

2H 2015

Key Messages 

• Developers show standards-based programming models benefits using Intel® Xeon® processors and Intel® Xeon Phi 

coprocessors: 

Intel® Xeon Phi is the ideal platform for many core and wide vector code development (threads, vector lanes, memory). 

Once you’ve optimized for the most cores, threads and wide SIMD vectors, your code scales everywhere – including on future Intel 

Xeon Phi generations and on future (highly-parallel) Intel® Xeon® processors. 

Offload, symmetric, and native execution models are demonstrated in the proof points. 

• Continued Momentum Across Verticals 

78 performance proof points (41 Haswell, 37 Ivy Bridge) in this updated guide, featuring 38 new proof points, 12 new codes and 5 new 

recipes. 26 are cluster proof points. 

Several new Life Science (BLAST*, LAMMPS* and Johns Hopkins Bowtie2*) and Manufacturing proof points (ANSYS*, MSC Software*, 

HPCG benchmark, Sandia Mantevo*, and Fujitsu*) added. For more information about Life Sciences, click here. 

Includes new proof points for Financial Services, Weather, Physics, and Digital Content Creation. 

• Get started today with Intel and 3rd party tools/libraries: 

Software tailored to support the Intel® Many Integrated Core Architecture. 

Download the Application Catalog to see the list of available or in-flight codes. 

• Leverage Intel® Parallel Studio XE 2016 and Intel® Cluster Ready to create faster code. 

• Code Modernization – Drive faster breakthroughs through faster code: Get more results on your hardware today and carry 

your code forward to the future. More Code Modernization resources: 

Intel® Code Modernization Enablement Program 

What is Code Modernization? 

2

Intel® Modern Code Developer Challenge 2015 

Code Modernization contest for HPC developers 

• Learn valuable skills, make a difference and contribute to a social cause. 

• In partnership with CERN & others 

• Prizes include trip to CERN, Scholarship for top student participants. 

• Email interest list live today: http://software.intel.com/moderncode/challenge 

3

Table of Contents 

Intel® Xeon Phi Coprocessors 

• New or Updated Proof Points > 

• Proof Points Highlights and Summary > 

• A Growing Vendor Ecosystem > 

Customer Proof Points 

• Life Sciences > 

• Material Sciences > 

• Manufacturing > 

• Financial Services > 

• Energy > 

• Astrophysics > 

• Geophysics > 

• Physics > 

• Climate and Weather > 

• Digital Content Creation > 

Additional Resources 

• Tools and Libraries > 

• Intel Developer Training > 

• Development Tools > 

4

New or Updated Proof Points 

NEW 

proof points (38): 

• Life Sciences: 

LAMMPS* 

BLAST* (2) 

Johns Hopkins Bowtie2* (2) 

• Manufacturing and DCC: 

ANSYS Mechanical* (5) 

ANSYS server demo 

Altair Hyperworks* 

MSC Software* 

HPCG benchmark 

Sandia Mantevo miniFE 

Autodesk Maya* (2) 

Fujitsu HPC Gateway* 

Fujitsu server demo 

E4* 

Intel Modern Code 

CST* 

Embree (2) 

• Financial Services: 

Monte Carlo* European Option 

Pricing 

Monte Carlo Excel*-Based 

Option Pricing 

QuantLib* (2) 

• Energy Industry: 

Iso3DFD 

Distributed Iso3DFD 

specfem3D 

DownUnder GeoSolutions* 

• Physics: 

miniGhost* (2) 

BerkeleyGW* 

ASKAP* 

Texas Advanced Computing 

Center* 

• Weather Research: 

ROMS* 

Proof points with updated 

code and/or recipe (7): 

• Life Sciences: 

Amber 14 (2 codes) 

• Financial Services: 

Monte Carlo 

QuanLib (2 codes, 2 recipes) 

5

New Proof Points and New Processor Speed Ups: 

Verticals Snapshot 

Average speed up in proof points 1 : 

• Life Sciences: 2.58X average speed up among new proof points and new processor. 

• Manufacturing and DCC: 2.48X average speed up among new proof points and new processor. 

• Financial Services: 3.86X average speed up among new proof points and new processor. 

• Energy Industry: 1.72X speed up in new proof point and new processor. 

• Physics: 1.6X average speed up among new proof points and new processor. 

• Weather Research: 1.46X average speed up among new proof points and new processor. 

1 - Performance demonstrated in proof points in this presentation 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, 

components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated 

purchases, including the performance of that product when combined with other products. 

6

Intel® Xeon Phi Coprocessors Software Ecosystem Highlights and Summary 

Intel® Xeon Phi 

coprocessors 

Powerful software 

used to gain 

meaningful insights 

Compelling reasons 

to optimize 

applications 

This presentation demonstrates improved software performance for key applications in key business 

segments. The majority of the latest proof points show increases 1 when adding the Intel® Xeon Phi 

coprocessor to the host Intel® Xeon® processor E5-2697 v2 or Intel® Xeon® processor E5-2697 v3. 

Some of the best examples are summarized below: 

Life Sciences Highlights: 

• Molecular systems with classical models simulation: 

Up to 6.5X performance increase. 

• Bimolecular simulations (Protein, DNA, RNA, virus etc.): 

Up to 5X performance increase. 

Manufacturing and Highlights: 

• Static structural analysis 

Up to 4.7X single node performance increase. 

Financial Services Highlights: 

• European Option Pricing using Monte Carlo: 


Energy Industry Highlights: 

• Accurate imaging of complex subsurface structures: 


Physics and Astronomy Highlights: 

• Finite Difference mini-application which implements a difference stencil across a homogenous three 

dimensional domain: 


Weather Research and Forecasting Highlights: 

• A weather prediction system designed to serve atmospheric research and operational forecasting needs: 


1 - Performance demonstrated in proof points in this presentation 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer 

systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your 

contemplated purchases, including the performance of that product when combined with other products. 

7

Relative performance 

1.00 

1.22 

3.00 

1.23 

2.19 

2.35 

3.40 

4.58 

5.51 

1.11 

1.17 

1.34 

1.58 

1.59 

1.72 

1.85 

1.85 

2.09 

2.12 

3.13 

6.94 

7.04 

8.91 

1.15 

1.20 

1.26 

1.28 

1.35 

1.37 

1.49 

1.50 

1.55 

1.56 

1.67 

2.03 

3.25 

6.50 

1.51 

1.88 

2.43 

2.04 

4.70 

0.45 

1.11 

1.12 

1.18 

1.37 

1.20 

1.27 

1.53 

1.73 

3.48 

1.31 

1.57 

Application Performance Snapshot 

Intel® Xeon Phi Coprocessor vs. 2 socket Intel® Xeon® processor 

10 

DCC 

Energy 

Financial Services 

Life Sciences 

Mfg. 

Physics 

Weather 

9 

8 

7 

6 

5 

4 

3 

2 

1 

0 



purchases, including the performance of that product when combined with other products. Source: Intel measured as of Q1 2015 Configuration Details: Please slide speaker notes. For more information go to 

http://www.intel.com/performance 

8

Intel® Xeon® Processor E5-2697 v2 and Intel® Xeon® Processor 

E5-2697 v3 Better Together with Intel® Xeon Phi Coprocessor 

Software Performance Increases with 2-Socket Intel® Xeon® 

Processor E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120 

compared to Intel® Xeon® Processor E5-2697 v2 1 

Software Performance Increases with 2-Socket Intel® Xeon® 

Processor E5-2697 v3 + Intel® Xeon Phi Coprocessor 7120 

compared to Intel® Xeon® Processor E5-2697 v3 1 

Monte Carlo* European Options 

QuantLib* SP Monte Carlo 

ANSYS Mechanical 

Sandia Mantevo* 

Petrobas* 

Black-Scholes* Formula Valuation 

Monte Carlo Excel*-based 

QuantLib* SP Black Scholes 

Binomial Options* 

ASKAP* 

Iso3DFD 

LAMMPS Production Protein 

GROMACS* 

miniGhost* 

BWA-ALN* 

OpenLB* 

2.12 

2.09 

1.84 

1.85 

1.73 

1.72 

1.61 

1.56 

1.55 

1.50 

1.53 

2.30 

2.20 

3.34 

6.93 

1 Performance demonstrated in respective proof points in this presentation 



purchases, including the performance of that product when combined with other products. * Other names and brands may be claimed as the property of others. 

8.91 

Quantlib* SP Monte Carlo 

ANSYS Mechanical 

Monte Carlo RNG* European Options 

NAMD* 2.10 STMV 

Autodesk Maya* 2014 


NAMD* 2.10 ApoA1 

BLASTn v.30 

WRF* 

LAMMPS* Liquid Crystal; 32 nodes 

DMI* 

AMBER* 14 Tobacco Virus 

1.45 

1.33 

1.25 

1.54 

1.52 

1.52 

1.49 

1.87 

1.75 

1.74 

2.61 

3.8 

9

Is Intel® Xeon Phi Coprocessor Performance Compelling 

Compared to NVIDIA*? 

Software Performance Increases with 2-Socket Intel® Xeon® Processor E5-2697 v2 or v3 + Intel® Xeon Phi Coprocessor 7120 compared 

to Intel® Xeon® Processor E5-2697 v2 or v3 + NVIDIA GPU 1 

QuanLib* SP Monte Carlo 

QuanLib* SP Monte Carlo 

5.17 

Isotropic FD RTM Kernel* 

Isotropic FD RTM Kernel* 

4.78 

LAMMPS* Stillinger-Weber 

LAMMPS* Stillinger-Weber 

3.73 

Embree* 2.2 Pathtracer 

Embree* 2.2 Pathtracer 

3.48 

QuantLib* Black-Scholes 

QuantLib* Black-Scholes 

2.62 

LAMMPS* Rhodopsin 512K 

LAMMPS* Rhodopsin 512K 

1.94 

ASKAP* 

ASKAP 

Xcelerit* 

Xcelerit* 

1.25 

1.43 

1 Performance demonstrated in respective proof points in this presentation 



purchases, including the performance of that product when combined with other products. * Other names and brands may be claimed as the property of others. 

10

Memory Capacity (GB) 

Memory Comparison: Intel vs. NVIDIA* 

Memory Capacity (GB) 

Without ECC With ECC 

16 

14 

12 

10 

8 

6 

4 

2 

0 

6.0 

5.8 

3.25% Drop 

w/ECC On 

8.0 

7.8 

16.0 

15.5 

4.0 

3.5 

5.0 

12.5% Drop 

w/ECC On 

3120P 5110P 7120P K10 K20 K20X K40 

4.4 

6.0 

5.3 

12.0 

10.5 

• Intel has Higher Memory Capacity (16GB vs. 12GB) 

• Intel has less impact to capacity when ECC is Enabled 

• NVIDIA loses ~12.5% mem capacity when ECC is enabled vs. ~3.25% for Intel 

Source: 

Intel results Measured as Oct, 2013. NVIDIA* results 

values based on www.nvidia.com 

K10 = capacity behind each GPU on the card 

Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 



purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others. 

11

Relative Performance 

Leverage Code Resources 

Code Modernization benefits both the Processor & the Coprocessor 

2S Xeon E5-2697 v2 (Original Parallel) 2S Xeon E5-2697 v2 (Optimized) 2S Xeon E5-2697 v2 + 1 Xeon Phi 7120A (Optimized) 

5 

4 

3 

2 

1 

0 

Code Modernization results in 

significant improvements for both the 

processor and the coprocessor 

Original 

Parallel Code 

(Xeon Only) 

1.7X 2.22 

3.43X 

1.70 

Optimized 

1 Optimized 

Code 

1 

Code 

(Xeon + 

(Xeon Only) 

Xeon Phi) 

Amber 14 Cellulose NPT 

Original 

Parallel Code 

(Xeon Only) 

3.43 

Optimized 

Code 

(Xeon Only) 

LAMMPS Liquid Crystal 

5.07 

Optimized 

Code 

(Xeon + 

Xeon Phi) 

Xeon = Intel® Xeon® processor 

Xeon Phi = Intel® Xeon Phi coprocessor 

Code Optimizations 

- Reduce host data xfer 

- Better data decomp. between CPU & Card 

- Better parallellization 

- Better OpenMP sched. 

Code Optimizations 

- Better Vectorization 

- Mixed Precision support 

- Misc changes for better 

parallelization 



purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others. 

12

A Growing Ecosystem: 

The Intel® Xeon Phi Coprocessors Applications & Solutions Catalog 

One place for all current public 

announcements, collateral, recipes, 

articles, benchmarks and case studies: 

• Provides insight into work-in 

process optimization and porting 

efforts in the software communities 

• Lists optimized, ported, and 

downloadable software 

• Click here to Access the Catalog 

Now! 

• Click here for companies offering 

Intel® Xeon Phi Coprocessors 

Intel.com/xeonphi 

Developer Zone 

13

“Intel’s leading technology & 

product provide great high 

performance computing power 

which enable us achieve more 

genome scientific research success 

for genome application 

development for China and for the 

whole human being.” 

Optimizing applications. Accelerating discovery. 

Life Sciences 

Wang Bingqiang 

Head of High Performance Computing, 

BGI 

14

Comparative Performance 

LAMMPS* 

Stillinger-Weber Water Benchmark 

32 NODES CLUSTER BENCHMARK 

APPROVED FOR PUBLIC PRESENTATION 

NEW 

3 

LAMMPS* Stillinger-Weber Water Benchmark Speed Up 

3X 

3.41X 

3.05X 

3.6X 

Application: LAMMPS* 

Description: Simulation of molecular systems with classical 

models. More at http://lammps.sandia.gov/ 

Availability: 

• Code: In main LAMMPS repository. 

• Recipe: Available here. 

2 

1 

0 

1 

0.9X 

1 Node (256K molecules) 32 Nodes (8.2M molecules) 

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS baseline) 

2S Intel® Xeon® processor E5-2697 v3 (LAMMPS IA Package) 

2S Xeon E5-2697 v3 + Tesla K40c*, boost off, ECC on 

2S Xeon E5-2697 v3 + Xeon Phi 7120A, turbo off (LAMMPS IA Package) 

“Xeon E5-2697 v3” = Intel® Xeon® processor E5-2697 v3 

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120A 

1 

No 

testing 

on Tesla 

Usage Model: Load balancer offloads part of neighbor-list and 

non-bond force calculations to Intel® Xeon Phi coprocessor 

for concurrent calculations with CPU. 

Highlights: Improved results with Intel® Xeon® processor E5- 

2697 v3 and Intel® Xeon Phi coprocessor 7120A. Dynamic 

load balancing allows for concurrent: 

• Data transfer between host and coprocessor. 

• Calculations of neighbor-list, non-bond, bond, and longrange 

terms. 

Same routines in LAMMPS Intel Package also run faster on 

CPU. 

Results: Simulation rate increase with Intel® Package is up to 

3.6X. Concurrent Intel Xeon Phi coprocessor computations and 

MPI communications yield improved speedup and higher node 

counts. 

For configuration details, go here. 

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2015 



purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others 

15


Johns Hopkins Bowtie 2* 

Multiple workloads 

1 

0 

Johns Hopkins Bowtie 2 Multiple Workloads Speed Up 

1 

.88X 


1.08X 

ERR161544 SRR034966_1 ERR000589 SRR002273_1 

Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40 

Intel® Xeon® processor E5-2697 v3 

1.59X 

1.87X 

SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2015 

1 NODE 

Applications: 1) Bowtie2 version 2.2.3; Intel® AVX2 port. Bowtie 

2 is an ultrafast and memory-efficient tool for aligning 

sequencing reads to long reference sequences. 2) NVBowtie 

version 0.9.9.3. NVBowtie is a GPU-accelerated re-engineering of 

Bowtie2, a widely used short-read aligner. While being 

completely rewritten from scratch, NVBowtie reproduces many 

(though not all) of the features of Bowtie2. 

http://nvlabs.github.io/nvbio/nvbowtie_page.html 

Description: Bowtie 2, in addition to aligning sequencing reads 

to long reference sequences, is particularly good at aligning 

reads of about 50 up to 100s or 1,000s of characters, and 

particularly good at aligning to relatively long (e.g. mammalian) 

genomes. 

Availability: 

• Code: Available here. 

• Recipe: Not available. Check for future availability here. 

Usage Models: ERR161544, SRR002273_1, HEK001(TGen), 

ERR000589_1, SRR033552_1, SRR034966_1, ERR024139_1 

Highlights: See more here. 


NEW 

Results: Bowtie2 running on the Intel® Xeon® processor E5-2697 

v3 with Intel® AVX2 port faster than NVBowtie running on the 

Intel® Xeon® processor E5-2697 v2 and the NVIDIA Tesla K40* 

for 6 of 7 workloads. NVIDIA published data of K40 compared to 

Intel® Xeon® processor E5-2600 (6 cores) on one workload. 




16


Johns Hopkins Bowtie 2* 

TGen workload 

1 NODE 


NEW 

1 

0 

Johns Hopkins Bowtie 2 TGen Workload Speed Up 

1.37X 

1 


Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A 


Application: Johns Hopkins Bowtie2* SSE2 and Bowtie2 running 

on the Intel® Xeon Phi coprocessor. Bowtie 2 is an ultrafast and 

memory-efficient tool for aligning sequencing reads to long 

reference sequences. 

Description: TGen workload – Analyzing cancer tumor RNA data 

from Translational Genomics Research Institute (TGen), available 

here and here. This workload used in Intel® Xeon® processor 

launch benchmarks. 

Availability: 



Usage Model: Heterogeneous model: Process data on host (78 

input files) and Intel® Xeon Phi coprocessor (36 input files), each 

file contains 500,000 reads. One Intel Xeon Phi coprocessor, 12 

simultaneous jobs with 10 threads each, were run and one job 

with 24 threads was run on the host. 

Highlights: See more here. 

Results: Up to 1.37X improved performance with the Intel® 

Xeon® processor E5-2697 v2 (host) and the Intel Xeon Phi 

coprocessor 7120A over the baseline host. 






17


1 

0 

BLAST* 

BLASTn v.30 

1 

2S Intel® Xeon® processor E5-2697 v2 (BLASTn v.30 baseline) 

2S Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A 

2S Xeon E5-2697 v2 + Xeon Phi 7120A (OFS parallelized) 

2S Intel® Xeon® processor E5-2697 v3 (BLASTn v.30 baseline) 



“Xeon E5-2697 v2/v3” = Intel® Xeon® processor E5-2697 v2/v3 



BLASTn* v.30 Speed Up 

1.26X 

1.41X 

1.22X 

1.49X 

1.52X 


Application: Basic Local Alignment Search Tool (BLASTn) v.30. 

Description: Searching for alignment in nucleotide query 

sequences against a known nucleotide db volume set. National 

Center for Biotechnology Information (NCBI*). More at 

http://blast.ncbi.nlm.nih.gov/. 

Availability: 



Usage Model: #4 (multiple queries multiple db) 100 NCBI queries 

(concatenated) against db refseq_rna.00-02 are distributed to the 

Intel® Xeon® processor and Intel® Xeon Phi coprocessor for 

maximum speedup sweet spot. Experiment was repeated 20 times 

with the pick of queries randomized for a sweet spot split 80/20 

and 59/23/18. 

Highlights: Throughput for this load sharing model has a small 

sweet spot for a sufficiently large query set. 

Results: Compared to the baseline, simulation rate speed up with 

Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi 

coprocessor 7120A heterogeneous model is 1.52X. Performance is 

also improved on the CPU due to Output Formatting Section (OFS) 

parallelization. 




. 

1 NODE 


NEW 

18


1 

0 

BLAST* 

BLASTp v.30 

1 

2S Intel® Xeon® processor E5-2697 v2 (BLASTp v.30 baseline) 



2S Intel® Xeon® processor E5-2697 v3 (BLASTp v.30 baseline) 




BLASTp* v.30 Speed Up 

1.15X 

1.3X 

1.21X 

1.39X 



1.41X 


Application: Basic Local Alignment Search Tool (BLASTp) v.30 

Description: Searching for alignment in protein query sequence 

against a known protein db volume set. More at 

http://blast.ncbi.nlm.nih.gov/. 

Availability: 



• Usage Model: #4 (multiple queries multiple db) 40 NCBI 

queries (concatenated) against db nr_sorted.00-02 are 

distributed to Intel® Xeon® processor and Intel® Xeon Phi 

coprocessor for maximum speedup sweet spot. Experiment 

was repeated 20 times with the pick of queries randomized for 

a sweet spot split 33/7 and 28/5/7. 

Highlights: Throughput for this offload model has a small sweet 

spot for a sufficiently large query set. Throughput is limited due 

to GAT stage not parallelized. 

Results: Compared to the baseline, simulation rate speed up 

with Intel® Xeon® processor E5-2697 v3 and Intel® Xeon Phi 

coprocessor 7120A heterogeneous model is 1.41X. 

Performance is also improved on the CPU due to Output 

Formatting Section (OFS) parallelization. 




. 

1 NODE 


NEW 

19


NAMD* 2.10 Pre-Release 

STMV 

CLUSTER BENCHMARK 

32 NODES 


30 

25 

20 

15 

10 

5 

0 

1 

NAMD* 2.10 (pre-release) Cluster Performance Increase 

STMV (~1M atoms) 

1.2X 

2X 

2.1X 

6.8X 

7.9X 

12.2X 13.1X 

1 Node 8 Nodes 32 Nodes 

Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 23 or 47 PPN) 

Intel® Xeon® processor E5-2697 v3 (27 or 55 PPN) 

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intel® Xeon Phi coprocessor 7120A (240T) 

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intel® Xeon Phi coprocessor 7110A (240T) 

20X 

24.2X 


27.2X 

32X 

Application: NAMD 2.10 pre-release; STMV 

Description: A parallel, object-oriented molecular 

dynamics code designed for high-performance 

simulation of large biomolecular systems. More at 

http://www.ks.uiuc.edu/Research/namd/ 

Availability: 

• Code: Intel® Xeon Phi coprocessor support is 

available as a pre-release. Use the nightly build. 


Usage Model: Single rank on host with 47 threads. 

Various computations are offloaded to Intel® Xeon 

Phi coprocessor from each thread. 

Highlights: Intel® Xeon Phi coprocessor support is 

now in the development branch of NAMD 2.10 prerelease. 

Results: For the STMV workload, the Intel® Xeon® 

processor E5-2697 v3 and the Intel® Xeon Phi 

coprocessor (32 nodes, 55 PPN) improved 

performance by up to 32X compared to the baseline 

processor (1 node, 47 PPN). 


SOURCE: INTEL MEASURED RESULTS AS OF SEPTEMBER, 2014 




20



ApoA1 


2 NODES 


2 

1 

0 


ApoA1 (~92K atoms); 55 PPN 

2.61X 

1 

1.52X 

1.94X 

1 Node 2 Nodes 

Intel® Xeon® processor E5-2697 v3 (Baseline: (Baseline; 1 node, node) 55PPN) 

Intel® Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor B1- 

7110A (240T) 

Application: NAMD* 2.10 pre-release; ApoA1 

Description: A parallel, object-oriented molecular dynamics code 

designed for high-performance simulation of large bio molecular 

systems. More at http://www.ks.uiuc.edu/Research/namd/ 

Availability: 

• Code: Intel® Xeon Phi coprocessor support is available as a prerelease. 

Use the nightly build. 


Usage Model: Single rank on host with 55 threads. Various 

computations are offloaded to Intel® Xeon Phi coprocessor from 

each thread. 

Highlights: Intel® Xeon Phi coprocessor support is now in the 

development branch of NAMD 2.10 pre-release. 

Results: For the ApoA1 workload, 2-node performance can be 

accelerated by up to 2.61X using a single Intel® Xeon Phi 

coprocessor. 






21



ApoA1 


2 NODES 


3 

2 

1 

0 


ApoA1 (~92K atoms); 47 PPN / 55 PPN 

1 

1.2X 

1.64X 1.82X 1.92X 

2.33X 


2.85X 

3.14X 

Intel® Xeon® processor E5-2697 v2 (Baseline: 1 node, 47 PPN) 

Intel® Xeon® processor E5-2697 v3 (55 PPN) 

Xeon E5-2697 v2 (47 PPN) + 1 Xeon Phi C0-7120A (240T) 

Xeon E5-2697 v3 (55 PPN) + 1 Xeon Phi B1-7110A (240T) 


“Xeon Phi 7110A/7120A” = Intel® Xeon Phi coprocessor 7110A/7120A 

Application: NAMD 2.10 pre-release; ApoA1 

Description: A parallel, object-oriented molecular dynamics code 

designed for high-performance simulation of large biomolecular 

systems. More at http://www.ks.uiuc.edu/Research/namd/ 

Availability: 

• Code: Intel® Xeon Phi coprocessor support is available as a 

pre-release. Use the nightly build. 


Usage Model: Single rank on host with 47 threads. Various 

computations are offloaded to Intel® Xeon Phi coprocessor from 

each thread. 

Highlights: Intel® Xeon Phi coprocessor support is now in the 

development branch of NAMD 2.10 pre-release. 

Results: For the ApoA1 workload, the Intel® Xeon® processor E5- 

2697 v3 and the Intel® Xeon Phi coprocessor (2 nodes, 55 PPN) 

improved performance by up to 3.14X compared to the baseline 

processor (1 node, 47 PPN). 






22


LAMMPS* 

Liquid Crystal Benchmark; 524K Atoms 

1 NODE 


6 

5 

4 

3 

2 

1 

0 

2S Intel® Xeon® processor E5-2697 v2 (LAMMPS Baseline) 




2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS IA 

Package) 

2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off (LAMMPS 

IA Package) 


LAMMPS* Liquid Crystal Benchmark Performance 

(Mixed Precision) 

1 1.2X 

3.4X 

5.4X 

5.1X 

6.5X 

SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2014 




Availability: 




non-bond force calculations to Intel® Xeon Phi coprocessor for 

concurrent calculations with CPU. 

Highlights: Improved results with Intel® Xeon® processor E5-2697 

v3 and Intel® Xeon PHI coprocessor 7120A. Dynamic load 

balancing allows for concurrent: 


• Calculations of neighbor-list, non-bond, bond, and long-range 

terms. 

Same routines in LAMMPS Intel Package also run faster on CPU. 

Results: Up to 6.5x performance improvement utilizing Intel® 

Xeon® processors and Intel® Xeon Phi coprocessors in 

combination with application optimization and generational 

improvements in architecture and higher core counts. 




23


LAMMPS* 

Liquid Crystal Benchmark; 524K Atoms 


32 NODES 


LAMMPS* Liquid Crystal Benchmark Performance (Mixed Precision) 

5 

4 

3 

2 

1 

0 

1 

4.5X 

5.4X 

1 Node (524K Atoms) 32 Nodes (16.8M Atoms) 

2S Intel® Xeon® processor E5-2697v3 (LAMMPS Baseline) 


2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off 

(LAMMPS IA Package) 

1 

3.66X 

5.3X 


Description: Simulation of molecular systems with classical models. 

More at http://lammps.sandia.gov/ 

Availability: 



Usage Model: Load balancer offloads part of neighbor-list and nonbond 

force calculations to Intel® Xeon Phi coprocessor for 



v3 and Intel® Xeon PHI coprocessor 7120A. Dynamic load 




terms. 


Results: Up to 5.4X performance improvement utilizing Intel® 

Xeon® processors and Intel® Xeon Phi coprocessors with 

application optimization on a single node compared to the baseline 

configuration. Performance gains continue to hold at 5.3x when 

scaling up to 32 nodes. 






24


LAMMPS* 

Rhodopsin Benchmark; 512K Atoms 


32 NODES 


1 

0 

LAMMPS* Rhodopsin Benchmark Performance 


1 

1.27X 

1.68X 




2S E5-2697 v3 + Intel® Xeon Phi coprocessor 7110P/7120A Turbo Off 

(LAMMPS IA Package) 

1 

1.07X 

1.47X 




Availability: 




non-bond force calculations to Intel® Xeon Phi coprocessor for 



v3 and Intel Xeon Phi coprocessor 7120A. Dynamic load 




terms. 




application optimization on a single node compared to the 

baseline configuration. Performance gains continue to hold at 

1.47X when scaling up to 32 nodes. 






25


AMBER* 14 

Particle Mesh Ewald (PME) Tobacco Virus 

2 

1 

0 

AMBER* 14 PME: Tobacco Virus, 1 Million Atoms 

1 

1.52X 

2X 

Intel® Xeon® processor E5-2697 v2 (baseline) 

Intel® Xeon® processor E5-2697 v2 (optimized) 

Xeon E5-2697 v2 (optimized) + Intel® Xeon Phi coprocessor 7120A 

Xeon E5-2697 v2 (optimized) + NVIDIA* K40 DPFP 


2.26X 

1.93X 

2.41X 

Xeon E5-2697 v3 (optimized) + Intel® Xeon Phi coprocessor 7120A 


Application: AMBER* 14 

1 NODE 


Description: Bimolecular Simulations (Protein, DNA, RNA, virus etc.). 

Full double precision (DPDP). More at http://ambermd.org/ 

Availability: 

• Code: Available as a patch. 

• Recipe: Available here (Section 18.7 of the manual). 

Usage Model: 

• Baseline is the Intel® Xeon® processor E5-2697 v2 compared to 

the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi 

coprocessor 7120A. 

• Offload processing on both, and using the released code, double 

precision code, across the platforms, 50% workload on the host 

and 50% on the coprocessor. 

Highlights: The code was optimized, delivered to the AMBER 

community (whoever has license) and available as an update patch 

during code configuration. The benchmark information is at 

http://www.ks.uiuc.edu/Research/STMV/ 

Results: Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon 

Phi coprocessor 7120A offload demonstrated up to 2.41X improved 

performance over the Intel Xeon processor E5-2697 v2. Optimized 

offload process demonstrated 1.07X increased performance 

compared to NVIDIA K40* performance. 






26


LAMMPS* 

Liquid Crystal Benchmark 

5 

4 

3 

2 

1 

0 

LAMMPS* Liquid Crystal Benchmark Performance 


1 




2S E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS 

IA Package) 


3.43X 

5.07X 

1 

3.06X 

4.84X 

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2014 



1-32 NODES 


Wide variety of academic, government, and industry users. Popular 

due to its versatility and support for a wide range of forcefields/potential 

models: Materials Science, Chemistry, Biophysics, 

Solid Mechanics, Granular Flow, etc. More at 

http://lammps.sandia.gov/ 

Availability: 








v2 and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing 

allows for concurrent: 



terms. 




application optimization on a single node compared to the baseline 

configuration. Performance at 4.84X when scaling up to 32 nodes. 




27


LAMMPS* 

Rhodopsin Benchmark 256K Atoms 

LAMMPS* Rhodopsin Benchmark Performance (Mixed Precision); Includes 

External NVIDIA* Results 2/K20X + 1S AMD* 

2 

1 

0 

1 

1.26X 


2S Intel® Xeon® E5-2697 v2 (LAMPPS baseline) 

2S Intel® Xeon® E5-2697 v2 (LAMPPS IA Package) 

2S E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120A Turbo off (LAMPPS 

IA package) 

Cray XK7: 1S AMD Opteron* 6274 + NVIDIA Tesla* K20X; Cray Gemini* 

Interconnect, PCIe2.0* (LAMMPS GPU Package) 

http://www.nvidia.com/docs/IO/122634/computational-chemistry-benchmarks.pdf) 


1.71X 

.94X 1 

1.56X 

2.15X 

1.46X 

SOURCE: INTEL MEASURED RESULTS AS OF MAY, 2014 



32 NODES 


models. Wide variety of academic, government, and industry users. 

Popular due to its versatility and support for a wide range of forcefields/potential 




Availability: 





non-bond force calculations to Xeon Phi for concurrent 

calculations with CPU. 

Highlights: Dynamic load balancing allows for concurrent: 



terms. 




application optimization on a single node compared to the 

baseline configuration. Performance increases up to 2.15X when 

scaling up to 32 nodes, out-performing the alternative 

configuration. 




28


LAMMPS* 


1 

0 

LAMMPS* Rhodopsin Benchmark Performance 


1.78X 1.75X 

1 





IA Package) 


1.2X 

1 

1.17X 




32 NODES 







Availability: 







Highlights: Improved results with Intel® Xeon® processor E5-2697 v2 

and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing 




terms. 


Results: Up to 1.78X performance improvement utilizing Intel® Xeon® 

processors and Intel® Xeon Phi coprocessors with application 

optimization on a single node compared to the baseline 

configuration. Performance gains continue to hold at 1.75X when 

scaling up to 32 nodes. 




29


LAMMPS* 


LAMMPS* Rhodopsin Benchmark Performance (Mixed Precision); Includes 

External NVIDIA* Results 2/K20X + 1S AMD* 

1 

0 


2S Intel® Xeon® processor E5-2697v2 (LAMMPS Baseline) 

2S Intel® Xeon® processor E5-2697v2 (LAMMPS IA Package) 

2S E5-2697v2 + Intel® Xeon Phi coprocessor 7120A Turbo Off (LAMMPS 

IA Package) 

Cray XK7: 1S AMD Opteron* 6274 + NVIDIA Tesla* K20X; Cray Gemini* 

Interconnect, PCIe* 2.0 (LAMMPS GPU Package) 

http://www.nvidia.com/docs/IO/122634/computational-chemistrybenchmarks.pdf) 


1.21X 

1.75X 

1 1 

.9X 

1.2X 

1.72X 

1.22X 




32 NODES 







Availability: 







Highlights: Improved results with Intel® Xeon® processor E5-2697 v2 

and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing 




terms. 





configuration. Performance gains continue to hold at 1.72X when 

scaling up to 32 nodes, out-performing the alternative configuration. 




30


LAMMPS* 

Production Protein Sim.; 474K Atoms 

1 

0 

LAMMPS* Production Protein Simulation Performance 


1 

1.18X 

1.9X 





IA Package) 

1 

1.07X 

1.7X 



32 NODES 






http://lammps.sandia.gov/. 

Availability: 








v2 and Intel Xeon Phi coprocessor 7120A. Dynamic load balancing 




terms. 





configuration. Performance at 4.84X when scaling up to 32 nodes. 






31


AMBER* 14 

Generalized Born (GB) Nucleosome 

1 NODE 


2 

1 

0 

AMBER* 14 GB Nucleosome, 25000 atoms 

1 


Intel® Xeon® processor E5-2697 v2 optimized 

Xeon E5-2697 Phi 7120A v2 + optimized Xeon E5-2697 + Xeon v2 Phi offload 7120A offload 


“Xeon Phi” = Intel® Xeon Phi coprocessor 7120A 


1.85X 

2.2X 



Description: Bimolecular Simulations (Protein, DNA, RNA, virus 

etc.). Full double precision (DPDP). More at http://ambermd.org/ 

Availability: 



Usage Model: 

• Baseline is the Intel® Xeon® processor E5-2697 v2 host (also 

measured in 

http://ambermd.org/gpus/benchmarks.htm#Benchmarks) and 

speed up is shown with offload processing on both the Intel® 

Xeon® processor E5-2697 v2 and the Intel® Xeon Phi 


• Performance shown for the released code, all double precision, 

across the platforms. 50% workload on the host, 50% on the 

coprocessor. 

Highlights: The code had been optimized, will be delivered to 

the AMBER community (whoever has license) and available as 

update patch during code configuration. 

Results: Optimized offload process demonstrated up to 2.2X 

improved performance over the baseline Intel® Xeon® processor 

E5-2697 v2. 




32


AMBER* 14 

Particle Mesh Ewald (PME) Cellulose NPT 

1 NODE 


2 

1 

0 

AMBER* 14 PME: Cellulose NPT 

1 


1.7X 

Optimized Intel® Xeon® processor E5-2697 v2 

2.2X 

Optimized Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi 

coprocessor 7120A 




Availability: 



Usage Model: 

• Baseline is the host Intel® Xeon® processor E5-2697 v2 compared 

to the Intel® Xeon® processor E5-2697 v2 and the Intel® Xeon Phi 

coprocessor 7120A using double precision. 

• Offload processing on both, using the released code, double 

precision across the platforms. 



during code configuration. 


Phi coprocessor 7120A offload demonstrated up to 2.2X improved 

performance over the baseline Intel Xeon processor E5-2697 only 

code. 






33


AMBER* 14 


1 NODE 


2 

1 

0 

AMBER* 14 PME Cellulose NPT Speed Up 

1 


1.7X 

Optimized Intel® Xeon® processor E5-2697 v2 

Optimized Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi 

coprocessor 7120A 

2X 




Availability: 

• Code: As a patch of AMBER 14 when user updates AMBER 

(http://ambermd.org/bugfixes14.html, 

http://ambermd.org/bugfixesat.html) Update 5 and update 8. 

• Recipe: http://ambermd.org/doc12/Amber14.pdf (Section 18.7 of 

the manual). 

Usage Model: Baseline is the Intel® Xeon® processor E5-2697 v2 

compared to the Intel® Xeon® processor E5-2697 v2 and the Intel® 

Xeon Phi coprocessor 7120A with offload processing on both, and 

using the released code (double precision code across the 

platforms). 



during code configuration. 


Phi coprocessor 7120A offload demonstrated up to 2X improved 

performance over the baseline Intel Xeon processor E5-2697 only 

code. 






34


AMBER* 14 



4 NODES 


AMBER* 14 PME Cellulose NPT (408K Atoms) 

2.8X 

2.6X 

2.3X 

2.3X 

2 

1.9X 

1.6X 

1.3X 

1 

1 

0 

Node 1 Node 2 Node 3 Node 4 


Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A 






Availability: 



Usage Model: 

• Baseline is on the Intel® Xeon® processor E5-2697 v2 host 

only (also measured in 


speed up is shown with offload processing on both the Intel 

Xeon processor E5-2697 v2 and the Intel® Xeon Phi 


• Performance shown is for the released code, double precision 

across the platforms, 50% workload on the host, 50% on the 

coprocessor. 

Highlights: The code had been optimized, will be delivered to 

the AMBER community (whoever has license) and available as 

update patch during code configuration. 

Results: Optimized offload process demonstrated compelling 

cluster performance improvement, up to 2.8X, over the baseline 

Intel® Xeon® processor E5-2697 v2. 




35


1 

0 

AMBER* 14 


AMBER* 14 PME Cellulose NPT (408K Atoms) 

1 

1.14X 1.11X 

1.37X 

2 nodes 3 nodes 

Intel® Xeon® processor E5-2697 v2 (baseline) 


Xeon E5-2697 v2 + NVIDIA* K40 DPFP 

1.57X 

1.32X 




3 NODES 




Availability: 



Usage Model: 

• Baseline is on the Intel® Xeon® processor E5-2697 v2 host only 

(also measured in 


speed up is shown with offload processing on both the Intel 

Xeon processor E5-2697 v2 and the Intel® Xeon Phi 


• Performance shown is for the released code, double precision 

across the platforms, 50% workload on the host, 50% on the 

coprocessor. 

Highlights: The code had been optimized, will be delivered to the 

AMBER community (whoever has license) and available as update 

patch during code configuration. 

Results: Optimized offload process demonstrated compelling 

cluster performance improvement, up to 2.6X, over the baseline 







36


Burrows-Wheeler Aligner (BWA-ALN)* 

Human Genome 

1 NODE 


1 

0 

1 

BWA-ALN* Speed Up 

1.24X 

1.86X 

2S Intel® Xeon® processor E5-2697 v2 (baseline BWA-ALN) 

2S Intel® Xeon® processor E5-2697 v2 (optimized BWA-ALN) 

2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 

7120A 

Application: Burrows-Wheeler Aligner*, version 0.5.10. 

BWA-ALN is represented in this benchmark. Workload is 

korean_female (read file 3.5 GB, 3.0 GB reference data 

base). 

Description: BWA is a popular software package for 

mapping low-divergent sequences against a large 

reference genome, such as the human genome. More at 

http://bio-bwa.sourceforge.net/. 

Availability: 



Usage Model: Hybrid MPI + OpenMP* using symmetric 

mode. 

Highlights: Results are identical to the unmodified run of 

BWA-ALN 

Results: The Intel® Xeon® processor E5-2697 v2 and the 

Intel® Xeon Phi coprocessor symmetric process 

demonstrated up to 1.86X improved performance over the 

baseline Intel® Xeon® processor E5-2697 v2. 


SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014 




37


GROMACS* 

512K H2O with RF 

1 NODE 


GROMACS* 512K H2O with RF Speed Up 

1.79X 

1.72X 

1.56X 

1 1.03X 

1 

0 


1 Intel® Xeon Phi coprocessor 7120P/X 

2 Intel® Xeon Phi coprocessor 7120P/X 

Intel® Xeon® processor E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120P/X 

Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi coprocessor 7120P/X 

Application: GROMACS* 5.0-RC1; Workload: 512K H2O with RF 

method 

Description: GROMACS is a versatile package to perform 

molecular dynamics, i.e. simulate the Newtonian equations of 

motion for systems with hundreds to millions of particles. It is 

one of the fastest and the most popular Molecular Dynamics 

packages. 

Availability: 

• Code: Version 5.0-rc1 available here and here. 


Highlights: 

• Highly optimized for Intel® Xeon® Processors (AVX-intrinsics). 

• Able to run full simulation on Intel® Xeon Phi coprocessor 

natively + host processor using a symmetric model. 

• Optimized with intrinsics for 512-bit vectorization on Intel 

Xeon Phi coprocessors. 

• Results: Symmetric process demonstrated up to 1.79X 

improved performance over the baseline Intel® Xeon® 

processor E5-2697 v2. 


SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2014 




38


NWChem* 

CCSD(T) Method 


32 NODES 


NWChem* 6.3rev2 and 6.5 CCSD(T) Method 32 Node Speed Up 

1 

0 

1 

1.24X 

NWChem 6.3, 64S Intel® Xeon® processor E5-2697 v2 

NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 

1.52X 

NWChem 6.5, 64S Intel® Xeon® processor E5-2697 v2 + 64 Intel® Xeon 

Phi Coprocessor 7120A 2 

Application: NWChem* is a computational chemistry software 

package that includes quantum chemical and molecular 

dynamics functionality. NWChem is developed the 

Environmental Molecular Sciences Laboratory (EMSL) at 

the Pacific Northwest National Laboratory (PNNL). More at 

http://www.nwchem-sw.org 

Availability: 

• Code: Available here and from the SVN repository. 


Usage Model: Offload using LEO and OpenMP* 

Highlights: NWChem with Intel® Xeon Phi coprocessor 7120A 

offloading is a compelling and cluster compelling application for 

the NWChem community . 

Results: Compared to the NWChem* 6.3rev2 and Intel® Xeon® 

processor E5-2697 v2 baseline: 

1) NWChem 6.5 CCSD(T) performed up to 1.24X faster with the 


2) NWChem 6.5 CCSD(T) performed up to 1.52X faster with the 

Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi 







39


NWChem* 

CCSD(T) Method 

1 NODE 


NWChem* 6.3rev2 CCSD(T) Method Single Node Speed Up 

1 

0 

1 

2S Intel® Xeon® processor E5-2697 v2 

1.35X 

2S Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi 

Coprocessor 7120A 

Application: NWChem* is a computational chemistry software 

package that includes quantum chemical and molecular 

dynamics functionality. It is designed for high-performance 

parallel supercomputers and aims to be scalable both in its 

ability to treat large problems efficiently, and in its usage of 

available parallel computing resources. NWChem is developed 

the Environmental Molecular Sciences Laboratory (EMSL) at 

the Pacific Northwest National Laboratory (PNNL). More at 

http://www.nwchem-sw.org 

Availability: 

• Code: Available here and from the SVN repository. 


Usage Model: Offload using LEO and OpenMP*. 

Highlights: NWChem with Intel® Xeon Phi coprocessor 7120A 

offloading is a compelling and cluster compelling application for 

the NWChem community . 

Results: NWChem* 6.3rev2 CCSD(T) performed up to 1.35X 

faster with the Intel® Xeon® processor E5-2697 v2 and the Intel 

Xeon Phi coprocessor 7120A compared to the Intel Xeon 

processor E5-2697 v2 baseline. 






40

Discover and design like never before. 

material Sciences 

41


miniGhost* 

BSPMA 

1 NODE 


NEW 

miniGhost v. 0.9 Speed Up 

1.76X 

1 

1 

1.17X 

1.53X 

Application: miniGhost v. 0.9 

Description: Finite Difference mini-application which 

implements a difference stencil across a homogenous three 

dimensional domain. More at https://mantevo.org/download/ 

Availability: 

• Code: Available here or here. 


Usage Model: Symmetric model. 

Highlights: The code is a proxy for the CTH (shock physics 

code) and can be used as a good test-case to measure network 

(interconnect bandwidth vs. latency) performance. 

0 



Results: Symmetric process (2S Intel® Xeon® processor E5- 

2697 v3 + Intel® Xeon Phi coprocessor 7120A) demonstrated 

up to 1.76X improved performance over the baseline 

processor. 

Intel® Xeon® processor E5-2697 v2 + Xeon Phi 7120A 








42


miniGhost* 

BSPMA 

1 

0 

1 

miniGhost v. 0.9 Speed Up 

1.18X 



1.38X 




1.48X 


Application: miniGhost v. 0.9 

Description: Finite Difference mini-application which 

implements a difference stencil across a homogenous three 

dimensional domain. More at https://mantevo.org/download/ 

Availability: 

4 NODES 

• Code: Available here or here. 


Usage Model: Symmetric model. 


NEW 

Highlights: The code is a proxy for the CTH (shock physics 

code) and can be used as a good test-case to measure network 

(interconnect bandwidth vs. latency) performance. 

Results: Symmetric process (2S Intel® Xeon® processor E5- 

2697 v3 + Intel® Xeon Phi coprocessor 7120A) demonstrated 

up to 1.48X improved performance over the baseline 

processor. 






43


Quantum ESPRESSO* 

GRIR443 

1 

0 

Quantum ESPRESSO* 5.0.3 GRIR443 Speed Up 


1 


1.4X 

16S Intel Xeon processor E5-2697 v2 + 16 Intel® Xeon Phi 




Application: Quantum ESPRESSO* 5.0.3 

Workload Description: Quantum ESPRESSO* is an integrated suite of 

Open-Source computer codes for electronic structure calculations 

and materials modeling at the nanoscale. GRIR443 is a public PRACE 

benchmark suited for multi-node execution. ESPRESSO is an acronym 

for opEn-Source Package for Research in Electronic Structure, 

Simulation, and Optimization. More at http://www.quantumespresso.org/. 

Availability: 



8 NODES 


Usage Model: Offload using OpenMP* and Intel® Math Kernel Library. 

Highlights: 

• Results are obtained from real-world, publicly available benchmarks 

– no special treatment. 

• Effect of offloading heavily depends on the type of the workload – 

larger workloads (such as GRIR443) generate large dense matrix 

multiplications, benefiting from the Intel® Xeon Phi coprocessor 

7120A. 

Results: Quantum ESPRESSO 5.0.3 GRIR443 performed up to 1.4X 

faster with Intel® Xeon® processor E5-2697 v2 and the Intel Xeon Phi 

coprocessor compared to the Intel® Xeon® processor E5-2697 v2 

baseline. 




44


Quantum ESPRESSO* 

AUSRUF112 

1 NODE 


1 

0 


Quantum ESPRESSO* 5.0.3 AUSURF112 Speed Up 

1 


1.2X 

2S Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi 



Application: Quantum ESPRESSO* 5.0.3 AUSRUF112 

Description: Quantum ESPRESSO is an integrated suite of Open- 

Source computer codes for electronic-structure calculations and 

materials modeling at the nanoscale. AUSURF112 is a publicly 

available DEISA benchmark suited for single node execution. 

ESPRESSO is an acronym for opEn-Source Package for Research 

in Electronic Structure, Simulation, and Optimization. More at 

http://www.quantum-espresso.org/ 

Availability: 



Usage Model: Offload using OpenMP* and Intel® Math Kernel 

Library. 

Highlights: 

• Results are obtained from real-world, publicly available 

benchmarks – no special treatment. 

• Effect of offloading heavily depends on the type of the 

workload. 

Results: Quantum ESPRESSO 5.0.3 AUSURF112 performed up to 

1.2X faster with Intel® Xeon® processor E5-2697 v2 and the Intel® 

Xeon Phi coprocessor compared to the Intel Xeon processor E5- 

2697 v2 baseline. 




45

Improving simulations speed and quality. 

Manufacturing 

46


ANSYS Mechanical* 16.0 

V16sp-3 DMP 

ANSYS Mechanical* V16sp-3 Solver Rate, DMP Mode 


8 NODES 


NEW 

30 

25 

20 

15 

10 

5 

0 

1 

2X 1.8X 

4.7X 

1 (1 node x 1 MPI 

ranks) 

6X 

9.4X 

7.6X 6.9X 

12.3X 


8 (1 node x 8 MPI 16 (2 nodes x 8 MPI 

ranks) 

ranks) 

Number of MPI processes on host 

16 (8 nodes x 2 MPI 

ranks) 



17.7X 16.9X 

21.2X 

19.4X 


22.4X 

Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessors 7120A 

32.3X 

Application: ANSYS Mechanical* 16.0 DMP Mode 

Description: V16sp-3 Harmonic Analysis, 1.7M DOF. 

Availability: 

• Code: Contact ANSYS for official V16 release, 

www.ansys.com 


Usage Model: Intel® MKL automatic offload. 

Highlights: First commercial simulator supporting the 

Intel® Xeon Phi coprocessor. V15sp-3 speaker 

benchmark model. 

Results: The Intel® Xeon Phi coprocessor provides 

speed-up over Intel® Xeon® E5-2697 v3 host on cluster 

runs, and dual Intel® Xeon Phi® coprocessors provide 

greater speed-up on both single node and on cluster 

runs. 


SOURCE: INTEL MEASURED RESULTS AS OF FEBRUARY, 2015 




47



V145sp-5 DMP (Previous Version Model) 

1 NODE 


NEW 


11.3X 


10 

8 

6 

4 

3.34X 

4.6X 

7.24X 

5.48X 

3.77X 

9.83X 

8.83X 

5.39X 

3.62X 

9.51X 

8.4X 

4.79X 

Description: V145sp-5 Static Structural Analysis, 2.1M 

DOF. 

Availability: 


www.ansys.com 

• Recipe: Available here 


2 

0 

1.98X 2.04X 

1 

1 2 4 8 







Intel® Xeon Phi coprocessor. V145sp-5 turbine blade 

benchmark model (discusses impact of Intel® SSDs). 

Results: At 1 MPI processes on the host, the Intel® Xeon 

Phi coprocessor provides up to 2.32X speed-up over 

the Intel® Xeon® processor E5-2697 v3 host, and the 

Intel® Xeon® processor E5-2697 v3 + the Intel® Xeon 

Phi coprocessor 7120A provides up to 4.6X speed-up 

over the Intel® Xeon® E5-2697 v2 host. 

Note: Intel® Xeon® processor uses 1 core per MPI process 






48



V16ln-2 DMP 

1 NODE 


NEW 

10 

8 

6 

4 

2 

0 

ANSYS Mechanical* V16ln-2 Solver Rate, DMP Mode 

2.48X 

2.03X 2X 

1.61X 

1 

4.89X 

3.73X 

1 2 4 8 




6.65X 




3X 

6.47X 

5.83X 

9.58X 

6X 

9.42X 

8.97X 

10.08X 


Description: V16ln-2 Dynamic Structural Modal 

Analysis, 2M DOF. 

Availability: 


www.ansys.com 




Intel® Xeon Phi coprocessor. 

Results: At 2 MPI processes on the host, the Intel® Xeon 

Phi coprocessor provides up to 1.78X speed-up over 

the Intel® Xeon® E5-2697 v3 host, and up to 2.44X 

speed-up over the Intel® Xeon® E5-2697 v2 host. 






49



V145sp-4 DMP (Previous Version Model) 

1 NODE APPROVED FOR PUBLIC PRESENTATION 

NEW 


14.8X 

15.11X 

14 

12 

11.76X 

10 

10.2X 

10.16X 

8.76X 

8 

6.99X 

6 

5.88X 

5.89X 

4 

3.43X 

2 

1.89X 

1 .96X 

0 

1 2 4 8 16 



Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 3151P 

Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon Phi coprocessors 3151P 


Description: V145sp-4 Static Structural Analysis, 3.2M 

DOF. 

Availability: 

• Code: V16 preview 4 is now available. Contact ANSYS 

for official V16 release, www.ansys.com 




Intel® Xeon Phi coprocessor. V145sp-5 turbine blade 

benchmark model (discusses impact of Intel® SSDs). 

Results: At 2 MPI processes on the host, a single Intel® 

Xeon Phi 31S1P provides up to 3.11x speed-up over 

Intel® Xeon® host E5-2692 v2 alone. Two Intel Xeon Phi 

31S1P coprocessors provide a speed-up of up to 3.7x. 







50



V16ln-2 SMP 

ANSYS Mechanical* V16ln-2 Solver Rate, SMP Mode 


NEW 

5 

4 

3 

2 

1 

0 

1 

2.06X 

1.67X 

2.87X 

1.77X 

3.12X 

2.75X 

1 2 8 16 




3.99X 3.88X 

4.38X 

Number of threads on host 

5.65X 5.75X 5.8X 

5.31X 

4.66X 4.95X 

Application: ANSYS Mechanical* 16.0 SMP Mode 

Description: V16ln-2 Dynamic Structural Modal 

Analysis, 2M DOF. 

Availability: 


www.ansys.com 




Intel® Xeon Phi coprocessor. 

Results: At 1 thread on the host, the Intel® Xeon Phi 

coprocessor provides up to 1.72X speed-up over the 

Intel® Xeon® E5-2697 v3 host, and up to 2.06X speed-up 

over the Intel® Xeon® E5-2697 v2 host. 


Note: Intel® Xeon® processor uses 1 core per thread 






51

Comparative Increase 

HPCG* benchmark 

Conjugate gradient method 

5 

4 

3 

2 

1 

0 

1.7X 

HPCG Benchmark Speed Up 

4.48X 

3.14X 3.27X 

2.95X 

4.15X 

5.67X 

5.31X 

1 1 1 

3.04X 

1 node 4 nodes 16 nodes 

Intel® Xeon® processor E5-2697 v3 (reference) 

Intel® Xeon® processor E5-2697 v3 (optimized) 

Xeon E5-2697 v3 + Xeon Phi 7120A (symmetric) 

Xeon E5-2697 v3 + 2 Xeon Phi 7120A (offload) 

Xeon E5-2697 v3 + 2 Xeon Phi 7120A (symmetric) 

3.95X 



4.59X 

5.23X 


16 NODES APPROVED FOR PUBLIC PRESENTATION 

NEW 

Application: HPCG benchmark 

Description: The HPCG benchmark is intended to complement the 

High Performance LINPACK benchmark used in the TOP500* system 

ranking by providing a metric that better aligns with a broader set of 

important cluster applications. The workload performs 50 iterations of 

CG method with Gauss-Seidel preconditioner and synthetic multigrid 

V-cycle. 

Availability: 

• Code: Available here. Intel Optimized Technical Preview for HPCG 

benchmark available by subscription. Contact intelmkl@intel.com. 


Usage Model: 

HPCG is a hybrid (MPI + OpenMP*) code supporting symmetric and 

offload modes. The reference version is pure MPI code running one 

rank per core. 

Highlights: 

Symmetric mode provides performance benefit by freeing CPU cores 

used by data transfer in offload for computations. 

Results: 

Optimized version demonstrated 1.7X performance improvement 

compared to reference implementation and 4.48X with 2 Intel® Xeon 

Phi single node. 


SOURCE: INTEL MEASURED RESULTS AS OF MAY, 2015 




52


Sandia Mantevo* 

MiniFE-2.0.1 

2 

1 

0 

1 

Sandia Mantevo* MiniFE 2.0.1 Speed Up 

1.1X 

2.14X 

2.58X 

1 node 8 nodes 



Xeon E5-2697 v3 + 1 Intel® Xeon Phi coprocessor 7120A 



1 

1.2X 

2.23X 

2.49X 


8 NODES 

Application: Sandia Mantevo* MiniFE 2.0.1. 

Description: Sandia National Laboratories; Intended to be 

the best approximation to an unstructured implicit finite 

that includes all important computational phases. More at 

https://mantevo.org. 

Availability: 




NEW 

Usage Model: Baseline is the Intel® Xeon® processor E5- 

2697 v2 host, and the host and the Intel® Xeon Phi 

coprocessor in a symmetric mode. 

Highlight: Porting ease using OpenMP*. The Intel® MPI 

Library enables rapid performance improvement when 

adding an Intel Xeon Phi coprocessor. 

Results: Up to 2.58X speed up for the Intel® Xeon® 

processor E5-2697 v3 and the Intel® Xeon Phi 

coprocessor 7120A in symmetric mode over the host. 


SOURCE: INTEL MEASURED RESULTS AS OF APRIL, 2015 




53


2 

1 

0 

MSC Software* 

Nastran* 2016 Alpha 

1 


MSC Software Nastran* 2016 Alpha Speed Up 

1.97X 2X 

1 

1 2 4 8 

Cores 


1.61X 

1 1 

2S Xeon E5-2697 v2 + 1 Intel® Xeon Phi coprocessor 7120P 

1.3X 


1 NODE 

Application: MSC Software Nastran* 2016 Alpha. 

Description: MSC Software Nastran sparse direct solver optimized 

for Intel® Xeon Phi coprocessors. Intel Xeon Phi coprocessors are 

especially well suited for large models that consist primarily of 

three-dimensional elements and have tightly-coupled multiphysics, 

such as noise, vibration, and harshness (NHV) studies. 

Availability: 

• Code: Contact an MSC Software representative. 

• Recipe: Use hStreams* with MPSS 3.5 or later. 


NEW 

Usage Model: When large matrices are split into submatrices for 

processing, the solver offloads the largest submatrices from the 

Intel® Xeon® processor to the Intel® Xeon Phi coprocessor. 

Performance is maximized through asynchronous compute 

capabilities, which allow matrix computations on the coprocessor to 

start almost instantly once the data offload begins. 

Highlight: This strategy has demonstrated performance gains 

across a range of solution sequences for MSC Nastran 2016 Alpha, 

including structural statics and modal decomposition, which 

account for the majority of customer workloads. 

Results: Up to 2X speed up for the Intel® Xeon® processor E5-2697 

v2 and the Intel® Xeon Phi coprocessor 7120A in offload mode 

over the host. 




54


Autodesk Maya* 2014 Viewport Plugin 

Embree-based workloads 


NEW 

2 

1 

0 

Embree-Based Viewport Plugin for Autodesk Maya* 2014 

Speed Up 

1.68X 

1.23X 

Chinese Dragon 

with translucent 

material 

1.67X 

1.2X 

Full-screen Crytek 

Sponza simple 

materials + 

textures 

Power Plant 


2S Xeon E5-2697 v2 + Xeon Phi 7120A 


1.8X 1.77X 

1.58X 


Rungholt 

2.8X 

Application: Embree-Based Viewport Plugin for Autodesk Maya* 2014 

Description: An open-source, proof-of-concept Autodesk Maya 2014 

viewport plugin which renders the scene being interactively edited using 

the Embree high-performance ray-tracing kernels and a modified version 

of the Embree sample path-tracer. 

Availability: 

• Code: Available here (branch “mayarender_v2.3.2”). 


Usage Model: 

This plugin can do computation using only the host, only the coprocessor 

via offload, and on both the host and coprocessor via offload. It can also 

make use of multiple coprocessors. The plugin is invoked every time the 

screen updates during interactive content authoring. 

Highlights: 

• Workstation application for both Linux* and Windows* 

• Makes use of the high-performance Embree ray-tracing kernels, which 

are optimized for both Intel® Xeon® processors and the Intel® Xeon Phi 

coprocessor 

Results: 

Dual Intel® Xeon® processor E5-2697 v2 + 1 Intel Xeon Phi coprocessor 

7120A hybrid mode improves performance by up to 2.8X over the dual 

Intel® Xeon® processor E5-2697 v2 host alone. 







55


2 

1 

0 

Autodesk Maya* 2014 Viewport Plugin 

Selected workloads 

Selected Workloads Viewport Plugin for Autodesk Maya* 2014 

Speed Up 

1.4X 

1.69X 

Chinese Dragon 

with translucent 

material 

1.35X 

1.82X 

Full-screen Crytek 

Sponza simple 

materials + 

textures 

1.6X 

1.43X 

Power Plant 




Rungholt 

“Xeon E5-2697 v2 / v3” = Intel® Xeon® processor E5-2697 v2 / v3 


1.44X 

2.51X 


NEW 

Application: Embree-Based Viewport Plugin for Autodesk Maya* 2014 

Description: An open-source, proof-of-concept Autodesk Maya 2014 

viewport plugin which renders the scene being interactively edited using 

the Embree high-performance ray-tracing kernels and a modified version 

of the Embree sample path-tracer. 

Availability: 

• Code: Available here (branch “mayarender_v2.3.2”). 


Usage Model: 

This plugin can do computation using only the host, only the coprocessor 

via offload, and on both the host and coprocessor via offload. It can also 

make use of multiple coprocessors. The plugin is invoked every time the 

screen updates during interactive content authoring. 

Highlights: 

• Workstation application for both Linux* and Windows* 

• Makes use of the high-performance Embree ray-tracing kernels, which 

are optimized for both Intel® Xeon® processors and the Intel® Xeon Phi 

coprocessor 

Results: 

Dual Intel® Xeon® processor E5-2697 v3 + 1 Intel Xeon Phi coprocessor 

7120A hybrid mode improves performance by up to 2.51X over the dual 

Intel® Xeon® processor E5-2697 v2 host alone. 






56


OpenLB* 

Cylinder2d 

1 NODE 


1 

0 

OpenLB* Speed Up 

1 1X 

1.53X 

2S Intel® Xeon® processor E5-2697 v2, baseline OpenLB* (12 Ranks) 

2S Intel Xeon processor E5-2697 v2, optimized OpenLB (12 Ranks) 

2S Xeon E5-2697 v2 (12 Ranks) + Xeon Phi 7120A (12 Ranks * 20 thread per 

Rank) 



Application: OpenLB* 0.8r0. Workload is cylinder2d, which is 

based on D2Q9 lattice mode and comes from the example in the 

source code package; the size is 16384*4096. 

Description: The OpenLB is a software package designed to set 

up an open source implementation of Lattice Boltzmann 

methods (LBM) in an object-oriented framework. More at 

http://optilb.com/openlb/. 

Availability: 



Usage Model: Hybrid MPI + OpenMP* on Intel® Xeon Phi 

coprocessor 7120A and MPI on Intel® Xeon® processor E5-2697 

v2 using symmetric mode. 

Highlights: Pure MPI mode shows the best performance on the 

host processor. For the coprocessor, MPI + OMP hybrid mode 

produces the best performance. 

Results: The performance shows up to 1.53X speed up in 

symmetric mode over the baseline Intel® Xeon® processor E5- 

2697 v2. 


SOURCE: INTEL MEASURED RESULTS AS OF OCTOBER, 2014 




57

Data Center Server Demo: 


1 NODE 


NEW 

Intel Better Together 

Hardware 

Intel® Xeon Phi coprocessor 7120P 


Intel® Solid-State Drives 

Intel® True Scale Fabric 


Fujitsu CELSIUS* workstation 

Software 

Intel® Solutions for Lustre* software 

The Fujitsu HPC Gateway* is an extendable Web desktop that provides easy access to HPC capabilities. This demo 

features the Intel® Xeon Phi coprocessor 7120P and the Intel® Xeon® processor E5-2600 v3 powering this integrated 

and intuitive HPC environment that can be adapted by organizations to meet specific requirements, such as access 

policies and licensing. See the demo here. 



purchases, including the performance of that product when combined with other products. *Other names and brands may be claimed as the property of others 

58

CLUSTER BENCHMARKS 

New Data Center Server Demos at Intel.com/XeonE5softwaresolutions: 

These demos feature the Intel® Xeon Phi coprocessor 7120P and the Intel® Xeon® processor E5-2600 v3 

powering applications that are improving the way HPC is utilized. See the demos linked here: 

• Intel® Xeon Phi Coprocessor Code Modernization 

• Altair HyperWorks* and Intel® Cluster Ready Shortens the HPC Learning Curve 

• ANSYS* Delivers Fast, Accurate Engineering Simulations 

• CST* 3D Simulations Powered by the Intel® Xeon Phi Coprocessor 

• E4* Accelerated Bioinformatics with Intel® Xeon® and Intel® Xeon Phi 

• Fujitsu* HPC Clusters Based on Intel® Cluster Ready 


NEW 


Hardware 

Intel® Xeon Phi coprocessor 7120P 



Software 





59

Improving financial outcomes through faster simulations 

Financial Services 

60


Monte Carlo * 

European Option Pricing 

30 

25 

20 

15 

10 

5 

0 

Intel® Xeon® processor E5-2697 v2 + Java* managed code 

Intel® Xeon® processor E5-2697 v2 + C/C++ native 

Intel® Xeon Phi coprocessor 7120A + C/C++ native 

Intel® Xeon Phi coprocessor 7120A + Hadoop* 

Java: version 8 update 30+ C/C++: Intel® Parallel Composer XE 15.0 


Monte Carlo European Option Pricing Speed Up 

0.69X 

1 

8.91X 

Hadoop: Cloudera* Distribution of Hadoop 5.3+ 

31.25X 


1 NODE 

Application: Monte Carlo European option 

Description: Implements European Option Pricing using Monte Carlo. It 

compares the performance of 

1) Java* code, 2) C/C++ native code, 3) C/C++ offload accelerated code, 

4) C/C++ accelerated code using Hadoop*. 

Availability: 

• Code and Recipe: Available here. 


NEW 

Usage Model: Java (Managed Offload), Native on the host, Native and 

Accelerators, Hadoop on native and Accelerated. 

Highlights: 

• Java code can use parallelism but cannot vectorize any control flows 

• Native C/C++ code can take advantage of both vectorization & 

parallelization 

• Native Accelerated code offloads the whole workload to Intel® Xeon 

Phi Coprocessor. 

• Hadoop distribute the workload to 4 remote modes using mapreduce 

• Remote nodes accelerate the workload and send the run result back 

to the head note. 

Results: 

• Intel Java Stream can parallelize the workload but not vectorize 

• Native interface bring vectorization and parallelization to the 

workload 

• Acceleration extends the parallelism from Multicore to manycore 

• Hadoop map reduce can further distribute the application into 4 

nodes and achieve up to 31X improvement. 




61


Monte Carlo* 

Microsoft Excel*-based option pricing 

1 NODE 


NEW 

Microsoft Excel*-based Monte Carlo Option Pricing Speed Up 

2.09X 

2 

1 

1 

0 


Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120P 


Application: Microsoft Excel*-based Monte Carlo option pricing. 

Description: This application uses Excel to price Monte Carlo 

European Options. Traditionally these calculations are done on 

CPUs. This framework demonstrates how Excel can offload 

calculations to Intel® Xeon Phi x100 (Knights Corner) using 

Windows* stack. 

Availability: 



Usage Model: Offload, OpenMP*. No processing on the host; 

host offloads all data from Excel to the Intel® Xeon Phi 

coprocessor. All calculations were performed on the Intel Xeon 

Phi Coprocessor. 

Highlights: Excel users can speed-up their compute intensive 

applications by offloading these tasks to the Intel Xeon Phi 

Coprocessor. 

Results: Offloading calculations to Intel® Xeon Phi coprocessor 

7120P resulted in up to 2.09X performance improvement 

compared to the Intel® Xeon® processor E5-2697 v2. 






62


QuantLib* 

Single Precision Monte Carlo 

1 NODE 


NEW 

7 

6 

5 

4 

3 

2 

1 

0 

QuantLib* SP Monte Carlo Speed Up 

6.93X 

1.82X 

1.34X 

1 




Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40c 


Application: QuantLib* Single Precision Monte Carlo 

Description: Monte Carlo is a popular simulation mathematical 

model to value and analyze complex financial instruments, in 

this case European options. This version is single-precision. 

More at Accelerating Financial Applications on the GPU. 

Availability: 


Usage Model: Multi-option scenario; 400k paths, 4k options. 

This model uses a time-step of 250 per path. Uses OpenMP* for 

threads and Intel® MKL Library for vector of random numbers 

Highlights: The optimized Intel® Architecture code is compared 

with the reference CUDA* version. 

Results: 

• Intel® Xeon® processor E5-2697 v3 demonstrates a speed up 

of up to 1.82X over Intel Xeon processor E5-2697 v2. 

• Intel® Xeon Phi (native) demonstrates a speed up of up to 

6.93X over Intel Xeon processor E5-2697 v2. 

• NVIDIA Tesla* K40c demonstrates a speed up of up to 1.34X 

over Intel Xeon processor E5-2697 v2 (measurements do not 

include data transfer time). 






67


QuantLib* 

Single Precision Black-Scholes 

1 

0 

QuantLib* SP Black-Scholes Speed Up 

1 

1.26X 



1.84X 


Intel® Xeon® processor E5-2697 v2 + 1 NVIDIA Tesla* K40c 


0.69X 

Application: QuantLib* Single Precision Black-Scholes 

Description: The QuantLib library is a popular library used 

for many areas of computational finance. Black-Scholes is a 

popular mathematical model for European option valuation. 

This is a single-precision version. More at Accelerating 

Financial Applications on the GPU. 

Availability: 


Usage Model: Test Conditions: 5 million options, 2048 

iterations. Uses OpenMP* for threads and Intel® TBB for 

memory alignment. 

Highlights: The optimized Intel® Architecture code is 

compared with the reference CUDA* version. 

Results: 

1 NODE 


NEW 

• Intel® Xeon® processor E5-2697 v3 demonstrates a speed 

up of up to 1.26X over Intel Xeon processor E5-2697 v2. 

• Intel® Xeon Phi (native) demonstrates a speed up of up 

to 1.84X over Intel Xeon processor E5-2697 v2. 

• NVIDIA Tesla* K40c demonstrates a slow down of up to 

0.69X compared to the Intel Xeon processor E5-2697 v2 

(measurements do not include data transfer time). 






64


Monte Carlo* RNG 

European Options; Double Precision 

1 NODE 


2 

1 

0 

Monte Carlo* RNG European Options – Double Precision 

.72X 

1 

1.87X 




2S Xeon E5-2697 v3 + 2 Xeon Phi 7120A 



2.77X 

Application: Monte Carlo* RNG European option pricing; double 

precision. 

Description: Monte Carlo RNG is a popular derivative pricing 

benchmark widely used by investment banks such as UBS*, Bank of 

America* and Goldman-Sachs*. More at 

https://software.intel.com/en-us/articles/case-study-achieving-highperformance-on-monte-carlo-european-option-using-stepwise. 

Availability: 



Usage Model: Hybrid or Asynchronous offload on one shared 

memory host node with up to two coprocessor devices. 

Highlights: 

• One of two most computational intensive workloads in FSI 

benchmark suites. 

• More operations per option data set. 

• Benefits from Intel® AVX2 FMA. 

Results: The baseline Intel® Xeon® processor E5-2697 v3 is 1.39X 

higher performance than Intel® Xeon® processor E5-2697 v2. Adding 

one coprocessor card nearly doubles the baseline performance; 

adding two coprocessors nearly triples it. 


SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2014 




65



Double Precision 

1 NODE 


Binomial Options Double Precision 

2 

1.54X 

2X 

Application: Binomial European Option Pricing. 

Description: Binomial Options* is a popular derivative pricing 

benchmark widely used by investment banks such as UBS*, Bank 

of America* and Goldman-Sachs*. One of two most 

computational intensive workloads in FSI benchmark suites. 

1 

1 

Availability: 



0 

.55X 




2S Xeon E5-2697 v3 + 2 Xeon Phi 7120A 





Highlights: 

• More operations per option data set. 

• Benefits from Intel® AVX2 FMA. 

Results: The baseline Intel® Xeon® processor E5-2697 v3 is 

nearly double the performance of the Intel® Xeon® processor E5- 

2697 v2. Adding one coprocessor card increases the baseline 

performance by over 50%; adding two coprocessors more than 

doubles it. 


SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2014 




66


Monte Carlo* 

European Options 

1 NODE 


10 

8 

6 

4 

2 

0 

Monte Carlo* European Options Speed Up 

Single and Double Precision 

1 

1.51X 

Single Precision 

10.65X 

2S Intel® Xeon® processor E5-2670 

1 

1.4X 


4.32X 

Application: Monte Carlo* RNG European option pricing. 



of America* and Goldman-Sachs*. More here. 

Availability: 



Usage Model: Single and double precision testing, host node 

with up to two coprocessor devices. Performance depends on 

raw computational power and the performance of exp2() 

Highlights: Dramatic performance scaling for both singleprecision 

and double-precision calculations. 

Results: For single precision, the Intel® Xeon® processor E5- 

2697 v2 performance improvement is up to 1.51X compared to 

the baseline, and the 2x Intel® Xeon Phi coprocessor 

performance improvement is up to 10.65X. 


2x Intel® Xeon Phi coprocessor 7120A (pre-production HW/SW) 


SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2013 




67


Black-Scholes Formula Valuation* 


1 NODE 


2 

1 

0 

1 

Black-Scholes Speed Up 


1.36X 


2.89X 




Intel® Xeon Phi coprocessor 7120A (pre-production HW/SW) 

1 

1.65X 

2.85X 

Application: Black-Scholes* financial modeling requires 

raw computational power plus high bandwidth 

between execution cores and memory 

Description: Popular derivative pricing benchmarks widely 

used by investment banks such as UBS, Bank of America, 

Goldman-Sachs 

Availability: 



Usage Model: Baseline is the Intel® Xeon® processor E5- 

2670 host only, and speed up is shown on the Intel® Xeon® 

processor E5-2697 v2 and on both the Intel® Xeon® 

processor E5-2697 v2 and on the Intel® Xeon Phi 


Highlights: Dramatic scaling for both single and double 

precision computations. 

Results: Up to 2.85X improved performance with Intel® 

Xeon Phi coprocessor 7120A over the baseline Intel® Xeon 

processor E5-2697 v2 for the double precision benchmark. 






68


Monte Carlo* RNG 

European Options 

1 NODE 


2 

Monte Carlo* RNG European Options Speed Up 


1.64X 

2.6X 

1.55X 

1.81X 

Application: Monte Carlo* RNG European option pricing; double 

precision. 



of America* and Goldman-Sachs*. More at 

https://software.intel.com/en-us/articles/case-study-achievinghigh-performance-on-monte-carlo-european-option-usingstepwise. 

1 

1 

1 

Availability: 



Usage Model: Single and double precision testing, host node 

with up to two coprocessor devices. 

0 


2S Intel® Xeon® E5-2670 

2S Intel® Xeon® E5-2697 v2 


2x Intel® Xeon Phi Coprocessor 7120A (pre-production HW/SW) 

Highlights: Intel® Xeon Phi coprocessor fast exp2 and FMA 

instructions deliver high performance, high accuracy for single 

and double precision computations. 

Results: For double precision, the Intel® Xeon® processor E5- 

2697 v2 performance improvement is up to 1.55X compared to 

the baseline, and the Intel® Xeon Phi coprocessor performance 

improvement is up to 1.81X. 






69




1 NODE 


Binomial Options Speed Up 


1.85X 

1.85X 

Application: Binomial European Option Pricing. 

Description: Binomial Options* is a popular derivative pricing 


of America* and Goldman-Sachs*. 

1 

1 

1 

Availability: 





0 




Highlights: 

• L2 cache for IA is large enough for this algorithm, which is an 

advantage compared to GPU shared memory. Intel compiler 

based SIMD vectorization can handle certain loop carry 

dependencies while GPU has to explicitly synch. 

Results: Adding one Intel® Xeon Phi coprocessor card increases 

the baseline Intel® Xeon processor E5-2697 v2 performance by 

up to 1.85X. 

2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi 

coprocessor (pre-production HW/SW) 


SOURCE: INTEL MEASURED RESULTS AS OF DECEMBER, 2013 




70


Xcelerit* 

Libor Market Model (LMM) 

1 NODE 


Xcelerit* LMM Speed Up 

1 

1 

1.29X 

1.03X 

1 

1.29X 

1.07X 

1 

1.34X 

1.12X 

Application: Xcelerit* LMM 

Description: Used in “Fixed income” division to model forward 

interest rates. More at http://blog.xcelerit.com/intel-xeon-phivs-nvidia-tesla-gpu/. 

Availability: 

• Code: 

• Base version. 

• Optimized versions: Contact Xcelerit* at 

http://www.xcelerit.com/ Contact Names: Hicham Lahlou, 

hicham.lahlou@xcelerit.com, Jorg Lotze, 

jorg.lotze@xcelerit.com 

• Recipe: Not available. Check for future availability at here. 

0 

Paths: 128k 256k 512k 


Usage Model: Baseline is the Intel® Xeon® processor E5-2697 

v2 host only and speed up is with native Intel® Xeon Phi 

coprocessor 7120A, double precision. 

Highlights: The code has been optimized and delivered to 

Xcelerit. 

Intel® Xeon Phi coprocessor 7120A 

NVIDIA K40* 

Results: Optimized Intel Xeon Phi coprocessor is the best 

performing platform for all configurations. 






71

"Intel Xeon Phi coprocessors 

enable exciting applications; we 

are seeing a noticeable boost in 

the performance of 

computationally-intensive kernels 

of wave migration solutions 

that are vital to our partners in the 

energy field.” 

Rishi Khan 

Vice President of Research and Development, 

ET International 

Enhancing exploration and extraction processes. 

Energy Industry 

72


1 

0 

Iso3DFD* 

16th Order Isotropic Kernel 

1 

Iso3DFD Speed Up 

1.37X 

Intel® Xeon® processor E5-2697 v2 (turbo off) 

Intel® Xeon® processor E5-2697 v3 (turbo off) 

1.72X 

Intel® Xeon Phi coprocessor 7120A (native, turbo on) 

Application: Iso3DFD. 

1 NODE 


NEW 

Description: Iso3DFD - 16th order Isotropic kernel that is at the 

heart of RTM algorithm. This code computes the wave 

propagation used in seismic imaging. 

Availability: 

• Code: Intel® Phi Xeon coprocessor version available here. 

Intel® Advanced Vector Extensions 2 (Intel® AVX2) version not 

yet ready for publication 


Usage Model: OpenMP*, no MPI, native Intel® Xeon Phi 

coprocessor. Domain sizes, cache blocking sizes, prefetching 

distances are auto-tuned to provide the maximum possible 

throughput for each platform. 

Highlights: Intel® AVX2 intrinsics implementation. Auto-tuning 

used to find best set of parameters for Ivy Bridge, KNC and 

Haswell. 

Results: Intel Phi Xeon coprocessor 7120P performance is up to 

1.72X better than Intel® Xeon® processor E5-2697 v2 and up to 

1.25X better than Intel® Xeon® processor E5-2697 v3. Intel® 

Xeon® processor E5-2697 v3 performance is up to 1.37X better 

than Intel® Xeon® processor E5-2697 v2. 






73

Tflops 

3 

2 

1 

0 

Distributed Iso3DFD* 


14.7 GCells 

1 node 2 nodes 3 nodes 4 nodes 

2S Intel® Xeon® processor E5-2697 v3 + 2 Intel® Xeon Phi coprocessor 

7120P 


Distributed Iso3DFD Speed Up 

29.3 GCells 

44 GCells 

58.7 GCells 



1 NODE 

Application: Distributed version of Iso3DFD. 


NEW 

Description: One dimensional domain decomposition Iso3DFD 

(16th order Isotropic kernel). Multi-node tests; cluster performance 

and scalability of MPI implementation. 

Availability: 

• Code: Not available. 

• Recipe: Available here. Optimization guide available here. 

Usage Model: Scaling analysis with each Intel® Xeon Phi 

coprocessor in a node solving a 14GB subdomain and each pair of 

Intel® Xeon® processors solving a 10GB subdomain. 

• Symmetric MPI. One MPI process per node or device. 

• OpenMP* within each MPI process on processor or coprocessor. 

(Note: No disc I/O being performed to save seismic wave-fields.) 

Highlights: 

• Symmetric model allows host-coprocessor load balancing at MPI 

job launch time. 

• Processes running either on host or MIC can be independently 

tuned. 

• Enabled the remaining DDR3 node memory to be used for fast 

I/O and other tasks. 

Results: Linear scalability confirmed to ensure no cluster-level 

limitation. 16GB memory on the coprocessors allows high overlap 

of computations and halo exchanges, hiding the cost of 

communication with the processors and other nodes. 




74


Petrobras* 

Isotropic RTM 

5 

4 

3 

2 

1 

0 

1 

Petrobras* Isotropic RTM Speed Up 

2.2X 

3.4X 

4.6X 

2S Intel® Xeon® processor E5-2697 v2, 2 MPI/12 OMP 

2S Xeon E5-2697 v2, (2/12) + Xeon Phi 7120A, 1 MPI/244 OMP 

2S Xeon E5-2697 v2, (2/12) + 2x Xeon Phi 7120A, 1/244 





5.6X 

1 NODE 


Application: Isotropic RTM has a major role in accurate imaging 

of complex subsurface structures. 

Description: The 3D Isotropic RTM plays a major role in accurate 

imaging of complex subsurface structures. This benchmark 

measures performance and scalability of the compute kernel on 

a hybrid host + Intel® Xeon Phi coprocessor configuration. Halo 

exchanges are performed between devices. More at 

http://www.petrobras.com/en/home.htm. 

Availability: 

• Code: Proprietary. 


Usage Model: Intel® Xeon Phi coprocessor 7120A as a host in 

native mode concurrently executing with the Intel® Xeon® 

processor E5-2697 v2. Same code for both devices with 

OpenMP* and Intrinsics. 

Highlights: Scalable; competitive performance/watt 

Results: Intel® Xeon® processor populated with 4 Intel Xeon Phi 

cards show up to 5.6X scaling when compared to the baseline 



SOURCE: INTEL MEASURED RESULTS AS OF JANUARY, 2014 




75


ISO 3DFD* 


1 NODE 


1 

0 

ISO 3DFD* Kernel Speed Up 

1 


1.66X 

Application: ISO 3DFD* 16th order Isotropic kernel. 

Description: ISO 3DFD is a kernel that computes an isotropic 

wave propagation in 3D (ISO). These kernel types are the heart of 

Reverse Time Migration (RTM). RTM can also be computed with 

VTI or TTI kernels. ISO 3DFD computes the wave propagation in 

the 16th order in space and 2nd order in time. On a single Intel® 

Xeon Phi coprocessor, this kernel is able to update around 5800 

Mcells/s (350 Gflops). 

Availability: 

• Code: Not available. 


Usage Model: Native mode with intrinsics. OpenMP* but no MPI. 

Results: The Intel® Xeon Phi coprocessor increased 

performance by up to 1.66X compared to the Intel® Xeon® 


Intel® Xeon Phi Coprocessor 7120A (native 244T) 






76



DownUnder GeoSolutions* 

3,800+ NODES 


NEW 


Hardware 

Intel® Xeon Phi coprocessor 

Intel® Xeon® E5-2600 v3 processor 


Software 


DownUnder GeoSolutions Insight* is optimized with Intel® Software Development Tools for Intel® Xeon Phi 

coprocessors and the Intel® processor Xeon® E5-2600 v3, providing oil and gas industry with the tools and 

functionality required to perform standard oil and gas workflows more efficiently and to interactively manipulate 

and migrate data and to achieve results in minutes that previously took hours or days 1 . See the demo here. 




77

“The Intel Xeon Phi coprocessor 

architecture provides truly 

impressive performance, and the 

potential to easily port 

applications represents a 

significant leap ahead.” 

Speeding discovery through faster, more accurate simulations. 

Astrophysics 

Greg Peterson, 

Director, National Institute for 

Computational Sciences 

78


BerkeleyGW* 

Sigma Phase 

1 NODE 


NEW 

1 

BerkeleyGW* Sigma Phase Speed Up 

1.58X 

1.31X 1.28X 

1 

Application: BerkeleyGW* (Sigma) 

Description: BerkeleyGW is a massively parallel computational 

package for electron-excited state properties. Sigma is the 

second half of the GW code. It gives the quasiparticle selfenergies 

and dispersion relation for quasielectron and 

quasihole states. 

Availability: 

• Code: Available here (version 1.1 beta). 

• Recipe: None. The exact input used is not available to the 

public although a similar input is available. 

Usage Model: MPI+OMP. In symmetric mode, MPI ranks are 

distributed to both the host and KNC card. Each MPI rank uses 

OMP threads (Hybrid). 

0 






Highlights: Hybrid MPI+OMP allows code to be run on both 

Intel® Xeon® processors and Intel® Xeon Phi coprocessors 

without any source modifications. 

Results: Intel® Xeon® processor E5-2697 v3 speed up is up to 

1.31X, and Intel Xeon processor E5-2697 v3 + Intel® Xeon Phi 

coprocessor 7120A speed up is up to 1.58X over the baseline 

processor. 






79


ASKAP* 

tHogBomClean 

1 

0 


Cleaning Rate Throughput per Second Speed Up 

1 

1.12X 

100 iterations 



Xeon E5-2697 v2 + NVIDIA K40* 


.71X 

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2015 

1 NODE 


NEW 

Application: Australian Square Kilometer Array Pathfinder* 

(ASKAP) tHogBomClean. 

Description: The tHogBomClean benchmark implements the 

kernel of the HogBom Clean deconvolution algorithm. This 

benchmark is quite minimal and actually omits the final step, 

convolution of the model with the clean beam, but this 

involves the similar operations to the other steps as far as 

the CPU is concerned. More here. 

Availability: 



Usage Model: Native execution on Intel® Xeon® processor 

E5-2697 v3 , Intel® Xeon Phi coprocessor 7120, and NVIDIA 

K40*, using the same codebase. 

Results: 

• The Intel Xeon processor E5-2697 v3 increased 

performance up to 1.39X compared to NVIDIA K40*. 

• The Intel Xeon Phi coprocessor 7120 increased 

performance up to 1.57X compared to NVIDIA K40. 

• The Intel Xeon Phi coprocessor 7120 increased 

performance up to 1.12X compared.to Intel Xeon 





80


ASKAP* 

tHogBomClean 

1 NODE 


1 

0 

Cleaning Rate Throughput per Second Speed Up 

1.74X 1.78X 

1 1.03X 

E5-2697 v2 Baseline E5-2697 v2 Optimized 

E5-2697 v2 + 7120A E5-2697 v2 + 7120A Turbo Off Optimized 

E5-2697 v2 + 7120A Turbo On Optimized E5-2697 v2 + NVIDIA K40* Boost Off 

E5-2697 v2 + NVIDIA K40* Boost On 

.68X 

1.07X 

“E5-2697 v2” = Intel® Xeon® processor E5-2697 v2 

“7120A” = Intel® Xeon Phi coprocessor 7120A 

1.24X 

Application: Australian Square Kilometer Array Pathfinder* 

(ASKAP) tHogBomClean. 

Description: The tHogBomClean benchmark implements 

the kernel of the HogBom Clean deconvolution algorithm. 

This benchmark is quite minimal and actually omits the final 

step, convolution of the model with the clean beam, but this 

involves the similar operations to the other steps as far as 

the CPU is concerned. More here. 

Availability: 



Usage Model: Offload using OpenMP*; host only (Intel® 

Xeon® processor E5-2697 v2) performs data initialization 

and transfers to the Intel® Xeon Phi coprocessor 7120A; all 

computing is performed by the Intel Xeon Phi coprocessor 

7120A. 

Results: The optimized, turbo on, Intel Xeon Phi 

coprocessor 7120A improved throughput speed by up to 

1.78X compared to the baseline Intel® Xeon® processor E5- 

2697 v2. 


SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2014 




81

Simulations – Fast, detailed, accurate. 

geophysics 

82


specfem3D 

300K Mesh, 1000 Time Steps 

1 

0 

1 



2S Xeon E5-2697 v3 + 1 Intel® Xeon Phi coprocessor 7120P 


specfem3D 300K Mesh Speed Up 

1.15X 58.71.16X 

GCells 


1 

SOURCE: INTEL MEASURED RESULTS AS OF JUNE, 2015 


2 NODES 


NEW 

Application: specfem3D, 300k Mesh, 1000 time steps simulation. 

Description: SPECFEM3D Cartesian simulates acoustic (fluid), 

elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave 

propagation in any type of conforming mesh of hexahedra 

(structured or not.) It can, for instance, model seismic waves 

propagating in sedimentary basins or any other regional geological 

model following earthquakes. It can also be used for nondestructive 

testing or for ocean acoustics. More here. 

Availability: 

• Code: Available here. Specific version is r20645e5-2697 v3 

• Recipe: Available here. Optimization guide available here. 

Usage Model: Symmetric run on 2S Intel® Xeon® processor and 

Intel® Xeon Phi coprocessor 7120P, MPI + OpenMP* programming, 

and with a large rank count. 

Highlights: 

• Hybridized code and major hotspots were multi-threaded. 

• Performance gain with the Intel compiler flag “-xCORE-AVX2”. 

Results: 

• For 1 and 2 nodes, 1 Intel Xeon Phi coprocessor 7120P created up to 

1.16X gain compared to a 2S Intel Xeon processor E5-2697 v3. 

• For larger node counts, MPI dominates and there is no benefit using 

the Intel Xeon Phi coprocessor. 




83


3,800+ NODES 


Texas Advanced Computer Center* 


NEW 


Hardware 


Intel® Xeon® E5-2600 v3 processor 

Intel® Solid-State Drive Data Center for PCIe* 

Software 

CentOS* 6.5, Intel® MPSS 3.4.1, VNC*, Intel® 

Parallel Studio XE 2015, Intel® SPC, Intel® 

True Scale Fabric, Intel® MPI Library 

The Texas Advanced Computing Center* (TACC) combines CT scan and simulation data to analyze aquifer 

permeability in order to protect freshwater supplies in Florida, utilizing Intel® Xeon Phi coprocessors, Intel® Xeon® 

processors, and several Intel® Software products, and features visualization techniques made possible by ray 

tracing kernels. See the demo here. 




84


6,400 NODES 


Texas Advanced Computing Center* (TACC) 

1 http://www.tacc.utexas.edu/resources/hpc/stampede 

SOURCE: INTEL MEASURED RESULTS AS OF MAY, 2014 

System: TACC Stampede*, a 10 petaflop supercomputer, 

one of the largest computing systems in the world for 

open science research, became operational on January 7, 

2013. Stampede will begin a two-phase transition to the 

Intel 15 compiler and the compatible software stack on 

March 31, 2015. Please visit: Stampede Two-Phase 

Transition to Intel 15 

Status: In service. 

Workloads: Runs hundreds of applications for thousands 

of users around the world. 

Performance: 

• More than 7 petaflops using Intel® Xeon Phi 

coprocessors 1 

• More than 2 petaflops using the Intel® Xeon® processor 

E5-2600 family 1 

More Information: 

• Stampede at the Texas Advanced Computing Center 

(Produced by Dell*/Intel): 

https://www.youtube.com/watch?v=f174GjUY1K4 

• TACC HPC systems overview: 

www.tacc.utexas.edu/resources/hpc 




85

Unlock, discover, innovate. 

physics 

86


Gyrokinetic Toroidal Code* 

Princeton (GTC-P) 


4 NODES 


GTC-P 4 Node Cluster Speed Up 

1 

1 

0 

4x 2S Intel® Xeon® processor E5-2697 v2 

1.18X 

Application: The gyrokinetic toroidal code (GTC) is a 

particle-in-cell code for turbulence simulation in support 

of the burning plasma experiment ITER, the crucial next 

step in the quest for fusion energy*. GTC-P is optimized 

GTC code developed at Princeton Plasma Physics Lab. 

More here. 

Availability: 



Usage Model: Hybrid MPI + OpenMP* using symmetric 

processing. 

Results: GTC-P shows performance increase of up to 

1.18X on 4 node Intel® Xeon® processor E5-2697 v2 + 

Intel® Xeon Phi coprocessor 7120P clusters. 

4x 2S Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor (C0- 

7120P) 






87

Increasing accuracy and timeliness of forecasts. 

Climate And Weather 

88


ROMS* 

Idealized Southern Ocean Benchmark 

1 NODE 


NEW 

ROMS Idealized Southern Ocean Benchmark Speed Up 

1.31X 

1 

1 

0 


2S Xeon E5-2697 v2 + Intel® Xeon Phi coprocessor 7120P 

“Xeon Phi 7120A” = Intel® Xeon Phi coprocessor 7120P 

Application: ROMS idealized southern ocean benchmark; ROMS 3.6+ 

(svn version 709) 

Description: ROMS is a free-surface, terrain-following, primitive 

equations ocean model that employs a split-explicit time-stepping 

scheme with special treatment and coupling between barotropic (fast) 

and baroclinic (slow) modes. The benchmark input file is 

“ocean_benchmark3.in” (distributed with publically available ROMS 

source code), modified to increase horizontal resolution in X and Y. See 

more at https://www.myroms.org/ 

Availability: 

• Code: Available here. Patch available here (commit 37b3c86 applied). 


Usage Model: Baseline is an Intel® Xeon® processor host only and 

speedup is shown with symmetric processing on the Intel Xeon 

processor the Intel® Xeon Phi coprocessor. Double precision code 

across the platforms. 

Highlights: The code has been optimized and will be available to the 

ROMS community as an update patch. 

Results: Symmetric process demonstrated 1.31X improved 

performance with the Intel Xeon processor E5-2697 v2 (host) and the 

Intel Xeon Phi coprocessor 7120A over the baseline host. 






89


Weather & Research Forecast (WRF*) 

WRFV3.6 CONUS2.5KM 


4 NODES 


1 

0 

1 

WRF V3.6 CONUS2.5KM Speed Up 

4 Node Clusters 

1.27X 1.3X 


2S Xeon E5-2697 v2 + Xeon Phi 7120P 


2S Xeon E5-2697 v3 + Xeon Phi 7120P 

1.94X 



“Xeon Phi 7120P” = Intel® Xeon Phi coprocessor 7120P 

Application: Weather & Research Forecast Model (WRF*) V3.6 

Description: WRF Model is a numerical weather prediction 

system designed to serve atmospheric research and 

operational forecasting needs. More at here. 

Availability: 



Usage Model: Baseline is the Intel® Xeon® processor E5-2697 

v2 host only, and speed up is shown with on both the host and 

on the Intel® Xeon Phi coprocessor 7120P, and on the Intel® 

Xeon® processor E5-2697 v3 + Intel® Xeon Phi coprocessor 

7120P. Performance shown is for released code. 

Highlights: The code had been optimized since 2011, delivered 

to the community, and available for download from the 

community site. 

Results: Symmetric process demonstrated up to 1.94X 








90


NASA* 

OVERFLOW 2.2g 

1 NODE 


NASA OVERFLOW* 2.2g Speed Up 

1.2X 

1 

1 

0 


2S Intel® Xeon® E5-2697 v2 + Intel® Xeon Phi Coprocessor 7120P 



Application: NASA OVERFLOW* 2.2g; fluid flow using 

overset grids, 

Description OVERFLOW 2.2g is a three-dimensional 

time-marching implicit Navier-Stokes code that can also 

operate in two-dimensional or axisymmetric mode. The 

code uses structured overset grid systems. More at 

http://people.nas.nasa.gov/~pulliam/Overflow/ 

Overflow_Manuals.html 

Availability: 

• Code: Available here. Xeon Phi optimization (but not 

build setup) is included in standard source code 

available from NASA 

• Recipe: Not available. 

Usage Model: MPI/OpenMP* symmetric processing 

distributed across an Intel® Xeon® processor and an Intel® 

Xeon Phi coprocessor. 

Results: Symmetric process demonstrated up to 1.2X 






91


NOAA NIM* 

Dynamic G5 

1 NODE 


1 

0 

NOAA NIM Dynamic Core* G5 Speed Up 

1 1.04X 



Titan: AMD Opteron* 6274 (16 cores, 2.2 GHz) processor, a NVIDIA Tesla* K20X GPU 

2S Xeon E5-2697 v2 + Xeon Phi 7120A (symmetric) 

.98X 



1.49X 

Application: Non-Hydrostatic Icosahedral Model 

(NOAA NIM Dynamic Core*). NIM is available from the 

NOAA Earth Systems Research Lab. 

Description: Next generation model that is being 

evaluated across CPU, GPU and MIC and focused on 

next generation high resolution weather (especially 

severe storm) forecasting. Benchmark is “Dynamic” 

core of the NOAA NIM weather simulator. More at 

https://earthsystemcog.org/projects/dcmip- 

2012/nim. 

Availability: 

• Code: Contact NOAA/ESRL 


Usage Model: Release 1598 with the G5 workload. G5 

resolution; K96 configuration. Symmetric processing 

with Intel® Xeon® processor E5-2697 v2 (host) and 

Intel® Xeon Phi coprocessor 7120A. 

Results: Symmetric process demonstrated up to 

1.49X improved performance over the baseline Intel® 

Xeon® processor E5-2697 v2. 






92

Improving speed and quality through digital design 

Digital Content Creation 

93


40 

30 

20 

10 

Embree 2.2 Pathtracer Renderer 

Ray Tracing 

0 

3.69X 

2.15X 

1X 

Bentley 

(2.3M Tris) 

Frames Per Second 

1024x1024 Image Resolution 

3.63X 

2.2X 

1X 

Crown 

(4.8M Tris) 

3.48X 

2.13X 

1X 

Dragon 

(7.4M Tris) 

Intel® Xeon® Processor E5-2699 v3 

Karst Fluid 

Flow 

(8.4M Tris) 

Intel® Xeon Phi Coprocessor 7120A 

Nvidia GeForce GTX Titan X* 

3.4X 

2X 

1X 

2.73X 

1.54X 

1X 

Power Plant 

(12.8M Tris) 

1 NODE 


NEW 

Application: Embree 2.2 Pathtracer Renderer using Embree and 

Optix* RT Libraries 

Description: Pathtracing app for use in benchmarking RT libraries 

Availability: 

• Code: Available here 

• Recipe: embree.github.io; http://embree.github.com 

Usage Model: 

• Native on Intel® Xeon® processor, COI-based offload on Intel® 

Xeon Phi coprocessor. Intel® TBB, ISPC and Intrinsics 

Highlights: 

• Typical ray tracing rendering pipeline code used throughout DCC 

to show comparative performance on different types of hardware 

with a variety of input 3D data models. Optimized for all Intel ISAs 

and delivered to the community and available for download. 

• End-User benefits: Ability to achieve competitive performance and 

the flexibility of IA for rendering and render farm applications 

Results: 

• Embree on Intel® Xeon Phi 7120A coprocessor performs up to 

2.15X faster than Nvidia 

– Path tracing causes low SIMD utilization (better for Xeon) 

– Xeon® E5-2699 v3 is latest silicon while Xeon Phi 7120A now 

2.5 years old 


SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2015 




94


Embree 2.2 

Morton Build 

1 NODE 


NEW 

1 

0 

1X 

Morton Build 

Million Triangles/Second 

1.42X 

1.3X 

Crown Bentley Car Dragon 

Intel® Xeon® Processor E5-2699 v3 


1.54X 

Application: Embree 2.2 Morton build 

Description: High performance ray tracing kernel library 

required to calculate photorealistic images; Morton build 

performance for dynamic scenes. Build for typical model 

sizes of a couple millions of triangles possible in a fraction 

of a second using Embree. 

Availability: 

• Code: embree.github.io; http://embree.github.com 

• Recipe: embree.github.io. 

Usage Model: 

• Native on Intel® Xeon® processor, COI-based offload on 

Intel® Xeon Phi coprocessor. Intel® TBB, ISPC and 

Intrinsics 

Highlights: Low-quality Morton build even faster and 

enables handling of dynamic models with 1M triangles. 

Targets application developers in professional rendering 

environment (e.g. Autodesk*, Dreamwork*, Pixar*, etc.) 

Results: 

• Embree on Intel® Xeon Phi 7120A coprocessor up to 

1.54X faster than 2S Intel® Xeon® E5-2699 v3 processor 


SOURCE: INTEL MEASURED RESULTS AS OF AUGUST, 2015 




95


Embree* 2.0 

Ray Tracing Kernels 

1 NODE 


Embree* 2.0 Ray Tracing Speed Up 

1.89X 

Application: Embree 2.0 

1 

1 

1.49X 

Description: Embree 2.0 is a simple ray tracing framework 

with optimized kernels for Intel® Xeon® processors and 

Intel® Xeon Phi coprocessors. It is useful for digital 

content creation (DCC), mechanical computer aided design 

(CAD), industrial design, and scientific visualization. More 

at http://embree.github.io/. 

Availability: 



Usage Model: 4M (million) triangles scene, ambient 

occlusion. 

0 


Results: Up to 1.89X improved performance with the Intel® 

Xeon Phi coprocessor compared to the baseline Intel® 

Xeon® processor E5-2670. 








96

Intel® Software Development Tools & 3RD Party Support for Developers 

and System Administrators 

97

What’s Inside: 

Intel® Parallel Studio XE 2016 

Composer Edition Professional Edition Cluster Edition 

What it does: 

Build fast code using 

industry leading compilers 

and libraries including new 

data analytics library 

Adds analysis tools 

Adds MPI cluster tools 

What’s included: 

• C++ and/or Fortran 

compilers 

• Performance libraries 

• Parallel models 

Composer edition + 

• Performance profiling 

• Threading design/prototyping 

& vectorization advisor 

• Memory & thread debugger 

• Data analytics acceleration 

library 

Professional edition + 

• MPI cluster 

communications library 

• MPI error checking and 

tuning 

98

Features and Configurations 


Composer Edition Professional Edition Cluster Edition 

Intel® C++ Compiler 

Intel® Fortran Compiler 

Intel® Data Analytics Acceleration Library 

Intel® Threading Building Blocks 

Intel® Integrated Performance Primitives 

Intel® Math Kernel Library 

Intel® Cilk Plus & Intel® OpenMP* 

Bundle or Add-on: 

Rogue Wave IMSL* Library 








Intel® Advisor XE 

Intel® Inspector XE 

Intel® VTune Amplifier XE 

Add-on: 









Intel® Advisor XE 

Intel® Inspector XE 

Intel® VTune Amplifier XE 

Intel® MPI Library 

Intel® Trace Analyzer and Collector 

Intel® Cluster Checker preview (Linux only) 

Add-on: 


Additional configurations including, floating and academic, are available at: http://intel.ly/perf-tools 

99

Visual C++* 

2015 

Intel 16.0 

GCC 

5.2.0 

Intel 16.0 

Visual C++* 

2015 

Intel C++ 

16.0 

GCC 

5.2.0 

Intel C++ 

16.0 

PGI 

Fortran* 

15.3 

Absoft* 

15.0.1 

Intel Fortran 16.0 

PGI 

Fortran* 

15.3 

gFortran* 

5.1.0 

Open64* 

4.5.2 

Absoft* 

15.0.1 

Intel Fortran 16.0 

Performance without Compromise 

Intel® C++ and Fortran Compilers on Windows*, Linux* & OS X* 

Boost C++ application performance 

on Windows* & Linux* using Intel® C++ Compiler 

(higher is better) 

Floating Point 

Integer 

1.51 

1.00 1.30 1.00 1.00 

1.00 

1.24 

1.51 

Boost Fortran application performance 

on Windows* & Linux* using Intel® Fortran Compiler 

(higher is better) 

1.00 

1.33 

1.88 

1.00 

1.07 

1.09 

1.32 

1.64 

Windows Linux Windows Linux 

Estimated SPECfp®_rate_base2006 Estimated SPECint®_rate_base2006 

Relative geomean performance, SPEC* benchmark - higher is better 

0.00 

Windows 

Linux 

Relative geomean performance, Polyhedron* benchmark– higher is better 

Configuration: Windows hardware: HP DL320e Gen8 v2 (single-socket server) with Intel(R) Xeon(R) CPU E3-1280 v3 @ 3.60GHz, 32 GB RAM, HyperThreading is off; 

Linux hardware: HP BL460c Gen9 with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 256 GB RAM, HyperThreading is on. Software: Intel C++ compiler 16.0, Microsoft (R) 

C/C++ Optimizing Compiler Version 19.00.23026 for x86/x64, GCC 5.2.0. Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel 3.10.0-229.el7.x86_64. 

Windows OS: Windows 8.1. SPEC* Benchmark (www.spec.org). 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and 

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results 

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that 

product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not 

unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not 

guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent 

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture 

are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific 

instruction sets covered by this notice. Notice revision #20110804 . 

Configuration: Hardware: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, HyperThreading is off, 16 GB RAM. Software: Intel Fortran compiler 16.0, Absoft*15.0.1,. PGI 

Fortran* 15.3, Open64* 4.5.2, gFortran* 5.1.0. Linux OS: Red Hat Enterprise Linux Server release 7.0 (Maipo), kernel 3.10.0-123.el7.x86_64. Windows OS: Windows 7, 

Service pack 1. Polyhedron Fortran Benchmark (www.fortran.uk). Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xINTEGER 

-stack:0x80000000. Intel® Fortran compiler: /fast /Qparallel /link /stack:64000000. PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. 

Linux compiler switches: Absoft -m64 -mavx -O5 -speed_math=10 -march=core -xINTEGER. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. 

Intel Fortran compiler: -fast –parallel. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed -Mstack_arrays -Mconcur=bind. Open64: - 

march=bdver1 -mavx -mno-fma4 -Ofast -mso –apo. 











100

Speedup 

Turn Big Data Into Information Faster with 


Advanced analytics algorithms supporting all data analysis 

stages. 

Business 

Scientific 

Engineering 

Web/Social 

Pre-processing 

• Decompression 

• Filtering 

• Normalization 

Transformation 

• Aggregation 

• Dimension 

Reduction 

Analysis 

• Summary 

Statistics 

• Clustering. 

Modeling 

• Machine 

Learning 

• Parameter 

Estimation 

• Simulation 

Validation 

• Hypothesis 

testing 

• Model 

errors 

Decision Making 

• Forecasting 

• Decision Trees 

• Etc. 

8 

Designed and 

Built by Intel 

to 

Delight 

Data Scientists 

PCA Performance Boost 

Using Intel® DAAL vs. Spark* MLLib 

Simple to incorporate object-oriented APIs for C++ and Java 

Easy connections to: 

• Popular analytics platforms (Hadoop, Spark) 

• Data sources (SQL, non-SQL, files, in-memory) 

6 

4 

2 

0 

4X 

6X 

Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel® Xeon® Processor E5-2699 v3, 2 

Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. PCA normalized input. 











6X 

1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 

Table Size 

7X 

7X 

101

Memory Access Analysis 

New! Intel® VTune Amplifier 2016 

New! 

Tune data structures for better performance 

• Attribute cache misses to data structures 

Bandwidth Analysis for Non-Uniform Memory 

• See Read & Write contributions to 

Total Bandwidth 

• Easier tuning of multi-socket bandwidth 

Seeing total bandwidth can suggest data blocking opportunities 

to change a bandwidth bound app into a compute bound app. 

102

Scalable Profiling for MPI and Hybrid Clusters with 

MPI Performance Snapshot 

Lightweight – Low overhead 

profiling up to 32K Ranks 

Scalability- Performance 

variation at scale can be 

detected sooner 

Identifying Key Metrics – 

Shows PAPI counters and 

MPI/OpenMP* imbalances 

103

Altair RADIOSS* and PBS Professional* 

Finite Element Analysis (FEA) / HPC Workload Management 

RADIOSS 

First evaluation of FEA code with Intel® Xeon Phi coprocessors 

• Speed time to results for computer aided engineering (CAE) 

simulations with an optimized solver 

• Perform linear and non-linear simulations of structures, 

fluids, fluid-structure interaction, sheet metal stamping, and 

mechanical systems 

• Extend value through tight integration with the Altair 

HyperWorks CAE software platform 

PBS Professional 

Schedule jobs for Intel Xeon Phi coprocessors 

• Improve resource utilization with a commercial grade HPC 

workload management solution 

• Improve value with advanced support for green provisioning, 

security, and policy-driven job scheduling 

• Get more information: Video, press release, white paper, 

configuration toolkit 

“Through our tight collaboration with Intel, we were able 

to test first-generation Intel Xeon Phi coprocessors at a 

very early stage and learned a great deal about how to 

optimize our solvers for Intel’s manycore architecture. 

We are eager to port our HPC solvers, such as RADIOSS, 

to the next-generation Intel Xeon Phi processor, which 

looks really promising for delivering cutting-edge, high 

performance to our customers.” 

Eric Lequiniou 

HPC Director, Altair 

See the: 

• Solution Brief 

• Case Study 

* Other names and brands may be claimed as the property of others. 

104

Bright Cluster Manager* 

Advanced Cluster Management Made Easy 

Bright Cluster Manager: An easy to use, integrated software solution 

that transforms the way clusters of all kinds are deployed and 

managed. 

Bright plus Intel® Xeon Phi coprocessor: Get everything you 

need to enable Intel Xeon Phi coprocessors in a cluster. 

• Deploy a software environment that works "out of the box“ 

• Bare metal provisioning, powerful user interface, comprehensive monitoring 

• Reduce the time, effort, and cost associated with building and operating an HPC 

cluster 

Simplified management: For clusters that include Intel Xeon Phi 

coprocessors 

• Install, manage and use clusters of any size, with no need for Linux* knowledge. 

• Configure, schedule, monitor, and manage Intel Xeon Phi coprocessors in your 

cluster – with ease. 

“Bright Cluster Manager and 

Intel Xeon Phi provide full 

cluster management. 

Deploying, using, and 

optimizing your cluster just got 

easier.” 

Martijn de Vries 

CTO at Bright Computing 


105

106

Intel® Xeon Phi Coprocessor Developer Site 

• Architecture, setup, and programming resources 

• Self-guided training 

• Case studies 

• Codes and Recipes 

• Information on tools and ecosystem 

• Support through community forum 

View at: http://software.intel.com/mic-developer/ 

107

Modern Code Developer Community 

A global online community for learning and sharing with experts 

software.intel.com/moderncode 

Developer Zone 

• Modern Code Zone 

• Software Tools, training webinars 

• How-to guides, parallel programming BKMs 

• Remote Access to hardware 

• Support Forums 

Topics 

• Vectorization/single instruction, multiple data (SIMD) 

• Multi-threading 

• Multi node/clustering 

• Take advantage of on-package high-bandwidth memory. 

• Increase memory and power efficiency. 

Experts 

• Black Belts, & Intel Engineer experts 

• Technical Content, Training -webinars, F2F; forum support 

• Conference and Tradeshows: Keynotes, Presentations, 

BOFs, Demos, Tutorials 

108

Intel® Developer Zone 

Join us on Social Media 

@intelsoftware 

Intel Developer Zone 

Intel Software TV 

@IntelDeveloperZone 

software.intel.com 

109

Intel® Software Development Tools 

For The Intel® Xeon Phi Coprocessor 


• Information: http:\\software.intel.com\en-us\intel-parallel-studio-xe 

Intel® System Studio 

• Information: https://software.intel.com/en-us/intel-system-studio 

• Video: https://www.youtube.com/watch?v=jraWWlc21gc 

Intel® Integrated Native Developer Experience (Intel® INDE) 

• Information: https://software.intel.com/en-us/intel-inde 

• Video: https://www.youtube.com/watch?v=PAGauYOLNGI&index=2&list=PLg-UKERBljNwPClBBuJvgG1Wl_P0iQXsQ 

Intel® Cluster Ready 

• https://software.intel.com/cluster-ready 

James Reinders: Capitalize on Parallelism with Intel® Xeon Phi coprocessors 

• youtube.com/watch?v=g9ehO6duNuE&list=UUH5Rft7GYM8KZpxA-4Ohihg&index=9&feature=plcp 


110

Recommended Links 

Getting Started: 

• An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi coprocessors 

• Intel® Xeon Phi Coprocessor Developer’s Quick Start Guide 

• http://software.intel.com/mic-developer: 

The Training tab has Beginner and Advanced workshop videos, and links to past/future webinars. 

The Tools and Downloads tab has a link to Intel and Third Party Tools and Libraries. 

− This page has links to available beta and production for developers. 

Vectorization: 

• http://software.intel.com/en-us/intel-vectorization-tools 

• Check out the 6 steps in the toolkit! 

Performance: 

• Life Sciences 

• Manufacturing 

• Financial Services 

• Energy 

• Physics 

• Digital Content Creation 

• Weather 

111

Hardware Configuration - Intel® Xeon® processor E5-2697 v2 

Platform 

CPU/Stepping 

Coprocessor 

Intel® Xeon® processor E5-2697 v2 + Intel® Xeon Phi coprocessor 7120A Intel® Xeon® processor E5-2697 v2 + NVIDIA K40* 

Intel® Server System R2208GZ4GC platform, 2U chassis, hot-swap drives, 24 

DIMMs, 1 750W Redundant Power Supply 

Intel® Xeon® processor E5-2697 v2, 2.7 GHz , 12 core, 8GT/s dual QPI links, 

130 W, 3.5GHz Max Turbo Frequency, 768kB instr L1 / 3072kB L2 / 30MB L3 

cache 

Intel® Xeon Phi coprocessor 7110 and 7120; 61 cores, 1.1 and 1.238 GHz, 

ECC enabled, TURBO disabled. Software Details: MPSS version - 2.1.6720- 

13/16/19, Flash version - 2.1.03.0386 

Intel® Server System R2208GZ4GC platform, 2U chassis, hot-swap drives, 24 

DIMMs, 1 750W Redundant Power Supply 

Intel® Xeon® processor E5-2697 v2, 2.7 GHz , 12 core, 8GT/s dual QPI links, 

130 W, 3.5GHz Max Turbo Frequency, 768kB instr L1 / 3072kB L2 / 30MB L3 

cache 

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory, ECC enabled, Boost 

disabled. Software Details: CUDA Version 5.5 

Memory 64GB total 8*8GB 1600MHZ Reg ECC DDR3 64GB total 8*8GB 1600MHZ Reg ECC DDR3 

Chipset 

Rev 4.6 

SE5C600.86B.99.99.x069.071520130923 

Rev 4.6 

SE5C600.86B.99.99.x069.071520130923 

HDD Specs SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM 

OS RHEL* 6.4 RHEL* 6.4 

MSC Nastran 2016 Alpha: Testing conducted by MSC Software on MSC Nastran* 2016 Alpha release, SOL 101. Baseline configuration: Intel® Xeon® processor E5-2679 v2 (2.6 GHz, 12 

cores, 30 MB cache), 128 GB (8 x 16) DDR3 memory @ 1.6 GHz, 4 TB HDD (SATA @ 6 GB/s 5900 RPM), Intel® Math Kernel Library (Intel® MKL) 11.2.3, Intel® Manycore Platform Software 

Stack (Intel® MPSS) 3.5. Test configuration: Identical to baseline configuration with the addition of one Intel® Xeon Phi coprocessor 7120A. 


104

Hardware Configuration – Intel® Xeon® processor E5-2697 v3 



Platform Intel® Server System R2208GZ4GC Intel® Xeon® processor E5-2697 v3, Grantley platform 

CPU 

Intel® Xeon® processor E5-2697 v2 2.7 GHz , Dual socket 12 core, 8GT/s dual QPI links, 

130 W, 3.5GHz Max Turbo Frequency 768kB instr L1 / 3072kB L2 / 30MB L3 cache 

RAM 64 GB total/node, 8*8GB 1600MHz Reg ECC DDR3 8x 8GB 2133 DDR4 DIMMs 

Intel® Xeon® processor E5-2697 v3 @ 2.60GHz / 9.6 GT/S, 145W, 14 core, Intel Smart 

Cache 35 MB 

Hard drive SEAGATE* ST9600205SS (scsi), 1x600 GB SAS HDD 10kRPM SSDSA2B220, 200GB, Intel® SSD DC S3500 Series 800GB 

Intel® Xeon Phi 

Coprocessor 

Cluster File 

System Abstract 

Nvidia* 

Coprocessor 

OS / Kernel / IB 

stack 

ECC Mode: Enabled; Coprocessor Stepping: C0, Board SKU: C0PRQ-7120 P/A/X/D, 

Frequency: 1.24GHz and Coprocessor Stepping: B1, Turbo On/Off documented within 

measurements, SKU used documented within measurements 

Panasas (60 TB storage), IEEL Lustre DDN (49 TB storage), IEEL Lustre SSD (32 TB 

storage), Firmware version 5.5.0.b-1067797.15 

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory , ECC enabled, Boost disabled, 

Software Details: CUDA Version 5.5 

Red Hat Enterprise Linux Server* release 6.5 / 2.6.32-358.6.2.el6.x86_64.crt1 / OFED 

3.5-2-MIC-rc1 

ECC Mode: Enabled; Coprocessor Stepping: C0, Board SKU: C0PRQ-7120 P/A/X/D, 

Frequency: 1.24GHz and Coprocessor Stepping: B1, Turbo On/Off documented within 

measurements, SKU used documented within measurements 

Panasas (60 TB storage), IEEL Lustre DDN (49 TB storage), IEEL Lustre SSD (32 TB 

storage), Firmware version 5.5.0.b-1067797.15 

Nvidia K40c*, 2880 CUDA* Cores, 12GB memory, ECC enabled, Boost disabled, 

Software Details: CUDA Version 5.5 

Red Hat Enterprise Linux Server* release 6.5 / 2.6.32-358.6.2.el6.x86_64.crt1 / OFED 

3.5-2-MIC-rc1,IFS-7.2.2.0.8 


113

LEGAL DISCLAIMERS 

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY 

THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED 

WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY 

PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer 

systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating 

your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each 

of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. 

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where 

similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. 

Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system 

configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost 

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See 

http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and 

products specified are for planning purposes only and are subject to change without notice 

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Product plans, dates, 

and specifications are preliminary and subject to change without notice 

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at 

less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration 

and you can learn more at http://www.intel.com/go/turbo. 

For more information go to http://www.intel.com/performance 

Any difference in system hardware or software design or configuration may affect actual performance 

Copyright © 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon logo, Xeon Phi, and Xeon Phi logo are trademarks of Intel Corporation 

in the U.S. and/or other countries. All dates and products specified are for planning purposes only and are subject to change without notice. 

*Other names and brands may be claimed as the property of others. 

114

OPTIMIZATION NOTICE 

Optimization Notice 

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for 

optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and 

SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or 

effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent 

optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not 

specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable 

product User and Reference Guides for more information regarding the specific instruction sets covered by 

this notice. 

Notice revision #20110804 

115

116

2H 2015

Create successful ePaper yourself

Delete template?

Save as template?