03.02.2015 Views

MEMSCALE: A New Memory Architecture For Clusters - Ziti

MEMSCALE: A New Memory Architecture For Clusters - Ziti

MEMSCALE: A New Memory Architecture For Clusters - Ziti

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>MEMSCALE</strong>: A <strong>New</strong> <strong>Memory</strong> <strong>Architecture</strong> <strong>For</strong> <strong>Clusters</strong><br />

Presentation at CERCS Systems Seminar at Georgia Tech<br />

Holger Fröning 1 , Federico Silla 2 , Hector Montaner 2 ,<br />

Jose Duato 2<br />

1<br />

Institute of Computer Engineering, University of Heidelberg, Germany<br />

2<br />

Parallel <strong>Architecture</strong>s Group. Technical University of Valencia, Spain<br />

April 11th, 2012


Outline<br />

• Motivation<br />

• The <strong>MEMSCALE</strong> <strong>Architecture</strong><br />

• Evaluation of exclusive use model<br />

• Overcoming Scalability Limitations of Shared <strong>Memory</strong><br />

• Evaluation of shared use model<br />

• Conclusion & Future Work<br />

2012/04/11 - CERCS Systems Seminar<br />

2


Motivation – The Data Deluge<br />

• <strong>Memory</strong> requirements<br />

• Merck Bio Research DB: 1.5TB/quarter<br />

• Typical Oil Company: 350TB+<br />

• Wal-Mart Transaction DB: 500TB (2004)<br />

• Facebook: 150TB<br />

• Twitter: more than 340M tweets per day<br />

• Particle Physics, Large Hadron Collider: 15PB<br />

• Human Genomics: 200PB+ (1GB/person, 200%<br />

CAGR)<br />

• Fourth paradigm in science: Data-<br />

Intensive Scientific Discovery<br />

• After theory, experiment and simulation<br />

[Gibbons2008]<br />

2012/04/11 - CERCS Systems Seminar<br />

3


Motivation – Technology Constraints<br />

• Performance disparity for<br />

different technologies<br />

• <strong>Memory</strong> hierarchy only helps<br />

if enough locality is present<br />

• Effective memory capacity<br />

• <strong>Memory</strong> capacity per core is<br />

not keeping the pace of core<br />

count increase<br />

500x<br />

<strong>Memory</strong> capacity per computing core for<br />

4P servers at highest memory speed<br />

2012/04/11 - CERCS Systems Seminar<br />

4


Motivation – In-<strong>Memory</strong> Computing<br />

• DRAM for storage In-<strong>Memory</strong> Databases<br />

• Jim Gray: „<strong>Memory</strong> is the new disk and disk is the new tape”<br />

• HDD bad for random access, but pretty good for linear access<br />

• Natural for logging and journaling an in-memory database<br />

• Google, Yahoo!: entire indices are stored in DRAM<br />

• Bigtable, Memcached<br />

• Litte or no locality for many new applications<br />

• “[…] new Web applications such as Facebook appear to have little or<br />

no locality, due to complex linkages between data (e.g., friendships<br />

in Facebook). As of August 2009 about 25% of all the online data for<br />

Facebook is kept in main memory on memcached servers at any<br />

given point in time, providing a hit rate of 96.5%”. [Ousterhout2009]<br />

• Typical computer: 2-8MB cache, 2-8GB DRAM. At most 0.1% of<br />

DRAM capacity can be held in caches<br />

2012/04/11 - CERCS Systems Seminar<br />

5


Motivation – Scalable Shared <strong>Memory</strong> Programming<br />

• Shared memory programming<br />

+ Pervasive, ease of use<br />

- No scalability<br />

• Message passing<br />

+ Scalable<br />

- Difficult to program<br />

- Overhead<br />

• Overcome message passing‘s<br />

overhead<br />

• Tag matching, progress, data<br />

distribution<br />

Combine advantages of<br />

Shared <strong>Memory</strong> with the<br />

scalability of message passing<br />

Communication done by FPGAs, configured as<br />

message passing engine resp. shared memory engine.<br />

2012/04/11 - CERCS Systems Seminar<br />

6


Motivation – Datacenter Utilization<br />

• Many efforts on power<br />

usage effectiveness<br />

(PUE), but poor utilization<br />

• Commodity-only approach<br />

for datacenters and clusters<br />

• Resource partitioning<br />

• Overprovisioning of<br />

resources to avoid<br />

demand paging<br />

• Underutilization in the<br />

average case<br />

Average CPU utilization at Google<br />

[Hennessy2011]<br />

Inter-server variation (rendering farm)<br />

[Lim2010]<br />

2012/04/11 - CERCS Systems Seminar<br />

7


Research Questions<br />

• Overcome memory capacity constraints<br />

• Enabling in-core computing for data-intensive applications<br />

• Re-visiting Shared <strong>Memory</strong> to overcome MPI limitations<br />

• Avoiding the overhead associated with MPI<br />

• Benefits of application-specific solutions for database<br />

appliances or datacenters<br />

• Overcoming commodity limitations<br />

2012/04/11 - CERCS Systems Seminar<br />

8


The <strong>MEMSCALE</strong> <strong>Architecture</strong>


Cluster/Distributed Computing<br />

• Cluster Computing is a highly viable option, and has prooven<br />

itself over more than one decade<br />

• Distributed resources, more or less coupled<br />

• Datacenter<br />

• Standard cluster<br />

• High performance cluster<br />

• Supercomputer<br />

loose coupling<br />

tight coupling<br />

• Coupling: local/remote access cost disparities<br />

• Appealing because of the strong scaling (and the costeffectiveness<br />

of commodity parts)<br />

• Resource partioning<br />

• Rules out remote accesses!<br />

• Leads to data partitioning and data movement!<br />

2012/04/11 - CERCS Systems Seminar<br />

10


Current architectures<br />

Shared <strong>Memory</strong> Computers<br />

(shared-everything)<br />

Message Passing Systems<br />

(shared-nothing)<br />

• Single address space<br />

• Up to 64 Cores<br />

• Up to 512GB RAM<br />

• Shared <strong>Memory</strong> programming model<br />

• Global coherency among all<br />

processors<br />

• Limited scalability!<br />

• Resource aggregation<br />

• Multiple address spaces<br />

• Nodes are identical – symmetric view<br />

• <strong>Memory</strong> over-provisioned<br />

• Message Passing programming model<br />

• No coherency among nodes, only<br />

within nodes<br />

• Unlimited scalabilty<br />

• Resource partitioning<br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Adress<br />

Space<br />

Address<br />

Space 1<br />

Address<br />

Space 2<br />

Address<br />

Space 3<br />

Address<br />

Space 4<br />

Address<br />

Space 5<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

2012/04/11 - CERCS Systems Seminar<br />

11


Our approach: <strong>MEMSCALE</strong><br />

• Selective aggregation of resources<br />

• Shared use of scalable resources (memory)<br />

• Exclusive use of resources with limited scalability (cores/caches)<br />

• <strong>Memory</strong> regions can expand to other nodes<br />

• Overhead of global coherency is avoided<br />

Spanning up global address spaces<br />

Decoupled resource aggregation, no resource partitioning<br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

<strong>Memory</strong><br />

Region 1<br />

<strong>Memory</strong><br />

Region 2<br />

<strong>Memory</strong><br />

Region 3<br />

<strong>Memory</strong><br />

Region 4<br />

<strong>Memory</strong><br />

Region 5<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

2012/04/11 - CERCS Systems Seminar<br />

12


Setting up global address spaces<br />

• Distributed shared memory<br />

• Local address space<br />

is split up:<br />

1. Private partition<br />

2. Shared partition<br />

3. Mapping to global<br />

address space<br />

• Virtually unlimited<br />

• Only limited by physical<br />

address sizes<br />

• Currently: 2^48 bytes or 256 TB<br />

2012/04/11 - CERCS Systems Seminar<br />

13


Remote memory access<br />

• Software-transparent access<br />

to remote memory<br />

• Loads/stores to local mapping<br />

of the global address space<br />

• Parts of the global address will<br />

identify target node<br />

• <strong>For</strong>ward request over the<br />

network<br />

• Request hits shared local<br />

partition on target node<br />

• If appropriate, send response<br />

back<br />

origin<br />

address space<br />

0<br />

private<br />

local<br />

shared<br />

local<br />

global<br />

• Direct, low-latency path to remote memory<br />

• Shared <strong>Memory</strong> Engine (SME)<br />

2 40<br />

2 64 -1<br />

global<br />

address space<br />

node 0<br />

node 1<br />

node n-1<br />

target<br />

address space<br />

0<br />

private<br />

local<br />

shared<br />

local<br />

2 40<br />

global<br />

oAddr gAddr tAddr<br />

2 64 -1<br />

2012/04/11 - CERCS Systems Seminar<br />

14


Remote memory access in detail<br />

Node #0 (Source)<br />

Issuing loads/stores on<br />

remote memory<br />

<strong>Memory</strong><br />

CPU<br />

(MCT)<br />

<strong>Memory</strong><br />

CPU<br />

(MCT)<br />

Remote load latency:<br />

1.89usec (R1, Virtex-4)<br />

1.44usec (R2, Virtex-6)<br />

DRAM<br />

Controller<br />

<strong>Memory</strong><br />

Controller<br />

Core<br />

Core<br />

Node #1 (Target)<br />

Serving as memory host<br />

<strong>Memory</strong><br />

CPU<br />

(MCT)<br />

<strong>Memory</strong><br />

CPU<br />

(MCT)<br />

HT<br />

HT<br />

Coherency<br />

Domain<br />

SME<br />

EXTOLL Packet<br />

SME<br />

Coherency<br />

Domain<br />

Source-local address<br />

Target node determination<br />

Address calculation<br />

Global address<br />

Loss-less and in-order<br />

packet forwarding<br />

Target-local address<br />

Source tag management<br />

Address calculation<br />

2012/04/11 - CERCS Systems Seminar<br />

15


Excursion: EXTOLL<br />

• High performance interconnection network<br />

• Designed from scratch for HPC demands<br />

• Optimized for low latency and high message rate<br />

• Virtex-4 (156MHz, HT400, 6.24Gbps)<br />

• 40% faster for WRF (compared to IBDDR)<br />

HOST<br />

INTERFACE<br />

NETWORK<br />

INTERFACE<br />

NETWORK<br />

ATU<br />

Linkport<br />

VELO<br />

Networkport<br />

Linkport<br />

Host Interface<br />

(HT3 or PCIe)<br />

On Chip<br />

Network<br />

RMA<br />

Networkport<br />

Networkport<br />

EXTOLL<br />

Network<br />

Switch<br />

Linkport<br />

Linkport<br />

Linkport<br />

Control<br />

& Status<br />

Linkport<br />

2012/04/11 - CERCS Systems Seminar<br />

16


Evaluation Methodology<br />

• Prototype Cluster in Valencia<br />

• 64nodes, 1k K10 cores, 1TB, Virtex-4 FPGA-based network<br />

1. Execution on remote memory only<br />

• Worst case, applications could also use local memory for locality<br />

reasons<br />

2. Compared to execution<br />

on local memory<br />

• Serves as an upper<br />

performance bound<br />

3. Compared to execution<br />

on remote swap<br />

Main <strong>Memory</strong><br />

Processors/Caches<br />

Main <strong>Memory</strong><br />

<strong>Memory</strong><br />

Region 3<br />

Processors/Caches<br />

Main <strong>Memory</strong><br />

Processors/Caches<br />

• Remote swap as a software-only alternative – similar to ScaleMP<br />

2012/04/11 - CERCS Systems Seminar<br />

17


Evaluation – Data intensive calculations<br />

• PARSEC operating on remote memory<br />

• Canneal has a huge footprint heavy use of remote swapping<br />

• Streamcluster too small no remote swapping, linear access<br />

pattern<br />

2012/04/11 - CERCS Systems Seminar<br />

18


Evaluation – Data intensive calculations<br />

• Benchmarks operating on<br />

remote memory<br />

• PARSEC<br />

SPLASH2<br />

2012/04/11 - CERCS Systems Seminar<br />

19


Evaluation – Data intensive calculations<br />

• Idea is to borrow memory from other nodes<br />

• Sensitivity to background noise<br />

• MMULT on client and STREAM on server<br />

• Impact of remote access latency<br />

Each hop adds 600ns<br />

2012/04/11 - CERCS Systems Seminar<br />

20


Evaluation – In-<strong>Memory</strong>-Database<br />

• Social network-like DB<br />

• MySQL with modified memory storage engine<br />

• approx. 100GB<br />

• Many fine-grain accesses, almost no locality<br />

• Currently: read-only<br />

2012/04/11 - CERCS Systems Seminar<br />

21


Evaluation – In-<strong>Memory</strong>-Database (I)<br />

• Comparing<br />

• Local memory, 30x faster<br />

• Remote memory (<strong>MEMSCALE</strong>), reference<br />

• SSD (SATA), 25x slower<br />

Local <strong>Memory</strong>: Upper performance<br />

bound, but insufficient capacity.<br />

Included for reference.<br />

<strong>MEMSCALE</strong>: unlimited capacity,<br />

closing the performance gap (2500<br />

Q/s) to local memory.<br />

SSD: insufficient performance (90<br />

Q/s), not scaling with the number<br />

of available cores per node<br />

HDD: neither scalable nor<br />

providing performance<br />

Executed on single node<br />

2012/04/11 - CERCS Systems Seminar<br />

22


Evaluation – Limitations<br />

• Concurrent load requests<br />

to remote memory<br />

• Performance satures for 4-5<br />

flows<br />

• Limited number of<br />

outstanding requests for<br />

MMIO space<br />

• Solution: coherent<br />

integration using cHT<br />

• Viewed as a memory<br />

controller from the OS<br />

• (Cache coherency)<br />

2012/04/11 - CERCS Systems Seminar<br />

23


Overcoming Scalability Limitations of Shared <strong>Memory</strong>


Collaborative Use of <strong>Memory</strong> Regions<br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

<strong>Memory</strong><br />

Region 1<br />

<strong>Memory</strong><br />

Region 2<br />

<strong>Memory</strong><br />

Region 3<br />

<strong>Memory</strong><br />

Region 4<br />

<strong>Memory</strong><br />

Region 5<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

Main <strong>Memory</strong><br />

<strong>Memory</strong><br />

Region 1<br />

<strong>Memory</strong><br />

Region 2<br />

<strong>Memory</strong><br />

Region 3<br />

<strong>Memory</strong><br />

Region 4<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

Processors/Caches<br />

2012/04/11 - CERCS Systems Seminar<br />

25


Time<br />

On-Demand Consistency<br />

• BSP-like execution model<br />

• Consistency only guaranteed at<br />

synchronization points<br />

• Enriched Barriers/Locks<br />

• Cache invalidations (WBINVD,<br />

CLFLUSH, ...)<br />

• Reverting from continuous<br />

consistency (pessimistic model)<br />

to partial consistency (optimistic<br />

model)<br />

• No longer necessary to<br />

broadcast probes on every<br />

shared memory access<br />

Threads 0..4<br />

Communication<br />

Barrier<br />

Lock<br />

Unlock<br />

Lock<br />

Unlock<br />

Communication<br />

Barrier<br />

Task Model<br />

Computation<br />

Communication<br />

Synchronization<br />

2012/04/11 - CERCS Systems Seminar<br />

26


Evaluation: Barrier Performance<br />

• Example synchronization<br />

primitive<br />

• Only using remote stores<br />

• Push-style communication<br />

• Local coherent spinning on<br />

updates<br />

• Tree-like structure<br />

• „Fastest barrier in the west“<br />

• 256 threads in 9.3usec<br />

• 1k threads in 18.0usec<br />

IB: 20-30us for 64P<br />

[Graham2010]<br />

2012/04/11 - CERCS Systems Seminar<br />

27


Evaluation – On-Demand Consistency<br />

• FFT<br />

• Focus on scalability<br />

• Matched problem size<br />

• 5 barriers per iteration<br />

• Result<br />

• Scalability matches MPI execution<br />

• Unmodified program, only changes<br />

within library calls (barrier)<br />

2012/04/11 - CERCS Systems Seminar<br />

28


Evaluation – In-<strong>Memory</strong>-Database<br />

MySQL Cluster<br />

• Cluster-level execution<br />

• <strong>MEMSCALE</strong> vs. MySQL cluster<br />

• MySQL cluster<br />

• 16 nodes, Gigabit Ethernet<br />

• 450 queries per second<br />

• Saturation starts at 20-30 flows<br />

• <strong>MEMSCALE</strong><br />

• 16 nodes, EXTOLL R1 w/t SME<br />

• 128GB memory pool<br />

• 35k queries per second, 77x<br />

• Linear scalability up to 4/5 flows<br />

per node (64-80 total), then<br />

saturation<br />

• Limited by<br />

• Number of outstanding loads<br />

• Access latency<br />

<strong>MEMSCALE</strong><br />

2012/04/11 - CERCS Systems Seminar<br />

29


Conclusion and Future Work<br />

2012/04/11 - CERCS Systems Seminar<br />

30


Conclusion<br />

• Sole use of commodity<br />

parts is hitting a wall<br />

• <strong>MEMSCALE</strong>: A new<br />

memory architecture for<br />

clusters<br />

• Overcoming resources<br />

paritioning<br />

• Scalable memory<br />

aggregation<br />

• Enabling in-memory<br />

computing<br />

• 500x gap reduced to<br />

10-36x<br />

2012/04/11 - CERCS Systems Seminar<br />

31


Future Work<br />

• Eventually consistent model for <strong>MEMSCALE</strong>-DB<br />

• Use of global address spaces to improve utilization in<br />

datacenters – „Virtual DIMMs“<br />

• Reduce memory overprovisioning<br />

• Support for „heavy queries“ (TPC-H like) - „Red Fox“<br />

• <strong>MEMSCALE</strong> is currently not very suitable for stream-like accesses<br />

Collaborations with Sudhakar Yalamanchili, Jeff Young<br />

et al. from GT<br />

2012/04/11 - CERCS Systems Seminar<br />

32


Future Work - MEERkAT<br />

• MEERkAT – Improving datacenter utilization and energy<br />

efficiency<br />

• System-level virtualization & dynamic resource aggregation of<br />

hetereogeneous resources, various migration levels<br />

• Analogy<br />

• A meerkat is a small, highly socialized animal, living in large<br />

groups in big underground networks with multiple entrances. When<br />

outside, meerkats are mostly known for their altruistic behavior,<br />

one or more standing sentry while others are busy with foraging or<br />

playing.<br />

• The MEERkAT approach gears to mimic this behavior, relying on a<br />

tightly-coupled network with resources acting very socially, being<br />

altruistic when idle and thus aspiring to optimize the overall<br />

situation.<br />

2012/04/11 - CERCS Systems Seminar<br />

33


• Let‘s check the „compatibility“ of meerkats with red foxes,<br />

ocelots or oncillas ;)<br />

2012/04/11 - CERCS Systems Seminar<br />

34


UCAA Workshop<br />

• Unconventional Cluster <strong>Architecture</strong>s and<br />

Applications (http://www.gap.upv.es/ucaa)<br />

• Platform for innovative ideas that might turn out to be disruptive in<br />

the future<br />

• Co-located with ICPP2012<br />

• Topics of interest include<br />

• High-performance, data-intensive, and power-aware computing<br />

• Application-specific clusters, datacenters, and cloud architectures<br />

• <strong>New</strong> industry and technology trends<br />

• Software cluster-level virtualization for consolidation purposes<br />

• <strong>New</strong> uses of GPUs, FPGAs, and other specialized hardware<br />

• ...<br />

• Submission deadline: May 1st, 2012<br />

2012/04/11 - CERCS Systems Seminar<br />

35


Thanks for your attention!<br />

Holger Fröning<br />

froening@uni-hd.de


References<br />

[Ousterhout2009] John Ousterhout, Parag Agrawal, David Erickson, Christos<br />

Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan,<br />

Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan<br />

Stutsman. 2010. The case for RAMClouds: scalable high-performance storage<br />

entirely in DRAM. SIGOPS Oper. Syst. Rev. 43, 4 (January 2010), 92-105.<br />

[Hennessy2011] David A. Patterson and John L. Hennessy. 2011. Computer<br />

<strong>Architecture</strong>: a Quantitative Approach. 5th ed. Morgan Kaufmann Publishers Inc.,<br />

San Francisco, CA, USA.<br />

[Graham2010] Richard Graham, Stephen W. Poole, Pavel Shamis, Gil Bloch, Noam<br />

Boch, HIllel Chapman, Michael Kagan, Arial Shahar, Ishai Rabinovitz, Gilad<br />

Shainer. 2010. Overlapping Computation and Communication: Barrier Algorithms<br />

and Connectx-2 CORE-Direct Capabilities, 10th Workshop on Communication<br />

<strong>Architecture</strong> for <strong>Clusters</strong>, in conjunction with International Parallel and Distributed<br />

Processing Symposium (IPDPS 2010).<br />

[Schmidl2010] Schmidl, et. al, How to scale Nested OpenMP Applications on the<br />

ScaleMP vSMP <strong>Architecture</strong>, 2010 IEEE International Conference on Cluster<br />

Computing<br />

2012/04/11 - CERCS Systems Seminar<br />

37


Backup Slides<br />

2012/04/11 - CERCS Systems Seminar<br />

38


Barrier Performance<br />

2012/04/11 - CERCS Systems Seminar<br />

39


Technology<br />

• Most experiments made with<br />

outdated Virtex4 technology<br />

• Ventoux (HT600,200MHz)<br />

• 1.45 predicted, 1.44 measured<br />

2012/04/11 - CERCS Systems Seminar<br />

40


Evaluation – In-<strong>Memory</strong>-Database<br />

Name of my friends<br />

SELECT SQL_NO_CACHE u_name FROM users, friendships WHERE f_u1 = AND f_u2 =<br />

u_id<br />

Number of new messages directly sent to me<br />

SELECT SQL_NO_CACHE m_text FROM messages WHERE m_to_id = AND m_to_t =<br />

\"user\" AND m_new = TRUE<br />

Number of new messages sent to any of my groups<br />

SELECT SQL_NO_CACHE m_text FROM messages, affiliations WHERE a_user = AND<br />

m_to_id = a_group AND m_to_t = \"group\" AND m_new = TRUE LIMIT 50<br />

100 last messages sent by me<br />

SELECT SQL_NO_CACHE m_text FROM messages WHERE m_from = LIMIT 100<br />

Name of my groups<br />

SELECT SQL_NO_CACHE g_name FROM groups, affiliations WHERE a_user = AND<br />

a_group = g_id LIMIT 50<br />

Text of 3 given messages<br />

SELECT SQL_NO_CACHE m_text FROM messages WHERE m_id = OR m_id = OR m_id =<br />

OR m_id = OR m_id = <br />

2012/04/11 - CERCS Systems Seminar<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!