MEMSCALE: A New Memory Architecture For Clusters - Ziti
MEMSCALE: A New Memory Architecture For Clusters - Ziti
MEMSCALE: A New Memory Architecture For Clusters - Ziti
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>MEMSCALE</strong>: A <strong>New</strong> <strong>Memory</strong> <strong>Architecture</strong> <strong>For</strong> <strong>Clusters</strong><br />
Presentation at CERCS Systems Seminar at Georgia Tech<br />
Holger Fröning 1 , Federico Silla 2 , Hector Montaner 2 ,<br />
Jose Duato 2<br />
1<br />
Institute of Computer Engineering, University of Heidelberg, Germany<br />
2<br />
Parallel <strong>Architecture</strong>s Group. Technical University of Valencia, Spain<br />
April 11th, 2012
Outline<br />
• Motivation<br />
• The <strong>MEMSCALE</strong> <strong>Architecture</strong><br />
• Evaluation of exclusive use model<br />
• Overcoming Scalability Limitations of Shared <strong>Memory</strong><br />
• Evaluation of shared use model<br />
• Conclusion & Future Work<br />
2012/04/11 - CERCS Systems Seminar<br />
2
Motivation – The Data Deluge<br />
• <strong>Memory</strong> requirements<br />
• Merck Bio Research DB: 1.5TB/quarter<br />
• Typical Oil Company: 350TB+<br />
• Wal-Mart Transaction DB: 500TB (2004)<br />
• Facebook: 150TB<br />
• Twitter: more than 340M tweets per day<br />
• Particle Physics, Large Hadron Collider: 15PB<br />
• Human Genomics: 200PB+ (1GB/person, 200%<br />
CAGR)<br />
• Fourth paradigm in science: Data-<br />
Intensive Scientific Discovery<br />
• After theory, experiment and simulation<br />
[Gibbons2008]<br />
2012/04/11 - CERCS Systems Seminar<br />
3
Motivation – Technology Constraints<br />
• Performance disparity for<br />
different technologies<br />
• <strong>Memory</strong> hierarchy only helps<br />
if enough locality is present<br />
• Effective memory capacity<br />
• <strong>Memory</strong> capacity per core is<br />
not keeping the pace of core<br />
count increase<br />
500x<br />
<strong>Memory</strong> capacity per computing core for<br />
4P servers at highest memory speed<br />
2012/04/11 - CERCS Systems Seminar<br />
4
Motivation – In-<strong>Memory</strong> Computing<br />
• DRAM for storage In-<strong>Memory</strong> Databases<br />
• Jim Gray: „<strong>Memory</strong> is the new disk and disk is the new tape”<br />
• HDD bad for random access, but pretty good for linear access<br />
• Natural for logging and journaling an in-memory database<br />
• Google, Yahoo!: entire indices are stored in DRAM<br />
• Bigtable, Memcached<br />
• Litte or no locality for many new applications<br />
• “[…] new Web applications such as Facebook appear to have little or<br />
no locality, due to complex linkages between data (e.g., friendships<br />
in Facebook). As of August 2009 about 25% of all the online data for<br />
Facebook is kept in main memory on memcached servers at any<br />
given point in time, providing a hit rate of 96.5%”. [Ousterhout2009]<br />
• Typical computer: 2-8MB cache, 2-8GB DRAM. At most 0.1% of<br />
DRAM capacity can be held in caches<br />
2012/04/11 - CERCS Systems Seminar<br />
5
Motivation – Scalable Shared <strong>Memory</strong> Programming<br />
• Shared memory programming<br />
+ Pervasive, ease of use<br />
- No scalability<br />
• Message passing<br />
+ Scalable<br />
- Difficult to program<br />
- Overhead<br />
• Overcome message passing‘s<br />
overhead<br />
• Tag matching, progress, data<br />
distribution<br />
Combine advantages of<br />
Shared <strong>Memory</strong> with the<br />
scalability of message passing<br />
Communication done by FPGAs, configured as<br />
message passing engine resp. shared memory engine.<br />
2012/04/11 - CERCS Systems Seminar<br />
6
Motivation – Datacenter Utilization<br />
• Many efforts on power<br />
usage effectiveness<br />
(PUE), but poor utilization<br />
• Commodity-only approach<br />
for datacenters and clusters<br />
• Resource partitioning<br />
• Overprovisioning of<br />
resources to avoid<br />
demand paging<br />
• Underutilization in the<br />
average case<br />
Average CPU utilization at Google<br />
[Hennessy2011]<br />
Inter-server variation (rendering farm)<br />
[Lim2010]<br />
2012/04/11 - CERCS Systems Seminar<br />
7
Research Questions<br />
• Overcome memory capacity constraints<br />
• Enabling in-core computing for data-intensive applications<br />
• Re-visiting Shared <strong>Memory</strong> to overcome MPI limitations<br />
• Avoiding the overhead associated with MPI<br />
• Benefits of application-specific solutions for database<br />
appliances or datacenters<br />
• Overcoming commodity limitations<br />
2012/04/11 - CERCS Systems Seminar<br />
8
The <strong>MEMSCALE</strong> <strong>Architecture</strong>
Cluster/Distributed Computing<br />
• Cluster Computing is a highly viable option, and has prooven<br />
itself over more than one decade<br />
• Distributed resources, more or less coupled<br />
• Datacenter<br />
• Standard cluster<br />
• High performance cluster<br />
• Supercomputer<br />
loose coupling<br />
tight coupling<br />
• Coupling: local/remote access cost disparities<br />
• Appealing because of the strong scaling (and the costeffectiveness<br />
of commodity parts)<br />
• Resource partioning<br />
• Rules out remote accesses!<br />
• Leads to data partitioning and data movement!<br />
2012/04/11 - CERCS Systems Seminar<br />
10
Current architectures<br />
Shared <strong>Memory</strong> Computers<br />
(shared-everything)<br />
Message Passing Systems<br />
(shared-nothing)<br />
• Single address space<br />
• Up to 64 Cores<br />
• Up to 512GB RAM<br />
• Shared <strong>Memory</strong> programming model<br />
• Global coherency among all<br />
processors<br />
• Limited scalability!<br />
• Resource aggregation<br />
• Multiple address spaces<br />
• Nodes are identical – symmetric view<br />
• <strong>Memory</strong> over-provisioned<br />
• Message Passing programming model<br />
• No coherency among nodes, only<br />
within nodes<br />
• Unlimited scalabilty<br />
• Resource partitioning<br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Adress<br />
Space<br />
Address<br />
Space 1<br />
Address<br />
Space 2<br />
Address<br />
Space 3<br />
Address<br />
Space 4<br />
Address<br />
Space 5<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
2012/04/11 - CERCS Systems Seminar<br />
11
Our approach: <strong>MEMSCALE</strong><br />
• Selective aggregation of resources<br />
• Shared use of scalable resources (memory)<br />
• Exclusive use of resources with limited scalability (cores/caches)<br />
• <strong>Memory</strong> regions can expand to other nodes<br />
• Overhead of global coherency is avoided<br />
Spanning up global address spaces<br />
Decoupled resource aggregation, no resource partitioning<br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
<strong>Memory</strong><br />
Region 1<br />
<strong>Memory</strong><br />
Region 2<br />
<strong>Memory</strong><br />
Region 3<br />
<strong>Memory</strong><br />
Region 4<br />
<strong>Memory</strong><br />
Region 5<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
2012/04/11 - CERCS Systems Seminar<br />
12
Setting up global address spaces<br />
• Distributed shared memory<br />
• Local address space<br />
is split up:<br />
1. Private partition<br />
2. Shared partition<br />
3. Mapping to global<br />
address space<br />
• Virtually unlimited<br />
• Only limited by physical<br />
address sizes<br />
• Currently: 2^48 bytes or 256 TB<br />
2012/04/11 - CERCS Systems Seminar<br />
13
Remote memory access<br />
• Software-transparent access<br />
to remote memory<br />
• Loads/stores to local mapping<br />
of the global address space<br />
• Parts of the global address will<br />
identify target node<br />
• <strong>For</strong>ward request over the<br />
network<br />
• Request hits shared local<br />
partition on target node<br />
• If appropriate, send response<br />
back<br />
origin<br />
address space<br />
0<br />
private<br />
local<br />
shared<br />
local<br />
global<br />
• Direct, low-latency path to remote memory<br />
• Shared <strong>Memory</strong> Engine (SME)<br />
2 40<br />
2 64 -1<br />
global<br />
address space<br />
node 0<br />
node 1<br />
node n-1<br />
target<br />
address space<br />
0<br />
private<br />
local<br />
shared<br />
local<br />
2 40<br />
global<br />
oAddr gAddr tAddr<br />
2 64 -1<br />
2012/04/11 - CERCS Systems Seminar<br />
14
Remote memory access in detail<br />
Node #0 (Source)<br />
Issuing loads/stores on<br />
remote memory<br />
<strong>Memory</strong><br />
CPU<br />
(MCT)<br />
<strong>Memory</strong><br />
CPU<br />
(MCT)<br />
Remote load latency:<br />
1.89usec (R1, Virtex-4)<br />
1.44usec (R2, Virtex-6)<br />
DRAM<br />
Controller<br />
<strong>Memory</strong><br />
Controller<br />
Core<br />
Core<br />
Node #1 (Target)<br />
Serving as memory host<br />
<strong>Memory</strong><br />
CPU<br />
(MCT)<br />
<strong>Memory</strong><br />
CPU<br />
(MCT)<br />
HT<br />
HT<br />
Coherency<br />
Domain<br />
SME<br />
EXTOLL Packet<br />
SME<br />
Coherency<br />
Domain<br />
Source-local address<br />
Target node determination<br />
Address calculation<br />
Global address<br />
Loss-less and in-order<br />
packet forwarding<br />
Target-local address<br />
Source tag management<br />
Address calculation<br />
2012/04/11 - CERCS Systems Seminar<br />
15
Excursion: EXTOLL<br />
• High performance interconnection network<br />
• Designed from scratch for HPC demands<br />
• Optimized for low latency and high message rate<br />
• Virtex-4 (156MHz, HT400, 6.24Gbps)<br />
• 40% faster for WRF (compared to IBDDR)<br />
HOST<br />
INTERFACE<br />
NETWORK<br />
INTERFACE<br />
NETWORK<br />
ATU<br />
Linkport<br />
VELO<br />
Networkport<br />
Linkport<br />
Host Interface<br />
(HT3 or PCIe)<br />
On Chip<br />
Network<br />
RMA<br />
Networkport<br />
Networkport<br />
EXTOLL<br />
Network<br />
Switch<br />
Linkport<br />
Linkport<br />
Linkport<br />
Control<br />
& Status<br />
Linkport<br />
2012/04/11 - CERCS Systems Seminar<br />
16
Evaluation Methodology<br />
• Prototype Cluster in Valencia<br />
• 64nodes, 1k K10 cores, 1TB, Virtex-4 FPGA-based network<br />
1. Execution on remote memory only<br />
• Worst case, applications could also use local memory for locality<br />
reasons<br />
2. Compared to execution<br />
on local memory<br />
• Serves as an upper<br />
performance bound<br />
3. Compared to execution<br />
on remote swap<br />
Main <strong>Memory</strong><br />
Processors/Caches<br />
Main <strong>Memory</strong><br />
<strong>Memory</strong><br />
Region 3<br />
Processors/Caches<br />
Main <strong>Memory</strong><br />
Processors/Caches<br />
• Remote swap as a software-only alternative – similar to ScaleMP<br />
2012/04/11 - CERCS Systems Seminar<br />
17
Evaluation – Data intensive calculations<br />
• PARSEC operating on remote memory<br />
• Canneal has a huge footprint heavy use of remote swapping<br />
• Streamcluster too small no remote swapping, linear access<br />
pattern<br />
2012/04/11 - CERCS Systems Seminar<br />
18
Evaluation – Data intensive calculations<br />
• Benchmarks operating on<br />
remote memory<br />
• PARSEC<br />
SPLASH2<br />
2012/04/11 - CERCS Systems Seminar<br />
19
Evaluation – Data intensive calculations<br />
• Idea is to borrow memory from other nodes<br />
• Sensitivity to background noise<br />
• MMULT on client and STREAM on server<br />
• Impact of remote access latency<br />
Each hop adds 600ns<br />
2012/04/11 - CERCS Systems Seminar<br />
20
Evaluation – In-<strong>Memory</strong>-Database<br />
• Social network-like DB<br />
• MySQL with modified memory storage engine<br />
• approx. 100GB<br />
• Many fine-grain accesses, almost no locality<br />
• Currently: read-only<br />
2012/04/11 - CERCS Systems Seminar<br />
21
Evaluation – In-<strong>Memory</strong>-Database (I)<br />
• Comparing<br />
• Local memory, 30x faster<br />
• Remote memory (<strong>MEMSCALE</strong>), reference<br />
• SSD (SATA), 25x slower<br />
Local <strong>Memory</strong>: Upper performance<br />
bound, but insufficient capacity.<br />
Included for reference.<br />
<strong>MEMSCALE</strong>: unlimited capacity,<br />
closing the performance gap (2500<br />
Q/s) to local memory.<br />
SSD: insufficient performance (90<br />
Q/s), not scaling with the number<br />
of available cores per node<br />
HDD: neither scalable nor<br />
providing performance<br />
Executed on single node<br />
2012/04/11 - CERCS Systems Seminar<br />
22
Evaluation – Limitations<br />
• Concurrent load requests<br />
to remote memory<br />
• Performance satures for 4-5<br />
flows<br />
• Limited number of<br />
outstanding requests for<br />
MMIO space<br />
• Solution: coherent<br />
integration using cHT<br />
• Viewed as a memory<br />
controller from the OS<br />
• (Cache coherency)<br />
2012/04/11 - CERCS Systems Seminar<br />
23
Overcoming Scalability Limitations of Shared <strong>Memory</strong>
Collaborative Use of <strong>Memory</strong> Regions<br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
<strong>Memory</strong><br />
Region 1<br />
<strong>Memory</strong><br />
Region 2<br />
<strong>Memory</strong><br />
Region 3<br />
<strong>Memory</strong><br />
Region 4<br />
<strong>Memory</strong><br />
Region 5<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
Main <strong>Memory</strong><br />
<strong>Memory</strong><br />
Region 1<br />
<strong>Memory</strong><br />
Region 2<br />
<strong>Memory</strong><br />
Region 3<br />
<strong>Memory</strong><br />
Region 4<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
Processors/Caches<br />
2012/04/11 - CERCS Systems Seminar<br />
25
Time<br />
On-Demand Consistency<br />
• BSP-like execution model<br />
• Consistency only guaranteed at<br />
synchronization points<br />
• Enriched Barriers/Locks<br />
• Cache invalidations (WBINVD,<br />
CLFLUSH, ...)<br />
• Reverting from continuous<br />
consistency (pessimistic model)<br />
to partial consistency (optimistic<br />
model)<br />
• No longer necessary to<br />
broadcast probes on every<br />
shared memory access<br />
Threads 0..4<br />
Communication<br />
Barrier<br />
Lock<br />
Unlock<br />
Lock<br />
Unlock<br />
Communication<br />
Barrier<br />
Task Model<br />
Computation<br />
Communication<br />
Synchronization<br />
2012/04/11 - CERCS Systems Seminar<br />
26
Evaluation: Barrier Performance<br />
• Example synchronization<br />
primitive<br />
• Only using remote stores<br />
• Push-style communication<br />
• Local coherent spinning on<br />
updates<br />
• Tree-like structure<br />
• „Fastest barrier in the west“<br />
• 256 threads in 9.3usec<br />
• 1k threads in 18.0usec<br />
IB: 20-30us for 64P<br />
[Graham2010]<br />
2012/04/11 - CERCS Systems Seminar<br />
27
Evaluation – On-Demand Consistency<br />
• FFT<br />
• Focus on scalability<br />
• Matched problem size<br />
• 5 barriers per iteration<br />
• Result<br />
• Scalability matches MPI execution<br />
• Unmodified program, only changes<br />
within library calls (barrier)<br />
2012/04/11 - CERCS Systems Seminar<br />
28
Evaluation – In-<strong>Memory</strong>-Database<br />
MySQL Cluster<br />
• Cluster-level execution<br />
• <strong>MEMSCALE</strong> vs. MySQL cluster<br />
• MySQL cluster<br />
• 16 nodes, Gigabit Ethernet<br />
• 450 queries per second<br />
• Saturation starts at 20-30 flows<br />
• <strong>MEMSCALE</strong><br />
• 16 nodes, EXTOLL R1 w/t SME<br />
• 128GB memory pool<br />
• 35k queries per second, 77x<br />
• Linear scalability up to 4/5 flows<br />
per node (64-80 total), then<br />
saturation<br />
• Limited by<br />
• Number of outstanding loads<br />
• Access latency<br />
<strong>MEMSCALE</strong><br />
2012/04/11 - CERCS Systems Seminar<br />
29
Conclusion and Future Work<br />
2012/04/11 - CERCS Systems Seminar<br />
30
Conclusion<br />
• Sole use of commodity<br />
parts is hitting a wall<br />
• <strong>MEMSCALE</strong>: A new<br />
memory architecture for<br />
clusters<br />
• Overcoming resources<br />
paritioning<br />
• Scalable memory<br />
aggregation<br />
• Enabling in-memory<br />
computing<br />
• 500x gap reduced to<br />
10-36x<br />
2012/04/11 - CERCS Systems Seminar<br />
31
Future Work<br />
• Eventually consistent model for <strong>MEMSCALE</strong>-DB<br />
• Use of global address spaces to improve utilization in<br />
datacenters – „Virtual DIMMs“<br />
• Reduce memory overprovisioning<br />
• Support for „heavy queries“ (TPC-H like) - „Red Fox“<br />
• <strong>MEMSCALE</strong> is currently not very suitable for stream-like accesses<br />
Collaborations with Sudhakar Yalamanchili, Jeff Young<br />
et al. from GT<br />
2012/04/11 - CERCS Systems Seminar<br />
32
Future Work - MEERkAT<br />
• MEERkAT – Improving datacenter utilization and energy<br />
efficiency<br />
• System-level virtualization & dynamic resource aggregation of<br />
hetereogeneous resources, various migration levels<br />
• Analogy<br />
• A meerkat is a small, highly socialized animal, living in large<br />
groups in big underground networks with multiple entrances. When<br />
outside, meerkats are mostly known for their altruistic behavior,<br />
one or more standing sentry while others are busy with foraging or<br />
playing.<br />
• The MEERkAT approach gears to mimic this behavior, relying on a<br />
tightly-coupled network with resources acting very socially, being<br />
altruistic when idle and thus aspiring to optimize the overall<br />
situation.<br />
2012/04/11 - CERCS Systems Seminar<br />
33
• Let‘s check the „compatibility“ of meerkats with red foxes,<br />
ocelots or oncillas ;)<br />
2012/04/11 - CERCS Systems Seminar<br />
34
UCAA Workshop<br />
• Unconventional Cluster <strong>Architecture</strong>s and<br />
Applications (http://www.gap.upv.es/ucaa)<br />
• Platform for innovative ideas that might turn out to be disruptive in<br />
the future<br />
• Co-located with ICPP2012<br />
• Topics of interest include<br />
• High-performance, data-intensive, and power-aware computing<br />
• Application-specific clusters, datacenters, and cloud architectures<br />
• <strong>New</strong> industry and technology trends<br />
• Software cluster-level virtualization for consolidation purposes<br />
• <strong>New</strong> uses of GPUs, FPGAs, and other specialized hardware<br />
• ...<br />
• Submission deadline: May 1st, 2012<br />
2012/04/11 - CERCS Systems Seminar<br />
35
Thanks for your attention!<br />
Holger Fröning<br />
froening@uni-hd.de
References<br />
[Ousterhout2009] John Ousterhout, Parag Agrawal, David Erickson, Christos<br />
Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan,<br />
Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan<br />
Stutsman. 2010. The case for RAMClouds: scalable high-performance storage<br />
entirely in DRAM. SIGOPS Oper. Syst. Rev. 43, 4 (January 2010), 92-105.<br />
[Hennessy2011] David A. Patterson and John L. Hennessy. 2011. Computer<br />
<strong>Architecture</strong>: a Quantitative Approach. 5th ed. Morgan Kaufmann Publishers Inc.,<br />
San Francisco, CA, USA.<br />
[Graham2010] Richard Graham, Stephen W. Poole, Pavel Shamis, Gil Bloch, Noam<br />
Boch, HIllel Chapman, Michael Kagan, Arial Shahar, Ishai Rabinovitz, Gilad<br />
Shainer. 2010. Overlapping Computation and Communication: Barrier Algorithms<br />
and Connectx-2 CORE-Direct Capabilities, 10th Workshop on Communication<br />
<strong>Architecture</strong> for <strong>Clusters</strong>, in conjunction with International Parallel and Distributed<br />
Processing Symposium (IPDPS 2010).<br />
[Schmidl2010] Schmidl, et. al, How to scale Nested OpenMP Applications on the<br />
ScaleMP vSMP <strong>Architecture</strong>, 2010 IEEE International Conference on Cluster<br />
Computing<br />
2012/04/11 - CERCS Systems Seminar<br />
37
Backup Slides<br />
2012/04/11 - CERCS Systems Seminar<br />
38
Barrier Performance<br />
2012/04/11 - CERCS Systems Seminar<br />
39
Technology<br />
• Most experiments made with<br />
outdated Virtex4 technology<br />
• Ventoux (HT600,200MHz)<br />
• 1.45 predicted, 1.44 measured<br />
2012/04/11 - CERCS Systems Seminar<br />
40
Evaluation – In-<strong>Memory</strong>-Database<br />
Name of my friends<br />
SELECT SQL_NO_CACHE u_name FROM users, friendships WHERE f_u1 = AND f_u2 =<br />
u_id<br />
Number of new messages directly sent to me<br />
SELECT SQL_NO_CACHE m_text FROM messages WHERE m_to_id = AND m_to_t =<br />
\"user\" AND m_new = TRUE<br />
Number of new messages sent to any of my groups<br />
SELECT SQL_NO_CACHE m_text FROM messages, affiliations WHERE a_user = AND<br />
m_to_id = a_group AND m_to_t = \"group\" AND m_new = TRUE LIMIT 50<br />
100 last messages sent by me<br />
SELECT SQL_NO_CACHE m_text FROM messages WHERE m_from = LIMIT 100<br />
Name of my groups<br />
SELECT SQL_NO_CACHE g_name FROM groups, affiliations WHERE a_user = AND<br />
a_group = g_id LIMIT 50<br />
Text of 3 given messages<br />
SELECT SQL_NO_CACHE m_text FROM messages WHERE m_id = OR m_id = OR m_id =<br />
OR m_id = OR m_id = <br />
2012/04/11 - CERCS Systems Seminar<br />
41