11.07.2015 Views

Time (sec) - HPDC

Time (sec) - HPDC

Time (sec) - HPDC

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, andSupercomputers– Programming paradigm: HPC, MTC, and HTC• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp2


• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, andSupercomputers– Programming paradigm: HPC, MTC, and HTC• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp4


• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, andSupercomputers– Programming paradigm: HPC, MTC, and HTC• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp5


• Local Disk:– 2002-2004: ANL/UC TG Site(70GB SCSI)– Today: PADS (RAID-0, 6drives 750GB SATA)• Cluster:– 2002-2004: ANL/UC TG Site(GPFS, 8 servers, 1Gb/s each)– Today: PADS (GPFS, SAN)• Supercomputer:MB/s per Processor Core100010010– 2002-2004: IBM Blue Gene/L(GPFS)0.1– Today: IBM Blue Gene/P (GPFS)1Local DiskClusterSupercomputer-2.2X-99X2002-2004 Today-15X-438X6


What if we could combine thescientific community’s existingprogramming paradigms, but yetstill exploit the data locality thatnaturally occurs in scientificworkloads?7


“Significant performance improvements can be obtainedin the analysis of large dataset by leveraging informationabout data analysis workloads rather than individualdata analysis tasks.”• Important concepts related to the hypothesis– Workload: a complex query (or set of queries) decomposable intosimpler tasks to answer broader analysis questions– Data locality is crucial to the efficient use of large scale distributedsystems for scientific and data-intensive applications– Allocate computational and caching storage resources, co-scheduled tooptimize workload performance10


• FA: first-available– simple load balancing• MCH: max-cache-hit– maximize cache hits• MCU: max-compute-util– maximize processor utilization• GCC: good-cache-compute– maximize both cache hit and processor utilization atthe same time[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”13


• 3GHz dual CPUs• ANL/UC TG with128 processors• Scheduling window2500 tasks• Dataset• 100K files• 1 byte each• Tasks• Read 1 file• Write 1 fileCPU <strong>Time</strong> per Task (ms)CPU <strong>Time</strong> per Task (ms)554433221100Task SubmitNotification Task Submit for Task AvailabilityTask Notification Dispatch for (data-aware Task Availability scheduler)Task Task Results Dispatch (data-aware (data-aware scheduler) scheduler)Notification Task Results for (data-aware Task Resultsscheduler)WS Notification Communication for Task ResultsThroughput WS Communication (tasks/<strong>sec</strong>)Throughput (tasks/<strong>sec</strong>)firstavailablavailablcompute-utihicache-first-max-max-cache-good-firstavailablavailablcompute-utihicache-first-max-max-cache-good-without I/O with I/Ocomputewithout I/O with I/Ocompute500050004000400030003000200020001000100000Throughput (tasks/<strong>sec</strong>)Throughput (tasks/<strong>sec</strong>)[DIDC09] “Towards Data Intensive Many-Task Computing”, under review14


• Monotonically Increasing Workload– Emphasizes increasing loads• Sine-Wave Workload– Emphasizes varying loads• All-Pairs Workload– Compare to best case model of active storage• Image Stacking Workload (Astronomy)– Evaluate data diffusion on a real large-scale dataintensiveapplication from astronomy domain[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”15


• 250K tasks– 10MB reads– 10ms compute• Vary arrival rate:– Min: 1 task/<strong>sec</strong>– Increment function:CEILING(*1.3)– Max: 1000 tasks/<strong>sec</strong>• 128 processors• Ideal case:– 1415 <strong>sec</strong>– 80Gb/s peakthroughputArrival Rate (per <strong>sec</strong>ond)10009008007006005004003002001000Arrival RateTasks completed<strong>Time</strong> (<strong>sec</strong>)250000200000150000100000500000Tasks Completed16


• GPFS vs. ideal: 5011 <strong>sec</strong> vs. 1415 <strong>sec</strong>Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100Throughput (Gb/s)Wait Queue Length<strong>Time</strong> (<strong>sec</strong>)Demand (Gb/s)Number of Nodes17


Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100Max-compute-util10.90.80.70.60.50.40.30.20.10Cache Hit/Miss %Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100Max-cache-hit10.90.80.70.60.50.40.30.20.10Cache Hit/Miss %CPU Utilization %<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization18


Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100100%90%80%70%60%50%40%30%20%10%0%Cache Hit/Miss % 1GB1.5GB Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100100%90%80%70%60%50%40%30%Cache Hit/Miss %20%10%0%Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes100100%9090%8080%7070%6060%5050%4040%3030%2020%1010%00%<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of NodesCache Hit/Miss % 2GB4GB Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes10019080706050403020100190.90.80.70.60.50.40.30.20.10<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of NodesCache Hit/Miss %


• Data Diffusion vs. ideal: 1436 <strong>sec</strong> vs 1415 <strong>sec</strong>Nodes AllocatedThroughput (Gb/s)Queue Length (x1K)100908070605040302010010.90.80.70.60.50.40.30.20.10Cache Hit/Miss %<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes20


Throughput (Gb/s)2018161412108642080612Ideal FA GCC1GB73GCC1.5GBGCC2GBResponse <strong>Time</strong> 81 81Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)GCC4GB– 3 <strong>sec</strong> vs 1569 <strong>sec</strong> 506X21MCH4GB46MCU4GBAverage Response <strong>Time</strong> (<strong>sec</strong>)Throughput:180016001400120010008006004002000– Average: 14Gb/s vs 4Gb/s– Peak: 81Gb/s vs. 6Gb/s1569FA1084GCC1GB114GCC1.5GB3.4 3.1GCC2GBGCC4GB230MCH4GB287MCU4GB21


• PerformanceIndex:– 34X higher• Speedup– 3.5X fasterthan GPFSPerformance Index10.90.80.70.60.50.40.30.20.103.532.521.51Speedup (comp. to LAN GPFS)FAGCC1GBGCC1.5GBGCC2GBGCC4GBGCC4GBSRPMCH4GBMCU4GBPerformance IndexSpeedup (compared to first-available)22


• 2M tasks– 10MB reads– 10ms compute• Vary arrival rate:– Min: 1 task/<strong>sec</strong>– Arrival rate function:– Max: 1000 tasks/<strong>sec</strong>• 200 processors• Ideal case:– 6505 <strong>sec</strong>– 80Gb/s peakthroughputAArrival Rate (per <strong>sec</strong>)Arrival Rate (per <strong>sec</strong>)(sin( sqrt(time1000100090090080080070070060060050050040040030030020020010010000Arrival RateArrival RateNumber of TasksNumber of Tasks0.11)*2.859678)0060060012001200180018002400240030003000360036004200420048004800540054006000600066006600<strong>Time</strong> (<strong>sec</strong>)<strong>Time</strong> (<strong>sec</strong>)1)*( time0.11)*5.70520000002000000180000018000001600000160000014000001400000120000012000001000000100000080000080000060000060000040000040000020000020000000Number of Tasks CompletedNumber of Tasks Completed23


• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrsNodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100<strong>Time</strong> (<strong>sec</strong>)Throughput (Gb/s)Wait Queue LengthDemand (Gb/s)Number of Nodes24


• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrsNodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100j100%90%80%70%60%50%40%30%20%10%0%Cache Hit/Miss<strong>Time</strong> (<strong>sec</strong>)Cache Hit Local % Cache Hit Global % Cache Miss %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes25


• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• GCC+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrsNodes AllocatedThroughput (Gb/s)Queue Length (x1K)1009080706050403020100100%90%80%70%60%50%40%30%20%10%0%Cache Hit/Miss %<strong>Time</strong> (<strong>sec</strong>)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes26


• 500x500– 250K tasks– 24MB reads– 100ms compute– 200 CPUs• 1000x1000• 1M tasks• 24MB reads• 4<strong>sec</strong> compute• 4096 CPUs• Ideal case:– 6505 <strong>sec</strong>– 80Gb/s peakthroughput• All-Pairs( set A, set B, function F )returns matrix M:• Compare all elements of set A toall elements of set B via function F,yielding matrix M, such thatM[i,j] = F(A[i],B[j])1 foreach $i in A2 foreach $j in B3 submit_job F $i $j4 end5 end[DIDC09] “Towards Data Intensive Many-Task Computing”, under review27


Throughput (Gb/s)100.0090.0080.0070.0060.0050.0040.0030.0020.0010.000.00Efficiency: 75%100%90%80%70%60%50%40%30%20%10%0%Cache Hit/Miss<strong>Time</strong> (<strong>sec</strong>)Cache Hit Local % Cache Hit Global %Cache Miss %Max Throughput (GPFS)Throughput (Data Diffusion)Max Throughput (Local Disk)28


Throughput (Gb/s)200.00180.00160.00140.00120.00100.0080.0060.0040.0020.000.00Efficiency: 86%100%90%80%70%60%50%40%30%20%10%0%Cache Hit/Miss<strong>Time</strong> (<strong>sec</strong>)Cache Hit Local % Cache Hit Global %Cache Miss %Max Throughput (GPFS)Throughput (Data Diffusion)Max Throughput (Local Memory)[DIDC09] “Towards Data Intensive Many-Task Computing”, under review29


• Pull vs. Push– Data Diffusion•Pulls task working set• Incremental spanningforest– Active Storage:•Pushes workloadworking set to all nodes• Static spanning treeChristopher Moretti, Douglas Thain,University of Notre Dame[DIDC09] “Towards Data Intensive Many-Task Computing”, under reviewEfficiency100%90%80%70%60%50%40%30%20%10%0%Experiment500x500200 CPUs1 <strong>sec</strong>500x500200 CPUs0.1 <strong>sec</strong>1000x10004096 CPUs4 <strong>sec</strong>1000x10005832 CPUs4 <strong>sec</strong>500x500200 CPUs1 <strong>sec</strong>ApproachBest Case(active storage)Falkon(data diffusion)Best Case(active storage)Falkon(data diffusion)Best Case(active storage)Falkon(data diffusion)Best Case(active storage)Falkon(data diffusion)500x500200 CPUs0.1 <strong>sec</strong>ExperimentLocalDisk/Memory(GB)Best Case (active storage)Falkon (data diffusion)Best Case (parallel file system)1000x10004096 CPUs4 <strong>sec</strong>1000x10005832 CPUs4 <strong>sec</strong>Network(node-to-node)(GB)SharedFileSystem(GB)6000 1536 126000 1698 346000 1536 126000 1528 6224000 12288 2424000 4676 38424000 12288 243024000 3867 906


• Best to use active storage if– Slow data source– Workload working set fits on local node storage• Best to use data diffusion if– Medium to fast data source– Task working set


• Purpose– On-demand “stacks” ofrandom locations within~10TB dataset• Challenge– Processing Costs:• O(100ms) per object– Data Intensive:• 40MB:1<strong>sec</strong>– Rapid access to 10-10K“random” files– <strong>Time</strong>-varying load[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”[TG06] “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”AP+++++++=SloanDataLocality Number of Objects Number of Files1 111700 1117001.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 2025 3230 23695 790


450400350openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking<strong>Time</strong> (ms)300250200150100500GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”33


Low data locality – Similar (but better)performance to GPFS<strong>Time</strong> (ms) per stack per CPU<strong>Time</strong> (ms) per stack per CPU20002000180018001600160014001400120012001000100080080060060040040020020000Data Diffusion (GZ)Data Data Diffusion Diffusion (FIT) (GZ)GPFS Data (GZ) Diffusion (FIT)GPFS GPFS (FIT) (GZ)GPFS (FIT)2 4 8 16 32 64 1282 4 8 16 32 64 128Number of CPUsNumber of CPUs<strong>Time</strong> (ms) per stack per CPU<strong>Time</strong> (ms) per stack per CPU20002000180018001600160014001400120012001000100080080060060040040020020000Data Diffusion (GZ)Data Data Diffusion Diffusion (FIT) (GZ)GPFS Data (GZ) Diffusion (FIT)GPFS GPFS (FIT) (GZ)GPFS (FIT)2 4 8 16 32 64 1282 4 8 16 32 64 128Number of CPUsNumber of CPUsHigh data locality– Near perfect scalability[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”34


• Aggregate throughput:–39Gb/s– 10X higher than GPFS• Reduced load on GPFS– 0.49Gb/s– 1/10 of the original loadAggregate Throughput (Gb/s)Aggregate Throughput (Gb/s)Data Diffusion (GZ)20002000Data Data Diffusion (FIT) (GZ)1800GPFS Data (GZ) Diffusion (FIT)1800GPFS GPFS (FIT) (GZ)1600GPFS (FIT)1600140014001200120010001000800800600600400400200200001 1.38 2 3 4 5 10 20 30 Ideal1 1.38 2 3 4 5 10 20 30 IdealLocalityLocality[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”<strong>Time</strong> (ms) per stack per CPU<strong>Time</strong> (ms) per stack per CPU50Data Diffusion Throughput Local50 Data Data Diffusion Diffusion Throughput Cache-to-CacheLocalData Diffusion Cache-to-Cache45Data Diffusion Throughput GPFS45 GPFS Data Throughput Diffusion Throughput (FIT) GPFS40GPFS GPFS Throughput (GZ) (FIT)40GPFS Throughput (GZ)35353030252520201515101055001 1.38 2 3 4 5 10 20 301 1.38 2 3Locality4 5 10 20 30Locality• Big performance gainsas locality increases35


• Data access patterns: write once, read many• Task definition must include input/output filesmetadata• Per task working set must fit in local storage• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Needs Java 1.4+36


• [Ghemawat03,Dean04]: MapReduce+GFS• [Bialecki05]: Hadoop+HDFS• [Gu06]: Sphere+Sector• [Tatebe04]: Gfarm• [Chervenak04]: RLS, DRS• [Kosar06]: Stork• Conclusions– None focused on the co-location of storage and genericblack box computations with data-aware scheduling whileoperating in a dynamic elastic environment– Swift + Falkon + Data Diffusion is arguably a more genericand powerful solution than MapReduce37


• Identified that data locality is crucial to theefficient use of large scale distributed systemsfor data-intensive applications Data Diffusion– Integrated streamlined task dispatching with dataaware scheduling policies– Heuristics to maximize real world performance– Suitable for varying, data-intensive workloads– Proof of O(NM) Competitive Caching38


• Falkon is a real system– Late 2005: Initial prototype, AstroPortal– January 2007: Falkon v0– November 2007: Globus incubator project v0.1• http://dev.globus.org/wiki/Incubator/Falkon– February 2009: Globus incubator project v0.9• Implemented in Java (~20K lines of code) and C(~1K lines of code)– Open source: svn co https://svn.globus.org/repos/falkon• Source code contributors (beside myself)– Yong Zhao, Zhao Zhang, Ben Clifford, Mihael Hategan[Globus07] “Falkon: A Proposal for Project Globus Incubation”39


• Workload• 160K CPUs• 1M tasks• 60 <strong>sec</strong> per task• 2 CPU years in 453 <strong>sec</strong>• Throughput: 2312 tasks/<strong>sec</strong>• 85% efficiency[TPDS09] “Middleware Support for Many-Task Computing”, under preparation40


[TPDS09] “Middleware Support for Many-Task Computing”, under preparation41


ACM MTAGS09 Workshop@ SC09Due Date: August 1 st , 200942


IEEE TPDS JournalSpecial Issue on MTCDue Date: December 1 st , 200943


• More information:– Other publications: http://people.cs.uchicago.edu/~iraicu/– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php• Funding:– NASA: Ames Research Center, GSRP– DOE: Office of Advanced Scientific Computing Research,Office of Science, U.S. Dept. of Energy– NSF: TeraGrid• Relevant activities:– ACM MTAGS09 Workshop at Supercomputing 2009• http://dsl.cs.uchicago.edu/MTAGS09/– Special Issue on MTC in IEEE TPDS Journal• http://dsl.cs.uchicago.edu/TPDS_MTC/44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!