27.01.2015 Views

A Methodology for Fine- Grained Parallelism in JavaScript ...

A Methodology for Fine- Grained Parallelism in JavaScript ...

A Methodology for Fine- Grained Parallelism in JavaScript ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Computer Systems @ Colorado<br />

COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO<br />

http://systems.cs.colorado.edu<br />

A <strong>Methodology</strong> <strong>for</strong> <strong>F<strong>in</strong>e</strong>-<br />

<strong>Gra<strong>in</strong>ed</strong> <strong>Parallelism</strong> <strong>in</strong><br />

<strong>JavaScript</strong> Applications<br />

Jeff Fifield and Dirk Grunwald<br />

Department of Computer Science<br />

University of Colorado<br />

LCPC Workshop<br />

September 8, 2011


Modern <strong>JavaScript</strong><br />

Most web applications are written us<strong>in</strong>g <strong>JavaScript</strong><br />

• Email<br />

• Office suites<br />

• Games<br />

• IDEs<br />

Many server side applications use <strong>JavaScript</strong><br />

• CommonJS<br />

• node.js, R<strong>in</strong>goJS, wakanda, many more...<br />

• MongoDB map/reduce, CouchDB views<br />

APIs <strong>for</strong> Desktop, Mobile


Why is <strong>JavaScript</strong> Slow<br />

it's not because <strong>JavaScript</strong> is <strong>in</strong>terpreted,<br />

- current <strong>JavaScript</strong> eng<strong>in</strong>es are JIT compilers<br />

- they per<strong>for</strong>m standard optimizations and code gen<br />

- they can generate good code<br />

it's because <strong>JavaScript</strong> is untyped,<br />

what does c = a + b mean <br />

if a = 1, b = 2 <br />

if a = "cat", b = "dog" <br />

Other reasons too:<br />

object creation, use of eval, sparse arrays, ...<br />

but these th<strong>in</strong>gs shouldn't be <strong>in</strong> hot loops anyway


Type Specializ<strong>in</strong>g JITs<br />

Luckily, programmers usually write Type-Stable code:<br />

That is,<br />

a = 1;<br />

b = 2;<br />

if (someth<strong>in</strong>g)<br />

b = 5;<br />

c = a + b;<br />

and not,<br />

a = 1;<br />

b = 2;<br />

if (someth<strong>in</strong>g)<br />

b = “word”;<br />

c = a + b;<br />

Type specializ<strong>in</strong>g, dynamic optimiz<strong>in</strong>g, JIT compilers<br />

• Observe types through profil<strong>in</strong>g or trac<strong>in</strong>g (still emit checks)<br />

• Prove types us<strong>in</strong>g type <strong>in</strong>ference (no checks, research area)<br />

• Generate code specialized to given types<br />

• Google V8 Crankshaft, Mozilla TraceMonkey, JägerMonkey<br />

Good JS per<strong>for</strong>mance requires Type-Stable Code


Go Faster with <strong>Parallelism</strong><br />

●<br />

●<br />

●<br />

●<br />

●<br />

<strong>JavaScript</strong> does not support parallelism<br />

●<br />

●<br />

Lots of concurrent programm<strong>in</strong>g models <strong>in</strong> use<br />

Various multi-threaded implementations<br />

WebWorkers<br />

●<br />

Com<strong>in</strong>g soon, threads with asynchronous message pass<strong>in</strong>g<br />

WebCL<br />

●<br />

Com<strong>in</strong>g soon, OpenCL <strong>for</strong> <strong>JavaScript</strong><br />

Other proposals from the Server Side <strong>JavaScript</strong> community<br />

●<br />

●<br />

Sync and lock – JVM semantics<br />

Fork – as <strong>in</strong> <strong>for</strong>k a new process<br />

None of these are appeal<strong>in</strong>g <strong>for</strong> f<strong>in</strong>e-gra<strong>in</strong>ed parallelism<br />

●<br />

●<br />

<strong>F<strong>in</strong>e</strong>-gra<strong>in</strong>ed parallelism == loop or procedure level<br />

Too low level or too heavy weight


Mak<strong>in</strong>g <strong>Parallelism</strong> Easier<br />

Use an Embedded Doma<strong>in</strong> Specific Language (EDSL)<br />

How to put a DSL <strong>in</strong>to a general purpose language:<br />

1. restrict the language to not do th<strong>in</strong>gs you don't like<br />

2. augment the language to do the th<strong>in</strong>gs you do like<br />

3. add runtime AST construction + DSL compiler<br />

Get the benefits of the host syntax, compiler, libraries, etc.<br />

Examples:<br />

Copperhead <strong>for</strong> Python,<br />

Intel Array Build<strong>in</strong>g Blocks / Rapidm<strong>in</strong>d / Ct <strong>for</strong> C++,<br />

Microsoft Accelerator <strong>for</strong> .NET


Sluice: EDSL <strong>for</strong> Stream <strong>Parallelism</strong><br />

• Kahn Process Network model of computation<br />

• StreamIt style program construction<br />

function Adder(arg) {<br />

Kernel<br />

Split<br />

Kernel<br />

this.a = arg;<br />

this.work = function() {<br />

Kernel<br />

var e = this.pop();<br />

Kernel ... Kernel Kernel<br />

Kernel Kernel<br />

e = e + this.a;<br />

Kernel<br />

this.push(e);<br />

return false;<br />

Jo<strong>in</strong><br />

Kernel<br />

Kernel<br />

}<br />

}<br />

var add0 = new Adder(1);<br />

var add1 = new Adder(1);<br />

var add2 = new Adder(1);<br />

pipel<strong>in</strong>e split-jo<strong>in</strong> data parallel<br />

var sj = Sluice.SplitRR(1, add0, add1, add2).Jo<strong>in</strong>RR(1);<br />

var p = Sluice.Pipel<strong>in</strong>e(new Count(10), sj, new Pr<strong>in</strong>ter());<br />

> p.run()<br />

1 2 3 4 5 6 7 8 9 10


Default Sluice Execution: Run as<br />

Normal <strong>JavaScript</strong><br />

●<br />

●<br />

●<br />

●<br />

●<br />

Sluice uses a pure <strong>JavaScript</strong> implementation by default<br />

Kernels execute without modification (it's just <strong>JavaScript</strong>)<br />

Map streams to <strong>JavaScript</strong> Array with bounded sizes<br />

Simple Schedul<strong>in</strong>g Algorithm:<br />

●<br />

●<br />

●<br />

S<strong>in</strong>gle threaded scheduler based on corout<strong>in</strong>es<br />

yield to the scheduler when blocked on full or empty stream<br />

Choose next kernel to run us<strong>in</strong>g round-rob<strong>in</strong><br />

Requires generators – not yet <strong>in</strong> the language.


Stream <strong>Parallelism</strong> Simplifies Program Analysis<br />

function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />

this.m_pos_rd = pos_rd;<br />

this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />

this.work = function () {<br />

var <strong>for</strong>ce = [0.0,0.0,0.0];<br />

var i = this.pop()*4;<br />

var N = this.pop()*4;<br />

<strong>for</strong> (var j=0; j


Sluice Acceleration<br />

Target a low level representation <strong>for</strong> stream programs<br />

●<br />

●<br />

Stream and Kernel Intermediate Representation (SKIR)<br />

SKIR = LLVM + Kernels + Streams<br />

Code generation:<br />

●<br />

●<br />

●<br />

Typed kernels only, i.e. type-stable + <strong>in</strong>ference<br />

Generate AST from Sluice kernel at runtime<br />

Translate AST to SKIR-C to SKIR<br />

Currently uses programmer annotation:<br />

Var k = new MyKernel(...);<br />

Sluice.toSkir(k, function(err, ret) {<br />

Sluice.Pipel<strong>in</strong>e(..., ret, ...).run();<br />

});


Kernel Offload<br />

●<br />

●<br />

Offload selected kernels to SKIR runtime<br />

●<br />

These kernels run <strong>in</strong> parallel with Sluice<br />

SKIR Runtime consists of:<br />

●<br />

●<br />

●<br />

JIT compiler <strong>for</strong> stream parallel computation<br />

– Standard Synchronous Dataflow optimizations<br />

●<br />

e.g. fusion, fission, batch<strong>in</strong>g<br />

– Shared memory multi-core and OpenCL backends<br />

– Support <strong>for</strong> dynamic optimization<br />

Dynamic schedul<strong>in</strong>g algorithm<br />

– S<strong>in</strong>gle threaded or parallel execution<br />

Out of process communication mechanisms<br />

– RPC implemented as protobufs over sockets<br />

– Supports shared memory stream communication<br />

– Supports shared memory <strong>for</strong> kernel state


Kernel Offload<br />

<strong>JavaScript</strong><br />

SKIR<br />

Sluice<br />

Process boundary<br />

1) Compile K2 to SKIR code<br />

K1<br />

shared<br />

memory<br />

<strong>in</strong>put stream<br />

2) Transfer SKIR code<br />

3) Allocate streams<br />

4) Allocate & copy state<br />

5) Call K2<br />

RPC<br />

state<br />

K2<br />

(offloaded)<br />

shared<br />

memory<br />

state copy<br />

K2<br />

SKIR<br />

6) JIT compile K2<br />

shared<br />

memory<br />

output stream<br />

7) Execute K2<br />

K3<br />

Sluice<br />

8) Copy state back to JS


Evaluation<br />

●<br />

●<br />

Set of compute <strong>in</strong>tensive kernels<br />

●<br />

●<br />

Rout<strong>in</strong>es from Pixastic image process<strong>in</strong>g library<br />

– Already <strong>JavaScript</strong>, trivial to port to Sluice<br />

– 5 Megapixel images (2592x1944)<br />

– Task parallelism<br />

n-body particle physics simulation from CUDA SDK<br />

– Ported to <strong>JavaScript</strong> and Sluice from C++<br />

– Vary<strong>in</strong>g number of particles<br />

– Pipel<strong>in</strong>e parallelism, Data parallelism<br />

Compare with V8 (node.js) execution on a 4-core<br />

Intel Nahalem


Pixastic Benchmarks<br />

Ported 4 image process<strong>in</strong>g rout<strong>in</strong>es: <strong>in</strong>vert, sepia, sharpen,<br />

and edge detection<br />

Orig<strong>in</strong>al Invert Code<br />

function <strong>in</strong>vert_ref(params)<br />

{<br />

var data = Pixastic.prepareData(params);<br />

var <strong>in</strong>vertAlpha =<br />

!!params.options.<strong>in</strong>vertAlpha;<br />

}<br />

var rect = params.options.rect;<br />

var p = rect.width * rect.height;<br />

var pix = p*4, pix1 = pix + 1,<br />

var pix2 = pix + 2, pix3 = pix + 3;<br />

while (p--) {<br />

data[pix-=4] = 255 - data[pix];<br />

data[pix1-=4] = 255 - data[pix1];<br />

data[pix2-=4] = 255 - data[pix2];<br />

if (<strong>in</strong>vertAlpha)<br />

data[pix3-=4] = 255 - data[pix3];<br />

}<br />

return true;<br />

Sluice Invert Code<br />

function <strong>in</strong>vert_sluice(data, <strong>in</strong>vert_alpha)<br />

{<br />

this.<strong>in</strong>vertAlpha = <strong>in</strong>vert_alpha;<br />

this.data = data;<br />

}<br />

this.work = function () {<br />

var w = this.pop();<br />

var h = this.pop();<br />

var p = w * h;<br />

};<br />

var pix = p*4, pix1 = pix + 1;<br />

var pix2 = pix + 2, pix3 = pix + 3;<br />

while (p--) {<br />

this.data[pix-=4] = 255 - this.data[pix];<br />

this.data[pix1-=4] = 255 - this.data[pix1];<br />

this.data[pix2-=4] = 255 - this.data[pix2];<br />

if (this.<strong>in</strong>vertAlpha)<br />

this.data[pix3-=4] = 255 - this.data[pix3];<br />

}<br />

this.push(w);<br />

this.push(h);<br />

return false;


Results: Pixastic<br />

1800<br />

600<br />

execution time (ms)<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

ref<br />

skir cold<br />

skir warm<br />

time per image (ms)<br />

500<br />

400<br />

300<br />

200<br />

100<br />

edges<br />

sharpen<br />

sepia<br />

<strong>in</strong>vert<br />

0<br />

<strong>in</strong>vert sepia sharpen edges<br />

0<br />

1 2 4 6 8 10<br />

benchmark<br />

number of images<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Test of acceleration due to SKIR<br />

Offload process<strong>in</strong>g of s<strong>in</strong>gle image to SKIR<br />

ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />

cold = SKIR - <strong>in</strong>cludes JIT overhead<br />

warm = SKIR - does not <strong>in</strong>clude JIT<br />

overhead<br />

●<br />

●<br />

●<br />

Test of task parallelism<br />

Launch 1 to 10 image<br />

process<strong>in</strong>g kernels <strong>in</strong> parallel<br />

Shows mean time per image<br />

●<br />

8 worker threads (4 cores / 8<br />

hardware threads)


n-body simulation<br />

function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />

this.m_pos_rd = pos_rd;<br />

this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />

this.work = function () {<br />

var <strong>for</strong>ce = [0.0,0.0,0.0];<br />

var i = this.pop()*4;<br />

var N = this.pop()*4;<br />

<strong>for</strong> (var j=0; j


Results: n-body<br />

600<br />

3<br />

500<br />

time per iteration (ms)<br />

400<br />

300<br />

200<br />

100<br />

ref<br />

sluice<br />

skir<br />

speedup<br />

2<br />

4096<br />

3072<br />

2048<br />

1024<br />

0<br />

100 500 1000 1500<br />

1<br />

2 4 8<br />

number of bodies<br />

number of threads<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Test of acceleration due to SKIR<br />

Offload CalculateForces kernel to SKIR<br />

Shows mean time per simulation iteration<br />

ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />

sluice = sluice implementation<br />

skir = sluice implementation + SKIR offload<br />

●<br />

●<br />

●<br />

Test of data parallelism<br />

Offload CalculateForces kernel as a data<br />

parallel kernel, use SKIR kernel fission<br />

Shows speedup vs. s<strong>in</strong>gle thread<br />

● Use 2, 4 or 8 SKIR threads (4 cores / 8<br />

hardware threads)<br />

●<br />

<strong>JavaScript</strong> also uses one thread


Summary<br />

●<br />

Stream parallelism<br />

● Isolates program state<br />

● Abstracts <strong>in</strong>ter-task communication<br />

●<br />

●<br />

●<br />

Presented a method <strong>for</strong> mak<strong>in</strong>g concurrency explicit and<br />

aid<strong>in</strong>g type <strong>in</strong>ference <strong>in</strong> <strong>JavaScript</strong> Applications us<strong>in</strong>g<br />

Stream <strong>Parallelism</strong><br />

Showed how to use these techniques to offload program<br />

kernels <strong>for</strong> parallel execution us<strong>in</strong>g a high per<strong>for</strong>mance JIT<br />

and runtime system <strong>for</strong> stream<strong>in</strong>g computation<br />

Showed significant per<strong>for</strong>mance ga<strong>in</strong>s <strong>for</strong> the tested<br />

benchmarks


Blank Slide


extend kernel work function<br />

function createKernel(workfn)<br />

{<br />

var that = workfn;<br />

if (that._sluice_kernel !== true) {<br />

that._sluice_kernel = true;<br />

if (that.work == undef<strong>in</strong>ed)<br />

that.work = workfn;<br />

that.<strong>in</strong>s = [];<br />

that.outs = [];<br />

that.pop = createPop(that,0);<br />

that.popN = createPop(that);<br />

that.push = createPush(that,0);<br />

that.pushN = createPush(that);<br />

that.fiber = Fiber(<br />

function() {<br />

var r = false;<br />

while(r == false)<br />

r = that.work.call(that);<br />

return r;<br />

});<br />

// return a pop operation<br />

function createArrayPop(kernel,stream)<br />

{<br />

var k = kernel;<br />

if (stream != null) {<br />

var s = stream;<br />

return function() {<br />

var a = k.<strong>in</strong>s[s].data;<br />

while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />

var e = a.shift();<br />

return e;<br />

};<br />

} else {<br />

return function(s) {<br />

var a = k.<strong>in</strong>s[s].data;<br />

while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />

var e = a.shift();<br />

return e;<br />

};<br />

}<br />

}<br />

}<br />

}<br />

return that;


enchmarks['<strong>in</strong>vert'] = {<br />

ref : function (rgba, cb) {<br />

var ret = <strong>in</strong>vert_ref(rgba, 2592, 1944, false);<br />

cb(ret);<br />

},<br />

skir : function (rgba, cb) {<br />

var <strong>in</strong>vert = new <strong>in</strong>vert_sluice(rgba, 0);<br />

toSkir(<strong>in</strong>vert,<br />

function(err, ret) {<br />

var p = new sluice.Pipel<strong>in</strong>e( function(){ this.push(2592); //w<br />

this.push(1944); //h<br />

return true; },<br />

ret,<br />

function() { this.pop();<br />

return true; }<br />

).run(function() { cb(<strong>in</strong>vert.data); });<br />

});<br />

}<br />

};<br />

var rgba = [];<br />

<strong>for</strong> (var i=0; i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!