A Methodology for Fine- Grained Parallelism in JavaScript ...
A Methodology for Fine- Grained Parallelism in JavaScript ...
A Methodology for Fine- Grained Parallelism in JavaScript ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Computer Systems @ Colorado<br />
COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO<br />
http://systems.cs.colorado.edu<br />
A <strong>Methodology</strong> <strong>for</strong> <strong>F<strong>in</strong>e</strong>-<br />
<strong>Gra<strong>in</strong>ed</strong> <strong>Parallelism</strong> <strong>in</strong><br />
<strong>JavaScript</strong> Applications<br />
Jeff Fifield and Dirk Grunwald<br />
Department of Computer Science<br />
University of Colorado<br />
LCPC Workshop<br />
September 8, 2011
Modern <strong>JavaScript</strong><br />
Most web applications are written us<strong>in</strong>g <strong>JavaScript</strong><br />
• Email<br />
• Office suites<br />
• Games<br />
• IDEs<br />
Many server side applications use <strong>JavaScript</strong><br />
• CommonJS<br />
• node.js, R<strong>in</strong>goJS, wakanda, many more...<br />
• MongoDB map/reduce, CouchDB views<br />
APIs <strong>for</strong> Desktop, Mobile
Why is <strong>JavaScript</strong> Slow<br />
it's not because <strong>JavaScript</strong> is <strong>in</strong>terpreted,<br />
- current <strong>JavaScript</strong> eng<strong>in</strong>es are JIT compilers<br />
- they per<strong>for</strong>m standard optimizations and code gen<br />
- they can generate good code<br />
it's because <strong>JavaScript</strong> is untyped,<br />
what does c = a + b mean <br />
if a = 1, b = 2 <br />
if a = "cat", b = "dog" <br />
Other reasons too:<br />
object creation, use of eval, sparse arrays, ...<br />
but these th<strong>in</strong>gs shouldn't be <strong>in</strong> hot loops anyway
Type Specializ<strong>in</strong>g JITs<br />
Luckily, programmers usually write Type-Stable code:<br />
That is,<br />
a = 1;<br />
b = 2;<br />
if (someth<strong>in</strong>g)<br />
b = 5;<br />
c = a + b;<br />
and not,<br />
a = 1;<br />
b = 2;<br />
if (someth<strong>in</strong>g)<br />
b = “word”;<br />
c = a + b;<br />
Type specializ<strong>in</strong>g, dynamic optimiz<strong>in</strong>g, JIT compilers<br />
• Observe types through profil<strong>in</strong>g or trac<strong>in</strong>g (still emit checks)<br />
• Prove types us<strong>in</strong>g type <strong>in</strong>ference (no checks, research area)<br />
• Generate code specialized to given types<br />
• Google V8 Crankshaft, Mozilla TraceMonkey, JägerMonkey<br />
Good JS per<strong>for</strong>mance requires Type-Stable Code
Go Faster with <strong>Parallelism</strong><br />
●<br />
●<br />
●<br />
●<br />
●<br />
<strong>JavaScript</strong> does not support parallelism<br />
●<br />
●<br />
Lots of concurrent programm<strong>in</strong>g models <strong>in</strong> use<br />
Various multi-threaded implementations<br />
WebWorkers<br />
●<br />
Com<strong>in</strong>g soon, threads with asynchronous message pass<strong>in</strong>g<br />
WebCL<br />
●<br />
Com<strong>in</strong>g soon, OpenCL <strong>for</strong> <strong>JavaScript</strong><br />
Other proposals from the Server Side <strong>JavaScript</strong> community<br />
●<br />
●<br />
Sync and lock – JVM semantics<br />
Fork – as <strong>in</strong> <strong>for</strong>k a new process<br />
None of these are appeal<strong>in</strong>g <strong>for</strong> f<strong>in</strong>e-gra<strong>in</strong>ed parallelism<br />
●<br />
●<br />
<strong>F<strong>in</strong>e</strong>-gra<strong>in</strong>ed parallelism == loop or procedure level<br />
Too low level or too heavy weight
Mak<strong>in</strong>g <strong>Parallelism</strong> Easier<br />
Use an Embedded Doma<strong>in</strong> Specific Language (EDSL)<br />
How to put a DSL <strong>in</strong>to a general purpose language:<br />
1. restrict the language to not do th<strong>in</strong>gs you don't like<br />
2. augment the language to do the th<strong>in</strong>gs you do like<br />
3. add runtime AST construction + DSL compiler<br />
Get the benefits of the host syntax, compiler, libraries, etc.<br />
Examples:<br />
Copperhead <strong>for</strong> Python,<br />
Intel Array Build<strong>in</strong>g Blocks / Rapidm<strong>in</strong>d / Ct <strong>for</strong> C++,<br />
Microsoft Accelerator <strong>for</strong> .NET
Sluice: EDSL <strong>for</strong> Stream <strong>Parallelism</strong><br />
• Kahn Process Network model of computation<br />
• StreamIt style program construction<br />
function Adder(arg) {<br />
Kernel<br />
Split<br />
Kernel<br />
this.a = arg;<br />
this.work = function() {<br />
Kernel<br />
var e = this.pop();<br />
Kernel ... Kernel Kernel<br />
Kernel Kernel<br />
e = e + this.a;<br />
Kernel<br />
this.push(e);<br />
return false;<br />
Jo<strong>in</strong><br />
Kernel<br />
Kernel<br />
}<br />
}<br />
var add0 = new Adder(1);<br />
var add1 = new Adder(1);<br />
var add2 = new Adder(1);<br />
pipel<strong>in</strong>e split-jo<strong>in</strong> data parallel<br />
var sj = Sluice.SplitRR(1, add0, add1, add2).Jo<strong>in</strong>RR(1);<br />
var p = Sluice.Pipel<strong>in</strong>e(new Count(10), sj, new Pr<strong>in</strong>ter());<br />
> p.run()<br />
1 2 3 4 5 6 7 8 9 10
Default Sluice Execution: Run as<br />
Normal <strong>JavaScript</strong><br />
●<br />
●<br />
●<br />
●<br />
●<br />
Sluice uses a pure <strong>JavaScript</strong> implementation by default<br />
Kernels execute without modification (it's just <strong>JavaScript</strong>)<br />
Map streams to <strong>JavaScript</strong> Array with bounded sizes<br />
Simple Schedul<strong>in</strong>g Algorithm:<br />
●<br />
●<br />
●<br />
S<strong>in</strong>gle threaded scheduler based on corout<strong>in</strong>es<br />
yield to the scheduler when blocked on full or empty stream<br />
Choose next kernel to run us<strong>in</strong>g round-rob<strong>in</strong><br />
Requires generators – not yet <strong>in</strong> the language.
Stream <strong>Parallelism</strong> Simplifies Program Analysis<br />
function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />
this.m_pos_rd = pos_rd;<br />
this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />
this.work = function () {<br />
var <strong>for</strong>ce = [0.0,0.0,0.0];<br />
var i = this.pop()*4;<br />
var N = this.pop()*4;<br />
<strong>for</strong> (var j=0; j
Sluice Acceleration<br />
Target a low level representation <strong>for</strong> stream programs<br />
●<br />
●<br />
Stream and Kernel Intermediate Representation (SKIR)<br />
SKIR = LLVM + Kernels + Streams<br />
Code generation:<br />
●<br />
●<br />
●<br />
Typed kernels only, i.e. type-stable + <strong>in</strong>ference<br />
Generate AST from Sluice kernel at runtime<br />
Translate AST to SKIR-C to SKIR<br />
Currently uses programmer annotation:<br />
Var k = new MyKernel(...);<br />
Sluice.toSkir(k, function(err, ret) {<br />
Sluice.Pipel<strong>in</strong>e(..., ret, ...).run();<br />
});
Kernel Offload<br />
●<br />
●<br />
Offload selected kernels to SKIR runtime<br />
●<br />
These kernels run <strong>in</strong> parallel with Sluice<br />
SKIR Runtime consists of:<br />
●<br />
●<br />
●<br />
JIT compiler <strong>for</strong> stream parallel computation<br />
– Standard Synchronous Dataflow optimizations<br />
●<br />
e.g. fusion, fission, batch<strong>in</strong>g<br />
– Shared memory multi-core and OpenCL backends<br />
– Support <strong>for</strong> dynamic optimization<br />
Dynamic schedul<strong>in</strong>g algorithm<br />
– S<strong>in</strong>gle threaded or parallel execution<br />
Out of process communication mechanisms<br />
– RPC implemented as protobufs over sockets<br />
– Supports shared memory stream communication<br />
– Supports shared memory <strong>for</strong> kernel state
Kernel Offload<br />
<strong>JavaScript</strong><br />
SKIR<br />
Sluice<br />
Process boundary<br />
1) Compile K2 to SKIR code<br />
K1<br />
shared<br />
memory<br />
<strong>in</strong>put stream<br />
2) Transfer SKIR code<br />
3) Allocate streams<br />
4) Allocate & copy state<br />
5) Call K2<br />
RPC<br />
state<br />
K2<br />
(offloaded)<br />
shared<br />
memory<br />
state copy<br />
K2<br />
SKIR<br />
6) JIT compile K2<br />
shared<br />
memory<br />
output stream<br />
7) Execute K2<br />
K3<br />
Sluice<br />
8) Copy state back to JS
Evaluation<br />
●<br />
●<br />
Set of compute <strong>in</strong>tensive kernels<br />
●<br />
●<br />
Rout<strong>in</strong>es from Pixastic image process<strong>in</strong>g library<br />
– Already <strong>JavaScript</strong>, trivial to port to Sluice<br />
– 5 Megapixel images (2592x1944)<br />
– Task parallelism<br />
n-body particle physics simulation from CUDA SDK<br />
– Ported to <strong>JavaScript</strong> and Sluice from C++<br />
– Vary<strong>in</strong>g number of particles<br />
– Pipel<strong>in</strong>e parallelism, Data parallelism<br />
Compare with V8 (node.js) execution on a 4-core<br />
Intel Nahalem
Pixastic Benchmarks<br />
Ported 4 image process<strong>in</strong>g rout<strong>in</strong>es: <strong>in</strong>vert, sepia, sharpen,<br />
and edge detection<br />
Orig<strong>in</strong>al Invert Code<br />
function <strong>in</strong>vert_ref(params)<br />
{<br />
var data = Pixastic.prepareData(params);<br />
var <strong>in</strong>vertAlpha =<br />
!!params.options.<strong>in</strong>vertAlpha;<br />
}<br />
var rect = params.options.rect;<br />
var p = rect.width * rect.height;<br />
var pix = p*4, pix1 = pix + 1,<br />
var pix2 = pix + 2, pix3 = pix + 3;<br />
while (p--) {<br />
data[pix-=4] = 255 - data[pix];<br />
data[pix1-=4] = 255 - data[pix1];<br />
data[pix2-=4] = 255 - data[pix2];<br />
if (<strong>in</strong>vertAlpha)<br />
data[pix3-=4] = 255 - data[pix3];<br />
}<br />
return true;<br />
Sluice Invert Code<br />
function <strong>in</strong>vert_sluice(data, <strong>in</strong>vert_alpha)<br />
{<br />
this.<strong>in</strong>vertAlpha = <strong>in</strong>vert_alpha;<br />
this.data = data;<br />
}<br />
this.work = function () {<br />
var w = this.pop();<br />
var h = this.pop();<br />
var p = w * h;<br />
};<br />
var pix = p*4, pix1 = pix + 1;<br />
var pix2 = pix + 2, pix3 = pix + 3;<br />
while (p--) {<br />
this.data[pix-=4] = 255 - this.data[pix];<br />
this.data[pix1-=4] = 255 - this.data[pix1];<br />
this.data[pix2-=4] = 255 - this.data[pix2];<br />
if (this.<strong>in</strong>vertAlpha)<br />
this.data[pix3-=4] = 255 - this.data[pix3];<br />
}<br />
this.push(w);<br />
this.push(h);<br />
return false;
Results: Pixastic<br />
1800<br />
600<br />
execution time (ms)<br />
1600<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
ref<br />
skir cold<br />
skir warm<br />
time per image (ms)<br />
500<br />
400<br />
300<br />
200<br />
100<br />
edges<br />
sharpen<br />
sepia<br />
<strong>in</strong>vert<br />
0<br />
<strong>in</strong>vert sepia sharpen edges<br />
0<br />
1 2 4 6 8 10<br />
benchmark<br />
number of images<br />
●<br />
●<br />
●<br />
●<br />
●<br />
Test of acceleration due to SKIR<br />
Offload process<strong>in</strong>g of s<strong>in</strong>gle image to SKIR<br />
ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />
cold = SKIR - <strong>in</strong>cludes JIT overhead<br />
warm = SKIR - does not <strong>in</strong>clude JIT<br />
overhead<br />
●<br />
●<br />
●<br />
Test of task parallelism<br />
Launch 1 to 10 image<br />
process<strong>in</strong>g kernels <strong>in</strong> parallel<br />
Shows mean time per image<br />
●<br />
8 worker threads (4 cores / 8<br />
hardware threads)
n-body simulation<br />
function CalculateForces(pos_rd, soften<strong>in</strong>gSquared) {<br />
this.m_pos_rd = pos_rd;<br />
this.m_soften<strong>in</strong>gSquared = soften<strong>in</strong>gSquared;<br />
this.work = function () {<br />
var <strong>for</strong>ce = [0.0,0.0,0.0];<br />
var i = this.pop()*4;<br />
var N = this.pop()*4;<br />
<strong>for</strong> (var j=0; j
Results: n-body<br />
600<br />
3<br />
500<br />
time per iteration (ms)<br />
400<br />
300<br />
200<br />
100<br />
ref<br />
sluice<br />
skir<br />
speedup<br />
2<br />
4096<br />
3072<br />
2048<br />
1024<br />
0<br />
100 500 1000 1500<br />
1<br />
2 4 8<br />
number of bodies<br />
number of threads<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
Test of acceleration due to SKIR<br />
Offload CalculateForces kernel to SKIR<br />
Shows mean time per simulation iteration<br />
ref = ord<strong>in</strong>ary <strong>JavaScript</strong><br />
sluice = sluice implementation<br />
skir = sluice implementation + SKIR offload<br />
●<br />
●<br />
●<br />
Test of data parallelism<br />
Offload CalculateForces kernel as a data<br />
parallel kernel, use SKIR kernel fission<br />
Shows speedup vs. s<strong>in</strong>gle thread<br />
● Use 2, 4 or 8 SKIR threads (4 cores / 8<br />
hardware threads)<br />
●<br />
<strong>JavaScript</strong> also uses one thread
Summary<br />
●<br />
Stream parallelism<br />
● Isolates program state<br />
● Abstracts <strong>in</strong>ter-task communication<br />
●<br />
●<br />
●<br />
Presented a method <strong>for</strong> mak<strong>in</strong>g concurrency explicit and<br />
aid<strong>in</strong>g type <strong>in</strong>ference <strong>in</strong> <strong>JavaScript</strong> Applications us<strong>in</strong>g<br />
Stream <strong>Parallelism</strong><br />
Showed how to use these techniques to offload program<br />
kernels <strong>for</strong> parallel execution us<strong>in</strong>g a high per<strong>for</strong>mance JIT<br />
and runtime system <strong>for</strong> stream<strong>in</strong>g computation<br />
Showed significant per<strong>for</strong>mance ga<strong>in</strong>s <strong>for</strong> the tested<br />
benchmarks
Blank Slide
extend kernel work function<br />
function createKernel(workfn)<br />
{<br />
var that = workfn;<br />
if (that._sluice_kernel !== true) {<br />
that._sluice_kernel = true;<br />
if (that.work == undef<strong>in</strong>ed)<br />
that.work = workfn;<br />
that.<strong>in</strong>s = [];<br />
that.outs = [];<br />
that.pop = createPop(that,0);<br />
that.popN = createPop(that);<br />
that.push = createPush(that,0);<br />
that.pushN = createPush(that);<br />
that.fiber = Fiber(<br />
function() {<br />
var r = false;<br />
while(r == false)<br />
r = that.work.call(that);<br />
return r;<br />
});<br />
// return a pop operation<br />
function createArrayPop(kernel,stream)<br />
{<br />
var k = kernel;<br />
if (stream != null) {<br />
var s = stream;<br />
return function() {<br />
var a = k.<strong>in</strong>s[s].data;<br />
while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />
var e = a.shift();<br />
return e;<br />
};<br />
} else {<br />
return function(s) {<br />
var a = k.<strong>in</strong>s[s].data;<br />
while (!a.length) yield(k.<strong>in</strong>s[s].src == 1);<br />
var e = a.shift();<br />
return e;<br />
};<br />
}<br />
}<br />
}<br />
}<br />
return that;
enchmarks['<strong>in</strong>vert'] = {<br />
ref : function (rgba, cb) {<br />
var ret = <strong>in</strong>vert_ref(rgba, 2592, 1944, false);<br />
cb(ret);<br />
},<br />
skir : function (rgba, cb) {<br />
var <strong>in</strong>vert = new <strong>in</strong>vert_sluice(rgba, 0);<br />
toSkir(<strong>in</strong>vert,<br />
function(err, ret) {<br />
var p = new sluice.Pipel<strong>in</strong>e( function(){ this.push(2592); //w<br />
this.push(1944); //h<br />
return true; },<br />
ret,<br />
function() { this.pop();<br />
return true; }<br />
).run(function() { cb(<strong>in</strong>vert.data); });<br />
});<br />
}<br />
};<br />
var rgba = [];<br />
<strong>for</strong> (var i=0; i