A Methodology for Fine- Grained Parallelism in JavaScript ...

Computer Systems @ Colorado 

COMPUTER AND CYBERPHYSICAL SYSTEMS RESEARCH AT COLORADO 

http://systems.cs.colorado.edu 

A Methodology for Fine- 

Grained Parallelism in 

JavaScript Applications 

Jeff Fifield and Dirk Grunwald 

Department of Computer Science 

University of Colorado 

LCPC Workshop 

September 8, 2011

Modern JavaScript 

Most web applications are written using JavaScript 

• Email 

• Office suites 

• Games 

• IDEs 

Many server side applications use JavaScript 

• CommonJS 

• node.js, RingoJS, wakanda, many more... 

• MongoDB map/reduce, CouchDB views 

APIs for Desktop, Mobile

Why is JavaScript Slow 

it's not because JavaScript is interpreted, 

- current JavaScript engines are JIT compilers 

- they perform standard optimizations and code gen 

- they can generate good code 

it's because JavaScript is untyped, 

what does c = a + b mean 

if a = 1, b = 2 

if a = "cat", b = "dog" 

Other reasons too: 

object creation, use of eval, sparse arrays, ... 

but these things shouldn't be in hot loops anyway

Type Specializing JITs 

Luckily, programmers usually write Type-Stable code: 

That is, 

a = 1; 

b = 2; 

if (something) 

b = 5; 

c = a + b; 

and not, 

a = 1; 

b = 2; 

if (something) 

b = “word”; 

c = a + b; 

Type specializing, dynamic optimizing, JIT compilers 

• Observe types through profiling or tracing (still emit checks) 

• Prove types using type inference (no checks, research area) 

• Generate code specialized to given types 

• Google V8 Crankshaft, Mozilla TraceMonkey, JägerMonkey 

Good JS performance requires Type-Stable Code

Go Faster with Parallelism 

● 

● 

● 

● 

● 

JavaScript does not support parallelism 

● 

● 

Lots of concurrent programming models in use 

Various multi-threaded implementations 

WebWorkers 

● 

Coming soon, threads with asynchronous message passing 

WebCL 

● 

Coming soon, OpenCL for JavaScript 

Other proposals from the Server Side JavaScript community 

● 

● 

Sync and lock – JVM semantics 

Fork – as in fork a new process 

None of these are appealing for fine-grained parallelism 

● 

● 

Fine-grained parallelism == loop or procedure level 

Too low level or too heavy weight

Making Parallelism Easier 

Use an Embedded Domain Specific Language (EDSL) 

How to put a DSL into a general purpose language: 

1. restrict the language to not do things you don't like 

2. augment the language to do the things you do like 

3. add runtime AST construction + DSL compiler 

Get the benefits of the host syntax, compiler, libraries, etc. 

Examples: 

Copperhead for Python, 

Intel Array Building Blocks / Rapidmind / Ct for C++, 

Microsoft Accelerator for .NET

Sluice: EDSL for Stream Parallelism 

• Kahn Process Network model of computation 

• StreamIt style program construction 

function Adder(arg) { 

Kernel 

Split 

Kernel 

this.a = arg; 

this.work = function() { 

Kernel 

var e = this.pop(); 

Kernel ... Kernel Kernel 

Kernel Kernel 

e = e + this.a; 

Kernel 

this.push(e); 

return false; 

Join 

Kernel 

Kernel 

} 

} 

var add0 = new Adder(1); 



pipeline split-join data parallel 

var sj = Sluice.SplitRR(1, add0, add1, add2).JoinRR(1); 

var p = Sluice.Pipeline(new Count(10), sj, new Printer()); 

> p.run() 

1 2 3 4 5 6 7 8 9 10

Default Sluice Execution: Run as 

Normal JavaScript 

● 

● 

● 

● 

● 

Sluice uses a pure JavaScript implementation by default 

Kernels execute without modification (it's just JavaScript) 

Map streams to JavaScript Array with bounded sizes 

Simple Scheduling Algorithm: 

● 

● 

● 

Single threaded scheduler based on coroutines 

yield to the scheduler when blocked on full or empty stream 

Choose next kernel to run using round-robin 

Requires generators – not yet in the language.

Stream Parallelism Simplifies Program Analysis 

function CalculateForces(pos_rd, softeningSquared) { 

this.m_pos_rd = pos_rd; 

this.m_softeningSquared = softeningSquared; 

this.work = function () { 

var force = [0.0,0.0,0.0]; 

var i = this.pop()*4; 

var N = this.pop()*4; 

for (var j=0; j

Sluice Acceleration 

Target a low level representation for stream programs 

● 

● 

Stream and Kernel Intermediate Representation (SKIR) 

SKIR = LLVM + Kernels + Streams 

Code generation: 

● 

● 

● 

Typed kernels only, i.e. type-stable + inference 

Generate AST from Sluice kernel at runtime 

Translate AST to SKIR-C to SKIR 

Currently uses programmer annotation: 

Var k = new MyKernel(...); 

Sluice.toSkir(k, function(err, ret) { 

Sluice.Pipeline(..., ret, ...).run(); 

});

Kernel Offload 

● 

● 

Offload selected kernels to SKIR runtime 

● 

These kernels run in parallel with Sluice 

SKIR Runtime consists of: 

● 

● 

● 

JIT compiler for stream parallel computation 

– Standard Synchronous Dataflow optimizations 

● 

e.g. fusion, fission, batching 

– Shared memory multi-core and OpenCL backends 

– Support for dynamic optimization 

Dynamic scheduling algorithm 

– Single threaded or parallel execution 

Out of process communication mechanisms 

– RPC implemented as protobufs over sockets 

– Supports shared memory stream communication 

– Supports shared memory for kernel state

Kernel Offload 

JavaScript 

SKIR 

Sluice 

Process boundary 

1) Compile K2 to SKIR code 

K1 

shared 

memory 

input stream 

2) Transfer SKIR code 

3) Allocate streams 

4) Allocate & copy state 

5) Call K2 

RPC 

state 

K2 

(offloaded) 

shared 

memory 

state copy 

K2 

SKIR 

6) JIT compile K2 

shared 

memory 

output stream 

7) Execute K2 

K3 

Sluice 

8) Copy state back to JS

Evaluation 

● 

● 

Set of compute intensive kernels 

● 

● 

Routines from Pixastic image processing library 

– Already JavaScript, trivial to port to Sluice 

– 5 Megapixel images (2592x1944) 

– Task parallelism 

n-body particle physics simulation from CUDA SDK 

– Ported to JavaScript and Sluice from C++ 

– Varying number of particles 

– Pipeline parallelism, Data parallelism 

Compare with V8 (node.js) execution on a 4-core 

Intel Nahalem

Pixastic Benchmarks 

Ported 4 image processing routines: invert, sepia, sharpen, 

and edge detection 

Original Invert Code 

function invert_ref(params) 

{ 

var data = Pixastic.prepareData(params); 

var invertAlpha = 

!!params.options.invertAlpha; 

} 

var rect = params.options.rect; 

var p = rect.width * rect.height; 

var pix = p*4, pix1 = pix + 1, 

var pix2 = pix + 2, pix3 = pix + 3; 

while (p--) { 

data[pix-=4] = 255 - data[pix]; 

data[pix1-=4] = 255 - data[pix1]; 


if (invertAlpha) 


} 

return true; 

Sluice Invert Code 

function invert_sluice(data, invert_alpha) 

{ 

this.invertAlpha = invert_alpha; 

this.data = data; 

} 


var w = this.pop(); 

var h = this.pop(); 

var p = w * h; 

}; 

var pix = p*4, pix1 = pix + 1; 

var pix2 = pix + 2, pix3 = pix + 3; 

while (p--) { 

this.data[pix-=4] = 255 - this.data[pix]; 

this.data[pix1-=4] = 255 - this.data[pix1]; 


if (this.invertAlpha) 


} 

this.push(w); 

this.push(h); 

return false;

Results: Pixastic 

1800 

600 

execution time (ms) 

1600 

1400 

1200 

1000 

800 

600 

400 

200 

ref 

skir cold 

skir warm 

time per image (ms) 

500 

400 

300 

200 

100 

edges 

sharpen 

sepia 

invert 

0 

invert sepia sharpen edges 

0 

1 2 4 6 8 10 

benchmark 

number of images 

● 

● 

● 

● 

● 

Test of acceleration due to SKIR 

Offload processing of single image to SKIR 

ref = ordinary JavaScript 

cold = SKIR - includes JIT overhead 

warm = SKIR - does not include JIT 

overhead 

● 

● 

● 

Test of task parallelism 

Launch 1 to 10 image 

processing kernels in parallel 

Shows mean time per image 

● 

8 worker threads (4 cores / 8 

hardware threads)

n-body simulation 

function CalculateForces(pos_rd, softeningSquared) { 

this.m_pos_rd = pos_rd; 

this.m_softeningSquared = softeningSquared; 


var force = [0.0,0.0,0.0]; 

var i = this.pop()*4; 

var N = this.pop()*4; 

for (var j=0; j

Results: n-body 

600 

3 

500 

time per iteration (ms) 

400 

300 

200 

100 

ref 

sluice 

skir 

speedup 

2 

4096 

3072 

2048 

1024 

0 

100 500 1000 1500 

1 

2 4 8 

number of bodies 

number of threads 

● 

● 

● 

● 

● 

● 

Test of acceleration due to SKIR 

Offload CalculateForces kernel to SKIR 

Shows mean time per simulation iteration 

ref = ordinary JavaScript 

sluice = sluice implementation 

skir = sluice implementation + SKIR offload 

● 

● 

● 

Test of data parallelism 

Offload CalculateForces kernel as a data 

parallel kernel, use SKIR kernel fission 

Shows speedup vs. single thread 

● Use 2, 4 or 8 SKIR threads (4 cores / 8 

hardware threads) 

● 

JavaScript also uses one thread

Summary 

● 

Stream parallelism 

● Isolates program state 

● Abstracts inter-task communication 

● 

● 

● 

Presented a method for making concurrency explicit and 

aiding type inference in JavaScript Applications using 

Stream Parallelism 

Showed how to use these techniques to offload program 

kernels for parallel execution using a high performance JIT 

and runtime system for streaming computation 

Showed significant performance gains for the tested 

benchmarks

Blank Slide

extend kernel work function 

function createKernel(workfn) 

{ 

var that = workfn; 

if (that._sluice_kernel !== true) { 

that._sluice_kernel = true; 

if (that.work == undefined) 

that.work = workfn; 

that.ins = []; 

that.outs = []; 

that.pop = createPop(that,0); 

that.popN = createPop(that); 

that.push = createPush(that,0); 

that.pushN = createPush(that); 

that.fiber = Fiber( 

function() { 

var r = false; 

while(r == false) 

r = that.work.call(that); 

return r; 

}); 

// return a pop operation 

function createArrayPop(kernel,stream) 

{ 

var k = kernel; 

if (stream != null) { 

var s = stream; 

return function() { 

var a = k.ins[s].data; 

while (!a.length) yield(k.ins[s].src == 1); 

var e = a.shift(); 

return e; 

}; 

} else { 

return function(s) { 

var a = k.ins[s].data; 

while (!a.length) yield(k.ins[s].src == 1); 

var e = a.shift(); 

return e; 

}; 

} 

} 

} 

} 

return that;

enchmarks['invert'] = { 

ref : function (rgba, cb) { 

var ret = invert_ref(rgba, 2592, 1944, false); 

cb(ret); 

}, 

skir : function (rgba, cb) { 

var invert = new invert_sluice(rgba, 0); 

toSkir(invert, 

function(err, ret) { 

var p = new sluice.Pipeline( function(){ this.push(2592); //w 

this.push(1944); //h 

return true; }, 

ret, 

function() { this.pop(); 

return true; } 

).run(function() { cb(invert.data); }); 

}); 

} 

}; 

var rgba = []; 

for (var i=0; i

A Methodology for Fine- Grained Parallelism in JavaScript ...

Create successful ePaper yourself

Delete template?

Save as template?