03.05.2014 Views

Programming with DRMAA - Open Grid Forum

Programming with DRMAA - Open Grid Forum

Programming with DRMAA - Open Grid Forum

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OGF25/EGEE User <strong>Forum</strong><br />

<strong>Programming</strong> <strong>with</strong> <strong>DRMAA</strong><br />

<strong>DRMAA</strong> + WS + JSDL<br />

Krzyszof Kurowski<br />

Paweł Lichocki<br />

Mariusz Mamoński<br />

{krzysztof.kurowski,lichocki,mamonski}@man.poznan.pl


Agenda<br />

•introduction and motivations<br />

•<strong>DRMAA</strong><br />

•idea<br />

•overview<br />

•routines<br />

•Academic point of view<br />

•running numerical algorithms<br />

•boosting performance<br />

•exemplary application<br />

•Industrial point of view<br />

•developing middleware<br />

•enhanced security, reliability and functionality<br />

•real-life applications


Introduction … back in <strong>Grid</strong>lab times<br />

•What we liked:<br />

• people, parties and results ;-)<br />

• idea = GAT + middleware<br />

• new scenarios and user driven<br />

use cases, e.g. Zakopane/migration<br />

• a feedback we provided for O(G)GF<br />

•What we did not like:<br />

• the way middleware was integrated <strong>with</strong> DRMS<br />

• script-based job managers, new versions/releases, ...<br />

• security model, no support for advanced AAA scenarios<br />

• poor quality, no docs, many bugs and problems <strong>with</strong><br />

deployment and support (at that time ;-)<br />

• we had to wait for OGSA, WSRF, … and we liked C and WS<br />

3


Motivations<br />

•<strong>Grid</strong>Lab feedback:<br />

•GAT -> a new OGF standard: SAGA<br />

•Middleware -> OGF <strong>DRMAA</strong>, JSDL, HPC-Profile, ...<br />

•What was really important for our middleware:<br />

• performance (e.g. one million job submissions per day)<br />

stability and portability (e.g. ANSI C not Java)<br />

• simple well accepted standards not only<br />

recommendations or best practices<br />

• flexibility, new security models, …<br />

• support for local and external users performing relatively<br />

simple job submission and monitoring operations (more<br />

advanced scenarios we were discussing under a new<br />

working group in OGF), thus <strong>DRMAA</strong><br />

4


<strong>DRMAA</strong> idea<br />

Application<br />

(binary file)<br />

Application<br />

(binary file)<br />

specific<br />

script<br />

specific<br />

script<br />

specific<br />

script<br />

<strong>DRMAA</strong><br />

application<br />

unified API<br />

specific<br />

commands<br />

specific<br />

commands<br />

specific<br />

commands<br />

vendor<br />

<strong>DRMAA</strong> lib<br />

vendor<br />

<strong>DRMAA</strong> lib<br />

vendor<br />

<strong>DRMAA</strong> lib<br />

DRMS 1 DRMS 2 DRMS 3<br />

DRMS 1 DRMS 2 DRMS 3


Overview<br />

•technically <strong>DRMAA</strong> may be seen as a substitute for the<br />

dedicated scripts for job handling<br />

•from a broader perspective <strong>DRMAA</strong> is a type of parallel<br />

programming model (like <strong>Open</strong>MP or MPI)<br />

•benefits<br />

•provides compatibility <strong>with</strong> all major DRMSs<br />

•eases the implementation effort and deployment<br />

•good at use cases such as workflows or parameter sweep<br />

•pitfalls<br />

•<strong>DRMAA</strong> application must be run from the submit host of<br />

the chosen DRMS<br />

•<strong>DRMAA</strong> application uses only one DRMS at a time


Routines in a nutshell<br />

•session handling<br />

•drmaa_init<br />

•drmaa_exit<br />

•job submission<br />

•job_template routines<br />

•drmaa_submit_job<br />

•drmaa_submit_bulk_jobs<br />

•job monitoring and control<br />

•drmaa_job_ps<br />

•drmaa_control<br />

•job synchronization<br />

•drmaa_wait<br />

•drmaa_synchronize<br />

init<br />

set template<br />

submit job (template)<br />

wait (job)<br />

exit


<strong>DRMAA</strong> academic use case<br />

•Imagine an MPSD algorithm<br />

•It does not saturate the provided computational power<br />

•There exist different versions, each of them suitable in<br />

other specific case<br />

•It is a crucial part of a very time-demanding application<br />

•It is (practically) impossible to know a priori which<br />

version to use (in order to minimize<br />

the execution time)<br />

•Such algorithms exist, for example<br />

•Simplex - a popular algorithm for<br />

numerical solution of the linear<br />

programming problem<br />

•Hyper-heuristics - an approach of choosing appropriate<br />

heuristic basing on instance characteristic


Example<br />

•Observations<br />

•many methods - standard, revised, Bartels-Golub, sparse<br />

Bartels-Golub, Reid’s Method, ...<br />

•each method suits best different type of data (e.g. sparse<br />

or dense matrices)<br />

•the differences in execution time may be huge<br />

•Problem<br />

•one should analyze the input and choose the method<br />

which seems to be most suitable, but this takes time and<br />

might be incorrect since there are no clear criterias<br />

•Solution<br />

•run all methods, wait for the first to finish, gather<br />

results, terminate other methods


Application flow and assumpptions<br />

•the scheme presents the idea of<br />

the <strong>DRMAA</strong> application<br />

•in the source code we assume<br />

•the binary file name is simplexN<br />

•the binary takes as an argument<br />

the path to the input file<br />

•the input file is named data.in<br />

•the input file is accessible on all<br />

execution hosts in the working<br />

directory (otherwise a file-staging<br />

must be used)<br />

•we ignore potential errors and<br />

failures of <strong>DRMAA</strong> calls<br />

init<br />

for i = [0..N)<br />

set i-th template<br />

submit job (i-th template)<br />

j = wait (ANY)<br />

for i = [0..N) and i != j<br />

terminate job (i-th)<br />

exit


Application source code 1/2<br />

int i;<br />

char e[<strong>DRMAA</strong>_ERROR_STRING_BUFFER];<br />

size_t s = <strong>DRMAA</strong>_ERROR_STRING_BUFFER;<br />

drmaa_init( NULL, e, s );<br />

drmaa_job_template_t *jt = NULL;<br />

drmaa_allocate_job_template( &jt, e, s );<br />

const char *args[2] = { "data.in", NULL };<br />

drmaa_set_vector_attribute( jt, <strong>DRMAA</strong>_V_ARGV, args, e, s );<br />

init<br />

allocate template<br />

set generic template<br />

char jobid[ N ][<strong>DRMAA</strong>_JOBNAME_BUFFER];<br />

for (i = 0; i < N; ++i) {<br />

for i = [0..N)<br />

char cmd[] = "simplex_";<br />

cmd[6] = i + 48;<br />

set i-th template<br />

drmaa_set_attribute( jt, <strong>DRMAA</strong>_REMOTE_COMMAND, cmd, e, s );<br />

drmaa_run_job( jobid[ i ], <strong>DRMAA</strong>_JOBNAME_BUFFER, jt, e, s );<br />

submit job (i-th template)<br />

}<br />

drmaa_delete_job_template( jt, e, s );<br />

delete template


Application source code 2/2<br />

for (i = 0; i < N; ++i) {<br />

char jobid_out[<strong>DRMAA</strong>_JOBNAME_BUFFER];<br />

int status = 0, aborted = 0, exited = 0;<br />

drmaa_attr_values_t *rusage = NULL;<br />

drmaa_wait( <strong>DRMAA</strong>_JOB_IDS_SESSION_ANY, jobid_out,<br />

<strong>DRMAA</strong>_JOBNAME_BUFFER, &status,<br />

<strong>DRMAA</strong>_TIMEOUT_WAIT_FOREVER, &rusage, e, s );<br />

drmaa_wifaborted( &aborted, status, NULL, 0 );<br />

if (aborted != 1) {<br />

drmaa_wifexited( &exited, status, NULL, 0 );<br />

if (exited == 1)<br />

break;<br />

}<br />

}<br />

for (i = 0; i < N; ++i)<br />

drmaa_control( jobid[ i ], <strong>DRMAA</strong>_CONTROL_TERMINATE, e, s );<br />

for i = [0..N)<br />

j = wait (ANY)<br />

check if job exited normally<br />

if yes do not wait again<br />

for i = [0..N) /*and i != j*/<br />

terminate job (i-th)<br />

drmaa_exit(e, s);<br />

exit


Advanced programming in <strong>DRMAA</strong><br />

•The solution was to use drmaa_wait <strong>with</strong><br />

<strong>DRMAA</strong>_JOB_IDS_SESSION_ANY. However waiting for any<br />

“normally terminated” job is not that straightforward<br />

•Other limitations<br />

•The previous approach allows us to run many algorithms<br />

simultaneously, but uses only one DRMS<br />

•The <strong>DRMAA</strong> application must be run from the submission<br />

host of the chosen DRMS<br />

•<strong>DRMAA</strong> does not cover issues regarding resource requests<br />

•Questions<br />

•What if we have access to many separated clusters?<br />

•What about security?<br />

•Solution<br />

•map <strong>DRMAA</strong> functionalities to web-services along <strong>with</strong> JSDL


Industrial approach - SMOA Computing<br />

• Successor of the <strong>Open</strong>DSP (<strong>Open</strong> <strong>DRMAA</strong> Service Provider)<br />

• Web Service interface to <strong>DRMAA</strong> compliant systems<br />

• Adds authentication, authorization and accounting layers<br />

which are out of scope <strong>DRMAA</strong> specification<br />

• Robust implementation (C, gSOAP toolkit)<br />

• Modular architecture - new scenarios can be realized by<br />

addition of new C/Python modules)<br />

• Use of standard interfaces (JSDL, <strong>DRMAA</strong>, ODBC, BES HPC<br />

Basic Profile) - easier integration and maintenance<br />

• https://sourceforge.net/projects/smoa-project


JSDL to <strong>DRMAA</strong> mapping<br />

• ➞ <strong>DRMAA</strong>_JOB_NAME<br />

• ➞ <strong>DRMAA</strong>_REMOTE_COMMAND<br />

•* ➞ <strong>DRMAA</strong>_V_ARGV<br />

•* ➞ <strong>DRMAA</strong>_V_ENV<br />

• ➞ <strong>DRMAA</strong>_WD<br />

• ➞ <strong>DRMAA</strong>_INPUT_PATH<br />

• ➞ <strong>DRMAA</strong>_OUTPUT_PATH<br />

• ➞ <strong>DRMAA</strong>_ERROR_PATH<br />

• other JSDL elements if needed can be mapped to<br />

<strong>DRMAA</strong>_NATIVE_SPECIFICATION by dedicated SMOA<br />

Computing JSDL Filter module


BES to <strong>DRMAA</strong> mapping (interfaces)<br />

•CreateActivity ➞ drmaa_run_job<br />

•GetActivitiesStatuses ➞ drmaa_job_ps, drmaa_wait<br />

•TerminateActivities ➞ drmaa_control


Example SMOA realization<br />

My organization<br />

2.<br />

6.<br />

3.<br />

1.<br />

Cluster A<br />

SMOA Computing<br />

<strong>DRMAA</strong><br />

Compliant<br />

System<br />

Cluster B<br />

1 - Subscribe<br />

2, 3 - Create Activity<br />

4, 5 - Notify<br />

6 - Terminate Activity<br />

5.<br />

SMOA Computing<br />

4.<br />

SMOA Notification<br />

<strong>DRMAA</strong><br />

Compliant<br />

System


<strong>Open</strong>DSP use case<br />

G-Render - <strong>Grid</strong>-based Image<br />

Processing System<br />

•used in TeleHVEM<br />

•virtual laboratory for High<br />

Voltage Electron Microscope<br />

•e-science<br />

•Server system<br />

•Sun <strong>Grid</strong> Engine (SGE) for<br />

computational grid<br />

•<strong>Open</strong>DSP for <strong>DRMAA</strong> web<br />

service<br />

•Client<br />

•Various image processing<br />

features<br />

sge_execd<br />

GRender<br />

GUI<br />

GRender<br />

GRender<br />

Agent<br />

sge_schedd<br />

sge_qmaster<br />

<strong>Open</strong>DSP<br />

gSOAP<br />

stubs


Summary & Future<br />

• The best way to learn how to program in <strong>DRMAA</strong> is to start<br />

writing your own <strong>DRMAA</strong> application instead of using native<br />

programming interfaces or scripts for a specific DRMS<br />

• If you have access to many computing resources managed by<br />

different DRMSs you may want to use SDKs to <strong>DRMAA</strong> Service<br />

Provider (now SMOA Computing) we developed for C/C++,<br />

Java, .NET as well as example tools like Vine/GS toolkit,<br />

Jabber clients, …<br />

• We have already extended <strong>DRMAA</strong> <strong>with</strong> a set of generic<br />

advanced reservation APIs (testing now <strong>with</strong> LSF, SGE, PBSPro)<br />

• We integrated our middleware <strong>with</strong> well known programming<br />

and execution environments ProActive (Java) and <strong>Open</strong>MPI (C/<br />

C++/Python) to manage cluster-to-cluster parallel apps<br />

• It should not be difficult to implement a new SAGA adaptor<br />

for SMOA Computing


Links and literature<br />

•[1] OGF <strong>DRMAA</strong> Working Group, http://www.drmaa.org<br />

•[2] <strong>DRMAA</strong> Specification, http://www.ogf.org/documents/GFD.133.pdf<br />

•[3] <strong>DRMAA</strong> C bindings, https://forge.gridforum.org/sf/docman/do/downloadDocument/<br />

projects.drmaa-wg/docman.root.ggf_13/doc5545<br />

•[4] <strong>Grid</strong> Engine HOWTOs, http://gridengine.sunsource.net/howto/howto.html#<strong>DRMAA</strong><br />

•[5] FedStage <strong>DRMAA</strong> Wiki, http://wiki.fedstage.com/FedStage%20<strong>DRMAA</strong><br />

•[6] FedStage <strong>DRMAA</strong> for PBS PRO, http://sourceforge.net/projects/pbspro-drmaa<br />

•[7] FedStage <strong>DRMAA</strong> for LSF, http://sourceforge.net/projects/lsf-drmaa<br />

•[8] JSDL Working Group, http://forge.gridforum.org/sf/sfmain/do/viewProject/<br />

projects.jsdl-wg<br />

•[9] JSDL Specification, http://www.gridforum.org/documents/GFD.56.pdf<br />

•[10] gSOAP project, http://www.cs.fsu.edu/~engelen/soap.html<br />

•[11] FedStage <strong>Open</strong> <strong>DRMAA</strong> Service Provider, http://sourceforge.net/projects/opendsp<br />

•[12] TeleHVEM, http://goc.pragma-grid.net/wiki/images/2/2e/Hvem-yeom.pdf<br />

•[13] SMOA Project, http://sf.net/projects/smoa-project


Thank you

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!