Programming with DRMAA - Open Grid Forum
Programming with DRMAA - Open Grid Forum
Programming with DRMAA - Open Grid Forum
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
OGF25/EGEE User <strong>Forum</strong><br />
<strong>Programming</strong> <strong>with</strong> <strong>DRMAA</strong><br />
<strong>DRMAA</strong> + WS + JSDL<br />
Krzyszof Kurowski<br />
Paweł Lichocki<br />
Mariusz Mamoński<br />
{krzysztof.kurowski,lichocki,mamonski}@man.poznan.pl
Agenda<br />
•introduction and motivations<br />
•<strong>DRMAA</strong><br />
•idea<br />
•overview<br />
•routines<br />
•Academic point of view<br />
•running numerical algorithms<br />
•boosting performance<br />
•exemplary application<br />
•Industrial point of view<br />
•developing middleware<br />
•enhanced security, reliability and functionality<br />
•real-life applications
Introduction … back in <strong>Grid</strong>lab times<br />
•What we liked:<br />
• people, parties and results ;-)<br />
• idea = GAT + middleware<br />
• new scenarios and user driven<br />
use cases, e.g. Zakopane/migration<br />
• a feedback we provided for O(G)GF<br />
•What we did not like:<br />
• the way middleware was integrated <strong>with</strong> DRMS<br />
• script-based job managers, new versions/releases, ...<br />
• security model, no support for advanced AAA scenarios<br />
• poor quality, no docs, many bugs and problems <strong>with</strong><br />
deployment and support (at that time ;-)<br />
• we had to wait for OGSA, WSRF, … and we liked C and WS<br />
3
Motivations<br />
•<strong>Grid</strong>Lab feedback:<br />
•GAT -> a new OGF standard: SAGA<br />
•Middleware -> OGF <strong>DRMAA</strong>, JSDL, HPC-Profile, ...<br />
•What was really important for our middleware:<br />
• performance (e.g. one million job submissions per day)<br />
stability and portability (e.g. ANSI C not Java)<br />
• simple well accepted standards not only<br />
recommendations or best practices<br />
• flexibility, new security models, …<br />
• support for local and external users performing relatively<br />
simple job submission and monitoring operations (more<br />
advanced scenarios we were discussing under a new<br />
working group in OGF), thus <strong>DRMAA</strong><br />
4
<strong>DRMAA</strong> idea<br />
Application<br />
(binary file)<br />
Application<br />
(binary file)<br />
specific<br />
script<br />
specific<br />
script<br />
specific<br />
script<br />
<strong>DRMAA</strong><br />
application<br />
unified API<br />
specific<br />
commands<br />
specific<br />
commands<br />
specific<br />
commands<br />
vendor<br />
<strong>DRMAA</strong> lib<br />
vendor<br />
<strong>DRMAA</strong> lib<br />
vendor<br />
<strong>DRMAA</strong> lib<br />
DRMS 1 DRMS 2 DRMS 3<br />
DRMS 1 DRMS 2 DRMS 3
Overview<br />
•technically <strong>DRMAA</strong> may be seen as a substitute for the<br />
dedicated scripts for job handling<br />
•from a broader perspective <strong>DRMAA</strong> is a type of parallel<br />
programming model (like <strong>Open</strong>MP or MPI)<br />
•benefits<br />
•provides compatibility <strong>with</strong> all major DRMSs<br />
•eases the implementation effort and deployment<br />
•good at use cases such as workflows or parameter sweep<br />
•pitfalls<br />
•<strong>DRMAA</strong> application must be run from the submit host of<br />
the chosen DRMS<br />
•<strong>DRMAA</strong> application uses only one DRMS at a time
Routines in a nutshell<br />
•session handling<br />
•drmaa_init<br />
•drmaa_exit<br />
•job submission<br />
•job_template routines<br />
•drmaa_submit_job<br />
•drmaa_submit_bulk_jobs<br />
•job monitoring and control<br />
•drmaa_job_ps<br />
•drmaa_control<br />
•job synchronization<br />
•drmaa_wait<br />
•drmaa_synchronize<br />
init<br />
set template<br />
submit job (template)<br />
wait (job)<br />
exit
<strong>DRMAA</strong> academic use case<br />
•Imagine an MPSD algorithm<br />
•It does not saturate the provided computational power<br />
•There exist different versions, each of them suitable in<br />
other specific case<br />
•It is a crucial part of a very time-demanding application<br />
•It is (practically) impossible to know a priori which<br />
version to use (in order to minimize<br />
the execution time)<br />
•Such algorithms exist, for example<br />
•Simplex - a popular algorithm for<br />
numerical solution of the linear<br />
programming problem<br />
•Hyper-heuristics - an approach of choosing appropriate<br />
heuristic basing on instance characteristic
Example<br />
•Observations<br />
•many methods - standard, revised, Bartels-Golub, sparse<br />
Bartels-Golub, Reid’s Method, ...<br />
•each method suits best different type of data (e.g. sparse<br />
or dense matrices)<br />
•the differences in execution time may be huge<br />
•Problem<br />
•one should analyze the input and choose the method<br />
which seems to be most suitable, but this takes time and<br />
might be incorrect since there are no clear criterias<br />
•Solution<br />
•run all methods, wait for the first to finish, gather<br />
results, terminate other methods
Application flow and assumpptions<br />
•the scheme presents the idea of<br />
the <strong>DRMAA</strong> application<br />
•in the source code we assume<br />
•the binary file name is simplexN<br />
•the binary takes as an argument<br />
the path to the input file<br />
•the input file is named data.in<br />
•the input file is accessible on all<br />
execution hosts in the working<br />
directory (otherwise a file-staging<br />
must be used)<br />
•we ignore potential errors and<br />
failures of <strong>DRMAA</strong> calls<br />
init<br />
for i = [0..N)<br />
set i-th template<br />
submit job (i-th template)<br />
j = wait (ANY)<br />
for i = [0..N) and i != j<br />
terminate job (i-th)<br />
exit
Application source code 1/2<br />
int i;<br />
char e[<strong>DRMAA</strong>_ERROR_STRING_BUFFER];<br />
size_t s = <strong>DRMAA</strong>_ERROR_STRING_BUFFER;<br />
drmaa_init( NULL, e, s );<br />
drmaa_job_template_t *jt = NULL;<br />
drmaa_allocate_job_template( &jt, e, s );<br />
const char *args[2] = { "data.in", NULL };<br />
drmaa_set_vector_attribute( jt, <strong>DRMAA</strong>_V_ARGV, args, e, s );<br />
init<br />
allocate template<br />
set generic template<br />
char jobid[ N ][<strong>DRMAA</strong>_JOBNAME_BUFFER];<br />
for (i = 0; i < N; ++i) {<br />
for i = [0..N)<br />
char cmd[] = "simplex_";<br />
cmd[6] = i + 48;<br />
set i-th template<br />
drmaa_set_attribute( jt, <strong>DRMAA</strong>_REMOTE_COMMAND, cmd, e, s );<br />
drmaa_run_job( jobid[ i ], <strong>DRMAA</strong>_JOBNAME_BUFFER, jt, e, s );<br />
submit job (i-th template)<br />
}<br />
drmaa_delete_job_template( jt, e, s );<br />
delete template
Application source code 2/2<br />
for (i = 0; i < N; ++i) {<br />
char jobid_out[<strong>DRMAA</strong>_JOBNAME_BUFFER];<br />
int status = 0, aborted = 0, exited = 0;<br />
drmaa_attr_values_t *rusage = NULL;<br />
drmaa_wait( <strong>DRMAA</strong>_JOB_IDS_SESSION_ANY, jobid_out,<br />
<strong>DRMAA</strong>_JOBNAME_BUFFER, &status,<br />
<strong>DRMAA</strong>_TIMEOUT_WAIT_FOREVER, &rusage, e, s );<br />
drmaa_wifaborted( &aborted, status, NULL, 0 );<br />
if (aborted != 1) {<br />
drmaa_wifexited( &exited, status, NULL, 0 );<br />
if (exited == 1)<br />
break;<br />
}<br />
}<br />
for (i = 0; i < N; ++i)<br />
drmaa_control( jobid[ i ], <strong>DRMAA</strong>_CONTROL_TERMINATE, e, s );<br />
for i = [0..N)<br />
j = wait (ANY)<br />
check if job exited normally<br />
if yes do not wait again<br />
for i = [0..N) /*and i != j*/<br />
terminate job (i-th)<br />
drmaa_exit(e, s);<br />
exit
Advanced programming in <strong>DRMAA</strong><br />
•The solution was to use drmaa_wait <strong>with</strong><br />
<strong>DRMAA</strong>_JOB_IDS_SESSION_ANY. However waiting for any<br />
“normally terminated” job is not that straightforward<br />
•Other limitations<br />
•The previous approach allows us to run many algorithms<br />
simultaneously, but uses only one DRMS<br />
•The <strong>DRMAA</strong> application must be run from the submission<br />
host of the chosen DRMS<br />
•<strong>DRMAA</strong> does not cover issues regarding resource requests<br />
•Questions<br />
•What if we have access to many separated clusters?<br />
•What about security?<br />
•Solution<br />
•map <strong>DRMAA</strong> functionalities to web-services along <strong>with</strong> JSDL
Industrial approach - SMOA Computing<br />
• Successor of the <strong>Open</strong>DSP (<strong>Open</strong> <strong>DRMAA</strong> Service Provider)<br />
• Web Service interface to <strong>DRMAA</strong> compliant systems<br />
• Adds authentication, authorization and accounting layers<br />
which are out of scope <strong>DRMAA</strong> specification<br />
• Robust implementation (C, gSOAP toolkit)<br />
• Modular architecture - new scenarios can be realized by<br />
addition of new C/Python modules)<br />
• Use of standard interfaces (JSDL, <strong>DRMAA</strong>, ODBC, BES HPC<br />
Basic Profile) - easier integration and maintenance<br />
• https://sourceforge.net/projects/smoa-project
JSDL to <strong>DRMAA</strong> mapping<br />
• ➞ <strong>DRMAA</strong>_JOB_NAME<br />
• ➞ <strong>DRMAA</strong>_REMOTE_COMMAND<br />
•* ➞ <strong>DRMAA</strong>_V_ARGV<br />
•* ➞ <strong>DRMAA</strong>_V_ENV<br />
• ➞ <strong>DRMAA</strong>_WD<br />
• ➞ <strong>DRMAA</strong>_INPUT_PATH<br />
• ➞ <strong>DRMAA</strong>_OUTPUT_PATH<br />
• ➞ <strong>DRMAA</strong>_ERROR_PATH<br />
• other JSDL elements if needed can be mapped to<br />
<strong>DRMAA</strong>_NATIVE_SPECIFICATION by dedicated SMOA<br />
Computing JSDL Filter module
BES to <strong>DRMAA</strong> mapping (interfaces)<br />
•CreateActivity ➞ drmaa_run_job<br />
•GetActivitiesStatuses ➞ drmaa_job_ps, drmaa_wait<br />
•TerminateActivities ➞ drmaa_control
Example SMOA realization<br />
My organization<br />
2.<br />
6.<br />
3.<br />
1.<br />
Cluster A<br />
SMOA Computing<br />
<strong>DRMAA</strong><br />
Compliant<br />
System<br />
Cluster B<br />
1 - Subscribe<br />
2, 3 - Create Activity<br />
4, 5 - Notify<br />
6 - Terminate Activity<br />
5.<br />
SMOA Computing<br />
4.<br />
SMOA Notification<br />
<strong>DRMAA</strong><br />
Compliant<br />
System
<strong>Open</strong>DSP use case<br />
G-Render - <strong>Grid</strong>-based Image<br />
Processing System<br />
•used in TeleHVEM<br />
•virtual laboratory for High<br />
Voltage Electron Microscope<br />
•e-science<br />
•Server system<br />
•Sun <strong>Grid</strong> Engine (SGE) for<br />
computational grid<br />
•<strong>Open</strong>DSP for <strong>DRMAA</strong> web<br />
service<br />
•Client<br />
•Various image processing<br />
features<br />
sge_execd<br />
GRender<br />
GUI<br />
GRender<br />
GRender<br />
Agent<br />
sge_schedd<br />
sge_qmaster<br />
<strong>Open</strong>DSP<br />
gSOAP<br />
stubs
Summary & Future<br />
• The best way to learn how to program in <strong>DRMAA</strong> is to start<br />
writing your own <strong>DRMAA</strong> application instead of using native<br />
programming interfaces or scripts for a specific DRMS<br />
• If you have access to many computing resources managed by<br />
different DRMSs you may want to use SDKs to <strong>DRMAA</strong> Service<br />
Provider (now SMOA Computing) we developed for C/C++,<br />
Java, .NET as well as example tools like Vine/GS toolkit,<br />
Jabber clients, …<br />
• We have already extended <strong>DRMAA</strong> <strong>with</strong> a set of generic<br />
advanced reservation APIs (testing now <strong>with</strong> LSF, SGE, PBSPro)<br />
• We integrated our middleware <strong>with</strong> well known programming<br />
and execution environments ProActive (Java) and <strong>Open</strong>MPI (C/<br />
C++/Python) to manage cluster-to-cluster parallel apps<br />
• It should not be difficult to implement a new SAGA adaptor<br />
for SMOA Computing
Links and literature<br />
•[1] OGF <strong>DRMAA</strong> Working Group, http://www.drmaa.org<br />
•[2] <strong>DRMAA</strong> Specification, http://www.ogf.org/documents/GFD.133.pdf<br />
•[3] <strong>DRMAA</strong> C bindings, https://forge.gridforum.org/sf/docman/do/downloadDocument/<br />
projects.drmaa-wg/docman.root.ggf_13/doc5545<br />
•[4] <strong>Grid</strong> Engine HOWTOs, http://gridengine.sunsource.net/howto/howto.html#<strong>DRMAA</strong><br />
•[5] FedStage <strong>DRMAA</strong> Wiki, http://wiki.fedstage.com/FedStage%20<strong>DRMAA</strong><br />
•[6] FedStage <strong>DRMAA</strong> for PBS PRO, http://sourceforge.net/projects/pbspro-drmaa<br />
•[7] FedStage <strong>DRMAA</strong> for LSF, http://sourceforge.net/projects/lsf-drmaa<br />
•[8] JSDL Working Group, http://forge.gridforum.org/sf/sfmain/do/viewProject/<br />
projects.jsdl-wg<br />
•[9] JSDL Specification, http://www.gridforum.org/documents/GFD.56.pdf<br />
•[10] gSOAP project, http://www.cs.fsu.edu/~engelen/soap.html<br />
•[11] FedStage <strong>Open</strong> <strong>DRMAA</strong> Service Provider, http://sourceforge.net/projects/opendsp<br />
•[12] TeleHVEM, http://goc.pragma-grid.net/wiki/images/2/2e/Hvem-yeom.pdf<br />
•[13] SMOA Project, http://sf.net/projects/smoa-project
Thank you