25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.1 <strong>Integration</strong> <strong>Flows</strong><br />

they group many independent service calls <strong>of</strong> an ActiveXML document that target the<br />

same Web service into one service call and thus, in most cases, they reduce the amount<br />

<strong>of</strong> transfered data. Note that all <strong>of</strong> these approaches for reducing the transfered data are<br />

rule-based in the sense that rewriting is only applied once during the initial deployment<br />

<strong>of</strong> an integration flow or the execution model was changed.<br />

B. Control Flow Parallelization and Operator Reordering<br />

In contrast to approaches that optimize the data transfer between the integration platform<br />

and the external systems, further optimization approaches use control flow parallelization<br />

in order to speed up integration flows by explicitly parallelizing processing steps or operator<br />

reordering to reduce the size <strong>of</strong> intermediate results.<br />

In general, all <strong>of</strong> those approaches analyze dependencies between tasks <strong>of</strong> a workflow and<br />

then rewrite tasks with no dependencies between them to parallel subflows. For example,<br />

Li and Zhan determine the critical path <strong>of</strong> a workflow with regard to the execution time and<br />

then optimize only this part <strong>of</strong> the workflow [LZ05]. Typically, such rewritings are made<br />

only once during the initial deployment (optimize-once model [BBD09]) and thus, they<br />

are rule-based optimization approaches, where rewriting rules are <strong>of</strong>ten defined in terms <strong>of</strong><br />

algebraic equivalences [HML09, YB08]. For example, Behrend and Jörg proposed to use<br />

the known Magic Sets technique for the rule-based optimization <strong>of</strong> incremental ETL flows<br />

[Beh09, BJ10]. There, selection constants gathered in a subflow are propagated to other<br />

subflows with common attributes in order to filter tuples as early as possible. In addition,<br />

the XPEDIA system [BABO + 09] achieves partitioned parallelism by partitioning large<br />

XML documents into multiple parts, evaluating the parts in parallel, and finally, merging<br />

the results. This concept is a hybrid approach <strong>of</strong> control flow and data flow parallelism<br />

because parallel subflows are used to evaluate independent data partitions.<br />

Furthermore, Srivastava et al. proposed an algorithm for finding the best plan <strong>of</strong> Web<br />

service calls (control-flow semantics) with regard to the highest degree <strong>of</strong> parallelism and<br />

thus, lowest total execution time [BMS05, SMWM06]. In addition to the exploitation <strong>of</strong><br />

parallelism, they also include the selectivity <strong>of</strong> services as a metric for arranging these<br />

services. However, costs <strong>of</strong> local processing steps are neglected and the optimal plan<br />

is computed for each incoming ad-hoc Web service query (optimize-always optimization<br />

model). Similar to this, Simitisis et al. proposed an optimization algorithm for ETL flows<br />

[Sim04, SVS05] by modeling this optimization problem as a state-space search problem. In<br />

addition to the parallelization <strong>of</strong> operators, they used techniques for merging, splitting and<br />

reordering operators. Although they use a cost model, the approach is mainly rule-based<br />

due to optimization on logical level and optimization is triggered whenever a flow instance<br />

is requested. This optimize-always model [BBD09] is advantageous for long-running flow<br />

instances. However, when executing many instances with rather small amounts <strong>of</strong> data—<br />

as it is the case for typical workloads <strong>of</strong> integration flows—the optimize-always model falls<br />

short due to the predominant optimization costs.<br />

The BPEL/SQL approach uses, in addition to SQL-activity-specific rewriting rules,<br />

parallelization techniques for the rewriting <strong>of</strong> SQL activities [Rei08]. This is an example<br />

<strong>of</strong> a hybrid approach, where aspects <strong>of</strong> data transfer optimization and control flow<br />

parallelization are combined in order to achieve highest performance.<br />

To summarize, existing techniques <strong>of</strong> optimizing integration flows mainly try to decrease<br />

the costs for accessing external systems and to increase the degree <strong>of</strong> parallelism. Thereby,<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!