25.01.2015 Views

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

Cost-Based Optimization of Integration Flows - Datenbanken ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.2 Prerequisites for <strong>Cost</strong>-<strong>Based</strong> <strong>Optimization</strong><br />

concept is to use an empirical cost model. There, physical operators are empirically analyzed<br />

(static or dynamic/leaning-based) with regard to the execution costs when varying<br />

the input parameters such as the input cardinalities. The results <strong>of</strong> these experimental<br />

measurements are then used in order to determine continuous cost functions that approximately<br />

describe the cost <strong>of</strong> the evaluated operator. This is done for all physical operators<br />

and then, it is generalized to cost functions for logical operators. These continuous cost<br />

functions are tailor-made for a given hardware platform or they include parameters to<br />

fine-tune and adjust the cost models to different hardware platforms, respectively. Empirical<br />

cost models are known to produce precise results in local settings, where operators<br />

and data structures are known in advance and where all operations are executed locally.<br />

For example, such empirical cost models exist for object-relational DBMS [Mak07], native<br />

XML DBMS [WH09, Wei11], and computational-science applications [SBC06]. Unfortunately,<br />

due to the integration <strong>of</strong> heterogeneous systems and applications, an empirical<br />

cost model is inadequate for integration flows or distributed data management in general<br />

because the dynamic adaptation to external costs (execution times <strong>of</strong> external queries<br />

or remote data access) is required [ROH99, JSHL02, NGT98, GST96, RZL04, SW97].<br />

Furthermore, learning-based approaches can lead to non-monotonic cost functions.<br />

In contrast to empirical cost models in DBMS, our double-metric cost model relies on a<br />

fundamentally different concept. The core idea is tw<strong>of</strong>old. First, we define abstract cost<br />

functions according to the asymptotic time complexity <strong>of</strong> the different operator implementations<br />

using the monitored cardinalities—similar to empirical cost models—as input<br />

parameter. Second, we additionally use the monitored execution times in order to weight<br />

the computed costs. This ensures—similar to cardinality estimation in the Leo project<br />

[SLMK01, ML02, AHL + 04]—that the cost model is self-adjusting by definition, which<br />

means that it does not require any knowledge about the involved external systems, and<br />

that we are able to compare costs <strong>of</strong> interaction-, control-flow- and data-flow-oriented<br />

operators by a combined metric <strong>of</strong> cardinalities and execution times.<br />

Statistics Monitoring and <strong>Cost</strong> Model<br />

Execution statistics are monitored at the granularity <strong>of</strong> single operators in order to enable<br />

our double metric cost estimation approach. Figure 3.3 illustrates the conceptual model<br />

<strong>of</strong> operator statistics.<br />

(|ds in1(o i)| … |ds in k1(o i)|)<br />

|ds in(o i)|<br />

|ds in1(o i)| |ds in2(o i)|<br />

nid operator o i W(o i)<br />

wait(o i)<br />

(|ds out1(o i)| … |ds out k2(o i)|)<br />

(a) Generic Operator<br />

nid operator o i W(o i)<br />

wait(o i)<br />

|ds out(o i)|<br />

(b) Unary Operator<br />

nid operator o i W(o i)<br />

wait(o i)<br />

|ds out(o i)|<br />

(c) Binary Operator<br />

Figure 3.3: General Model <strong>of</strong> Operator Execution Statistics<br />

An operator o i is described, in the general case (see Figure 3.3(a)), by cardinalities <strong>of</strong><br />

arbitrary input data sets |ds ini |, a node identifier nid, the total execution time W (o i ), the<br />

subsumed waiting time wait(o i ) (e.g., time <strong>of</strong> external query execution, where local processing<br />

is blocked) with 0 ≤ wait(o i ) ≤ W (o i ), and cardinalities <strong>of</strong> arbitrary output data<br />

sets |ds outj |. Naturally, unary (see Figure 3.3(b)) and binary ((see Figure 3.3(c)) operators<br />

with a single output are most common. With regard to efficient monitoring, we use the<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!