Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.2 Prerequisites for <strong>Cost</strong>-<strong>Based</strong> <strong>Optimization</strong><br />
concept is to use an empirical cost model. There, physical operators are empirically analyzed<br />
(static or dynamic/leaning-based) with regard to the execution costs when varying<br />
the input parameters such as the input cardinalities. The results <strong>of</strong> these experimental<br />
measurements are then used in order to determine continuous cost functions that approximately<br />
describe the cost <strong>of</strong> the evaluated operator. This is done for all physical operators<br />
and then, it is generalized to cost functions for logical operators. These continuous cost<br />
functions are tailor-made for a given hardware platform or they include parameters to<br />
fine-tune and adjust the cost models to different hardware platforms, respectively. Empirical<br />
cost models are known to produce precise results in local settings, where operators<br />
and data structures are known in advance and where all operations are executed locally.<br />
For example, such empirical cost models exist for object-relational DBMS [Mak07], native<br />
XML DBMS [WH09, Wei11], and computational-science applications [SBC06]. Unfortunately,<br />
due to the integration <strong>of</strong> heterogeneous systems and applications, an empirical<br />
cost model is inadequate for integration flows or distributed data management in general<br />
because the dynamic adaptation to external costs (execution times <strong>of</strong> external queries<br />
or remote data access) is required [ROH99, JSHL02, NGT98, GST96, RZL04, SW97].<br />
Furthermore, learning-based approaches can lead to non-monotonic cost functions.<br />
In contrast to empirical cost models in DBMS, our double-metric cost model relies on a<br />
fundamentally different concept. The core idea is tw<strong>of</strong>old. First, we define abstract cost<br />
functions according to the asymptotic time complexity <strong>of</strong> the different operator implementations<br />
using the monitored cardinalities—similar to empirical cost models—as input<br />
parameter. Second, we additionally use the monitored execution times in order to weight<br />
the computed costs. This ensures—similar to cardinality estimation in the Leo project<br />
[SLMK01, ML02, AHL + 04]—that the cost model is self-adjusting by definition, which<br />
means that it does not require any knowledge about the involved external systems, and<br />
that we are able to compare costs <strong>of</strong> interaction-, control-flow- and data-flow-oriented<br />
operators by a combined metric <strong>of</strong> cardinalities and execution times.<br />
Statistics Monitoring and <strong>Cost</strong> Model<br />
Execution statistics are monitored at the granularity <strong>of</strong> single operators in order to enable<br />
our double metric cost estimation approach. Figure 3.3 illustrates the conceptual model<br />
<strong>of</strong> operator statistics.<br />
(|ds in1(o i)| … |ds in k1(o i)|)<br />
|ds in(o i)|<br />
|ds in1(o i)| |ds in2(o i)|<br />
nid operator o i W(o i)<br />
wait(o i)<br />
(|ds out1(o i)| … |ds out k2(o i)|)<br />
(a) Generic Operator<br />
nid operator o i W(o i)<br />
wait(o i)<br />
|ds out(o i)|<br />
(b) Unary Operator<br />
nid operator o i W(o i)<br />
wait(o i)<br />
|ds out(o i)|<br />
(c) Binary Operator<br />
Figure 3.3: General Model <strong>of</strong> Operator Execution Statistics<br />
An operator o i is described, in the general case (see Figure 3.3(a)), by cardinalities <strong>of</strong><br />
arbitrary input data sets |ds ini |, a node identifier nid, the total execution time W (o i ), the<br />
subsumed waiting time wait(o i ) (e.g., time <strong>of</strong> external query execution, where local processing<br />
is blocked) with 0 ≤ wait(o i ) ≤ W (o i ), and cardinalities <strong>of</strong> arbitrary output data<br />
sets |ds outj |. Naturally, unary (see Figure 3.3(b)) and binary ((see Figure 3.3(c)) operators<br />
with a single output are most common. With regard to efficient monitoring, we use the<br />
39