Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
Cost-Based Optimization of Integration Flows - Datenbanken ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Abstract<br />
<strong>Integration</strong> flows are increasingly used to specify and execute data-intensive integration<br />
tasks between several heterogeneous systems and applications. There are many different<br />
application areas such as (near) real-time ETL (Extraction Transformation Loading) and<br />
data synchronization between operational systems. For the reasons <strong>of</strong> (1) an increasing<br />
amount <strong>of</strong> data, (2) typically highly distributed IT infrastructures, and (3) high requirements<br />
for data consistency and up-to-dateness, many instances <strong>of</strong> integration flows—with<br />
rather small amounts <strong>of</strong> data per instance—are executed over time by the central integration<br />
platform. Due to this high load as well as blocking synchronous source systems or<br />
client applications, the performance <strong>of</strong> the central integration platform is crucial for an<br />
IT infrastructure. As a result, there is a need for optimizing integration flows. Existing<br />
approaches for the optimization <strong>of</strong> integration flows tackle this problem with rule-based<br />
optimization in the form <strong>of</strong> algebraic simplifications or static rewriting decisions during<br />
deployment. Unfortunately, rule-based optimization exhibits two major drawbacks. First,<br />
we cannot exploit the full optimization potential because the decision on rewriting alternatives<br />
<strong>of</strong>ten depends on dynamically changing costs with regard to execution statistics<br />
such as cardinalities, selectivities and execution times. Second, there is no re-optimization<br />
over time and hence, the adaptation to changing workload characteristics is impossible.<br />
In conclusion, there is a need for adaptive cost-based optimization <strong>of</strong> integration flows.<br />
This problem <strong>of</strong> cost-based optimization <strong>of</strong> integration flows is not as straight-forward as<br />
it may appear at a first glance. The differences to optimization in traditional data management<br />
systems are manifold. First, integration flows are reactive in the sense that they process<br />
remote, partially non-accessible data that is received in the form <strong>of</strong> message streams.<br />
Thus, proactive optimization such as dedicated physical design is impossible. Second,<br />
there is also the problem <strong>of</strong> missing knowledge about data properties <strong>of</strong> external systems<br />
because, in the context <strong>of</strong> loosely coupled applications, statistics are non-accessible or do<br />
not exist at all. Third, in contrast to traditional declarative queries, integration flows are<br />
described as imperative flow specifications including both data-flow-oriented and controlflow-oriented<br />
operators. This requires awareness with regard to semantic correctness when<br />
rewriting such flows. Additionally, further integration-flow-specific transactional properties<br />
such as the serial order <strong>of</strong> messages, the cache coherency problem when interacting<br />
with external systems, and the compensation-based rollback must be taken into account<br />
when optimizing such integration flows. In conclusion, the cost-based optimization <strong>of</strong><br />
integration flows is a hard but highly relevant problem in today’s IT infrastructures.<br />
In this thesis, we introduce the concept <strong>of</strong> cost-based optimization <strong>of</strong> integration flows<br />
that relies on incremental statistics maintenance and inter-instance plan re-optimization.<br />
As a foundation, we propose the concept <strong>of</strong> periodical re-optimization and present how<br />
to integrate such a cost-based optimizer into the system architecture <strong>of</strong> an integration<br />
platform. This includes integration-flow-specific (1) prerequisites such as the dependency<br />
analysis and a cost model for interaction-, control-flow- and data-flow-oriented operators<br />
as well as (2) specific statistic maintenance strategies, optimization algorithms and<br />
optimization techniques. While this architecture was inspired by cost-based optimizers<br />
iii