Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Applying</strong> <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong> <strong>Techniques</strong> <strong>to</strong> Speed Up<br />
Aggregate Query Processing in Array Databases<br />
by<br />
Angélica García Gutiérrez<br />
A thesis submitted in partial fulfillment<br />
of the requirements for the degree of<br />
Doc<strong>to</strong>r of Philosophy<br />
in Computer Science<br />
Approved, Thesis Committee:<br />
Prof. Dr. Peter Baumann<br />
Prof. Dr. Vikram Unnithan<br />
Prof. Dr. Inés Fernando Vega López<br />
Date of Defense: November 12, 2010<br />
School of Engineering and Science
In memory of my grandmother, Naty.
Acknowledgments<br />
I would like <strong>to</strong> express my sincere gratitude <strong>to</strong> my thesis advisor, Prof. Dr. Peter<br />
Baumann for his excellent guidance throughout the course of this dissertation. With<br />
his tremendous passion for science and his great efforts <strong>to</strong> explain things clearly and<br />
simply, he made this research <strong>to</strong> be one of the richest experiences of my life. He<br />
always suggested new ideas, and guided my research through many pitfalls. Furthermore,<br />
I learned from him <strong>to</strong> be kind and cooperative. Thank you, for every<br />
single meeting, for every single discussion that you always managed <strong>to</strong> be thoughtprovoking,<br />
for your continue encouragement, for believing in that I could bring this<br />
project <strong>to</strong> success.<br />
I am also grateful <strong>to</strong> Prof. Dr. Inés Fernando Vega López for his valuable suggestions.<br />
He not only provided me with technical advice but also gave me some important<br />
hints on scientific writing that I applied on this dissertation. My sincere gratitude also<br />
<strong>to</strong> Prof. Dr. Vikram Unnithan. Despite being one of <strong>Jacobs</strong> <strong>University</strong>’s most popular<br />
and busiest professors due <strong>to</strong> his genuine engagement with student life beyond<br />
academics, Prof. Unnithan <strong>to</strong>ok interest in this work and provided me unconditional<br />
support.<br />
I would like <strong>to</strong> thank two promising graduate students, Irina Calciu and Eugen<br />
Sorbalo for their outstanding contributions with some of the experiments presented in<br />
Chapter 5 of this thesis.<br />
I am especially grateful <strong>to</strong> my colleagues Michael Owonibi, Salah Al Jubeh, and<br />
Yu Jinsongdi for their many valuable discussions, and for providing a stimulating and<br />
fun environment in which <strong>to</strong> learn and grow.<br />
I am grateful <strong>to</strong> the team assistants at School of Engineering and Science, for helping<br />
the School <strong>to</strong> run smoothly and for assisting me in many different ways. Sigrid<br />
Manss deserves special mention. Thank you for all your kindness, and caring.<br />
Also, I would like <strong>to</strong> thank Connie Garcia, Jim Toersten, Greg White, Irina Prjadeha,<br />
and all of my friends that helped me <strong>to</strong> proofread this thesis. Vic<strong>to</strong>ria Inness-<br />
Brown deserves special mention for applying her expertise as an edi<strong>to</strong>r on reviewing<br />
each chapter of this thesis.<br />
Thank you <strong>to</strong> all my great friends who provided support and encouragement in so<br />
many ways, for helping me <strong>to</strong> see the bright side of my problems in difficult times, for
all the emotional support, comraderie, entertainment, and caring provided. Specially,<br />
<strong>to</strong> Salah Al Jubeh, Asma Alazeib, Talina Eslava, Rainer Gruenheid, Yu Jinsongdi,<br />
Maria Joy, Ghada Kadamany, Ingrid Lara, Blessing Musunda, Michael Owonibi, Jessica<br />
Price, Irina Prjadeha, Joerg Reinekirchen, Yannic Ramaye, Mila Tarabashkina,<br />
Ruiju Tong, Derya Toykan, Iyad Tumar, Vanya Uzunova, Tanja Vaitulevich, and Jus<strong>to</strong><br />
Vargas. You all have a place in my heart. Also, <strong>to</strong> my friend Samantha Hoo<strong>to</strong>n, whom<br />
I learned <strong>to</strong> love as a sister shortly after meeting her. Her authenticity, self-confidence,<br />
and drive <strong>to</strong> success are a real inspiration. Thank you for your caring, for sharing your<br />
wisdom, for taking me <strong>to</strong> the hospital when I was in pain, and for being there anytime<br />
I needed a friend.<br />
My warmest thanks <strong>to</strong> Father Matthew I. Nwoko for his spiritual guidance, his<br />
caring, his advices, and overall, for his unconditional love.<br />
Thank you <strong>to</strong> my parents, my brother and sisters, who have always been very supportive<br />
of my aspirations. Their support has been instrumental in getting me on the<br />
path that brought me <strong>to</strong> this project. Especialmente, Gracias a ti mamá, por ser mi<br />
ejemplo de tenacidad y compromiso. A ti también te dedico esta tesis.<br />
To DAAD and CONACYT, the financial support and trust is gratefully acknowledged.<br />
To everybody that has been a part of my life, thank you very much.<br />
Lastly, I thank the Lord God Almighty for giving me health, ideas and wisdom <strong>to</strong><br />
enable me complete this research project successfully.
Abstract<br />
Large multidimensional arrays of data are common in a variety of scientific applications.<br />
In the past, arrays have typically been s<strong>to</strong>red in files, and then manipulated<br />
by cus<strong>to</strong>mized programs operating on those files. Nowadays, with science moving<br />
<strong>to</strong>ward computational databases, the trend is <strong>to</strong>ward a new class of database, the array<br />
database. In the broadest sense, the array database supports various types of multidimensional<br />
array data, including remote-sensor data, satellite imagery, and data<br />
resulting from scientific simulations.<br />
As with traditional databases for business applications, analytics in array databases<br />
often involves the extraction of general characteristics from large reposi<strong>to</strong>ries. This requires<br />
efficient methods for computing queries that involve data summarization, such<br />
as aggregate queries. A typical solution is <strong>to</strong> pre-compute the whole or parts of each<br />
query, and then save the results of those queries that are frequently submitted against<br />
the database and those that can be used <strong>to</strong> compute the results of similar future queries.<br />
This process is known as pre-aggregation. Unfortunately, pre-aggregation support for<br />
array databases is currently limited <strong>to</strong> one specific operation, scaling (zooming), and<br />
<strong>to</strong> two-dimensional datasets (images).<br />
In this aspect, database technology for business applications is much more mature.<br />
Technologies such as On-Line Analytical Processing (<strong>OLAP</strong>) provide the means <strong>to</strong><br />
analyze business data from one or multiple sources, and thus facilitate the decision<br />
making process. In <strong>OLAP</strong>, the information is viewed as data cubes. These cubes<br />
are typically s<strong>to</strong>red in relational tables, or in multidimensional arrays, or in a hybrid<br />
model. In order <strong>to</strong> enable fast interactive multidimensional data analysis, database<br />
systems frequently pre-compute and s<strong>to</strong>re the results of aggregate queries. While there<br />
are some valuable research results in the realm of <strong>OLAP</strong> pre-aggregation techniques<br />
with varying degrees of power and refinement, not enough work has been done and<br />
reported for array databases.<br />
The purpose of this thesis is <strong>to</strong> investigate the application of <strong>OLAP</strong> pre-aggregation<br />
techniques with the objective of speeding up aggregate operations in array databases.<br />
In particular, we consider enhancing aggregate computation in Geographic Information<br />
Systems (GIS) and remote-sensing imaging applications. To this end, we describe<br />
a set of fundamental operations in GIS based on a sound algebraic framework.
This allows us <strong>to</strong> identify those operations that require data summarization and that<br />
therefore may benefit from pre-aggregation. We introduce a conceptual framework<br />
and cost model for rewriting basic aggregate queries in terms of pre-aggregated data,<br />
and conduct experiments <strong>to</strong> assess the performance of our algorithms. Results show<br />
that query response times can be substantially reduced by strategically selecting the<br />
pre-aggregate with the least cost in terms of execution time. We also investigate the<br />
problem of selecting a set of queries for pre-aggregation, but failed <strong>to</strong> find an analytical<br />
solution for all possible types of aggregate queries. Nevertheless, we present a<br />
framework and algorithms for the selection of scaling operations for pre-aggregation<br />
considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets<br />
outperform the results of image pyramids, the current technique used <strong>to</strong> speed up scaling<br />
operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets<br />
show that query response types can also be substantially reduced by intelligently selecting<br />
a set of scaling operations for pre-aggregation.<br />
The work presented in this thesis is the first of its kind for array databases in scientific<br />
applications.
Contents<br />
1 Introduction and Problem Statement 9<br />
1.1 Overview of Thesis and Contributions . . . . . . . . . . . . . . . . . 12<br />
1.2 Publications Related <strong>to</strong> this Thesis . . . . . . . . . . . . . . . . . . . 12<br />
2 Background and Related Work 15<br />
2.1 Array Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.1.1 Basic Notion of Arrays . . . . . . . . . . . . . . . . . . . . . 15<br />
2.1.2 2D Data Models . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.1.3 Multidimensional Data Models . . . . . . . . . . . . . . . . 17<br />
2.1.4 S<strong>to</strong>rage Management . . . . . . . . . . . . . . . . . . . . . . 18<br />
2.1.5 2D <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.1.6 <strong>Pre</strong>-<strong>Aggregation</strong> Beyond 2D . . . . . . . . . . . . . . . . . . 23<br />
2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2 On-Line Analytical Processing (<strong>OLAP</strong>) . . . . . . . . . . . . . . . . 25<br />
2.2.1 <strong>OLAP</strong> Data model . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.2 <strong>OLAP</strong> Operations . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.2.3 <strong>OLAP</strong> Architectures . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.2.4 <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . 30<br />
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3 Fundamental Geo-Raster Operations 37<br />
3.1 Array Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
3.1.1 Construc<strong>to</strong>r . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
3.1.2 Condenser . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.1.3 Sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.2 Geo-Raster Operations . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.2.1 Mathematical Operations . . . . . . . . . . . . . . . . . . . . 39<br />
3.2.2 <strong>Aggregation</strong> Operations . . . . . . . . . . . . . . . . . . . . 45<br />
3.2.3 Statistical Aggregate Operations . . . . . . . . . . . . . . . . 51<br />
3.2.4 Affine Transformations . . . . . . . . . . . . . . . . . . . . . 55<br />
3.2.5 Terrain Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.2.6 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
3
4 Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data 63<br />
4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
4.1.1 <strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.1.2 <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.1.3 Aggregate Query and <strong>Pre</strong>-Aggregate Equivalence . . . . . . . 64<br />
4.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
4.2.1 Computing Queries from Raw Data . . . . . . . . . . . . . . 68<br />
4.2.2 Computing Queries from Independent and Overlapped <strong>Pre</strong>-<br />
Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
4.2.3 Computing Queries from Dominant <strong>Pre</strong>-Aggregates . . . . . 69<br />
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
5 <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations 77<br />
5.1 Non-Standard Aggregate Operations . . . . . . . . . . . . . . . . . . 77<br />
5.2 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
5.2.1 Lattice Representation . . . . . . . . . . . . . . . . . . . . . 79<br />
5.2.2 <strong>Pre</strong>-<strong>Aggregation</strong> Selection Problem . . . . . . . . . . . . . . 80<br />
5.3 <strong>Pre</strong>-Aggregates Selection . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
5.3.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 83<br />
5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data . . . . . . 83<br />
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
5.5.1 2D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
5.5.2 3D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />
5.5.3 4D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
6 Conclusion 103<br />
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
List of Figures<br />
2.1 3D Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
2.2 Map Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
2.3 Image Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.4 Image Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.5 Nearest Neighbor, Bilinear and Cubic Interpolation Methods . . . . . 22<br />
2.6 3D Scaling Operations on Time-Series Imagery Datasets . . . . . . . 24<br />
2.7 <strong>OLAP</strong> Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.8 Typical <strong>OLAP</strong> Cube Operations . . . . . . . . . . . . . . . . . . . . 27<br />
2.9 <strong>OLAP</strong> Approaches: M<strong>OLAP</strong>, R<strong>OLAP</strong>, and H<strong>OLAP</strong> . . . . . . . . . . 27<br />
2.10 M<strong>OLAP</strong> S<strong>to</strong>rage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
2.11 R<strong>OLAP</strong> S<strong>to</strong>rage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.12 Typical Query as Expressed in R<strong>OLAP</strong> and M<strong>OLAP</strong> Systems . . . . 29<br />
2.13 Star Model of a Spatial Warehouse . . . . . . . . . . . . . . . . . . . 32<br />
2.14 Comparison of Roll-Up and Scaling Operations . . . . . . . . . . . . 34<br />
3.1 Reduction of Contrast in the Green Channel of an RGB Image . . . . 40<br />
3.2 Highlighted Infrared Areas of an NRG Image . . . . . . . . . . . . . 41<br />
3.3 Cells of Rasters A and B with Equal Values . . . . . . . . . . . . . . 42<br />
3.4 Re-Classification of the Cell Values of a Raster Image . . . . . . . . . 43<br />
3.5 Computation of a Proximity Operation . . . . . . . . . . . . . . . . . 44<br />
3.6 Computation of an Overlay Operation . . . . . . . . . . . . . . . . . 45<br />
3.7 Computation of an Overlay Operation Considering Values Greater<br />
than Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
3.8 Calculation of the Total Sum of Cell Values in a Raster . . . . . . . . 47<br />
3.9 Result of an Average Aggregate Operation . . . . . . . . . . . . . . . 48<br />
3.10 Result of a Maximum Aggregate Operation . . . . . . . . . . . . . . 48<br />
3.11 Result of a Minimum Aggregate Operation . . . . . . . . . . . . . . 49<br />
3.12 Computation of the His<strong>to</strong>gram for a Raster Image . . . . . . . . . . . 50<br />
3.13 Computation of the Diversity for a Raster Image . . . . . . . . . . . . 50<br />
3.14 Computation of a Majority Operation for a Raster Image . . . . . . . 51<br />
3.15 Computation of the Variance for a Raster Image . . . . . . . . . . . . 52<br />
3.16 Computation of the Standard Deviation for a Raster Image . . . . . . 52<br />
3.17 Computation of Median for a Raster Image . . . . . . . . . . . . . . 54<br />
3.18 Computation of a Top-k Operation for a Raster Image . . . . . . . . . 54<br />
3.19 Computation of a Translation Operation for a Raster Image . . . . . . 56<br />
5
3.20 Computation of a Scaling Operation for a Raster Image . . . . . . . . 57<br />
3.21 Slopes Along the X and Y Directions . . . . . . . . . . . . . . . . . . 58<br />
3.22 Flow Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.23 Sobel Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
3.24 Computation of an Edge-Detection for a Raster Image . . . . . . . . . 60<br />
4.1 Types of <strong>Pre</strong>-Aggregates . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
4.2 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> (left) and Decomposed Queries<br />
(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
5.1 Sample Lattice Diagram for a Workload with Five Scaling Operations 79<br />
5.2 Query Workload with Uniform Distribution . . . . . . . . . . . . . . 87<br />
5.3 Query Workload with Poisson Distribution . . . . . . . . . . . . . . . 88<br />
5.4 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 89<br />
5.5 Query Workload with Peak Distribution . . . . . . . . . . . . . . . . 90<br />
5.6 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 90<br />
5.7 Query Workload with Step Distribution . . . . . . . . . . . . . . . . 91<br />
5.8 Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> . . . . . . . . . . . . . . . . . 92<br />
5.9 Workload with Uniform Distribution along x, y, and t . . . . . . . . . 93<br />
5.10 Average Query Cost over S<strong>to</strong>rage Space . . . . . . . . . . . . . . . . 93<br />
5.11 Selected <strong>Pre</strong>-Aggregates, c = 36% . . . . . . . . . . . . . . . . . . . 94<br />
5.12 Workload with Uniform Distribution Along x, y, and Poisson distribution<br />
in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
5.13 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 95<br />
5.14 Selected <strong>Pre</strong>-Aggregates, c = 26% . . . . . . . . . . . . . . . . . . . 96<br />
5.15 Workload with Poisson distribution Along x, y, and t . . . . . . . . . 96<br />
5.16 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 97<br />
5.17 Selected <strong>Pre</strong>-Aggregates, c = 30% . . . . . . . . . . . . . . . . . . . 97<br />
5.18 Workload with Poisson Distribution Along x, y, and Uniform Distribution<br />
in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.19 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 99<br />
5.20 Selected <strong>Pre</strong>-Aggregates, c = 21% . . . . . . . . . . . . . . . . . . . 99
List of Tables<br />
3.1 UNO and FAO Suitability Classifications . . . . . . . . . . . . . . . 43<br />
3.2 Capability Indexes for Different Capability Classes . . . . . . . . . . 43<br />
3.3 Array Algebra Classification of Geo-Raster Operations. . . . . . . . . 62<br />
4.1 Cost Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
4.2 Database and Queries of the Experiment. . . . . . . . . . . . . . . . . 74<br />
4.3 Comparison of Query Evaluation Costs Using <strong>Pre</strong>-Aggregated Data<br />
and Original Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
5.1 Sample <strong>Pre</strong>-Aggregates. . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
5.2 ECHAM T-42 Climate Simulation Dimensions . . . . . . . . . . . . 100<br />
5.3 4D Scaling: Scale Vec<strong>to</strong>r Distribution . . . . . . . . . . . . . . . . . 100<br />
5.4 4D Scaling: Selected <strong>Pre</strong>-Aggregates . . . . . . . . . . . . . . . . . . 100<br />
7
This page was left blank intentionally.
Chapter 1<br />
Introduction and Problem Statement<br />
Scientific computing platforms and infrastructures are making new kinds of experiments<br />
possible, resulting in the generation of vast volumes of arrays of data. This<br />
is happening in many specialized application areas such as meteorology, oceanography,<br />
hydrology, astronomy, medical imaging, and exploration systems for oil, natural<br />
gas, coal, and diamonds. These datasets range from uniformly spaced points<br />
(cells) along a single dimension <strong>to</strong> multidimensional arrays containing several different<br />
types of data. For example, astronomy and earth sciences operate on two- or<br />
three-dimensional spatial grids, often using a plethora of spherical coordinate systems.<br />
Furthermore, nearly all sciences must deal with data series over time. It is frequently<br />
necessary <strong>to</strong> understand relationships between consecutive elements in time,<br />
or <strong>to</strong> analyze entire sequences of observations, and such datasets may represent spatial,<br />
temporal, or spatio-temporal information. For example, if ocean measurements<br />
such as temperature, salinity, and oxygen are recorded every hour at spacings of every<br />
one meter in depth, and every ten meters in two horizontal dimensions, the result is<br />
a four-dimensional array with three spatial dimensions and one temporal dimension,<br />
and three values attached <strong>to</strong> each cell of the array.<br />
In the past, arrays were typically s<strong>to</strong>red in files and then manipulated by programs<br />
that operated on these files. Nowadays, with science moving <strong>to</strong>ward being computational<br />
and data based, the trend is <strong>to</strong>ward a new class of database system which provides<br />
support for not only traditional, or coded, data types such as text, integers, etc.,<br />
but also richer data types like multidimensional arrays. This new trend of databases is<br />
referred <strong>to</strong> as Array Databases.<br />
Implementing an efficient array database management system (DBMS) can be very<br />
challenging. Typically, there are two approaches that can be taken <strong>to</strong> s<strong>to</strong>re array<br />
datasets in a DBMS. In the first, the values of each cell are s<strong>to</strong>red in a separate row,<br />
along with fields describing the position of the cell in the array. The most obvious<br />
drawback of this approach is the need for a large multidimensional index <strong>to</strong> efficiently<br />
find rows in the table. Moreover, the space taken by a multidimensional index is larger<br />
than the size of the table itself if all dimensions forming an array are used as the key.<br />
In the second approach, a multidimensional array is written <strong>to</strong> a Binary Large Object<br />
(BLOB), which is s<strong>to</strong>red in a field of a table in the database. Applications then fetch<br />
9
10 1. Introduction and Problem Statement<br />
the contents of the BLOB when they wish <strong>to</strong> operate on the data. The main drawback<br />
<strong>to</strong> this approach is that it either requires the entire array <strong>to</strong> be passed <strong>to</strong> the client, or it<br />
requires that the client perform a large number of BLOB input/output (I/O) operations<br />
<strong>to</strong> read only the required portions of the array. With databases growing beyond a few<br />
tens of terabytes, the analysis of large volumes of array datasets is severely limited<br />
by the relatively low I/O performance of most of <strong>to</strong>days computing platforms. Highperformance<br />
numerical simulations are also increasingly feeling the I/O bottleneck.<br />
To improve data management and analytics on large reposi<strong>to</strong>ries of data, aggregation<br />
has been put forward as a key process when describing high-level data. An<br />
example of data aggregation is the computation and s<strong>to</strong>rage of statistical parameters,<br />
such as count, average, median, and standard deviation. Aggregate computation has<br />
been studied in a variety of settings [4, 21, 66]. In particular, On-Line Analytical Processing<br />
(<strong>OLAP</strong>) technology has emerged <strong>to</strong> address the problem of efficiently computing<br />
complex multidimensional aggregate queries on large data warehouses. Most<br />
<strong>OLAP</strong> systems rely on the process of selecting aggregate combinations, and then precomputing<br />
and s<strong>to</strong>ring their results so the database system can make use of them in<br />
subsequent requests. Such a process is known as pre-aggregation, which has proved <strong>to</strong><br />
speed up aggregate queries by several orders of magnitude in business and statistical<br />
applications [31, 41].<br />
While considerable work has been done on the problem of efficiently computing<br />
aggregate queries in <strong>OLAP</strong>-based applications, such computations continue <strong>to</strong> be a<br />
data management challenge in scientific applications. A relevant example in which the<br />
use of advanced data management and efficient query processing are highly desirable<br />
is hyper-spectral remote-sensing imaging, in which an image spectrometer collects<br />
hundreds or even thousands of measurements for the same area of the surface of the<br />
Earth. The scenes provided by such sensors are often called data cubes <strong>to</strong> denote<br />
the dimensionality of the data. Notably, efficient query processing and data mining<br />
techniques facilitate exploration of spatio-temporal data patterns, both interactively as<br />
well as in batch on archived data.<br />
A significant fraction of scientific data is image-based and can be naturally represented<br />
in multidimensional arrays. These datasets fit poorly in<strong>to</strong> relational databases,<br />
which lack efficient support for the concepts of physical proximity and order. They<br />
are typically s<strong>to</strong>red in array-friendly formats such as HDF5, netCDF, or FITS. The<br />
extremely high computational requirements introduced by image-based scientific applications<br />
make them an excellent case study for our research.<br />
Since array databases and <strong>OLAP</strong>/data warehousing both deal with large multidimensional<br />
datasets and aggregate queries, adapting <strong>OLAP</strong> pre-aggregation techniques<br />
<strong>to</strong> the management and computation of aggregate queries in array databases may provide<br />
a strong potential benefit. This thesis investigates the application of <strong>OLAP</strong> preaggregation<br />
techniques in speeding up query processing in array databases. In particular,<br />
we focus on enhancing aggregate computation in GIS and remote-sensing imaging<br />
applications. However, the results can be generalized <strong>to</strong> other domains as well.
Relevant and complementary questions <strong>to</strong> this thesis are:<br />
1. What fac<strong>to</strong>rs influence the decision of selecting an aggregate query for preaggregation?<br />
2. What formalisms are necessary <strong>to</strong> establish an efficient and scalable pre-aggregation<br />
framework for array databases?<br />
3. What type of constraints are typically considered by existing <strong>OLAP</strong> pre-aggregation<br />
algorithms, and how do they effect performance?<br />
The thesis objectives are outlined as follows:<br />
1. To illustrate the necessity for improving aggregate computation in array databases<br />
for GIS and remote-sensing imaging applications.<br />
2. To achieve a solid understanding of <strong>OLAP</strong> pre-aggregation algorithms and architectural<br />
issues when manipulating large amounts of data.<br />
3. To formally describe fundamental operations in GIS and remote-sensing imaging<br />
applications and identify those that involve data summarization.<br />
4. To design a theoretical pre-aggregation framework for array databases supporting<br />
GIS and remote-sensing imaging applications.<br />
5. To design query selection and query rewriting algorithms using existing <strong>OLAP</strong>/data<br />
warehousing pre-aggregation techniques.<br />
6. To implement algorithms in an array database management system.<br />
7. To conduct a performance study of the developed algorithms.<br />
The methodological approach employed in this thesis is centered on a three-stage<br />
design methodology:<br />
• Identification of fundamental operations in GIS and remote-sensing imaging<br />
applications.<br />
A literature review helped us identify fundamental operations in GIS that require<br />
data summarization. The literature included different classification schemes,<br />
international standards and best practices.<br />
• Design and implementation<br />
Existing <strong>OLAP</strong> pre-aggregation techniques are used as a basis for the construction<br />
of a pre-aggregation framework for array databases. S<strong>to</strong>rage space constraints<br />
are considered while designing query selection algorithms. The algorithms<br />
were developed using the C++ programming language and tested in the<br />
RasDaMan multidimensional array database management system.<br />
• Evaluation<br />
Performance of the developed algorithms is measured on 2D, 3D, and 4D datasets.<br />
For scaling operations on 2D datasets we compare our results against those of<br />
the traditional image pyramids approach.<br />
11
12 1. Introduction and Problem Statement<br />
1.1 Overview of Thesis and Contributions<br />
This section provides an overview of the following chapters.<br />
Chapter 2 presents a comparative study between array databases and <strong>OLAP</strong>, and<br />
devotes special attention <strong>to</strong> data structures and operations. It starts with a discussion<br />
of existing approaches for data modeling, s<strong>to</strong>rage management and query processing<br />
in both array databases and the data warehousing/<strong>OLAP</strong> environment. Existing<br />
pre-aggregation and related techniques are also discussed in both application domains.<br />
From this study, one can observe similarities with regards <strong>to</strong> data structures and operations<br />
between both application domains. This suggests that array databases can benefit<br />
from pre-aggregation schemes <strong>to</strong> accelerate the computation of aggregate queries.<br />
Chapter 3 describes fundamental operations in GIS and remote-sensing imaging<br />
applications. The selection of operations is based on a thorough review of existing<br />
surveys regarding GIS operations, international standards, and on feedback from GIS<br />
practitioners. To better understand the structural characteristics of common queries in<br />
array databases, such operations evolved using a proven array model. This allowed<br />
us <strong>to</strong> identify a set of operations requiring data summarization (aggregation) and the<br />
candidate operations <strong>to</strong> be supported by pre-aggregation techniques.<br />
Chapter 4 deals with the computation of aggregate queries in array databases using<br />
pre-aggregated data. The proposed pre-aggregation framework distinguishes different<br />
types of pre-aggregates and shows that such a distinction is useful in finding an optimal<br />
solution that reduces the cost of the CPU required for the computation of aggregate<br />
queries. A cost-model is used <strong>to</strong> assess the benefit of using pre-aggregated data<br />
for computing aggregate queries. The measurements on real-life raster image datasets<br />
show that the computation of aggregate queries is always faster with our algorithms<br />
in comparison <strong>to</strong> traditional methods.<br />
Chapter 5 considers the problem of offering pre-aggregation support <strong>to</strong> non-standard<br />
aggregate operations in GIS and remote-sensing imaging applications. A discussion<br />
is presented on the issues found while attempting <strong>to</strong> provide pre-aggregation support<br />
for all non-standard aggregate operations as well as the motivation for focusing on<br />
scaling operations. The framework and cost model presented in Chapter 4 are adapted<br />
<strong>to</strong> support scaling operations. Experiments covering 2D, 3D, and 4D show how our<br />
pre-aggregation approach not only generalizes the most common approach for 2D, but<br />
it also helps reduce computational times for 2D, 3D, and 4D datasets.<br />
Chapter 6 presents a summary of our findings and outlines future lines of research.<br />
1.2 Publications Related <strong>to</strong> this Thesis<br />
A number of papers have been published that relate <strong>to</strong> the work described in this<br />
thesis. Doc<strong>to</strong>ral workshops provided a platform <strong>to</strong> discuss the feasibility of the proposed<br />
research and an opportunity <strong>to</strong> receive feedback from experts in computer science<br />
[6] and the GIS scientific community [5]. Participation in those workshops led<br />
<strong>to</strong> a refinement of the research objectives outlined in Chapter 1. The study and algebraic<br />
modeling of geo-raster operations reported in Chapter 3 are presented in [7, 8].
1.2 Publications Related <strong>to</strong> this Thesis 13<br />
The pre-aggregation framework described in Chapter 4 is presented in [9]. Finally,<br />
findings about the query selection problem addressed in Chapter 5 have been accepted<br />
for publication in [10].
This page was left blank intentionally.
Chapter 2<br />
Background and Related Work<br />
This chapter describes existing database technology for two environments: GIS/remotesensing<br />
imaging and data warehousing/<strong>OLAP</strong>. Our investigation shows that conceptual<br />
data models and operations are similar in both application domains. This suggests<br />
that array database technology can be substantially enhanced by adopting a preaggregation<br />
scheme using a basis of existing <strong>OLAP</strong> technology.<br />
2.1 Array Databases<br />
Multidimensional data analysis has recently taken the spotlight in the context of<br />
scientific applications. A fundamental demand from science users is extremely fast<br />
response times for multidimensional queries. While most scientific users can use relational<br />
tables and have been forced <strong>to</strong> do so by many commercial DBMS systems, only<br />
a few users find tables <strong>to</strong> be a natural data model that closely matches their data. Furthermore,<br />
few users are satisfied with SQL as the interface language [30]. In contrast,<br />
it appears that arrays are a natural data model for a significant subset of science users,<br />
specifically in astronomy, oceanography, and remote-sensing applications. Moreover,<br />
a table with a primary key is merely a 1D array. Hence, an array data model can<br />
subsume the needs of users who are satisfied with tables.<br />
Next we review the existing database technology supporting multidimensional arrays<br />
in scientific applications: 1D sensor time-series, 2D satellite imagery, 3D image<br />
time-series, and 4D atmospheric data.<br />
2.1.1 Basic Notion of Arrays<br />
Several approaches have been proposed <strong>to</strong>wards the formalization of arrays and<br />
array query languages. The underlying methods of formalization differ, and it is still<br />
an open discussion. However, the following notion of arrays is quite common [79]:<br />
An array is a set of cells of a fixed data type T , with a fixed cell size. Each<br />
cell corresponds <strong>to</strong> one element in the multidimensional domain of the array. The<br />
domain D of an array is a d-dimensional subinterval of a discrete coordinate set S =<br />
S 1 × ... × S d , where each S i , i = 1, ..., d is a finite <strong>to</strong>tally ordered discrete set and d is<br />
the dimensionality of the array.<br />
15
16 2. Background and Related Work<br />
The definition domain of an array is expressed as a multidimensional interval by its<br />
lower and upper bounds, l i and u i respectively, along each direction l i of the domain,<br />
denoted as D = [l 1 : u 1 ; ...; l d : u d ], where l i < u i , i = 1, ..., d, and l i , u i ∈ S i .<br />
Figure 2.1(a) shows the constituents of a sample 3D array.<br />
Figure 2.1. 3D Array<br />
The following subsections provide a brief summary of the main contributions of<br />
data modeling and query languages that support array data in GIS and remote-sensing<br />
imaging applications.<br />
2.1.2 2D Data Models<br />
A uniform representation and algebraic notation for manipulating image-based data<br />
structures known as map algebra was first advanced by Tomlin and Berry [56]. While<br />
not the first ones <strong>to</strong> describe this type of spatial data processing, Tomlin and Berry put<br />
forward the methodological basis for the organization of this form of geographical<br />
data analysis. Map algebra represents a method of treating individual rasters or array<br />
layers as members of algebraic equations. Map algebra functions are grouped in<strong>to</strong> the<br />
following categories:<br />
• Local functions create outputs in which output cell values are determined on a<br />
cell-by-cell basis without regard for the value of neighboring cells.<br />
• Focal functions create outputs in which the value of the output grid is affected by<br />
the value of neighboring cells. Low-pass filters are commonly used <strong>to</strong> smooth<br />
out data.<br />
• Zonal functions create outputs in which the values of output cells are determined<br />
in part by the spatial association between cells in the input grids.<br />
• Global functions compute an output raster where the value for each output cell<br />
is potentially a function of all of the input cell values.
2.1 Array Databases 17<br />
Figure 2.2 shows a graphical classification of grid functions according <strong>to</strong> map algebra.<br />
Figure 2.2. Map Algebra Functions<br />
Map algebra is primarily oriented <strong>to</strong>ward 2D static data. Each layer is associated<br />
with a particular moment or period of time, and analytical operations are intended <strong>to</strong><br />
deal with spatial relationships. In its original form, map algebra was never intended<br />
<strong>to</strong> handle spatial data with a temporal component.<br />
2.1.3 Multidimensional Data Models<br />
AQL<br />
Libkin et al. [63] presented an array data model called AQL that embeds array support<br />
in<strong>to</strong> specific nested relational calculus and treats arrays as functions rather than<br />
collection types. The AQL data model combines complex objects such as sets, bags,<br />
and lists with multidimensional arrays. To express complex object values, the core<br />
calculus on which AQL is based has been extended with concepts such as comprehensions,<br />
pattern matching, and block structures that strengthen the expressive power of<br />
the language. Still, AQL does not provide a declarative mechanism <strong>to</strong> define the order<br />
in which queries manipulate data.<br />
Array Manipulation Language (AML)<br />
AML is a query language for multidimensional array data [80]. The model is aimed<br />
<strong>to</strong>wards applications in image databases, particularly for remote sensing, but it is cus<strong>to</strong>mizable<br />
<strong>to</strong> support a wide variety of application domains. An interesting characteristic<br />
of this language is the use of bit patterns, an array indexing mechanism that<br />
allows for a more powerful access structure <strong>to</strong> arrays. AML’s algebra consists of three<br />
opera<strong>to</strong>rs that enable the manipulation of arrays: subsample, merge, and apply. Each<br />
opera<strong>to</strong>r takes one or more arrays as arguments, and produces an array as result. Subsample<br />
is a unary opera<strong>to</strong>r that eliminates cells from an array by cutting out slices.<br />
Merge is a binary opera<strong>to</strong>r that combines two arrays defined over the same domain.<br />
The Apply opera<strong>to</strong>r applies a user-defined function <strong>to</strong> an array, thereby producing a<br />
new array. All AML opera<strong>to</strong>rs take bit patterns as parameters.
18 2. Background and Related Work<br />
Data and Query Model for Stream Geo-Raster Imagery<br />
Gertz et al. [67] introduced a data and query model for managing and querying<br />
streams of remote-sensing imagery. The data model considers the spatio-temporal<br />
and geo-referenced nature of satellite imagery. Three classes of opera<strong>to</strong>rs allow the<br />
formulation of queries. A stream restriction opera<strong>to</strong>r acts as a filter that selects points<br />
from a stream that satisfy a given condition of the spatial, temporal, or spatio-temporal<br />
component of the image. The stream transform opera<strong>to</strong>r maps the point or value<br />
associated with a stream <strong>to</strong> a new point or value set. This class of opera<strong>to</strong>rs is useful<br />
for processing on a point-by-point basis. The third class of opera<strong>to</strong>rs is called stream<br />
compositions, which allows the combination of image data from different spectral<br />
bands. To this end, each stream is considered <strong>to</strong> represent a single spectral band.<br />
However, since the primary objective of the authors was <strong>to</strong> stream geo-raster image<br />
data, they put less emphasis on post-processing satellite images. Core operations<br />
such as Fourier transforms and edge detection are therefore not supported by their<br />
framework.<br />
Array Algebra<br />
Baumann [75] introduced a formal array model called Array Algebra that supports the<br />
description and manipulation of multidimensional array data types [76]. The simple<br />
algebra consists of three core opera<strong>to</strong>rs: an array construc<strong>to</strong>r, a general condenser for<br />
computing aggregations, and an index sorter. The expressive power of Array Algebra<br />
through these opera<strong>to</strong>rs enables a wide range of signal processing, imaging, and statistical<br />
operations. Moreover, the termination of any well-formed query is guaranteed<br />
by limiting the expressiveness power <strong>to</strong> non-recursive operations. Array Algebra is<br />
described in more detail in Chapter 3.<br />
To date, Array Algebra is the most comprehensive and complete approach supporting<br />
a variety of applications including sensor, image and statistical data. Recently, a<br />
Geo-raster service standard based on Array Algebra concepts has been issued by the<br />
Open GeoSpatial Consortium (OGC) [78]. A commercial and open-source implementation<br />
of Array Algebra is currently available for the scientific community.<br />
2.1.4 S<strong>to</strong>rage Management<br />
At present, handling large image data s<strong>to</strong>red in a database is usually carried out by<br />
adopting a tiling strategy [23]. An image is split in<strong>to</strong> sub-images (tiles), as shown in<br />
Fig. 2.3. When a region of interest is requested in a given query operation, only the<br />
relevant tiles are accessed. This strategy results in significant I/O bandwidth savings.<br />
Tiles form the basic processing units for indexing and compression. Spatial indexing<br />
allows for the quick retrieval of the identifier and location of a required tile, while<br />
compression improves disk I/O bandwidth efficiency. The choice of tile size is crucial<br />
for efficiency. While large tiles return much redundant data in response <strong>to</strong> a range<br />
query, small tiles result in a bad compression ratio where tile size varies from 8 KB<br />
(very small) <strong>to</strong> 512 KB (very large) of data [23, 96]. A comprehensive approach
2.1 Array Databases 19<br />
<strong>to</strong>ward the s<strong>to</strong>rage of large amounts of data on tertiary s<strong>to</strong>rage media considering<br />
tiling techniques in multidimensional database management systems is presented in<br />
[23, 24, 25].<br />
Figure 2.3. Image Tiling<br />
A key fac<strong>to</strong>r influencing the effectiveness of a tiling scheme is compression. Raster<br />
data compression algorithms are the same as algorithms for compression of other image<br />
data. However, remote-sensing images are usually of much higher resolution,<br />
are multi-spectral and have significant larger volumes than natural images. To effectively<br />
compress raster data in GIS environments, emphasis must be placed on the<br />
management of schemas <strong>to</strong> deal with large volumes of remote-sensing imagery, and<br />
on the integration of various types of datasets such as vec<strong>to</strong>r and multidimensional<br />
datasets [3, 87].<br />
Dehmel [3] proposed a comprehensive framework for the compression of multidimensional<br />
arrays based on different model layers, including various kinds of predic<strong>to</strong>rs<br />
and a generic wavelet engine for lossy compression with arbitrary quality levels.<br />
In particular, the author introduces concepts such as channel separation <strong>to</strong> compress<br />
values for each channel separately, and predic<strong>to</strong>rs that calculate approximate values<br />
for some cells and express those cell values relative <strong>to</strong> the approximate values. Further,<br />
the proposed method applies wavelets <strong>to</strong> transform the channels individually in<strong>to</strong><br />
multi-resolution representations with coarse approximations and various levels of detail<br />
information. This led <strong>to</strong> a wavelet engine architecture consisting of three major<br />
components: transformation, quantization and compression that helps improve compression<br />
rates considerably in array databases.<br />
2.1.5 2D <strong>Pre</strong>-<strong>Aggregation</strong><br />
Aggregate operations on GIS and remote-sensing applications have been shown <strong>to</strong><br />
be computationally expensive due <strong>to</strong> the size and complexity of the operations [8].<br />
One such operation is zooming (scaling), which is carried out by interpolating the values<br />
of the original dataset <strong>to</strong> downsample it <strong>to</strong> a lower resolution. This is particularly<br />
necessary in web-based raster applications, where limitations such as bandwidth and<br />
other resources prevent the efficient processing of the original raster datasets. For<br />
smooth interactive panning, browsers load the image in tiles and quantities larger than<br />
actually displayed. Zooming far out results in large scale fac<strong>to</strong>rs, meaning that large<br />
amounts of data must be moved <strong>to</strong> deliver minimal results.
20 2. Background and Related Work<br />
Current database technology for GIS and remote-sensing imaging applications employ<br />
multi-scale image pyramids <strong>to</strong> improve performance of scaling operations on 2D<br />
raster images [51, 70, 82]. Image pyramids is a technique which consists of resampling<br />
the original dataset and creating a number of copies from it, where each copy is<br />
resampled at a coarser resolution (Fig. 2.4). The pyramid consists of a finite number<br />
of levels that differ in scale by a fixed step fac<strong>to</strong>r, and are much smaller in size than the<br />
original dataset but adequate for visualization at a lower scale (zoom ratio). Common<br />
practice is <strong>to</strong> construct pyramids in scale levels of a power of 2, yielding scale fac<strong>to</strong>rs<br />
2, 4, 6, 8, 16, 32, 64, 128, 256, and 512. When more detailed data are needed, or<br />
when it becomes necessary <strong>to</strong> access the original image, a better access speed can be<br />
achieved by accessing the smaller piece of the original data, if the original data are cut<br />
in<strong>to</strong> smaller pieces. A restricted area of the image, instead of the entire image, is then<br />
accessed.<br />
Figure 2.4. Image Pyramids<br />
Pyramid Construction<br />
The construction of pyramid layers requires resampling of original image cell values.<br />
Resampling interpolates cell values or otherwise assigns values <strong>to</strong> cells of a new<br />
raster object. It results in a raster with larger or smaller cells and different dimensions.<br />
Resampling changes the scale of an input raster, and is used in conjunction<br />
with geometric transformation models that change the internal geometry of a raster.<br />
The following are the most popular interpolation methods [34]:<br />
• Nearest neighbor is the resampling technique of choice for discrete (categorical)<br />
data since it does not alter the value of the input cells [64]. After the cell’s center<br />
on the output raster dataset is located on the input raster, the nearest neighbor<br />
assignment determines the location of the closest cell center on the input raster<br />
and assigns the value of that cell <strong>to</strong> the cell on the output raster.<br />
• Linear interpolation is used <strong>to</strong> interpolate along value curves. It assumes that<br />
cell values vary in proportion <strong>to</strong> distance along a value segment: v = a + bx.<br />
Linear interpolation may be used <strong>to</strong> interpolate feature attribute values along a<br />
line segment connecting any two point value pairs.
2.1 Array Databases 21<br />
• Bilinear interpolation is used <strong>to</strong> interpolate cell values at direct positions within<br />
a quadrilateral grid. It assumes that feature attribute values vary as a bilinear<br />
function of position within the grid cell: v = a + bx + cy + dxy. Given a<br />
direct position, p, in a grid cell whose vertices are V, V + V 1 , V + V 2 , and<br />
V + V 1 + V 2 , where V 1 and V 2 are offset vec<strong>to</strong>rs of the grid, and with cell values<br />
at vertices v 1 , v 2 , v 3 , and v 4 , respectively, there are unique numbers i and j, with<br />
0 ≤ i ≤ 1, and 0 ≤ j ≤ 1 such that p = V + iV 1 + jV 2 . The cell value at p is:<br />
v = (1 − i)(1 − j)v 1 + i(1 − j)v 2 + j(1 − i)v 3 + ijv 4 .<br />
Since the values for output cells are calculated according <strong>to</strong> the relative positions<br />
and values of input cells, bilinear interpolation is preferred for data where the<br />
location from a known point or phenomenon determines the value assigned <strong>to</strong><br />
a cell (that is, continuous surfaces). Elevation, slope, intensity of noise from an<br />
airport, and salinity of groundwater near an estuary are phenomena represented<br />
as continuous surfaces and are most appropriately resampled using bilinear interpolation.<br />
• Quadratic interpolation is used <strong>to</strong> interpolate cell values along curves. It assumes<br />
that cell values vary as a quadratic function of distance along a value<br />
segment: v = a + bx + cx 2 , where a is the value of a cell at the start of a value<br />
segment and v is the value of a cell at distance x along the curve from the start.<br />
Three point value pairs are needed <strong>to</strong> provide control values for calculating the<br />
coefficients of the function.<br />
• Cubic interpolation is used <strong>to</strong> interpolate cell values along curves. It assumes<br />
that cell values vary as a cubic function of distance along a value segment:<br />
v = a + bx + cx 2 + dx 3 where a is the value of a cell at the start of a value<br />
segment and v is the value of a cell at distance x along the curve from the start.<br />
Four point value pairs are needed <strong>to</strong> provide control values for calculating the<br />
coefficients of the function.<br />
Cubic convolution has a tendency <strong>to</strong> sharpen the edges of the data more than bilinear<br />
interpolation since more cells are involved in the calculation of the output<br />
values.<br />
Pyramid Evaluation<br />
During the evaluation of a scaling operation with a target scale fac<strong>to</strong>r s, the pyramid<br />
level with the largest scale fac<strong>to</strong>r s ′ with s ′ < s is determined. This level is loaded and<br />
then an adjustment is made by scaling the resulting image by a fac<strong>to</strong>r of s/s ′ . If, for<br />
example, scaling by s = 11 is required, then pyramid level 3 with scale fac<strong>to</strong>r s ′ = 8<br />
is chosen, requiring a rest scaling of 11/8 = 1.375, thereby <strong>to</strong>uching only 1/64 of<br />
what is read without a pyramid.<br />
The computation complexity of a scaling operation depends on the chosen resampling<br />
method. For example, nearest neighbor resampling considers the closest cell<br />
center of the input raster and assigns the value of that cell <strong>to</strong> the corresponding cell
22 2. Background and Related Work<br />
on the output raster. Other resampling methods such as bilinear and cubic interpolation<br />
consider a subset of cells <strong>to</strong> calculate each of the cell values in the output rasters.<br />
Fig. 2.5 shows three common options for interpolating output cell values. Note that<br />
the bold outline (center image) indicates the current target cell for which a value is<br />
being interpolated.<br />
(a) Portion of<br />
original raster<br />
(b) Portion of<br />
output raster<br />
(c) Input cells used by common<br />
resampling methods<br />
Figure 2.5. Nearest Neighbor, Bilinear and Cubic Interpolation Methods<br />
A characteristic of the pyramid approach is that it increases the size of a raster<br />
dataset by approximately 33 percent. This is because the additionally reduced resolution<br />
representations are s<strong>to</strong>red in the system <strong>to</strong>gether with the original dataset. This is<br />
offset, however, by the increasing response time obtained in return. The choice of resampling<br />
method for constructing the pyramid is influenced by the data characteristics<br />
and type of analysis performed on the data. For example, visual appearance of remote<br />
sensing imagery is best using nearest-neighbor resampling, whereas scientific interpretation<br />
may require cubic interpolation. Rasters representing categorical data e.g.,<br />
land use data, do not allow interpolation since it is important that original data values<br />
remain unchanged; hence only nearest-neighbor resampling can be applied [64].<br />
The reason why categorical data should not be interpolated is because intermediate<br />
terms cannot be derived with meaningful results. For example, soil type data cannot<br />
be interpolated since a soil type 14 and a soil type 15 cannot sensibly be averaged<br />
<strong>to</strong> derive a soil type 14.5. Creating pyramids for different resampling methods is not<br />
efficient due <strong>to</strong> the additional resources required for s<strong>to</strong>rage and maintenance. Thus,<br />
the hard-wired resampling approach possess significant flexibility limitations <strong>to</strong> users<br />
when analytic objectives diverge.<br />
Fast retrieval of raster image datasets has also been investigated in distributed<br />
database systems. Kitamo<strong>to</strong> [14] proposed a caching mechanism that allows twodimensional<br />
satellite imagery <strong>to</strong> be cached with minimum resolution <strong>to</strong> provide a<br />
coarse view of the images in distributed satellite image databases. The cache management<br />
problem is treated as the knapsack problem [14], where the relevance and size<br />
of the data is considered <strong>to</strong> determine if the data will be cached or not. Additionally,<br />
access patterns influence the relevance of the data. The frequency of requests for a
2.1 Array Databases 23<br />
given image and its resulting popularity rank are included in the strategy for caching<br />
selection. <strong>Pre</strong>diction of user access patterns is not considered, however.<br />
More recently, methods exploiting the capabilities of modern graphics hardware<br />
have been applied <strong>to</strong> the organization and processing of large amounts of satellite<br />
imagery. For example, Boettger et al. presented a method based on the concepts of<br />
perspective and complex logarithm [90] for visualization and navigation of satellite<br />
and aerial imagery [50]. Datasets are decomposed in<strong>to</strong> tiles of different sizes and<br />
levels of resolution according <strong>to</strong> a pre-defined area of interest. The tiles closer <strong>to</strong> the<br />
center of interest have higher resolution, whereas low-resolution tiles are created for<br />
parts further away. The resulting tiles are indexed and cached in<strong>to</strong> the memory of the<br />
graphics hardware, enabling quick access <strong>to</strong> the area of interest with the best available<br />
resolution. When the center of interest is changed, tiles not yet available in graphics<br />
memory are loaded. Based on the assumption that the graphics memory offers more<br />
space than needed, the cache contains not only the tiles that conform <strong>to</strong> the area of<br />
interest, but those that presumably will be needed in the future.<br />
2.1.6 <strong>Pre</strong>-<strong>Aggregation</strong> Beyond 2D<br />
Geographic phenomena can be examined at different granularities. This includes<br />
different spatial perspectives and temporal views. Earth remote sensing imagery can<br />
be treated as time-series data <strong>to</strong> study/track changes over time. For example, a user<br />
looking at changes in vegetation patterns over a certain region during the past 10 years<br />
can see their effect on the regional maps over that time period. Fig. 2.6 shows various<br />
instances of scaling operations on 3D image time-series. Figure 2.6(a) shows the<br />
original dataset, which consists of two spatial dimensions (dim 1, dim 2), and one<br />
temporal dimension (dim 3). Figure 2.6(b) shows the original dataset scaled down<br />
along the two spatial dimensions. Figure 2.6(c) shows a scaling operation along the<br />
time dimension of the original dataset. Figure 2.6(d) shows the original dataset scaled<br />
down in the spatial and temporal dimensions.<br />
Shifts in temporal detail have been studied in various application domains [18, 22,<br />
43]. At the time of this writing, there is little support for zooming with respect <strong>to</strong> time<br />
in GIS technology: the focus has been set on studying such alterations with respect <strong>to</strong><br />
the geometric (vec<strong>to</strong>r) properties of objects [54, 58, 59].<br />
Datasets in environmental observation and climate modeling are often defined over<br />
4-D spatio-temporal space of the form (x,y,z,t), possibly extended with <strong>to</strong>pology relationships.<br />
Scaling operations are also critical for these kinds of applications due <strong>to</strong> the<br />
size and dimensionality of the data. Extremely large volumes of data are generated<br />
during climate simulations. While only one part might be needed for a specific data<br />
analysis, huge data volumes are moved. This is particularly true for time-series data<br />
analysis. At the time of this writing, however, 4D scaling operations are not supported<br />
for GIS and remote-sensing imaging applications.
24 2. Background and Related Work<br />
(a) 3D dataset<br />
(b) 3D dataset (scaled-down along<br />
dim1 and dim2 by a fac<strong>to</strong>r of 2)<br />
(c) 3D dataset (scaled-down along<br />
dim3 by a fac<strong>to</strong>r of 4)<br />
(d) 3D dataset (scaled-down along<br />
all dimensions by a fac<strong>to</strong>r of 2)<br />
Figure 2.6. 3D Scaling Operations on Time-Series Imagery Datasets
2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 25<br />
2.1.7 Summary<br />
Array database theory is gradually entering its consolidation phase. The notion<br />
of arrays as functions mapping points of some hypercube-shaped domain <strong>to</strong> values<br />
of some range set is commonly accepted. Two main modeling paradigms are used:<br />
calculus and algebra. Multidimensional data models embed arrays in<strong>to</strong> the relational<br />
world, either by providing conceptual stubs like Array Algebra, or by adding relational<br />
capabilities explicitly such as AQL and RAM. Notably, aggregate query processing<br />
plays a critical role given the large volumes of the arrays. Our study shows<br />
that pre-aggregation techniques focus only on 2D datasets, and that support is limited<br />
<strong>to</strong> one particular operation: scaling. We distinguish the pyramid approach as the<br />
most popular method for speeding up scaling operations on 2D datasets; despite its<br />
known limitations such as hard-wired interpolation and lack of support for datasets of<br />
higher dimensions. Advances on hardware graphics are enabling quicker and more<br />
accurate visualization and navigation capabilities for raster imagery. However, little<br />
work has been reported on how array database technology is progressively exploiting<br />
these hardware advances. A critical gap with respect <strong>to</strong> pre-aggregation is the lack of<br />
support for aggregate operations other than 2D scaling.<br />
2.2 On-Line Analytical Processing (<strong>OLAP</strong>)<br />
Data warehousing/<strong>OLAP</strong> is an application domain where complex multidimensional<br />
aggregates on large databases have been studied intensively. Typically, a data<br />
warehouse collects business data from one or multiple sources so that the desired financial,<br />
marketing, and business analyses can be performed. These kinds of analyses<br />
can detect trends and anomalies, make projections, and make business decisions<br />
[41]. When such analysis predominantly involves aggregate queries, it is called<br />
on-line analytical processing, or <strong>OLAP</strong> [38, 39]. To understand the mechanism of<br />
pre-computation, the following subsections review different approaches <strong>to</strong> structuring<br />
multidimensional data, s<strong>to</strong>rage mechanisms and operations in <strong>OLAP</strong>.<br />
2.2.1 <strong>OLAP</strong> Data model<br />
The multidimensional <strong>OLAP</strong> model begins with the observation that the fac<strong>to</strong>rs<br />
that influence decision-making processes are related <strong>to</strong> enterprise-specific facts, such<br />
as sales, shipments, hospital admissions, surgeries, and so on. [68]. Instances of a<br />
fact subsequently correspond <strong>to</strong> events that occur. For example, every sale or shipment<br />
carried out is an event. Each fact is described by the values of a set of relevant<br />
measures providing quantitative descriptions of events, e.g., sales receipts, amounts<br />
shipped, hospital admission costs, and surgery times are all measures.<br />
In <strong>OLAP</strong>, information is viewed conceptually as cubes that consist of descriptive<br />
categories (dimensions) and quantitative values (measures) [26, 81, 69, 83]. In the scientific<br />
literature, measures are at times called variables, metrics, properties, attributes,<br />
or indica<strong>to</strong>rs. Figure 2.7 illustrates a 3D <strong>OLAP</strong> data cube where business events
26 2. Background and Related Work<br />
(facts) are mapped at the intersection of a specific combination of dimensions.<br />
Different attributes along each dimension are often organized in hierarchical structures<br />
that determine the different levels in which data can be further analyzed [26].<br />
For example, within the time dimension, one may have levels composed of years,<br />
months, and days. Similarly, within the geography dimension, one may have levels<br />
such as country, region, state/province, or city. Hierarchical structures are used <strong>to</strong> infer<br />
summarization (aggregation), that is, whether an aggregate view (query) defined<br />
for some category can be correctly derived from a set of precomputed views defined<br />
for other categories.<br />
Figure 2.7. <strong>OLAP</strong> Data Cube<br />
2.2.2 <strong>OLAP</strong> Operations<br />
<strong>OLAP</strong> includes a set of operations for manipulation of dimensional data organized<br />
in multiple levels of abstraction. Basic <strong>OLAP</strong> operations are roll-up, drill-down, slice,<br />
dice and pivot [44]. A roll-up (aggregation) operation computes higher aggregations<br />
from lower aggregations or base facts according <strong>to</strong> their hierarchies, whereas drilldown<br />
(disaggregation) is an analytic technique whereby the user navigates among<br />
levels of data ranging from most summarized/aggregated, <strong>to</strong> most detailed. Typical<br />
<strong>OLAP</strong> aggregate functions include average, maximum, minimum, count, and sum.<br />
Drilling paths may be defined by the hierarchies within dimensions or other relationships<br />
dynamic within or between dimensions. A slice consists of the selection of a<br />
smaller data cube or even the reduction of a multidimensional data cube <strong>to</strong> fewer dimensions<br />
by a point restriction in some dimension. The dice operation works similarly<br />
<strong>to</strong> the slice except that it performs a selection on two or more dimensions. Figure 2.8<br />
provides a graphical description of these operations.<br />
2.2.3 <strong>OLAP</strong> Architectures<br />
Figure 2.9 shows different approaches for the implementation of <strong>OLAP</strong> functionalities:<br />
Multidimensional <strong>OLAP</strong> (M<strong>OLAP</strong>), Relational <strong>OLAP</strong> (R<strong>OLAP</strong>), Hybrid <strong>OLAP</strong>
2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 27<br />
Figure 2.8. Typical <strong>OLAP</strong> Cube Operations<br />
(H<strong>OLAP</strong>). These approaches offer a common view in the form of data cubes, which<br />
are independent of how the data is s<strong>to</strong>red.<br />
Figure 2.9. <strong>OLAP</strong> Approaches: M<strong>OLAP</strong>, R<strong>OLAP</strong>, and H<strong>OLAP</strong><br />
M<strong>OLAP</strong><br />
M<strong>OLAP</strong> maintains data in a multi-dimensional matrix based on a non-relational specialized<br />
s<strong>to</strong>rage structure [37], see Fig. 2.10(a). While building the s<strong>to</strong>rage structure,<br />
selected aggregations associated with all possible roll-ups are precomputed and s<strong>to</strong>red<br />
[92]. Thus, roll-up and drill-down operations are executed in interactive time. Products<br />
such as Oracle Essbase, IBM Cognos Powerplay, and open-source Palo have<br />
adopted this approach.<br />
A M<strong>OLAP</strong> system is based on an ad-hoc logical model that directly represents<br />
multidimensional data and its applicable operations. The underlying multidimensional<br />
database physically s<strong>to</strong>res data as arrays and access <strong>to</strong> it is positional [68]. Grid-files<br />
[53, 55], R*-trees [71] and UB-trees [84] are among the techniques used for that<br />
purpose.<br />
The main advantage of this approach is that it contains the pre-computed aggregate<br />
values that offer a very compact and efficient way <strong>to</strong> retrieve answers for specific
28 2. Background and Related Work<br />
aggregate queries [68]. One difficulty that M<strong>OLAP</strong> poses, however, pertains <strong>to</strong> the<br />
sparseness of the data. Sparseness means that many events did not take place and<br />
valuable processing time is taken by adding up zeros [91]. For example, a company<br />
may not sell every item every day in every s<strong>to</strong>re, so no values appear at the intersection<br />
where products are not sold in a particular region at a particular time. On the other<br />
hand, M<strong>OLAP</strong> can be much faster for applications where subsets of the data cube<br />
are dense [100]. Another limitation of this approach is that the computation of a<br />
cube requires a complex aggregate query across all data in a warehouse. Though<br />
it is possible <strong>to</strong> incrementally update cubes as new data arrives, it is impractical <strong>to</strong><br />
dynamically create new cubes <strong>to</strong> answer ad-hoc queries [68].<br />
Figure 2.10. M<strong>OLAP</strong> S<strong>to</strong>rage Scheme<br />
R<strong>OLAP</strong><br />
In R<strong>OLAP</strong>, underlying data is s<strong>to</strong>red in a relational database, see Fig. 2.11(a). The<br />
relational model, however, does not include concepts of dimension, measure, and hierarchy.<br />
Thus specific types of schemata must be created so the multidimensional<br />
model can be represented in terms of basic relational elements such as attributes, relations,<br />
and integrity constraints [68]. Such representations are done using a star schema<br />
data model, although the snowflake schema is also often adopted.<br />
R<strong>OLAP</strong> implementations can handle large amounts of data and leverage all functionalities<br />
of the relational database [72]. Disadvantages are that overall performance<br />
is slow and each R<strong>OLAP</strong> report represents an SQL query with the limitations of the<br />
genre. R<strong>OLAP</strong> vendors tried <strong>to</strong> mitigate this problem by including out-of-the-box<br />
complex functions in their product offering and providing users the capability of defining<br />
their own functions. Another problem with R<strong>OLAP</strong> implementations results from<br />
the performance hit caused by costly join operations between large tables [68]. To<br />
overcome this issue, fact tables in data-warehouses are usually de-normalized. Sub-
2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 29<br />
stantial performance gains can be achieved through the materialization of derived tables<br />
(views) that s<strong>to</strong>re aggregate data used for typical <strong>OLAP</strong> queries.<br />
Figure 2.11. R<strong>OLAP</strong> S<strong>to</strong>rage Scheme<br />
Figure 2.12 shows the formulation of a typical query in both R<strong>OLAP</strong> and M<strong>OLAP</strong>.<br />
The query yields sales information for a specific product sold in a particular city by a<br />
given vendor. The formulation of the queries is done according <strong>to</strong> the syntax of Oracle<br />
10g. Note the lengthy difference between the two query formulations.<br />
(a) Sample R<strong>OLAP</strong> query<br />
(b) Sample M<strong>OLAP</strong> query<br />
Figure 2.12. Typical Query as Expressed in R<strong>OLAP</strong> and M<strong>OLAP</strong> Systems
30 2. Background and Related Work<br />
H<strong>OLAP</strong><br />
The intermediate architecture type, H<strong>OLAP</strong>, mixes the advantages offered by R<strong>OLAP</strong><br />
and M<strong>OLAP</strong>. It takes advantage of the standardization level and the ability <strong>to</strong> manage<br />
large amounts of data from R<strong>OLAP</strong> implementations, and the query speed typical of<br />
M<strong>OLAP</strong> systems. For summary type information, H<strong>OLAP</strong> leverages cube technology<br />
and for drilling down in<strong>to</strong> details it uses the R<strong>OLAP</strong> model. In H<strong>OLAP</strong> architecture,<br />
the largest amount of data should be s<strong>to</strong>red in an RDBMS <strong>to</strong> avoid the problems<br />
caused by sparsity, and a multidimensional system should s<strong>to</strong>re only the information<br />
users most frequently need <strong>to</strong> access [68]. If that information is not enough <strong>to</strong> solve<br />
queries, then the system accesses the data managed by the relational system in a more<br />
transparent manner.<br />
2.2.4 <strong>OLAP</strong> <strong>Pre</strong>-<strong>Aggregation</strong><br />
<strong>OLAP</strong> systems require fast interactive multidimensional data analysis of aggregates.<br />
To fulfill this requirement, database systems frequently pre-compute aggregate<br />
views on some subset of dimensions and their corresponding hierarchies. Virtually<br />
all <strong>OLAP</strong> products resort <strong>to</strong> some degree of pre-computation of these aggregates,<br />
a process known as pre-aggregation. <strong>OLAP</strong> pre-aggregation techniques have<br />
proved <strong>to</strong> speed up aggregate queries by several orders of magnitude in business applications<br />
[31, 41]. A full pre-aggregation of all possible combinations of aggregate<br />
queries, however, is not considered feasible because it often exceeds the available s<strong>to</strong>rage<br />
limit and incurs a high maintenance cost. Therefore, modern <strong>OLAP</strong> systems adopt<br />
a partial pre-aggregation approach where only a set of aggregates are materialized so<br />
it can be re-used for efficiently computing other aggregates.<br />
<strong>Pre</strong>-aggregation techniques consist of three inter-related processes: view selection,<br />
query rewriting, and view maintenance. A view is a derived relation defined in terms<br />
of base relations. Views can be materialized by s<strong>to</strong>ring the tuples of a view in a<br />
database, as was first investigated in the 1980s [36]. Like a cache, a materialized<br />
view provides fast access <strong>to</strong> its data. However, a cache may get dirty whenever its<br />
underlying base relations are updated. The process of updating a materialized view in<br />
response <strong>to</strong> changes <strong>to</strong> its base data is called view maintenance [12].<br />
View Selection<br />
Gupta et al. [13] proposed a framework that shows how <strong>to</strong> use materialized views <strong>to</strong><br />
help answer aggregate queries. The framework provides a set of query rewriting rules<br />
<strong>to</strong> determine what materialized aggregate views can be employed <strong>to</strong> answer aggregate<br />
queries. An algorithm uses these rules <strong>to</strong> transform a query tree in<strong>to</strong> an equivalent<br />
tree with some or all base relations replaced by materialized views. Thus, a query<br />
optimizer can choose the most efficient tree and provide the best query response time.<br />
Harinarayan et al. [92] investigated the issue of how <strong>to</strong> select views for materialization<br />
under s<strong>to</strong>rage space constraints so the average query cost is minimal.<br />
To meet changing user needs several dynamic pre-aggregation approaches have
2.2 On-Line Analytical Processing (<strong>OLAP</strong>) 31<br />
been proposed. In principle, views may be either selected on demand or pre-selected<br />
using some prediction strategy. For applications where s<strong>to</strong>rage space is a constraint,<br />
replacement algorithms identify those views that can be replaced with new selections<br />
[60]. Kotidis et al. [97] introduced a dynamic view selection approach called Multidimensional<br />
Range Queries (MRQ), known as slice queries in <strong>OLAP</strong>, which use an<br />
on-demand fetching strategy. Within this approach, the level of detail or granularity is<br />
a compromise between the materialization of many small, highly specific queries, and<br />
the materialization of a few large queries followed by answering incoming queries at<br />
each stage, using the materialized queries. This approach, however, does not take in<strong>to</strong><br />
account user access patterns before making selections.<br />
The first work <strong>to</strong> consider user access information <strong>to</strong> evaluate potential queries<br />
<strong>to</strong> be materialized is presented in [26], where the author introduced PROMISE, an<br />
approach that predicts the structure and value of the next query based on the current<br />
query. Yao et al. [99] proposed a different approach for the materialization of dynamic<br />
views. A set of batch queries were rewritten using certain canonical queries so the<br />
<strong>to</strong>tal cost of execution could be reduced using the intermediate results for answering<br />
queries appearing later in the batch. This approach requires all queries <strong>to</strong> be precisely<br />
known before hand, and though the approach might work well in a particular database<br />
scenario, it might not be useful in dynamic <strong>OLAP</strong>, where it is extremely difficult <strong>to</strong><br />
accurately predict the exact nature of future queries.<br />
View Maintenance<br />
In most cases it is wasteful <strong>to</strong> maintain a view by recomputing it from scratch. Materialized<br />
views are therefore maintained using an incremental approach [11]. Only the<br />
changes <strong>to</strong> be propagated <strong>to</strong> the materialized view are computed using the changes of<br />
the source relations [1, 33, 89]. At present, view maintenance has been investigated<br />
from these four dimensions [11]:<br />
• Information Dimension: Focuses on accessing the information required for view<br />
maintenance, such as base relations and the materialized view.<br />
• Modification Dimension: Focuses on the kinds of modifications e.g., insertions<br />
and deletions, that a view maintenance algorithm can handle.<br />
• Language Dimension: Addresses the problems related <strong>to</strong> the language of the<br />
views supported by the view maintenance algorithm. That is, what is the language<br />
of the views that can be maintained by the view maintenance algorithm?<br />
How are views expressed? Does the algorithm allow duplicates?<br />
• Instance Dimension: Considers the applicability of the algorithm <strong>to</strong> all or a<br />
specific set of instances of the database.<br />
View maintenance cost is the sum of the cost of propagating each base relation<br />
change <strong>to</strong> the affected materialized views. The sum can be weighted, where each<br />
weight indicates the frequency of propagations of the changes of the associated source
32 2. Background and Related Work<br />
relation. When the base relation affects more than one materialized view, multiple<br />
maintenance expressions must be evaluated. Multi-query optimization techniques<br />
can be used <strong>to</strong> detect common sub-expressions between the maintenance expressions<br />
so that an efficient global evaluation plan for the maintenance expressions can be<br />
achieved [61, 62].<br />
Numerous methods have been developed for materialized view maintenance in conventional<br />
database systems. Zhuge et al. [101] introduced the Eager Compensating<br />
Algorithm (ECA) based on previous incremental view maintenance algorithms and<br />
compensating queries used <strong>to</strong> eliminate anomalies. In [102], authors define multiple<br />
views consistent with each other as the multiple view consistency problem. Further<br />
research from the same authors [102, 103] considers data warehouse views defined<br />
on base tables located in different data sources, i.e., if a view involves n base tables,<br />
then n data sources are also involved.<br />
A common characteristic of the early approaches <strong>to</strong> view maintenance is the considerable<br />
need for accessing base relations, which in most cases results in performance<br />
degradation. The improvement of the efficiency of view maintenance techniques has<br />
been a <strong>to</strong>pic of active research in the database research community [15, 65, 85, 98].<br />
Spatial <strong>OLAP</strong> (S<strong>OLAP</strong>)<br />
The multidimensional approach used by data warehouses and <strong>OLAP</strong> does not support<br />
array data types or spatial data types such as point, lines, or polygons. Following<br />
the development trends of data warehouse and data mining techniques, Stefanovic et<br />
al. [52] proposed the construction of a spatial data warehouse <strong>to</strong> enable on-line data<br />
analysis in spatial-information reposi<strong>to</strong>ries. The authors used a star/snowflake model<br />
<strong>to</strong> build a spatial data cube consisting of both spatial and non-spatial dimensions and<br />
measures: the data cube shown in Fig. 2.13 consists of one spatial dimension (region)<br />
and three non-spatial dimensions (precipitation, temperature, and time).<br />
Figure 2.13. Star Model of a Spatial Warehouse<br />
Current research in spatial data management focuses on querying spatial data,<br />
particularly regarding the improvement of aggregate query performance [57] for
2.3 Discussion 33<br />
spatial-vec<strong>to</strong>r data structures. Alas, little attention has been given <strong>to</strong> spatial-raster<br />
data [42, 73, 86]. Support for spatial-raster data typically consists of creating a<br />
spatial-raster cube from information in the metadata file (such as size, level, width,<br />
height, date of creation, format, and location) [28, 94].<br />
Vega et al. [40] presented a model <strong>to</strong> analyze and compare existing techniques<br />
for the evaluation of aggregate queries on spatial, temporal, and spatio-temporal data.<br />
The study shows that existing aggregate computation techniques rely on some form<br />
of pre-aggregation and support is restricted <strong>to</strong> distributive aggregate functions such<br />
as COUNT , SUM, and MAX. Additionally, the authors identify several important<br />
needs concerning aggregate computation. First, they discuss the need <strong>to</strong> develop<br />
further and more substantial techniques <strong>to</strong> support holistic aggregate functions e.g.,<br />
MEDIAN, RANK, and <strong>to</strong> better support selective predicates. The second observation<br />
pertains <strong>to</strong> the lack of support for queries needing <strong>to</strong> be efficiently evaluated<br />
at every granule in time. Existing aggregate computation techniques focus only on<br />
spatial objects such as lines, points, and polygons but do not consider aggregate computation<br />
on data grids (array) structures.<br />
2.3 Discussion<br />
Query performance is a major concern underlying the design of databases in both<br />
business and remote-sensing imaging applications. While there are some valuable<br />
research results in the realm of pre-aggregation techniques <strong>to</strong> support query processing<br />
in business and statistical applications, little has been done in the field of array<br />
databases.<br />
The question therefore arises, what distinguishes array data from traditional data<br />
types that it cannot be fully supported by relational databases and thus take advantage<br />
of advance technologies such as <strong>OLAP</strong>? <strong>OLAP</strong> from its very conception was designed<br />
<strong>to</strong> assist in the decision-making process of business applications, where business perspectives<br />
such as products and/or s<strong>to</strong>res, represented the dimensions of the data cube.<br />
And while the different columns in a data cube are usually called dimensions, they<br />
generally cannot be considered as a special extent of the entities modeled by the<br />
database. Instead, they are regarded as explicit attributes that characterize a particular<br />
entity. Some dimensions in a data cube (e.g., Cus<strong>to</strong>merId) are defined over discrete<br />
domains which do not have a natural ordering among their values (cus<strong>to</strong>mer 1000 cannot<br />
be considered close <strong>to</strong> cus<strong>to</strong>mer 1001). In such cases, any ordering defined for the<br />
values in one of these columns is arbitrary [40]. For this reason, existing <strong>OLAP</strong> solutions<br />
and related pre-aggregation techniques cannot be applied <strong>to</strong> multidimensional<br />
arrays, at least not in a straight-forward manner.<br />
Recently, however, a new trend in <strong>OLAP</strong> gained considerable popularity due <strong>to</strong> its<br />
capabilities <strong>to</strong> support Geo-spatial data. Spatial <strong>OLAP</strong> considers the case in which a<br />
data-cube may have both spatial and non-spatial dimensions. However, spatial <strong>OLAP</strong><br />
focuses mainly on spatial-vec<strong>to</strong>r data and so far little support has been provided for<br />
spatial-raster data in terms of selective materialization for the optimization of aggregates.<br />
Support is limited only <strong>to</strong> those operations that can be constructed from
34 2. Background and Related Work<br />
metadata available for the raster, but not <strong>to</strong> the improvement of the computation of<br />
aggregate operations over the values of raster datasets.<br />
At present, pre-aggregation support in array databases is limited. Only one comparatively<br />
simple pre-aggregation technique has been used, namely image pyramids. The<br />
limitation of this technique <strong>to</strong> two-dimensional datasets and hard-wired interpolation<br />
calls for the development of more flexible and efficient techniques.<br />
From our study of data modeling, s<strong>to</strong>rage techniques, operations in <strong>OLAP</strong> and<br />
remote sensing imaging applications, we have observed the following similarities:<br />
• Array databases and <strong>OLAP</strong> systems typically employ multidimensional data<br />
models <strong>to</strong> organize their data.<br />
• Both application domains handle large volumes of multidimensional data.<br />
• Operations convey a high degree of similarity, for instance, a roll-up (aggregate)<br />
operation in <strong>OLAP</strong> such as computing the weekly sales per product is very<br />
similar <strong>to</strong> scaling a satellite image by a fac<strong>to</strong>r of seven along the X axis. Figure<br />
2.14 illustrates this similarity.<br />
(a) Scaling operation<br />
(b) Roll-Up operation<br />
Figure 2.14. Comparison of Roll-Up and Scaling Operations
2.3 Discussion 35<br />
• Both application domains use pre-aggregation approaches <strong>to</strong> speed up query<br />
processing: <strong>OLAP</strong> pre-aggregation techniques support a wide range of aggregate<br />
operations and speed up query processing by several orders of magnitude<br />
(last benchmark reported fac<strong>to</strong>rs up <strong>to</strong> 100 times [29, 88]). Scaling of 2D<br />
datasets always uses the same scale fac<strong>to</strong>r on each dimension <strong>to</strong> maintain a<br />
coherent view, whereas for datasets of higher dimensionality, the scale fac<strong>to</strong>r is<br />
independent. Scaling resembles a primitive form of pre-aggregation in comparison<br />
<strong>to</strong> existing <strong>OLAP</strong> pre-aggregation techniques.<br />
• While data in <strong>OLAP</strong> applications are sparsely populated, remote sensing imagery<br />
usually are densely populated (100%). There are no guidelines explaining<br />
when an <strong>OLAP</strong> data cube is considered sparse or dense. However, when a data<br />
cube contains 30 percent empty cells it is usually treated with sparsity-handling<br />
techniques in most <strong>OLAP</strong> systems.<br />
Furthermore, when compared <strong>to</strong> well-known <strong>OLAP</strong> pre-aggregation techniques,<br />
GIS image pyramids are different in several respects:<br />
• Image pyramids are constrained <strong>to</strong> 2D imagery. To the best of our knowledge<br />
there is no generalization of pyramids <strong>to</strong> n-D.<br />
• The x and y axes are always zoomed by the same scalar fac<strong>to</strong>r s in the 2D zoom<br />
vec<strong>to</strong>r (s, s). This is exploited by image pyramids in that they only offer preaggregates<br />
along a scalar range. In this respect, image pyramids actually are 1D<br />
pre-aggregates.<br />
• Several interpolation methods are used for resampling during scaling. Some<br />
techniques are standardized [48], they include nearest-neighbor, bi-linear, biquadratic,<br />
bi-cubic, and barycentric. The two scaling steps incurred for image<br />
pyramids (construction of the pyramid level and rest scaling) must be done using<br />
the same interpolation technique <strong>to</strong> achieve valid results. In <strong>OLAP</strong>, summation<br />
during roll-up corresponds <strong>to</strong> linear interpolation in imaging.<br />
• Scale fac<strong>to</strong>rs are continuous, as opposed <strong>to</strong> the discrete hierarchy levels in<br />
<strong>OLAP</strong>. It is, therefore, impossible <strong>to</strong> materialize all possible pre-aggregates.<br />
Based on these observations, this thesis aims <strong>to</strong> systematically carry over results<br />
from <strong>OLAP</strong> <strong>to</strong> array databases and provide pre-aggregation support not only for queries<br />
using basic aggregate functions, but <strong>to</strong> more complex operations such as scaling. As<br />
a preliminary and fundamental step, it is necessary <strong>to</strong> have a clear understanding of<br />
the various operations performed on remote sensing imagery and <strong>to</strong> identify those that<br />
involve aggregation computation. Next chapter addresses this issue in more detail.
This page was left blank intentionally.
Chapter 3<br />
Fundamental Geo-Raster Operations<br />
in GIS and Remote-sensing<br />
Applications<br />
This chapter describes a set of fundamental operations in GIS and remote-sensing<br />
imaging applications. For rigid comparison and classification, these operations are<br />
discussed by means of a sound mathematical framework. The aim is <strong>to</strong> identify those<br />
operations requiring data summarization that may benefit from a pre-aggregation approach.<br />
To that end, we use Array Algebra as our modeling framework.<br />
3.1 Array Algebra<br />
The rationale behind the selection of Array Algebra as the modeling framework is<br />
grounded in the following observations:<br />
• It is oriented <strong>to</strong>wards multidimensional data in a variety of applications including<br />
imaging.<br />
• It provides the means <strong>to</strong> formulate a wide variety of operations on multidimensional<br />
arrays.<br />
• There are commercial and open-source implementations of Array Algebra that<br />
show the soundness and maturity of the framework.<br />
The expressive power of Array Algebra, the simplicity of its opera<strong>to</strong>rs, and its successful<br />
implementation in both commercial and scientific applications make it suitable<br />
for our investigation.<br />
Essentially, the algebra consists of three opera<strong>to</strong>rs: an array construc<strong>to</strong>r, a generalized<br />
aggregation, and a multi-dimensional sorter [75, 76]. Array algebra is minimal<br />
in the sense that no subset of its operations exhibits the same expressive power. It is<br />
safe in evaluation: every formula can be evaluated in a finite number of steps. It is<br />
closed in its application: any resulting expression is either a scalar or an array.<br />
37
38 3. Fundamental Geo-Raster Operations<br />
Arrays are represented as functions mapping n-dimensional points from discrete<br />
Euclidean space <strong>to</strong> values. The spatial domain of an array is defined as a finite set of<br />
n-dimensional points in Euclidean space forming a hypercube with boundaries parallel<br />
<strong>to</strong> the coordinate system axes.<br />
Let X ⊆ Z d be a spatial domain and F a value set i.e., a homogeneous algebra.<br />
Then, an F-valued d-dimensional array over the spatial domain X(multi-dimensional<br />
array) is defined as:<br />
a : X → F (i.e., a ∈ F X ),<br />
a = {(x, a(x)) : x ∈ X, a(x) ∈ F }<br />
The array elements a(x) are referred <strong>to</strong> as cells. Auxiliary function sdom(a) denotes<br />
the spatial domain of some array a.<br />
3.1.1 Construc<strong>to</strong>r<br />
The MARRAY array construc<strong>to</strong>r allows arrays <strong>to</strong> be defined by indicating a spatial<br />
domain and an expression evaluated for each cell position of the array. An iteration<br />
variable bound <strong>to</strong> a spatial domain is available in the cell expression so that the cell<br />
value depends on its position. Let X be a spatial domain, F a value set, and v a free<br />
identifier. Let e v be an expression with result type F containing zero or more free occurrences<br />
of v as placeholder(s) for an expression with result type X. Then, an array<br />
over spatial domain X with base type F is constructed through:<br />
MARRAY X,v (e v ) = {(x, a(x)) : a(x) = e x , x ∈ X}<br />
A straightforward application of MARRAY is spatio-temporal sub-setting by simply<br />
changing its domain.<br />
Example: For some 2-D grey-scale image a, its cu<strong>to</strong>ut <strong>to</strong> domain [x0:x1,y0:y1] (assumed<br />
<strong>to</strong> lie inside the array) is given by:<br />
MARRAY [x0:x1,y0:y1],p (a[p])<br />
Similarly, trimming produces a cu<strong>to</strong>ut of an array of lower volume, but unchanged<br />
dimensionality, and section cuts out a hyperplane with reduced dimensionality.<br />
We can also change an array’s values by changing the e v expression. In the simplest<br />
case this expression takes the cell value and modifies it. The following expression adds<br />
the values in the cells of two raster images, regardless of their extent and dimension:<br />
a + b = MARRAY X,p (a[p] + b[p])<br />
If we allow the use of all operations known on the base algebra, i.e., on the pixel<br />
type, we immediately obtain a cohort of the following useful operations.
3.2 Geo-Raster Operations 39<br />
3.1.2 Condenser<br />
The COND array condenser (aggrega<strong>to</strong>r) takes the values of an array’s cells and<br />
combines them through some commutative and associative operation, thereby obtaining<br />
a scalar value. For some v free identifier, spatial domain X = x 1 , ..., x n , x i ∈ Z d<br />
consisting of n points, and e a,v an expression of result type F containing occurrences<br />
of an array a and identifier v, the condense of a by o is defined as:<br />
COND o,X,v (e a,v ) := O<br />
x∈X<br />
e a,x = e a,x1 o...oe a,xn<br />
Example: Let a be the image as defined in above. The average over all pixel<br />
intensities in a is then given by:<br />
∑<br />
COND +,sdom(a),p (a) = a[x]/(m ∗ n)<br />
3.1.3 Sorter<br />
x∈[1:m,1:n]<br />
The SORT array sorter proceeds along a selected dimension <strong>to</strong> reorder the corresponding<br />
hyperslices. Functional sort s rearranges a given array along a specified<br />
dimension s without changing its value set or spatial domain. To that end, an<br />
order-generating function is provided that associates a sequence position <strong>to</strong> each (d-<br />
1)-dimensional hyperslice. Note that function f s,a has all degrees of freedom <strong>to</strong> assess<br />
any of a’s cell values for determining the measure value of a hyperslice on hand - it<br />
can be a particular cell value in the current hyperslice, the average of all hyperslice<br />
values, or the value of one or more neighboring slices. Note that the sort opera<strong>to</strong>r<br />
includes the relational group by.<br />
The language is recursive in the array expression e v and hence allows arbitrary<br />
nesting of expressions. In the sequel we use the abbreviations introduced above for<br />
nested expressions.<br />
3.2 Geo-Raster Operations<br />
This section presents a set of fundamental operations for Geo-raster data. These<br />
operations have been selected based on an exhaustive literature review of classification<br />
schemes, international standards, and best practices [2, 19, 27, 32, 35, 45, 46, 47, 49].<br />
By examining the Array Algebra opera<strong>to</strong>rs involved in the computation of the operations,<br />
we identify those that require data summarization (aggregation) and therefore<br />
may benefit from pre-aggregation.<br />
Queries were executed in a raster database management system (RasDaMan), and<br />
formulated according <strong>to</strong> the syntax of an SQL-based query language for multidimensional<br />
raster databases based on Array Algebra, namely, rasql.<br />
3.2.1 Mathematical Operations<br />
The following groups of mathematical opera<strong>to</strong>rs are distinguished: arithmetic,<br />
trigonometric, boolean and relational. They operate at cell level and can be applied
40 3. Fundamental Geo-Raster Operations<br />
in a single or multiple rasters of numerical type and identical spatial domain. The<br />
basic arithmetic opera<strong>to</strong>rs include addition (+), subtraction (-), multiplication (*), and<br />
division (/). Trigonometric functions perform trigonometric calculations on the values<br />
of an input raster: sine (sin), cosine (cos), tangent (tan) or their inverse (arcsin, arccos,<br />
arctan). Consider, for example, the following query:<br />
Query 3.2.1. Consider a RGB (red, green, blue) raster image A. Extract the green<br />
component from the image, and reduce the contrast by a fac<strong>to</strong>r of 2.<br />
With Array Algebra, the query can be computed as follows:<br />
Results are shown in Fig. 3.1.<br />
MARRAY sdom(A),i (A.green[i]/2)<br />
(a) Original RGB image (b) Green component (c) Output raster<br />
Figure 3.1. Reduction of Contrast in the Green Channel of an RGB Image<br />
All or part of a raster image can be manipulated using the rules of Boolean algebra<br />
integrated in<strong>to</strong> database query languages such as SQL [2]. Boolean algebra uses<br />
logical opera<strong>to</strong>rs such as and, or, not, and xor <strong>to</strong> determine if a particular condition is<br />
true or false. These opera<strong>to</strong>rs are often combined with relational opera<strong>to</strong>rs: equal (=),<br />
not equal (≠), less than (), and greater<br />
than or equal <strong>to</strong> (≥). Consider, for example, the following queries:<br />
Query 3.2.2. Given a near-infrared green (NRG) raster image A, highlight the cells<br />
with sufficient near-infrared values.<br />
This query can be answered by imposing a lower bound on the infrared intensity,<br />
and upper bounds on the green and blue intensities. The resulting boolean array is
3.2 Geo-Raster Operations 41<br />
multiplied by the original image A <strong>to</strong> show the original cell where an infrared value<br />
prevails and black otherwise.<br />
MARRAY sdom(A),i (A[i] ∗ ((A[i].nir ≥ 130) and<br />
Results are shown in Fig. 3.2.<br />
(A[i].green ≤ 110) and (A[i].blue ≤ 140)))<br />
(a) Original NRG raster<br />
(b) Output raster<br />
Figure 3.2. Highlighted Infrared Areas of an NRG Image<br />
Query 3.2.3. Compare the cell values of two 8-bit gray raster images A and B. Create<br />
a new raster where each cell value takes the value of 255 (white pixel) when the cell<br />
values of A and B are identical.<br />
The algebraic formulation is as follows:<br />
Results are shown in Fig. 3.3.<br />
MARRAY sdom(A),i ((A[i] = B[i]) ∗ 255)
42 3. Fundamental Geo-Raster Operations<br />
(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster image<br />
Figure 3.3. Cells of Rasters A and B with Equal Values<br />
Reclassification<br />
Reclassification is a generalization technique used <strong>to</strong> re-assign cell values in classified<br />
rasters. For example consider the query below where reclassification is based on a land<br />
suitability study.<br />
Query 3.2.4. Given an 8-bit gray image A, map each cell value <strong>to</strong> its corresponding<br />
suitability class shown in Table 3.2 1 , and decrease the contrast of the image according<br />
<strong>to</strong> the decreasing fac<strong>to</strong>r.<br />
The query can be answered as follows:<br />
MARRAY sdom(A),g (((A[g] > 180) ∗ A[g]/2) +<br />
Results are shown in Fig. 3.4.<br />
(((A[g] ≥ 130)and(A[g] < 180)) ∗ A[g]/3) +<br />
(((A[g] ≥ 80)and(A[g] < 130)) ∗ A[g]/4) +<br />
((A[g] < 80) ∗ A[g]/5))<br />
1 Classification taken from http://www.fao.org/docrep/X5310E/X5310E00.htm
3.2 Geo-Raster Operations 43<br />
Table 3.1. UNO and FAO Suitability Classifications<br />
Classification Description<br />
S1<br />
Highly suitable<br />
S2<br />
Moderately suitable<br />
S3<br />
Marginally suitable<br />
NS<br />
Not suitable<br />
Table 3.2. Capability Indexes for Different Capability Classes<br />
Capability index Class Suitability class Decrease fac<strong>to</strong>r<br />
>180 I S1 2<br />
130-180 II S2 3<br />
80-130 III S3 4<br />
< 80 IV NS 5<br />
(a) Original raster<br />
(b) Output raster<br />
Figure 3.4. Re-Classification of the Cell Values of a Raster Image
44 3. Fundamental Geo-Raster Operations<br />
Proximity<br />
The proximity operation creates a new raster where each cell value contains the distance<br />
<strong>to</strong> a specified reference point. As an example consider the following query:<br />
Query 3.2.5. Estimate the proximity of each cell of the raster image shown in Fig. 3.4(a)<br />
<strong>to</strong> the reference cell located in [30,5].<br />
The computation of this query can be formulated as:<br />
Results are shown in Fig. 3.5.<br />
MARRAY sdom(A),(g,h) (|g − 30| + |h − 5|)<br />
Figure 3.5. Computation of a Proximity Operation<br />
Overlay<br />
The overlay operation refers <strong>to</strong> the process of stacking two or more identical georeferenced<br />
rasters on <strong>to</strong>p of each other so that each position in the covered area can be<br />
analyzed in terms of these data. The overlay operation can be solved using arithmetic<br />
and relational opera<strong>to</strong>rs. For example, consider the following query:<br />
Query 3.2.6. Given two 8-bit gray raster images A and B with identical spatial domain,<br />
perform an overlay operation. That is, make a cell-wise comparison between<br />
the two rasters. Each cell value of the new array must take the maximum cell value<br />
between A and B.<br />
The computation of this query can be formulated as:<br />
MARRAY sdom(A),g (((A[g] > B[g]) ∗ A[g]) + ((A[g] ≤ B[g]) ∗ B[g]))<br />
The above formulation works as follows. The left part of the arithmetic expression +<br />
tests for the cell value of array A <strong>to</strong> be greater than the cell value of B. The result of<br />
this operation is either 0 (condition not satisfied) or 1 (condition satisfied), which in
3.2 Geo-Raster Operations 45<br />
turn is multiplied by the cell value of array A. Thus, the left part of the expression is<br />
either 0 or the cell value of array A. Similarly, the right-hand side of the arithmetic<br />
addition expression verifies if the cell value of array A is less than or equal <strong>to</strong> the cell<br />
value of B. The result is either 0 or 1 depending on whether or not the condition is<br />
satisfied. This value is multiplied for the cell value of array B. Note that only one of<br />
the parts of the addition expression will be greater than zero, and that this value corresponds<br />
<strong>to</strong> the highest value between arrays A and B. Results are shown in Fig. 3.6.<br />
(a) 8-bit gray raster A (b) 8-bit gray raster B (c) Output raster<br />
Figure 3.6. Computation of an Overlay Operation<br />
An overlay operation can also be done considering a different condition <strong>to</strong> be tested<br />
while determining the cell values of the output array. For example:<br />
Query 3.2.7. Compute an overlay operation between rasters A and B. That is, compare<br />
cell-wise the two rasters: if the cell value of B is non-zero, then set this value<br />
as the cell value of the corresponding cell in array A. Otherwise, the cell value of A<br />
remains unchanged.<br />
The query can be answered as follows:<br />
MARRAY sdom(A),g (((B[g] > 0) ∗ B[g]) + ((B[g] ≤ 0) ∗ A[g]))<br />
Results are shown in Fig. 3.7.<br />
3.2.2 <strong>Aggregation</strong> Operations<br />
We now present the modeling of operations consisting of one or more aggregate<br />
functions. An aggregate function takes a collection of cells and returns a single value<br />
that summarizes the information contained in the set of cells. The SQL standard provides<br />
a variety of aggregate functions. SQL-92 includes count, sum, average, min,
46 3. Fundamental Geo-Raster Operations<br />
(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster<br />
Figure 3.7. Computation of an Overlay Operation Considering Values Greater than<br />
Zero<br />
and max. SQL:1999 adds every, some and any. <strong>OLAP</strong> functions were first published<br />
as an addendum <strong>to</strong> the ISO SQL:1999 standard. They have since been completely<br />
incorporated in<strong>to</strong> both SQL:2003 and recently published SQL:2008 ISO SQL Standards.<br />
<strong>OLAP</strong> functions include rank, ntile, cume dist, percent rank, row number,<br />
percentile cont, and percentile disc.<br />
Add<br />
The add operation sums up the content of the cells and returns the <strong>to</strong>tal as a scalar<br />
value. It can be applied in two or more rasters with an identical spatial domain, returning<br />
a new raster with the same spatial domain. In this case, the cells of the new<br />
raster contain the sum of the inputs computed on a cell-by-cell basis. As an example<br />
of the add operation in a single raster consider the following query:<br />
Query 3.2.8. Return the sum of all cell values of the raster shown in Fig. 3.8(a).<br />
Results are shown in Fig. 3.8.<br />
add cells(A) = COND +,sdom(A),i (A[i])
3.2 Geo-Raster Operations 47<br />
(a) Original NRG raster<br />
(b) Output result<br />
Figure 3.8. Calculation of the Total Sum of Cell Values in a Raster<br />
Count<br />
The count operation returns the number of cells that fulfill a boolean condition applied<br />
<strong>to</strong> a raster. For example, consider the following query:<br />
Query 3.2.9. Return the number of cells of raster A of boolean type, containing true<br />
value in the green channel.<br />
Average<br />
count cells(A) = COND +,sdom(A),i (A[i].green = 1)<br />
The average operation returns a scalar value representing the mean of all values contained<br />
in a raster. As an example consider the following query:<br />
Query 3.2.10. Return the average of the cell values in each channel of the NRG image<br />
shown in Fig. 3.9(a).<br />
Let sum cells(A) be a function calculated as shown in Section 3.2.2, and card(sdom(A))<br />
a function returning the cardinality of A. Then, the average of A is calculated as follows:<br />
sum cells(A)<br />
avg cells(A) =<br />
card(sdom(A))<br />
Results are shown in Fig. 3.9.<br />
Maximum<br />
A maximum operation returns the largest cell value contained in a raster of numerical<br />
type. As an example, consider the following query:
48 3. Fundamental Geo-Raster Operations<br />
(a) Original NRG raster<br />
(b) Output result<br />
Figure 3.9. Result of an Average Aggregate Operation<br />
Query 3.2.11. Return the maximum cell value of all cells contained in the NRG raster<br />
image shown in Fig. 3.10(a).<br />
Results are shown in Fig. 3.10.<br />
max cells(A) = COND max,sdom(A),i (A[i])<br />
(a) Original NRG raster<br />
(b) Output result<br />
Figure 3.10. Result of a Maximum Aggregate Operation<br />
Minimum<br />
A minimum operation returns the smallest cell value contained in a raster of numerical<br />
type. As an example, consider the following query:
3.2 Geo-Raster Operations 49<br />
Query 3.2.12. Return the smallest element of all cell values in the NRG raster image<br />
shown in Fig. 3.11(a).<br />
Results are shown in Fig. 3.11.<br />
min cells(A) = COND min,sdom(A),i (A[i])<br />
(a) Original NRG raster<br />
(b) Output result<br />
Figure 3.11. Result of a Minimum Aggregate Operation<br />
His<strong>to</strong>gram<br />
A his<strong>to</strong>gram provides information about the number of times a value occurs across a<br />
range of possible values. For an 8-bit raster up <strong>to</strong> 256 different values are possible.<br />
As an example consider the following query:<br />
Query 3.2.13. Calculate the his<strong>to</strong>gram for a 2D raster A with 8-bit integer pixel<br />
resolution.<br />
The query can be computed as follows:<br />
Results are shown in Fig. 3.12.<br />
Diversity<br />
MARRAY sdom(A),g (count cells(A = g[0])) (3.1)<br />
The diversity operation returns the different classifications in a raster. For example,<br />
consider the following query:<br />
Query 3.2.14. Given the classifications in an 8-bit gray raster image, return true (1)<br />
for those classes whose <strong>to</strong>tal number of cells are greater than 0.
50 3. Fundamental Geo-Raster Operations<br />
Figure 3.12. Computation of the His<strong>to</strong>gram for a Raster Image<br />
For the computation of this operation we make use of the his<strong>to</strong>gram calculated in<br />
Query. 3.2.2. Let B be a 1-D array containing the his<strong>to</strong>gram values:<br />
B = MARRAY sdom(A),g (COND +,sdom(A),i (A[i] = g))<br />
then, C is the array containing true values for the elements of the his<strong>to</strong>gram that are<br />
greater than 0:<br />
C = MARRAY sdom(B),i (B[i] > 0)<br />
Results are shown in Fig. 3.13.<br />
Figure 3.13. Computation of the Diversity for a Raster Image<br />
Majority/Minority<br />
In a classified raster, the majority operation finds the class value with the largest number<br />
of elements in the raster. Similarly, the minority operation finds the cell value with<br />
fewest number of elements. As an example, consider the following query:<br />
Query 3.2.15. Return the cell representing the majority of all cell values contained in<br />
2D 8-bit gray raster image A shown in Fig. 3.14(a).<br />
To solve this query we use the his<strong>to</strong>gram computed in Query. 3.2.2, and then select<br />
the cell value representing the majority of the different classes. Let h be a 1-D array
3.2 Geo-Raster Operations 51<br />
containing the his<strong>to</strong>gram values, h1 a 1-D array of spatial domain[0:255] containing<br />
a list of values from 0 <strong>to</strong> 255. Let h2 be an array containing the sum of h and h1:<br />
h2 = MARRAY [0:255],g (h + h1)<br />
then, majority can be computed as follows:<br />
COND +,sdom(A),i ((max cells(h) = (h2[i] − h1[i])) ∗ h1[i])<br />
Results are shown in Fig. 3.14.<br />
(a) Classified raster<br />
(b) Majority class<br />
Figure 3.14. Computation of a Majority Operation for a Raster Image<br />
3.2.3 Statistical Aggregate Operations<br />
We now consider operations that consist or include one or more statistical aggregate<br />
functions. The basic statistical aggregate functions include standard deviation, root<br />
square, power, mode, median, variance, and <strong>to</strong>p-k. These functions can be applied<br />
<strong>to</strong> a raster, or a set of rasters retrieved by a logical search. Consider the following<br />
examples:<br />
Variance<br />
Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg a<br />
variable containing the average of all cell values of A, avg=avg cells(A); then the<br />
variance v of A can be solved as follows:<br />
v(A) = 1 n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg))<br />
Results are shown in Fig. 3.15.
52 3. Fundamental Geo-Raster Operations<br />
Figure 3.15. Computation of the Variance for a Raster Image<br />
Standard Deviation<br />
Query 3.2.16. Estimate the standard deviation of the cell values of the NRG raster<br />
image shown in Fig. 3.8(a).<br />
Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg the<br />
average of the cell values of A, avg=avg cells(A); then the standard deviation s of A<br />
can be solved as follows:<br />
s(A) =<br />
√<br />
1<br />
n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg))<br />
Results are shown in Fig. 3.16.<br />
Figure 3.16. Computation of the Standard Deviation for a Raster Image<br />
Median<br />
The median can be calculated by sorting the cell values of raster A in ascending order<br />
and choosing the middle value. In case the number of cells is even, the median
3.2 Geo-Raster Operations 53<br />
is the average of the two middle values. In solving this operation, we use the sort<br />
opera<strong>to</strong>r <strong>to</strong> perform the ascending sorting of array A. However, for an array of dimensionality<br />
higher than 1 it is necessary <strong>to</strong> flatten the array in<strong>to</strong> a one-dimensional<br />
array. For example, the conversion from a two-dimensional raster A[0:m,0:n] in<strong>to</strong> a<br />
one-dimensional raster B[0:m*n] can be calculated as follows:<br />
Let d be the cardinality of A, d=card(sdom(A); let r be the number of rows; and let<br />
c be the number of columns. Then, the flattening of A can be calculated as:<br />
B =MARRAY [0:255],g (<br />
COND +,[0:m,0:n],i (<br />
((g > (m ∗ (i − 1))) and (g ≤ i)) ∗ A[1 : (g − (m ∗ (i − 1))), 1 : i]))<br />
Let S be the raster containing the sorted values of B (the flattening of A), S = SORT 0 , asc<br />
f<br />
(B), and let n be the cardinality of S, n = card(sdom(S)). Assuming an integer division<br />
and an array indexing starting at zero, the median of array A can be solved as follows:<br />
if n is odd then the median is equal <strong>to</strong> S[ n 2<br />
the following query:<br />
n−1<br />
S[ 2<br />
]; else median =<br />
]+S[<br />
n+1<br />
2 ]<br />
2<br />
. Consider<br />
Query 3.2.17. Obtain the median of the 1-D array A whose cell values are shown in<br />
Fig. 3.17(a).<br />
Since the array has an odd number of elements the computation of the query is as<br />
follows:<br />
A[card(A)/2]<br />
Results are shown in Fig. 3.17(b).<br />
Top-k<br />
The Top-k function returns the k cells with the highest values within a raster. For<br />
example, consider the following query:<br />
Query 3.2.18. Find the five highest values contained in raster A.<br />
To solve this query we first sort A in ascending order and then select the <strong>to</strong>p five<br />
values. Let d=0 indicate a sorting in the 0 dimension, and let f be sorting function<br />
f d,A (p)=A[P]. Then S is a sorted array of raster A (see Fig. 3.18):<br />
S = SORT 0 , asc<br />
f (A)<br />
thus, the <strong>to</strong>p five cell values are obtained by:<br />
S[0 : 4]
54 3. Fundamental Geo-Raster Operations<br />
(a) 1-D array<br />
(b) Median<br />
Figure 3.17. Computation of Median for a Raster Image<br />
(a) Top five values<br />
Figure 3.18. Computation of a Top-k Operation for a Raster Image
3.2 Geo-Raster Operations 55<br />
3.2.4 Affine Transformations<br />
Geometric transformations permit the elimination of geometric dis<strong>to</strong>rtions that occur<br />
when images are captured. An example is the attempt <strong>to</strong> match remotely sensed<br />
images of the same area taken after one year, when the more recent image was probably<br />
not taken from precisely the same position. Another example is the Landsat<br />
Level 1B data that are already transformed on<strong>to</strong> a plane, but that may not be rectified<br />
<strong>to</strong> the user’s desired map projection [46]. <strong>Applying</strong> an affine transformation <strong>to</strong><br />
a uniformly dis<strong>to</strong>rted raster image can correct for a range of perspective dis<strong>to</strong>rtions<br />
by transforming the measurements from the ideal coordinates <strong>to</strong> those actually used.<br />
An affine transformation is an important class of linear 2-D geometric transformations<br />
that maps variables, e.g. cell intensity values located at position (x 1 , y 1 ), in an<br />
input raster image in<strong>to</strong> new variables (x 2 , y 2 ) in an output raster image by applying<br />
a linear combination of translation, rotation, scaling and shearing operations. The<br />
computation of these operations often requires interpolation techniques.<br />
In the remainder of this section we discuss special cases of affine transformations.<br />
Translation<br />
Translation performs a geometric transformation that maps the position of each cell in<br />
an input raster image in<strong>to</strong> a new position in an output raster image. Under translation,<br />
a cell located at (x 1 , y 1 ) in the original is shifted <strong>to</strong> a new position (x 2 , y 2 ) in the<br />
corresponding output raster image by displacing it through a user-specified translation<br />
vec<strong>to</strong>r (h, k). The cell values remain unchanged and the spatial domain of the output<br />
raster image is the same as that of the original input raster. Consider for example, the<br />
following query:<br />
Query 3.2.19. Shift the spatial domain of a raster defined as A[x 1 : x 2 , y 1 : y 2 ] by the<br />
point [h:k].<br />
The query can be solved by invoking the shift function of Array Algebra:<br />
Results are shown in Fig. 3.19.<br />
shift(A[x 1 : x 2 , y 1 : y 2 ], [h : k]])<br />
Rotation<br />
Rotation performs a geometric transformation that maps position (x 1 , y 1 ) of a cell in<br />
an input raster image on<strong>to</strong> a position (x 2 , y 2 ) in an output raster image by rotating it<br />
clockwise or counterclockwise, through a user-specified angle (θ) about origin O. The<br />
rotation operation performs a transformation of the form:<br />
x 2 = cos(θ) ∗ (x 1 − x 0 ) − sin(θ) ∗ (y 1 − y 0 ) + x 0<br />
y 2 = sin(θ) ∗ (x 1 − x 0 ) + cos(θ) ∗ (y 1 − y 0 ) + y 0
56 3. Fundamental Geo-Raster Operations<br />
(a) Original domain<br />
(b) Translated domain<br />
Figure 3.19. Computation of a Translation Operation for a Raster Image<br />
where (x 0 , y 0 ) are the coordinates of the center of rotation in the input raster image,<br />
and θ is the angle of rotation. Existing algorithms for the computation of rotation,<br />
unlike those employed by translation, can produce coordinates (x 2 , y 2 ) that are not<br />
integers. A common solution <strong>to</strong> this problem is the application of interpolation techniques<br />
like nearest neighbor, bilinear, or cubic interpolation. For large raster datasets<br />
this is an intensive computing problem because every output cell must be computed<br />
separately using data from its neighbors. Consequently, the rotation operation is not<br />
yet properly supported by Array Algebra.<br />
Scaling<br />
Scaling stretches or compresses the coordinates of a raster (or part of it) according<br />
<strong>to</strong> a scaling fac<strong>to</strong>r. This operation can be used <strong>to</strong> change the visual appearance of an<br />
image, <strong>to</strong> alter the quantity of information s<strong>to</strong>red in a scene representation, or as a lowlevel<br />
preprocessor in a multi-stage image processing chain that operates on features of<br />
a particular scale. For the estimation of the cell values in a scaled output raster image,<br />
two common approaches exist:<br />
• one pixel value within a local neighborhood is chosen (perhaps randomly) <strong>to</strong><br />
be representative of its surroundings. This method is computationally simple<br />
but may lead <strong>to</strong> poor results when the sampling neighborhood is <strong>to</strong>o large and<br />
diverse.<br />
• the second method interpolates cell values within a neighborhood by taking the<br />
average of the local intensity values.
3.2 Geo-Raster Operations 57<br />
As in the rotation operation, the application of scaling using interpolation techniques<br />
in large raster datasets is an intensive computing problem because every output cell<br />
must be computed separately using data from its neighbors. Consider the following<br />
query performing a scaling operation using bilinear interpolation. That is, the cell<br />
value for (x0,y0) in the output raster is calculated by averaging the values of its nearest<br />
cells: two in the horizontal plane (x0,x1) and two in the vertical plane (y0,y1). Note<br />
that the query is applied in a raster of spatial domain [0:255, 0:255] but as earlier<br />
mentioned, raster datasets tend <strong>to</strong> be extremely large (TB, PB).<br />
Query 3.2.20. Scale the 2D raster shown in Fig. 3.20(a), along the x and y dimensions<br />
by a fac<strong>to</strong>r of 2.<br />
The query can be solved as follows:<br />
B = MARRAY [0:<br />
m<br />
2 ,0: n 2 ],(x,y) (COND +,[0:1,0:1],(i,j) (A[i + x ∗ 2, j + y ∗ 2]/4))<br />
Results are shown in Fig. 3.20.<br />
(a) Original raster<br />
(b) Scaled raster<br />
Figure 3.20. Computation of a Scaling Operation for a Raster Image<br />
3.2.5 Terrain Analysis<br />
Raster image data is particularly useful for tasks related <strong>to</strong> terrain analysis. Some<br />
of the most popular operations include slope/aspect, drainage networks, and catchments<br />
(or watersheds). The processing of these operations may involve interpolation
58 3. Fundamental Geo-Raster Operations<br />
techniques that lead <strong>to</strong> expensive computational costs. For simplicity, we model these<br />
operations with approaches not using interpolation methods.<br />
Slope/Aspect<br />
Slope is defined by a plane tangent <strong>to</strong> a <strong>to</strong>pographic surface, as modeled by the Digital<br />
Elevation Model (DEM) at a point [2]. Slope is classified as a vec<strong>to</strong>r, thus having two<br />
components: a quantity (gradient) and a direction (aspect). The slope (gradient) is<br />
defined as the maximum rate of change in altitude, and aspect as the compass direction<br />
of the maximum rate of change. Several approaches exist for the computation of<br />
slope/aspect, and we follow the method proposed by [32]:<br />
• Slope in the X direction (difference in height values on either side of P) is given<br />
by:<br />
z(r, c + 1) − z(r, c − 1)<br />
T anΘ x =<br />
2g<br />
• slope in the Y direction<br />
• gradient at P<br />
T anΘ y =<br />
• direction or aspect of the gradient<br />
Results are shown in Fig. 3.21.<br />
z(r + 1, c) − z(r − 1, c)<br />
2g<br />
√<br />
(tan 2 Θ x + tan 2 Θ y )<br />
tanα = tanΘ x<br />
tanΘ y<br />
Figure 3.21. Slopes Along the X and Y Directions<br />
Note that after the calculation of the slopes for each cell in a raster image, the<br />
results may need <strong>to</strong> be classified <strong>to</strong> display them clearly on a map [2].<br />
Query 3.2.21. Calculate the slope along the X direction of an 8-bit grey raster A:<br />
MARRAY sdom(A),(r,c)<br />
(arctan(A(r, c + 1) − A(r, c − 1)))<br />
2g
3.2 Geo-Raster Operations 59<br />
Local Drain Directions (ldd)<br />
The ldd network is useful for computing several properties of a DEM because it explicitly<br />
contains information about the connectivity of different cells. Two steps are<br />
required <strong>to</strong> derive a drainage network: the estimation of flow of material over the<br />
surface and the removal of pits. For instance (see Fig. 3.22), cell A1 has three neighboring<br />
cells (A2, B1 and B2) and the lowest of them is B1, thus the flow direction is<br />
south (downward). For cell C3, the lowest of its eight neighboring cells is D2, so the<br />
flow direction is southwest (<strong>to</strong> the lower left). This method is one of the most popular<br />
algorithms <strong>to</strong> estimate flow directions and it is commonly known as D8 algorithm [2].<br />
Figure 3.22. Flow Directions<br />
Query 3.2.22. Estimate the flow of material over raster A where each cell contains<br />
the slope along the X direction.<br />
Let A be a raster with the slopes along the X direction of A. The ldd is then calculated<br />
as:<br />
MARRAY sdom(A),(i,j) (COND min,[−1:1,−1:1],(v,w) (A[i + v, j + w]))<br />
Irrespective of the algorithm used <strong>to</strong> compute flow directions, the resulting ldd network<br />
is extremely useful for computing other properties of a DEM such as stream<br />
channels, ridges, and catchments.<br />
3.2.6 Other Operations<br />
Edge Detection<br />
Edge detection produces a new raster containing only the boundary cells of a given<br />
raster. The detection of intensity discontinuities in a raster is very useful, e.g. the<br />
boundary representation is easy <strong>to</strong> integrate in<strong>to</strong> a large variety of detection algorithms.<br />
The following parameterized function can be used <strong>to</strong> express filtering operations<br />
in Array Algebra:<br />
f(A, M) = MARRAY sdom(A),x (COND +,sdom(M),i (A[x + i] ∗ M(y)))<br />
where sdom(M) is the size of the corresponding filter window, e.g., 3x3. As an example<br />
consider the following query:
60 3. Fundamental Geo-Raster Operations<br />
(a) M1<br />
(b) M2<br />
Figure 3.23. Sobel Masks<br />
Query 3.2.23. Apply edge detection <strong>to</strong> raster A shown in Fig. 3.24(a) using a 3x3<br />
Sobel filter.<br />
To compute this query, a Sobel filter and its inverse are applied <strong>to</strong> the original raster<br />
A (see Fig. 3.23):<br />
|f(A, M1)| + |f(A, M2)|<br />
9<br />
which in Array Algebra can be computed as follows:<br />
MARRAY sdom(A),x (COND +,sdom(M1),i (<br />
Results are shown in Fig. 3.24.<br />
(abs(A[x + i] ∗ M1(i))) + (abs((A[x + i] ∗ M2(i))))/9))<br />
(a) Original raster image<br />
(b) Output raster image<br />
Figure 3.24. Computation of an Edge-Detection for a Raster Image
3.3 Summary 61<br />
Slicing<br />
The slicing operation extracts lower-dimensional sections from a raster. Array Algebra<br />
accomplishes the slicing operation by indicating the slicing position in the desired<br />
dimension. Thus, the operation reduces the dimensionality of the raster by one. For<br />
example, consider the following query:<br />
Query 3.2.24. Slice raster A along the second dimension at position 50.<br />
The query is solved by specifying the slicing position as follows:<br />
3.3 Summary<br />
MARRAY sdom(A),(x,y,z) (A[x, 50, z])<br />
By examining the fundamental structure of Geo-raster operations and breaking<br />
down their computational steps in<strong>to</strong> a few basic Array Algebra opera<strong>to</strong>rs, we determine<br />
that Geo-raster operations can be broken down in<strong>to</strong> the following classes:<br />
• COND and MARRAY combined operations. Operations whose computation<br />
requires both MARRAY and COND opera<strong>to</strong>rs:<br />
add, count, average, maximum, minimum, majority, minority, his<strong>to</strong>gram, diversity,<br />
variance, standard deviation, scaling, edge detection, and local drain<br />
directions.<br />
• MARRAY exclusive operations. Operations whose computation requires only<br />
the MARRAY opera<strong>to</strong>r:<br />
arithmetic, trigonometric, boolean, logical, overlay, reclassification, proximity,<br />
translation, slicing, and slope/aspect.<br />
• SORT operations. Operations whose computation requires the SORT opera<strong>to</strong>r:<br />
<strong>to</strong>p-k, median.<br />
• AFFINE transformations. Special cases of affine transformations partially or<br />
not yet supported by Array Algebra: rotation and scaling.<br />
This classification allows us <strong>to</strong> identify a set of operations that require data summarization<br />
and thus are potential candidates <strong>to</strong> be treated with pre-aggregation techniques:<br />
add, count, average, maximum, minimum, majority, minority, his<strong>to</strong>gram, diversity,<br />
variance, standard deviation, scaling, edge detection, and local drain directions.<br />
Table 3.3 summarizes the usage of Array Algebra opera<strong>to</strong>rs for each operation<br />
discussed in Section 3.2.
62 3. Fundamental Geo-Raster Operations<br />
Table 3.3. Array Algebra Classification of Geo-Raster Operations.<br />
Operation MARRAY COND SORT AFFINE<br />
1. Count x<br />
2. Add x<br />
3. Average x<br />
4. Maximum x<br />
5. Minimum x<br />
6. Majority x x<br />
7. Minority x x<br />
8. Std. Deviation x<br />
9. Median x x<br />
10. Variance x<br />
11. Top-k x<br />
12. His<strong>to</strong>gram x x<br />
13. Diversity x x<br />
14. Proximity x<br />
15. Arithmetic x<br />
16. Trigonometric x<br />
17. Boolean x<br />
18. Logical x<br />
19. Overlay x<br />
20. Re-classification x<br />
21. Translation x<br />
22. Rotation x<br />
23. Scaling x x x<br />
24. Slicing x<br />
25. Edge Detection x x<br />
26. Slope/Aspect x<br />
27. Local drain directions (ldd) x x
Chapter 4<br />
Answering Basic Aggregate Queries<br />
Using <strong>Pre</strong>-Aggregated Data<br />
As discussed in previous chapters, aggregation is an important mechanism that allows<br />
users <strong>to</strong> extract general characterizations from very large reposi<strong>to</strong>ries of data. In this<br />
chapter, we study the effect of selecting a set of aggregate queries, compute their<br />
results and use them for subsequent query requests. In particular, we study the effect<br />
of pre-aggregation in computing aggregate queries in the field of GIS and remotesensing<br />
imaging applications.<br />
We introduce a pre-aggregation framework that distinguishes among different types<br />
of pre-aggregates for computing a query. We show that in most cases, several preaggregates<br />
may qualify for answering an aggregate query and address the problem of<br />
selecting the best pre-aggregate in terms of execution time. To this end, we introduce<br />
a model that measures the cost of using qualified pre-aggregates for the computation<br />
of a query. We then present an algorithm that selects the best pre-aggregate for computing<br />
a query. We measure the performance of our algorithms in an array database<br />
management system (RasDaMan), and show that our algorithms give much better performance<br />
over straightforward methods.<br />
4.1 Framework<br />
Most major database management systems allow the user <strong>to</strong> s<strong>to</strong>re query results<br />
through a process known as view materialization. The query optimizer may then au<strong>to</strong>matically<br />
use the materialized data <strong>to</strong> speed up the evaluation of a new query. Queries<br />
that benefit from using materialized data are those that involve the summarization of<br />
large amounts of data. They are known as aggregate queries because their query statements<br />
include one or more aggregate functions. The ANSI SQL:2008 standard defines<br />
a wide variety of aggregate functions including: COUNT, SUM, AVG, MAX, MIN,<br />
EVERY, ANY, SOME, VAR POP, VAR SAMP, STDDEV POP, STDDEV SAMP, AR-<br />
RAY AGG, REGR COUNT, COVAR POP, COVAR SAMP, CORR, REGR R2, REGR SLOPE,<br />
and REGR INTER-CEPT [20].<br />
63
64 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
4.1.1 <strong>Aggregation</strong><br />
An aggregate operation contains one or more aggregate functions that map a multiset<br />
of cell values in a dataset <strong>to</strong> a single scalar value. In our framework, queries<br />
may contain an arbitrary number of aggregate functions, e.g., COUNT, SUM, AVG,<br />
MAX, MIN, and a spatial domain. We formulate our queries using rasql 1 , the declarative<br />
interface <strong>to</strong> the RasDaMan server. We use the Array Algebra notation for spatial<br />
domains:<br />
sdom = [l 1 : h 1 , . . . , l d : h d ] (4.1)<br />
where the vec<strong>to</strong>r variables l (low) and h (high) deliver lower and upper bound vec<strong>to</strong>rs<br />
respectively.<br />
4.1.2 <strong>Pre</strong>-<strong>Aggregation</strong><br />
The term pre-aggregation refers <strong>to</strong> the process of pre-computing and s<strong>to</strong>ring the<br />
results of aggregate queries for subsequent use in the same or similar query requests.<br />
The decision <strong>to</strong> use pre-aggregated data during the computation of an aggregate query<br />
is influenced by the structural characteristics of the query and the pre-aggregate.<br />
By comparing the data structures between the two, one can determine if the preaggregated<br />
result contributes fully or partially <strong>to</strong> the final answer of the query, and<br />
if it is worth using pre-aggregated data.<br />
4.1.3 Aggregate Query and <strong>Pre</strong>-Aggregate Equivalence<br />
An aggregate query Q and a pre-aggregate p i are equivalent if and only if all the<br />
following conditions are met:<br />
1. The aggregate operation of the query Q is the same as the aggregate operation<br />
defined for the pre-aggregate p i .<br />
2. The aggregate operation of the query Q and the pre-aggregate p i must be applied<br />
over the same objects.<br />
3. The same logical and boolean conditions, if any, apply <strong>to</strong> both the query Q and<br />
the pre-aggreate p i .<br />
4. For aggregate operations <strong>to</strong> be applied over a specific spatial domain, the extent<br />
of the spatial domain in query Q must be the same as the one in pre-aggregate<br />
p i .<br />
When all of the above conditions are satisfied, we say there is a full-matching<br />
between the query and pre-aggregate. In this case, the time it takes <strong>to</strong> retrieve the<br />
1 rasql is a SQL-based query language for multidimensional raster databases based on Array Algebra.
4.1 Framework 65<br />
pre-aggregated result will be much faster than the time required <strong>to</strong> compute the query<br />
from raw (original) data. Moreover, the s<strong>to</strong>rage overhead required <strong>to</strong> save the preaggregated<br />
result is compensated by the faster computation of the query obtained in<br />
return. However, cases do occur when only conditions 1, 2, 3 are satisfied. We refer <strong>to</strong><br />
this case as a partial-matching between the query and pre-aggregate. We can use the<br />
partial results provided by these pre-aggregates and thus speed up the computation of<br />
the query. However, further analysis must be carried out <strong>to</strong> find those pre-aggregates<br />
that provide the maximum speed for computing a query. To that end, we define the<br />
following types of pre-aggregates: independent, overlapped, and dominant.<br />
Independent <strong>Pre</strong>-Aggregates<br />
Definition 4.1 (Independent <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called<br />
Independent <strong>Pre</strong>-Aggregates (IPAS) with respect <strong>to</strong> Q, if the spatial domain of each<br />
pre-aggregate is contained within the spatial domain of query Q and there is no intersection<br />
among the spatial domains of the pre-aggregates. Fig. 4.1(a) shows an example<br />
of an independent pre-aggregate.<br />
IPAS := {p 1 , p 2 , . . . , p n | p i.sdom ⊆ Q .sdom , p i.sdom ∩ p j.sdom = ∅} , (4.2)<br />
✷<br />
Overlapped <strong>Pre</strong>-Aggregates<br />
Definition 4.2 (Overlapped <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called<br />
Overlapped <strong>Pre</strong>-Aggregates (OPAS) if the spatial domain of each pre-aggregate intersects<br />
with the spatial domain of the query Q. Fig. 4.1(b) shows an example of an<br />
overlapped pre-aggregate.<br />
OPAS := {p 1 , p 2 , . . . , p n | p i.sdom ∩ Q .sdom ≠ ∅} (4.3)<br />
✷<br />
Dominant <strong>Pre</strong>-Aggregates<br />
Definition 4.3 (Dominant <strong>Pre</strong>-Aggregates) – A set of pre-aggregates is called Dominant<br />
<strong>Pre</strong>-Aggregates (DPAS) if the spatial domain of the query Q is contained within<br />
the spatial domain of each pre-aggregate. Fig. 4.1(c) shows an example of a dominant<br />
pre-aggregate. Note that dominant pre-aggregates can only be used <strong>to</strong> answer the<br />
following types of aggregate queries: ADD, COUNT, and AVG.
66 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
✷<br />
DPAS := {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p i.sdom } . (4.4)<br />
Moreover, given an ordered DPAS<br />
DPAS = {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p 1.sdom ⊆ . . . ⊆ p n.sdom } , (4.5)<br />
the closest dominant pre-aggregate (p cd ) <strong>to</strong> Q is given by p 1 , i.e., p cd = p 1 .<br />
(a) Independent preaggregate<br />
(b) Overlapped preaggregate<br />
(c) Dominant preaggregate<br />
Figure 4.1. Types of <strong>Pre</strong>-Aggregates<br />
Cases may occur where a pre-aggregate intersects with one or more pre-aggregates<br />
of the same or different type. Intersections are problematic because the greater the<br />
number of intersections, the greater the number of cells that may need <strong>to</strong> be computed<br />
from raw data <strong>to</strong> determine the real contribution <strong>to</strong>wards the result of the query by<br />
a given pre-aggregate. The computation process involves several intermediary operations<br />
such as decomposing the pre-aggregate in<strong>to</strong> sub-partitions that in turn must<br />
be aggregated. Moreover, the same procedure must be performed on the other intersected<br />
pre-aggregates should we want <strong>to</strong> use their results. For example, assume that<br />
pre-aggregates p 1 , p 2 and p 3 can be used <strong>to</strong> answer query Q, and that they all intersect<br />
with each other. Since the result of each pre-aggregate includes a partial result of the<br />
other two pre-aggregates, we must use raw data <strong>to</strong> compute the intersected area and<br />
adjust the result of the pre-aggregate according <strong>to</strong> the aggregate function specified in<br />
the query predicate.<br />
To overcome this problem, a query selected for pre-aggregation for which other<br />
pre-aggregates exist with different spatial domains but identical structural properties<br />
can be decomposed in<strong>to</strong> a set of sub-partitions prior <strong>to</strong> the pre-aggregation process.
4.2 Cost Model 67<br />
By partitioning the query <strong>to</strong> be pre-aggregated we can avoid intersection among preaggregates,<br />
see example shown in Fig. 4.2.<br />
Figure 4.2. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong> (left) and Decomposed Queries<br />
(right)<br />
4.2 Cost Model<br />
This section introduces a cost model that allows us <strong>to</strong> estimate the cost (in terms of<br />
execution time) of computing a query using pre-aggregates compared <strong>to</strong> raw data. In<br />
our model, the access cost is driven by the number of required disk I/Os and memory<br />
accesses. These parameters are influenced by the number of tiles needed <strong>to</strong> answer<br />
a given query and the number and size of the cells in the datasets. The following<br />
assumptions underlie our estimates.<br />
1. We assume that the tiles needed <strong>to</strong> answer a given query are s<strong>to</strong>red using implicit<br />
s<strong>to</strong>rage of coordinates, which is the prevalent s<strong>to</strong>rage format for raster image<br />
data [79]. Implicit s<strong>to</strong>rage of coordinate values is a s<strong>to</strong>rage technique that<br />
leads <strong>to</strong> a higher degree of clustering of cell values that are close in data space,<br />
that is, it preserves spatial proximity of cell values. Given that state-of-the-art<br />
disk drives improve access <strong>to</strong> multidimensional datasets by allowing the spatial<br />
locality of the data <strong>to</strong> be preserved in the disk itself [93], we assume that it takes<br />
the same time <strong>to</strong> retrieve a tile from disk as <strong>to</strong> retrieve any other tile needed <strong>to</strong><br />
answer a given query. Clearly, there are other fac<strong>to</strong>rs, not considered here, that<br />
influence access cost. Among them are the cost for s<strong>to</strong>ring intermediate results,<br />
and the communication cost for sending the results from the client <strong>to</strong> the server.<br />
More complicated cost models are certainly possible, but we believe the cost
68 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
model we pick, being both simple and realistic, enable us <strong>to</strong> design and analyze<br />
powerful algorithms.<br />
2. We consider the time taken <strong>to</strong> access a given cell (pixel) on main memory <strong>to</strong> be<br />
the same as that required <strong>to</strong> access any other cell. That is, we assume that a tile<br />
sits in main memory and is not swapped out.<br />
3. We ignore the time it takes <strong>to</strong> combine partial aggregate results. Investigations<br />
have shown this time <strong>to</strong> be negligible compared <strong>to</strong> tile iteration [74].<br />
Table 4.1 lists the parameters involved in the different cost functions presented in<br />
the remainder of this section.<br />
Table 4.1. Cost Parameters<br />
Parameter Description<br />
Ntiles Number of tiles<br />
Ncells Number of cells<br />
sdom Spatial domain<br />
IPAS Independent pre-aggregates set<br />
OPAS Overlapped pre-aggregates set<br />
DPAS Dominant pre-aggregated set<br />
p cd Closest dominant pre-aggregate<br />
SP Sub-partitions<br />
4.2.1 Computing Queries from Raw Data<br />
The cost of computing an aggregate query Q (or sub-partitions of pre-aggregates)<br />
from raw data (C r ), is given by<br />
C r (Q) = C acc (Ntiles(Q)) + C agg (Ncells(Q)) (4.6)<br />
where C acc is the cost of retrieving the tiles required <strong>to</strong> answer Q, and C agg is the<br />
time taken <strong>to</strong> access and aggregate the <strong>to</strong>tal cells given by the spatial component of<br />
the query.<br />
4.2.2 Computing Queries from Independent and Overlapped <strong>Pre</strong>-Aggregates<br />
The cost of answering an aggregate query using independent and overlapped preaggregates<br />
is given by:<br />
C IOP AS (Q) = C IP AS (Q) + C OP AS (Q) + C SP (Q), (4.7)<br />
where C IP AS and C OP AS are the costs of using the results of independent and overlapped<br />
pre-aggregates, respectively, and C SP is the cost of decomposing the query Q<br />
in<strong>to</strong> a set of sub-partitions and aggregating each from raw data.
4.2 Cost Model 69<br />
Cost of independent pre-aggregates<br />
The cost of retrieving the results of independent pre-aggregates (C IP AS ) is given by:<br />
C IP AS (Q, T ) = C fin (Q, T ) +<br />
∑<br />
|IP AS|<br />
i=0<br />
C acc (p i ) (4.8)<br />
where C fin is the cost of finding the pre-aggregates ∈ IP AS in the pre-aggregated<br />
pool T , and C acc is the accumulated cost of retrieving the results of the pre-aggregates.<br />
Cost of overlapped pre-aggregates<br />
The cost of retrieving the results of overlapped pre-aggregates (C OP AS ) is given by:<br />
C OP AS (Q) = C fin (Q, T ) +<br />
∑<br />
|OP AS|<br />
i=0<br />
|S|<br />
∑<br />
C dec (p i ) + C r (s i ) (4.9)<br />
where C fin is the cost of finding the pre-aggregates ∈ OP AS in the pre-aggregated<br />
pool T , C dec is the cost of decomposing the spatial domain of each pre-aggregate in<strong>to</strong><br />
a set of sub-partitions S such that the spatial domain of the partitioned pre-aggregate<br />
corresponds <strong>to</strong> p i.sdom − (p i.sdom ∩ Q), and C r is the cost of aggregating each resulting<br />
sub-partition s i ∈ S from raw data.<br />
Cost of aggregating sub-partitions of a query<br />
The cost of aggregating all sub-partitions forming a query is given by:<br />
|SP |<br />
∑<br />
C SP (Q) = C dec (Q) + C r (s i ), (4.10)<br />
where C dec is the cost of decomposing Q in<strong>to</strong> a set SP of sub-partitions, and C r is<br />
the cost of aggregating each resulting sub-partition s ∈ SP from raw data. Note that<br />
C dec is influenced by the costs of accessing the tiles required <strong>to</strong> aggregate each subpartition,<br />
and the cost of accessing the spatial properties of the pre-aggregates in IPAS<br />
and OPAS.<br />
4.2.3 Computing Queries from Dominant <strong>Pre</strong>-Aggregates<br />
The cost of computing an aggregate query Q using a dominant pre-aggregate is<br />
given by:<br />
C DP AS (Q) = C DP (Q, T ) + C agg (p cd ), (4.11)<br />
where C DP is the sum of the cost of finding the pre-aggregates ∈ DPAS in the preaggregated<br />
pool T and the cost of finding the closest dominant pre-aggregate p cd ,<br />
and C agg is the cost of computing the aggregate difference of p cd corresponding <strong>to</strong><br />
p cd.sdom − Q .sdom .<br />
i=0<br />
i=0
70 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
Cost of aggregating sub-partitions of the closest dominant pre-aggregate<br />
The cost C agg can be calculated as follows:<br />
|SP |<br />
∑<br />
C agg (p cd ) = C dec (p cd ) + C r (s i ), (4.12)<br />
where C dec is the cost of decomposing p cd in<strong>to</strong> a set SP of sub-partitions, and C r is the<br />
cost of aggregating each resulting sub-partition s ∈ SP from raw data.<br />
4.3 Implementation<br />
This section describes the application of a query optimization technique that transforms<br />
an input query written in terms of arrays so that it can be executed faster using<br />
pre-aggregated data. The query processing module of an array database management<br />
system (RasDaMan) has been extended with our pre-aggregation framework for query<br />
rewriting, and has been implemented as part of the optimization and evaluation phases.<br />
As discussed earlier in this chapter, there are two problems related <strong>to</strong> the computation<br />
of an aggregate query using pre-aggregated data. First, we must find all pre-aggregates<br />
that can be used <strong>to</strong> compute an aggregate query, including those that provide partial<br />
answers. Next, from all candidate pre-aggregates, we must find the one that minimizes<br />
the execution time (or cost) for computing the query. Our solution is based on an existing<br />
approach for answering queries using views in <strong>OLAP</strong> applications. Halevy et<br />
al. [95] showed that all possible rewritings of a query can be obtained by considering<br />
containment mappings from the bodies of the views <strong>to</strong> the body of the query. They<br />
also showed that such characterization is a NP-complete problem.<br />
The QUERYCOMPUTATION procedure returns the result of a query or an execution<br />
plan for a given query Q. An execution plan is an indica<strong>to</strong>r of the kind of data that<br />
must be used <strong>to</strong> compute the query. It returns a raw indica<strong>to</strong>r if the query must be<br />
computed from the original data. Other valid indica<strong>to</strong>rs include IP AS, OP AS, and<br />
DP AS, which indicate that the query will be answered using one or more partial<br />
pre-aggregates.<br />
The input of the algorithm is a query tree Q t of an aggregate query. The algorithm<br />
first verifies if the conditions for a PERFECT-MATCHING between the query and the<br />
pre-aggregated queries are satisfied. If a perfect-matching is found, it returns the result<br />
of the pre-aggregated query. Otherwise, the algorithm verifies if the conditions for a<br />
PARTIALMATCHING between the query and set of pre-aggregate queries are satisfied.<br />
Then, the algorithm makes use of our cost model <strong>to</strong> determine the cost of using preaggregates<br />
that satisfy partial-matching conditions for the computation of the query,<br />
and the cost of computing the query using the original data. Finally, the algorithm<br />
picks the plan with least cost in terms of execution time. The algorithm makes use of<br />
the following auxiliary procedures:<br />
• DECOMPOSEQUERY(Q t ) examines the nodes of the query tree Q t and generates<br />
a standardized representation S qt that can be manipulated via SQL statements.<br />
i=0
4.3 Implementation 71<br />
Algorithm 1 QUERYCOMPUTATION<br />
Require: A query tree Q t , a set of k number of pre-aggregate queries P<br />
1: initialize R = 0, key = false<br />
2: S qt = decomposeQuery(Q t )<br />
3: key = perfectMatching(S qt , P )<br />
4: if key then<br />
5: R = fetchResult(key)<br />
6: return R;<br />
7: end if<br />
8: if !key then<br />
9: plan = partialMatching(S qt , P )<br />
10: return plan;<br />
11: end if<br />
• PERFECTMATCHING(S qt ) compares a standardized representation of the query<br />
tree S qt against existing k number of pre-aggregates. The output is the corresponding<br />
key of the matched pre-aggregated query. A null value is returned if<br />
no perfect matching is found.<br />
• FETCHRESULT(key) retrieves the result R of the pre-aggregated query identified<br />
by key.<br />
The algorithm PARTIALMATCHING identifies an aggregate sub-expression in a<br />
query tree Q t , and finds pre-aggregated queries satisfying conditions 1, 2 and 3, but<br />
not condition 4 as defined in section 4.1.2. It considers the use of pre-aggregates<br />
that partially contribute <strong>to</strong> the answer of a query sub-expression that are either independent,<br />
overlapped, or dominant. The algorithm calculates the cost of using each<br />
pre-aggregate for computing the query, and returns an indica<strong>to</strong>r of the type of query<br />
providing the least cost.<br />
The aggregateOp() procedure compares a node n of a given query tree Q t against<br />
a list of pre-defined aggregate operations, e.g, add cells, count cells, avg cells,<br />
max cells, and min cells. If the node matches any such operation, it returns a true<br />
value.<br />
The getSubtree() procedure receives as parameter a query tree Q t and a pointer <strong>to</strong><br />
an aggregate node. If the aggregate node has children, it creates a subtree Q ′ where<br />
the root node corresponds <strong>to</strong> the aggregate node.<br />
The findP reaggregate() procedure receives as parameters an aggregate operation<br />
op, an object identifier ro, and a spatial domain sd. It then determines if the values of<br />
these parameters match those of any existing pre-aggregate. If a match is found, the<br />
result of the matched pre-aggregate is returned.<br />
The findIpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />
and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />
4.1.2 for equivalence between a query and a pre-aggregate. For those pre-aggregates
72 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
Algorithm 2 PARTIALMATCHING<br />
Require: A standardized query tree Q t with m number of nodes.<br />
1: initialize IP AS, OP AS, DP AS = {}<br />
2: initialize plan = ”raw”, key = false<br />
3: for each node n of Q t do<br />
4: if aggregateOp(node[n]) then<br />
5: Q ′ = getSubtree(Q t , node[n])<br />
6: op = getOperation(Q ′ )<br />
7: ro = getRasterObject(Q ′ )<br />
8: sd = getSpatialDomain(Q ′ )<br />
9: key = findP reaggregate(op, ro, sd)<br />
10: if key then<br />
11: R = fetchResult(key)<br />
12: return R;<br />
13: end if<br />
14: if !key then<br />
15: IP AS = findIpasP reaggregates(op, ro, sd)<br />
16: OP AS = findOpasP reaggregates(op, ro, sd)<br />
17: DP AS = findDpasP reaggregates(op, ro, sd)<br />
18: end if<br />
19: plan = selectP lan(Q ′ , IP AS, OP AS, DP AS)<br />
20: end if<br />
21: end for<br />
22: return plan;<br />
that qualify, it identifies those whose spatial domains are contained in the spatial domain<br />
of the query. The output is a set of independent pre-aggregates.<br />
The findOpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />
and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />
4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial domains<br />
intersect with the spatial domain of the query. The output is a set of overlapped preaggregates.<br />
The findDpasP reaggregates() procedure receives as a parameter a subtree Q ′<br />
and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section<br />
4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial<br />
domains dominate the spatial domain of the query. The output is a set of dominant<br />
pre-aggregates.<br />
The selectP lan() procedure receives as parameters a sub-query tree Q ′ , a set of<br />
independent pre-aggregates IP AS, a set of overlapped pre-aggregates OP AS and<br />
a set of dominant pre-aggregates DP AS. It then calculates the cost of answering<br />
the query using different types of pre-aggregates and raw data. The output of this<br />
procedure is an indica<strong>to</strong>r of the best plan for executing the query.
4.4 Experimental Results 73<br />
Query Evaluation<br />
The query optimizer module provides an optimized query tree along with the plan<br />
suggested for the computation of the query <strong>to</strong> the final phase, evaluation. Typically,<br />
the evaluation phase identifies the tiles affected by an aggregate query and executes<br />
the aggregate operation on each tile. Finally it combines the results <strong>to</strong> generate the<br />
answer <strong>to</strong> the query. With the extension of pre-aggregation in the optimizer, the traditional<br />
process differs such that the selected plan is considered before proceeding<br />
<strong>to</strong> execution. If the plan corresponds <strong>to</strong> raw, then the computation of the query is<br />
entirely done from raw data. Otherwise, it executes the aggregate operation only on<br />
those sub-expressions for which there are not pre-aggregated results.<br />
4.4 Experimental Results<br />
This section presents the performance results of our algorithms on real-life raster<br />
image datasets. We ran our experiments on a Intel Pentium 4 -CPU 3.00 GHz PC<br />
running SuSe Linux 9.1. The workstation had a <strong>to</strong>tal physical memory of 512 MB.<br />
The datasets were s<strong>to</strong>red in RasDaMan, an array database management system (our<br />
research vehicle).<br />
Table 4.2 lists the test queries used in our experiments. We ran each query 200<br />
times against the database <strong>to</strong> obtain average query response times. The queries are<br />
formulated using rasql syntax, the declarative query interface <strong>to</strong> the RasDaMan server.<br />
We performed a cold test where the queries were run sequentially; the cache buffer<br />
was cleaned after the completion of each query. The dataset consists of a collection of<br />
2D raster images, each associated with an object identifier (oid). Each image shows<br />
a portion of the Black Sea, is 260 Mb in size, and consists of 100 indexed tiles. We<br />
artificially created a set of pre-aggregates for the experiment. They are s<strong>to</strong>red in a preaggregation<br />
pool containing a <strong>to</strong>tal of 5000 pre-aggregates requiring a <strong>to</strong>tal s<strong>to</strong>rage<br />
space of 50 Mb.<br />
Computing the test queries involves the execution of two fundamental operations<br />
in GIS and remote-sensing imaging: sub-setting and aggregation. The values of the<br />
spatial domain of the queries were chosen such that we could measure the impact of<br />
using pre-aggregation for the following cases:<br />
• The computation of queries Q1, Q2 and Q3 can be done by combining the<br />
results of partial pre-aggregates and the remaining parts from original data.<br />
• The computation of queries Q4, Q5 and Q6 can be done by using the results<br />
of full pre-aggregates. That is, the full answer <strong>to</strong> these queries has been precomputed<br />
and s<strong>to</strong>red in the database.<br />
• The computation of queries Q7, Q8 and Q9 can be done by combining the<br />
results of two or more pre-aggregates. There is no need <strong>to</strong> use original data <strong>to</strong><br />
compute these queries.
74 4. Answering Basic Aggregate Queries Using <strong>Pre</strong>-Aggregated Data<br />
Table 4.2. Database and Queries of the Experiment.<br />
Qid Description<br />
Q1 select add cells(y[6000:10000, 29000:32000])<br />
from blacksea as y were oid(y) = 49153<br />
Q2 select add cells(y[7000:10000, 29000:31000])<br />
from blacksea as y where oid(y) = 49154<br />
Q3 select add cells(y[6700:10000, 28000:30000])<br />
from blacksea as y where oid(y) = 49155<br />
Q4 select add cells(y[7680:8191, 29000:31000])<br />
from blacksea as y where oid(y) = 49153<br />
Q5 select add cells(y[8704:9215, 29000:31000])<br />
from blacksea as y where oid(y) = 49154<br />
Q6 select add cells(y[9728:10000, 29000:31000])<br />
from blacksea as y where oid(y) = 49155<br />
Q7 select add cells(y[7680:8191, 29696:30207])<br />
from blacksea as y where oid(y) = 49153<br />
Q8 select add cells(y[8704:9215, 30720:31000])<br />
from blacksea as y where oid(y) = 49154<br />
Q9 select add cells(y[9216:9727, 30208:30719])<br />
from blacksea as y where oid(y) = 49155<br />
Table 4.3 compares the CPU cost required for the computation of the queries<br />
using pre-aggregated data and raw data. The CPU cost was obtained by using the time<br />
library of C++. The column #aff. tiles shows the number of tiles that need <strong>to</strong> be read<br />
for computing the given query. Column # preagg. tiles represents the number of preaggregates<br />
that can be used <strong>to</strong> compute the query. Column t pre shows the <strong>to</strong>tal CPU<br />
cost of computing the query considering pre-aggregated data. Column t ex shows the<br />
time taken <strong>to</strong> execute the query entirely from raw data. Column ratio shows that CPU<br />
time is always better when the computations consider pre-aggregated data.<br />
Table 4.3. Comparison of Query Evaluation Costs Using <strong>Pre</strong>-Aggregated Data and<br />
Original Data.<br />
Q id #aff. tiles #preagg. tiles t pre t ex ratio<br />
Q1 63 24 15.6 17.8 87%<br />
Q2 35 24 6.9 9.3 74%<br />
Q3 35 8 9.4 10 94%<br />
Q4 5 5 1.02 1.55 65%<br />
Q5 5 5 1.1 1.63 67%<br />
Q6 5 5 0.74 1.01 73%<br />
Q7 2 1 0.04 0.41 9%<br />
Q8 2 1 0.04 0.45 8%<br />
Q9 2 1 0.04 0.41 9%<br />
4.5 Summary<br />
In this chapter we presented a framework for computing aggregate queries in array<br />
databases using pre-aggregated data. We distinguished among different types of
4.5 Summary 75<br />
pre-aggregates: independent, overlapped, and dominant. We showed that such a distinction<br />
is useful <strong>to</strong> find a set of pre-aggregated queries that can reduce CPU cost for<br />
query computation. We proposed a cost-model <strong>to</strong> calculate the cost of using different<br />
pre-aggregates and select the best option for evaluating a query using pre-aggregated<br />
data. The measurements on real-life raster images showed that the computation of<br />
the queries is always faster with our algorithms compared <strong>to</strong> straightforward methods.<br />
We focused on queries using basic aggregate functions covering a large number of<br />
operations in GIS and remote-sensing imaging applications. The challenge remains,<br />
however, in supporting more complex aggregate operations, e.g., scaling, which is<br />
discussed in the following chapter.
This page was left blank intentionally.
Chapter 5<br />
<strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond<br />
Basic Aggregate Operations<br />
In this chapter we investigate the problem of offering pre-aggregation support <strong>to</strong> nonstandard<br />
aggregate operations such as scaling and edge detection. We discuss issues<br />
found while attempting <strong>to</strong> provide a pre-aggregation framework for all non-standard<br />
aggregate operations. We then justify our reasons for focusing on scaling operations.<br />
We adapt the framework and cost model presented in Chapter 4 <strong>to</strong> support scaling operations.<br />
Finally, we discuss the efficiency of our algorithms based on a performance<br />
analysis covering 2D, 3D and 4D datasets. We indicate how our approach generalizes<br />
and outperforms well-known 2D image pyramids widely used in Web mapping.<br />
5.1 Non-Standard Aggregate Operations<br />
As shown in Chapter 2, aggregate operations are not limited <strong>to</strong> queries using basic<br />
aggregate functions. In the GIS domain, operations such as scaling, edge detection,<br />
and those related <strong>to</strong> terrain analysis also require data summarization and may therefore<br />
benefit from pre-aggregation. See Table 3.3 for a complete list of operations requiring<br />
summarization. Finding a general pre-aggregation approach for computing those<br />
kinds of operations, however, it introduces additional complications when compared<br />
<strong>to</strong> finding pre-aggregates using basic aggregate functions.<br />
Basic aggregate functions each consolidate the values of a group of cells and return<br />
a scalar value. The value may represent the <strong>to</strong>tal sum, the number of cells, the maximum<br />
or minimum cell value, or the average value of the affected cells. Affected cells<br />
are determined by the spatial domain defined in the predicate of the query. In contrast,<br />
the computation of a scaling operation may require consolidating the cell values of a<br />
group of cells <strong>to</strong> calculate each cell value in the output raster. The affected cells are<br />
determined by both the resampling method and scale vec<strong>to</strong>r as described in Chapter 3.<br />
A similar situation occurs with edge detection. The affected cells are determined by<br />
the size and values of the applied Sobel filter. For simplicity, we refer <strong>to</strong> those kinds<br />
of operations as non-standard aggregate operations.<br />
There is an important concern that must now be taken in<strong>to</strong> account. From Chap-<br />
77
78 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
ter 3, we see that the result returned by a group of affected cells for a given nonstandard<br />
aggregate operation such as scaling is not likely <strong>to</strong> be useful in computing<br />
another non-standard aggregate operation such as edge detection. This is because<br />
non-standard operations differ significantly with respect <strong>to</strong> the way their affected cells<br />
are determined. Nevertheless, this result may be useful in computing the same type of<br />
non-standard operation under certain conditions. For example, the result of scaling by<br />
a fac<strong>to</strong>r of 8 could be used <strong>to</strong> compute scaling by a fac<strong>to</strong>r of 10 (assuming that both<br />
operations utilize the same resampling method). This result, however, is not likely <strong>to</strong><br />
be useful in edge detection for the same object.<br />
We therefore simplify the problem of offering pre-aggregation support <strong>to</strong> nonstandard<br />
aggregations by treating each type of non-standard operation separately. This<br />
simplification is similar <strong>to</strong> those found in data warehousing techniques where preaggregation<br />
algorithms cover a specific type of queries. For instance, pre-aggregation<br />
algorithms exist for queries that include a group-by clause in their predicates, while<br />
other algorithms are used for queries without join conditions.<br />
We now focus on pre-aggregation support for one non-standard aggregate operation,<br />
scaling, for the following reasons:<br />
• One of the most frequent operations in GIS and remote-sensing imaging applications<br />
is downscaling of some dataset or part thereof, such as obtaining a 1 GB<br />
overview of a 10 TB dataset.<br />
• Scaling is a very expensive operation as it normally requires a full scan of the<br />
dataset, plus costly main memory operations. Therefore, query optimization is<br />
critical <strong>to</strong> this class of retrieval operations.<br />
• Scaling is the only operation that has already been supported by pre-aggregation,<br />
at least for 2D datasets. This provides a point of reference <strong>to</strong> compare the effectiveness<br />
of our algorithms against existing techniques.<br />
Although the framework discussed in the following sections is centered around<br />
scaling operations, it can be adapted <strong>to</strong> support other non-standard aggregate operations<br />
by modifying the matching conditions as discussed later in this chapter.<br />
5.2 Conceptual Framework<br />
A common optimization technique that speeds up scaling operations is <strong>to</strong> materialize<br />
selected downscaled versions of an object, e.g., using image pyramids. When<br />
evaluating a scaling operation with target scale fac<strong>to</strong>r s, the pyramid level with the<br />
largest scale fac<strong>to</strong>r s ′ is determined, where s ′ < s. This relationship between scaling<br />
operations places them within a lattice framework similar <strong>to</strong> that used for data cubes<br />
in data warehouse/<strong>OLAP</strong> applications [92]. Our conceptual framework and greedy<br />
algorithm for the selection of pre-aggregates is based on the work of Harinarayan et<br />
al. presented in [92]. The use of this approach was motivated by the similarities<br />
between our datasets (multidimensional arrays) and <strong>OLAP</strong> data cubes. Furthermore,
5.2 Conceptual Framework 79<br />
Figure 5.1. Sample Lattice Diagram for a Workload with Five Scaling Operations<br />
the lattice framework and the greedy algorithm have proven successful in a variety of<br />
business applications.<br />
5.2.1 Lattice Representation<br />
A scaling lattice consists of a set of queries L and dependence relations ≼ denoted<br />
by 〈L, ≼〉. The ≼ opera<strong>to</strong>r imposes a partial ordering on the queries of the lattice.<br />
Consider two queries q 1 and q 2 . We say q 1 ≼ q 2 if q 1 can be answered using only the<br />
results of q 2 . The base node of the lattice is the scaling operation with the smallest<br />
scale vec<strong>to</strong>r upon which every query is dependent. Lattices are commonly represented<br />
in a diagram in which the elements are nodes, and there is a path downward from q 1<br />
<strong>to</strong> q 2 if and only if q 1 ≼ q 2 . The selection of pre-aggregates, that is, queries for<br />
materialization, is equivalent <strong>to</strong> selecting vertices from the underlying nodes of the<br />
lattice. Fig. 5.1 shows a lattice diagram for a workload containing five queries. Each<br />
node has an associated label that represents a scaling operation for a given dataset,<br />
scale-vec<strong>to</strong>r and resampling method.<br />
In our framework, we use the following function <strong>to</strong> define scaling operations:<br />
where<br />
scale(objName[lo 1 : hi 1 , ..., lo n : hi n ], ⃗s, resMeth) (5.1)<br />
• objName[lo 1 : hi 1 , ..., lo n : hi n ]: is the name of the multidimensional raster<br />
image <strong>to</strong> be scaled. The operation can be restricted <strong>to</strong> a specific area of the<br />
raster object. In that case, the area is specified by defining lower (lo n ) and<br />
upper (hi n ) bounds for each dimension. If the spatial domain is omitted, the<br />
operation is performed on the full spatial extent defining the raster image.<br />
• ⃗s: is a vec<strong>to</strong>r where each element is a numeric value that represents the scale<br />
fac<strong>to</strong>r used in a specific dimension of the raster image.<br />
• resMeth: specifies the resampling method <strong>to</strong> be applied <strong>to</strong> the original raster<br />
object.
80 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
For example, scale(CalF ires, [2, 2, 2], nn) defines a scaling operation by a fac<strong>to</strong>r<br />
of two on each dimension, using nearest neighbor as resampling method on a 3D<br />
dataset identified as CalF ires.<br />
5.2.2 <strong>Pre</strong>-<strong>Aggregation</strong> Selection Problem<br />
Definition 5.4 (<strong>Pre</strong>-Aggregates Selection Problem) – Given a query workload Q<br />
and a s<strong>to</strong>rage space constraint C, the pre-aggregates selection problem is <strong>to</strong> select a<br />
set P ⊆ Q of queries such that P minimizes the overall costs of computing Q while<br />
the s<strong>to</strong>rage space required by P does not exceed the limit given by C.<br />
✷<br />
Considering existing view selection strategies in data warehousing/<strong>OLAP</strong>, the following<br />
selection criteria are suggested for pre-aggregates:<br />
• Frequency. <strong>Pre</strong>-aggregates yield particularly significant increases in processing<br />
speed when scaling operations are executed with high frequency within a<br />
workload.<br />
• S<strong>to</strong>rage space. The s<strong>to</strong>rage space constraint of a candidate scaling operation<br />
must be at least the size of the s<strong>to</strong>rage required by the query in the workload with<br />
the smallest scale vec<strong>to</strong>r. This guarantees that for any query in the workload at<br />
least one pre-aggregate can be used for its computation.<br />
• Benefit. A scaling operation may be used <strong>to</strong> compute the same and other dependent<br />
queries in the workload. A metric is therefore used <strong>to</strong> calculate the<br />
cost savings gained by using a candidate scaling operation. To evaluate the<br />
cost, we use the model presented in Section 4.2. We call this the benefit of<br />
a pre-aggregate set and normalize the benefit against the base object’s s<strong>to</strong>rage<br />
volume.<br />
Frequency<br />
The frequency of query q, denoted by F (q), is the number of occurrences of a given<br />
query in a workload:<br />
F (q) = N(q)/ |Q| (5.2)<br />
where N(q) is a function that returns the number of occurrences of a given query in<br />
workload Q.<br />
S<strong>to</strong>rage Space<br />
The s<strong>to</strong>rage space of a given query denoted by S(q), represents the s<strong>to</strong>rage space<br />
required <strong>to</strong> save the result of query q and it is determined by the number of cells<br />
composing the output object defined in query q.
5.2 Conceptual Framework 81<br />
Benefit<br />
The benefit of a candidate scale operation for pre-aggregation q, is computed by<br />
adding the savings in query cost for each scaling operation in the workload dependent<br />
on q, including all queries identical <strong>to</strong> q. That is, query q may contribute <strong>to</strong><br />
saving processing costs for the same or similar queries in the workload. In both cases,<br />
specific matching conditions must be satisfied.<br />
Full-Match Conditions. Let q be a candidate query for pre-aggregation and p a<br />
query in workload Q. Let p and q both be scaling operations as defined in Eq. 5.1.<br />
There is a full-match between q and p if and only if:<br />
• the value of parameter objName[] in the scale function defined for q is the same<br />
as in p<br />
• the value of parameter ⃗s in the scale function defined for q is the same as in p<br />
• the value of parameter resMeth in the scale function defined for q is the same<br />
as in p<br />
Partial-Match Conditions. Let q be a candidate query for pre-aggregation and p<br />
be a query in the workload Q. There is a partial-match between p and q if and only if:<br />
• the value of parameter objName[] in the scale function defined for q is the same<br />
as in p<br />
• the value of parameter resMeth in the scale function defined for q is the same<br />
as in p<br />
• the parameter ⃗s for both q and p is of the same dimensionality<br />
• vec<strong>to</strong>r values defined in ⃗s for q are higher than those defined in p<br />
Definition 5.5 (Benefit) – Let T ∈ Q be a subset of scaling operations that can<br />
be fully or partially computed using query q. The benefit of query q per unit space,<br />
denoted by B(q), is the sum of the computational cost savings gained by selecting<br />
query q for pre-aggregation.<br />
✷<br />
B(q) = ((F (q) ∗ C(q)) + ∑ t∈T<br />
(F (t) ∗ C r (t, q)))/size(q) (5.3)<br />
where F (q) represents the frequency of query q in the workload, C ( q) is the cost of<br />
computing query q on the original dataset, C r (t, q) is the relative cost of computing<br />
query t from q, and size(q) is a function that returns the number of cells composing<br />
the spatial domain component of a query q.
82 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
5.3 <strong>Pre</strong>-Aggregates Selection<br />
<strong>Pre</strong>-aggregating all distinct scaling operations in the workload is not always possible<br />
because of space limitations. This is similar <strong>to</strong> the problem of selecting views<br />
for materialization in <strong>OLAP</strong>. One approach for finding the optimal set of scaling operations<br />
<strong>to</strong> pre-compute consists of enumerating all possible combinations and finding<br />
the one that yields the minimum average query cost, or the maximum benefit. Finding<br />
the optimal set of pre-aggreates in this way has a complexity of O(2 n ) where n is the<br />
number of queries in the workload. If the number of scaling operations on a given<br />
raster object is 50, there are 2 50 possible pre-aggregates for that object. Therefore,<br />
computing the optimal set of aggregates exhaustively is not feasible. In fact, it is an<br />
NP-hard problem [92, 17].<br />
We therefore consider the selection of pre-aggregates as an optimization problem<br />
where the input includes multidimensional datasets, a query workload, and an upper<br />
bound on available disk space. The output is a set of queries that minimizes the <strong>to</strong>tal<br />
cost of evaluating the query workload depending on the s<strong>to</strong>rage limit. We present an<br />
algorithm that uses the benefit per unit space of a scaling operation. We model the<br />
expected queries by a query workload, which is a set of scaling operations:<br />
Q = {q i |0 < i ≤ n} (5.4)<br />
where each q i has an associated non-negative frequency, f i . We normalize frequencies<br />
so that they sum up <strong>to</strong> 1:<br />
(<br />
n∑<br />
q i ) (5.5)<br />
i=1<br />
Based on this setup we study different workload patterns.<br />
The PRE-AGGREGATESSELECTION procedure returns a set P = {p i |0 < i ≤ n} of<br />
queries <strong>to</strong> be pre-aggregated. Input is a workload Q and a s<strong>to</strong>rage space constraint S.<br />
The workload contains a number of queries, each corresponding <strong>to</strong> a scaling operation<br />
as defined in Eq. 5.1.<br />
Frequency, s<strong>to</strong>rage space, and benefit per unit space are calculated for each distinct<br />
query in the workload. When calculating the benefit, we assume that each query is<br />
evaluated using the root (<strong>to</strong>p) node, which is the first selected pre-aggregate, p 1 . The<br />
second chosen pre-aggregate p 2 is the one with highest benefit per unit space.<br />
The algorithm recalculates the benefit of each scaling operation given that they are<br />
computed either from the root, if the scaling operation is above p 1 , or from p 2 otherwise.<br />
Subsequent selections are performed in a similar manner. The benefit is recalculated<br />
each time a scaling operation is selected for pre-aggregation. The algorithm<br />
s<strong>to</strong>ps selecting pre-aggregates when the s<strong>to</strong>rage space constraint is reached, or when<br />
there are no more queries in the workload <strong>to</strong> be considered for pre-aggregation, i.e.,<br />
all scaling operations in the workload have already been selected for pre-aggregation.<br />
The function highestBenefit(Q) returns the scaling operation with highest benefit<br />
per unit space in Q. Complexity of the algorithm is O(k · n 2 ) (k is the number
5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data 83<br />
Algorithm 3 PRE-AGGREGATESSELECTION<br />
Require: A workload Q, and a s<strong>to</strong>rage space constraint c<br />
1: P = {<strong>to</strong>p scaling operation}<br />
2: while (c > 0 and |P | != |Q| ) do<br />
3: p = highestBenefit(Q, P )<br />
4: if (c - |p| > 0) then<br />
5: c = c - |p|<br />
6: P = P ∪ p<br />
7: end if<br />
8: else c = 0<br />
9: return P<br />
of selected pre-aggregates and n is the number of vertices in the lattice), which arises<br />
from the cost of sorting the pre-aggregates by benefit per unit size.<br />
5.3.1 Complexity Analysis<br />
Let m be the number of queries in the lattice. Suppose we have no queries selected<br />
except for the <strong>to</strong>p query, which is manda<strong>to</strong>ry. The time <strong>to</strong> answer a given query in the<br />
workload is the time taken <strong>to</strong> compute the query using the <strong>to</strong>p query and calculating<br />
it according <strong>to</strong> our cost model. We denote this time by T o . Suppose that in addition<br />
<strong>to</strong> the <strong>to</strong>p query, we choose a set of queries P . Denote the average time <strong>to</strong> answer a<br />
query by T p . The benefit of the set of queries P is the reduction in average time <strong>to</strong><br />
answer a query, that is, T o − T p . Thus, minimizing the average time <strong>to</strong> answer a query<br />
is equivalent <strong>to</strong> maximizing the benefit of a set of queries.<br />
Let p 1 , p 2 , ..., p k be the k queries selected by the PRE-AGGREGATESSELECTION<br />
algorithm. Let b i be the benefit achieved by the selection of p i , for i = 1, 2, ..., k.<br />
That is, b i is the benefit of p i , with respect <strong>to</strong> the set consisting of the <strong>to</strong>p query and<br />
p 1 , p 2 , ..., p i−1 . Let P = p 1 , p 2 , ..., p k .<br />
Let O = o 1 , o 2 , ..., o k be an optimal set of k queries, i.e., those queries giving<br />
the maximum benefit. Let m i be the benefit achieved by the selection of o i , for i =<br />
1, 2, ..., k. That is, m i is the benefit of o i , with respect <strong>to</strong> the set consisting of the <strong>to</strong>p<br />
query and o 1 , o 2 , ..., o i−1 .<br />
Harinarayan et al [92] proved that the benefit of the greedy algorithm can never<br />
be less than (e-1)/e = 0.63 times the benefit of the optimum choice of pre-aggregated<br />
queries.<br />
5.4 Answering Scaling Operations Using <strong>Pre</strong>-Aggregated Data<br />
We say that a pre-aggregate p answers query q if there exists some other query q ′<br />
which when executed on the result of p, provides the result of q. The result can be<br />
either exact with respect <strong>to</strong> q (q ′ ◦ p ≡ q), or only an approximation (q ′ ◦ p ≈ q).<br />
In practice, the result is often an approximation because of the effect of resampling<br />
the original dataset. The same effect is observed in the traditional image pyramids
84 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
approach, but it is considered negligible since approximations are good enough for<br />
many applications. In our approach, when two or more pre-aggregates qualify for<br />
computing a given scaling operation, we pick the pre-aggregate with the closest scale<br />
vec<strong>to</strong>r value <strong>to</strong> the one defined in the scaling operation.<br />
Example 5.1 – Assume the queries listed in Table 5.1 have been pre-aggregated, and<br />
suppose we want <strong>to</strong> compute the following query: q = scale(ras01, (4.0, 4.0, ⃗ 4.0), bi).<br />
From the list of available pre-aggregates, the query can be answered either by using<br />
p2 or p3. From these two pre-aggregates, p3 has the closest scale vec<strong>to</strong>r <strong>to</strong> q. Thus,<br />
q ′ = scale(p3, (0.87, 0.87, ⃗ 0.87), bi). Note that q ′ represents a rewritten scaling operation<br />
in terms of the pre-aggregate.<br />
✷<br />
Table 5.1. Sample <strong>Pre</strong>-Aggregates.<br />
Raster Object ID Raster Name Scale Vec<strong>to</strong>r Resampling Method<br />
p1 ras01 (2.0, 2.0, ⃗ 2.0) nn<br />
p2 ras01 (3.0, 3.0, ⃗ 3.0) bi<br />
p3 ras01 (3.5, 3.5, ⃗ 3.5) bi<br />
p4 ras01 (6.0, 6.0, ⃗ 6.0) bi<br />
The REWRITEOPERATION procedure returns for query q a query q ′ that has been<br />
rewritten in terms of a pre-aggregate identified with p id . The input of the algorithm<br />
is the scaling operation q and a set of pre-aggregates P . The algorithm looks for a<br />
PERFECT-MATCH between q and one of the elements in P . To this end, the algorithm<br />
verifies that the matching conditions listed in Section 5.2.2 are all satisfied. If<br />
a perfect match is found, it returns the identifier of the matched pre-aggregate. Otherwise,<br />
the algorithm verifies PARTIAL-MATCH conditions for all pre-aggregates in<br />
P . All qualified pre-aggregates are added <strong>to</strong> set S. In case of a partial matching,<br />
the algorithm finds the pre-aggregate with the scale vec<strong>to</strong>r closest <strong>to</strong> the one defined<br />
in Q. REWRITEQUERY rewrites the original query as a function of the selected preaggregate,<br />
and adjusts the values of the scale vec<strong>to</strong>r <strong>to</strong> perform the complementary<br />
scaling operation. The algorithm makes use of the following auxiliary functions.<br />
• FULLMATCH(q, P ). Verifies that all full-match conditions are satisfied. If<br />
no matching is found, it returns 0, else it returns the id of the matching preaggregate.<br />
• PARTIALMATCH(q, P ). Verifies that all partial-match conditions are satisfied.<br />
Each qualified pre-aggregate of P is added <strong>to</strong> set S.<br />
• CLOSESTSCALEVECTOR(q, S). Compares the scale vec<strong>to</strong>rs between q and the<br />
elements of S, and returns the identifier (p id ) of the pre-aggregate whose scale<br />
vec<strong>to</strong>r is the closest <strong>to</strong> that defined for q.<br />
• REWRITEQUERY(Q, p id ). Rewrites query q in terms of the selected pre-aggregate<br />
and adjusts the scale vec<strong>to</strong>r values accordingly.
5.5 Experimental Results 85<br />
Algorithm 4 REWRITEOPERATION<br />
Require: A query q, and a set of pre-aggregates P<br />
1: initialize S = {} , p id = 0<br />
2: p id = fullMatch(q, P )<br />
3: if (p id == 0) then<br />
4: S = partialMatch(q, P )<br />
5: p id = closestScaleV ec<strong>to</strong>r(q, S)<br />
6: end if<br />
7: q ′ = rewriteQuery(q, p id )<br />
8: return q ′<br />
5.5 Experimental Results<br />
Experiments were conducted <strong>to</strong> evaluate the effectiveness of the pre-aggregation<br />
selection and rewriting algorithms in supporting scaling operations. They were run on<br />
a machine with a 3.00 GHz Intel Pentium 4 processor, running SuSe Linux 9.1. The<br />
workstation had a <strong>to</strong>tal physical memory of 512 MB.<br />
The query workload consisted of scaling operations with different scaling vec<strong>to</strong>rs.<br />
Different data distributions of the query workload were also considered. Despite the<br />
growing popularity of Web mapping services for GIS raster information processing,<br />
very few studies have been undertaken that report on user behaviors while using those<br />
services. One of the primary reasons for lack of research in this area may be the<br />
limited availability of the datasets outside of specialized research groups. Moreover,<br />
while query patterns related <strong>to</strong> scaling operations on 2D datasets are difficult <strong>to</strong> find,<br />
no empirical workload distributions were found for datasets of higher dimensionalities.<br />
We therefore resorted <strong>to</strong> using a set of artificial distributions that cover many<br />
practical situations in GIS and remote-sensing imaging.<br />
Most pre-aggregation algorithms in <strong>OLAP</strong> and image pyramids assume a uniform<br />
distribution of the values given for the scale vec<strong>to</strong>r in the query workload, so we<br />
considered the same type of distribution for our experiments. Furthermore, we also<br />
considered a Poisson distribution of the scale vec<strong>to</strong>r values. The rationale is that<br />
such a distribution covers situations where the dataset is scaled down by fac<strong>to</strong>rs that<br />
typically fall within a narrow range of scale vec<strong>to</strong>rs. For example, very large objects<br />
may need <strong>to</strong> be scaled down by large scale vec<strong>to</strong>rs so they can be efficiently transferred<br />
back and forth via Web services [77]. We also considered applications where the<br />
dataset is scaled down by the same scale vec<strong>to</strong>r, we refer <strong>to</strong> such access patter as a<br />
peak distribution. Finally, we investigated a step distribution that satisfies cases where<br />
scaling operations can be grouped within specific ranges of scale vec<strong>to</strong>rs.<br />
Our experiments were performed on datasets generated from three real-life rasterobjects:<br />
• Dataset R1. Consists of a 2D raster object with spatial domain [0 : 15359, 0 :<br />
10239]. The dataset contains 600 tiles, each with a spatial domain of [0 : 512, 0 :<br />
512]. The <strong>to</strong>tal number of cells composing the raster object is 157 millions.
86 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
• Dataset R2. Consists of a 3D raster object with spatial domain [0 : 11299, 0 :<br />
10459, 0 : 3650]. The dataset contains 3214 tiles, each with a spatial domain of<br />
[0 : 512, 0 : 512, 0 : 512]. The <strong>to</strong>tal number of cells composing the raster object<br />
is 43 trillions.<br />
• Dataset R3. Consists of a 4D raster object with spatial domain [0 : 10150, 0 :<br />
7259, 0 : 2430, 0 : 75640]. The dataset contains 197,070 tiles, each with a<br />
spatial domain of [0 : 512, 0 : 512, 0 : 512, 0 : 512]. The <strong>to</strong>tal number of cells<br />
composing the raster object is 1.35e+16.<br />
In the rest of this section, we present the results of our experiments according <strong>to</strong><br />
the dimensionality of the data.<br />
5.5.1 2D Datasets<br />
In this experiment the workload consisted of 12800 scaling operations defined for<br />
dataset R1.<br />
Uniform Distribution<br />
The scaling vec<strong>to</strong>rs of the queries in the workload were uniformly distributed. Scale<br />
vec<strong>to</strong>rs were integers ranging from 2 <strong>to</strong> 256. Per observations in practice, we assumed<br />
that both dimensions were coupled. We considered a s<strong>to</strong>rage space constraint of 35%,<br />
which is slightly higher than the additional s<strong>to</strong>rage space taken by image pyramids.<br />
The PRE-AGGREGATESSELECTION algorithm yields 12 pre-aggregates for this test<br />
where we executed scaling operations with scale vec<strong>to</strong>rs 2, 4, 6, 11, 15, 22, 32, 46, 67, 95,<br />
137 and 182. The cost of computing the workload using these pre-aggregates is<br />
18, 565. In contrast, image pyramids selects scaling operations with scale vec<strong>to</strong>rs:<br />
2, 4, 8, 16, 32, 64, 128, and 256, and requires 33% additional s<strong>to</strong>rage space. Image<br />
pyramids computes the workload at a cost of 29, 166. The results of this experiment<br />
show that the pre-aggregates selected by our algorithm provide an improved performance<br />
for scaling operations over image pyramids. The cost of computing the workload<br />
using our algorithm is 36% less than that incurred by image pyramids, at a price<br />
of 2% additional s<strong>to</strong>rage space.<br />
Fig. 5.2(a) shows the distribution of the scale vec<strong>to</strong>rs of all queries in the workload.<br />
The pre-aggregates selected by image pyramids and our pre-aggregation selection algorithm<br />
are shown in Fig. 5.2(b) and 5.2(c), respectively.<br />
Poisson Distribution<br />
The workload for this experiment consisted of scaling operations where the scale vec<strong>to</strong>rs<br />
had a Poisson distribution, and the mean value of the scale vec<strong>to</strong>r equaled 50. The<br />
PRE-AGGREGATES-SELECTION algorithm yields 33 pre-aggregates for this test that<br />
executed scaling operations using scale vec<strong>to</strong>rs from 34 <strong>to</strong> 66. The cost of computing<br />
the workload using these pre-aggregates is 42, 455. In contrast, image pyramids
5.5 Experimental Results 87<br />
(a) Query workload (Uniform distribution)<br />
(b) Selected queries for materialization by image pyramids<br />
(c) Selected queries for materialization by our pre-aggregation selection algorithm<br />
Figure 5.2. Query Workload with Uniform Distribution<br />
selects scaling operations with scale vec<strong>to</strong>rs: 2, 4, 8, 16, 32, 64, 128, and the cost of<br />
computing the workload is 95, 468. Thus, the cost of computing the workload using
88 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
pre-aggregates selected by our algorithm is 55% less than that incurred using image<br />
pyramids. There is also a major difference with respect <strong>to</strong> the additional s<strong>to</strong>rage<br />
space required by both approaches: image pyramids requires 33% additional s<strong>to</strong>rage<br />
space, while our algorithm requires only 5% additional space <strong>to</strong> s<strong>to</strong>re the selected<br />
pre-aggregates.<br />
(a) Query workload (Poisson distribution)<br />
Figure 5.3. Query Workload with Poisson Distribution<br />
Fig. 5.3(a) shows the distribution of the scale vec<strong>to</strong>rs of all queries in the workload.<br />
The pre-aggregates selected by image pyramids are shown in Fig. 5.4(a). Even when<br />
there are no queries in the workload with scale fac<strong>to</strong>rs smaller than 33, image pyramids<br />
still allocates space for pre-aggregates 2, 4, 8, 16, 32 which are the ones that account<br />
for much of the overall space requirement (33%). In contrast, our algorithm uses<br />
the query frequencies in the workload <strong>to</strong> select the queries for pre-aggregation. See<br />
Fig. 5.4(b). For this workload configuration, it is possible <strong>to</strong> pre-aggregate all distinct<br />
queries, and provide much faster query response times than image pyramids. This<br />
shows the benefit of considering query frequencies in the workload. If we pick a<br />
mean higher than 50, the additional s<strong>to</strong>rage space needed by the pre-aggregates is<br />
minimal. Conversely, if the mean is shifted <strong>to</strong> a lower scale vec<strong>to</strong>r value, e.g. 16, the<br />
s<strong>to</strong>rage space needed by our pre-aggregation algorithm can increase up <strong>to</strong> 35%.<br />
Peak Distribution<br />
In this experiment, the query workload consisted of scaling operations with a scale<br />
vec<strong>to</strong>r having a value of 100 in each dimension. The PRE-AGGREGATESSELECTION<br />
algorithm yields 1 pre-aggregate for this test that executes a scaling operation with<br />
scale vec<strong>to</strong>r: 100, 100. The cost of computing the workload using this pre-aggregate<br />
is 1.27E + 08. In contrast, image pyramids selects scaling operations with scale<br />
fac<strong>to</strong>r values in each dimension: 2, 4, 8, 16, 32, 64, 128,, and the cost of computing<br />
the workload is 3.01E + 08. Thus, the cost of computing the workload using the<br />
pre-aggregates selected by our algorithm is 58% less than the cost incurred by image<br />
pyramids. Furthermore, there is major difference with respect <strong>to</strong> the s<strong>to</strong>rage space
5.5 Experimental Results 89<br />
(a) Selected queries for pre-aggregation by image pyramids<br />
(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />
Figure 5.4. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong><br />
required by both approaches: image pyramids requires 33% additional s<strong>to</strong>rage space,<br />
while our algorithm only requires 5% additional space.<br />
Fig. 5.5(a) shows the distribution of the scale vec<strong>to</strong>rs for all queries in the workload.<br />
The pre-aggregates selected by image pyramids are shown in Fig. 5.6(a). Image<br />
pyramids allocates space for pre-aggregates with scale fac<strong>to</strong>rs 2, 4, 8, 16, 32, 128, and<br />
256 in each dimension. In contrast, our pre-aggregation selection algorithm selected<br />
one query, shown in Fig. 5.6(b). Although our algorithm makes more efficient use<br />
of s<strong>to</strong>rage space and computes the workload faster than image pyramids, this kind of<br />
scenario is not likely <strong>to</strong> occur in practice. The s<strong>to</strong>rage overhead is simply not justified.<br />
However, users may benefit from having a system that au<strong>to</strong>matically pre-aggregates<br />
such operations with minimum overhead, a capability that can be provided by using<br />
our algorithm.<br />
Step Distribution<br />
We now consider a scenario where scale vec<strong>to</strong>rs are distributed in various ranges of<br />
frequencies, i.e. in a step distribution. The PRE-AGGREGATESSELECTION algorithm<br />
yields 6 pre-aggregates for this test, where scaling operations are executed with scale
90 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
(a) Query workload (Peak distribution)<br />
Figure 5.5. Query Workload with Peak Distribution<br />
(a) Selected queries for pre-aggregation by image pyramids<br />
(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />
Figure 5.6. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong>
5.5 Experimental Results 91<br />
vec<strong>to</strong>rs 6, 8, 13, 19, 75, and 200. The cost of computing the workload using these preaggregates<br />
is 1.5e + 09. In contrast, image pyramids selects scaling operations with<br />
scale vec<strong>to</strong>rs: 2, 4, 8, 16, 32, 64, and 128, and the cost of computing the workload is<br />
2.21e + 09. The cost of computing the workload using the pre-aggregates selected by<br />
our algorithm is therefore 32% less than that incurred by image pyramids. Moreover,<br />
there is a major difference with respect <strong>to</strong> the additional s<strong>to</strong>rage space required by<br />
both approaches. Image pyramids requires 33% additional s<strong>to</strong>rage space, while our<br />
algorithm only requires 15% additional space.<br />
(a) Query workload (Step distribution)<br />
Figure 5.7. Query Workload with Step Distribution<br />
Fig. 5.7(a) shows the distribution of the scale vec<strong>to</strong>rs for all queries in the workload.<br />
The pre-aggregates selected by image pyramids are shown in Fig. 5.8(a).<br />
5.5.2 3D Datasets<br />
To test our pre-aggregation algorithms on 3D time-series datasets, we picked four<br />
data distribution patterns of scaling vec<strong>to</strong>rs. For simplicity, we have labeled each<br />
dimension x, y, and t respectively. The following assumption (taken from observation<br />
in practice) is common for each data distribution type: the scale vec<strong>to</strong>r along the first<br />
two dimensions is the same, i.e. x = y. The aim of this test is <strong>to</strong> measure average<br />
query costs while varying available s<strong>to</strong>rage space for pre-aggregation.<br />
Uniform distribution in x, y, t<br />
In this experiment, the workload consisted of 10, 000 scaling operations referring <strong>to</strong><br />
the 3D dataset R2 described at the beginning of this Section. Scale vec<strong>to</strong>rs were uniformly<br />
distributed along the x, y, and t dimensions. Values of scale vec<strong>to</strong>rs ranged<br />
from 2 <strong>to</strong> 256. Fig. 5.9 shows the distribution of the scaling vec<strong>to</strong>rs in the workload.<br />
We executed the PRE-AGGREGATESSELECTION algorithm for different values<br />
of s<strong>to</strong>rage space constraint (c). The minimum s<strong>to</strong>rage space required <strong>to</strong> support the
92 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
(a) Selected queries for pre-aggregation by image pyramids<br />
(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm<br />
Figure 5.8. Selected Queries for <strong>Pre</strong>-<strong>Aggregation</strong>
5.5 Experimental Results 93<br />
root node of the lattice was 12.5% of the size of the original dataset. Fig. 5.10 shows<br />
the average query cost as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space<br />
dramatically reduces the average query cost. The improvement in average query cost<br />
decreases, however, as allocated space goes beyond 36%. Fig. 5.11 shows the scaling<br />
operations selected for pre-aggregation when c = 36%. For this instance of the<br />
s<strong>to</strong>rage space constraint, the algorithm selected 49 pre-aggreagates. The <strong>to</strong>tal cost of<br />
computing that workload is 6.44e+05. In contrast, computing the workload using the<br />
original dataset incurs a cost of 1.28e + 12.<br />
Figure 5.9. Workload with Uniform Distribution along x, y, and t<br />
Figure 5.10. Average Query Cost over S<strong>to</strong>rage Space
94 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
Figure 5.11. Selected <strong>Pre</strong>-Aggregates, c = 36%<br />
Uniform distribution in x, y and Poisson distribution in t<br />
In this experiment, the workload consisted of 23, 460 scaling operations referring <strong>to</strong><br />
3D dataset R2. The scale vec<strong>to</strong>rs were uniformly distributed along x and y, with a<br />
Poisson distribution along t. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 256 in the x<br />
and y dimensions, whereas in t they ranged from 8 <strong>to</strong> 16, with a mean value of 12.<br />
Fig. 5.12 shows the distribution of scaling vec<strong>to</strong>rs in the workload. Note that the<br />
scale vec<strong>to</strong>r values in the dimensions x and y are coupled. The frequency of the various<br />
scale fac<strong>to</strong>r values is denoted by f. We ran the PRE-AGGREGATESSELECTION<br />
algorithm for different values of the s<strong>to</strong>rage space constraint. The minimum s<strong>to</strong>rage<br />
space required <strong>to</strong> support the root node of the lattice was 3.13% of the size of the<br />
original dataset. Fig. 5.13 shows the average query cost as s<strong>to</strong>rage space increases.<br />
A small amount of s<strong>to</strong>rage space dramatically improves the average query cost. However,<br />
we can also observe that the improvement in average query cost decreases as<br />
allocated space goes beyond 26%. Fig. 5.14 shows the scaling operations selected for<br />
pre-aggregation when c = 26%. For this instance of the s<strong>to</strong>rage space constraint, the<br />
algorithm selected 67 pre-aggreagates. The <strong>to</strong>tal cost for computing the workload is<br />
1.21e + 07. In contrast, computing the workload using the original dataset incurs a<br />
cost of 2.31e + 11.<br />
Poisson distribution in x, y, t<br />
In this experiment, the workload consisted of 600 scaling operations referring <strong>to</strong> 3D<br />
dataset R2. The scale vec<strong>to</strong>rs followed a Poisson distribution along the three dimensions<br />
x, y, and t. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 10 in the x and y<br />
dimensions, whereas in t they ranged between 8 and 16, with a mean value of 12.<br />
Fig. 5.15 shows the distribution of the scaling vec<strong>to</strong>rs in the workload. We ran<br />
the PRE-AGGREGATESSELECTION algorithm for different values of the s<strong>to</strong>rage space
5.5 Experimental Results 95<br />
Figure 5.12. Workload with Uniform Distribution Along x, y, and Poisson distribution<br />
in t<br />
Figure 5.13. Average Query Cost as Space is Varied<br />
constraint. The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root node of the lattice<br />
was 4.18% of the size of the original dataset. Fig. 5.16 shows the average query cost<br />
as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space dramatically improves<br />
the average query cost. However, the improvement in average query cost decreases as<br />
allocated space goes beyond 26%. Fig. 5.17 shows the scaling operations selected for<br />
pre-aggregation when c = 30%. For this instance of the s<strong>to</strong>rage space constraint, the<br />
algorithm selected 23 pre-aggreagates. The <strong>to</strong>tal cost of computing the workload is<br />
1680. In contrast, computing the workload using the original dataset incurs a cost of<br />
1.34e + 12.
96 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
Figure 5.14. Selected <strong>Pre</strong>-Aggregates, c = 26%<br />
Figure 5.15. Workload with Poisson distribution Along x, y, and t<br />
Poisson distribution in x, y, and Uniform distribution along t<br />
In this experiment, the workload consisted of 924 scaling operations referring <strong>to</strong> 3D<br />
dataset R2. The scale vec<strong>to</strong>rs followed a Poisson distribution along the dimensions x<br />
and y, and a uniform distribution along dimension t. Values of scale vec<strong>to</strong>rs ranged<br />
from 2 <strong>to</strong> 10 in the x and y dimensions, and were uniformly distributed along t. Fig.<br />
5.18 shows the distribution of the scaling vec<strong>to</strong>rs in the workload. We ran the PRE-<br />
AGGREGATESSELECTION algorithm for different values of the s<strong>to</strong>rage space constraint.<br />
The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root node of the lattice<br />
was 4% of the size of the original dataset. Fig. 5.19 shows the average query cost<br />
as s<strong>to</strong>rage space is increased. A small amount of s<strong>to</strong>rage space dramatically improves<br />
the average query cost. However, the improvement in average query cost decreases as<br />
allocated space goes beyond 21%. Fig. 5.20 shows the scaling operations selected for
5.5 Experimental Results 97<br />
Figure 5.16. Average Query Cost as Space is Varied<br />
Figure 5.17. Selected <strong>Pre</strong>-Aggregates, c = 30%
98 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
Figure 5.18. Workload with Poisson Distribution Along x, y, and Uniform Distribution<br />
in t<br />
pre-aggregation when c = 21%. For this instance of the s<strong>to</strong>rage space constraint, the<br />
algorithm selected 17 pre-aggreagates. The <strong>to</strong>tal cost of computing the workload is<br />
1472. In contrast, computing the workload using the original dataset incurs a cost of<br />
1.63e + 12.<br />
5.5.3 4D Datasets<br />
For 4D datasets, we considered ECHAMT−42 as a typical use case found in<br />
climate modeling. ECHAMT−42 is an energy and mass budget model developed<br />
by the Max-Planck-Institute for Meteorology [16]. We assumed that dimensions x<br />
and y are scaled down by the same scale value. However, the scale values along z and<br />
t may vary according <strong>to</strong> specific analysis requirements for a given application. If we<br />
look at the sample dimensions of ECHAMT − 42 model shown in Table 5.2, it is<br />
clear that the dimension values along the first three dimensions are much smaller than<br />
those of the fourth dimension (time).<br />
In this experiment, the workload consisted of 1, 137 scaling operations referring <strong>to</strong><br />
4D dataset R3. We assumed that the scale vec<strong>to</strong>rs followed a Poisson distribution in<br />
each of the four dimensions. The rationale behind this assumption is that scientists<br />
are often interested in a highly selective data set and Poisson distribution fits nicely<br />
for this data access pattern. Values of scale vec<strong>to</strong>rs ranged from 2 <strong>to</strong> 11 in the x, y<br />
dimensions with a mean of 6; from 10 <strong>to</strong> 19 along the z dimension with a mean of 14,<br />
and from 230 <strong>to</strong> 239 along t with the mean of 234. Table 5.3 shows the distribution<br />
of the scale fac<strong>to</strong>rs of all scaling operations in the workload.<br />
We ran the PRE-AGGREGATESSELECTION algorithm for different values of the<br />
s<strong>to</strong>rage space constraint. The minimum s<strong>to</strong>rage space required <strong>to</strong> support the root<br />
node of the lattice was 1.25% of the size of the original dataset. Table 5.4 shows the
5.5 Experimental Results 99<br />
Figure 5.19. Average Query Cost as Space is Varied<br />
Figure 5.20. Selected <strong>Pre</strong>-Aggregates, c = 21%
100 5. <strong>Pre</strong>-<strong>Aggregation</strong> Support Beyond Basic Aggregate Operations<br />
scaling operations selected for pre-aggregation when c = 1.3%. For this instance of<br />
the s<strong>to</strong>rage space constraint, the algorithm selected 4 pre-aggregates shown in Table<br />
5.4. The <strong>to</strong>tal cost of computing the workload is 3361. In contrast, computing the<br />
workload using the original dataset incurs a cost of 1.35e + 16.<br />
Table 5.2. ECHAM T-42 Climate Simulation Dimensions<br />
Dimension<br />
Extent<br />
Longitude 128<br />
Latitude 64<br />
Elevation 17<br />
Time (24 min/slice) 200 years (2,190,000)<br />
Table 5.3. 4D Scaling: Scale Vec<strong>to</strong>r Distribution<br />
Scale Vec<strong>to</strong>r count<br />
2,2,10,230 200<br />
3,3,11,231 300<br />
4,4,12,232 500<br />
5,5,13,233 800<br />
6,6,14,234 1000<br />
7,7,15,235 1000<br />
8,8,16,236 800<br />
9,9,17,237 500<br />
10,10,18,238 300<br />
11,11,19,239 200<br />
Table 5.4. 4D Scaling: Selected <strong>Pre</strong>-Aggregates<br />
Scale Vec<strong>to</strong>r count<br />
2,2,10,230 200<br />
4,4,12,232 500<br />
6,6,14,234 1000<br />
8,8,16,236 800<br />
5.6 Summary<br />
This chapter describes our investigations on the problem of intelligently picking<br />
a subset of scaling operations given a s<strong>to</strong>rage space constraint. There is a tradeoff<br />
between the amount of space allocated for pre-aggregation, and the average query<br />
cost of scaling operations. We introduced a pre-aggregation selection algorithm based<br />
on a given query workload that determines a set of pre-aggregates in face of s<strong>to</strong>rage<br />
space constraints.<br />
We performed experiments on 2D, 3D, and 4D datasets using different data distribution<br />
patterns for the scale vec<strong>to</strong>rs. We relied on artificial data distributions since no<br />
empirical distributions were found. In addition <strong>to</strong> uniformly distributed scale vec<strong>to</strong>rs,
5.6 Summary 101<br />
we considered non-uniform distributions including Poisson, peak, and step. For 2D<br />
datasets, we showed that our algorithm performs better than that of image pyramids.<br />
In particular, for non-uniform data distributions, our pre-aggregation selection algorithm<br />
not only provides a lower average query cost, but makes a much more efficient<br />
use of s<strong>to</strong>rage space. This is because our algorithm considers the frequency of the<br />
query, and the cost savings (benefit) this provides for computing the workload. Nevertheless,<br />
the major advantage of our algorithm over that of image pyramids is not the<br />
improved average query cost, but the reduced amount of s<strong>to</strong>rage space required for<br />
the pre-aggregates, especially for non-uniform distributions.<br />
In our experiments with 3D and 4D datasets, we showed the effect of the available<br />
s<strong>to</strong>rage space for pre-aggregation on average query costs. We observed that a small<br />
amount of s<strong>to</strong>rage overhead is sufficient <strong>to</strong> dramatically reduce average query costs.<br />
Since there are no similar techniques against which we can compare our results, we<br />
compared our results against the average query costs obtained by using the original<br />
data.
This page was left blank intentionally.
Chapter 6<br />
Conclusion<br />
One of the biggest challenges of database technology is <strong>to</strong> effectively and efficiently<br />
provide solutions for extremely large volumes of multidimensional array data archiving<br />
and management. This thesis focuses on investigating the problem of applying<br />
<strong>OLAP</strong> pre-aggregation technology <strong>to</strong> speed up aggregate query processing in array<br />
databases for GIS and remote-sensing imaging applications.<br />
We presented a study of fundamental imaging operations in GIS. By using a formal<br />
algebraic framework, Array Algebra, we were able <strong>to</strong> classify GIS operations<br />
according <strong>to</strong> three basic algebraic opera<strong>to</strong>rs, and thus identify a set of operations that<br />
can benefit from pre-aggregation techniques. We argued that <strong>OLAP</strong> pre-aggregation<br />
techniques cannot be applied in a straight-forward manner <strong>to</strong> array databases for our<br />
target applications. The reason is that although similar, data structures in both application<br />
domains differ in fundamental aspects. In <strong>OLAP</strong>, multidimensional data spaces<br />
are spanned by axes where cell values sit on the grid at intersection points. This<br />
is paralleled by raster image data that are discretized during acquisition. Thus, the<br />
structure of an <strong>OLAP</strong> data cube is rather similar <strong>to</strong> a raster array. Dimension hierarchies<br />
in <strong>OLAP</strong> serve <strong>to</strong> group value ranges along an axis. Querying data by referring<br />
<strong>to</strong> coordinates on the measure axes yields ground data, whereas queries using axes<br />
higher up in a dimension hierarchy will return aggregated values. A main differentiating<br />
criterion between data in <strong>OLAP</strong> and raster image data is density: <strong>OLAP</strong> data<br />
are sparse, typically 5%, whereas, raster image datasets are 100% dense. Note also<br />
that dimensions in <strong>OLAP</strong> are treated as business perspectives such as products and/or<br />
s<strong>to</strong>res, and these are non-spatial dimensions, which contrast with the spatial nature of<br />
raster image datasets. There are, however, core similarities that motivated us <strong>to</strong> further<br />
research <strong>OLAP</strong> pre-aggregation techniques. For example, we observed that array<br />
databases and <strong>OLAP</strong> systems both employ multidimensional data models <strong>to</strong> organize<br />
their data. Also, the operations convey a high degree of similarity: a roll-up (aggregate)<br />
operation in <strong>OLAP</strong> is very similar <strong>to</strong> a scaling operation in a raster domain.<br />
Moreover, both application domains make use of pre-aggregation approaches <strong>to</strong> speed<br />
up query processing, however, each has different levels of maturity and scalability.<br />
We presented a framework that focuses on computing basic aggregate operations<br />
using pre-aggregated data. We argued that the decision of computing an aggregate<br />
103
104 6. Conclusion<br />
query using pre-aggregated data is influenced by the structural characteristics of the<br />
query and the pre-aggregate. Thus, by comparing query tree structures between the<br />
two, one can determine if the pre-aggregated result contributes fully or partially <strong>to</strong><br />
the final answer of the query. The best case occurs when there is full-matching between<br />
the query and the pre-aggregate, since the time taken <strong>to</strong> compute the query is<br />
reduced <strong>to</strong> the time it takes <strong>to</strong> retrieve the result. However, in the case of partialmatching,<br />
several pre-aggregates can be considered for computing the answer of a<br />
query. The decision has <strong>to</strong> be made, therefore, as <strong>to</strong> which pre-aggregates provide the<br />
best performance in terms of execution time. To this end, we distinguished between<br />
different pre-aggregates and presented a cost-model <strong>to</strong> calculate the cost of using each<br />
qualifying pre-aggregate. Then we presented an algorithm that selects the best execution<br />
plan for evaluating a query considering pre-aggregated data. Tests performed on<br />
real-life raster image datasets showed that our distinction between different types of<br />
pre-aggregates is useful <strong>to</strong> determine the pre-aggregate providing the highest benefit<br />
(in terms of execution time) for computing a given query.<br />
We then described the issues of attempting <strong>to</strong> generalize our pre-aggregation framework<br />
<strong>to</strong> support more complex aggregate operations, and justified our decision <strong>to</strong> focus<br />
on one particular operation: scaling. Traditionally, 2D scaling operations have<br />
been performed using image pyramids. Practice shows that pyramids are typically<br />
constructed in scale levels of powers of 2, thus yielding scale vec<strong>to</strong>rs 2, 4, 6, 8, 16, 32, 64,<br />
128, 256, and 512. The materialization of the pyramid requires an estimated 33% additional<br />
s<strong>to</strong>rage space. Our pre-aggregation selection algorithm is similar <strong>to</strong> the pyramid<br />
approach in that it selects a set of queries for materialization, where each level corresponds<br />
<strong>to</strong> a scaling operation with a defined scale fac<strong>to</strong>r. However, the selection of<br />
such queries is not restricted <strong>to</strong> a fixed number of levels interleveled by a power of two.<br />
Instead, our selection algorithm considers the frequency of each query in the workload,<br />
and how the results of each individual query can help <strong>to</strong> reduce the overall cost<br />
of computing the workload. We compared the performance of our pre-aggregation algorithm<br />
against that of image pyramids: results showed that for workloads with scale<br />
vec<strong>to</strong>rs uniformly distributed our algorithm computes the workload 36% cheaper than<br />
image pyramids, and requires 7% additional space than image pyramids. For scale<br />
vec<strong>to</strong>rs following a Poisson distribution, our algorithm computes the workload at a<br />
cost 55% cheaper than when using the pyramids approach. Further, our algorithm<br />
can be applied <strong>to</strong> datasets of higher dimensions, a feature not supported by traditional<br />
image pyramids.<br />
6.1 Future Work<br />
There are natural extensions <strong>to</strong> this work that would help expand and strengthen the<br />
results. One area of further work is in adding self-management capabilities so that the<br />
DBMS maintains statistics about each scaling operation appearing within the incoming<br />
queries and, at some suitable time, adjust the pre-aggregate set accordingly. <strong>OLAP</strong><br />
dynamic pre-aggregation addresses a similar problem. Another area is in applying the<br />
results studied here <strong>to</strong> the many real-world situations where data cubes contain one or
6.1 Future Work 105<br />
more non-spatio-temporal dimensions, such as pressure, which is common in meteorological<br />
and oceanographic data sets.<br />
Workload distribution deserves further investigation. While the distributions chosen<br />
are practical and relevant, there might be further situations worth considering.<br />
Gaining empirical figures from user-exposed services like EarthLook 1 can be useful<br />
<strong>to</strong> tune our pre-aggregation selection algorithms. Further investigation is also necessary<br />
in the realm of rewriting scaling operations. In <strong>OLAP</strong> applications, there is a<br />
trade-off between speed and accuracy. But accuracy may be critical for certain Georaster<br />
applications, so solutions <strong>to</strong> the query rewriting problem must weight these two<br />
aspects according <strong>to</strong> user data analysis requirements. Moreover, it must consider the<br />
fact that the same dataset may be accessed by various users with <strong>to</strong>tally different analysis<br />
needs.<br />
1 www.earthlook.org
This page was left blank intentionally.
Bibliography<br />
[1] Blakeley J. A., Larson P-K., and Tompa F. Efficiently updating materialized<br />
views. In SIGMOD Rec., volume 15, pages 61–71, New York, NY, USA, 1986.<br />
ACM.<br />
[2] Burrough P. A. and McDonell R. A. Principles of Geographical Information<br />
Systems. Oxford, 2004.<br />
[3] Dehmel A. A Compression Engine for Multidimensional Array Database Systems.<br />
PhD thesis, Technical <strong>University</strong> Munich, Germany, 2002.<br />
[4] Dobra A., Garofalakis M., Gehrke J., and Ras<strong>to</strong>gi R. Processing complex aggregate<br />
queries over data streams. In Proceedings of the 2002 ACM SIGMOD<br />
international conference on Management of data, pages 61–72, New York, NY,<br />
USA, 2002. ACM.<br />
[5] Garcia-Gutierrez A. <strong>Applying</strong> <strong>OLAP</strong> pre-aggregation techniques <strong>to</strong> speed up<br />
query processing in raster-image databases. In GI-Days 2007 - Young Researchers<br />
Forum, pages 189–191, Muenster, Germany, 2007. IfGIprints 30.<br />
[6] Garcia-Gutierrez A. <strong>Applying</strong> <strong>OLAP</strong> pre-aggregation techniques <strong>to</strong> speed up<br />
query response times in raster image databases. In ICSOFT (ISDM/EHST/DC),<br />
pages 259–266, 2007.<br />
[7] Garcia-Gutierrez A. Modeling geo-raster operations with array algebra. In<br />
Technical Report (7), 2007.<br />
[8] Garcia-Gutierrez A. and Baumann P. Modeling fundamental geo-raster operations<br />
with array algebra. In ICDM Workshops, pages 607–612, 2007.<br />
[9] Garcia-Gutierrez A. and Baumann P. Computing aggregate queries in raster image<br />
databases using pre-aggregated data. In Proceedings of the International<br />
Conference on Computer Science and Applications, pages 84–89, San Francisco,<br />
CA, USA, 2008.<br />
[10] Garcia-Gutierrez A. and Baumann P. Using pre-aggregation <strong>to</strong> speed up scaling<br />
operations on massive spatio-temporal data. In 29th International Conference<br />
on Conceptual Modeling, November 2010.<br />
107
108 Bibliography<br />
[11] Gupta A. and Mumick I. S. Maintenance of materialized views: Problems,<br />
techniques, and applications. In IEEE Data Engineering Bulletin, volume 18,<br />
pages 3–18, 1995.<br />
[12] Gupta A. and Mumick I. S. Materialized Views. The MIT <strong>Pre</strong>ss, 2007.<br />
[13] Gupta A., Harinarayan V., and Quass D. Aggregate-query processing in data<br />
warehousing environments. In Proceedings of the 21th International Conference<br />
on Very Large Data Bases, pages 358–369, San Francisco, CA, USA,<br />
1995. Morgan Kaufmann Publishers Inc.<br />
[14] Kitamo<strong>to</strong> A. Multiresolution cache management for distributed satellite image<br />
database using nacsis-thai international link. In Proceedings of the 6th International<br />
Workshop on Academic Information Networks and Systems (WAINS),<br />
pages 243–250, 2000.<br />
[15] Koeller A. and Rundensteiner E. A. Incremental maintenance of schemarestructuring<br />
views in schemasql. In IEEE Transactions on Knowledge and<br />
Data Engineering, volume 16, pages 1096–1111, Piscataway, NJ, USA, 2004.<br />
IEEE Educational Activities Department.<br />
[16] Lauer A., J. Hendricks, I. Ackermann, B. Schell, H. Hass, and S. Metzger.<br />
Simulating aerosol microphysics with the echam/made gcm; part i: Model description<br />
and comparison with observations. In Atmospheric Chemistry and<br />
Physics, volume 5, pages 3251–3276, 2005.<br />
[17] Shukla A., Deshpande P., and Naugh<strong>to</strong>n J. F. Materialized view selection for<br />
multidimensional datasets. In Proceedings of the 24th International Conference<br />
on Very Large Data Bases, pages 488–499, San Francisco, CA, USA, 1998.<br />
Morgan Kaufmann Publishers Inc.<br />
[18] Spokoiny A. and Shahar Y. An active database architecture for knowledgebased<br />
incremental abstraction of complex concepts from continuously arriving<br />
time-oriented raw data. In Journal on Intelligent Information Systems, volume<br />
28, pages 199–231, Hingham, MA, USA, 2007. Kluwer Academic Publishers.<br />
[19] Stan A. Geographic information systems: A management perspective. In WDL<br />
Publications, 1991.<br />
[20] American National Standards Institute Inc. (ANSI). ANSI/ISO/IEC<br />
9075-2:2008, International Organization for Standardization (ISO), Information<br />
Technology –Database Languages – SQL–Part 2: Foundation<br />
(SQL/Foundation). Technical report, American National Standards Institute,<br />
2008.<br />
[21] Barbará B. and Imielinski T. Sleepers and workaholics: Caching strategies in<br />
mobile environments. In SIGMOD Conference, pages 1–12, 1994.
BIBLIOGRAPHY 109<br />
[22] Moon B., Vega-Lopez I. F., and Vijaykumar I. Scalable algorithms for large<br />
temporal aggregation. In Proceedings of the 16th International Conference on<br />
Data Engineering, page 145, Washing<strong>to</strong>n, DC, USA, 2000. IEEE Computer<br />
Society.<br />
[23] Reiner B. HEAVEN A Hierarchical S<strong>to</strong>rage and Archive Environment for Multidimensional<br />
Array Database Management Systems. PhD thesis, Technical<br />
<strong>University</strong> Munich, Germany, 2004.<br />
[24] Reiner B. and Hahn K. Tertiary s<strong>to</strong>rage support for large-scale multidimensional<br />
array database management systems, 2002.<br />
[25] Reiner B., Hahn K., Hoefling G., and Baumann P. Hierarchical s<strong>to</strong>rage support<br />
and management for large-scale multidimensional array database management<br />
systems. In Proceedings of the 3rd International Conference on Database and<br />
Expert Systems Applications (DEXA), Aix en Provence, 2002.<br />
[26] Sapia C. Promise: <strong>Pre</strong>dicting query behavior <strong>to</strong> enable predictive caching<br />
strategies for <strong>OLAP</strong> systems. In Proceedings of the 2nd International Conference<br />
on Data Warehousing and Knowledge Discovery, pages 224–233, London,<br />
UK, 2000. Springer-Verlag.<br />
[27] Open GIS Consortium. Web Coverage Processing Service (WCPS). In best<br />
practices document No. 06-035r1, pages 21–47, 2006.<br />
[28] The <strong>OLAP</strong> Council. Efficient s<strong>to</strong>rage and management of environmental information.<br />
www.olapreport.com, Accessed July 11 2002.<br />
[29] The <strong>OLAP</strong> Council. Apb-1 olap benchmark release ii. http://<br />
www.olapcouncil.org/research/resrchly.htm, Accessed July<br />
11 2010.<br />
[30] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,<br />
P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,<br />
D. Maier, S. Madden, J. Patel, M. S<strong>to</strong>nebraker, and S. Zdonik. A demonstration<br />
of scidb: a science-oriented dbms. In Proceedings of the Very Large<br />
Data Bases Conference Endowment, volume 2, pages 1534–1537. VLDB Endowment,<br />
2009.<br />
[31] Chatzian<strong>to</strong>niou D. Ad hoc <strong>OLAP</strong>: Expression and evaluation. In Proceedings of<br />
the 15th International Conference on Data Engineering, page 250, Washing<strong>to</strong>n,<br />
DC, USA, 1999. IEEE Computer Society.<br />
[32] O’Sullivan D. and Unwin D. Geographic Information Analysis. John Wiley,<br />
2003.<br />
[33] Quass D. Maintenance expressions for views with aggregation. In VIEWS,<br />
pages 110–118, 1996.
110 Bibliography<br />
[34] Tvei<strong>to</strong> I. D., Dobesch H., Grueter E., Perdigao A., Tvei<strong>to</strong> O.E., Thornes J. E.,<br />
Van derWel F., and Bottai L. The use of geographic information systems in<br />
clima<strong>to</strong>logy and meteorology. In Final Report COST Action, page 719, 2006.<br />
[35] Nguyen D.H. Using javascript for some interactive operations in virtual geographic<br />
model with geovrml. In Proceedings of the International Symposium<br />
on Geoinformatics for Spatial Infrastructure Development in Earth and Allied<br />
Sciences, 2006.<br />
[36] Adiba M. E. and Lindsay B. G. Database snapshots. In Proceedings of the Sixth<br />
International Conference on Very Large Data Bases, Oc<strong>to</strong>ber 1-3, 1980, Montreal,<br />
Quebec, Canada, Proceedings, pages 86–91. IEEE Computer Society,<br />
1980.<br />
[37] Thomsen E. Olap Solutions : Building Multidimensional Information Systems.<br />
John Wiley and Sons, 1997.<br />
[38] Codd E. F., Codd S. B., and Salley C.T. Beyond decision support. In Computer<br />
World, volume 27, 1993.<br />
[39] Codd E. F., Codd S. B., and Salley C. T. Providing <strong>OLAP</strong> (on-line analytical<br />
processing) <strong>to</strong> user-analysts: An it mandate. In Technical Report, 1993.<br />
[40] Vega-Lopez I. F., Snodgrass R. T., and Moon B. Spatiotemporal aggregate computation:<br />
A survey. In IEEE Transactions on Knowledge and Data Engineering,<br />
volume 17, pages 271–286, Piscataway, NJ, USA, 2005. IEEE Educational<br />
Activities Department.<br />
[41] Colliat G. <strong>OLAP</strong>, relational, and multidimensional database systems. In SIG-<br />
MOD Rec., volume 25, pages 64–69, New York, NY, USA, 1996. ACM.<br />
[42] Pestana G., da Silva M. M., and Bedard Y. Spatial <strong>OLAP</strong> modeling: An<br />
overview base on spatial objects changing over time. In IEEE 3rd International<br />
Conference on Computational Cybernetics, pages 149–154, April 2005.<br />
[43] Wiederhold G., Jajodia S., and Litwin W. Dealing with granularity of time<br />
in temporal databases. In Proceedings of the 3rd international conference on<br />
Advanced information systems engineering, pages 124–140, New York, NY,<br />
USA, 1991. Springer-Verlag New York, Inc.<br />
[44] García-Molina H., Ullman J. D., and Widom J. Database Systems: The Complete<br />
Book. Williams, 2002.<br />
[45] Samet H. Foundations of Multidimensional and Metric Data Structures. Morgan<br />
Kaufmann Publishers, 2006.<br />
[46] ERDAS IMAGINE. ERDAS Field Guide. 1997.
BIBLIOGRAPHY 111<br />
[47] ESRI Inc. ArcGIS 9 Geo Processing Commands, quick reference guide. ArcGIS,<br />
2004.<br />
[48] ISO. 19123:2005 geographic information - coverage geometry and functions,<br />
2005.<br />
[49] Albrecht J. Universal analytical gis operations - a task-oriented systematization<br />
of data structure-independent gis functionality. In Geographic Information<br />
Research- transatlantic perspectives, pages 577–591, 1998.<br />
[50] Boettger J., <strong>Pre</strong>iser M., Balzer M., and Deussen O. Detail-in-context visualization<br />
for satellite imagery. volume 27, pages 587–596, 2008.<br />
[51] Burt P. J. and Adelson E. H. The laplacian pyramid as a compact code. In IEEE<br />
Transactions on Communications, number 31, pages 532–540, 1983.<br />
[52] Han J., Stefanovic N., and Koperski K. Selective materialization: An efficient<br />
method for spatial data cube construction. In Proceedings of the Second Pacific-<br />
Asia Conference on Research and Development in Knowledge Discovery and<br />
Data Mining, pages 144–158, London, UK, 1998. Springer-Verlag.<br />
[53] Nievergelt J., Hinterberger H., and Sevcik K. C. The grid file: An adaptable,<br />
symmetric multikey file structure. In ACM Transactions on Database Systems,<br />
volume 9, pages 38–71, 1984.<br />
[54] Peuquet D. J. Making space for time: Issues in space-time data representation.<br />
In Geoinformatica, volume 5, pages 11–32, Hingham, MA, USA, 2001.<br />
Kluwer Academic Publishers.<br />
[55] Whang K. J. and Krishnamurthy R. The multilevel grid file - a dynamic hierarchical<br />
multidimensional file structure. In DASFAA, pages 449–459, 1991.<br />
[56] Berry J. K. and Tomlin C. D. A Mathematical Structure for Car<strong>to</strong>graphic Modeling<br />
in Environmental Analysis. In Proceedings of the American Congress on<br />
Surveying and Mapping, pages 269–283, 1979.<br />
[57] Choi K. and Luk W. Processing aggregate queries on spatial <strong>OLAP</strong> data. In Proceedings<br />
of the 10th international conference on Data Warehousing and Knowledge<br />
Discovery, pages 125–134, Berlin, Heidelberg, 2008. Springer-Verlag.<br />
[58] Hornsby K. and Egenhofer M. J. Shifts in detail through temporal zooming.<br />
In International Workshop on Database and Expert Systems Applications, volume<br />
0, page 487, Los Alami<strong>to</strong>s, CA, USA, 1999. IEEE Computer Society.<br />
[59] Hornsby K. and Egenhofer M. J. Identity-based change: A foundation for<br />
spatio-temporal knowledge representation. In International Journal of Geographical<br />
Information Science, volume 14, pages 207–224, 2000.
112 Bibliography<br />
[60] Ramachandran K., Shah B., and Raghavan V. V. Dynamic pre-fetching of views<br />
based on user-access patterns in an <strong>OLAP</strong> system. In ICEIS (1), pages 60–67,<br />
2005.<br />
[61] Sellis T. K. Multiple-query optimization. In ACM Trans. Database Syst., volume<br />
13, pages 23–52, New York, NY, USA, 1988. ACM.<br />
[62] Shim K., Sellis T., and Nau D. Improvements on a heuristic algorithm for<br />
multiple-query optimization. In Data and Knowledge Engineering, volume 12,<br />
pages 197–222, 1994.<br />
[63] Libkin L., Machlin R., and Wong L. A query language for multidimensional<br />
arrays: Design, implementation, and optimization techniques. In SIGMOD<br />
Rec., volume 25, pages 228–239, New York, NY, USA, 1996. ACM.<br />
[64] Usery E. L., Finn M. P., Scheidt D. J., Ruhl S., Beard T., and Bearden M.<br />
Geospatial data resampling and resolution effects on watershed modeling: A<br />
case study using the agricultural non-point source pollution model. In Journal<br />
of Geographical Systems, volume 6, pages 289–306, 2004.<br />
[65] Yong K. L. and Kim M. H. Optimizing the incremental maintenance of multiple<br />
join views. In Proceedings of the 8th ACM International Workshop on Data<br />
Warehousing and <strong>OLAP</strong>, pages 107–113, New York, NY, USA, 2005. ACM.<br />
[66] Benedikt M. and Libkin L. Exact and approximate aggregation in constraint<br />
query languages. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART<br />
Symposium on Principles of Database Systems, pages 102–113, New York, NY,<br />
USA, 1999. ACM.<br />
[67] Gertz M., Hart Q., Rueda C., Singhal S., and Zhang J. A data and query model<br />
for streaming geospatial image data. In EDBT Workshops, pages 687–699,<br />
2006.<br />
[68] Golfarelli M. and Rizzi S. Data Warehouse Design: Modern Principles and<br />
Methodologies. McGraw Hill, 2009.<br />
[69] Gyssens M. and Lakshmanan L. V. A foundation for multi-dimensional<br />
databases. pages 106–115, 1997.<br />
[70] Ogden J. M., Adelson E. H., Bergen J. R., and Burt P. J. Pyramid methods in<br />
computer graphics, 1985.<br />
[71] Beckmann N., Kriegel H. P., Schneider R., and Seeger B. The r*-tree: an<br />
efficient and robust access method for points and rectangles. In SIGMOD Rec.,<br />
volume 19, pages 322–331, New York, NY, USA, 1990. ACM.<br />
[72] Roussopoulos N. Materialized views and data warehouses. In SIGMOD<br />
Record, volume 27, pages 21–26, 1997.
BIBLIOGRAPHY 113<br />
[73] Stefanovic N., Han J., and Koperski K. Object-based selective materialization<br />
for efficient implementation of spatial data cubes. In IEEE Transactions on<br />
Knowledge and Data Engineering, volume 12, pages 938–958, Piscataway, NJ,<br />
USA, 2000. IEEE Educational Activities Department.<br />
[74] Widmann N. and Baumann P. Performance evaluation of multidimensional<br />
array s<strong>to</strong>rage techniques in databases. In Proceedings of the IDEAS Conference,<br />
1999.<br />
[75] Baumann P. Management of multidimensional discrete data. In The VLDB<br />
Journal, volume 3, pages 401–444, Secaucus, NJ, USA, 1994. Springer-Verlag<br />
New York, Inc.<br />
[76] Baumann P. A database array algebra for spatio-temporal data and beyond. In<br />
Next Generation Information Technologies and Systems, pages 76–93, 1999.<br />
[77] Baumann P. Web-enabled raster gis services for large image and map databases.<br />
In Proceedings of the 12th International Workshop on Database and Expert<br />
Systems Applications, page 870, Washing<strong>to</strong>n, DC, USA, 2001. IEEE Computer<br />
Society.<br />
[78] Baumann P. Web coverage processing service (wcps) implementation specification.<br />
number 08-068. ogc, 1.0.0 edition. 2008.<br />
[79] Furtado P. and Baumann P. S<strong>to</strong>rage of multidimensional arrays based on arbitrary<br />
tiling. In Proceedings of the 15th International Conference on Data<br />
Engineering, page 480, Washing<strong>to</strong>n, DC, USA, 1999. IEEE Computer Society.<br />
[80] Marathe A. P. and Salem K. A language for manipulating arrays. In Proceedings<br />
of the 23rd International Conference on Very Large Data Bases VLDB<br />
’97, pages 46–55, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers<br />
Inc.<br />
[81] Vassiliadis P. Modeling multidimensional databases, cubes and cube operations.<br />
In Proceedings of the 10th International Conference on Scientific and<br />
Statistical Database Management, pages 53–62, Washing<strong>to</strong>n, DC, USA, 1998.<br />
IEEE Computer Society.<br />
[82] Burt P.J. Fast filter transforms for image processing. In Computer Graphics<br />
and Image Processing, number 16, pages 16–51, 1981.<br />
[83] Agrawal R., Gupta A., and Sarawagi S. Modeling multidimensional databases.<br />
In Proceedings of the 13th International Conference on Data Engineering,<br />
pages 232–243, Washing<strong>to</strong>n, DC, USA, 1997. IEEE Computer Society.<br />
[84] Pieringer R., Markl V., Ramsak F., and Bayer R. Hinta: A linearization algorithm<br />
for physical clustering of complex <strong>OLAP</strong> hierarchies. In DMDW, page 11,<br />
2001.
114 Bibliography<br />
[85] Chen S., Liu B., and Rundensteiner E. A. Multiversion-based view maintenance<br />
over distributed data sources. In ACM Transaction Database Systems,<br />
volume 29, pages 675–709, New York, NY, USA, 2004. ACM.<br />
[86] Prasher S. and Zhou X. Multiresolution amalgamation: Dynamic spatial data<br />
cube generation. In Proceedings of the 15th Australasian database conference,<br />
pages 103–111, Darlinghurst, Australia, Australia, 2004. Australian Computer<br />
Society, Inc.<br />
[87] Shekhar S. and Xiong H. Encyclopedia of GIS. Springer, 2008.<br />
[88] SYBASE. Sybase solutions guide. http://www.sybase.cz/uploads/<br />
CEEMEA_SybaseIQ_FINAL.pdf, Accessed July 11, 2010.<br />
[89] Griffin T. and Libkin L. Incremental maintenance of views with duplicates. In<br />
Proceedings of the SIGMOD Rec., volume 24, pages 328–339, New York, NY,<br />
USA, 1995. ACM.<br />
[90] Needham T. Visual Complex Analysis. Oxford <strong>University</strong> <strong>Pre</strong>ss, 1998.<br />
[91] Niemi T., Nummenmaa J., and Thanisch P. Normalizing <strong>OLAP</strong> cubes for controlling<br />
sparsity. In Data Knowledge Engineering, volume 46, pages 317–343,<br />
Amsterdam, The Netherlands, The Netherlands, 2003. Elsevier Science Publishers<br />
B. V.<br />
[92] Harinarayan V., Rajaraman A., and Ullman J. D. Implementing data cubes<br />
efficiently. In SIGMOD Rec., volume 25, pages 205–216, New York, NY, USA,<br />
1996. ACM.<br />
[93] Schlosser S. W., Schindler J., Papadomanolakis S., Shao M., Ailamaki A.,<br />
Faloutsos C., and Ganger G. R. On multidimensional data and modern disks. In<br />
In Proceedings of the 4th USENIX Conference on File and S<strong>to</strong>rage Technologies.<br />
USENIX Association, pages 225–238, 2005.<br />
[94] Mingjie X. Experiments on remote sensing image cube and its <strong>OLAP</strong>. In<br />
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,<br />
volume 7, pages 4398–4401 vol.7, September 2004.<br />
[95] Halevy A. Y. Answering queries using views: A survey. In The VLDB Journal,<br />
volume 10, pages 270–294, Secaucus, NJ, USA, December 2001. Springer-<br />
Verlag New York, Inc.<br />
[96] Jiebing Y. and Dewitt D. J. Processing satellite images on tertiary s<strong>to</strong>rage:<br />
A study of the impact of tile size on performance. In Proceedings of the 5th<br />
NASA Goddard Conference on Mass S<strong>to</strong>rage Systems and Technologies, pages<br />
460–476, 1996.
BIBLIOGRAPHY 115<br />
[97] Kotidis Y. and Roussopoulos N. A case for dynamic view management. In ACM<br />
Transactions on Database Systems, volume 26, pages 388–423, New York, NY,<br />
USA, 2001. ACM.<br />
[98] Lee K. Y., Son J. H., and Kim M. H. Efficient incremental view maintenance in<br />
data warehouses. In Proceedings of the 10th International Conference on Information<br />
and Knowledge Management, pages 349–356, New York, NY, USA,<br />
2001. ACM.<br />
[99] Qingsong Y. and Aijun A. Using user access patterns for semantic query<br />
caching. In DEXA, pages 737–746, 2003.<br />
[100] Zhao Y., Deshpande P. M., and Naugh<strong>to</strong>n J. F. An array-based algorithm for simultaneous<br />
multidimensional aggregates. In SIGMOD Rec., volume 26, pages<br />
159–170, New York, NY, USA, 1997. ACM.<br />
[101] Zhuge Y., García-Molina H., Hammer J., and Widom J. View maintenance<br />
in a warehousing environment. In Proceedings of the 1995 ACM SIGMOD<br />
International Conference on Management of Data, pages 316–327, New York,<br />
NY, USA, 1995. ACM.<br />
[102] Zhuge Y., García-Molina H., and Wiener J. L. Multiple view consistency for<br />
data warehousing. In Proceedings of the 13th International Conference on Data<br />
Engineering, pages 289–300, Washing<strong>to</strong>n, DC, USA, 1997. IEEE Computer<br />
Society.<br />
[103] Zhuge Y., García-Molina H., and Wiener J. L. Consistency algorithms for<br />
multi-source warehouse view maintenance. In Distributed Parallel Databases,<br />
volume 6, pages 7–40, Hingham, MA, USA, 1998. Kluwer Academic Publishers.