Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University

Applying OLAP Pre-Aggregation Techniques to Speed Up 

Aggregate Query Processing in Array Databases 

by 

Angélica García Gutiérrez 

A thesis submitted in partial fulfillment 

of the requirements for the degree of 

Doctor of Philosophy 

in Computer Science 

Approved, Thesis Committee: 

Prof. Dr. Peter Baumann 

Prof. Dr. Vikram Unnithan 

Prof. Dr. Inés Fernando Vega López 

Date of Defense: November 12, 2010 

School of Engineering and Science

In memory of my grandmother, Naty.

Acknowledgments 

I would like to express my sincere gratitude to my thesis advisor, Prof. Dr. Peter 

Baumann for his excellent guidance throughout the course of this dissertation. With 

his tremendous passion for science and his great efforts to explain things clearly and 

simply, he made this research to be one of the richest experiences of my life. He 

always suggested new ideas, and guided my research through many pitfalls. Furthermore, 

I learned from him to be kind and cooperative. Thank you, for every 

single meeting, for every single discussion that you always managed to be thoughtprovoking, 

for your continue encouragement, for believing in that I could bring this 

project to success. 

I am also grateful to Prof. Dr. Inés Fernando Vega López for his valuable suggestions. 

He not only provided me with technical advice but also gave me some important 

hints on scientific writing that I applied on this dissertation. My sincere gratitude also 

to Prof. Dr. Vikram Unnithan. Despite being one of Jacobs University’s most popular 

and busiest professors due to his genuine engagement with student life beyond 

academics, Prof. Unnithan took interest in this work and provided me unconditional 

support. 

I would like to thank two promising graduate students, Irina Calciu and Eugen 

Sorbalo for their outstanding contributions with some of the experiments presented in 

Chapter 5 of this thesis. 

I am especially grateful to my colleagues Michael Owonibi, Salah Al Jubeh, and 

Yu Jinsongdi for their many valuable discussions, and for providing a stimulating and 

fun environment in which to learn and grow. 

I am grateful to the team assistants at School of Engineering and Science, for helping 

the School to run smoothly and for assisting me in many different ways. Sigrid 

Manss deserves special mention. Thank you for all your kindness, and caring. 

Also, I would like to thank Connie Garcia, Jim Toersten, Greg White, Irina Prjadeha, 

and all of my friends that helped me to proofread this thesis. Victoria Inness- 

Brown deserves special mention for applying her expertise as an editor on reviewing 

each chapter of this thesis. 

Thank you to all my great friends who provided support and encouragement in so 

many ways, for helping me to see the bright side of my problems in difficult times, for

all the emotional support, comraderie, entertainment, and caring provided. Specially, 

to Salah Al Jubeh, Asma Alazeib, Talina Eslava, Rainer Gruenheid, Yu Jinsongdi, 

Maria Joy, Ghada Kadamany, Ingrid Lara, Blessing Musunda, Michael Owonibi, Jessica 

Price, Irina Prjadeha, Joerg Reinekirchen, Yannic Ramaye, Mila Tarabashkina, 

Ruiju Tong, Derya Toykan, Iyad Tumar, Vanya Uzunova, Tanja Vaitulevich, and Justo 

Vargas. You all have a place in my heart. Also, to my friend Samantha Hooton, whom 

I learned to love as a sister shortly after meeting her. Her authenticity, self-confidence, 

and drive to success are a real inspiration. Thank you for your caring, for sharing your 

wisdom, for taking me to the hospital when I was in pain, and for being there anytime 

I needed a friend. 

My warmest thanks to Father Matthew I. Nwoko for his spiritual guidance, his 

caring, his advices, and overall, for his unconditional love. 

Thank you to my parents, my brother and sisters, who have always been very supportive 

of my aspirations. Their support has been instrumental in getting me on the 

path that brought me to this project. Especialmente, Gracias a ti mamá, por ser mi 

ejemplo de tenacidad y compromiso. A ti también te dedico esta tesis. 

To DAAD and CONACYT, the financial support and trust is gratefully acknowledged. 

To everybody that has been a part of my life, thank you very much. 

Lastly, I thank the Lord God Almighty for giving me health, ideas and wisdom to 

enable me complete this research project successfully.

Abstract 

Large multidimensional arrays of data are common in a variety of scientific applications. 

In the past, arrays have typically been stored in files, and then manipulated 

by customized programs operating on those files. Nowadays, with science moving 

toward computational databases, the trend is toward a new class of database, the array 

database. In the broadest sense, the array database supports various types of multidimensional 

array data, including remote-sensor data, satellite imagery, and data 

resulting from scientific simulations. 

As with traditional databases for business applications, analytics in array databases 

often involves the extraction of general characteristics from large repositories. This requires 

efficient methods for computing queries that involve data summarization, such 

as aggregate queries. A typical solution is to pre-compute the whole or parts of each 

query, and then save the results of those queries that are frequently submitted against 

the database and those that can be used to compute the results of similar future queries. 

This process is known as pre-aggregation. Unfortunately, pre-aggregation support for 

array databases is currently limited to one specific operation, scaling (zooming), and 

to two-dimensional datasets (images). 

In this aspect, database technology for business applications is much more mature. 

Technologies such as On-Line Analytical Processing (OLAP) provide the means to 

analyze business data from one or multiple sources, and thus facilitate the decision 

making process. In OLAP, the information is viewed as data cubes. These cubes 

are typically stored in relational tables, or in multidimensional arrays, or in a hybrid 

model. In order to enable fast interactive multidimensional data analysis, database 

systems frequently pre-compute and store the results of aggregate queries. While there 

are some valuable research results in the realm of OLAP pre-aggregation techniques 

with varying degrees of power and refinement, not enough work has been done and 

reported for array databases. 

The purpose of this thesis is to investigate the application of OLAP pre-aggregation 

techniques with the objective of speeding up aggregate operations in array databases. 

In particular, we consider enhancing aggregate computation in Geographic Information 

Systems (GIS) and remote-sensing imaging applications. To this end, we describe 

a set of fundamental operations in GIS based on a sound algebraic framework.

This allows us to identify those operations that require data summarization and that 

therefore may benefit from pre-aggregation. We introduce a conceptual framework 

and cost model for rewriting basic aggregate queries in terms of pre-aggregated data, 

and conduct experiments to assess the performance of our algorithms. Results show 

that query response times can be substantially reduced by strategically selecting the 

pre-aggregate with the least cost in terms of execution time. We also investigate the 

problem of selecting a set of queries for pre-aggregation, but failed to find an analytical 

solution for all possible types of aggregate queries. Nevertheless, we present a 

framework and algorithms for the selection of scaling operations for pre-aggregation 

considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets 

outperform the results of image pyramids, the current technique used to speed up scaling 

operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets 

show that query response types can also be substantially reduced by intelligently selecting 

a set of scaling operations for pre-aggregation. 

The work presented in this thesis is the first of its kind for array databases in scientific 

applications.

Contents 

1 Introduction and Problem Statement 9 

1.1 Overview of Thesis and Contributions . . . . . . . . . . . . . . . . . 12 

1.2 Publications Related to this Thesis . . . . . . . . . . . . . . . . . . . 12 

2 Background and Related Work 15 

2.1 Array Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.1.1 Basic Notion of Arrays . . . . . . . . . . . . . . . . . . . . . 15 

2.1.2 2D Data Models . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.1.3 Multidimensional Data Models . . . . . . . . . . . . . . . . 17 

2.1.4 Storage Management . . . . . . . . . . . . . . . . . . . . . . 18 

2.1.5 2D Pre-Aggregation . . . . . . . . . . . . . . . . . . . . . . 19 

2.1.6 Pre-Aggregation Beyond 2D . . . . . . . . . . . . . . . . . . 23 

2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.2 On-Line Analytical Processing (OLAP) . . . . . . . . . . . . . . . . 25 

2.2.1 OLAP Data model . . . . . . . . . . . . . . . . . . . . . . . 25 

2.2.2 OLAP Operations . . . . . . . . . . . . . . . . . . . . . . . . 26 

2.2.3 OLAP Architectures . . . . . . . . . . . . . . . . . . . . . . 26 

2.2.4 OLAP Pre-Aggregation . . . . . . . . . . . . . . . . . . . . 30 

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3 Fundamental Geo-Raster Operations 37 

3.1 Array Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

3.1.1 Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

3.1.2 Condenser . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.1.3 Sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.2 Geo-Raster Operations . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.2.1 Mathematical Operations . . . . . . . . . . . . . . . . . . . . 39 

3.2.2 Aggregation Operations . . . . . . . . . . . . . . . . . . . . 45 

3.2.3 Statistical Aggregate Operations . . . . . . . . . . . . . . . . 51 

3.2.4 Affine Transformations . . . . . . . . . . . . . . . . . . . . . 55 

3.2.5 Terrain Analysis . . . . . . . . . . . . . . . . . . . . . . . . 57 

3.2.6 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3

4 Answering Basic Aggregate Queries Using Pre-Aggregated Data 63 

4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

4.1.1 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

4.1.2 Pre-Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 64 

4.1.3 Aggregate Query and Pre-Aggregate Equivalence . . . . . . . 64 

4.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

4.2.1 Computing Queries from Raw Data . . . . . . . . . . . . . . 68 

4.2.2 Computing Queries from Independent and Overlapped Pre- 

Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

4.2.3 Computing Queries from Dominant Pre-Aggregates . . . . . 69 

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

5 Pre-Aggregation Support Beyond Basic Aggregate Operations 77 

5.1 Non-Standard Aggregate Operations . . . . . . . . . . . . . . . . . . 77 

5.2 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 78 

5.2.1 Lattice Representation . . . . . . . . . . . . . . . . . . . . . 79 

5.2.2 Pre-Aggregation Selection Problem . . . . . . . . . . . . . . 80 

5.3 Pre-Aggregates Selection . . . . . . . . . . . . . . . . . . . . . . . . 82 

5.3.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 83 

5.4 Answering Scaling Operations Using Pre-Aggregated Data . . . . . . 83 

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85 

5.5.1 2D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

5.5.2 3D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

5.5.3 4D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

6 Conclusion 103 

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

List of Figures 

2.1 3D Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.2 Map Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . 17 

2.3 Image Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.4 Image Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.5 Nearest Neighbor, Bilinear and Cubic Interpolation Methods . . . . . 22 

2.6 3D Scaling Operations on Time-Series Imagery Datasets . . . . . . . 24 

2.7 OLAP Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

2.8 Typical OLAP Cube Operations . . . . . . . . . . . . . . . . . . . . 27 

2.9 OLAP Approaches: MOLAP, ROLAP, and HOLAP . . . . . . . . . . 27 

2.10 MOLAP Storage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 28 

2.11 ROLAP Storage Scheme . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.12 Typical Query as Expressed in ROLAP and MOLAP Systems . . . . 29 

2.13 Star Model of a Spatial Warehouse . . . . . . . . . . . . . . . . . . . 32 

2.14 Comparison of Roll-Up and Scaling Operations . . . . . . . . . . . . 34 

3.1 Reduction of Contrast in the Green Channel of an RGB Image . . . . 40 

3.2 Highlighted Infrared Areas of an NRG Image . . . . . . . . . . . . . 41 

3.3 Cells of Rasters A and B with Equal Values . . . . . . . . . . . . . . 42 

3.4 Re-Classification of the Cell Values of a Raster Image . . . . . . . . . 43 

3.5 Computation of a Proximity Operation . . . . . . . . . . . . . . . . . 44 

3.6 Computation of an Overlay Operation . . . . . . . . . . . . . . . . . 45 

3.7 Computation of an Overlay Operation Considering Values Greater 

than Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.8 Calculation of the Total Sum of Cell Values in a Raster . . . . . . . . 47 

3.9 Result of an Average Aggregate Operation . . . . . . . . . . . . . . . 48 

3.10 Result of a Maximum Aggregate Operation . . . . . . . . . . . . . . 48 

3.11 Result of a Minimum Aggregate Operation . . . . . . . . . . . . . . 49 

3.12 Computation of the Histogram for a Raster Image . . . . . . . . . . . 50 

3.13 Computation of the Diversity for a Raster Image . . . . . . . . . . . . 50 

3.14 Computation of a Majority Operation for a Raster Image . . . . . . . 51 

3.15 Computation of the Variance for a Raster Image . . . . . . . . . . . . 52 

3.16 Computation of the Standard Deviation for a Raster Image . . . . . . 52 

3.17 Computation of Median for a Raster Image . . . . . . . . . . . . . . 54 

3.18 Computation of a Top-k Operation for a Raster Image . . . . . . . . . 54 

3.19 Computation of a Translation Operation for a Raster Image . . . . . . 56 

5

3.20 Computation of a Scaling Operation for a Raster Image . . . . . . . . 57 

3.21 Slopes Along the X and Y Directions . . . . . . . . . . . . . . . . . . 58 

3.22 Flow Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.23 Sobel Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

3.24 Computation of an Edge-Detection for a Raster Image . . . . . . . . . 60 

4.1 Types of Pre-Aggregates . . . . . . . . . . . . . . . . . . . . . . . . 66 

4.2 Selected Queries for Pre-Aggregation (left) and Decomposed Queries 

(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

5.1 Sample Lattice Diagram for a Workload with Five Scaling Operations 79 

5.2 Query Workload with Uniform Distribution . . . . . . . . . . . . . . 87 

5.3 Query Workload with Poisson Distribution . . . . . . . . . . . . . . . 88 

5.4 Selected Queries for Pre-Aggregation . . . . . . . . . . . . . . . . . 89 

5.5 Query Workload with Peak Distribution . . . . . . . . . . . . . . . . 90 


5.7 Query Workload with Step Distribution . . . . . . . . . . . . . . . . 91 


5.9 Workload with Uniform Distribution along x, y, and t . . . . . . . . . 93 

5.10 Average Query Cost over Storage Space . . . . . . . . . . . . . . . . 93 

5.11 Selected Pre-Aggregates, c = 36% . . . . . . . . . . . . . . . . . . . 94 

5.12 Workload with Uniform Distribution Along x, y, and Poisson distribution 

in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 

5.13 Average Query Cost as Space is Varied . . . . . . . . . . . . . . . . . 95 


5.15 Workload with Poisson distribution Along x, y, and t . . . . . . . . . 96 



5.18 Workload with Poisson Distribution Along x, y, and Uniform Distribution 

in t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 


5.20 Selected Pre-Aggregates, c = 21% . . . . . . . . . . . . . . . . . . . 99

List of Tables 

3.1 UNO and FAO Suitability Classifications . . . . . . . . . . . . . . . 43 

3.2 Capability Indexes for Different Capability Classes . . . . . . . . . . 43 

3.3 Array Algebra Classification of Geo-Raster Operations. . . . . . . . . 62 

4.1 Cost Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

4.2 Database and Queries of the Experiment. . . . . . . . . . . . . . . . . 74 

4.3 Comparison of Query Evaluation Costs Using Pre-Aggregated Data 

and Original Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

5.1 Sample Pre-Aggregates. . . . . . . . . . . . . . . . . . . . . . . . . 84 

5.2 ECHAM T-42 Climate Simulation Dimensions . . . . . . . . . . . . 100 

5.3 4D Scaling: Scale Vector Distribution . . . . . . . . . . . . . . . . . 100 

5.4 4D Scaling: Selected Pre-Aggregates . . . . . . . . . . . . . . . . . . 100 

7

This page was left blank intentionally.

Chapter 1 

Introduction and Problem Statement 

Scientific computing platforms and infrastructures are making new kinds of experiments 

possible, resulting in the generation of vast volumes of arrays of data. This 

is happening in many specialized application areas such as meteorology, oceanography, 

hydrology, astronomy, medical imaging, and exploration systems for oil, natural 

gas, coal, and diamonds. These datasets range from uniformly spaced points 

(cells) along a single dimension to multidimensional arrays containing several different 

types of data. For example, astronomy and earth sciences operate on two- or 

three-dimensional spatial grids, often using a plethora of spherical coordinate systems. 

Furthermore, nearly all sciences must deal with data series over time. It is frequently 

necessary to understand relationships between consecutive elements in time, 

or to analyze entire sequences of observations, and such datasets may represent spatial, 

temporal, or spatio-temporal information. For example, if ocean measurements 

such as temperature, salinity, and oxygen are recorded every hour at spacings of every 

one meter in depth, and every ten meters in two horizontal dimensions, the result is 

a four-dimensional array with three spatial dimensions and one temporal dimension, 

and three values attached to each cell of the array. 

In the past, arrays were typically stored in files and then manipulated by programs 

that operated on these files. Nowadays, with science moving toward being computational 

and data based, the trend is toward a new class of database system which provides 

support for not only traditional, or coded, data types such as text, integers, etc., 

but also richer data types like multidimensional arrays. This new trend of databases is 

referred to as Array Databases. 

Implementing an efficient array database management system (DBMS) can be very 

challenging. Typically, there are two approaches that can be taken to store array 

datasets in a DBMS. In the first, the values of each cell are stored in a separate row, 

along with fields describing the position of the cell in the array. The most obvious 

drawback of this approach is the need for a large multidimensional index to efficiently 

find rows in the table. Moreover, the space taken by a multidimensional index is larger 

than the size of the table itself if all dimensions forming an array are used as the key. 

In the second approach, a multidimensional array is written to a Binary Large Object 

(BLOB), which is stored in a field of a table in the database. Applications then fetch 

9

10 1. Introduction and Problem Statement 

the contents of the BLOB when they wish to operate on the data. The main drawback 

to this approach is that it either requires the entire array to be passed to the client, or it 

requires that the client perform a large number of BLOB input/output (I/O) operations 

to read only the required portions of the array. With databases growing beyond a few 

tens of terabytes, the analysis of large volumes of array datasets is severely limited 

by the relatively low I/O performance of most of todays computing platforms. Highperformance 

numerical simulations are also increasingly feeling the I/O bottleneck. 

To improve data management and analytics on large repositories of data, aggregation 

has been put forward as a key process when describing high-level data. An 

example of data aggregation is the computation and storage of statistical parameters, 

such as count, average, median, and standard deviation. Aggregate computation has 

been studied in a variety of settings [4, 21, 66]. In particular, On-Line Analytical Processing 

(OLAP) technology has emerged to address the problem of efficiently computing 

complex multidimensional aggregate queries on large data warehouses. Most 

OLAP systems rely on the process of selecting aggregate combinations, and then precomputing 

and storing their results so the database system can make use of them in 

subsequent requests. Such a process is known as pre-aggregation, which has proved to 

speed up aggregate queries by several orders of magnitude in business and statistical 

applications [31, 41]. 

While considerable work has been done on the problem of efficiently computing 

aggregate queries in OLAP-based applications, such computations continue to be a 

data management challenge in scientific applications. A relevant example in which the 

use of advanced data management and efficient query processing are highly desirable 

is hyper-spectral remote-sensing imaging, in which an image spectrometer collects 

hundreds or even thousands of measurements for the same area of the surface of the 

Earth. The scenes provided by such sensors are often called data cubes to denote 

the dimensionality of the data. Notably, efficient query processing and data mining 

techniques facilitate exploration of spatio-temporal data patterns, both interactively as 

well as in batch on archived data. 

A significant fraction of scientific data is image-based and can be naturally represented 

in multidimensional arrays. These datasets fit poorly into relational databases, 

which lack efficient support for the concepts of physical proximity and order. They 

are typically stored in array-friendly formats such as HDF5, netCDF, or FITS. The 

extremely high computational requirements introduced by image-based scientific applications 

make them an excellent case study for our research. 

Since array databases and OLAP/data warehousing both deal with large multidimensional 

datasets and aggregate queries, adapting OLAP pre-aggregation techniques 

to the management and computation of aggregate queries in array databases may provide 

a strong potential benefit. This thesis investigates the application of OLAP preaggregation 

techniques in speeding up query processing in array databases. In particular, 

we focus on enhancing aggregate computation in GIS and remote-sensing imaging 

applications. However, the results can be generalized to other domains as well.

Relevant and complementary questions to this thesis are: 

1. What factors influence the decision of selecting an aggregate query for preaggregation? 

2. What formalisms are necessary to establish an efficient and scalable pre-aggregation 

framework for array databases? 

3. What type of constraints are typically considered by existing OLAP pre-aggregation 

algorithms, and how do they effect performance? 

The thesis objectives are outlined as follows: 

1. To illustrate the necessity for improving aggregate computation in array databases 

for GIS and remote-sensing imaging applications. 

2. To achieve a solid understanding of OLAP pre-aggregation algorithms and architectural 

issues when manipulating large amounts of data. 

3. To formally describe fundamental operations in GIS and remote-sensing imaging 

applications and identify those that involve data summarization. 

4. To design a theoretical pre-aggregation framework for array databases supporting 

GIS and remote-sensing imaging applications. 

5. To design query selection and query rewriting algorithms using existing OLAP/data 

warehousing pre-aggregation techniques. 

6. To implement algorithms in an array database management system. 

7. To conduct a performance study of the developed algorithms. 

The methodological approach employed in this thesis is centered on a three-stage 

design methodology: 

• Identification of fundamental operations in GIS and remote-sensing imaging 

applications. 

A literature review helped us identify fundamental operations in GIS that require 

data summarization. The literature included different classification schemes, 

international standards and best practices. 

• Design and implementation 

Existing OLAP pre-aggregation techniques are used as a basis for the construction 

of a pre-aggregation framework for array databases. Storage space constraints 

are considered while designing query selection algorithms. The algorithms 

were developed using the C++ programming language and tested in the 

RasDaMan multidimensional array database management system. 

• Evaluation 

Performance of the developed algorithms is measured on 2D, 3D, and 4D datasets. 

For scaling operations on 2D datasets we compare our results against those of 

the traditional image pyramids approach. 

11

12 1. Introduction and Problem Statement 

1.1 Overview of Thesis and Contributions 

This section provides an overview of the following chapters. 

Chapter 2 presents a comparative study between array databases and OLAP, and 

devotes special attention to data structures and operations. It starts with a discussion 

of existing approaches for data modeling, storage management and query processing 

in both array databases and the data warehousing/OLAP environment. Existing 

pre-aggregation and related techniques are also discussed in both application domains. 

From this study, one can observe similarities with regards to data structures and operations 

between both application domains. This suggests that array databases can benefit 

from pre-aggregation schemes to accelerate the computation of aggregate queries. 

Chapter 3 describes fundamental operations in GIS and remote-sensing imaging 

applications. The selection of operations is based on a thorough review of existing 

surveys regarding GIS operations, international standards, and on feedback from GIS 

practitioners. To better understand the structural characteristics of common queries in 

array databases, such operations evolved using a proven array model. This allowed 

us to identify a set of operations requiring data summarization (aggregation) and the 

candidate operations to be supported by pre-aggregation techniques. 

Chapter 4 deals with the computation of aggregate queries in array databases using 

pre-aggregated data. The proposed pre-aggregation framework distinguishes different 

types of pre-aggregates and shows that such a distinction is useful in finding an optimal 

solution that reduces the cost of the CPU required for the computation of aggregate 

queries. A cost-model is used to assess the benefit of using pre-aggregated data 

for computing aggregate queries. The measurements on real-life raster image datasets 

show that the computation of aggregate queries is always faster with our algorithms 

in comparison to traditional methods. 

Chapter 5 considers the problem of offering pre-aggregation support to non-standard 

aggregate operations in GIS and remote-sensing imaging applications. A discussion 

is presented on the issues found while attempting to provide pre-aggregation support 

for all non-standard aggregate operations as well as the motivation for focusing on 

scaling operations. The framework and cost model presented in Chapter 4 are adapted 

to support scaling operations. Experiments covering 2D, 3D, and 4D show how our 

pre-aggregation approach not only generalizes the most common approach for 2D, but 

it also helps reduce computational times for 2D, 3D, and 4D datasets. 

Chapter 6 presents a summary of our findings and outlines future lines of research. 

1.2 Publications Related to this Thesis 

A number of papers have been published that relate to the work described in this 

thesis. Doctoral workshops provided a platform to discuss the feasibility of the proposed 

research and an opportunity to receive feedback from experts in computer science 

[6] and the GIS scientific community [5]. Participation in those workshops led 

to a refinement of the research objectives outlined in Chapter 1. The study and algebraic 

modeling of geo-raster operations reported in Chapter 3 are presented in [7, 8].

1.2 Publications Related to this Thesis 13 

The pre-aggregation framework described in Chapter 4 is presented in [9]. Finally, 

findings about the query selection problem addressed in Chapter 5 have been accepted 

for publication in [10].


Chapter 2 

Background and Related Work 

This chapter describes existing database technology for two environments: GIS/remotesensing 

imaging and data warehousing/OLAP. Our investigation shows that conceptual 

data models and operations are similar in both application domains. This suggests 

that array database technology can be substantially enhanced by adopting a preaggregation 

scheme using a basis of existing OLAP technology. 

2.1 Array Databases 

Multidimensional data analysis has recently taken the spotlight in the context of 

scientific applications. A fundamental demand from science users is extremely fast 

response times for multidimensional queries. While most scientific users can use relational 

tables and have been forced to do so by many commercial DBMS systems, only 

a few users find tables to be a natural data model that closely matches their data. Furthermore, 

few users are satisfied with SQL as the interface language [30]. In contrast, 

it appears that arrays are a natural data model for a significant subset of science users, 

specifically in astronomy, oceanography, and remote-sensing applications. Moreover, 

a table with a primary key is merely a 1D array. Hence, an array data model can 

subsume the needs of users who are satisfied with tables. 

Next we review the existing database technology supporting multidimensional arrays 

in scientific applications: 1D sensor time-series, 2D satellite imagery, 3D image 

time-series, and 4D atmospheric data. 

2.1.1 Basic Notion of Arrays 

Several approaches have been proposed towards the formalization of arrays and 

array query languages. The underlying methods of formalization differ, and it is still 

an open discussion. However, the following notion of arrays is quite common [79]: 

An array is a set of cells of a fixed data type T , with a fixed cell size. Each 

cell corresponds to one element in the multidimensional domain of the array. The 

domain D of an array is a d-dimensional subinterval of a discrete coordinate set S = 

S 1 × ... × S d , where each S i , i = 1, ..., d is a finite totally ordered discrete set and d is 

the dimensionality of the array. 

15

16 2. Background and Related Work 

The definition domain of an array is expressed as a multidimensional interval by its 

lower and upper bounds, l i and u i respectively, along each direction l i of the domain, 

denoted as D = [l 1 : u 1 ; ...; l d : u d ], where l i 

Figure 2.1(a) shows the constituents of a sample 3D array. 

Figure 2.1. 3D Array 

The following subsections provide a brief summary of the main contributions of 

data modeling and query languages that support array data in GIS and remote-sensing 

imaging applications. 

2.1.2 2D Data Models 

A uniform representation and algebraic notation for manipulating image-based data 

structures known as map algebra was first advanced by Tomlin and Berry [56]. While 

not the first ones to describe this type of spatial data processing, Tomlin and Berry put 

forward the methodological basis for the organization of this form of geographical 

data analysis. Map algebra represents a method of treating individual rasters or array 

layers as members of algebraic equations. Map algebra functions are grouped into the 

following categories: 

• Local functions create outputs in which output cell values are determined on a 

cell-by-cell basis without regard for the value of neighboring cells. 

• Focal functions create outputs in which the value of the output grid is affected by 

the value of neighboring cells. Low-pass filters are commonly used to smooth 

out data. 

• Zonal functions create outputs in which the values of output cells are determined 

in part by the spatial association between cells in the input grids. 

• Global functions compute an output raster where the value for each output cell 

is potentially a function of all of the input cell values.

2.1 Array Databases 17 

Figure 2.2 shows a graphical classification of grid functions according to map algebra. 

Figure 2.2. Map Algebra Functions 

Map algebra is primarily oriented toward 2D static data. Each layer is associated 

with a particular moment or period of time, and analytical operations are intended to 

deal with spatial relationships. In its original form, map algebra was never intended 

to handle spatial data with a temporal component. 

2.1.3 Multidimensional Data Models 

AQL 

Libkin et al. [63] presented an array data model called AQL that embeds array support 

into specific nested relational calculus and treats arrays as functions rather than 

collection types. The AQL data model combines complex objects such as sets, bags, 

and lists with multidimensional arrays. To express complex object values, the core 

calculus on which AQL is based has been extended with concepts such as comprehensions, 

pattern matching, and block structures that strengthen the expressive power of 

the language. Still, AQL does not provide a declarative mechanism to define the order 

in which queries manipulate data. 

Array Manipulation Language (AML) 

AML is a query language for multidimensional array data [80]. The model is aimed 

towards applications in image databases, particularly for remote sensing, but it is customizable 

to support a wide variety of application domains. An interesting characteristic 

of this language is the use of bit patterns, an array indexing mechanism that 

allows for a more powerful access structure to arrays. AML’s algebra consists of three 

operators that enable the manipulation of arrays: subsample, merge, and apply. Each 

operator takes one or more arrays as arguments, and produces an array as result. Subsample 

is a unary operator that eliminates cells from an array by cutting out slices. 

Merge is a binary operator that combines two arrays defined over the same domain. 

The Apply operator applies a user-defined function to an array, thereby producing a 

new array. All AML operators take bit patterns as parameters.


Data and Query Model for Stream Geo-Raster Imagery 

Gertz et al. [67] introduced a data and query model for managing and querying 

streams of remote-sensing imagery. The data model considers the spatio-temporal 

and geo-referenced nature of satellite imagery. Three classes of operators allow the 

formulation of queries. A stream restriction operator acts as a filter that selects points 

from a stream that satisfy a given condition of the spatial, temporal, or spatio-temporal 

component of the image. The stream transform operator maps the point or value 

associated with a stream to a new point or value set. This class of operators is useful 

for processing on a point-by-point basis. The third class of operators is called stream 

compositions, which allows the combination of image data from different spectral 

bands. To this end, each stream is considered to represent a single spectral band. 

However, since the primary objective of the authors was to stream geo-raster image 

data, they put less emphasis on post-processing satellite images. Core operations 

such as Fourier transforms and edge detection are therefore not supported by their 

framework. 

Array Algebra 

Baumann [75] introduced a formal array model called Array Algebra that supports the 

description and manipulation of multidimensional array data types [76]. The simple 

algebra consists of three core operators: an array constructor, a general condenser for 

computing aggregations, and an index sorter. The expressive power of Array Algebra 

through these operators enables a wide range of signal processing, imaging, and statistical 

operations. Moreover, the termination of any well-formed query is guaranteed 

by limiting the expressiveness power to non-recursive operations. Array Algebra is 

described in more detail in Chapter 3. 

To date, Array Algebra is the most comprehensive and complete approach supporting 

a variety of applications including sensor, image and statistical data. Recently, a 

Geo-raster service standard based on Array Algebra concepts has been issued by the 

Open GeoSpatial Consortium (OGC) [78]. A commercial and open-source implementation 

of Array Algebra is currently available for the scientific community. 

2.1.4 Storage Management 

At present, handling large image data stored in a database is usually carried out by 

adopting a tiling strategy [23]. An image is split into sub-images (tiles), as shown in 

Fig. 2.3. When a region of interest is requested in a given query operation, only the 

relevant tiles are accessed. This strategy results in significant I/O bandwidth savings. 

Tiles form the basic processing units for indexing and compression. Spatial indexing 

allows for the quick retrieval of the identifier and location of a required tile, while 

compression improves disk I/O bandwidth efficiency. The choice of tile size is crucial 

for efficiency. While large tiles return much redundant data in response to a range 

query, small tiles result in a bad compression ratio where tile size varies from 8 KB 

(very small) to 512 KB (very large) of data [23, 96]. A comprehensive approach


toward the storage of large amounts of data on tertiary storage media considering 

tiling techniques in multidimensional database management systems is presented in 

[23, 24, 25]. 

Figure 2.3. Image Tiling 

A key factor influencing the effectiveness of a tiling scheme is compression. Raster 

data compression algorithms are the same as algorithms for compression of other image 

data. However, remote-sensing images are usually of much higher resolution, 

are multi-spectral and have significant larger volumes than natural images. To effectively 

compress raster data in GIS environments, emphasis must be placed on the 

management of schemas to deal with large volumes of remote-sensing imagery, and 

on the integration of various types of datasets such as vector and multidimensional 

datasets [3, 87]. 

Dehmel [3] proposed a comprehensive framework for the compression of multidimensional 

arrays based on different model layers, including various kinds of predictors 

and a generic wavelet engine for lossy compression with arbitrary quality levels. 

In particular, the author introduces concepts such as channel separation to compress 

values for each channel separately, and predictors that calculate approximate values 

for some cells and express those cell values relative to the approximate values. Further, 

the proposed method applies wavelets to transform the channels individually into 

multi-resolution representations with coarse approximations and various levels of detail 

information. This led to a wavelet engine architecture consisting of three major 

components: transformation, quantization and compression that helps improve compression 

rates considerably in array databases. 

2.1.5 2D Pre-Aggregation 

Aggregate operations on GIS and remote-sensing applications have been shown to 

be computationally expensive due to the size and complexity of the operations [8]. 

One such operation is zooming (scaling), which is carried out by interpolating the values 

of the original dataset to downsample it to a lower resolution. This is particularly 

necessary in web-based raster applications, where limitations such as bandwidth and 

other resources prevent the efficient processing of the original raster datasets. For 

smooth interactive panning, browsers load the image in tiles and quantities larger than 

actually displayed. Zooming far out results in large scale factors, meaning that large 

amounts of data must be moved to deliver minimal results.


Current database technology for GIS and remote-sensing imaging applications employ 

multi-scale image pyramids to improve performance of scaling operations on 2D 

raster images [51, 70, 82]. Image pyramids is a technique which consists of resampling 

the original dataset and creating a number of copies from it, where each copy is 

resampled at a coarser resolution (Fig. 2.4). The pyramid consists of a finite number 

of levels that differ in scale by a fixed step factor, and are much smaller in size than the 

original dataset but adequate for visualization at a lower scale (zoom ratio). Common 

practice is to construct pyramids in scale levels of a power of 2, yielding scale factors 

2, 4, 6, 8, 16, 32, 64, 128, 256, and 512. When more detailed data are needed, or 

when it becomes necessary to access the original image, a better access speed can be 

achieved by accessing the smaller piece of the original data, if the original data are cut 

into smaller pieces. A restricted area of the image, instead of the entire image, is then 

accessed. 

Figure 2.4. Image Pyramids 

Pyramid Construction 

The construction of pyramid layers requires resampling of original image cell values. 

Resampling interpolates cell values or otherwise assigns values to cells of a new 

raster object. It results in a raster with larger or smaller cells and different dimensions. 

Resampling changes the scale of an input raster, and is used in conjunction 

with geometric transformation models that change the internal geometry of a raster. 

The following are the most popular interpolation methods [34]: 

• Nearest neighbor is the resampling technique of choice for discrete (categorical) 

data since it does not alter the value of the input cells [64]. After the cell’s center 

on the output raster dataset is located on the input raster, the nearest neighbor 

assignment determines the location of the closest cell center on the input raster 

and assigns the value of that cell to the cell on the output raster. 

• Linear interpolation is used to interpolate along value curves. It assumes that 

cell values vary in proportion to distance along a value segment: v = a + bx. 

Linear interpolation may be used to interpolate feature attribute values along a 

line segment connecting any two point value pairs.


• Bilinear interpolation is used to interpolate cell values at direct positions within 

a quadrilateral grid. It assumes that feature attribute values vary as a bilinear 

function of position within the grid cell: v = a + bx + cy + dxy. Given a 

direct position, p, in a grid cell whose vertices are V, V + V 1 , V + V 2 , and 

V + V 1 + V 2 , where V 1 and V 2 are offset vectors of the grid, and with cell values 

at vertices v 1 , v 2 , v 3 , and v 4 , respectively, there are unique numbers i and j, with 

0 ≤ i ≤ 1, and 0 ≤ j ≤ 1 such that p = V + iV 1 + jV 2 . The cell value at p is: 

v = (1 − i)(1 − j)v 1 + i(1 − j)v 2 + j(1 − i)v 3 + ijv 4 . 

Since the values for output cells are calculated according to the relative positions 

and values of input cells, bilinear interpolation is preferred for data where the 

location from a known point or phenomenon determines the value assigned to 

a cell (that is, continuous surfaces). Elevation, slope, intensity of noise from an 

airport, and salinity of groundwater near an estuary are phenomena represented 

as continuous surfaces and are most appropriately resampled using bilinear interpolation. 

• Quadratic interpolation is used to interpolate cell values along curves. It assumes 

that cell values vary as a quadratic function of distance along a value 

segment: v = a + bx + cx 2 , where a is the value of a cell at the start of a value 

segment and v is the value of a cell at distance x along the curve from the start. 

Three point value pairs are needed to provide control values for calculating the 

coefficients of the function. 

• Cubic interpolation is used to interpolate cell values along curves. It assumes 

that cell values vary as a cubic function of distance along a value segment: 

v = a + bx + cx 2 + dx 3 where a is the value of a cell at the start of a value 

segment and v is the value of a cell at distance x along the curve from the start. 

Four point value pairs are needed to provide control values for calculating the 

coefficients of the function. 

Cubic convolution has a tendency to sharpen the edges of the data more than bilinear 

interpolation since more cells are involved in the calculation of the output 

values. 

Pyramid Evaluation 

During the evaluation of a scaling operation with a target scale factor s, the pyramid 

level with the largest scale factor s ′ with s ′ < s is determined. This level is loaded and 

then an adjustment is made by scaling the resulting image by a factor of s/s ′ . If, for 

example, scaling by s = 11 is required, then pyramid level 3 with scale factor s ′ = 8 

is chosen, requiring a rest scaling of 11/8 = 1.375, thereby touching only 1/64 of 

what is read without a pyramid. 

The computation complexity of a scaling operation depends on the chosen resampling 

method. For example, nearest neighbor resampling considers the closest cell 

center of the input raster and assigns the value of that cell to the corresponding cell


on the output raster. Other resampling methods such as bilinear and cubic interpolation 

consider a subset of cells to calculate each of the cell values in the output rasters. 

Fig. 2.5 shows three common options for interpolating output cell values. Note that 

the bold outline (center image) indicates the current target cell for which a value is 

being interpolated. 

(a) Portion of 

original raster 

(b) Portion of 

output raster 

(c) Input cells used by common 

resampling methods 

Figure 2.5. Nearest Neighbor, Bilinear and Cubic Interpolation Methods 

A characteristic of the pyramid approach is that it increases the size of a raster 

dataset by approximately 33 percent. This is because the additionally reduced resolution 

representations are stored in the system together with the original dataset. This is 

offset, however, by the increasing response time obtained in return. The choice of resampling 

method for constructing the pyramid is influenced by the data characteristics 

and type of analysis performed on the data. For example, visual appearance of remote 

sensing imagery is best using nearest-neighbor resampling, whereas scientific interpretation 

may require cubic interpolation. Rasters representing categorical data e.g., 

land use data, do not allow interpolation since it is important that original data values 

remain unchanged; hence only nearest-neighbor resampling can be applied [64]. 

The reason why categorical data should not be interpolated is because intermediate 

terms cannot be derived with meaningful results. For example, soil type data cannot 

be interpolated since a soil type 14 and a soil type 15 cannot sensibly be averaged 

to derive a soil type 14.5. Creating pyramids for different resampling methods is not 

efficient due to the additional resources required for storage and maintenance. Thus, 

the hard-wired resampling approach possess significant flexibility limitations to users 

when analytic objectives diverge. 

Fast retrieval of raster image datasets has also been investigated in distributed 

database systems. Kitamoto [14] proposed a caching mechanism that allows twodimensional 

satellite imagery to be cached with minimum resolution to provide a 

coarse view of the images in distributed satellite image databases. The cache management 

problem is treated as the knapsack problem [14], where the relevance and size 

of the data is considered to determine if the data will be cached or not. Additionally, 

access patterns influence the relevance of the data. The frequency of requests for a


given image and its resulting popularity rank are included in the strategy for caching 

selection. Prediction of user access patterns is not considered, however. 

More recently, methods exploiting the capabilities of modern graphics hardware 

have been applied to the organization and processing of large amounts of satellite 

imagery. For example, Boettger et al. presented a method based on the concepts of 

perspective and complex logarithm [90] for visualization and navigation of satellite 

and aerial imagery [50]. Datasets are decomposed into tiles of different sizes and 

levels of resolution according to a pre-defined area of interest. The tiles closer to the 

center of interest have higher resolution, whereas low-resolution tiles are created for 

parts further away. The resulting tiles are indexed and cached into the memory of the 

graphics hardware, enabling quick access to the area of interest with the best available 

resolution. When the center of interest is changed, tiles not yet available in graphics 

memory are loaded. Based on the assumption that the graphics memory offers more 

space than needed, the cache contains not only the tiles that conform to the area of 

interest, but those that presumably will be needed in the future. 

2.1.6 Pre-Aggregation Beyond 2D 

Geographic phenomena can be examined at different granularities. This includes 

different spatial perspectives and temporal views. Earth remote sensing imagery can 

be treated as time-series data to study/track changes over time. For example, a user 

looking at changes in vegetation patterns over a certain region during the past 10 years 

can see their effect on the regional maps over that time period. Fig. 2.6 shows various 

instances of scaling operations on 3D image time-series. Figure 2.6(a) shows the 

original dataset, which consists of two spatial dimensions (dim 1, dim 2), and one 

temporal dimension (dim 3). Figure 2.6(b) shows the original dataset scaled down 

along the two spatial dimensions. Figure 2.6(c) shows a scaling operation along the 

time dimension of the original dataset. Figure 2.6(d) shows the original dataset scaled 

down in the spatial and temporal dimensions. 

Shifts in temporal detail have been studied in various application domains [18, 22, 

43]. At the time of this writing, there is little support for zooming with respect to time 

in GIS technology: the focus has been set on studying such alterations with respect to 

the geometric (vector) properties of objects [54, 58, 59]. 

Datasets in environmental observation and climate modeling are often defined over 

4-D spatio-temporal space of the form (x,y,z,t), possibly extended with topology relationships. 

Scaling operations are also critical for these kinds of applications due to the 

size and dimensionality of the data. Extremely large volumes of data are generated 

during climate simulations. While only one part might be needed for a specific data 

analysis, huge data volumes are moved. This is particularly true for time-series data 

analysis. At the time of this writing, however, 4D scaling operations are not supported 

for GIS and remote-sensing imaging applications.


(a) 3D dataset 

(b) 3D dataset (scaled-down along 

dim1 and dim2 by a factor of 2) 

(c) 3D dataset (scaled-down along 

dim3 by a factor of 4) 

(d) 3D dataset (scaled-down along 

all dimensions by a factor of 2) 

Figure 2.6. 3D Scaling Operations on Time-Series Imagery Datasets

2.2 On-Line Analytical Processing (OLAP) 25 

2.1.7 Summary 

Array database theory is gradually entering its consolidation phase. The notion 

of arrays as functions mapping points of some hypercube-shaped domain to values 

of some range set is commonly accepted. Two main modeling paradigms are used: 

calculus and algebra. Multidimensional data models embed arrays into the relational 

world, either by providing conceptual stubs like Array Algebra, or by adding relational 

capabilities explicitly such as AQL and RAM. Notably, aggregate query processing 

plays a critical role given the large volumes of the arrays. Our study shows 

that pre-aggregation techniques focus only on 2D datasets, and that support is limited 

to one particular operation: scaling. We distinguish the pyramid approach as the 

most popular method for speeding up scaling operations on 2D datasets; despite its 

known limitations such as hard-wired interpolation and lack of support for datasets of 

higher dimensions. Advances on hardware graphics are enabling quicker and more 

accurate visualization and navigation capabilities for raster imagery. However, little 

work has been reported on how array database technology is progressively exploiting 

these hardware advances. A critical gap with respect to pre-aggregation is the lack of 

support for aggregate operations other than 2D scaling. 

2.2 On-Line Analytical Processing (OLAP) 

Data warehousing/OLAP is an application domain where complex multidimensional 

aggregates on large databases have been studied intensively. Typically, a data 

warehouse collects business data from one or multiple sources so that the desired financial, 

marketing, and business analyses can be performed. These kinds of analyses 

can detect trends and anomalies, make projections, and make business decisions 

[41]. When such analysis predominantly involves aggregate queries, it is called 

on-line analytical processing, or OLAP [38, 39]. To understand the mechanism of 

pre-computation, the following subsections review different approaches to structuring 

multidimensional data, storage mechanisms and operations in OLAP. 

2.2.1 OLAP Data model 

The multidimensional OLAP model begins with the observation that the factors 

that influence decision-making processes are related to enterprise-specific facts, such 

as sales, shipments, hospital admissions, surgeries, and so on. [68]. Instances of a 

fact subsequently correspond to events that occur. For example, every sale or shipment 

carried out is an event. Each fact is described by the values of a set of relevant 

measures providing quantitative descriptions of events, e.g., sales receipts, amounts 

shipped, hospital admission costs, and surgery times are all measures. 

In OLAP, information is viewed conceptually as cubes that consist of descriptive 

categories (dimensions) and quantitative values (measures) [26, 81, 69, 83]. In the scientific 

literature, measures are at times called variables, metrics, properties, attributes, 

or indicators. Figure 2.7 illustrates a 3D OLAP data cube where business events


(facts) are mapped at the intersection of a specific combination of dimensions. 

Different attributes along each dimension are often organized in hierarchical structures 

that determine the different levels in which data can be further analyzed [26]. 

For example, within the time dimension, one may have levels composed of years, 

months, and days. Similarly, within the geography dimension, one may have levels 

such as country, region, state/province, or city. Hierarchical structures are used to infer 

summarization (aggregation), that is, whether an aggregate view (query) defined 

for some category can be correctly derived from a set of precomputed views defined 

for other categories. 

Figure 2.7. OLAP Data Cube 

2.2.2 OLAP Operations 

OLAP includes a set of operations for manipulation of dimensional data organized 

in multiple levels of abstraction. Basic OLAP operations are roll-up, drill-down, slice, 

dice and pivot [44]. A roll-up (aggregation) operation computes higher aggregations 

from lower aggregations or base facts according to their hierarchies, whereas drilldown 

(disaggregation) is an analytic technique whereby the user navigates among 

levels of data ranging from most summarized/aggregated, to most detailed. Typical 

OLAP aggregate functions include average, maximum, minimum, count, and sum. 

Drilling paths may be defined by the hierarchies within dimensions or other relationships 

dynamic within or between dimensions. A slice consists of the selection of a 

smaller data cube or even the reduction of a multidimensional data cube to fewer dimensions 

by a point restriction in some dimension. The dice operation works similarly 

to the slice except that it performs a selection on two or more dimensions. Figure 2.8 

provides a graphical description of these operations. 

2.2.3 OLAP Architectures 

Figure 2.9 shows different approaches for the implementation of OLAP functionalities: 

Multidimensional OLAP (MOLAP), Relational OLAP (ROLAP), Hybrid OLAP


Figure 2.8. Typical OLAP Cube Operations 

(HOLAP). These approaches offer a common view in the form of data cubes, which 

are independent of how the data is stored. 

Figure 2.9. OLAP Approaches: MOLAP, ROLAP, and HOLAP 

MOLAP 

MOLAP maintains data in a multi-dimensional matrix based on a non-relational specialized 

storage structure [37], see Fig. 2.10(a). While building the storage structure, 

selected aggregations associated with all possible roll-ups are precomputed and stored 

[92]. Thus, roll-up and drill-down operations are executed in interactive time. Products 

such as Oracle Essbase, IBM Cognos Powerplay, and open-source Palo have 

adopted this approach. 

A MOLAP system is based on an ad-hoc logical model that directly represents 

multidimensional data and its applicable operations. The underlying multidimensional 

database physically stores data as arrays and access to it is positional [68]. Grid-files 

[53, 55], R*-trees [71] and UB-trees [84] are among the techniques used for that 

purpose. 

The main advantage of this approach is that it contains the pre-computed aggregate 

values that offer a very compact and efficient way to retrieve answers for specific


aggregate queries [68]. One difficulty that MOLAP poses, however, pertains to the 

sparseness of the data. Sparseness means that many events did not take place and 

valuable processing time is taken by adding up zeros [91]. For example, a company 

may not sell every item every day in every store, so no values appear at the intersection 

where products are not sold in a particular region at a particular time. On the other 

hand, MOLAP can be much faster for applications where subsets of the data cube 

are dense [100]. Another limitation of this approach is that the computation of a 

cube requires a complex aggregate query across all data in a warehouse. Though 

it is possible to incrementally update cubes as new data arrives, it is impractical to 

dynamically create new cubes to answer ad-hoc queries [68]. 

Figure 2.10. MOLAP Storage Scheme 

ROLAP 

In ROLAP, underlying data is stored in a relational database, see Fig. 2.11(a). The 

relational model, however, does not include concepts of dimension, measure, and hierarchy. 

Thus specific types of schemata must be created so the multidimensional 

model can be represented in terms of basic relational elements such as attributes, relations, 

and integrity constraints [68]. Such representations are done using a star schema 

data model, although the snowflake schema is also often adopted. 

ROLAP implementations can handle large amounts of data and leverage all functionalities 

of the relational database [72]. Disadvantages are that overall performance 

is slow and each ROLAP report represents an SQL query with the limitations of the 

genre. ROLAP vendors tried to mitigate this problem by including out-of-the-box 

complex functions in their product offering and providing users the capability of defining 

their own functions. Another problem with ROLAP implementations results from 

the performance hit caused by costly join operations between large tables [68]. To 

overcome this issue, fact tables in data-warehouses are usually de-normalized. Sub-


stantial performance gains can be achieved through the materialization of derived tables 

(views) that store aggregate data used for typical OLAP queries. 

Figure 2.11. ROLAP Storage Scheme 

Figure 2.12 shows the formulation of a typical query in both ROLAP and MOLAP. 

The query yields sales information for a specific product sold in a particular city by a 

given vendor. The formulation of the queries is done according to the syntax of Oracle 

10g. Note the lengthy difference between the two query formulations. 

(a) Sample ROLAP query 

(b) Sample MOLAP query 

Figure 2.12. Typical Query as Expressed in ROLAP and MOLAP Systems


HOLAP 

The intermediate architecture type, HOLAP, mixes the advantages offered by ROLAP 

and MOLAP. It takes advantage of the standardization level and the ability to manage 

large amounts of data from ROLAP implementations, and the query speed typical of 

MOLAP systems. For summary type information, HOLAP leverages cube technology 

and for drilling down into details it uses the ROLAP model. In HOLAP architecture, 

the largest amount of data should be stored in an RDBMS to avoid the problems 

caused by sparsity, and a multidimensional system should store only the information 

users most frequently need to access [68]. If that information is not enough to solve 

queries, then the system accesses the data managed by the relational system in a more 

transparent manner. 

2.2.4 OLAP Pre-Aggregation 

OLAP systems require fast interactive multidimensional data analysis of aggregates. 

To fulfill this requirement, database systems frequently pre-compute aggregate 

views on some subset of dimensions and their corresponding hierarchies. Virtually 

all OLAP products resort to some degree of pre-computation of these aggregates, 

a process known as pre-aggregation. OLAP pre-aggregation techniques have 

proved to speed up aggregate queries by several orders of magnitude in business applications 

[31, 41]. A full pre-aggregation of all possible combinations of aggregate 

queries, however, is not considered feasible because it often exceeds the available storage 

limit and incurs a high maintenance cost. Therefore, modern OLAP systems adopt 

a partial pre-aggregation approach where only a set of aggregates are materialized so 

it can be re-used for efficiently computing other aggregates. 

Pre-aggregation techniques consist of three inter-related processes: view selection, 

query rewriting, and view maintenance. A view is a derived relation defined in terms 

of base relations. Views can be materialized by storing the tuples of a view in a 

database, as was first investigated in the 1980s [36]. Like a cache, a materialized 

view provides fast access to its data. However, a cache may get dirty whenever its 

underlying base relations are updated. The process of updating a materialized view in 

response to changes to its base data is called view maintenance [12]. 

View Selection 

Gupta et al. [13] proposed a framework that shows how to use materialized views to 

help answer aggregate queries. The framework provides a set of query rewriting rules 

to determine what materialized aggregate views can be employed to answer aggregate 

queries. An algorithm uses these rules to transform a query tree into an equivalent 

tree with some or all base relations replaced by materialized views. Thus, a query 

optimizer can choose the most efficient tree and provide the best query response time. 

Harinarayan et al. [92] investigated the issue of how to select views for materialization 

under storage space constraints so the average query cost is minimal. 

To meet changing user needs several dynamic pre-aggregation approaches have


been proposed. In principle, views may be either selected on demand or pre-selected 

using some prediction strategy. For applications where storage space is a constraint, 

replacement algorithms identify those views that can be replaced with new selections 

[60]. Kotidis et al. [97] introduced a dynamic view selection approach called Multidimensional 

Range Queries (MRQ), known as slice queries in OLAP, which use an 

on-demand fetching strategy. Within this approach, the level of detail or granularity is 

a compromise between the materialization of many small, highly specific queries, and 

the materialization of a few large queries followed by answering incoming queries at 

each stage, using the materialized queries. This approach, however, does not take into 

account user access patterns before making selections. 

The first work to consider user access information to evaluate potential queries 

to be materialized is presented in [26], where the author introduced PROMISE, an 

approach that predicts the structure and value of the next query based on the current 

query. Yao et al. [99] proposed a different approach for the materialization of dynamic 

views. A set of batch queries were rewritten using certain canonical queries so the 

total cost of execution could be reduced using the intermediate results for answering 

queries appearing later in the batch. This approach requires all queries to be precisely 

known before hand, and though the approach might work well in a particular database 

scenario, it might not be useful in dynamic OLAP, where it is extremely difficult to 

accurately predict the exact nature of future queries. 

View Maintenance 

In most cases it is wasteful to maintain a view by recomputing it from scratch. Materialized 

views are therefore maintained using an incremental approach [11]. Only the 

changes to be propagated to the materialized view are computed using the changes of 

the source relations [1, 33, 89]. At present, view maintenance has been investigated 

from these four dimensions [11]: 

• Information Dimension: Focuses on accessing the information required for view 

maintenance, such as base relations and the materialized view. 

• Modification Dimension: Focuses on the kinds of modifications e.g., insertions 

and deletions, that a view maintenance algorithm can handle. 

• Language Dimension: Addresses the problems related to the language of the 

views supported by the view maintenance algorithm. That is, what is the language 

of the views that can be maintained by the view maintenance algorithm? 

How are views expressed? Does the algorithm allow duplicates? 

• Instance Dimension: Considers the applicability of the algorithm to all or a 

specific set of instances of the database. 

View maintenance cost is the sum of the cost of propagating each base relation 

change to the affected materialized views. The sum can be weighted, where each 

weight indicates the frequency of propagations of the changes of the associated source


relation. When the base relation affects more than one materialized view, multiple 

maintenance expressions must be evaluated. Multi-query optimization techniques 

can be used to detect common sub-expressions between the maintenance expressions 

so that an efficient global evaluation plan for the maintenance expressions can be 

achieved [61, 62]. 

Numerous methods have been developed for materialized view maintenance in conventional 

database systems. Zhuge et al. [101] introduced the Eager Compensating 

Algorithm (ECA) based on previous incremental view maintenance algorithms and 

compensating queries used to eliminate anomalies. In [102], authors define multiple 

views consistent with each other as the multiple view consistency problem. Further 

research from the same authors [102, 103] considers data warehouse views defined 

on base tables located in different data sources, i.e., if a view involves n base tables, 

then n data sources are also involved. 

A common characteristic of the early approaches to view maintenance is the considerable 

need for accessing base relations, which in most cases results in performance 

degradation. The improvement of the efficiency of view maintenance techniques has 

been a topic of active research in the database research community [15, 65, 85, 98]. 

Spatial OLAP (SOLAP) 

The multidimensional approach used by data warehouses and OLAP does not support 

array data types or spatial data types such as point, lines, or polygons. Following 

the development trends of data warehouse and data mining techniques, Stefanovic et 

al. [52] proposed the construction of a spatial data warehouse to enable on-line data 

analysis in spatial-information repositories. The authors used a star/snowflake model 

to build a spatial data cube consisting of both spatial and non-spatial dimensions and 

measures: the data cube shown in Fig. 2.13 consists of one spatial dimension (region) 

and three non-spatial dimensions (precipitation, temperature, and time). 

Figure 2.13. Star Model of a Spatial Warehouse 

Current research in spatial data management focuses on querying spatial data, 

particularly regarding the improvement of aggregate query performance [57] for

2.3 Discussion 33 

spatial-vector data structures. Alas, little attention has been given to spatial-raster 

data [42, 73, 86]. Support for spatial-raster data typically consists of creating a 

spatial-raster cube from information in the metadata file (such as size, level, width, 

height, date of creation, format, and location) [28, 94]. 

Vega et al. [40] presented a model to analyze and compare existing techniques 

for the evaluation of aggregate queries on spatial, temporal, and spatio-temporal data. 

The study shows that existing aggregate computation techniques rely on some form 

of pre-aggregation and support is restricted to distributive aggregate functions such 

as COUNT , SUM, and MAX. Additionally, the authors identify several important 

needs concerning aggregate computation. First, they discuss the need to develop 

further and more substantial techniques to support holistic aggregate functions e.g., 

MEDIAN, RANK, and to better support selective predicates. The second observation 

pertains to the lack of support for queries needing to be efficiently evaluated 

at every granule in time. Existing aggregate computation techniques focus only on 

spatial objects such as lines, points, and polygons but do not consider aggregate computation 

on data grids (array) structures. 

2.3 Discussion 

Query performance is a major concern underlying the design of databases in both 

business and remote-sensing imaging applications. While there are some valuable 

research results in the realm of pre-aggregation techniques to support query processing 

in business and statistical applications, little has been done in the field of array 

databases. 

The question therefore arises, what distinguishes array data from traditional data 

types that it cannot be fully supported by relational databases and thus take advantage 

of advance technologies such as OLAP? OLAP from its very conception was designed 

to assist in the decision-making process of business applications, where business perspectives 

such as products and/or stores, represented the dimensions of the data cube. 

And while the different columns in a data cube are usually called dimensions, they 

generally cannot be considered as a special extent of the entities modeled by the 

database. Instead, they are regarded as explicit attributes that characterize a particular 

entity. Some dimensions in a data cube (e.g., CustomerId) are defined over discrete 

domains which do not have a natural ordering among their values (customer 1000 cannot 

be considered close to customer 1001). In such cases, any ordering defined for the 

values in one of these columns is arbitrary [40]. For this reason, existing OLAP solutions 

and related pre-aggregation techniques cannot be applied to multidimensional 

arrays, at least not in a straight-forward manner. 

Recently, however, a new trend in OLAP gained considerable popularity due to its 

capabilities to support Geo-spatial data. Spatial OLAP considers the case in which a 

data-cube may have both spatial and non-spatial dimensions. However, spatial OLAP 

focuses mainly on spatial-vector data and so far little support has been provided for 

spatial-raster data in terms of selective materialization for the optimization of aggregates. 

Support is limited only to those operations that can be constructed from


metadata available for the raster, but not to the improvement of the computation of 

aggregate operations over the values of raster datasets. 

At present, pre-aggregation support in array databases is limited. Only one comparatively 

simple pre-aggregation technique has been used, namely image pyramids. The 

limitation of this technique to two-dimensional datasets and hard-wired interpolation 

calls for the development of more flexible and efficient techniques. 

From our study of data modeling, storage techniques, operations in OLAP and 

remote sensing imaging applications, we have observed the following similarities: 

• Array databases and OLAP systems typically employ multidimensional data 

models to organize their data. 

• Both application domains handle large volumes of multidimensional data. 

• Operations convey a high degree of similarity, for instance, a roll-up (aggregate) 

operation in OLAP such as computing the weekly sales per product is very 

similar to scaling a satellite image by a factor of seven along the X axis. Figure 

2.14 illustrates this similarity. 

(a) Scaling operation 

(b) Roll-Up operation 

Figure 2.14. Comparison of Roll-Up and Scaling Operations

2.3 Discussion 35 

• Both application domains use pre-aggregation approaches to speed up query 

processing: OLAP pre-aggregation techniques support a wide range of aggregate 

operations and speed up query processing by several orders of magnitude 

(last benchmark reported factors up to 100 times [29, 88]). Scaling of 2D 

datasets always uses the same scale factor on each dimension to maintain a 

coherent view, whereas for datasets of higher dimensionality, the scale factor is 

independent. Scaling resembles a primitive form of pre-aggregation in comparison 

to existing OLAP pre-aggregation techniques. 

• While data in OLAP applications are sparsely populated, remote sensing imagery 

usually are densely populated (100%). There are no guidelines explaining 

when an OLAP data cube is considered sparse or dense. However, when a data 

cube contains 30 percent empty cells it is usually treated with sparsity-handling 

techniques in most OLAP systems. 

Furthermore, when compared to well-known OLAP pre-aggregation techniques, 

GIS image pyramids are different in several respects: 

• Image pyramids are constrained to 2D imagery. To the best of our knowledge 

there is no generalization of pyramids to n-D. 

• The x and y axes are always zoomed by the same scalar factor s in the 2D zoom 

vector (s, s). This is exploited by image pyramids in that they only offer preaggregates 

along a scalar range. In this respect, image pyramids actually are 1D 

pre-aggregates. 

• Several interpolation methods are used for resampling during scaling. Some 

techniques are standardized [48], they include nearest-neighbor, bi-linear, biquadratic, 

bi-cubic, and barycentric. The two scaling steps incurred for image 

pyramids (construction of the pyramid level and rest scaling) must be done using 

the same interpolation technique to achieve valid results. In OLAP, summation 

during roll-up corresponds to linear interpolation in imaging. 

• Scale factors are continuous, as opposed to the discrete hierarchy levels in 

OLAP. It is, therefore, impossible to materialize all possible pre-aggregates. 

Based on these observations, this thesis aims to systematically carry over results 

from OLAP to array databases and provide pre-aggregation support not only for queries 

using basic aggregate functions, but to more complex operations such as scaling. As 

a preliminary and fundamental step, it is necessary to have a clear understanding of 

the various operations performed on remote sensing imagery and to identify those that 

involve aggregation computation. Next chapter addresses this issue in more detail.


Chapter 3 

Fundamental Geo-Raster Operations 

in GIS and Remote-sensing 

Applications 

This chapter describes a set of fundamental operations in GIS and remote-sensing 

imaging applications. For rigid comparison and classification, these operations are 

discussed by means of a sound mathematical framework. The aim is to identify those 

operations requiring data summarization that may benefit from a pre-aggregation approach. 

To that end, we use Array Algebra as our modeling framework. 

3.1 Array Algebra 

The rationale behind the selection of Array Algebra as the modeling framework is 

grounded in the following observations: 

• It is oriented towards multidimensional data in a variety of applications including 

imaging. 

• It provides the means to formulate a wide variety of operations on multidimensional 

arrays. 

• There are commercial and open-source implementations of Array Algebra that 

show the soundness and maturity of the framework. 

The expressive power of Array Algebra, the simplicity of its operators, and its successful 

implementation in both commercial and scientific applications make it suitable 

for our investigation. 

Essentially, the algebra consists of three operators: an array constructor, a generalized 

aggregation, and a multi-dimensional sorter [75, 76]. Array algebra is minimal 

in the sense that no subset of its operations exhibits the same expressive power. It is 

safe in evaluation: every formula can be evaluated in a finite number of steps. It is 

closed in its application: any resulting expression is either a scalar or an array. 

37

38 3. Fundamental Geo-Raster Operations 

Arrays are represented as functions mapping n-dimensional points from discrete 

Euclidean space to values. The spatial domain of an array is defined as a finite set of 

n-dimensional points in Euclidean space forming a hypercube with boundaries parallel 

to the coordinate system axes. 

Let X ⊆ Z d be a spatial domain and F a value set i.e., a homogeneous algebra. 

Then, an F-valued d-dimensional array over the spatial domain X(multi-dimensional 

array) is defined as: 

a : X → F (i.e., a ∈ F X ), 

a = {(x, a(x)) : x ∈ X, a(x) ∈ F } 

The array elements a(x) are referred to as cells. Auxiliary function sdom(a) denotes 

the spatial domain of some array a. 

3.1.1 Constructor 

The MARRAY array constructor allows arrays to be defined by indicating a spatial 

domain and an expression evaluated for each cell position of the array. An iteration 

variable bound to a spatial domain is available in the cell expression so that the cell 

value depends on its position. Let X be a spatial domain, F a value set, and v a free 

identifier. Let e v be an expression with result type F containing zero or more free occurrences 

of v as placeholder(s) for an expression with result type X. Then, an array 

over spatial domain X with base type F is constructed through: 

MARRAY X,v (e v ) = {(x, a(x)) : a(x) = e x , x ∈ X} 

A straightforward application of MARRAY is spatio-temporal sub-setting by simply 

changing its domain. 

Example: For some 2-D grey-scale image a, its cutout to domain [x0:x1,y0:y1] (assumed 

to lie inside the array) is given by: 

MARRAY [x0:x1,y0:y1],p (a[p]) 

Similarly, trimming produces a cutout of an array of lower volume, but unchanged 

dimensionality, and section cuts out a hyperplane with reduced dimensionality. 

We can also change an array’s values by changing the e v expression. In the simplest 

case this expression takes the cell value and modifies it. The following expression adds 

the values in the cells of two raster images, regardless of their extent and dimension: 

a + b = MARRAY X,p (a[p] + b[p]) 

If we allow the use of all operations known on the base algebra, i.e., on the pixel 

type, we immediately obtain a cohort of the following useful operations.

3.2 Geo-Raster Operations 39 

3.1.2 Condenser 

The COND array condenser (aggregator) takes the values of an array’s cells and 

combines them through some commutative and associative operation, thereby obtaining 

a scalar value. For some v free identifier, spatial domain X = x 1 , ..., x n , x i ∈ Z d 

consisting of n points, and e a,v an expression of result type F containing occurrences 

of an array a and identifier v, the condense of a by o is defined as: 

COND o,X,v (e a,v ) := O 

x∈X 

e a,x = e a,x1 o...oe a,xn 

Example: Let a be the image as defined in above. The average over all pixel 

intensities in a is then given by: 

∑ 

COND +,sdom(a),p (a) = a[x]/(m ∗ n) 

3.1.3 Sorter 

x∈[1:m,1:n] 

The SORT array sorter proceeds along a selected dimension to reorder the corresponding 

hyperslices. Functional sort s rearranges a given array along a specified 

dimension s without changing its value set or spatial domain. To that end, an 

order-generating function is provided that associates a sequence position to each (d- 

1)-dimensional hyperslice. Note that function f s,a has all degrees of freedom to assess 

any of a’s cell values for determining the measure value of a hyperslice on hand - it 

can be a particular cell value in the current hyperslice, the average of all hyperslice 

values, or the value of one or more neighboring slices. Note that the sort operator 

includes the relational group by. 

The language is recursive in the array expression e v and hence allows arbitrary 

nesting of expressions. In the sequel we use the abbreviations introduced above for 

nested expressions. 

3.2 Geo-Raster Operations 

This section presents a set of fundamental operations for Geo-raster data. These 

operations have been selected based on an exhaustive literature review of classification 

schemes, international standards, and best practices [2, 19, 27, 32, 35, 45, 46, 47, 49]. 

By examining the Array Algebra operators involved in the computation of the operations, 

we identify those that require data summarization (aggregation) and therefore 

may benefit from pre-aggregation. 

Queries were executed in a raster database management system (RasDaMan), and 

formulated according to the syntax of an SQL-based query language for multidimensional 

raster databases based on Array Algebra, namely, rasql. 

3.2.1 Mathematical Operations 

The following groups of mathematical operators are distinguished: arithmetic, 

trigonometric, boolean and relational. They operate at cell level and can be applied


in a single or multiple rasters of numerical type and identical spatial domain. The 

basic arithmetic operators include addition (+), subtraction (-), multiplication (*), and 

division (/). Trigonometric functions perform trigonometric calculations on the values 

of an input raster: sine (sin), cosine (cos), tangent (tan) or their inverse (arcsin, arccos, 

arctan). Consider, for example, the following query: 

Query 3.2.1. Consider a RGB (red, green, blue) raster image A. Extract the green 

component from the image, and reduce the contrast by a factor of 2. 

With Array Algebra, the query can be computed as follows: 

Results are shown in Fig. 3.1. 

MARRAY sdom(A),i (A.green[i]/2) 

(a) Original RGB image (b) Green component (c) Output raster 

Figure 3.1. Reduction of Contrast in the Green Channel of an RGB Image 

All or part of a raster image can be manipulated using the rules of Boolean algebra 

integrated into database query languages such as SQL [2]. Boolean algebra uses 

logical operators such as and, or, not, and xor to determine if a particular condition is 

true or false. These operators are often combined with relational operators: equal (=), 

not equal (≠), less than (), and greater 

than or equal to (≥). Consider, for example, the following queries: 

Query 3.2.2. Given a near-infrared green (NRG) raster image A, highlight the cells 

with sufficient near-infrared values. 

This query can be answered by imposing a lower bound on the infrared intensity, 

and upper bounds on the green and blue intensities. The resulting boolean array is


multiplied by the original image A to show the original cell where an infrared value 

prevails and black otherwise. 

MARRAY sdom(A),i (A[i] ∗ ((A[i].nir ≥ 130) and 


(A[i].green ≤ 110) and (A[i].blue ≤ 140))) 

(a) Original NRG raster 

(b) Output raster 

Figure 3.2. Highlighted Infrared Areas of an NRG Image 

Query 3.2.3. Compare the cell values of two 8-bit gray raster images A and B. Create 

a new raster where each cell value takes the value of 255 (white pixel) when the cell 

values of A and B are identical. 

The algebraic formulation is as follows: 


MARRAY sdom(A),i ((A[i] = B[i]) ∗ 255)


(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster image 

Figure 3.3. Cells of Rasters A and B with Equal Values 

Reclassification 

Reclassification is a generalization technique used to re-assign cell values in classified 

rasters. For example consider the query below where reclassification is based on a land 

suitability study. 

Query 3.2.4. Given an 8-bit gray image A, map each cell value to its corresponding 

suitability class shown in Table 3.2 1 , and decrease the contrast of the image according 

to the decreasing factor. 

The query can be answered as follows: 

MARRAY sdom(A),g (((A[g] > 180) ∗ A[g]/2) + 


(((A[g] ≥ 130)and(A[g] < 180)) ∗ A[g]/3) + 

(((A[g] ≥ 80)and(A[g] < 130)) ∗ A[g]/4) + 

((A[g] < 80) ∗ A[g]/5)) 

1 Classification taken from http://www.fao.org/docrep/X5310E/X5310E00.htm


Table 3.1. UNO and FAO Suitability Classifications 

Classification Description 

S1 

Highly suitable 

S2 

Moderately suitable 

S3 

Marginally suitable 

NS 

Not suitable 

Table 3.2. Capability Indexes for Different Capability Classes 

Capability index Class Suitability class Decrease factor 

>180 I S1 2 

130-180 II S2 3 

80-130 III S3 4 

< 80 IV NS 5 

(a) Original raster 

(b) Output raster 

Figure 3.4. Re-Classification of the Cell Values of a Raster Image


Proximity 

The proximity operation creates a new raster where each cell value contains the distance 

to a specified reference point. As an example consider the following query: 

Query 3.2.5. Estimate the proximity of each cell of the raster image shown in Fig. 3.4(a) 

to the reference cell located in [30,5]. 

The computation of this query can be formulated as: 


MARRAY sdom(A),(g,h) (|g − 30| + |h − 5|) 

Figure 3.5. Computation of a Proximity Operation 

Overlay 

The overlay operation refers to the process of stacking two or more identical georeferenced 

rasters on top of each other so that each position in the covered area can be 

analyzed in terms of these data. The overlay operation can be solved using arithmetic 

and relational operators. For example, consider the following query: 

Query 3.2.6. Given two 8-bit gray raster images A and B with identical spatial domain, 

perform an overlay operation. That is, make a cell-wise comparison between 

the two rasters. Each cell value of the new array must take the maximum cell value 

between A and B. 

The computation of this query can be formulated as: 

MARRAY sdom(A),g (((A[g] > B[g]) ∗ A[g]) + ((A[g] ≤ B[g]) ∗ B[g])) 

The above formulation works as follows. The left part of the arithmetic expression + 

tests for the cell value of array A to be greater than the cell value of B. The result of 

this operation is either 0 (condition not satisfied) or 1 (condition satisfied), which in


turn is multiplied by the cell value of array A. Thus, the left part of the expression is 

either 0 or the cell value of array A. Similarly, the right-hand side of the arithmetic 

addition expression verifies if the cell value of array A is less than or equal to the cell 

value of B. The result is either 0 or 1 depending on whether or not the condition is 

satisfied. This value is multiplied for the cell value of array B. Note that only one of 

the parts of the addition expression will be greater than zero, and that this value corresponds 

to the highest value between arrays A and B. Results are shown in Fig. 3.6. 

(a) 8-bit gray raster A (b) 8-bit gray raster B (c) Output raster 

Figure 3.6. Computation of an Overlay Operation 

An overlay operation can also be done considering a different condition to be tested 

while determining the cell values of the output array. For example: 

Query 3.2.7. Compute an overlay operation between rasters A and B. That is, compare 

cell-wise the two rasters: if the cell value of B is non-zero, then set this value 

as the cell value of the corresponding cell in array A. Otherwise, the cell value of A 

remains unchanged. 

The query can be answered as follows: 

MARRAY sdom(A),g (((B[g] > 0) ∗ B[g]) + ((B[g] ≤ 0) ∗ A[g])) 


3.2.2 Aggregation Operations 

We now present the modeling of operations consisting of one or more aggregate 

functions. An aggregate function takes a collection of cells and returns a single value 

that summarizes the information contained in the set of cells. The SQL standard provides 

a variety of aggregate functions. SQL-92 includes count, sum, average, min,


(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster 

Figure 3.7. Computation of an Overlay Operation Considering Values Greater than 

Zero 

and max. SQL:1999 adds every, some and any. OLAP functions were first published 

as an addendum to the ISO SQL:1999 standard. They have since been completely 

incorporated into both SQL:2003 and recently published SQL:2008 ISO SQL Standards. 

OLAP functions include rank, ntile, cume dist, percent rank, row number, 

percentile cont, and percentile disc. 

Add 

The add operation sums up the content of the cells and returns the total as a scalar 

value. It can be applied in two or more rasters with an identical spatial domain, returning 

a new raster with the same spatial domain. In this case, the cells of the new 

raster contain the sum of the inputs computed on a cell-by-cell basis. As an example 

of the add operation in a single raster consider the following query: 

Query 3.2.8. Return the sum of all cell values of the raster shown in Fig. 3.8(a). 


add cells(A) = COND +,sdom(A),i (A[i])



(b) Output result 

Figure 3.8. Calculation of the Total Sum of Cell Values in a Raster 

Count 

The count operation returns the number of cells that fulfill a boolean condition applied 

to a raster. For example, consider the following query: 

Query 3.2.9. Return the number of cells of raster A of boolean type, containing true 

value in the green channel. 

Average 

count cells(A) = COND +,sdom(A),i (A[i].green = 1) 

The average operation returns a scalar value representing the mean of all values contained 

in a raster. As an example consider the following query: 

Query 3.2.10. Return the average of the cell values in each channel of the NRG image 

shown in Fig. 3.9(a). 

Let sum cells(A) be a function calculated as shown in Section 3.2.2, and card(sdom(A)) 

a function returning the cardinality of A. Then, the average of A is calculated as follows: 

sum cells(A) 

avg cells(A) = 

card(sdom(A)) 


Maximum 

A maximum operation returns the largest cell value contained in a raster of numerical 

type. As an example, consider the following query:




Figure 3.9. Result of an Average Aggregate Operation 

Query 3.2.11. Return the maximum cell value of all cells contained in the NRG raster 

image shown in Fig. 3.10(a). 


max cells(A) = COND max,sdom(A),i (A[i]) 



Figure 3.10. Result of a Maximum Aggregate Operation 

Minimum 

A minimum operation returns the smallest cell value contained in a raster of numerical 

type. As an example, consider the following query:


Query 3.2.12. Return the smallest element of all cell values in the NRG raster image 

shown in Fig. 3.11(a). 


min cells(A) = COND min,sdom(A),i (A[i]) 



Figure 3.11. Result of a Minimum Aggregate Operation 

Histogram 

A histogram provides information about the number of times a value occurs across a 

range of possible values. For an 8-bit raster up to 256 different values are possible. 

As an example consider the following query: 

Query 3.2.13. Calculate the histogram for a 2D raster A with 8-bit integer pixel 

resolution. 

The query can be computed as follows: 


Diversity 

MARRAY sdom(A),g (count cells(A = g[0])) (3.1) 

The diversity operation returns the different classifications in a raster. For example, 

consider the following query: 

Query 3.2.14. Given the classifications in an 8-bit gray raster image, return true (1) 

for those classes whose total number of cells are greater than 0.


Figure 3.12. Computation of the Histogram for a Raster Image 

For the computation of this operation we make use of the histogram calculated in 

Query. 3.2.2. Let B be a 1-D array containing the histogram values: 

B = MARRAY sdom(A),g (COND +,sdom(A),i (A[i] = g)) 

then, C is the array containing true values for the elements of the histogram that are 

greater than 0: 

C = MARRAY sdom(B),i (B[i] > 0) 


Figure 3.13. Computation of the Diversity for a Raster Image 

Majority/Minority 

In a classified raster, the majority operation finds the class value with the largest number 

of elements in the raster. Similarly, the minority operation finds the cell value with 

fewest number of elements. As an example, consider the following query: 

Query 3.2.15. Return the cell representing the majority of all cell values contained in 

2D 8-bit gray raster image A shown in Fig. 3.14(a). 

To solve this query we use the histogram computed in Query. 3.2.2, and then select 

the cell value representing the majority of the different classes. Let h be a 1-D array


containing the histogram values, h1 a 1-D array of spatial domain[0:255] containing 

a list of values from 0 to 255. Let h2 be an array containing the sum of h and h1: 

h2 = MARRAY [0:255],g (h + h1) 

then, majority can be computed as follows: 

COND +,sdom(A),i ((max cells(h) = (h2[i] − h1[i])) ∗ h1[i]) 


(a) Classified raster 

(b) Majority class 

Figure 3.14. Computation of a Majority Operation for a Raster Image 

3.2.3 Statistical Aggregate Operations 

We now consider operations that consist or include one or more statistical aggregate 

functions. The basic statistical aggregate functions include standard deviation, root 

square, power, mode, median, variance, and top-k. These functions can be applied 

to a raster, or a set of rasters retrieved by a logical search. Consider the following 

examples: 

Variance 

Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg a 

variable containing the average of all cell values of A, avg=avg cells(A); then the 

variance v of A can be solved as follows: 

v(A) = 1 n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg)) 

Results are shown in Fig. 3.15.


Figure 3.15. Computation of the Variance for a Raster Image 

Standard Deviation 

Query 3.2.16. Estimate the standard deviation of the cell values of the NRG raster 

image shown in Fig. 3.8(a). 

Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg the 

average of the cell values of A, avg=avg cells(A); then the standard deviation s of A 

can be solved as follows: 

s(A) = 

√ 

1 

n ∗ COND +,sdom(A),i((A[i] − avg) ∗ (A[i] − avg)) 


Figure 3.16. Computation of the Standard Deviation for a Raster Image 

Median 

The median can be calculated by sorting the cell values of raster A in ascending order 

and choosing the middle value. In case the number of cells is even, the median


is the average of the two middle values. In solving this operation, we use the sort 

operator to perform the ascending sorting of array A. However, for an array of dimensionality 

higher than 1 it is necessary to flatten the array into a one-dimensional 

array. For example, the conversion from a two-dimensional raster A[0:m,0:n] into a 

one-dimensional raster B[0:m*n] can be calculated as follows: 

Let d be the cardinality of A, d=card(sdom(A); let r be the number of rows; and let 

c be the number of columns. Then, the flattening of A can be calculated as: 

B =MARRAY [0:255],g ( 

COND +,[0:m,0:n],i ( 

((g > (m ∗ (i − 1))) and (g ≤ i)) ∗ A[1 : (g − (m ∗ (i − 1))), 1 : i])) 

Let S be the raster containing the sorted values of B (the flattening of A), S = SORT 0 , asc 

f 

(B), and let n be the cardinality of S, n = card(sdom(S)). Assuming an integer division 

and an array indexing starting at zero, the median of array A can be solved as follows: 

if n is odd then the median is equal to S[ n 2 

the following query: 

n−1 

S[ 2 

]; else median = 

]+S[ 

n+1 

2 ] 

2 

. Consider 

Query 3.2.17. Obtain the median of the 1-D array A whose cell values are shown in 

Fig. 3.17(a). 

Since the array has an odd number of elements the computation of the query is as 

follows: 

A[card(A)/2] 

Results are shown in Fig. 3.17(b). 

Top-k 

The Top-k function returns the k cells with the highest values within a raster. For 

example, consider the following query: 

Query 3.2.18. Find the five highest values contained in raster A. 

To solve this query we first sort A in ascending order and then select the top five 

values. Let d=0 indicate a sorting in the 0 dimension, and let f be sorting function 

f d,A (p)=A[P]. Then S is a sorted array of raster A (see Fig. 3.18): 

S = SORT 0 , asc 

f (A) 

thus, the top five cell values are obtained by: 

S[0 : 4]


(a) 1-D array 

(b) Median 

Figure 3.17. Computation of Median for a Raster Image 

(a) Top five values 

Figure 3.18. Computation of a Top-k Operation for a Raster Image


3.2.4 Affine Transformations 

Geometric transformations permit the elimination of geometric distortions that occur 

when images are captured. An example is the attempt to match remotely sensed 

images of the same area taken after one year, when the more recent image was probably 

not taken from precisely the same position. Another example is the Landsat 

Level 1B data that are already transformed onto a plane, but that may not be rectified 

to the user’s desired map projection [46]. Applying an affine transformation to 

a uniformly distorted raster image can correct for a range of perspective distortions 

by transforming the measurements from the ideal coordinates to those actually used. 

An affine transformation is an important class of linear 2-D geometric transformations 

that maps variables, e.g. cell intensity values located at position (x 1 , y 1 ), in an 

input raster image into new variables (x 2 , y 2 ) in an output raster image by applying 

a linear combination of translation, rotation, scaling and shearing operations. The 

computation of these operations often requires interpolation techniques. 

In the remainder of this section we discuss special cases of affine transformations. 

Translation 

Translation performs a geometric transformation that maps the position of each cell in 

an input raster image into a new position in an output raster image. Under translation, 

a cell located at (x 1 , y 1 ) in the original is shifted to a new position (x 2 , y 2 ) in the 

corresponding output raster image by displacing it through a user-specified translation 

vector (h, k). The cell values remain unchanged and the spatial domain of the output 

raster image is the same as that of the original input raster. Consider for example, the 

following query: 

Query 3.2.19. Shift the spatial domain of a raster defined as A[x 1 : x 2 , y 1 : y 2 ] by the 

point [h:k]. 

The query can be solved by invoking the shift function of Array Algebra: 


shift(A[x 1 : x 2 , y 1 : y 2 ], [h : k]]) 

Rotation 

Rotation performs a geometric transformation that maps position (x 1 , y 1 ) of a cell in 

an input raster image onto a position (x 2 , y 2 ) in an output raster image by rotating it 

clockwise or counterclockwise, through a user-specified angle (θ) about origin O. The 

rotation operation performs a transformation of the form: 

x 2 = cos(θ) ∗ (x 1 − x 0 ) − sin(θ) ∗ (y 1 − y 0 ) + x 0 

y 2 = sin(θ) ∗ (x 1 − x 0 ) + cos(θ) ∗ (y 1 − y 0 ) + y 0


(a) Original domain 

(b) Translated domain 

Figure 3.19. Computation of a Translation Operation for a Raster Image 

where (x 0 , y 0 ) are the coordinates of the center of rotation in the input raster image, 

and θ is the angle of rotation. Existing algorithms for the computation of rotation, 

unlike those employed by translation, can produce coordinates (x 2 , y 2 ) that are not 

integers. A common solution to this problem is the application of interpolation techniques 

like nearest neighbor, bilinear, or cubic interpolation. For large raster datasets 

this is an intensive computing problem because every output cell must be computed 

separately using data from its neighbors. Consequently, the rotation operation is not 

yet properly supported by Array Algebra. 

Scaling 

Scaling stretches or compresses the coordinates of a raster (or part of it) according 

to a scaling factor. This operation can be used to change the visual appearance of an 

image, to alter the quantity of information stored in a scene representation, or as a lowlevel 

preprocessor in a multi-stage image processing chain that operates on features of 

a particular scale. For the estimation of the cell values in a scaled output raster image, 

two common approaches exist: 

• one pixel value within a local neighborhood is chosen (perhaps randomly) to 

be representative of its surroundings. This method is computationally simple 

but may lead to poor results when the sampling neighborhood is too large and 

diverse. 

• the second method interpolates cell values within a neighborhood by taking the 

average of the local intensity values.


As in the rotation operation, the application of scaling using interpolation techniques 

in large raster datasets is an intensive computing problem because every output cell 

must be computed separately using data from its neighbors. Consider the following 

query performing a scaling operation using bilinear interpolation. That is, the cell 

value for (x0,y0) in the output raster is calculated by averaging the values of its nearest 

cells: two in the horizontal plane (x0,x1) and two in the vertical plane (y0,y1). Note 

that the query is applied in a raster of spatial domain [0:255, 0:255] but as earlier 

mentioned, raster datasets tend to be extremely large (TB, PB). 

Query 3.2.20. Scale the 2D raster shown in Fig. 3.20(a), along the x and y dimensions 

by a factor of 2. 

The query can be solved as follows: 

B = MARRAY [0: 

m 

2 ,0: n 2 ],(x,y) (COND +,[0:1,0:1],(i,j) (A[i + x ∗ 2, j + y ∗ 2]/4)) 


(a) Original raster 

(b) Scaled raster 

Figure 3.20. Computation of a Scaling Operation for a Raster Image 

3.2.5 Terrain Analysis 

Raster image data is particularly useful for tasks related to terrain analysis. Some 

of the most popular operations include slope/aspect, drainage networks, and catchments 

(or watersheds). The processing of these operations may involve interpolation


techniques that lead to expensive computational costs. For simplicity, we model these 

operations with approaches not using interpolation methods. 

Slope/Aspect 

Slope is defined by a plane tangent to a topographic surface, as modeled by the Digital 

Elevation Model (DEM) at a point [2]. Slope is classified as a vector, thus having two 

components: a quantity (gradient) and a direction (aspect). The slope (gradient) is 

defined as the maximum rate of change in altitude, and aspect as the compass direction 

of the maximum rate of change. Several approaches exist for the computation of 

slope/aspect, and we follow the method proposed by [32]: 

• Slope in the X direction (difference in height values on either side of P) is given 

by: 

z(r, c + 1) − z(r, c − 1) 

T anΘ x = 

2g 

• slope in the Y direction 

• gradient at P 

T anΘ y = 

• direction or aspect of the gradient 


z(r + 1, c) − z(r − 1, c) 

2g 

√ 

(tan 2 Θ x + tan 2 Θ y ) 

tanα = tanΘ x 

tanΘ y 

Figure 3.21. Slopes Along the X and Y Directions 

Note that after the calculation of the slopes for each cell in a raster image, the 

results may need to be classified to display them clearly on a map [2]. 

Query 3.2.21. Calculate the slope along the X direction of an 8-bit grey raster A: 

MARRAY sdom(A),(r,c) 

(arctan(A(r, c + 1) − A(r, c − 1))) 

2g


Local Drain Directions (ldd) 

The ldd network is useful for computing several properties of a DEM because it explicitly 

contains information about the connectivity of different cells. Two steps are 

required to derive a drainage network: the estimation of flow of material over the 

surface and the removal of pits. For instance (see Fig. 3.22), cell A1 has three neighboring 

cells (A2, B1 and B2) and the lowest of them is B1, thus the flow direction is 

south (downward). For cell C3, the lowest of its eight neighboring cells is D2, so the 

flow direction is southwest (to the lower left). This method is one of the most popular 

algorithms to estimate flow directions and it is commonly known as D8 algorithm [2]. 

Figure 3.22. Flow Directions 

Query 3.2.22. Estimate the flow of material over raster A where each cell contains 

the slope along the X direction. 

Let A be a raster with the slopes along the X direction of A. The ldd is then calculated 

as: 

MARRAY sdom(A),(i,j) (COND min,[−1:1,−1:1],(v,w) (A[i + v, j + w])) 

Irrespective of the algorithm used to compute flow directions, the resulting ldd network 

is extremely useful for computing other properties of a DEM such as stream 

channels, ridges, and catchments. 

3.2.6 Other Operations 

Edge Detection 

Edge detection produces a new raster containing only the boundary cells of a given 

raster. The detection of intensity discontinuities in a raster is very useful, e.g. the 

boundary representation is easy to integrate into a large variety of detection algorithms. 

The following parameterized function can be used to express filtering operations 

in Array Algebra: 

f(A, M) = MARRAY sdom(A),x (COND +,sdom(M),i (A[x + i] ∗ M(y))) 

where sdom(M) is the size of the corresponding filter window, e.g., 3x3. As an example 

consider the following query:


(a) M1 

(b) M2 

Figure 3.23. Sobel Masks 

Query 3.2.23. Apply edge detection to raster A shown in Fig. 3.24(a) using a 3x3 

Sobel filter. 

To compute this query, a Sobel filter and its inverse are applied to the original raster 

A (see Fig. 3.23): 

|f(A, M1)| + |f(A, M2)| 

9 

which in Array Algebra can be computed as follows: 

MARRAY sdom(A),x (COND +,sdom(M1),i ( 


(abs(A[x + i] ∗ M1(i))) + (abs((A[x + i] ∗ M2(i))))/9)) 

(a) Original raster image 

(b) Output raster image 

Figure 3.24. Computation of an Edge-Detection for a Raster Image

3.3 Summary 61 

Slicing 

The slicing operation extracts lower-dimensional sections from a raster. Array Algebra 

accomplishes the slicing operation by indicating the slicing position in the desired 

dimension. Thus, the operation reduces the dimensionality of the raster by one. For 

example, consider the following query: 

Query 3.2.24. Slice raster A along the second dimension at position 50. 

The query is solved by specifying the slicing position as follows: 

3.3 Summary 

MARRAY sdom(A),(x,y,z) (A[x, 50, z]) 

By examining the fundamental structure of Geo-raster operations and breaking 

down their computational steps into a few basic Array Algebra operators, we determine 

that Geo-raster operations can be broken down into the following classes: 

• COND and MARRAY combined operations. Operations whose computation 

requires both MARRAY and COND operators: 

add, count, average, maximum, minimum, majority, minority, histogram, diversity, 

variance, standard deviation, scaling, edge detection, and local drain 

directions. 

• MARRAY exclusive operations. Operations whose computation requires only 

the MARRAY operator: 

arithmetic, trigonometric, boolean, logical, overlay, reclassification, proximity, 

translation, slicing, and slope/aspect. 

• SORT operations. Operations whose computation requires the SORT operator: 

top-k, median. 

• AFFINE transformations. Special cases of affine transformations partially or 

not yet supported by Array Algebra: rotation and scaling. 

This classification allows us to identify a set of operations that require data summarization 

and thus are potential candidates to be treated with pre-aggregation techniques: 

add, count, average, maximum, minimum, majority, minority, histogram, diversity, 

variance, standard deviation, scaling, edge detection, and local drain directions. 

Table 3.3 summarizes the usage of Array Algebra operators for each operation 

discussed in Section 3.2.


Table 3.3. Array Algebra Classification of Geo-Raster Operations. 

Operation MARRAY COND SORT AFFINE 

1. Count x 

2. Add x 

3. Average x 

4. Maximum x 

5. Minimum x 

6. Majority x x 

7. Minority x x 

8. Std. Deviation x 

9. Median x x 

10. Variance x 

11. Top-k x 

12. Histogram x x 

13. Diversity x x 

14. Proximity x 

15. Arithmetic x 

16. Trigonometric x 

17. Boolean x 

18. Logical x 

19. Overlay x 

20. Re-classification x 

21. Translation x 

22. Rotation x 

23. Scaling x x x 

24. Slicing x 

25. Edge Detection x x 

26. Slope/Aspect x 

27. Local drain directions (ldd) x x

Chapter 4 

Answering Basic Aggregate Queries 

Using Pre-Aggregated Data 

As discussed in previous chapters, aggregation is an important mechanism that allows 

users to extract general characterizations from very large repositories of data. In this 

chapter, we study the effect of selecting a set of aggregate queries, compute their 

results and use them for subsequent query requests. In particular, we study the effect 

of pre-aggregation in computing aggregate queries in the field of GIS and remotesensing 

imaging applications. 

We introduce a pre-aggregation framework that distinguishes among different types 

of pre-aggregates for computing a query. We show that in most cases, several preaggregates 

may qualify for answering an aggregate query and address the problem of 

selecting the best pre-aggregate in terms of execution time. To this end, we introduce 

a model that measures the cost of using qualified pre-aggregates for the computation 

of a query. We then present an algorithm that selects the best pre-aggregate for computing 

a query. We measure the performance of our algorithms in an array database 

management system (RasDaMan), and show that our algorithms give much better performance 

over straightforward methods. 

4.1 Framework 

Most major database management systems allow the user to store query results 

through a process known as view materialization. The query optimizer may then automatically 

use the materialized data to speed up the evaluation of a new query. Queries 

that benefit from using materialized data are those that involve the summarization of 

large amounts of data. They are known as aggregate queries because their query statements 

include one or more aggregate functions. The ANSI SQL:2008 standard defines 

a wide variety of aggregate functions including: COUNT, SUM, AVG, MAX, MIN, 

EVERY, ANY, SOME, VAR POP, VAR SAMP, STDDEV POP, STDDEV SAMP, AR- 

RAY AGG, REGR COUNT, COVAR POP, COVAR SAMP, CORR, REGR R2, REGR SLOPE, 

and REGR INTER-CEPT [20]. 

63

64 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data 

4.1.1 Aggregation 

An aggregate operation contains one or more aggregate functions that map a multiset 

of cell values in a dataset to a single scalar value. In our framework, queries 

may contain an arbitrary number of aggregate functions, e.g., COUNT, SUM, AVG, 

MAX, MIN, and a spatial domain. We formulate our queries using rasql 1 , the declarative 

interface to the RasDaMan server. We use the Array Algebra notation for spatial 

domains: 

sdom = [l 1 : h 1 , . . . , l d : h d ] (4.1) 

where the vector variables l (low) and h (high) deliver lower and upper bound vectors 

respectively. 

4.1.2 Pre-Aggregation 

The term pre-aggregation refers to the process of pre-computing and storing the 

results of aggregate queries for subsequent use in the same or similar query requests. 

The decision to use pre-aggregated data during the computation of an aggregate query 

is influenced by the structural characteristics of the query and the pre-aggregate. 

By comparing the data structures between the two, one can determine if the preaggregated 

result contributes fully or partially to the final answer of the query, and 

if it is worth using pre-aggregated data. 

4.1.3 Aggregate Query and Pre-Aggregate Equivalence 

An aggregate query Q and a pre-aggregate p i are equivalent if and only if all the 

following conditions are met: 

1. The aggregate operation of the query Q is the same as the aggregate operation 

defined for the pre-aggregate p i . 

2. The aggregate operation of the query Q and the pre-aggregate p i must be applied 

over the same objects. 

3. The same logical and boolean conditions, if any, apply to both the query Q and 

the pre-aggreate p i . 

4. For aggregate operations to be applied over a specific spatial domain, the extent 

of the spatial domain in query Q must be the same as the one in pre-aggregate 

p i . 

When all of the above conditions are satisfied, we say there is a full-matching 

between the query and pre-aggregate. In this case, the time it takes to retrieve the 

1 rasql is a SQL-based query language for multidimensional raster databases based on Array Algebra.

4.1 Framework 65 

pre-aggregated result will be much faster than the time required to compute the query 

from raw (original) data. Moreover, the storage overhead required to save the preaggregated 

result is compensated by the faster computation of the query obtained in 

return. However, cases do occur when only conditions 1, 2, 3 are satisfied. We refer to 

this case as a partial-matching between the query and pre-aggregate. We can use the 

partial results provided by these pre-aggregates and thus speed up the computation of 

the query. However, further analysis must be carried out to find those pre-aggregates 

that provide the maximum speed for computing a query. To that end, we define the 

following types of pre-aggregates: independent, overlapped, and dominant. 

Independent Pre-Aggregates 

Definition 4.1 (Independent Pre-Aggregates) – A set of pre-aggregates is called 

Independent Pre-Aggregates (IPAS) with respect to Q, if the spatial domain of each 

pre-aggregate is contained within the spatial domain of query Q and there is no intersection 

among the spatial domains of the pre-aggregates. Fig. 4.1(a) shows an example 

of an independent pre-aggregate. 

IPAS := {p 1 , p 2 , . . . , p n | p i.sdom ⊆ Q .sdom , p i.sdom ∩ p j.sdom = ∅} , (4.2) 

✷ 

Overlapped Pre-Aggregates 

Definition 4.2 (Overlapped Pre-Aggregates) – A set of pre-aggregates is called 

Overlapped Pre-Aggregates (OPAS) if the spatial domain of each pre-aggregate intersects 

with the spatial domain of the query Q. Fig. 4.1(b) shows an example of an 

overlapped pre-aggregate. 

OPAS := {p 1 , p 2 , . . . , p n | p i.sdom ∩ Q .sdom ≠ ∅} (4.3) 

✷ 

Dominant Pre-Aggregates 

Definition 4.3 (Dominant Pre-Aggregates) – A set of pre-aggregates is called Dominant 

Pre-Aggregates (DPAS) if the spatial domain of the query Q is contained within 

the spatial domain of each pre-aggregate. Fig. 4.1(c) shows an example of a dominant 

pre-aggregate. Note that dominant pre-aggregates can only be used to answer the 

following types of aggregate queries: ADD, COUNT, and AVG.


✷ 

DPAS := {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p i.sdom } . (4.4) 

Moreover, given an ordered DPAS 

DPAS = {p 1 , p 2 , . . . , p n | Q .sdom ⊆ p 1.sdom ⊆ . . . ⊆ p n.sdom } , (4.5) 

the closest dominant pre-aggregate (p cd ) to Q is given by p 1 , i.e., p cd = p 1 . 

(a) Independent preaggregate 

(b) Overlapped preaggregate 

(c) Dominant preaggregate 

Figure 4.1. Types of Pre-Aggregates 

Cases may occur where a pre-aggregate intersects with one or more pre-aggregates 

of the same or different type. Intersections are problematic because the greater the 

number of intersections, the greater the number of cells that may need to be computed 

from raw data to determine the real contribution towards the result of the query by 

a given pre-aggregate. The computation process involves several intermediary operations 

such as decomposing the pre-aggregate into sub-partitions that in turn must 

be aggregated. Moreover, the same procedure must be performed on the other intersected 

pre-aggregates should we want to use their results. For example, assume that 

pre-aggregates p 1 , p 2 and p 3 can be used to answer query Q, and that they all intersect 

with each other. Since the result of each pre-aggregate includes a partial result of the 

other two pre-aggregates, we must use raw data to compute the intersected area and 

adjust the result of the pre-aggregate according to the aggregate function specified in 

the query predicate. 

To overcome this problem, a query selected for pre-aggregation for which other 

pre-aggregates exist with different spatial domains but identical structural properties 

can be decomposed into a set of sub-partitions prior to the pre-aggregation process.

4.2 Cost Model 67 

By partitioning the query to be pre-aggregated we can avoid intersection among preaggregates, 

see example shown in Fig. 4.2. 

Figure 4.2. Selected Queries for Pre-Aggregation (left) and Decomposed Queries 

(right) 

4.2 Cost Model 

This section introduces a cost model that allows us to estimate the cost (in terms of 

execution time) of computing a query using pre-aggregates compared to raw data. In 

our model, the access cost is driven by the number of required disk I/Os and memory 

accesses. These parameters are influenced by the number of tiles needed to answer 

a given query and the number and size of the cells in the datasets. The following 

assumptions underlie our estimates. 

1. We assume that the tiles needed to answer a given query are stored using implicit 

storage of coordinates, which is the prevalent storage format for raster image 

data [79]. Implicit storage of coordinate values is a storage technique that 

leads to a higher degree of clustering of cell values that are close in data space, 

that is, it preserves spatial proximity of cell values. Given that state-of-the-art 

disk drives improve access to multidimensional datasets by allowing the spatial 

locality of the data to be preserved in the disk itself [93], we assume that it takes 

the same time to retrieve a tile from disk as to retrieve any other tile needed to 

answer a given query. Clearly, there are other factors, not considered here, that 

influence access cost. Among them are the cost for storing intermediate results, 

and the communication cost for sending the results from the client to the server. 

More complicated cost models are certainly possible, but we believe the cost


model we pick, being both simple and realistic, enable us to design and analyze 

powerful algorithms. 

2. We consider the time taken to access a given cell (pixel) on main memory to be 

the same as that required to access any other cell. That is, we assume that a tile 

sits in main memory and is not swapped out. 

3. We ignore the time it takes to combine partial aggregate results. Investigations 

have shown this time to be negligible compared to tile iteration [74]. 

Table 4.1 lists the parameters involved in the different cost functions presented in 

the remainder of this section. 

Table 4.1. Cost Parameters 

Parameter Description 

Ntiles Number of tiles 

Ncells Number of cells 

sdom Spatial domain 

IPAS Independent pre-aggregates set 

OPAS Overlapped pre-aggregates set 

DPAS Dominant pre-aggregated set 

p cd Closest dominant pre-aggregate 

SP Sub-partitions 

4.2.1 Computing Queries from Raw Data 

The cost of computing an aggregate query Q (or sub-partitions of pre-aggregates) 

from raw data (C r ), is given by 

C r (Q) = C acc (Ntiles(Q)) + C agg (Ncells(Q)) (4.6) 

where C acc is the cost of retrieving the tiles required to answer Q, and C agg is the 

time taken to access and aggregate the total cells given by the spatial component of 

the query. 

4.2.2 Computing Queries from Independent and Overlapped Pre-Aggregates 

The cost of answering an aggregate query using independent and overlapped preaggregates 

is given by: 

C IOP AS (Q) = C IP AS (Q) + C OP AS (Q) + C SP (Q), (4.7) 

where C IP AS and C OP AS are the costs of using the results of independent and overlapped 

pre-aggregates, respectively, and C SP is the cost of decomposing the query Q 

into a set of sub-partitions and aggregating each from raw data.

4.2 Cost Model 69 

Cost of independent pre-aggregates 

The cost of retrieving the results of independent pre-aggregates (C IP AS ) is given by: 

C IP AS (Q, T ) = C fin (Q, T ) + 

∑ 

|IP AS| 

i=0 

C acc (p i ) (4.8) 

where C fin is the cost of finding the pre-aggregates ∈ IP AS in the pre-aggregated 

pool T , and C acc is the accumulated cost of retrieving the results of the pre-aggregates. 

Cost of overlapped pre-aggregates 

The cost of retrieving the results of overlapped pre-aggregates (C OP AS ) is given by: 

C OP AS (Q) = C fin (Q, T ) + 

∑ 

|OP AS| 

i=0 

|S| 

∑ 

C dec (p i ) + C r (s i ) (4.9) 

where C fin is the cost of finding the pre-aggregates ∈ OP AS in the pre-aggregated 

pool T , C dec is the cost of decomposing the spatial domain of each pre-aggregate into 

a set of sub-partitions S such that the spatial domain of the partitioned pre-aggregate 

corresponds to p i.sdom − (p i.sdom ∩ Q), and C r is the cost of aggregating each resulting 

sub-partition s i ∈ S from raw data. 

Cost of aggregating sub-partitions of a query 

The cost of aggregating all sub-partitions forming a query is given by: 

|SP | 

∑ 

C SP (Q) = C dec (Q) + C r (s i ), (4.10) 

where C dec is the cost of decomposing Q into a set SP of sub-partitions, and C r is 

the cost of aggregating each resulting sub-partition s ∈ SP from raw data. Note that 

C dec is influenced by the costs of accessing the tiles required to aggregate each subpartition, 

and the cost of accessing the spatial properties of the pre-aggregates in IPAS 

and OPAS. 

4.2.3 Computing Queries from Dominant Pre-Aggregates 

The cost of computing an aggregate query Q using a dominant pre-aggregate is 

given by: 

C DP AS (Q) = C DP (Q, T ) + C agg (p cd ), (4.11) 

where C DP is the sum of the cost of finding the pre-aggregates ∈ DPAS in the preaggregated 

pool T and the cost of finding the closest dominant pre-aggregate p cd , 

and C agg is the cost of computing the aggregate difference of p cd corresponding to 

p cd.sdom − Q .sdom . 

i=0 

i=0


Cost of aggregating sub-partitions of the closest dominant pre-aggregate 

The cost C agg can be calculated as follows: 

|SP | 

∑ 

C agg (p cd ) = C dec (p cd ) + C r (s i ), (4.12) 

where C dec is the cost of decomposing p cd into a set SP of sub-partitions, and C r is the 

cost of aggregating each resulting sub-partition s ∈ SP from raw data. 

4.3 Implementation 

This section describes the application of a query optimization technique that transforms 

an input query written in terms of arrays so that it can be executed faster using 

pre-aggregated data. The query processing module of an array database management 

system (RasDaMan) has been extended with our pre-aggregation framework for query 

rewriting, and has been implemented as part of the optimization and evaluation phases. 

As discussed earlier in this chapter, there are two problems related to the computation 

of an aggregate query using pre-aggregated data. First, we must find all pre-aggregates 

that can be used to compute an aggregate query, including those that provide partial 

answers. Next, from all candidate pre-aggregates, we must find the one that minimizes 

the execution time (or cost) for computing the query. Our solution is based on an existing 

approach for answering queries using views in OLAP applications. Halevy et 

al. [95] showed that all possible rewritings of a query can be obtained by considering 

containment mappings from the bodies of the views to the body of the query. They 

also showed that such characterization is a NP-complete problem. 

The QUERYCOMPUTATION procedure returns the result of a query or an execution 

plan for a given query Q. An execution plan is an indicator of the kind of data that 

must be used to compute the query. It returns a raw indicator if the query must be 

computed from the original data. Other valid indicators include IP AS, OP AS, and 

DP AS, which indicate that the query will be answered using one or more partial 


The input of the algorithm is a query tree Q t of an aggregate query. The algorithm 

first verifies if the conditions for a PERFECT-MATCHING between the query and the 

pre-aggregated queries are satisfied. If a perfect-matching is found, it returns the result 

of the pre-aggregated query. Otherwise, the algorithm verifies if the conditions for a 

PARTIALMATCHING between the query and set of pre-aggregate queries are satisfied. 

Then, the algorithm makes use of our cost model to determine the cost of using preaggregates 

that satisfy partial-matching conditions for the computation of the query, 

and the cost of computing the query using the original data. Finally, the algorithm 

picks the plan with least cost in terms of execution time. The algorithm makes use of 

the following auxiliary procedures: 

• DECOMPOSEQUERY(Q t ) examines the nodes of the query tree Q t and generates 

a standardized representation S qt that can be manipulated via SQL statements. 

i=0

4.3 Implementation 71 

Algorithm 1 QUERYCOMPUTATION 

Require: A query tree Q t , a set of k number of pre-aggregate queries P 

1: initialize R = 0, key = false 

2: S qt = decomposeQuery(Q t ) 

3: key = perfectMatching(S qt , P ) 

4: if key then 

5: R = fetchResult(key) 

6: return R; 

7: end if 

8: if !key then 

9: plan = partialMatching(S qt , P ) 

10: return plan; 

11: end if 

• PERFECTMATCHING(S qt ) compares a standardized representation of the query 

tree S qt against existing k number of pre-aggregates. The output is the corresponding 

key of the matched pre-aggregated query. A null value is returned if 

no perfect matching is found. 

• FETCHRESULT(key) retrieves the result R of the pre-aggregated query identified 

by key. 

The algorithm PARTIALMATCHING identifies an aggregate sub-expression in a 

query tree Q t , and finds pre-aggregated queries satisfying conditions 1, 2 and 3, but 

not condition 4 as defined in section 4.1.2. It considers the use of pre-aggregates 

that partially contribute to the answer of a query sub-expression that are either independent, 

overlapped, or dominant. The algorithm calculates the cost of using each 

pre-aggregate for computing the query, and returns an indicator of the type of query 

providing the least cost. 

The aggregateOp() procedure compares a node n of a given query tree Q t against 

a list of pre-defined aggregate operations, e.g, add cells, count cells, avg cells, 

max cells, and min cells. If the node matches any such operation, it returns a true 

value. 

The getSubtree() procedure receives as parameter a query tree Q t and a pointer to 

an aggregate node. If the aggregate node has children, it creates a subtree Q ′ where 

the root node corresponds to the aggregate node. 

The findP reaggregate() procedure receives as parameters an aggregate operation 

op, an object identifier ro, and a spatial domain sd. It then determines if the values of 

these parameters match those of any existing pre-aggregate. If a match is found, the 

result of the matched pre-aggregate is returned. 

The findIpasP reaggregates() procedure receives as a parameter a subtree Q ′ 

and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section 

4.1.2 for equivalence between a query and a pre-aggregate. For those pre-aggregates


Algorithm 2 PARTIALMATCHING 

Require: A standardized query tree Q t with m number of nodes. 

1: initialize IP AS, OP AS, DP AS = {} 

2: initialize plan = ”raw”, key = false 

3: for each node n of Q t do 

4: if aggregateOp(node[n]) then 

5: Q ′ = getSubtree(Q t , node[n]) 

6: op = getOperation(Q ′ ) 

7: ro = getRasterObject(Q ′ ) 

8: sd = getSpatialDomain(Q ′ ) 

9: key = findP reaggregate(op, ro, sd) 

10: if key then 

11: R = fetchResult(key) 

12: return R; 

13: end if 

14: if !key then 

15: IP AS = findIpasP reaggregates(op, ro, sd) 

16: OP AS = findOpasP reaggregates(op, ro, sd) 

17: DP AS = findDpasP reaggregates(op, ro, sd) 

18: end if 

19: plan = selectP lan(Q ′ , IP AS, OP AS, DP AS) 

20: end if 

21: end for 

22: return plan; 

that qualify, it identifies those whose spatial domains are contained in the spatial domain 

of the query. The output is a set of independent pre-aggregates. 

The findOpasP reaggregates() procedure receives as a parameter a subtree Q ′ 


4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial domains 

intersect with the spatial domain of the query. The output is a set of overlapped preaggregates. 

The findDpasP reaggregates() procedure receives as a parameter a subtree Q ′ 


4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial 

domains dominate the spatial domain of the query. The output is a set of dominant 


The selectP lan() procedure receives as parameters a sub-query tree Q ′ , a set of 

independent pre-aggregates IP AS, a set of overlapped pre-aggregates OP AS and 

a set of dominant pre-aggregates DP AS. It then calculates the cost of answering 

the query using different types of pre-aggregates and raw data. The output of this 

procedure is an indicator of the best plan for executing the query.

4.4 Experimental Results 73 

Query Evaluation 

The query optimizer module provides an optimized query tree along with the plan 

suggested for the computation of the query to the final phase, evaluation. Typically, 

the evaluation phase identifies the tiles affected by an aggregate query and executes 

the aggregate operation on each tile. Finally it combines the results to generate the 

answer to the query. With the extension of pre-aggregation in the optimizer, the traditional 

process differs such that the selected plan is considered before proceeding 

to execution. If the plan corresponds to raw, then the computation of the query is 

entirely done from raw data. Otherwise, it executes the aggregate operation only on 

those sub-expressions for which there are not pre-aggregated results. 

4.4 Experimental Results 

This section presents the performance results of our algorithms on real-life raster 

image datasets. We ran our experiments on a Intel Pentium 4 -CPU 3.00 GHz PC 

running SuSe Linux 9.1. The workstation had a total physical memory of 512 MB. 

The datasets were stored in RasDaMan, an array database management system (our 

research vehicle). 

Table 4.2 lists the test queries used in our experiments. We ran each query 200 

times against the database to obtain average query response times. The queries are 

formulated using rasql syntax, the declarative query interface to the RasDaMan server. 

We performed a cold test where the queries were run sequentially; the cache buffer 

was cleaned after the completion of each query. The dataset consists of a collection of 

2D raster images, each associated with an object identifier (oid). Each image shows 

a portion of the Black Sea, is 260 Mb in size, and consists of 100 indexed tiles. We 

artificially created a set of pre-aggregates for the experiment. They are stored in a preaggregation 

pool containing a total of 5000 pre-aggregates requiring a total storage 

space of 50 Mb. 

Computing the test queries involves the execution of two fundamental operations 

in GIS and remote-sensing imaging: sub-setting and aggregation. The values of the 

spatial domain of the queries were chosen such that we could measure the impact of 

using pre-aggregation for the following cases: 

• The computation of queries Q1, Q2 and Q3 can be done by combining the 

results of partial pre-aggregates and the remaining parts from original data. 

• The computation of queries Q4, Q5 and Q6 can be done by using the results 

of full pre-aggregates. That is, the full answer to these queries has been precomputed 

and stored in the database. 

• The computation of queries Q7, Q8 and Q9 can be done by combining the 

results of two or more pre-aggregates. There is no need to use original data to 

compute these queries.


Table 4.2. Database and Queries of the Experiment. 

Qid Description 

Q1 select add cells(y[6000:10000, 29000:32000]) 

from blacksea as y were oid(y) = 49153 


from blacksea as y where oid(y) = 49154 















Table 4.3 compares the CPU cost required for the computation of the queries 

using pre-aggregated data and raw data. The CPU cost was obtained by using the time 

library of C++. The column #aff. tiles shows the number of tiles that need to be read 

for computing the given query. Column # preagg. tiles represents the number of preaggregates 

that can be used to compute the query. Column t pre shows the total CPU 

cost of computing the query considering pre-aggregated data. Column t ex shows the 

time taken to execute the query entirely from raw data. Column ratio shows that CPU 

time is always better when the computations consider pre-aggregated data. 

Table 4.3. Comparison of Query Evaluation Costs Using Pre-Aggregated Data and 

Original Data. 

Q id #aff. tiles #preagg. tiles t pre t ex ratio 

Q1 63 24 15.6 17.8 87% 

Q2 35 24 6.9 9.3 74% 

Q3 35 8 9.4 10 94% 

Q4 5 5 1.02 1.55 65% 

Q5 5 5 1.1 1.63 67% 

Q6 5 5 0.74 1.01 73% 

Q7 2 1 0.04 0.41 9% 

Q8 2 1 0.04 0.45 8% 

Q9 2 1 0.04 0.41 9% 

4.5 Summary 

In this chapter we presented a framework for computing aggregate queries in array 

databases using pre-aggregated data. We distinguished among different types of


pre-aggregates: independent, overlapped, and dominant. We showed that such a distinction 

is useful to find a set of pre-aggregated queries that can reduce CPU cost for 

query computation. We proposed a cost-model to calculate the cost of using different 

pre-aggregates and select the best option for evaluating a query using pre-aggregated 

data. The measurements on real-life raster images showed that the computation of 

the queries is always faster with our algorithms compared to straightforward methods. 

We focused on queries using basic aggregate functions covering a large number of 

operations in GIS and remote-sensing imaging applications. The challenge remains, 

however, in supporting more complex aggregate operations, e.g., scaling, which is 

discussed in the following chapter.


Chapter 5 

Pre-Aggregation Support Beyond 

Basic Aggregate Operations 

In this chapter we investigate the problem of offering pre-aggregation support to nonstandard 

aggregate operations such as scaling and edge detection. We discuss issues 

found while attempting to provide a pre-aggregation framework for all non-standard 

aggregate operations. We then justify our reasons for focusing on scaling operations. 

We adapt the framework and cost model presented in Chapter 4 to support scaling operations. 

Finally, we discuss the efficiency of our algorithms based on a performance 

analysis covering 2D, 3D and 4D datasets. We indicate how our approach generalizes 

and outperforms well-known 2D image pyramids widely used in Web mapping. 

5.1 Non-Standard Aggregate Operations 

As shown in Chapter 2, aggregate operations are not limited to queries using basic 

aggregate functions. In the GIS domain, operations such as scaling, edge detection, 

and those related to terrain analysis also require data summarization and may therefore 

benefit from pre-aggregation. See Table 3.3 for a complete list of operations requiring 

summarization. Finding a general pre-aggregation approach for computing those 

kinds of operations, however, it introduces additional complications when compared 

to finding pre-aggregates using basic aggregate functions. 

Basic aggregate functions each consolidate the values of a group of cells and return 

a scalar value. The value may represent the total sum, the number of cells, the maximum 

or minimum cell value, or the average value of the affected cells. Affected cells 

are determined by the spatial domain defined in the predicate of the query. In contrast, 

the computation of a scaling operation may require consolidating the cell values of a 

group of cells to calculate each cell value in the output raster. The affected cells are 

determined by both the resampling method and scale vector as described in Chapter 3. 

A similar situation occurs with edge detection. The affected cells are determined by 

the size and values of the applied Sobel filter. For simplicity, we refer to those kinds 

of operations as non-standard aggregate operations. 

There is an important concern that must now be taken into account. From Chap- 

77

78 5. Pre-Aggregation Support Beyond Basic Aggregate Operations 

ter 3, we see that the result returned by a group of affected cells for a given nonstandard 

aggregate operation such as scaling is not likely to be useful in computing 

another non-standard aggregate operation such as edge detection. This is because 

non-standard operations differ significantly with respect to the way their affected cells 

are determined. Nevertheless, this result may be useful in computing the same type of 

non-standard operation under certain conditions. For example, the result of scaling by 

a factor of 8 could be used to compute scaling by a factor of 10 (assuming that both 

operations utilize the same resampling method). This result, however, is not likely to 

be useful in edge detection for the same object. 

We therefore simplify the problem of offering pre-aggregation support to nonstandard 

aggregations by treating each type of non-standard operation separately. This 

simplification is similar to those found in data warehousing techniques where preaggregation 

algorithms cover a specific type of queries. For instance, pre-aggregation 

algorithms exist for queries that include a group-by clause in their predicates, while 

other algorithms are used for queries without join conditions. 

We now focus on pre-aggregation support for one non-standard aggregate operation, 

scaling, for the following reasons: 

• One of the most frequent operations in GIS and remote-sensing imaging applications 

is downscaling of some dataset or part thereof, such as obtaining a 1 GB 

overview of a 10 TB dataset. 

• Scaling is a very expensive operation as it normally requires a full scan of the 

dataset, plus costly main memory operations. Therefore, query optimization is 

critical to this class of retrieval operations. 

• Scaling is the only operation that has already been supported by pre-aggregation, 

at least for 2D datasets. This provides a point of reference to compare the effectiveness 

of our algorithms against existing techniques. 

Although the framework discussed in the following sections is centered around 

scaling operations, it can be adapted to support other non-standard aggregate operations 

by modifying the matching conditions as discussed later in this chapter. 

5.2 Conceptual Framework 

A common optimization technique that speeds up scaling operations is to materialize 

selected downscaled versions of an object, e.g., using image pyramids. When 

evaluating a scaling operation with target scale factor s, the pyramid level with the 

largest scale factor s ′ is determined, where s ′ < s. This relationship between scaling 

operations places them within a lattice framework similar to that used for data cubes 

in data warehouse/OLAP applications [92]. Our conceptual framework and greedy 

algorithm for the selection of pre-aggregates is based on the work of Harinarayan et 

al. presented in [92]. The use of this approach was motivated by the similarities 

between our datasets (multidimensional arrays) and OLAP data cubes. Furthermore,

5.2 Conceptual Framework 79 

Figure 5.1. Sample Lattice Diagram for a Workload with Five Scaling Operations 

the lattice framework and the greedy algorithm have proven successful in a variety of 

business applications. 

5.2.1 Lattice Representation 

A scaling lattice consists of a set of queries L and dependence relations ≼ denoted 

by 〈L, ≼〉. The ≼ operator imposes a partial ordering on the queries of the lattice. 

Consider two queries q 1 and q 2 . We say q 1 ≼ q 2 if q 1 can be answered using only the 

results of q 2 . The base node of the lattice is the scaling operation with the smallest 

scale vector upon which every query is dependent. Lattices are commonly represented 

in a diagram in which the elements are nodes, and there is a path downward from q 1 

to q 2 if and only if q 1 ≼ q 2 . The selection of pre-aggregates, that is, queries for 

materialization, is equivalent to selecting vertices from the underlying nodes of the 

lattice. Fig. 5.1 shows a lattice diagram for a workload containing five queries. Each 

node has an associated label that represents a scaling operation for a given dataset, 

scale-vector and resampling method. 

In our framework, we use the following function to define scaling operations: 

where 

scale(objName[lo 1 : hi 1 , ..., lo n : hi n ], ⃗s, resMeth) (5.1) 

• objName[lo 1 : hi 1 , ..., lo n : hi n ]: is the name of the multidimensional raster 

image to be scaled. The operation can be restricted to a specific area of the 

raster object. In that case, the area is specified by defining lower (lo n ) and 

upper (hi n ) bounds for each dimension. If the spatial domain is omitted, the 

operation is performed on the full spatial extent defining the raster image. 

• ⃗s: is a vector where each element is a numeric value that represents the scale 

factor used in a specific dimension of the raster image. 

• resMeth: specifies the resampling method to be applied to the original raster 

object.


For example, scale(CalF ires, [2, 2, 2], nn) defines a scaling operation by a factor 

of two on each dimension, using nearest neighbor as resampling method on a 3D 

dataset identified as CalF ires. 

5.2.2 Pre-Aggregation Selection Problem 

Definition 5.4 (Pre-Aggregates Selection Problem) – Given a query workload Q 

and a storage space constraint C, the pre-aggregates selection problem is to select a 

set P ⊆ Q of queries such that P minimizes the overall costs of computing Q while 

the storage space required by P does not exceed the limit given by C. 

✷ 

Considering existing view selection strategies in data warehousing/OLAP, the following 

selection criteria are suggested for pre-aggregates: 

• Frequency. Pre-aggregates yield particularly significant increases in processing 

speed when scaling operations are executed with high frequency within a 

workload. 

• Storage space. The storage space constraint of a candidate scaling operation 

must be at least the size of the storage required by the query in the workload with 

the smallest scale vector. This guarantees that for any query in the workload at 

least one pre-aggregate can be used for its computation. 

• Benefit. A scaling operation may be used to compute the same and other dependent 

queries in the workload. A metric is therefore used to calculate the 

cost savings gained by using a candidate scaling operation. To evaluate the 

cost, we use the model presented in Section 4.2. We call this the benefit of 

a pre-aggregate set and normalize the benefit against the base object’s storage 

volume. 

Frequency 

The frequency of query q, denoted by F (q), is the number of occurrences of a given 

query in a workload: 

F (q) = N(q)/ |Q| (5.2) 

where N(q) is a function that returns the number of occurrences of a given query in 

workload Q. 

Storage Space 

The storage space of a given query denoted by S(q), represents the storage space 

required to save the result of query q and it is determined by the number of cells 

composing the output object defined in query q.

5.2 Conceptual Framework 81 

Benefit 

The benefit of a candidate scale operation for pre-aggregation q, is computed by 

adding the savings in query cost for each scaling operation in the workload dependent 

on q, including all queries identical to q. That is, query q may contribute to 

saving processing costs for the same or similar queries in the workload. In both cases, 

specific matching conditions must be satisfied. 

Full-Match Conditions. Let q be a candidate query for pre-aggregation and p a 

query in workload Q. Let p and q both be scaling operations as defined in Eq. 5.1. 

There is a full-match between q and p if and only if: 

• the value of parameter objName[] in the scale function defined for q is the same 

as in p 

• the value of parameter ⃗s in the scale function defined for q is the same as in p 

• the value of parameter resMeth in the scale function defined for q is the same 

as in p 

Partial-Match Conditions. Let q be a candidate query for pre-aggregation and p 

be a query in the workload Q. There is a partial-match between p and q if and only if: 

• the value of parameter objName[] in the scale function defined for q is the same 

as in p 

• the value of parameter resMeth in the scale function defined for q is the same 

as in p 

• the parameter ⃗s for both q and p is of the same dimensionality 

• vector values defined in ⃗s for q are higher than those defined in p 

Definition 5.5 (Benefit) – Let T ∈ Q be a subset of scaling operations that can 

be fully or partially computed using query q. The benefit of query q per unit space, 

denoted by B(q), is the sum of the computational cost savings gained by selecting 

query q for pre-aggregation. 

✷ 

B(q) = ((F (q) ∗ C(q)) + ∑ t∈T 

(F (t) ∗ C r (t, q)))/size(q) (5.3) 

where F (q) represents the frequency of query q in the workload, C ( q) is the cost of 

computing query q on the original dataset, C r (t, q) is the relative cost of computing 

query t from q, and size(q) is a function that returns the number of cells composing 

the spatial domain component of a query q.


5.3 Pre-Aggregates Selection 

Pre-aggregating all distinct scaling operations in the workload is not always possible 

because of space limitations. This is similar to the problem of selecting views 

for materialization in OLAP. One approach for finding the optimal set of scaling operations 

to pre-compute consists of enumerating all possible combinations and finding 

the one that yields the minimum average query cost, or the maximum benefit. Finding 

the optimal set of pre-aggreates in this way has a complexity of O(2 n ) where n is the 

number of queries in the workload. If the number of scaling operations on a given 

raster object is 50, there are 2 50 possible pre-aggregates for that object. Therefore, 

computing the optimal set of aggregates exhaustively is not feasible. In fact, it is an 

NP-hard problem [92, 17]. 

We therefore consider the selection of pre-aggregates as an optimization problem 

where the input includes multidimensional datasets, a query workload, and an upper 

bound on available disk space. The output is a set of queries that minimizes the total 

cost of evaluating the query workload depending on the storage limit. We present an 

algorithm that uses the benefit per unit space of a scaling operation. We model the 

expected queries by a query workload, which is a set of scaling operations: 

Q = {q i |0 

where each q i has an associated non-negative frequency, f i . We normalize frequencies 

so that they sum up to 1: 

( 

n∑ 

q i ) (5.5) 

i=1 

Based on this setup we study different workload patterns. 

The PRE-AGGREGATESSELECTION procedure returns a set P = {p i |0 

queries to be pre-aggregated. Input is a workload Q and a storage space constraint S. 

The workload contains a number of queries, each corresponding to a scaling operation 

as defined in Eq. 5.1. 

Frequency, storage space, and benefit per unit space are calculated for each distinct 

query in the workload. When calculating the benefit, we assume that each query is 

evaluated using the root (top) node, which is the first selected pre-aggregate, p 1 . The 

second chosen pre-aggregate p 2 is the one with highest benefit per unit space. 

The algorithm recalculates the benefit of each scaling operation given that they are 

computed either from the root, if the scaling operation is above p 1 , or from p 2 otherwise. 

Subsequent selections are performed in a similar manner. The benefit is recalculated 

each time a scaling operation is selected for pre-aggregation. The algorithm 

stops selecting pre-aggregates when the storage space constraint is reached, or when 

there are no more queries in the workload to be considered for pre-aggregation, i.e., 

all scaling operations in the workload have already been selected for pre-aggregation. 

The function highestBenefit(Q) returns the scaling operation with highest benefit 

per unit space in Q. Complexity of the algorithm is O(k · n 2 ) (k is the number

5.4 Answering Scaling Operations Using Pre-Aggregated Data 83 

Algorithm 3 PRE-AGGREGATESSELECTION 

Require: A workload Q, and a storage space constraint c 

1: P = {top scaling operation} 

2: while (c > 0 and |P | != |Q| ) do 

3: p = highestBenefit(Q, P ) 

4: if (c - |p| > 0) then 

5: c = c - |p| 

6: P = P ∪ p 

7: end if 

8: else c = 0 

9: return P 

of selected pre-aggregates and n is the number of vertices in the lattice), which arises 

from the cost of sorting the pre-aggregates by benefit per unit size. 

5.3.1 Complexity Analysis 

Let m be the number of queries in the lattice. Suppose we have no queries selected 

except for the top query, which is mandatory. The time to answer a given query in the 

workload is the time taken to compute the query using the top query and calculating 

it according to our cost model. We denote this time by T o . Suppose that in addition 

to the top query, we choose a set of queries P . Denote the average time to answer a 

query by T p . The benefit of the set of queries P is the reduction in average time to 

answer a query, that is, T o − T p . Thus, minimizing the average time to answer a query 

is equivalent to maximizing the benefit of a set of queries. 

Let p 1 , p 2 , ..., p k be the k queries selected by the PRE-AGGREGATESSELECTION 

algorithm. Let b i be the benefit achieved by the selection of p i , for i = 1, 2, ..., k. 

That is, b i is the benefit of p i , with respect to the set consisting of the top query and 

p 1 , p 2 , ..., p i−1 . Let P = p 1 , p 2 , ..., p k . 

Let O = o 1 , o 2 , ..., o k be an optimal set of k queries, i.e., those queries giving 

the maximum benefit. Let m i be the benefit achieved by the selection of o i , for i = 

1, 2, ..., k. That is, m i is the benefit of o i , with respect to the set consisting of the top 

query and o 1 , o 2 , ..., o i−1 . 

Harinarayan et al [92] proved that the benefit of the greedy algorithm can never 

be less than (e-1)/e = 0.63 times the benefit of the optimum choice of pre-aggregated 

queries. 

5.4 Answering Scaling Operations Using Pre-Aggregated Data 

We say that a pre-aggregate p answers query q if there exists some other query q ′ 

which when executed on the result of p, provides the result of q. The result can be 

either exact with respect to q (q ′ ◦ p ≡ q), or only an approximation (q ′ ◦ p ≈ q). 

In practice, the result is often an approximation because of the effect of resampling 

the original dataset. The same effect is observed in the traditional image pyramids


approach, but it is considered negligible since approximations are good enough for 

many applications. In our approach, when two or more pre-aggregates qualify for 

computing a given scaling operation, we pick the pre-aggregate with the closest scale 

vector value to the one defined in the scaling operation. 

Example 5.1 – Assume the queries listed in Table 5.1 have been pre-aggregated, and 

suppose we want to compute the following query: q = scale(ras01, (4.0, 4.0, ⃗ 4.0), bi). 

From the list of available pre-aggregates, the query can be answered either by using 

p2 or p3. From these two pre-aggregates, p3 has the closest scale vector to q. Thus, 

q ′ = scale(p3, (0.87, 0.87, ⃗ 0.87), bi). Note that q ′ represents a rewritten scaling operation 

in terms of the pre-aggregate. 

✷ 

Table 5.1. Sample Pre-Aggregates. 

Raster Object ID Raster Name Scale Vector Resampling Method 

p1 ras01 (2.0, 2.0, ⃗ 2.0) nn 

p2 ras01 (3.0, 3.0, ⃗ 3.0) bi 

p3 ras01 (3.5, 3.5, ⃗ 3.5) bi 

p4 ras01 (6.0, 6.0, ⃗ 6.0) bi 

The REWRITEOPERATION procedure returns for query q a query q ′ that has been 

rewritten in terms of a pre-aggregate identified with p id . The input of the algorithm 

is the scaling operation q and a set of pre-aggregates P . The algorithm looks for a 

PERFECT-MATCH between q and one of the elements in P . To this end, the algorithm 

verifies that the matching conditions listed in Section 5.2.2 are all satisfied. If 

a perfect match is found, it returns the identifier of the matched pre-aggregate. Otherwise, 

the algorithm verifies PARTIAL-MATCH conditions for all pre-aggregates in 

P . All qualified pre-aggregates are added to set S. In case of a partial matching, 

the algorithm finds the pre-aggregate with the scale vector closest to the one defined 

in Q. REWRITEQUERY rewrites the original query as a function of the selected preaggregate, 

and adjusts the values of the scale vector to perform the complementary 

scaling operation. The algorithm makes use of the following auxiliary functions. 

• FULLMATCH(q, P ). Verifies that all full-match conditions are satisfied. If 

no matching is found, it returns 0, else it returns the id of the matching preaggregate. 

• PARTIALMATCH(q, P ). Verifies that all partial-match conditions are satisfied. 

Each qualified pre-aggregate of P is added to set S. 

• CLOSESTSCALEVECTOR(q, S). Compares the scale vectors between q and the 

elements of S, and returns the identifier (p id ) of the pre-aggregate whose scale 

vector is the closest to that defined for q. 

• REWRITEQUERY(Q, p id ). Rewrites query q in terms of the selected pre-aggregate 

and adjusts the scale vector values accordingly.


Algorithm 4 REWRITEOPERATION 

Require: A query q, and a set of pre-aggregates P 

1: initialize S = {} , p id = 0 

2: p id = fullMatch(q, P ) 

3: if (p id == 0) then 

4: S = partialMatch(q, P ) 

5: p id = closestScaleV ector(q, S) 

6: end if 

7: q ′ = rewriteQuery(q, p id ) 

8: return q ′ 

5.5 Experimental Results 

Experiments were conducted to evaluate the effectiveness of the pre-aggregation 

selection and rewriting algorithms in supporting scaling operations. They were run on 

a machine with a 3.00 GHz Intel Pentium 4 processor, running SuSe Linux 9.1. The 

workstation had a total physical memory of 512 MB. 

The query workload consisted of scaling operations with different scaling vectors. 

Different data distributions of the query workload were also considered. Despite the 

growing popularity of Web mapping services for GIS raster information processing, 

very few studies have been undertaken that report on user behaviors while using those 

services. One of the primary reasons for lack of research in this area may be the 

limited availability of the datasets outside of specialized research groups. Moreover, 

while query patterns related to scaling operations on 2D datasets are difficult to find, 

no empirical workload distributions were found for datasets of higher dimensionalities. 

We therefore resorted to using a set of artificial distributions that cover many 

practical situations in GIS and remote-sensing imaging. 

Most pre-aggregation algorithms in OLAP and image pyramids assume a uniform 

distribution of the values given for the scale vector in the query workload, so we 

considered the same type of distribution for our experiments. Furthermore, we also 

considered a Poisson distribution of the scale vector values. The rationale is that 

such a distribution covers situations where the dataset is scaled down by factors that 

typically fall within a narrow range of scale vectors. For example, very large objects 

may need to be scaled down by large scale vectors so they can be efficiently transferred 

back and forth via Web services [77]. We also considered applications where the 

dataset is scaled down by the same scale vector, we refer to such access patter as a 

peak distribution. Finally, we investigated a step distribution that satisfies cases where 

scaling operations can be grouped within specific ranges of scale vectors. 

Our experiments were performed on datasets generated from three real-life rasterobjects: 

• Dataset R1. Consists of a 2D raster object with spatial domain [0 : 15359, 0 : 

10239]. The dataset contains 600 tiles, each with a spatial domain of [0 : 512, 0 : 

512]. The total number of cells composing the raster object is 157 millions.



10459, 0 : 3650]. The dataset contains 3214 tiles, each with a spatial domain of 

[0 : 512, 0 : 512, 0 : 512]. The total number of cells composing the raster object 

is 43 trillions. 


7259, 0 : 2430, 0 : 75640]. The dataset contains 197,070 tiles, each with a 

spatial domain of [0 : 512, 0 : 512, 0 : 512, 0 : 512]. The total number of cells 

composing the raster object is 1.35e+16. 

In the rest of this section, we present the results of our experiments according to 

the dimensionality of the data. 

5.5.1 2D Datasets 

In this experiment the workload consisted of 12800 scaling operations defined for 

dataset R1. 

Uniform Distribution 

The scaling vectors of the queries in the workload were uniformly distributed. Scale 

vectors were integers ranging from 2 to 256. Per observations in practice, we assumed 

that both dimensions were coupled. We considered a storage space constraint of 35%, 

which is slightly higher than the additional storage space taken by image pyramids. 

The PRE-AGGREGATESSELECTION algorithm yields 12 pre-aggregates for this test 

where we executed scaling operations with scale vectors 2, 4, 6, 11, 15, 22, 32, 46, 67, 95, 

137 and 182. The cost of computing the workload using these pre-aggregates is 

18, 565. In contrast, image pyramids selects scaling operations with scale vectors: 

2, 4, 8, 16, 32, 64, 128, and 256, and requires 33% additional storage space. Image 

pyramids computes the workload at a cost of 29, 166. The results of this experiment 

show that the pre-aggregates selected by our algorithm provide an improved performance 

for scaling operations over image pyramids. The cost of computing the workload 

using our algorithm is 36% less than that incurred by image pyramids, at a price 

of 2% additional storage space. 

Fig. 5.2(a) shows the distribution of the scale vectors of all queries in the workload. 

The pre-aggregates selected by image pyramids and our pre-aggregation selection algorithm 

are shown in Fig. 5.2(b) and 5.2(c), respectively. 

Poisson Distribution 

The workload for this experiment consisted of scaling operations where the scale vectors 

had a Poisson distribution, and the mean value of the scale vector equaled 50. The 

PRE-AGGREGATES-SELECTION algorithm yields 33 pre-aggregates for this test that 

executed scaling operations using scale vectors from 34 to 66. The cost of computing 

the workload using these pre-aggregates is 42, 455. In contrast, image pyramids


(a) Query workload (Uniform distribution) 

(b) Selected queries for materialization by image pyramids 

(c) Selected queries for materialization by our pre-aggregation selection algorithm 

Figure 5.2. Query Workload with Uniform Distribution 

selects scaling operations with scale vectors: 2, 4, 8, 16, 32, 64, 128, and the cost of 

computing the workload is 95, 468. Thus, the cost of computing the workload using


pre-aggregates selected by our algorithm is 55% less than that incurred using image 

pyramids. There is also a major difference with respect to the additional storage 

space required by both approaches: image pyramids requires 33% additional storage 

space, while our algorithm requires only 5% additional space to store the selected 


(a) Query workload (Poisson distribution) 

Figure 5.3. Query Workload with Poisson Distribution 

Fig. 5.3(a) shows the distribution of the scale vectors of all queries in the workload. 

The pre-aggregates selected by image pyramids are shown in Fig. 5.4(a). Even when 

there are no queries in the workload with scale factors smaller than 33, image pyramids 

still allocates space for pre-aggregates 2, 4, 8, 16, 32 which are the ones that account 

for much of the overall space requirement (33%). In contrast, our algorithm uses 

the query frequencies in the workload to select the queries for pre-aggregation. See 

Fig. 5.4(b). For this workload configuration, it is possible to pre-aggregate all distinct 

queries, and provide much faster query response times than image pyramids. This 

shows the benefit of considering query frequencies in the workload. If we pick a 

mean higher than 50, the additional storage space needed by the pre-aggregates is 

minimal. Conversely, if the mean is shifted to a lower scale vector value, e.g. 16, the 

storage space needed by our pre-aggregation algorithm can increase up to 35%. 

Peak Distribution 

In this experiment, the query workload consisted of scaling operations with a scale 

vector having a value of 100 in each dimension. The PRE-AGGREGATESSELECTION 

algorithm yields 1 pre-aggregate for this test that executes a scaling operation with 

scale vector: 100, 100. The cost of computing the workload using this pre-aggregate 

is 1.27E + 08. In contrast, image pyramids selects scaling operations with scale 

factor values in each dimension: 2, 4, 8, 16, 32, 64, 128,, and the cost of computing 

the workload is 3.01E + 08. Thus, the cost of computing the workload using the 

pre-aggregates selected by our algorithm is 58% less than the cost incurred by image 

pyramids. Furthermore, there is major difference with respect to the storage space


(a) Selected queries for pre-aggregation by image pyramids 

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm 

Figure 5.4. Selected Queries for Pre-Aggregation 

required by both approaches: image pyramids requires 33% additional storage space, 

while our algorithm only requires 5% additional space. 

Fig. 5.5(a) shows the distribution of the scale vectors for all queries in the workload. 

The pre-aggregates selected by image pyramids are shown in Fig. 5.6(a). Image 

pyramids allocates space for pre-aggregates with scale factors 2, 4, 8, 16, 32, 128, and 

256 in each dimension. In contrast, our pre-aggregation selection algorithm selected 

one query, shown in Fig. 5.6(b). Although our algorithm makes more efficient use 

of storage space and computes the workload faster than image pyramids, this kind of 

scenario is not likely to occur in practice. The storage overhead is simply not justified. 

However, users may benefit from having a system that automatically pre-aggregates 

such operations with minimum overhead, a capability that can be provided by using 

our algorithm. 

Step Distribution 

We now consider a scenario where scale vectors are distributed in various ranges of 

frequencies, i.e. in a step distribution. The PRE-AGGREGATESSELECTION algorithm 

yields 6 pre-aggregates for this test, where scaling operations are executed with scale


(a) Query workload (Peak distribution) 

Figure 5.5. Query Workload with Peak Distribution 



Figure 5.6. Selected Queries for Pre-Aggregation


vectors 6, 8, 13, 19, 75, and 200. The cost of computing the workload using these preaggregates 

is 1.5e + 09. In contrast, image pyramids selects scaling operations with 

scale vectors: 2, 4, 8, 16, 32, 64, and 128, and the cost of computing the workload is 

2.21e + 09. The cost of computing the workload using the pre-aggregates selected by 

our algorithm is therefore 32% less than that incurred by image pyramids. Moreover, 

there is a major difference with respect to the additional storage space required by 

both approaches. Image pyramids requires 33% additional storage space, while our 

algorithm only requires 15% additional space. 

(a) Query workload (Step distribution) 

Figure 5.7. Query Workload with Step Distribution 

Fig. 5.7(a) shows the distribution of the scale vectors for all queries in the workload. 

The pre-aggregates selected by image pyramids are shown in Fig. 5.8(a). 


To test our pre-aggregation algorithms on 3D time-series datasets, we picked four 

data distribution patterns of scaling vectors. For simplicity, we have labeled each 

dimension x, y, and t respectively. The following assumption (taken from observation 

in practice) is common for each data distribution type: the scale vector along the first 

two dimensions is the same, i.e. x = y. The aim of this test is to measure average 

query costs while varying available storage space for pre-aggregation. 

Uniform distribution in x, y, t 

In this experiment, the workload consisted of 10, 000 scaling operations referring to 

the 3D dataset R2 described at the beginning of this Section. Scale vectors were uniformly 

distributed along the x, y, and t dimensions. Values of scale vectors ranged 

from 2 to 256. Fig. 5.9 shows the distribution of the scaling vectors in the workload. 

We executed the PRE-AGGREGATESSELECTION algorithm for different values 

of storage space constraint (c). The minimum storage space required to support the




Figure 5.8. Selected Queries for Pre-Aggregation


root node of the lattice was 12.5% of the size of the original dataset. Fig. 5.10 shows 

the average query cost as storage space is increased. A small amount of storage space 

dramatically reduces the average query cost. The improvement in average query cost 

decreases, however, as allocated space goes beyond 36%. Fig. 5.11 shows the scaling 

operations selected for pre-aggregation when c = 36%. For this instance of the 

storage space constraint, the algorithm selected 49 pre-aggreagates. The total cost of 

computing that workload is 6.44e+05. In contrast, computing the workload using the 

original dataset incurs a cost of 1.28e + 12. 

Figure 5.9. Workload with Uniform Distribution along x, y, and t 

Figure 5.10. Average Query Cost over Storage Space


Figure 5.11. Selected Pre-Aggregates, c = 36% 

Uniform distribution in x, y and Poisson distribution in t 


3D dataset R2. The scale vectors were uniformly distributed along x and y, with a 

Poisson distribution along t. Values of scale vectors ranged from 2 to 256 in the x 

and y dimensions, whereas in t they ranged from 8 to 16, with a mean value of 12. 

Fig. 5.12 shows the distribution of scaling vectors in the workload. Note that the 

scale vector values in the dimensions x and y are coupled. The frequency of the various 

scale factor values is denoted by f. We ran the PRE-AGGREGATESSELECTION 

algorithm for different values of the storage space constraint. The minimum storage 

space required to support the root node of the lattice was 3.13% of the size of the 

original dataset. Fig. 5.13 shows the average query cost as storage space increases. 

A small amount of storage space dramatically improves the average query cost. However, 

we can also observe that the improvement in average query cost decreases as 

allocated space goes beyond 26%. Fig. 5.14 shows the scaling operations selected for 

pre-aggregation when c = 26%. For this instance of the storage space constraint, the 

algorithm selected 67 pre-aggreagates. The total cost for computing the workload is 

1.21e + 07. In contrast, computing the workload using the original dataset incurs a 

cost of 2.31e + 11. 

Poisson distribution in x, y, t 

In this experiment, the workload consisted of 600 scaling operations referring to 3D 

dataset R2. The scale vectors followed a Poisson distribution along the three dimensions 

x, y, and t. Values of scale vectors ranged from 2 to 10 in the x and y 

dimensions, whereas in t they ranged between 8 and 16, with a mean value of 12. 

Fig. 5.15 shows the distribution of the scaling vectors in the workload. We ran 

the PRE-AGGREGATESSELECTION algorithm for different values of the storage space


Figure 5.12. Workload with Uniform Distribution Along x, y, and Poisson distribution 

in t 

Figure 5.13. Average Query Cost as Space is Varied 

constraint. The minimum storage space required to support the root node of the lattice 

was 4.18% of the size of the original dataset. Fig. 5.16 shows the average query cost 

as storage space is increased. A small amount of storage space dramatically improves 

the average query cost. However, the improvement in average query cost decreases as 

allocated space goes beyond 26%. Fig. 5.17 shows the scaling operations selected for 


algorithm selected 23 pre-aggreagates. The total cost of computing the workload is 

1680. In contrast, computing the workload using the original dataset incurs a cost of 

1.34e + 12.


Figure 5.14. Selected Pre-Aggregates, c = 26% 

Figure 5.15. Workload with Poisson distribution Along x, y, and t 

Poisson distribution in x, y, and Uniform distribution along t 

In this experiment, the workload consisted of 924 scaling operations referring to 3D 

dataset R2. The scale vectors followed a Poisson distribution along the dimensions x 

and y, and a uniform distribution along dimension t. Values of scale vectors ranged 

from 2 to 10 in the x and y dimensions, and were uniformly distributed along t. Fig. 

5.18 shows the distribution of the scaling vectors in the workload. We ran the PRE- 

AGGREGATESSELECTION algorithm for different values of the storage space constraint. 

The minimum storage space required to support the root node of the lattice 

was 4% of the size of the original dataset. Fig. 5.19 shows the average query cost 

as storage space is increased. A small amount of storage space dramatically improves 

the average query cost. However, the improvement in average query cost decreases as 

allocated space goes beyond 21%. Fig. 5.20 shows the scaling operations selected for



Figure 5.17. Selected Pre-Aggregates, c = 30%


Figure 5.18. Workload with Poisson Distribution Along x, y, and Uniform Distribution 

in t 


algorithm selected 17 pre-aggreagates. The total cost of computing the workload is 

1472. In contrast, computing the workload using the original dataset incurs a cost of 

1.63e + 12. 


For 4D datasets, we considered ECHAMT−42 as a typical use case found in 

climate modeling. ECHAMT−42 is an energy and mass budget model developed 

by the Max-Planck-Institute for Meteorology [16]. We assumed that dimensions x 

and y are scaled down by the same scale value. However, the scale values along z and 

t may vary according to specific analysis requirements for a given application. If we 

look at the sample dimensions of ECHAMT − 42 model shown in Table 5.2, it is 

clear that the dimension values along the first three dimensions are much smaller than 

those of the fourth dimension (time). 


4D dataset R3. We assumed that the scale vectors followed a Poisson distribution in 

each of the four dimensions. The rationale behind this assumption is that scientists 

are often interested in a highly selective data set and Poisson distribution fits nicely 

for this data access pattern. Values of scale vectors ranged from 2 to 11 in the x, y 

dimensions with a mean of 6; from 10 to 19 along the z dimension with a mean of 14, 

and from 230 to 239 along t with the mean of 234. Table 5.3 shows the distribution 

of the scale factors of all scaling operations in the workload. 

We ran the PRE-AGGREGATESSELECTION algorithm for different values of the 

storage space constraint. The minimum storage space required to support the root 

node of the lattice was 1.25% of the size of the original dataset. Table 5.4 shows the



Figure 5.20. Selected Pre-Aggregates, c = 21%


scaling operations selected for pre-aggregation when c = 1.3%. For this instance of 

the storage space constraint, the algorithm selected 4 pre-aggregates shown in Table 

5.4. The total cost of computing the workload is 3361. In contrast, computing the 

workload using the original dataset incurs a cost of 1.35e + 16. 

Table 5.2. ECHAM T-42 Climate Simulation Dimensions 

Dimension 

Extent 

Longitude 128 

Latitude 64 

Elevation 17 

Time (24 min/slice) 200 years (2,190,000) 

Table 5.3. 4D Scaling: Scale Vector Distribution 

Scale Vector count 

2,2,10,230 200 

3,3,11,231 300 

4,4,12,232 500 

5,5,13,233 800 

6,6,14,234 1000 

7,7,15,235 1000 

8,8,16,236 800 

9,9,17,237 500 

10,10,18,238 300 

11,11,19,239 200 

Table 5.4. 4D Scaling: Selected Pre-Aggregates 

Scale Vector count 

2,2,10,230 200 

4,4,12,232 500 

6,6,14,234 1000 

8,8,16,236 800 

5.6 Summary 

This chapter describes our investigations on the problem of intelligently picking 

a subset of scaling operations given a storage space constraint. There is a tradeoff 

between the amount of space allocated for pre-aggregation, and the average query 

cost of scaling operations. We introduced a pre-aggregation selection algorithm based 

on a given query workload that determines a set of pre-aggregates in face of storage 

space constraints. 

We performed experiments on 2D, 3D, and 4D datasets using different data distribution 

patterns for the scale vectors. We relied on artificial data distributions since no 

empirical distributions were found. In addition to uniformly distributed scale vectors,


we considered non-uniform distributions including Poisson, peak, and step. For 2D 

datasets, we showed that our algorithm performs better than that of image pyramids. 

In particular, for non-uniform data distributions, our pre-aggregation selection algorithm 

not only provides a lower average query cost, but makes a much more efficient 

use of storage space. This is because our algorithm considers the frequency of the 

query, and the cost savings (benefit) this provides for computing the workload. Nevertheless, 

the major advantage of our algorithm over that of image pyramids is not the 

improved average query cost, but the reduced amount of storage space required for 

the pre-aggregates, especially for non-uniform distributions. 

In our experiments with 3D and 4D datasets, we showed the effect of the available 

storage space for pre-aggregation on average query costs. We observed that a small 

amount of storage overhead is sufficient to dramatically reduce average query costs. 

Since there are no similar techniques against which we can compare our results, we 

compared our results against the average query costs obtained by using the original 

data.


Chapter 6 

Conclusion 

One of the biggest challenges of database technology is to effectively and efficiently 

provide solutions for extremely large volumes of multidimensional array data archiving 

and management. This thesis focuses on investigating the problem of applying 

OLAP pre-aggregation technology to speed up aggregate query processing in array 

databases for GIS and remote-sensing imaging applications. 

We presented a study of fundamental imaging operations in GIS. By using a formal 

algebraic framework, Array Algebra, we were able to classify GIS operations 

according to three basic algebraic operators, and thus identify a set of operations that 

can benefit from pre-aggregation techniques. We argued that OLAP pre-aggregation 

techniques cannot be applied in a straight-forward manner to array databases for our 

target applications. The reason is that although similar, data structures in both application 

domains differ in fundamental aspects. In OLAP, multidimensional data spaces 

are spanned by axes where cell values sit on the grid at intersection points. This 

is paralleled by raster image data that are discretized during acquisition. Thus, the 

structure of an OLAP data cube is rather similar to a raster array. Dimension hierarchies 

in OLAP serve to group value ranges along an axis. Querying data by referring 

to coordinates on the measure axes yields ground data, whereas queries using axes 

higher up in a dimension hierarchy will return aggregated values. A main differentiating 

criterion between data in OLAP and raster image data is density: OLAP data 

are sparse, typically 5%, whereas, raster image datasets are 100% dense. Note also 

that dimensions in OLAP are treated as business perspectives such as products and/or 

stores, and these are non-spatial dimensions, which contrast with the spatial nature of 

raster image datasets. There are, however, core similarities that motivated us to further 

research OLAP pre-aggregation techniques. For example, we observed that array 

databases and OLAP systems both employ multidimensional data models to organize 

their data. Also, the operations convey a high degree of similarity: a roll-up (aggregate) 

operation in OLAP is very similar to a scaling operation in a raster domain. 

Moreover, both application domains make use of pre-aggregation approaches to speed 

up query processing, however, each has different levels of maturity and scalability. 

We presented a framework that focuses on computing basic aggregate operations 

using pre-aggregated data. We argued that the decision of computing an aggregate 

103

104 6. Conclusion 

query using pre-aggregated data is influenced by the structural characteristics of the 

query and the pre-aggregate. Thus, by comparing query tree structures between the 

two, one can determine if the pre-aggregated result contributes fully or partially to 

the final answer of the query. The best case occurs when there is full-matching between 

the query and the pre-aggregate, since the time taken to compute the query is 

reduced to the time it takes to retrieve the result. However, in the case of partialmatching, 

several pre-aggregates can be considered for computing the answer of a 

query. The decision has to be made, therefore, as to which pre-aggregates provide the 

best performance in terms of execution time. To this end, we distinguished between 

different pre-aggregates and presented a cost-model to calculate the cost of using each 

qualifying pre-aggregate. Then we presented an algorithm that selects the best execution 

plan for evaluating a query considering pre-aggregated data. Tests performed on 

real-life raster image datasets showed that our distinction between different types of 

pre-aggregates is useful to determine the pre-aggregate providing the highest benefit 

(in terms of execution time) for computing a given query. 

We then described the issues of attempting to generalize our pre-aggregation framework 

to support more complex aggregate operations, and justified our decision to focus 

on one particular operation: scaling. Traditionally, 2D scaling operations have 

been performed using image pyramids. Practice shows that pyramids are typically 

constructed in scale levels of powers of 2, thus yielding scale vectors 2, 4, 6, 8, 16, 32, 64, 

128, 256, and 512. The materialization of the pyramid requires an estimated 33% additional 

storage space. Our pre-aggregation selection algorithm is similar to the pyramid 

approach in that it selects a set of queries for materialization, where each level corresponds 

to a scaling operation with a defined scale factor. However, the selection of 

such queries is not restricted to a fixed number of levels interleveled by a power of two. 

Instead, our selection algorithm considers the frequency of each query in the workload, 

and how the results of each individual query can help to reduce the overall cost 

of computing the workload. We compared the performance of our pre-aggregation algorithm 

against that of image pyramids: results showed that for workloads with scale 

vectors uniformly distributed our algorithm computes the workload 36% cheaper than 

image pyramids, and requires 7% additional space than image pyramids. For scale 

vectors following a Poisson distribution, our algorithm computes the workload at a 

cost 55% cheaper than when using the pyramids approach. Further, our algorithm 

can be applied to datasets of higher dimensions, a feature not supported by traditional 

image pyramids. 

6.1 Future Work 

There are natural extensions to this work that would help expand and strengthen the 

results. One area of further work is in adding self-management capabilities so that the 

DBMS maintains statistics about each scaling operation appearing within the incoming 

queries and, at some suitable time, adjust the pre-aggregate set accordingly. OLAP 

dynamic pre-aggregation addresses a similar problem. Another area is in applying the 

results studied here to the many real-world situations where data cubes contain one or

6.1 Future Work 105 

more non-spatio-temporal dimensions, such as pressure, which is common in meteorological 

and oceanographic data sets. 

Workload distribution deserves further investigation. While the distributions chosen 

are practical and relevant, there might be further situations worth considering. 

Gaining empirical figures from user-exposed services like EarthLook 1 can be useful 

to tune our pre-aggregation selection algorithms. Further investigation is also necessary 

in the realm of rewriting scaling operations. In OLAP applications, there is a 

trade-off between speed and accuracy. But accuracy may be critical for certain Georaster 

applications, so solutions to the query rewriting problem must weight these two 

aspects according to user data analysis requirements. Moreover, it must consider the 

fact that the same dataset may be accessed by various users with totally different analysis 

needs. 

1 www.earthlook.org


Bibliography 

[1] Blakeley J. A., Larson P-K., and Tompa F. Efficiently updating materialized 

views. In SIGMOD Rec., volume 15, pages 61–71, New York, NY, USA, 1986. 

ACM. 

[2] Burrough P. A. and McDonell R. A. Principles of Geographical Information 

Systems. Oxford, 2004. 

[3] Dehmel A. A Compression Engine for Multidimensional Array Database Systems. 

PhD thesis, Technical University Munich, Germany, 2002. 

[4] Dobra A., Garofalakis M., Gehrke J., and Rastogi R. Processing complex aggregate 

queries over data streams. In Proceedings of the 2002 ACM SIGMOD 

international conference on Management of data, pages 61–72, New York, NY, 

USA, 2002. ACM. 

[5] Garcia-Gutierrez A. Applying OLAP pre-aggregation techniques to speed up 

query processing in raster-image databases. In GI-Days 2007 - Young Researchers 

Forum, pages 189–191, Muenster, Germany, 2007. IfGIprints 30. 

[6] Garcia-Gutierrez A. Applying OLAP pre-aggregation techniques to speed up 

query response times in raster image databases. In ICSOFT (ISDM/EHST/DC), 

pages 259–266, 2007. 

[7] Garcia-Gutierrez A. Modeling geo-raster operations with array algebra. In 

Technical Report (7), 2007. 

[8] Garcia-Gutierrez A. and Baumann P. Modeling fundamental geo-raster operations 

with array algebra. In ICDM Workshops, pages 607–612, 2007. 

[9] Garcia-Gutierrez A. and Baumann P. Computing aggregate queries in raster image 

databases using pre-aggregated data. In Proceedings of the International 

Conference on Computer Science and Applications, pages 84–89, San Francisco, 

CA, USA, 2008. 

[10] Garcia-Gutierrez A. and Baumann P. Using pre-aggregation to speed up scaling 

operations on massive spatio-temporal data. In 29th International Conference 

on Conceptual Modeling, November 2010. 

107

108 Bibliography 

[11] Gupta A. and Mumick I. S. Maintenance of materialized views: Problems, 

techniques, and applications. In IEEE Data Engineering Bulletin, volume 18, 

pages 3–18, 1995. 

[12] Gupta A. and Mumick I. S. Materialized Views. The MIT Press, 2007. 

[13] Gupta A., Harinarayan V., and Quass D. Aggregate-query processing in data 

warehousing environments. In Proceedings of the 21th International Conference 

on Very Large Data Bases, pages 358–369, San Francisco, CA, USA, 

1995. Morgan Kaufmann Publishers Inc. 

[14] Kitamoto A. Multiresolution cache management for distributed satellite image 

database using nacsis-thai international link. In Proceedings of the 6th International 

Workshop on Academic Information Networks and Systems (WAINS), 

pages 243–250, 2000. 

[15] Koeller A. and Rundensteiner E. A. Incremental maintenance of schemarestructuring 

views in schemasql. In IEEE Transactions on Knowledge and 

Data Engineering, volume 16, pages 1096–1111, Piscataway, NJ, USA, 2004. 

IEEE Educational Activities Department. 

[16] Lauer A., J. Hendricks, I. Ackermann, B. Schell, H. Hass, and S. Metzger. 

Simulating aerosol microphysics with the echam/made gcm; part i: Model description 

and comparison with observations. In Atmospheric Chemistry and 

Physics, volume 5, pages 3251–3276, 2005. 

[17] Shukla A., Deshpande P., and Naughton J. F. Materialized view selection for 

multidimensional datasets. In Proceedings of the 24th International Conference 

on Very Large Data Bases, pages 488–499, San Francisco, CA, USA, 1998. 

Morgan Kaufmann Publishers Inc. 

[18] Spokoiny A. and Shahar Y. An active database architecture for knowledgebased 

incremental abstraction of complex concepts from continuously arriving 

time-oriented raw data. In Journal on Intelligent Information Systems, volume 

28, pages 199–231, Hingham, MA, USA, 2007. Kluwer Academic Publishers. 

[19] Stan A. Geographic information systems: A management perspective. In WDL 

Publications, 1991. 

[20] American National Standards Institute Inc. (ANSI). ANSI/ISO/IEC 

9075-2:2008, International Organization for Standardization (ISO), Information 

Technology –Database Languages – SQL–Part 2: Foundation 

(SQL/Foundation). Technical report, American National Standards Institute, 

2008. 

[21] Barbará B. and Imielinski T. Sleepers and workaholics: Caching strategies in 

mobile environments. In SIGMOD Conference, pages 1–12, 1994.

BIBLIOGRAPHY 109 

[22] Moon B., Vega-Lopez I. F., and Vijaykumar I. Scalable algorithms for large 

temporal aggregation. In Proceedings of the 16th International Conference on 

Data Engineering, page 145, Washington, DC, USA, 2000. IEEE Computer 

Society. 

[23] Reiner B. HEAVEN A Hierarchical Storage and Archive Environment for Multidimensional 

Array Database Management Systems. PhD thesis, Technical 

University Munich, Germany, 2004. 

[24] Reiner B. and Hahn K. Tertiary storage support for large-scale multidimensional 

array database management systems, 2002. 

[25] Reiner B., Hahn K., Hoefling G., and Baumann P. Hierarchical storage support 

and management for large-scale multidimensional array database management 

systems. In Proceedings of the 3rd International Conference on Database and 

Expert Systems Applications (DEXA), Aix en Provence, 2002. 

[26] Sapia C. Promise: Predicting query behavior to enable predictive caching 

strategies for OLAP systems. In Proceedings of the 2nd International Conference 

on Data Warehousing and Knowledge Discovery, pages 224–233, London, 

UK, 2000. Springer-Verlag. 

[27] Open GIS Consortium. Web Coverage Processing Service (WCPS). In best 

practices document No. 06-035r1, pages 21–47, 2006. 

[28] The OLAP Council. Efficient storage and management of environmental information. 

www.olapreport.com, Accessed July 11 2002. 

[29] The OLAP Council. Apb-1 olap benchmark release ii. http:// 

www.olapcouncil.org/research/resrchly.htm, Accessed July 

11 2010. 

[30] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, 

P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, 

D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A demonstration 

of scidb: a science-oriented dbms. In Proceedings of the Very Large 

Data Bases Conference Endowment, volume 2, pages 1534–1537. VLDB Endowment, 

2009. 

[31] Chatziantoniou D. Ad hoc OLAP: Expression and evaluation. In Proceedings of 

the 15th International Conference on Data Engineering, page 250, Washington, 

DC, USA, 1999. IEEE Computer Society. 

[32] O’Sullivan D. and Unwin D. Geographic Information Analysis. John Wiley, 

2003. 

[33] Quass D. Maintenance expressions for views with aggregation. In VIEWS, 

pages 110–118, 1996.


[34] Tveito I. D., Dobesch H., Grueter E., Perdigao A., Tveito O.E., Thornes J. E., 

Van derWel F., and Bottai L. The use of geographic information systems in 

climatology and meteorology. In Final Report COST Action, page 719, 2006. 

[35] Nguyen D.H. Using javascript for some interactive operations in virtual geographic 

model with geovrml. In Proceedings of the International Symposium 

on Geoinformatics for Spatial Infrastructure Development in Earth and Allied 

Sciences, 2006. 

[36] Adiba M. E. and Lindsay B. G. Database snapshots. In Proceedings of the Sixth 

International Conference on Very Large Data Bases, October 1-3, 1980, Montreal, 

Quebec, Canada, Proceedings, pages 86–91. IEEE Computer Society, 

1980. 

[37] Thomsen E. Olap Solutions : Building Multidimensional Information Systems. 

John Wiley and Sons, 1997. 

[38] Codd E. F., Codd S. B., and Salley C.T. Beyond decision support. In Computer 

World, volume 27, 1993. 

[39] Codd E. F., Codd S. B., and Salley C. T. Providing OLAP (on-line analytical 

processing) to user-analysts: An it mandate. In Technical Report, 1993. 

[40] Vega-Lopez I. F., Snodgrass R. T., and Moon B. Spatiotemporal aggregate computation: 

A survey. In IEEE Transactions on Knowledge and Data Engineering, 

volume 17, pages 271–286, Piscataway, NJ, USA, 2005. IEEE Educational 

Activities Department. 

[41] Colliat G. OLAP, relational, and multidimensional database systems. In SIG- 

MOD Rec., volume 25, pages 64–69, New York, NY, USA, 1996. ACM. 

[42] Pestana G., da Silva M. M., and Bedard Y. Spatial OLAP modeling: An 

overview base on spatial objects changing over time. In IEEE 3rd International 

Conference on Computational Cybernetics, pages 149–154, April 2005. 

[43] Wiederhold G., Jajodia S., and Litwin W. Dealing with granularity of time 

in temporal databases. In Proceedings of the 3rd international conference on 

Advanced information systems engineering, pages 124–140, New York, NY, 

USA, 1991. Springer-Verlag New York, Inc. 

[44] García-Molina H., Ullman J. D., and Widom J. Database Systems: The Complete 

Book. Williams, 2002. 

[45] Samet H. Foundations of Multidimensional and Metric Data Structures. Morgan 

Kaufmann Publishers, 2006. 

[46] ERDAS IMAGINE. ERDAS Field Guide. 1997.


[47] ESRI Inc. ArcGIS 9 Geo Processing Commands, quick reference guide. ArcGIS, 

2004. 

[48] ISO. 19123:2005 geographic information - coverage geometry and functions, 

2005. 

[49] Albrecht J. Universal analytical gis operations - a task-oriented systematization 

of data structure-independent gis functionality. In Geographic Information 

Research- transatlantic perspectives, pages 577–591, 1998. 

[50] Boettger J., Preiser M., Balzer M., and Deussen O. Detail-in-context visualization 

for satellite imagery. volume 27, pages 587–596, 2008. 

[51] Burt P. J. and Adelson E. H. The laplacian pyramid as a compact code. In IEEE 

Transactions on Communications, number 31, pages 532–540, 1983. 

[52] Han J., Stefanovic N., and Koperski K. Selective materialization: An efficient 

method for spatial data cube construction. In Proceedings of the Second Pacific- 

Asia Conference on Research and Development in Knowledge Discovery and 

Data Mining, pages 144–158, London, UK, 1998. Springer-Verlag. 

[53] Nievergelt J., Hinterberger H., and Sevcik K. C. The grid file: An adaptable, 

symmetric multikey file structure. In ACM Transactions on Database Systems, 

volume 9, pages 38–71, 1984. 

[54] Peuquet D. J. Making space for time: Issues in space-time data representation. 

In Geoinformatica, volume 5, pages 11–32, Hingham, MA, USA, 2001. 

Kluwer Academic Publishers. 

[55] Whang K. J. and Krishnamurthy R. The multilevel grid file - a dynamic hierarchical 

multidimensional file structure. In DASFAA, pages 449–459, 1991. 

[56] Berry J. K. and Tomlin C. D. A Mathematical Structure for Cartographic Modeling 

in Environmental Analysis. In Proceedings of the American Congress on 

Surveying and Mapping, pages 269–283, 1979. 

[57] Choi K. and Luk W. Processing aggregate queries on spatial OLAP data. In Proceedings 

of the 10th international conference on Data Warehousing and Knowledge 

Discovery, pages 125–134, Berlin, Heidelberg, 2008. Springer-Verlag. 

[58] Hornsby K. and Egenhofer M. J. Shifts in detail through temporal zooming. 

In International Workshop on Database and Expert Systems Applications, volume 

0, page 487, Los Alamitos, CA, USA, 1999. IEEE Computer Society. 

[59] Hornsby K. and Egenhofer M. J. Identity-based change: A foundation for 

spatio-temporal knowledge representation. In International Journal of Geographical 

Information Science, volume 14, pages 207–224, 2000.


[60] Ramachandran K., Shah B., and Raghavan V. V. Dynamic pre-fetching of views 

based on user-access patterns in an OLAP system. In ICEIS (1), pages 60–67, 

2005. 

[61] Sellis T. K. Multiple-query optimization. In ACM Trans. Database Syst., volume 

13, pages 23–52, New York, NY, USA, 1988. ACM. 

[62] Shim K., Sellis T., and Nau D. Improvements on a heuristic algorithm for 

multiple-query optimization. In Data and Knowledge Engineering, volume 12, 

pages 197–222, 1994. 

[63] Libkin L., Machlin R., and Wong L. A query language for multidimensional 

arrays: Design, implementation, and optimization techniques. In SIGMOD 

Rec., volume 25, pages 228–239, New York, NY, USA, 1996. ACM. 

[64] Usery E. L., Finn M. P., Scheidt D. J., Ruhl S., Beard T., and Bearden M. 

Geospatial data resampling and resolution effects on watershed modeling: A 

case study using the agricultural non-point source pollution model. In Journal 

of Geographical Systems, volume 6, pages 289–306, 2004. 

[65] Yong K. L. and Kim M. H. Optimizing the incremental maintenance of multiple 

join views. In Proceedings of the 8th ACM International Workshop on Data 

Warehousing and OLAP, pages 107–113, New York, NY, USA, 2005. ACM. 

[66] Benedikt M. and Libkin L. Exact and approximate aggregation in constraint 

query languages. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART 

Symposium on Principles of Database Systems, pages 102–113, New York, NY, 

USA, 1999. ACM. 

[67] Gertz M., Hart Q., Rueda C., Singhal S., and Zhang J. A data and query model 

for streaming geospatial image data. In EDBT Workshops, pages 687–699, 

2006. 

[68] Golfarelli M. and Rizzi S. Data Warehouse Design: Modern Principles and 

Methodologies. McGraw Hill, 2009. 

[69] Gyssens M. and Lakshmanan L. V. A foundation for multi-dimensional 

databases. pages 106–115, 1997. 

[70] Ogden J. M., Adelson E. H., Bergen J. R., and Burt P. J. Pyramid methods in 

computer graphics, 1985. 

[71] Beckmann N., Kriegel H. P., Schneider R., and Seeger B. The r*-tree: an 

efficient and robust access method for points and rectangles. In SIGMOD Rec., 

volume 19, pages 322–331, New York, NY, USA, 1990. ACM. 

[72] Roussopoulos N. Materialized views and data warehouses. In SIGMOD 

Record, volume 27, pages 21–26, 1997.


[73] Stefanovic N., Han J., and Koperski K. Object-based selective materialization 

for efficient implementation of spatial data cubes. In IEEE Transactions on 

Knowledge and Data Engineering, volume 12, pages 938–958, Piscataway, NJ, 

USA, 2000. IEEE Educational Activities Department. 

[74] Widmann N. and Baumann P. Performance evaluation of multidimensional 

array storage techniques in databases. In Proceedings of the IDEAS Conference, 

1999. 

[75] Baumann P. Management of multidimensional discrete data. In The VLDB 

Journal, volume 3, pages 401–444, Secaucus, NJ, USA, 1994. Springer-Verlag 

New York, Inc. 

[76] Baumann P. A database array algebra for spatio-temporal data and beyond. In 

Next Generation Information Technologies and Systems, pages 76–93, 1999. 

[77] Baumann P. Web-enabled raster gis services for large image and map databases. 

In Proceedings of the 12th International Workshop on Database and Expert 

Systems Applications, page 870, Washington, DC, USA, 2001. IEEE Computer 

Society. 

[78] Baumann P. Web coverage processing service (wcps) implementation specification. 

number 08-068. ogc, 1.0.0 edition. 2008. 

[79] Furtado P. and Baumann P. Storage of multidimensional arrays based on arbitrary 

tiling. In Proceedings of the 15th International Conference on Data 

Engineering, page 480, Washington, DC, USA, 1999. IEEE Computer Society. 

[80] Marathe A. P. and Salem K. A language for manipulating arrays. In Proceedings 

of the 23rd International Conference on Very Large Data Bases VLDB 

’97, pages 46–55, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers 

Inc. 

[81] Vassiliadis P. Modeling multidimensional databases, cubes and cube operations. 

In Proceedings of the 10th International Conference on Scientific and 

Statistical Database Management, pages 53–62, Washington, DC, USA, 1998. 

IEEE Computer Society. 

[82] Burt P.J. Fast filter transforms for image processing. In Computer Graphics 

and Image Processing, number 16, pages 16–51, 1981. 

[83] Agrawal R., Gupta A., and Sarawagi S. Modeling multidimensional databases. 

In Proceedings of the 13th International Conference on Data Engineering, 

pages 232–243, Washington, DC, USA, 1997. IEEE Computer Society. 

[84] Pieringer R., Markl V., Ramsak F., and Bayer R. Hinta: A linearization algorithm 

for physical clustering of complex OLAP hierarchies. In DMDW, page 11, 

2001.


[85] Chen S., Liu B., and Rundensteiner E. A. Multiversion-based view maintenance 

over distributed data sources. In ACM Transaction Database Systems, 

volume 29, pages 675–709, New York, NY, USA, 2004. ACM. 

[86] Prasher S. and Zhou X. Multiresolution amalgamation: Dynamic spatial data 

cube generation. In Proceedings of the 15th Australasian database conference, 

pages 103–111, Darlinghurst, Australia, Australia, 2004. Australian Computer 

Society, Inc. 

[87] Shekhar S. and Xiong H. Encyclopedia of GIS. Springer, 2008. 

[88] SYBASE. Sybase solutions guide. http://www.sybase.cz/uploads/ 

CEEMEA_SybaseIQ_FINAL.pdf, Accessed July 11, 2010. 

[89] Griffin T. and Libkin L. Incremental maintenance of views with duplicates. In 

Proceedings of the SIGMOD Rec., volume 24, pages 328–339, New York, NY, 

USA, 1995. ACM. 

[90] Needham T. Visual Complex Analysis. Oxford University Press, 1998. 

[91] Niemi T., Nummenmaa J., and Thanisch P. Normalizing OLAP cubes for controlling 

sparsity. In Data Knowledge Engineering, volume 46, pages 317–343, 

Amsterdam, The Netherlands, The Netherlands, 2003. Elsevier Science Publishers 

B. V. 

[92] Harinarayan V., Rajaraman A., and Ullman J. D. Implementing data cubes 

efficiently. In SIGMOD Rec., volume 25, pages 205–216, New York, NY, USA, 

1996. ACM. 

[93] Schlosser S. W., Schindler J., Papadomanolakis S., Shao M., Ailamaki A., 

Faloutsos C., and Ganger G. R. On multidimensional data and modern disks. In 

In Proceedings of the 4th USENIX Conference on File and Storage Technologies. 

USENIX Association, pages 225–238, 2005. 

[94] Mingjie X. Experiments on remote sensing image cube and its OLAP. In 

Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 

volume 7, pages 4398–4401 vol.7, September 2004. 

[95] Halevy A. Y. Answering queries using views: A survey. In The VLDB Journal, 

volume 10, pages 270–294, Secaucus, NJ, USA, December 2001. Springer- 

Verlag New York, Inc. 

[96] Jiebing Y. and Dewitt D. J. Processing satellite images on tertiary storage: 

A study of the impact of tile size on performance. In Proceedings of the 5th 

NASA Goddard Conference on Mass Storage Systems and Technologies, pages 

460–476, 1996.


[97] Kotidis Y. and Roussopoulos N. A case for dynamic view management. In ACM 

Transactions on Database Systems, volume 26, pages 388–423, New York, NY, 

USA, 2001. ACM. 

[98] Lee K. Y., Son J. H., and Kim M. H. Efficient incremental view maintenance in 

data warehouses. In Proceedings of the 10th International Conference on Information 

and Knowledge Management, pages 349–356, New York, NY, USA, 

2001. ACM. 

[99] Qingsong Y. and Aijun A. Using user access patterns for semantic query 

caching. In DEXA, pages 737–746, 2003. 

[100] Zhao Y., Deshpande P. M., and Naughton J. F. An array-based algorithm for simultaneous 

multidimensional aggregates. In SIGMOD Rec., volume 26, pages 

159–170, New York, NY, USA, 1997. ACM. 

[101] Zhuge Y., García-Molina H., Hammer J., and Widom J. View maintenance 

in a warehousing environment. In Proceedings of the 1995 ACM SIGMOD 

International Conference on Management of Data, pages 316–327, New York, 

NY, USA, 1995. ACM. 

[102] Zhuge Y., García-Molina H., and Wiener J. L. Multiple view consistency for 

data warehousing. In Proceedings of the 13th International Conference on Data 

Engineering, pages 289–300, Washington, DC, USA, 1997. IEEE Computer 

Society. 

[103] Zhuge Y., García-Molina H., and Wiener J. L. Consistency algorithms for 

multi-source warehouse view maintenance. In Distributed Parallel Databases, 

volume 6, pages 7–40, Hingham, MA, USA, 1998. Kluwer Academic Publishers.

Applying OLAP Pre-Aggregation Techniques to ... - Jacobs University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?