Visual Analytics of Patterns in High-Dimensional Data - Fachbereich ...

Visual Analytics of Patterns in 

High-Dimensional Data 

Dissertation zur Erlangung des akademischen Grades 

eines Dr. rer. nat. 

vorgelegt von 

Andrada Tatu 

an der 

Mathematisch-Naturwissenschaftliche Sektion 

Fachbereich Informatik und Informationswissenschaft 

Tag der mündlichen Prüfung: 12 Juli 2013 

Referenten: 

Prof. Dr. Daniel A. Keim, Universität Konstanz 

Prof. Dr. Oliver Deussen, Universität Konstanz 

Prof. Dr. Giuseppe Santucci, Sapienza Università di Roma

Pentru părinţii mei iubitori.

Acknowledgements 

This dissertation is the most important milestone in my academic career. One of the 

joys of completion is to look back and remember all the mentors, friends, collaborators, 

colleagues and family who have guided, supported, and inspired me along this fulfilling 

journey. 

First and foremost, I would like to express my deep appreciation to my advisor, Professor 

Dr. Daniel Keim, who has stirred my interest in Visual Analytics early on in my 

studies. He has not only been a strong supporter of my work, but he has also allowed 

me great freedom to develop my thesis. Without his guidance and persistent help, this 

dissertation would not have been possible. As a part of his group, I was able to perfect 

my research skills and draw appropriate conclusions. 

In addition, I would like to thank my committee members, Professor Dr. Oliver Deussen 

and Professor Dr. Giuseppe Santucci for their encouraging and insightful comments and 

their analytic questions that prompted me shape my ideas comprehensively. 

I am especially grateful to Dr. Enrico Bertini and Dr. Tobias Schreck, who closely 

accompanied my research during these years and motivated me to seek perfect solutions. 

Many of the results reported here present joint e orts. Their recommendations and instructions 

have enabled me to assemble and finish the dissertation e ectively. 

I would also like to express my gratitude to my collaborators for their guidance and 

inspirations in these past years, and especially name Ines Färber, Professor Dr. Thomas 

Seidl, Professor Dr. Tamara Munzner, Dr. Michael Sedlmair, Professor Dr. Melanie Tory, 

Georgia Albuquerque, Dr. Martin Eisemann, Dr. Jörn Schneidewind and Dr. Peter Bak. 

I am grateful to my colleagues for creating a pleasant working atmosphere. A special 

thank you goes to Svenja Simon (for her friendship and tricky R programming sessions), 

Miloš Krstajić (for supporting all my moods and encouraging me throughout these years), 

Dr. Florian Mansmann (for getting me into the group and becoming a lovely friend), 

David Spretke (for accompanying me from the first day of my Bachelor studies to the 

last of my doctoral work as a friend and hardworking colleague), Dr. Andreas Sto el (for 

always keeping his door open and the helpful debugging sessions), Christian Rohrdantz 

(for helpful suggestions and mental support during the writing phase and preparation 

of my defense talk), Dr. Leishi Zhang (for the great collaboration during the ClustNails 

project), Dr. Daniela Oelke (for initial paper writing suggestions and providing me the 

thesis template), and Sabine Kuhr (for her support in administrative work). I am very 

happy that, in many cases, my friendship with all of you has enriched my time beyond 

our shared time in the o ce. 

Special thanks goes to my student assistant Fabian Maaß, who implemented parts of 

the subspace visualization system and whose creativity shaped the research outcome.

vi 

This acknowledgement would not be complete without extending my sincere thanks 

to our DBVIS support team, which really made my life easier by providing fast, anytime 

technical support, computational power, and storage opportunities for my projects. I 

would like to specially mention Florian Sto el and Juri Buchmüller. 

Special thanks go to Mrs. Anna Dowden-Williams from the Academic Sta Development 

for proofreading most of my research papers and this thesis, which has profoundly 

improved its overall composition. 

My deepest appreciation and gratitude goes, however, to my family who has encouraged 

my studies from the start and provided me with the moral and emotional support 

needed through the entire process. They believed in my dream and helped me to fulfill it. 

I will be forever grateful for your unconditional love and support. 

I gratefully acknowledge also the financial support received from the German Research 

Society (DFG) under the research grant DFG-611 within the DFG Priority Program 

“Scalable Visual Analytics: Interactive Visual Analysis Systems of Complex Information 

Spaces” (SPP 1335). I also recognize being an associated PhD student to the GK-1042 

(PhD Graduate Program) “Explorative Analysis and Visualization of Large Information 

Spaces”.

Abstract 

Due to the technological progress over the last decades, today’s scientific and commercial 

applications are capable of generating, storing, and processing, massive amounts of data 

sets. This influences the type of data generated, which in turn means that with each 

data entry di erent aspects are combined and stored into one common database. Often 

the describing attributes are numeric; we name data with more than a handful attributes 

(dimensions) high-dimensional. Having to make use of these types of data archives provides 

new challenges to analysis techniques. 

The work of this thesis centers around the question of finding interesting patterns 

(meaningful information) in high-dimensional data sets. This task is highly challenging 

because of the so called curse of dimensionality, expressing that when dimensionality 

increases the data becomes sparse. This phenomena disturbs standard analysis techniques. 

Automatic techniques have to deal with the data complexity not only increasing their 

runtime, but also vitiating their computation functions (like distance functions). Moreover, 

exploring these data sets visually is hindered by the high number of dimensions that have 

to be displayed on the two dimensional screen space. 

This thesis is motivated by the idea that searching for interesting patterns in this 

kind of data can be done through a mixed approach of automation, visualization, and 

interaction. The amount of patterns a visualization contains can be measured by so called 

quality metrics. These automated functions can then filter the high number of highdimensional 

visualizations and present to the user a pre-filtered good subset for further 

investigation. We propose quality metrics for scatterplots and parallel coordinates focusing 

on di erent user tasks like identifying clusters and correlations. We also evaluate these 

measures with regard to (1) their ability to identify clusters in a variety of real and 

synthetic datasets; (2) their correlation with human perception of clusters in scatterplots. 

A thorough discussion of results follows reflecting the impact on directions for future 

research. 

As quality metrics were developed for a large number of di erent high-dimensional 

visualization techniques, we present our reflections on how these methods are related to 

each other and how the approach can be developed further. For this purpose, we provide 

an overview of approaches that use quality metrics in high-dimensional data visualization 

and propose a systematization based on a comprehensive literature review. 

In high-dimensional data, patterns exist often only in a subset of the dimensions. 

Subspace clustering techniques aim at finding these subspaces where clusters exist and 

which might otherwise be hidden if a traditional clustering algorithm is applied. While 

subspace clustering approaches tackle the sparsity problem in high-dimensional data well, 

designing e ective visualization to help analyzing the clustering result is not trivial. In 

addition to the cluster membership information, the relevant sets of dimensions and the 

overlaps of memberships and dimensions need to also be considered. Although, a number 

of techniques (for example, scatterplots, heat maps, dendrograms, hierarchical parallel 

coordinates) exist for visualizing traditional clustering results, little research has been 

done for visualizing subspace clustering results. Moreover, while extensive research has 

been carried out with regard to designing subspace clustering algorithms, surprisingly 

little attention has been paid to the developing of e ective visualization tools analyzing the

viii 

clustering result. Appropriate visualization techniques will not only help in monitoring the 

clustering process but, with special mining techniques, they could also enable the domain 

expert to guide and even to steer the subspace clustering process to reveal the patterns of 

interest. To this goal, we envision a concept that combines subspace clustering algorithms 

and interactive scalable visual exploration techniques. This work includes the task of 

comparative visualization and feedback guided computation of alternative clusterings.

Zusammenfassung 

Bedingt durch den technologischen Fortschritt der letzten Jahrzehnte sind heutige kommerzielle 

Applikationen in der Lage, riesige Datenmengen zu erzeugen, zu speichern und 

zu verarbeiten. Diese Entwicklung beeinflusst auch die Natur der erzeugten Daten, d.h. 

dass für jeden Dateneintrag unterschiedliche Aspekte in der gleichen Datenbank gespeichert 

werden. Oft sind die beschreibenden Attribute numerisch. Datensätze, die mehr 

als fünf solcher Attribute (Dimensionen) beinhalten, nenne ich hochdimensional. Der 

wertbringende Gebrauch solcher Datenarchive bringt neue Herausforderungen an Analysetechniken 

mit sich. 

Die vorliegende Dissertation bearbeitet die Fragestellung, wie interessante Muster (bedeutende 

Information) in hochdimensionalen Räumen gefunden werden können. Diese 

Aufgabenstellung ist durch das Problem des Fluches der Dimensionalität äußerst herausfordernd. 

Dieses Problem besagt, dass Daten im hochdimensionalen Raum spärlich 

vorkommen. Herkömmliche Analysetechniken werden dadurch beeinträchtigt. Automatische 

Methoden müssen die Datenkomplexität nicht nur ihre Laufzeit, sondern auch ihre 

Berechnungsfunktionen (z.B. Distanzfunktionen) betre end, einbeziehen. Außerdem wird 

die visuelle Exploration dieser Daten durch die Zweidimensionalität der Darstellungen 

beeinträchtigt. 

Diese Dissertation stützt sich auf das Konzept, dass die Suche nach interessanten 

Mustern in hochdimensionalen Datenmengen mit einem kombinierten Ansatz von automatischen, 

visuellen und interaktiven Methoden durchgeführt werden kann. Die Ausprägung 

der Muster einer Visualisierung kann durch sogenannte Qualitätsmaße gemessen werden. 

Durch diese automatischen Funktionen kann die große Menge an hochdimensionalen Visualisierungen 

eingegrenzt und dem Benutzer eine ausgewählte Menge zur weiteren Untersuchung 

zur Verfügung gestellt werden. Ich schlage Qualitätsmaße für Scatterplots 

und Parallele Koordinaten vor, die sich auf unterschiedliche Aufgaben, wie die Identifikation 

von Gruppen oder Korrelationen, konzentrieren. Zusätzlich werden diese Techniken 

bezüglich (1) ihrer Fähigkeit Cluster in unterschiedlichen realen und synthetischen 

Datensätzen und (2) ihrer Korrelation mit der menschlichen Wahrnehmung untersucht. 

Der ausführlichen Diskussion dieser Resultate folgen Überlegungen für die zukünftige 

Forschung. 

Da viele verschiedene Qualitätsmaße für eine Reihe weiterer hochdimensionaler Visualisierungen 

entwickelt wurden, werde ich Vorschläge für deren Vernetzung und Weiterentwicklung 

vorstellen. Hierfür wird eine Übersicht über die verschiedenen Ansätze erstellt, 

welcher eine Systematisierung zugrunde liegt, die aufgrund einer umfassenden Literaturauswertung 

zustande kam. 

Im hochdimensionalen Raum existieren manche Muster nur in verschiedenen Unterräumen 

des Datenraumes. Subspace Clustering Algorithmen wurden entwickelt, um Unterräume 

zu finden in denen Cluster existieren, die durch traditionelle Clustering Algorithmen 

nicht gefunden werden würden. Obwohl diese Algorithmen spärlich mit Daten 

besetzte, hochdimensionale Räume gut explorieren können, ist das Entwickeln von e ektiven 

Visualisierungstechniken, um diese Clusteringresultate zu analysieren, nicht trivial. 

Zusätzlich zu der Clusterzugehörigkeit von Elementen müssen die relevanten Attributmengen 

eines Clusters und die Objekt- und Dimensionsüberlappungen von Subspaceclus-

x 

tern dargestellt werden. Auch wenn eine Reihe von Techniken für die Visualisierung 

von traditionellen Clustering Resultaten existiert (z.B. Scatterplots, Heatmaps, Dendrogramme, 

hierarchische Parallele Koordinaten) gibt es nur wenige Ansätze, um das Resultat 

von Subspace Clustering Algorithmen zu visualisieren. Außerdem wurden bisher 

erstaunlich wenige Ansätze vorgestellt, die eine visuelle Analyse der Subspace Clustering 

Ergebnisse unterstützen können, obwohl im Bereich der Subspace Clustering Algorithmen 

viel Forschung betrieben wurde. Angemessene Visualisierungstechniken, die 

von speziellen Methoden zur Extraktion von Informationen unterstützt werden, würden 

nicht nur die Nachverfolgung der Clustering Ergebnisse ermöglichen, sondern auch Fachleuten 

dabei helfen, den Subspace Clustering Prozess so zu steuern, dass relevante Muster 

zum Vorschein kommen. Dieses Ziel vor Augen stelle ich ein Konzept vor, das Subspace 

Clustering Algorithmen mit interaktiven skalierbaren Visualisierungen kombiniert. Meine 

Ansätze widmen sich deshalb der Aufgabe der Visualisierung zum Vergleich von alternativen 

Clustergruppen, die durch Nutzerfeedback gesteuert werden.

Contents 

1 Introduction 1 

1.1 Need for Visual Interactive Data Exploration . . . . . . . . . . . . . . . . . 1 

1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

2 High-Dimensional Data Analysis 11 

2.1 Basic Techniques for High-Dimensional Data Analysis . . . . . . . . . . . . 12 

2.1.1 Common Challenges with High-Dimensional Data . . . . . . . . . . 12 

2.1.2 Feature Selection and Feature Extraction . . . . . . . . . . . . . . . 12 

2.2 Information Visualization Techniques for High-Dimensional Data . . . . . . 13 

2.2.1 Information Visualization Techniques . . . . . . . . . . . . . . . . . 13 

2.2.2 Limitations while Visualizing High-Dimensional Data . . . . . . . . 16 

2.3 Automated Techniques for High-Dimensional Data . . . . . . . . . . . . . . 17 

2.3.1 Data Mining Techniques for High-Dimensional Data . . . . . . . . . 17 

2.3.2 Quality Measures for High-Dimensional Data Visualizations . . . . . 19 

2.4 Visual Analytics for High-Dimensional Data . . . . . . . . . . . . . . . . . . 22 

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis . . . 22 

2.4.2 Subspace Cluster Analysis and Visualization . . . . . . . . . . . . . 26 

3 Quality Measures based Visual Analysis of High-Dimensional Data 29 

3.1 Quality Measures for Scatterplots and Parallel Coordinates . . . . . . . . . 30 

3.1.1 Overview and Problem Description . . . . . . . . . . . . . . . . . . . 30 

3.1.2 Quality Measures for Scatterplots with Unclassified Data . . . . . . 32 

3.1.3 Quality Measures for Scatterplots with Classified Data . . . . . . . . 34 

3.1.4 Quality Measures for Parallel Coordinates with Unclassified Data . . 38 

3.1.5 Quality Measures for Parallel Coordinates with Classified Data . . . 40 

3.1.6 Application on Real Data Sets . . . . . . . . . . . . . . . . . . . . . 41 

3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data . . . 49 

3.1.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 53 

3.2 Quality Measures and Human Perception – An Empirical Study . . . . . . . 54 

3.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.2.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

3.2.5 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 


4 A Systematization of Quality Metrics in High-Dimensional Data Visualization 

65 

4.1 Quality Metrics in High-Dimensional Data Visualization . . . . . . . . . . . 66 

4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xii 

Contents 

4.1.3 Quality Metrics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 71 

4.1.4 Systematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

4.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

4.1.6 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

4.1.7 Directions for Further Research . . . . . . . . . . . . . . . . . . . . . 85 

4.1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 


4.2 Visual Cluster Separation Factors: Sketching a Taxonomy . . . . . . . . . . 87 

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.2.3 Visual Cluster Separation Taxonomy . . . . . . . . . . . . . . . . . . 89 

4.2.4 Discussion and Further Research . . . . . . . . . . . . . . . . . . . . 90 

5 Visual Subspace Analysis of High-Dimensional Data 93 

5.1 Visual Exploration for Subspace Clustering . . . . . . . . . . . . . . . . . . 94 

5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

5.1.2 Subspace Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 96 

5.1.3 Task Definition and Design Space for Visual Subspace Cluster Analysis 99 

5.1.4 The ClustNails System . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

5.1.5 Use Case and System Comparison . . . . . . . . . . . . . . . . . . . 106 

5.1.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 109 

5.2 Visual Analytics of Subspace Search . . . . . . . . . . . . . . . . . . . . . . 110 

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

5.2.2 Subspace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

5.2.3 Proposed Analytical Workflow . . . . . . . . . . . . . . . . . . . . . 113 

5.2.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

5.2.5 Discussion and Possible Extensions . . . . . . . . . . . . . . . . . . . 124 

5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

6 Conclusion and Future Work 129 

6.1 Summary of Contributions and Future Work . . . . . . . . . . . . . . . . . 129 

List of Figures 133 

List of Tables 143 

A Appendix 145 

A.1 Original Data Dimensions for Used Data Sets . . . . . . . . . . . . . . . . . 145 

A.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

A.2.1 General Questions Form . . . . . . . . . . . . . . . . . . . . . . . . . 149 

A.2.2 Experiment Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 

A.2.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . 155 

A.3 Quality Metrics Pipelines for the Literature Review . . . . . . . . . . . . . . 156 

A.4 Hierarchical Grouping of Interesting Subspaces . . . . . . . . . . . . . . . . 162 

Bibliography 163

1 

Introduction 

Contents 

„Everybody gets so much information all day long 

that they lose their common sense.” 

Gertrude Stein 

1.1 Need for Visual Interactive Data Exploration . . . . . . . . . . 1 

1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . 4 

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.1 Need for Visual Interactive Data Exploration 

T 

oday data is produced everywhere - everything is recorded from production processes 

in the industry to employees working behavior and their personal data. Even animals 

are equipped with sensors and all their movements are recorded over long periods of time, 

click behavior of internet users is traced, or supermarket purchases are stored for later 

analysis. Since today’s technology allows for inexpensive and abundant storage space, 

there will even be more data stored in the near future. At the same time, these advantages 

reveal the problem of how to handle the data most e ectively. The gap between the 

generated data and the understanding of it increases [154], which also poses a challenge 

for analysis techniques, e.g. it is di cult to filter and extract relevant information since 

not only the volume increases, but also the complexity. 

Visualization has long been used as an e ective tool to explore and make sense of data, 

especially when analysts need to generate hypotheses about the information that is hidden 

in the data. While some techniques and commercial products have proven to be useful in 

providing e ective solutions, there are still modern databases that can store data of such 

complexities that go well beyond the limits of human understanding. 

The goal of this thesis is pattern finding in high-dimensional or multidimensional data. 

The methods presented here work with numerical data sets, with a large number of objects, 

and a large number of dimensions, also called attributes. Depending on the application 

area, a large number of objects can already start at hundreds and go up to thousands. The 

same is true for the describing attributes, or features of the objects. In this work we call 

high-dimensional data, all data sets with more than hundred objects and more than ten 

dimensions. An example of analysis tasks based on a costumer database will be described 

later in this section. 

Classical data exploration requires the user to find interesting phenomena in the data

2 Chapter 1. Introduction 

interactively, by starting with an initial visual representation. In [36] the authors suggest 

that “the purpose of visualization is insight, not pictures”. The techniques for highdimensional 

data visualization can also incorporate automated analysis components to 

reduce its complexity and to e ectively guide the user during the interactive exploration 

process. This process is called visual analytics. “Visual analytics strives to facilitate 

the analytical reasoning process by creating software that maximizes human capacity to 

perceive, understand, and reason about complex and dynamic data and situations” [137]. 

Patterns are also not a new concept when analyzing data. Witten and Frank expressed 

this perfectly in [154]: “There is nothing new about this” (patterns). “People have been 

seeking patterns in data since human life began. Hunters seek patterns in animal migration 

behavior, farmers seek patterns in crop growth, politicians seek patterns in voter opinion, 

and lovers seek patterns in their partners’ responses. A scientist’s job (like a baby’s) is 

to make sense of data, to discover the patterns that govern how the physical world works 

and encapsulate them in theories that can be used for predicting what will happen in new 

situations.” 

In large scale multivariate data sets, sole interactive exploration becomes ine ective 

or even unfeasible since the number of possible representations grows rapidly with the 

number of dimensions. Methods are needed that help the user to automatically find 

e ective and expressive visualizations. E ective and e cient analysis methods of large 

multidimensional data is necessary to understand the complexity of the information hidden 

in these databases. Data dimensionality is often the major limiting factor. 

For automatic pattern detection, a typically employed paradigm is one of clustering 

identifying groups of objects based on their mutual similarity. Unlike traditional clustering 

methods, for the aforementioned high-dimensional data considering all features simultaneously 

is no longer e ective due to the so-called curse of dimensionality [28]. As dimensionality 

increases, the distances between any two objects become less discriminative. 

Moreover, the probability of many dimensions being irrelevant for the underlying cluster 

structure increases. In such data sets it can be observed that each object may participate 

in di erent groupings, meaning that objects may have di erent roles. In comparison, in 

classical clustering each object belongs to one cluster, and the data set is partitioned into 

a number of clusters. “For example, in customer segmentation, we observe for each customer 

multiple possible behaviors which should be detected as clusters. In other domains, 

such as sensor networks each sensor node can be assigned to multiple clusters according to 

di erent environmental events. In gene expression analysis, objects should be detected in 

multiple clusters due to the various functions of each gene. In general, multiple groupings 

are desired as they characterize di erent views of the data” [103]. 

If we consider for example a customer database with a large number of customers 

(rows in the table) described by a large number of attributes (columns in the table) we 

may ask, how do this customers relate to each other, and what kind of patterns in this 

case groups can be identified in this database. In Figure 1.1 we can see a toy-example 

belonging to this kind of multiple valid groupings for one database. We can have groups 

like: “rich oldies”, “healthy sporties”, “unhealthy gamers”, “unemployed people”, “average 

people” and “sport professionals” 1 . To facilitate the data analysis in this direction, we 

present in Chapter 5 visual interactive systems and new analysis methods to support the 

understanding and comparison of di erent groupings in high-dimensional data. 

As already mentioned, this thesis is about visual analytics of patterns in high-dimensional 

1 This image appeared in the tutorial slides of Müller et al. [104] and the describing story is made up 

by myself.

1.1. Need for Visual Interactive Data Exploration 3 

Figure 1.1: Multiple valid and interesting groupings of a high-dimensional data set [104]. 

data. To assist the analysis of such data sets, e ective information visualization techniques 

providing a mapping of data properties to the screen, have been developed and are needed 

to make sense of the complex data at hand. The visualization of large complex information 

spaces typically involves mapping high-dimensional data to lower-dimensional visual 

representations. The challenge for the analyst is to find an insightful mapping, while the 

dimensionality of the data, and consequently the number of possible mappings, increases. 

As we will see later in Chapter 2, numerous expressive and e ective low-dimensional 

visualizations for high-dimensional data sets have been proposed in the past, such as 

scatterplots and scatterplot matrices (SPLOM) [37], parallel coordinates [78], glyph-based 

techniques [147], pixel-based displays [145] and geometrically transformed displays [86, 

145]. However, finding information-bearing and user-interpretable visual representations 

automatically remains a di cult task since there could be a large number of possible 

representations. In addition, it could be di cult to explain their relevance to the user. 

Finding relations, patterns, and trends over numerous dimensions is also di cult because 

the projection of n-dimensional objects over 2D spaces carries necessarily some form 

of information loss. Projection techniques like multidimensional scaling (MDS) and principal 

component analysis (PCA) o er traditional solutions by creating data embeddings 

that try as much as possible to preserve distances of the original multidimensional space 

in the 2D projection. These techniques have, however, severe problems in terms of interpretation, 

as it is no longer possible to interpret the observed patterns in terms of the 

dimension of the original data space. 

Mechanisms to measure the quality of the visualizations are therefore needed. In 

the past, quality measures have been developed for di erent areas like measures for data 

quality (outliers, missing values, sampling rate, level of detail), clustering quality (purity, 

F-measure (combining precision and recall), Rand index [114], silhouette coe cient [85], 

etc.), association rule quality (support and confidence [7], information gain [40], etc.) or 

the distance distribution measure in SURFING [16], a subspace search algorithm described 

and used in Chapter 5 to filter data spaces and find interesting subspaces. For visualizations, 

a number of authors have started introducing quality measures to quantify their 

importance. The rationale behind this method is that quality measures can help users 

reduce the search space by filtering out views with low information content. In the ideal


system, users can select one or more measures and the system optimizes the visualization 

in such a way as to reflect the choice of the user. This thesis also contributes to the field 

of quality measures, and in Chapter 3 new measures are presented for scatterplot matrices 

and parallel coordinates plots. 

However, there is one problem with these measures the lack of empirical validation 

based on user studies. These studies are in fact needed to inspect the underlying assumption 

that the patterns captured by these measures correspond to the patterns captured by 

the human eye. Since many di erent patterns can be analyzed, in this thesis we started 

with clusters in visualizations and research in this direction by comparing some of the 

most promising quality measures for filtering visualizations that present clusters to the 

human judgement by looking at the visualizations. 

The analysis of high-dimensional data is an ubiquitously relevant, yet well-known difficult 

problem. Problems exist both in automatic data analysis and in the visualization 

of this kind of data. On the visual-interactive side, a limited number of available visual 

variables and limited short-term memory of human analysts make it di cult to e ectively 

visualize data in high numbers of dimensions. In Chapter 5 we tackle this problem from 

the visual-interactive side. We present a visual-interactive tool to make sense of clusters 

in di erent subspaces, as well as an approach to identify subspaces that might show 

complementary clusterings. 

In summary, the focus of this thesis is to contribute on both sides of pattern finding in 

high-dimensional data, the automatic and the visual interactive part. We believe that these 

parts are simultaneously needed to solve the problem and therefore we present automatic 

mechanisms namely quality measures to reduce the alternative possible visualizations of 

high-dimensional data, and on the other side we visualize the relations between results to 

support the user in an interactive pattern finding process. 

1.2 Contributions of the Thesis 

This dissertation provides visual analytics mechanisms for pattern finding in high-dimensional 

data. In achieving this goal Substantiating the results, we supply the following contributions: 

• Quality measures for scatterplots and parallel coordinates plots are developed. Visual 

quality metrics have been recently devised to automatically extract interesting visual 

projections out of a large number of available candidates in the exploration of highdimensional 

databases. The metrics permit for instance to search within a large set of 

scatterplots (e.g., in a scatterplot matrix) and select the views that contain the best 

separation among clusters. The rationale behind these techniques is that automatic 

selection of “best” views is not only useful but also necessary when the number of 

potential projections exceeds the limit of human interpretation (Chapter 3) [132, 

133]. 

• Validating the measures trough a perceptual study. We present a perceptual study 

investigating the relationship between human interpretation of clusters in 2D scatterplots 

and the measures that were automatically extracted from these plots. Specifically, 

we compare a series of selected metrics and analyze how they predict human

1.3. Thesis Structure 5 

detection of clusters. A thorough discussion of results follows with reflections on 

their impact and directions for future research (Chapter 3) [134]. 

• A systematization of techniques that use quality metrics to help in the visual exploration 

of meaningful patterns in high-dimensional data. We present reflections 

on how di erent quality measure methods are related to each other and how the 

approach can be developed further. For this purpose, we provide an overview of approaches 

that use quality metrics in high-dimensional data visualization and propose 

a systematization based on a thorough literature review. We carefully analyze the 

papers and derive a set of factors for discriminating the quality metrics, visualization 

techniques, and the process itself. A quality metrics pipeline is proposed to model 

all the encountered varieties of metrics (Chapter 4) [27]. 

• A visual subspace cluster analysis system (ClustNails) to understand the result of 

subspace clustering. In subspace clustering in addition to the grouping information 

(clusters), the relevance of dimensions for particular groups and overlaps between 

groups, both in terms of dimensions and records, need to be analyzed. ClustNails integrates 

several novel visualization techniques with various user interaction facilities 

to support navigating and interpreting the result of subspace clustering algorithms 

(Chapter 5) [136]. 

• A novel method for the visual analysis of high-dimensional data for understanding 

high-dimensional data from di erent perspectives and investigating alternative 

clusterings. We employ an interestingness-guided subspace search algorithm to detect 

a candidate set of interesting subspaces, that may contain important patterns 

for further analysis. Based on appropriately defined subspace similarity functions, 

we visualize the subspaces and provide navigation facilities to interactively explore 

large sets of subspaces. Our approach allows users to e ectively compare and relate 

subspaces identifying complementary or contradicting relations among them, thus 

identifying alternative clusterings (Chapter 5) [135]. 

1.3 Thesis Structure 

After illustrating the problem in the previous section and enumerating the contributions 

of this thesis, the remainder of the thesis is structured as follows. 

Chapter 2 provides a brief overview of important related work in the field of highdimensional 

data analysis, covering three main areas. Section 2.1 introduces the common 

challenges when analyzing high-dimensional data and presents dimension reduction techniques 

that reduce the data complexity. Section 2.2 describes important visualization 

techniques for high-dimensional data. Section 2.3 introduces standard automatic techniques 

from the Data Mining community, as well as presents quality measures, that are 

automated ranking functions, to judge the quality of a visualization with respect to a 

given task. Section 2.4 presents some examples where the interplay between visualization, 

automation, and interaction is far more beneficial then any of these techniques alone. 

Chapter 3 proposes eight new quality metrics, for di erent tasks and two visualization 

types: scatterplot matrices and parallel coordinates. The metrics are tested on a set of 

synthetical and real data sets to prove their e ect. To ensure that the metrics reflect the


user’s perception, a selected subset of measures for scatterplot matrices is evaluated and 

compared with the user’s perception. We found that both perform similar. Based on this 

study, we have formulated guidelines for further evaluation of existing metrics. 

Based on a literature review, Chapter 4 introduces a systematization of di erent quality 

measures for high-dimensional data visualization. Their relation is described through 

characteristic factors like visualization techniques or a purpose for coming up with a coherent 

and unified picture for these techniques. By putting the existing methods into a 

common framework, we hope in easing the generation of new research in the field and spotting 

relevant gaps to bridge with future research. Following, Section 4.2 briefly presents 

the results of a qualitative data analysis that lead to a visual cluster separability taxonomy. 

This results are the basis for the follow up discussion on relevant aspects that arise 

when analyzing clusters visually and what future works need to be focused on. 

Chapter 5 presents two interactive systems that help to make sense of the highdimensional 

data sets with respect to di erent clusterings. Searching in subspaces is 

needed as automatic pattern search is done trough clustering algorithms, and it is not feasible 

to search for clusters in full space for high-dimensional data. Section 5.1 introduces a 

visual tool, ClustNails, to investigate subspace clustering results for di erent state of the 

art subspace clustering algorithms. This tool is intended to support the interpretation of 

the result with respect to the subspace cluster relations. With this visual tool questions 

like how many objects do clusters contain, how many dimensions, what dimensions do 

overlap between clusters or what objects are shared by more clusters can be answered. 

Section 5.2 goes one step further and presents an analytical approach to support the 

identification of alternative clusterings in this spaces. As we know, the high-dimensionality 

provides di erent facets in the data like for example in a data set about people we might 

have clusters in the taste of music perspective (rock-music, classical music, jazz, etc.) but 

at the same time we also might have di erent groupings of the same people describing their 

sportive activity level. Both views on this data are valid but provide a di erent insight 

about the data. To discover such alternative clusterings in high-dimensional data, in this 

section we propose an analytical workflow that starts from searching the set of possible 

subspaces identifying interesting subspaces. We then group these subspaces according to 

their data similarity providing filtering mechanisms for further interactive investigation. 

Supported by interaction, di erent clusterings of the data can be identified. 

Chapter 6 concludes the thesis and gives an overview of further research questions that 

we seem interesting to be investigated in future. 

A schematic overview of the chapter interrelations is shown in Figure 1.2.


Chapter1: Introduction 

Chapter2: High Dimensional Data Analysis 

HD data 

Chapter4: A Model of HD Data Visualization 

subspaces 

dimension 

projections 

Data Quality 

Metrics 

Visual Quality 

Metrics 

what is 

interesting? 

subspaces with 

"interesting" 

patterns 

methods to 

extract 

patterns 

present most 

interesting 

results first 

ranking 

the result space 

visualization of 

the result space 

how do we 

visualize and 

interact with that? 

how do 

subspaces relate 

to each other? 

Chapter3: QM based Visual Analysis of HD Data 

Chapter5: Visual Subspace Analysis of HD Data 

Chapter6: Conclusion and Future Work 

Figure 1.2: Schematic overview of the interrelation of chapters in this thesis. 

Parts of this thesis where published in: 

1. A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, 

and D. Keim. Combining automated analysis and visualization techniques 

for e ective exploration of high dimensional data. Proceedings of the IEEE 

Symposium on Visual Analytics Science and Technology (VAST), pages 59-66, 2009. 

The contributions: for this publication I took the lead on the computer science 

research part of the paper implementing the data space measures and leading also 

the writing of the paper itself. G. Albuquerque and M. Eisemann implemented the 

image quality metrics and provided their description in the paper and some parts 

of the evaluation section with these metrics. The Histogram Density measures were 

programmed by myself. J. Schneidewind gave advice for structuring the paper and 

presenting the results. D. Keim accompanied the project with suggestions for improvements 

for application and text. H. Theisel and M. Magnor gave advice to the 

project. All parts of the paper where revised several times by me, thus in this thesis 

I use the paper text without citation marks. G. Albuquerque’s thesis (title unknown 

by the time of my submission) might contain some text passages of this paper too for 

the parts she took part in the project. 

2. A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A. 

Keim. Automated Visual Analysis Methods for an E ective Exploration 

of High-Dimensional Data. IEEE Transactions on Visualization and Computer 

Graphics (TVCG), 17(5):pp. 584-597, May 2011.


The contributions: publication 1. was elected as one of the best for the VAST’09 

conference and this publication is an invited extension of 1. As primary author, I 

was responsible for writing the paper, generating new use-cases, testing our measures 

and describing further research directions in this area. G. Albuquerque implemented, 

described and tested the new CSM measure. P. Bak gave advice for structuring 

the experiments and presenting the results. D. Keim accompanied the paper with 

suggestions for improvements for application and text. M. Eisemann, H. Theisel 

and M. Magnor gave advice to the paper. All parts of the paper where revised several 

times by me, thus, in this thesis I use the paper text without citation marks. G. 

Albuquerque’s thesis (title unknown by the time of my submission) might contain 

some text passages of this paper too for the parts she took part in the project. 

3. A. Tatu, P. Bak, E. Bertini, D. A. Keim, and J. Schneidewind. Visual quality 

metrics and human perception: an initial study on 2D projections of large 

multidimensional data. In Proceedings of the Working Conference on Advanced 

Visual Interfaces (AVI), pages 49-56. ACM, 2010. 

The contributions: for this publication I took primary responsibility and additionally, 

I took the lead on the automatic evaluation. P. Bak took the lead on the human 

experiment. Together we compared the results and evaluated them statistically. E. 

Bertini, D. Keim and J. Schneidewind accompanied the paper with suggestions for 

improvements for experimental design and text. All parts of the paper where revised 

several times by me; thus, in this thesis I use the paper text without citation marks. 

4. D. J. Lehmann, G. Albuquerque, M. Eisemann, A. Tatu, D. A. Keim, H. Schumann, 

M. Magnor and H. Theisel. Visualisierung und Analyse multidimensionaler 

Datensätze. Informatik-Spektrum, Springer Berlin/Heidelberg, 33(6):589- 

600, 2010. 

The contributions: this publication was authored by D. Lehman. My contribution 

was to describe the use of quality metrics for high-dimensional data. This thesis was 

inspired by the discussions of this paper. 

5. E. Bertini, A. Tatu, and D. A. Keim. Quality Metrics in High-Dimensional 

Data Visualization: An Overview and Systematization. Proceedings of the 

IEEE Symposium on Information Visualization (InfoVis), 17(12):pages 2203-2212, 

Dec. 2011. 

The contributions: this publication was authored equally by E. Bertini and myself. 

We decided to show this by enumerating our names alphabetically in the authors list. 

E. Bertini and I conducted the literature review, came up with the systematization 

and description model of quality metrics, and described this process in this paper. D. 

Keim played the devils advocate to test our model and gave advice for improvement. 

All parts of the paper where written and revised several times by both leading authors. 

Thus, in this thesis I use the paper text without citation marks. 

6. M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy of visual cluster 

separation factors. Computer Graphics Forum (EuroVis), 31(3pt4):1335-1344, 

June 2012. 

The contributions: M. Sedlmair took the lead in writing this publication. M. 

Sedlmair and I conducted the qualitative analysis of the over 800 plots, and labeled


all the cases with di erent keywords. Based on these M. Sedlmair and T. Munzner 

came up with the taxonomy, and described it in the paper. I tested special cases like 

grid size influence during the writing process of the paper. M. Tory accompanied the 

paper with suggestions for improvements for the analysis and taxonomy and revised 

the text. In this thesis, I describe the results presented in that paper, without using 

the text, and I provide further ideas for research in this area. 

7. A. Tatu, F. Maaß, I. Färber, E. Bertini, T. Schreck, T. Seidl, and D. Keim. Subspace 

Search and Visualization to Make Sense of Alternative Clusterings 

in High-Dimensional Data. IEEE Symposium on Visual Analytics Science and 

Technology (VAST), pages 63-72, 2012. 

The contributions: for this publication I took the lead on the project and paper 

writing. F. Maaß implemented the subspace tool advised by myself, E. Bertini and T. 

Schreck. T. Schreck gave advise in structuring the paper and presenting the results 

by providing initial sections of the paper. I. Färber provided an initial section on 

subspace clustering. T. Seidl and D. Keim gave advice to the project. Major parts of 

the paper where written by myself and all the other parts where revised several times 

by me. Thus, in this thesis I use the paper text without citation marks. 

8. A. Tatu, L. Zhang, E. Bertini, T. Schreck, D. A. Keim, S. Bremm, and T. von Landesberger. 

ClustNails: Visual Analysis of Subspace Clusters. Tsinghua Science 

and Technology, Special Issue on Visualization and Computer Graphics, 17(4):419- 

428, Aug. 2012. 

The contributions: for this publication I took the lead on the project and paper 

writing. I implemented the subspace tool supported for some components by L. Zhang. 

E. Bertini, T. Schreck gave advise in structuring the paper and presenting the results 

and provided initial sections that I shaped for the final submission. D. A. Keim, S. 

Bremm, and T. von Landesberger gave advice to the project. Major parts of the 

paper where written by myself and I revised all the other parts of my co-authors 

several times to shape the final paper version. Thus, in this thesis I use the paper 

text without citation marks. 

Other publications to which I contributed but are not included in this thesis: 

1. M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen and D. A. 

Keim. Improving projection-based data analysis by feature space transformations. 

In Proceedings of SPIE 8654, Visualization and Data Analysis, 2013. 

2. B. Bustos, D. A. Keim, D. Saupe, T. Schreck and A. Tatu. Methods and User 

Interfaces for E ective Retrieval in 3D Databases (in German). Datenbank 

- Spektrum - Zeitschrift fuer Datenbank Technologie und Information Retrieval, 

dpunkt.verlag, 7(20):23-32, 2007.

10 Chapter 1. Introduction

2 

High-Dimensional Data Analysis 

Contents 

„You can observe a lot by watching.” 

Yogi Berra 

2.1 Basic Techniques for High-Dimensional Data Analysis . . . . . 12 

2.1.1 Common Challenges with High-Dimensional Data . . . . . . . . 12 

2.1.2 Feature Selection and Feature Extraction . . . . . . . . . . . . . 12 

2.2 Information Visualization Techniques for High-Dimensional 

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.2.1 Information Visualization Techniques . . . . . . . . . . . . . . . 13 

2.2.2 Limitations while Visualizing High-Dimensional Data . . . . . . 16 

2.3 Automated Techniques for High-Dimensional Data . . . . . . . 17 

2.3.1 Data Mining Techniques for High-Dimensional Data . . . . . . . 17 

2.3.2 Quality Measures for High-Dimensional Data Visualizations . . . 19 

2.4 Visual Analytics for High-Dimensional Data . . . . . . . . . . . 22 

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis . 22 

2.4.2 Subspace Cluster Analysis and Visualization . . . . . . . . . . . 26 

H 

igh-dimensional data contains complex patterns and di erent data analysis approaches 

have beed developed during the past years to uncover the possible hidden 

patterns of this data. As is outlined in the following, this thesis is related to a number of 

broader areas in data analysis and visualization of high-dimensional data. 

In this chapter, Section 2.1 describes the main challenges when dealing with highdimensional 

data and some basic techniques to reduce its dimensionality. Section 2.2 gives 

an overview of existing visualization techniques for high-dimensional data, and identifies 

the visualization challenges that arise due to the data complexity. Section 2.3 presents a 

series of automated techniques from Data Mining for pattern analysis in high-dimensional 

data, focusing on clustering. The second part presents mechanisms to quantify the quality 

of visualizations, called quality metrics. Due to the limitations of the pure visualinteractive 

solution or a sole automatic approach, in Section 2.4 we present works from 

related fields where the interplay of visualization and automation together with interactive 

features can provide better solutions to the tasks at hand. All examples of these sections 

are in the context of pattern finding and understanding of high-dimensional data. 

Parts of this chapter appeared in [27, 132, 133, 134, 135, 136].

12 Chapter 2. High-Dimensional Data Analysis 

2.1 Basic Techniques for High-Dimensional Data Analysis 

2.1.1 Common Challenges with High-Dimensional Data 

Before presenting di erent techniques to analyze high-dimensional data sets, we will discuss 

two common challenges in this area. 

The first issue is the so called curse of dimensionality. In high-dimensional analysis 

problems are known to be di cult due to the curse of dimensionality. This term was 

formulated by R. Bellman [20] in the context of dynamic programming, and describes 

the fact, that when dimensionality increases the data becomes sparse. In other words, 

in high-dimensional data everything tends to be basically equidistant making it hard to 

make any distinctions between objects. Additionally, many existing Data Mining algorithms 

have a complexity exponential with respect to the number of data dimensions. 

With increasing dimensionality, these algorithms become computationally intractable and 

therefore inapplicable in many real applications. 

The second issue concerns the meaning of similarity in a high-dimensional space is 

therefore diminished. It was shown in [28] that as dimensionality increases the distance to 

the nearest data point approaches the distance to the farthest data point. This problem 

influences the design of similarity functions for objects in high-dimensional spaces. 

2.1.2 Feature Selection and Feature Extraction 

A simple, but sometimes very e ective, way to deal with high-dimensional data is to reduce 

the number of dimensions by eliminating those that seem to be irrelevant. 

Dimension reduction can be achieved by either feature selection [61] or feature extraction 

[44]. Feature selection is the problem of selecting from a large space of input features 

(or dimensions) a smaller number of features that optimize a measurable criterion, e.g., 

the accuracy of a classifier [97]. 

Feature extraction methods reduce the dimensionality of the data by forming a new 

set of dimensions as a linear or nonlinear combination of the original dimensions. This 

synthetic dimensions represent most (or all) of the structure of the original data set by 

using less attributes. Depending on the training data, the methods can be supervised 

or unsupervised. “Supervised methods rely on class labels and optimize the performance 

of a supervised learning algorithm, typically a classifier. Unsupervised methods rely on 

quality criteria measured from the output of an unsupervised learning method, typically a 

clustering algorithm. However, many algorithms have variations for both supervised and 

unsupervised learning” [119]. Most automatic feature selection methods rely on supervised 

information (e.g., class labeled data) to perform the selection. Consequently, they are not 

directly applicable to the explorative analysis problem. 

For understanding the fundamental principle of feature extraction techniques in the 

next paragraphs, we describe the traditional dimension reduction methods, the principal 

component analysis (PCA) [83] and the multidimensional scaling (MDS) [41]. 

PCA tries to preserve the variance in the data and transforms the set of possibly 

correlated dimensions into new set of linearly uncorrelated dimensions that are a linear 

combination of the original dimensions and are called principal components. The first 

component contains the largest variance of the original dimension set, the second component 

is linearly uncorrelated to the previous one and also contains the maximal possible

2.2. Information Visualization Techniques for High-Dimensional Data 13 

variance and so on. The data set can be reduced by maintaining a smaller set of principal 

coordinates, as transformed dimensions. 

MDS tries to preserve the pairwise distances between the data points. There are a lot 

of variants of MDS dependent on the used distance functions [31]. The simplest version 

is the linear MDS, also called classical scaling, and its solution is very closely related to 

PCA when using an Euclidian distance function. 

All these techniques rely on the idea that variation of the data can be explained by 

a smaller number of transformed features. Their main di erence to the feature selection 

methods is that these methods instead of choosing a subset of dimensions from the data, 

create new dimensions defined as functions over all dimensions. They also do not consider 

class labels but rather their computation is relying just on data points. 

General problems in these techniques are that the mapping often is not unique. The 

techniques have several parameters that influence the result, and the interpretability of 

resulting dimensions is sometimes di cult because the original space dimensions coming 

from a specific domain have a certain interpretation (like age, income, etc.) but their 

linear combinations can be hardly interpreted. 

Koren and Carmel propose a series of new methods for creating projections from highdimensional 

data sets using linear transformations [89]. For non-labeled data, they propose 

a generalization of the PCA, the normalized PCA, that normalizes the squared pairwise 

distances to reduce the dominance of the large distances normally occurring for the standard 

PCA transformation. For labeled data, their methods integrate the class labels of 

the data in the computation, resulting in projections with a clearer separation between 

the classes. This methods compared to traditional PCA or MDS have the advantage that 

they also capture intra-cluster shapes. 

In addition to PCA and MDS presented above, there have been developed more techniques 

based on linear or non-linear transformations of the original features to obtain a 

reduced set of synthetic dimensions. Detailed surveys can be found in [111, 153]. Another 

prominent group of techniques for dimension reduction, which we want to recall shortly 

at this point, rely on signal processing techniques, that, when applied to a data vector, 

transform it to a numerically di erent vector [64]. These are for e.g. Discrete Fourier 

Transform, Cosine Transform, Wavelet Transform etc. Since input and transformed data 

vectors have the same length, the data is reduced by a user specified threshold that is used 

to truncate the transformed vector (e.g. wavelet coe cients). 

2.2 Information Visualization Techniques for High-Dimensional Data 

2.2.1 Information Visualization Techniques 

The representation of high-dimensional data is one of the main research challenges in 

visualization. Several techniques have been developed in recent years to deal with the 

problem of representing relations among many dimensions on a computer display, which 

is inherently bi-dimensional. Considering also the visual variables data visualizations can 

go a bit beyond 2D using color, shape, etc. but still have di erent issues for representing 

high-dimensional data sets. Classic approaches include parallel coordinates, scatterplot 

matrices, glyph-based and pixel-oriented techniques [145]. Figure 2.1 shows some examples


for these techniques taken from [145]. 

A 

B 

C 

D 

Figure 2.1: High-dimensional visualization techniques taken from [145]. A: Scatterplot matrix 

showing on the diagonal a histogram plot for each dimension. Selected points are marked in red in 

all plots. B: Parallel coordinates plot of a seven-dimensional data set. One polyline representing 

one data point is highlighted in red. C: Star glyphs in a MDS layout. D: Dense pixel displays 

representing a 14-dimensional data set. 

Scatterplots and Scatterplot Matrices [37] 

2D scatterplots are one of the most common used visualization techniques in data analysis. 

The data is represented by points in a rectangular box, each having the value of one 

variable (dimension) determining the position on the horizontal axis, and the value of the 

other variable, determining the position on the vertical axis. To represent a data set of a 

higher dimensionality, a common approach is to build a scatterplot matrix (SPLOM) [37]. 

Figure 2.1A shows an example of such a matrix for a four-dimensional data set, where 

every pair of dimensions is represented in one scatterplot. The matrix shows every plot 

twice, being symmetrical with respect to the diagonal. Additionally, on the diagonal, dimension 

histograms show the value distribution information for each dimension. Selected 

points are highlighted in red and a purple rectangle indicates their region.

2.2.1 Information Visualization Techniques 15 

Parallel Coordinates [78] 

Another important visualization method for multivariate data sets is parallel coordinates. 

Parallel coordinates was first introduced by Inselberg [77] and is used in several tools, 

e.g. XmdvTool [146] and VIS-STAMP [60], for visualizing multivariate data. The basic 

idea is that each dimension 1 of the data is a vertical line, so the axes of the plot are a 

collection of parallel lines. Each data point is a polyline that crosses each dimension axis 

by intersecting it at its dimension value. Figure 2.1B shows an example of parallel coordinates 

for a seven-dimensional data set where one data point’s ployline is highlighted in 

red. In comparison to the scatterplots, parallel coordinates can show data sets of higher 

dimensionality in one display. In a SPLOM a higher dimensional data set can be visualized 

by plotting every two-dimensional combination in one scatterplot. For both, parallel coordinates 

and SPLOM, the ordering is important. For parallel coordinates the order of axes 

(dimensions) and analog for the SPLOM the order of rows and columns, since di erent 

orderings make di erent relations in the data visible. It is important to decide the order 

of the dimensions that are to be presented to the user. Their e ectiveness, however, is 

highly related to the dimensionality of the data under inspection. Because the resolution 

available decreases as the number of data dimensions increases, it becomes very di cult, if 

not impossible, to explore the whole set of available orderings manually. In Section 2.3.2, 

we describe the notion of quality metrics that are mechanisms to automatically quantify 

the quality of the display and in Section 3.1.4, we introduce new quality metrics to determine 

the best ordering in parallel coordinates with respect to a given task. 

Glyph-based techniques [147] 

“Glyphs are graphical entities that convey one or more data values via attributes such 

as shape, size, color, and position” [147]. There is a variety of glyphs proposed in the 

literature so far, and just to name some there are: star glyphs, face glyphs, profile glyphs 

or box glyphs. An overview of multivariate glyphs can be found in [147]. They all have 

in common that they have one graphical representation per object, but use di erent encodings 

for the objects attributes (e.g. length, area, color). In Figure 2.1C star glyphs 

are exemplified. As the name suggests each object is represented by a star shaped glyph, 

where the value of each dimension is represented by the length of evenly spaced rays. The 

ray ends are connected by a polyline. 

Pixel-oriented techniques [145] 

Pixel-oriented techniques “map each value to individual pixels and create a filled polygon to 

represent each dimension” [145]. In Figure 2.1D a 14-dimensional data set is represented 

by dense pixel displays showing each dimension in a separate rectangle and each data 

value as a colored pixel in the rectangle. The values are sorted according to the tenth 

dimension, that is marked with a black border. Here we can see several challenges for 

this techniques. One is the already mentioned ordering of data values, to spot correlated 

dimensions, another one is the ordering of dimensions to position similar dimensions close 

to each other on the screen. Using di erent colormaps can also reveal di erent patterns in 

the data, thus choosing the suitable colormap for each data and task, suitable colormap 

is yet another challenge. Additionally, positioning the dimensions on the screen is not 

trivial, since di erent layouts – not only the grid layout – can be possible. 

1 We use the terms dimension and attribute (as well as feature, variable, column and axis) interchangeably 

in this thesis. We choose among them based on the context of the discussion, while attempting to be 

consistent with their use in the literature.


2.2.2 Limitations while Visualizing High-Dimensional Data 

As previously demonstrated, there are di erent ways to represent high-dimensional data 

on the screen and all these bring a number of challenges with them. Moreover, as already 

identified there are challenges due to the scalability of the display, the ordering of displayed 

objects or dimensions, the positioning of objects on the screen, the high number 

of possible visual mappings. Providing solutions for some of this problems would ease the 

exploration of the high-dimensional data. By an appropriate sorting of dimensions and 

an appropriate mapping to visual variables, clutter can be reduced and these visualization 

methods could allow to overview and relate high-dimensional data sets [49]. The data 

dimensionality causes problems in the visual mapping stage, meaning it is unclear which 

mapping is the best, so what data dimension should be mapped to what visual variable. 

Because of the high number of possible mappings for a high-dimensional data set, automated 

methods are needed to restrict this number. One way to judge the quality of these 

mappings is to compute quality measures for the displayed data (see Chapter 3 for more 

details) or to reduce the number of dimensions by dimensionality reduction techniques 

(see Section 2.1.2). 

Enriching Visualizations 

Static visualization techniques are not flexible enough to reveal the complex high-dimensional 

patterns, thus interaction is needed at this point. Proposed are di erent solutions to make 

visualizations interactive, supporting a dynamic use for high-dimensional data. These include 

brushing and linking [46], panning and zooming [19], focus-plus-context [92], magic 

lenses [29]. 

“Brushing and linking refers to the connecting of two or more views of the same data, 

such that a change to the representation in one view a ects the representation in the 

other views as well. ... Panning and zooming refers to the actions of a movie camera 

that can scan sideways across a scene (panning) or move in for a closeup or back away to 

get a wider view (zooming). ... When zooming is used, the more detail is visible about 

a particular item, the less can be seen about the surrounding items. Focus-plus-context 

is used to partly alleviate this e ect. The idea is to make one portion of the view – the 

focus of attention – larger, while simultaneously shrinking the surrounding objects. The 

farther an object is from the focus of attention, the smaller it is made to appear. ... Magic 

lenses are directly manipulable transparent windows that, when overlapped on some other 

data type, cause a transformation to be applied to the underlying data, thus changing 

its appearance” [15]. A full exemplification of these techniques is out of the scope of this 

work, and more details can be read in [15] 2 . 

Patterns that are just visible in subspaces of the original data space also need specialized 

visualizations to disclose the relations between the di erent subspaces from which 

they originate as well as their possible object overlap. In Chapter 5 we present a visualinteractive 

tool for this purpose. 

2 The cited description for each technique are from Chapter 10: User Interfaces and Visualization - by 

Marti Hearst. This chapter can also be found online at http://people.ischool.berkeley.edu/˜hearst/ 

irbook/10/node3.html#SECTION00122000000000000000f(last accessed on 03/13).

2.3. Automated Techniques for High-Dimensional Data 17 

2.3 Automated Techniques for High-Dimensional Data 

In this section, we present automated methods for analyzing high-dimensional data. Section 

2.3.1 discusses di erent data mining approaches to extract patterns from data. The 

focus is on clustering. We present general approaches, enumerating approaches that have 

been especially developed for coping with high-dimensional data, and present the di erence 

between clustering in a dimension reduced data set and subspace clustering. Besides 

automated pattern extraction, in Section 2.3.2 we introduce automation to judge the quality 

of visualization, namely by quality metrics. Given the huge number of possible visual 

representations for high-dimensional data, the user is assisted in finding the right visual 

mapping or the right projection for his data. Our contribution to this area consisting of 

new measures, a quality measures pipeline, and a systematization of existing measures, is 

outlined in Chapters 3 and 4. 

2.3.1 Data Mining Techniques for High-Dimensional Data 

Data Mining refers to extracting, or mining, knowledge (interesting patterns) from large 

amounts of data [64]. In order to extract these data patterns, di erent intelligent methods 

have been developed in the past. One important method, which is also the closest to 

this thesis, is clustering. Clustering takes the data set as input and groups the objects 

according to their similarity into di erent groups, called clusters. Therefore, the similarity 

between objects of one group is maximized, and between objects of di erent groups the 

similarity is minimized. That means that objects of one group are very similar to each 

other, while dissimilar to objects of other groups. The similarity is calculated on the full 

attribute space, using di erent distance functions, like Euclidian, Minkowski, or City-block 

distances. 

State of the Art Clustering 

There are di erent criteria to classify the existing clustering algorithms. We would like to 

di erentiate them roughly into hierarchical clustering algorithms, and partitioning clustering 

algorithms and enumerate some of the most known representatives. For further details 

please refer to the following surveys [21, 155] or the original papers of the algorithms. 

Hierarchical clustering organizes objects into groups that are at the same time grouped 

into groups. This is done consecutively building up a hierarchy of clusters. Representatives 

for this category, which we will also use later in Section 5.2, are hierarchical clusterings 

with di erent linkage methods, like single-linkage, complete-linkage, average-linkage, or 

minimum variance [144]. Trying to develop algorithms for handling large-scale data, in recent 

years, new hierarchical algorithms appeared that improve the clustering performance. 

Examples include BIRCH [162] an algorithm designed to use a height-balanced tree to 

store summaries of the original data that can achieve a linear computational complexity. 

The partitioning methods, divide all the data objects into a fixed number of groups, 

without any hierarchical structure. Major representatives for this category are algorithms 

like the density based DBSCAN [50] and OPTICS [10], or relocation methods like k- 

medoids and k-means methods [56].


Clustering in High Dimensions 

For high-dimensional data sets, the challenge is to design e ective and e cient clustering 

algorithms that can cope with the high number of objects, dimensions, and the noise level 

of this kind of data. Therefore a number of di erent algorithms were proposed to cluster 

this type of data. 

CURE [57] is a hierarchical clustering algorithm that can explore arbitrary cluster 

shapes and utilizes a random sample strategy to reduce computational complexity. 

Density-based clustering (DENCLUE) [70] is a well known approach for density based 

clustering for high-dimensional data. To make computations more feasible, the data is indexed 

using a B + -tree. The algorithm is built on the idea that the influence of each data 

point on his neighborhood can be modeled using a so called influence function. The overall 

density of the data space can be modeled analytically as the sum of the influence function 

applied to all data points. Clusters are then determined by identifying local maxima of 

the overall density function. 

Although, these algorithms can deal with large-scale data, they are sometimes not 

su cient to analyze high-dimensional data. Due to the previously described problem, the 

curse of dimensionality, namely algorithms relying on distance functions, can no longer 

perform well in high-dimensional spaces. To overcome this problem, dimension reduction 

(see Section 2.1.2) is used in cluster analysis to reduce the dimensionality of the data 

sets. However, dimensionality reduction methods cause some loss of information, and 

may destroy the interpretability of the results, even distort the real clusters. Moreover, 

such techniques do not actually remove any of the original attributes from the analysis. 

This is problematic when there are a large number of irrelevant attributes. The irrelevant 

information may mask the real clusters, even after transformation. Another way to tackle 

this problem is to use subspace clustering algorithms, that search for data clusters in 

di erent subsets of the same data set. Di erent subspaces may contain di erent meaningful 

clusters. The problem here is how to identify such subspace clusters e ciently. 

A large number of algorithms for subspace clustering have been developed in the past 

and we picked some representatives to be briefly described next. CLIQUE (CLustering 

In QUEst) [6] employs a bottom-up approach and searches for dense rectangular cells in 

all subspaces with high density of points. The clusters are generated by merging these 

rectangles. OptiGrid [71] is designed to obtain an optimal grid partitioning using cutting 

hyperplanes. It uses density estimations similar to DENCLUE to find the plane that 

separates two significantly dense half spaces, and goes trough a point of minimal density, 

using a set of linear projections. In Section 5.1 we use the k-medoid based algorithm 

PROCLUS (PROjected CLUstering) [4], one of the most robust algorithms for subspace 

clustering. It defines a cluster as a densely distributed subset of data objects in a subspace. 

ORCLUS (arbitrarily ORiented projected CLUster generation) [5] uses a similar approach 

but uses non-axes parallel subspaces to find the clusters. Further elaborations on the 

problem of subspace clustering are described in Section 2.4.2 and Section 5.1.2. 

Other Data Mining Techniques 

In addition to clustering techniques, many other techniques have been developed during 

the past. 

Mainly they are mining frequent patterns, associations, correlations, or outliers

2.3.2 Quality Measures for High-Dimensional Data Visualizations 19 

in data. A frequent pattern is a set of items that occur frequently in a data set. This 

term was first proposed by [7] in the context of frequent itemsets and association rule 

mining. By mining frequent patterns, the goal is to identify regularities in the data, like 

products purchased often together in basket data analysis. Frequent patterns form the 

foundation for many essential data mining tasks, such as association analysis, correlation 

analysis, classification (associative classification) and cluster analysis (frequent patternbased 

clustering). “Association analysis is the discovery of association rules showing 

attribute-value conditions that occur frequently together in a dataset” [63]. As mentioned 

in Section 1.1 support and confidence can characterize the quality of association rules. The 

rules are generated based on the identified frequent itemset in the data. One problem, 

however, is that for low support and confidence levels the resulting set of association rules 

is very high. Using higher levels of support and confidence can remove useful rules, so 

a mechanism is needed to detect the right confidence level. Visualization can help to 

overcome this issue, and supports the user in identifying the right rules. In Section 3.1 we 

will present image based quality measures to identify correlation among data attributes 

and attributes forming strong groups (clusters) in the data. 

In classification analysis the data is often classified (labeled), and a model is derived 

to distinguish these data classes. This model is trained on a subset of the data, called 

training set. Another subset of the data is used to validate the rules, which is the so 

called test set. The model can be represented by classification rules, decision trees, neural 

networks or mathematical formulas and is used to classify new data. However, often users 

need to predict missing values in the data, rather than class labels. When the predicted 

values are numerical the process is named prediction. Our work on quality metrics with 

labeled data (see Section 3.1.3 and Section 3.1.5), can be seen as a complementary way 

to identify the attributes that can best distinguish the classes in the data relevant for 

building the classification model. Classification is also referred to as supervised learning, 

because the training set is used to teach how to classify new data. Clustering is referred 

as unsupervised learning, since there are no class labels for training, and clusters or classes 

are established to group the data elements. 

In some applications, as in fraud detection, rare events can be of interest. The analysis 

of outlier data is referred to as outlier mining. Outliers can be detected for example by 

using statistical tests, but also by some quality metrics. Examples for quality metrics for 

outliers are marked in Table 4.2 later in Chapter 4. 

2.3.2 Quality Measures for High-Dimensional Data Visualizations 

General Measures 

Quality metrics (or measures) in visualization have a long history. While in our work we 

focus only on their specific use in high-dimensional data analysis, they have a broader 

scope than we can describe here. Early attempts to calculate quality metrics can be 

traced back to the work of Tufte [139], where he proposed metrics such as the data to 

ink ratio and the lie factor, which respectively optimize the use of the visualization space 

and reduce the distortions that visualization may introduce. Later in 1997 Richard Brath 

proposed a rich set of metrics to characterize the quality of business visualizations [32] 

and, around the same period Miller et al. advocated the use of visualization metrics as a 

way to compare visualizations [100]. The graph drawing community developed its own set


of metrics, most notable aesthetic metrics such as those found in the foundational work of 

Ware et al. on cognitive measurements of graph aesthetics [149]. Later, the word quality 

metrics assumed a more specific meaning; in particular it appeared in the context of a 

number of papers related to clutter reduction and scalability [24, 26, 80, 82, 112]. 

For the sake of completeness, it is worth mentioning that the word metric is also used in 

the context of information visualization user studies as a way to indicate how the elements 

of interest are measured (e.g., [108, 113]). 

Scatterplot Measures 

The idea of using measures calculated over the data or over the visualization space to select 

interesting projections, has been proposed already in some foundational works like Projection 

Pursuit [54, 74] and Grand Tour [13]. Projection Pursuit searches for low-dimensional 

(one or two-dimensional) projections that expose interesting structures, using a “Projection 

Pursuit Index” that considers inter-point distances and their variation. Grand Tour 

adopts a more interactive approach by allowing the user to easily navigate through many 

viewing directions, creating a movie like presentation of the whole original space. 

More recently, several works appeared in the visualization community that propose different 

forms of quality measures. Examples are, graph-theoretic measures for scatterplot 

matrices [151], measures over pixel-based visualizations [120], measures based on clutter 

reduction for visualizations [25, 112], and composite measures to find several data structures 

outliers, correlations, and sub-clusters [82]. We present a systematization of works 

on quality measures in Chapter 4 and propose a quality measures pipeline to describe the 

process of these measures. Additionally, several factors are derived to characterize the 

measures in a common language, and implications on further research are raised. At this 

point, it seems important to provide a short description of the first two categories, and 

postpone the details for the others for Chapter 4. 

First, the scagnostics measures [140] have an important role since they are a major 

inspiration source for our work. As an alternative to Projection Pursuit, the scagnostics 

method [140] was proposed to analyze structures in scatterplots. Since they never 

published their specifics of the method, Wilkinson et al. [151] take their opportunity to 

presented this scagnostics ideas and apply them to high-dimensional data. They describe 

detailed graph-theoretic measures for scatterplots. This means that graphs and their properties 

(like convex hull, alpha hull, Minimum Spanning Tree (MST)) are used as bases for 

computing scagnostics measures. Their scagnostics indices assess five aspects of the point 

distribution: outliers, shape, trend, density and coherence proposing nine characteristic 

indices for the distribution of points in scatterplots: outlying, skewed, clumpy, convex, 

skinny, striated, stringy, straight, and monotonic. Originally these indices are used to 

form a SPLOM of scagnostics, where each axes is a scagnostics measure. Here each data 

scatterplot is represented by a point according to his measures. The scagnostics SPLOM 

was used to spot unusual scatterplots regarding their data distribution (see Figure 2.2A). 

These indices were also used as ranking functions in data SPLOMs supporting di erent 

analysis tasks [152] as shown in Figure 2.2B. 

Second, the approach most similar to ours presented in Chapter 3 is Pixnostics, proposed 

by Schneidewind et al. [120]. They also use image-analysis techniques to rank the 

di erent lower-dimensional views of the data set and present only the best ranked to the 

user. The method does not only provide valuable lower-dimensional projections to the

2.3.2 Quality Measures for High-Dimensional Data Visualizations 21 

A 

B 

Figure 2.2: (A) Scagnostics SPLOM having as axes scagnostics measures and showing each data 

scatterplot as a point in the measures scatterplot [152]. (B) Scagnostics indices used as quality 

measures to rank data scatterplots [152]. 

user, but also optimized parameter settings for pixel-level visualizations. However, while 

their approach concentrates on pixel-level visualizations, we focus on scatterplots and 

parallel coordinates. 

We contribute to the field of quality metrics by proposing image-based and data-based 

measures for classified and non-classified data in scatterplots and parallel coordinates 

in Section 3.1. In Section 3.1.2 we present an image-based measure for non-classified 

scatterplots in order to quantify the structures and correlations between the respective 

dimensions. Our measure could for example be used as an additional index in a scagnostics 

matrix. 

Parallel to our work from Section 3.1 published in [133], Sips et al. [129] developed a 

class consistency visualization algorithm. Similar to our Histogram Density measures, the 

class consistency method proposes measures to rank 2D scatterplots. It filters the highest 

ranked scatterplots and presents them in an ordinary scatterplot matrix. 

Parallel Coordinates Measures 

Measures were not only used to rank a high number of visualizations regarding their 

structures, but also with the purpose to optimize visualizations for high-dimensional data 

representation. One major factor handled by these measures is optimizing the ordering of 

elements (like axes or data points) in the visualization. Aiming at dimension reordering, 

Ankerst et al. [9] presented a method based on similarity clustering of dimensions, placing 

similar dimensions close to each other. Yang [159] developed a method to generate interesting 

projections also based on similarity between the dimensions. Similar dimensions 

are clustered and used to create a lower-dimensional projection of the data. 

As an alternative to the methods for dimension reordering for parallel coordinates, we 

propose a method based on the structure presented on the low-dimensional embeddings 

of the data set. Three di erent kinds of measures to rank these embeddings are presented


in Section 3.1.4 for class and non-class based visualizations. 

Evaluating Measures 

A common denominator of all these works is the total absence of user studies able to inspect 

the relationship between human-detected and machine-detected data patterns. While it 

is certainly clear how these measures can help users deal with large data spaces, there 

are a number of open issues related to the human perception of the structures captured 

automatically by the suggested algorithms. In Section 3.2 we focus on the question of 

whether there is a correlation between what the human eye perceives and what the machine 

detects. 

Despite the lack of user studies specifically focused on the issues discussed above, 

there are a number of user studies focused on the detection of visual patterns which are 

worth mentioning here. A large literature exists on the detection of pre-attentive features, 

notably the work of Healey focused on visualization [67] and of Gestalt Laws [148], which 

are often taken as the basis for the detection of patterns from visual representations. Some 

more specific works focused on visualization are: [25] and [68] based on the perception of 

density in pixel-based scatterplots and in visualizations based on “pexels” (perceptual 

texture elements) respectively, [81] on the study of thresholds for the detection of patterns 

in parallel coordinates, and [65] on the correlation between the visualization performance 

an similarity with natural images. The study presented in [118] is also relevant and very 

similar to ours presented in Section 3.2 in terms of experiment design. Users ranked a 

series of images in terms of their perception of the degree of clutter exposed by the image, 

and the study correlated the degree of correlation between the user rank and the rank 

given by the suggested measure named feature congestion. 

2.4 Visual Analytics for High-Dimensional Data 

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis 

As presented in the previous chapter, combining data visualization with interactive and 

automated components speeds up the analysis of high-dimensional data sets. As a consequence, 

many interactive systems have been developed recently to support the user in 

analyzing high-dimensional data sets. Since there is a large number of interactive systems 

in the literature, presenting a full summary would overload this section. Hence in 

the following paragraphs, we identify only the four main domains related to this thesis 

and enumerate a selection of visual interactive systems for visual feature selection, visual 

clustering, visual classification, and dimension reordering. 

Visual Feature Selection 

Reducing high-dimensional data to a lower subset of features that express the data characteristics, 

is a crucial task in high-dimensional data analysis. Data features are therefore 

compared, for example computing correlations, data variation, etc. to identify their impor-

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis 23 

tance in expressing the data characteristics. Since fully automated feature selection methods 

often are infeasible, due to the data complexity and dimensionality, visual-interactive 

systems have been developed to deal with this problem. We illustrate three examples for 

such systems in Figure 2.3 with a short description, and point to more literature in this 

field in the next paragraphs. 

A 

B 

C 

Figure 2.3: Visual interactive feature selection systems. A: Rank-by-Feature Framework presented 

in [125]. B: Feature selection supported by quality measures [82]. C: DimStiller for feature selection 

[76]. 

In existing works involving visual-interactive selections or comparison of features, the 

Rank-by-Feature Framework [125] (see Figure 2.3A) provides a sorted visual overview 

of the correlation among pairs of features. In [82], the selection of input features was 

supported by a measure of the interestingness of the visual view provided by candidate 

features (see Figure 2.3B). An interactive dimensionality reduction workflow was presented 

in [76], relying on visual approaches to guide users in selecting features (see Figure 2.3C). 

In [33] and [34], interactive visual comparison was proposed to relate data described 

in di erent given feature spaces based on 2D mappings and tree structures extracted from 

the di erent data spaces. Furthermore, in [93] a visual design based on network and heat 

map visualization was proposed to relate clusterings in di erent subsets of dimensions. 

In [159], dimensions are hierarchically clustered based on a simple value-oriented similarity 

measure. Based on this structure, user navigation can take place to identify interesting 

subspaces. In a recent work [161], the output of this simple search method was visualized 

by tree- and matrix-based views, where each dimension combination was represented by 

a single MDS plot. 

In summary, many of these methods are applicable to compare data regarding di erent


criteria. However, most of them assume the feature selection to be performed globally and 

do not take the subspace search problem directly into account. One focus of this thesis 

is to show that local selection of features is essential when analyzing patterns of highdimensional 

data. The analysis is then performed in di erent subspaces of the data and 

related work on visual analysis tools that deal especially with subspaces will be presented 

in the next subsection. 

Visual Clustering 

Identification and relation of groups of data is a key explorative data analysis task. Often, 

user interaction is needed to identify and revise the number and characteristics of data 

clusters found by automatic search methods. To this end, visual-interactive approaches are 

useful. Although, many methods have been proposed, we can only highlight few of them 

in an exemplary manner. In [124], interactive exploration of hierarchically clustered data 

along a dendrogram data structure is proposed to help users find the right level of clusters 

for their tasks (see Figure 2.4A). In [159], the parallel coordinates approach serves as a 

basic display to show data clustering results allowing to compare clusters along their highdimensional 

data space. Also, 2D projections, possibly in conjunction with glyph-based 

representation of clusters, are widely employed, a recent example is [35] (see Figure 2.4B). 

A 

B 

Figure 2.4: Interactive visual analysis systems for clustering in high-dimensional visualization. A: 

Interactive exploration of hierarchically clustered data along a dendrogram [124]. B: (a) Grouping 

icons to form clusters based on visual similarity. (b) User-defined grouping of icons [35]. 

These approaches to visualization and clustering in high-dimensional data spaces all 

have in common that they are based on a given full (or reduced) dimensionality of the 

input data set. Thereby, they show only a singular perspective of the usually multi-faceted 

high-dimensional data, which might not be the most relevant one. As we will show in this 

thesis, it is also useful to explore high-dimensional data for patterns in di erent subsets 

of its full high-dimensional input space to increase potential data insight. 

Visual Classification 

Classification is using a model that distinguishes data classes, and is created based on a 

labeled training data set, to label new data. The classification model can be represented 

by decision trees. With pure automatic approaches, problems like over-fitting the model 

or tree pruning, are di cult to tackle [86]. Using visualization can help to overcome

2.4.1 Visual Interactive Systems for High-Dimensional Data Analysis 25 

these problems, for example by incorporating the user in the tree constructing process. 

Ankerst et al. present in [11] a user-centered approach that combines the domain knowledge 

of users, with computation strengths of the computer to create rules that satisfy the 

user’s constrains and generate visualizations of these patterns. Additionally, the pattern 

recognition of the human supported by adequate data visualizations can be used to increase 

the e ectivity of decision trees. In Figure 2.5A the visual classification shows the decision 

tree, visualizing each attribute-value by a colored pixel arranged in bars. Each attribute 

bar is sorted, and the purest value distribution is selected as split attribute of the decision 

tree. This procedure is repeated until all leaves contain pure classes. The split is marked 

with a black vertical line, and the leaves are underlined with a black line. Compared to 

standard visualizations of decision trees, additional information is encoded in a compact 

way, namely: size of the nodes (number of training records for the corresponding node), 

quality of the split (visible in the purity of the resulting partitions), class distribution 

(frequency and location of the training instances of all classes). 

A 

B 

Figure 2.5: Interactive visual analysis systems for classification in high-dimensional data. A: Visual 

classification from [11] illustrates the decision tree for DNA training data having 19 attributes, 

visualizing each attribute-value by a colored pixel arranged in bars. B: Decision tree construction 

system [142], representing the tree in a node-link diagram, displaying split points on the links and 

the split attributes on the node. 

Figure 2.5B shows a recent example from [142] of an interactive system for decision 

tree construction. Here the authors have the same goal, e.g. to bring the domain specific


knowledge of the user into the construction of the tree. A tight integration of visualization, 

interaction and automation supports domain experts in growing, pruning, optimizing and 

analyzing decision trees [142]. Compared to the previous example, here the tree representation 

is a more classic one since the tree is represented by node-link diagrams. Internal 

and leaf nodes are represented by node glyphs, and each parent-child relationship is represented 

by a link from patent to child node. The advantage of this visual representation 

is that it allows for an easier counting of the number of leafs while at the same time it 

shows which nodes are on the same level [142]. The main view displays split points on 

the links, using the width to encode the number of items and color the class membership 

of the items. The split attribute is shown on the nodes of the tree. These are visualized 

as rectangles containing relevant information like split attribute, class distribution, split 

points, and class histogram. Additional linked views support the user in constructing and 

optimizing the decision tree. 

Dimension Reordering 

As already discussed, dimension ordering is a relevant component of high-dimensional data 

visualization and exploration, as di erent orderings can expose di erent patterns. Ankerst 

et al. introduced the problem of dimensional ordering as an optimization problem in [9] 

and demonstrated that it is a NP-complete problem that must thus be solved through 

heuristics. Peng et al. in [112] applies dimension reordering on a series of n-dimensional 

visualization techniques to reduce clutter. Matrix based visualizations, starting from the 

seminal work of Bertin [22] have also been heavily researched in terms of the patterns 

they can expose through reordering. In Section 5.1 we use dimensional reordering and 

cluster reordering to make relationships among dimensions and clusters apparent in our 

ClustNails system. 

In [59] Guo also addresses ways to integrate visual and computational measures for 

picking and ordering variables for display on parallel coordinates. He describes a humancentered 

exploration environment, which incorporates a coordinated suite of computational 

and visualization methods to explore high-dimensional data and find patterns in 

this spaces. The main di erence between this approach and our approach presented in 

Section 3.1.4 is that Guo searches for locally defined patterns in subspaces, while our work 

concentrates on finding global patterns in a 2-dimensional projection of the data set. 

To summarize, ordering plays and important role in di erent areas: like ordering axes 

of parallel coordinates, ordering as a way to reduce clutter in scatterplot matrices, ordering 

to support similarity search of glyph-based visualizations or pixel-based displays. 

2.4.2 Subspace Cluster Analysis and Visualization 

As traditional full-space clustering is often not e ective for revealing a meaningful clustering 

structure for high-dimensional data (see Section 2.3.1), in the emerging research 

field of subspace clustering [90] several approaches aim at discovering meaningful clusters 

in locally relevant subspaces. The problem of finding clusters in high-dimensional data 

can be divided into two sub-problems: subspace search and cluster search. The first one 

aims at finding the subspaces where clusters exist, the second one at finding the actual 

clusters. The large majority of existing algorithms considers the two problems simultane-

2.4.2 Subspace Cluster Analysis and Visualization 27 

ously and produces a set of clusters, where each cluster is typically represented by a set of 

clustered objects (rows of the original data table) and the subset of relevant dimensions 

(columns of the original data table). Several methods have been proposed that di er to 

the clustering search strategy and constraints with respect to the overlap of clusters and 

dimensions [38, 84, 107]. Kriegel et al. [90] categorize these algorithms into four classes: 

(1) projected clustering; (2) “soft” projected clustering; (3) subspace clustering; (4) hybrid. 

The first two generate clusters that do not overlap, that is, every object belongs to 

only one cluster. Subspace clustering and hybrid may generate clusters that do overlap. 

While extensive research has been carried out in designing subspace clustering algorithms, 

surprisingly little attention has been paid to develop visualization support for 

subspace clustering. To our knowledge only a few subspace cluster visualization systems 

exist. 

(a) 

(b) 

Figure 2.6: (a) VISA system [14]. Left: MDS projection for the global view of clusters. Right: 

Matrix of subspace clusters for in-depth view. (b) Heidi Matrix [141] over a subspace. 

The VISA system [14] implements both a global view and an in-depth view (see 

Figure 2.6(a)) to help interpret the subspace clustering result. In the global view, the 

subspace clusters are projected onto a 2D display using a multidimensional scaling (MDS) 

projection. The aim is to show the similarity between clusters in terms of the number 

of records and dimensions in each cluster. Each cluster is represented as a colored circle 

where color represents the number of dimensions and the size represents the number of 

instances. The in-depth view shows the detailed characteristics of the clustering result 

including data items in each cluster and their values using a matrix representation. It 

uses di erent color codes to visualize all characteristics of an object: black for unselected 

dimensions, brightness for areas of interest, and hue for value. The MDS projection in 

VISA provides a good overview of the clustering results. However, using circles of di erent 

sizes in the MDS projection in VISA can be problematic; the distance between two clusters 

can be obscured by the radius of the circles, and the overlap between clusters often causes 

a cluttered display. The in-depth view shows detailed characteristics of the clustering 

result, but as shown in Figure 2.6(a), both hue and brightness are relatively weak at 

showing di erence/variations between numbers and values in unselected dimension. 

Heidi Matrix [141] uses a complex arrangement of subspaces in a matrix representation. 

This matrix is based on the computation of the k-Nearest Neighbors (kNN) in 

each subspace (see Figure 2.6(b)). Rows and columns represent the data items, and each


entry (i, j) in the matrix represents the number of subspaces in which i and j are neighbors. 

A categorical coloring scheme is used to color the cells according to the particular 

combination of subspaces in which two data items are neighbors. In addition, rows and 

columns are ordered according to the output generated by a clustering algorithm. The 

biggest advantage of Heidi Matrix is that it displays the full information of the data and 

the subspace clustering result. However, the rather abstract visual mapping scheme makes 

interpretation of the results di cult and to the best of our knowledge its e ectiveness has 

not been evaluated yet. The scalability of the visualization is another critical issue because 

it requires n ◊ n display space, where n is the number of data items. 

Figure 2.7: Visualization techniques applied in Ferdosi’s work [52]. Left: 1D subspace. Middle: 

2D subspace. Right: Subspace with 3 or more dimensions. 

Ferdosi et al. [52] proposed an algorithm for finding interesting subspaces in astronomical 

data as well as a visual system for displaying the results. The algorithm identifies 

candidate subspaces from data and ranks those by a quality metric based on density estimation 

and morphological operators. The result subspaces are visualized in di erent 

forms: line graphs for 1-dimensional subspaces, 2D scatterplots for 2-dimensional subspaces, 

and principle component analysis (PCA) projections for subspaces with higher 

dimensionalities (see Figure 2.7). Ferdosi’s work provides some interesting insight into 

subsets of dimensions in astronomical data with a high density of data objects. However, 

the algorithm does not assign objects to subspaces. Hence, the subspace clustering information 

is partially missing from both the data mining and the visualization compared to 

VISA and Heidi Matrix, meaning there is no direct way of comparing subspaces. 

In all of the above mentioned visualization systems, the visualization of overlapping 

dimensions and overlapping clusters is lacking. It is di cult to see and compare such 

overlapping information in the visual representations. In Section 5.1 we propose a visual 

tool to investigate subspace clustering results and represent also dimension and object 

overlap among clusters. 

We note that if we apply one of these subspace clustering visualizations, we immediately 

inherit two main challenges of this paradigm that is still considered an open research issues, 

namely: the e ciency challenge (relating to subspace cluster search) and the redundancy 

challenge (relating to the typical redundancy of the outputs generated). In Section 5.2 the 

redundancy problem is addressed by our proposed analytical workflow.

3 

Quality Measures based Visual Analysis of 


Contents 

„Measure what is measurable, and make measurable what is not so.” 

Galileo Galilei 

3.1 Quality Measures for Scatterplots and Parallel Coordinates . . 30 

3.1.1 Overview and Problem Description . . . . . . . . . . . . . . . . . 30 

3.1.2 Quality Measures for Scatterplots with Unclassified Data . . . . 32 

3.1.3 Quality Measures for Scatterplots with Classified Data . . . . . . 34 

3.1.4 Quality Measures for Parallel Coordinates with Unclassified Data 38 

3.1.5 Quality Measures for Parallel Coordinates with Classified Data . 40 

3.1.6 Application on Real Data Sets . . . . . . . . . . . . . . . . . . . 41 

3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data . 49 

3.1.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 53 

3.2 Quality Measures and Human Perception – An Empirical Study 54 

3.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.2.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 57 

3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

3.2.5 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 


V 

isual exploration of multivariate data typically requires projection onto lower-dimensional 

representations. The number of possible representations grows rapidly with 

the number of dimensions, and manual exploration quickly becomes ine ective or even 

unfeasible. In this chapter, we propose automatic analysis methods to extract potentially 

relevant visual structures from a set of candidate visualizations. Based on these features, 

the visualizations are ranked in accordance with a specified user task. The user is provided 

with a manageable number of potentially useful candidate visualizations that can be used 

as a starting point for interactive data analysis. This can e ectively ease the task of finding 

truly useful visualizations and potentially speed up the data exploration task. Therefore 

in Section 3.1, we present quality measures for class-based as well as non class-based 

scatterplots and parallel coordinates visualizations. The proposed analysis methods are 

evaluated on real and synthetic data sets and the results are presented in Section 3.1.6 

and 3.1.7. Section 3.2 presents an empirical study to compare the measures ranking with 

the user perception. The study helped us to derive further factors that we must take into 

account when designing new measures that have to fit the users’ perception.

30 Chapter 3. Quality Measures based Visual Analysis of High-Dimensional Data 

Parts of this chapter appeared in the following publications [132, 133, 134] 1 . 

3.1 Quality Measures for Scatterplots and Parallel Coordinates 

In this section, we present an automated approach that supports the user in the exploration 

process of high-dimensional data. The basic idea is to generate di erent projections from 

the high-dimensional data set and to automatically identify potentially relevant visual or 

data-structures from this set of possible candidates. These structures are used to determine 

the relevance of each projection to common predefined analysis tasks. The user may then 

use the projection with the highest relevance as the starting point of the visual interactive 

analysis. We present relevance measures for typical analysis tasks based on scatterplots 

and parallel coordinates. The experiments on class-labeled and non class-labeled data 

sets demonstrate the potential of our quality measures to find interesting projections and 

visualizations and thus speed up the exploration process. 

3.1.1 Overview and Problem Description 

Increasing dimensionality and growing volumes of data lead to the necessity of e ective exploration 

techniques to present the hidden information and structures of high-dimensional 

data sets. To support visual exploration, the high-dimensional data is commonly mapped 

to low-dimensional views, also called projections. Depending on the technique, exponentially 

many di erent low-dimensional views exist that cannot be analyzed manually. 

As already presented in Section 2.2.1, scatterplots and parallel coordinates plots are 

commonly used visualization techniques to deal with multivariate data sets. This lowdimensional 

embeddings of the high-dimensional data in a 2D view can be interpreted 

easily by the users. We have also seen that this techniques entail di erent challenges for 

high-dimensional data sets. For scatterplots, the high number of possible 2D projections 

for a high-dimensional data sets is challenging. Since there are n2 ≠n 

2 

di erent plots for a n- 

dimensional data set in a scatterplot matrix, an automatic analysis technique to preselect 

the important projections is useful and necessary. 

For parallel coordinates one problem is the large number of possible arrangements of 

the dimension axes. It has been shown in [30] that for a n-dimensional data set n+1 

2 

permutations 

are needed to visualize all relations between dimensions, but there are n! possible 

arrangements. An automated analysis of the visualizations can help finding the best ordering 

out of all possible arrangements. We attempt to analyze the pairwise combinations of 

dimensions that are later assembled to find the best visualizations by reducing the visual 

1 Please note that parts of the publications used here are slightly changed to adapt to the dissertation’s 

terminology. Due to readability issues and being an author in leading role for these publications, I decided 

not to quote these excerpts. 

The intense collaboration for [133] with G. Albuquerque and M. Eisemann from Braunschweig, brought up 

new image quality measures that they implemented and described for our joined publication. I participated 

in some of the discussions. Together we ran experiments for the application section on real data sets and 

described them in the paper. The evaluation part on the synthetic data was completely designed by 

myself. I decided to include the full description of the metrics in my thesis for a better understanding 

of the experiments and the discussions about the outcome. Major parts of Section 3.1.2, Section 3.1.3, 

Section 3.1.4 and Section 3.1.5 are therefore credited to aforementioned authors.

3.1.1 Overview and Problem Description 31 

analysis to n 2 visualizations. We propose ranking functions to judge the quality of a visual 

embedding. This ranking functions are called quality measures and automatically select 

the best visual representation with respect to a given task. 

HD Data 

Set of 

Visualizations 

Quality Measures 

Ranked 

Visualizations 

2D projections 

in scatterplots 

Visual Mapping 

& Projection 

2000 4000 6000 8000 

0 200 400 600 800 

dim 4 

dim 22 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●●●●●●●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●●●●●●●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

●● 

● ● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

100 200 300 400 500 600 

0 200 400 600 800 1000 

dim 5 

dim 7 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

●● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

●● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

●● 

● 

● 

● ● 

● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

100 200 300 400 500 600 

0 200 400 600 800 

dim 5 

dim 22 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

●● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ●● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● ● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

●● 

● 

● ● ● ● 

● 

●● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ●● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● ● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

●● 

●● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● ● 

● ● ● ● 

● ● ● ● 

● ● ● ● 

● ● ● ● 

● ● ● ● 

● 

● 

● 

0 2000 4000 6000 8000 

0 200 400 600 800 

dim 6 

dim 22 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

0 2000 4000 6000 

−300 −200 −100 0 100 200 

Comp.1 

Comp.2 

● ● 

● ● 

● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ●● 

● 

● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ●● 

●●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● ● 

● ●● 

● 

● ●● 

● ● 

● ●● 

● 

● ●● 

● 

● 

● ●● 

● 

● ●● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ●● 

● 

● 

● ● 

● 

● 

● 

● ●● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● 

● ● 

● 

● ●●● 

● 

● ●●● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● 

● ● 

● ● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● ● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● ● 

● ● 

● 

● ●● 

●● 

● 

● 

● 

●● 

● 

● 

● 

● ●● 

● ●● ● 

● ● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● ● 

●● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● ● 

● ● 

● ● ● ● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● 

● ● ● 

● ● 

● 

● ● ● 

● ● 

● 

●●● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● ● 

●● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● ● 

● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

● ●●●●●●●●● 

● ● 

● 

● ● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● ● 

● 

● ● 

● ●●●● 

● ● 

● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● ● ●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ●●●● 

● ● 

● 

● ●●● ● ● 

● 

● ●● 

● 

● ●● 

● ● 

● 

● 

● ●● 

● ● 

● 

● ● 

● ● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● ●● 

● ● 

● 

● 

● 

● 

● ●● 

● ● 

● 

● 

● ● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

●● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● ● 

● ● 

● 

● ● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

●● ● 

● 

● 

●● ● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ●● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

●●● 

● 

● ● 

●●● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ●● 

● 

●● 

● ● 

●●● 

● 

●● 

● 

● 

●●● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

0 2000 4000 6000 

−200 0 200 400 600 

Comp.1 

Comp.2 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● 

●● ● 

● 

● 

● 

● ● 

● 

●● ● 

● ● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● ● ● ● 

● 

● 

● ● ● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● 

● 

● ● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● ● 

● ● ● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ● ● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● ● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

●● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ●●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ●● 

● ● ●● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● ● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

●● 

● ● 

● 

●● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

Task 

2000 4000 6000 8000 

0 200 400 600 800 

dim 4 

dim 22 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●●●●●●●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●●●●●●●●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● ● 

● 

● 

● 

● 

●


An overview of our techniques is shown in Table 3.1. For scatterplots with unclassified 

data, we developed the Rotating Variance Measure which favors xy-plots with a high 

correlation between the two dimensions. For classified data, we propose measures that 

consider the class information while computing the ranking value of the images. We 

developed four methods, a Class Density Measure, aClass Separating Measure, a1D- 

Histogram Density Measure, and a 2D-Histogram Density Measure. They have the goal 

to find the best scatterplots showing the classes separated. For parallel coordinates with 

unclassified data, we propose a Hough Space Measure that searches for interesting patterns 

such as clustered lines in the views. For classified data, we propose two measures: the 

Overlap Measure that focuses on finding views with as little overlap as possible between 

the classes, so that the classes separate well, and the Similarity Measure that looks for 

correlations between the lines. All the measures, except the 1D and 2D-Histogram Density 

Measures, are computed directly over the visualization images and do not consider possible 

intra- and interclass overplotting of points. 

As example analysis tasks for unclassified data sets, we choose correlation search in 

scatterplots (Section 3.1.2) and cluster search (i.e. similar lines) in parallel coordinates 

(Section 3.1.4). If class information is given, the tasks are to find views where distinct 

clusters in the data set are also well separated in the visualization (Section 3.1.3) or show 

a high level of inter- and intraclass similarity (Section 3.1.5). 

3.1.2 Quality Measures for Scatterplots with Unclassified Data 

Our scatterplot measures aim to assess the distribution of the data regarding correlation 

and density of points and the separateness of classes. In this section, we therefore propose 

analysis functions to compute the correlation of points in scatterplots with unclassified 

data. Additionally, new methods to measure the density of the classes and for assessing 

the separateness of classes in scatterplots with classified data are proposed in the next 

Section 3.1.3. In the case of unclassified, but well separable data, class labels can be 

automatically assigned using clustering algorithms. 

Rotating Variance Measure 2 

High correlations are represented as long, skinny structures in the scatterplot visualization. 

Due to outliers even almost perfect correlations can lead to skewed distributions in the 

plot and attention needs to be paid to this fact. The Rotating Variance Measure (RVM) 

is aimed at finding linear and nonlinear correlations between the pairwise dimensions of a 

given data set. 

To compute the measure over the image representation we first transform the discrete 

scatterplot visualization into a continuous density field. For each screen pixel s and its 

position x =(x, y) the distance to its k-th nearest sample points N s in the visualization 

is computed. To obtain an estimate of the local density fl at a pixel s, wedefinefl =1/r, 

where r is the radius of the enclosing sphere of the k-nearest neighbors of s given by 

r = max iœNs ||x ≠ x i ||. (3.1) 

2 Implemented and described by our partners from Braunschweig: G. Albuquerque and M. Eisemann 

for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.

3.1.2 Quality Measures for Scatterplots with Unclassified Data 33 

(a) 

(b) 

Figure 3.2: Scatterplot example and its respective density image. For each pixel we compute the 

mass distribution along di erent directions and save the smallest value, here depicted by the blue 

line. 

Choosing the k-th neighbor instead of the nearest eliminates the influence of outliers. k 

is chosen to be between 2 and n ≠ 1, so that the minimum value of r is mapped to 1. We 

used k = 4 throughout the application Section 3.1.6. Other density estimations could of 

course be used as well. 

Visualizations containing high correlations should generally have corresponding density 

fields with a small band of larger values while views with lower correlation should have 

a density field consisting of many local maxima spread in the image. We can estimate 

this amount of spread for every pixel by computing the normalized mass distribution by 

taking s samples along di erent lines l ◊ centered at the corresponding pixel positions x l◊ 

and with length equal to the image width, see Figure 3.2. For these sampled lines we 

compute the weighted distribution for each pixel position x i : 

‹ ◊ i = 

q sj=1 

p s j 

l ◊ 

||x i ≠ x s j 

|| 

q sj=1 

p s j 

l ◊ 

(3.2) 

‹ i = min 

◊œ[0,2fi] ‹i ◊ (3.3) 

where p s j 

l ◊ 

is the j-th sample along line l ◊ and x s j 

is its corresponding position in the image. 

For pixels positioned at a maximum of a density image conveying a real correlation the 

distribution value will be very small, if the line is orthogonal to the local main direction 

of the correlation at the current position in comparison to other positions in the image. 

Note that such a line can be found even in non-linear correlations. On the other hand, 

pixels in density images conveying low correlation will always have only large ‹ values. 

For each column in the image, we compute the minimum value and sum up the result. 

The final RVM value is therefore defined as: 

RV M = 

1 

qx min y ‹(x, y) , (3.4) 

where ‹(x, y) is the mass distribution value at pixel position (x, y).


3.1.3 Quality Measures for Scatterplots with Classified Data 

Most of the known techniques calculate the quality of a projection without taking the class 

distribution into account. In classified data plots we can search for the class distribution in 

the projection, where good views should show good class separation, i.e. minimal overlap 

of classes. 

In this section, we propose three approaches to rank the scatterplots of multivariate 

classified data sets, in order to determine the best views of the high-dimensional structures. 

Class Density Measure 3 

The Class Density Measure (CDM) evaluates orthogonal projections, i.e. scatterplots, 

according to their separation properties. Therefore, CDM computes a score for each 

candidate plot that reflects the separation properties of the classes considering also the 

density of each class. The candidate plots are then ranked according to their score, so 

that the user can start investigating highly ranked plots in the exploration process. 

In case we are given only the visualization without the data, we assume that every 

color used in the visualization represents one class. We therefore separate the classes 

first into distinct images, so that each image contains only the information of one of the 

classes. Please note that the overplotting of classes influences the computation of the 

measure. If the data is available, this is no longer a problem since all the classes can be 

plotted separately in one image. Since a continuous representation for each class-image is 

necessary to compute the overlap between the classes, we estimate a continuous, smooth 

density function based on local neighborhoods. For each screen pixel s the distance to its 

k-th nearest neighbors N s of the same class is computed and the local density is derived 

as described earlier in this section. 

Having these continuous density functions available for each class, we estimate the 

mutual overlap by computing the sum of the absolute di erence between each pair and 

sum up the result: 

M≠1 ÿ Mÿ Pÿ 

CDM = 

|p i k ≠ p i l|, (3.5) 

k=1 l=k+1 i=1 

with M being the number of density images, i.e. classes respectively, p i k is the i-th pixel 

value in the density image computed for the class k, and P is the number of pixels. If 

the range of the pixel values is normalized to [0, 1] the range for the CDM is between 

0 and P , considering 2 classes (M=2). This value is large, if the densities at each pixel 

di er as much as possible, i.e. if one class has a high density value compared to all others. 

Consequently, the visualization with the fewest overlap of the classes will be given the 

highest value. Another property of this measure is not only in assessing well separated 

but also dense clusters that ease the interpretability of the data in the visualization. Note 

that non-overlapping classes in scatterplots produce di erent density images using our 

algorithm. If the clusters are similar, the density images are di erent, which results in a 

high value for the CDM measure. 

3 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann, 


3.1.3 Quality Measures for Scatterplots with Classified Data 35 

Class Separating Measure 4 

The CDM introduced before finds views with few overlap between classes and dense clusters 

in high-dimensional data sets. The CDM measure is computed over density images 

with a rapid fallo function. The local density fl was defined in Section 3.1.2 as fl =1/r. 

By changing this function, we are able to control the balance between the property of 

separation and dense clustering. Choosing a function with an increasing value for r can 

yield better separated clusters but with a lower clustering property. 

In our experiments, we found that using fl = r instead fl =1/r, provides a good 

trade-o between class separability and clustering. In extension to the CDM measure, we 

therefore propose the Class Separating Measure (CSM). The main di erence between these 

two measures is in the computation of the continuous representation of the scatterplot, 

henceforth termed distance field for the CSM (with fl = r), and density image for the 

CDM (with fl =1/r). 

To compute a distance field, the local distance at a screen pixel s is defined as r, where 

r is the radius of the enclosing sphere of the k-nearest neighbors of s, as described earlier 

in Section 3.1.2. Once we have the distance field of each class, the CSM is computed as 

the sum of the absolute di erence between them (note that for the CDM measure the 

inverse of the distance was used): 


CSM = 

|p i k ≠ p i l|, (3.6) 

k=1 l=k+1 i=1 

with M being the number of distance field images, i.e. classes respectively, p i k is the i-th 

pixel value in the distance field computed for the class k, and P is the number of pixels. 

Comparing the CSM and the CDM, the Class Separating Measure has a bias towards 

large distances between clusters while the Class Density Measure has a bias towards dense 

clusters. We consider separation and density of the clusters as two di erent user tasks. 

Frequently, views with well separated clusters are not necessarily the ones with dense clusters. 

When a view presents both properties simultaneously, it is assigned with a higher 

value by the two measures, producing a similar rank for both measures. The user has the 

opportunity to choose his measure according to the task, or even combine both measures, 

to find projections supporting both tasks. A comparison between the Class Separating 

and Class Density measures with a real example is presented in Section 3.1.6. 

Histogram Density Measures 5 

The Histogram Density Measures (1D and 2D-HDM) are density measures for scatterplots 

that extend the previously presented approaches by including non-orthogonal views in 

the ranked result lists. They consider the class distribution of the data points using 

histograms. Since we are interested in plots that show good class separations, HDM looks 

for corresponding histograms that show significant separation properties given by pure 

histogram bins. To determine the best low-dimensional embedding of the high-dimensional 

data using HDM, a two step computation is conducted. 


for the collaborative publication [132]. Adapted and slightly changed for the thesis by myself. 

5 Implemented and described by myself.


First, all 2D scatterplots of the data set are ranked with the 1D-HDM to search in 

the 1D linear projections which dimensions are representing the classes best separated. 

For each projection, we therefore rank them by the entropy value of the 1D projections 

separated in small equidistant parts, called histogram bins. p c is the number of points of 

class c in one bin. The entropy, average information content of that bin, is calculated as: 

H(p) =≠ ÿ c 

p c 

q 

c p c 

log 2 

p c 

q 

c p c 

. (3.7) 

H(p) is 0, if a bin has only points of one class, and log 2 M, if it contains equivalent points 

of all M classes. Each projection is ranked with the 1D-HDM : 

HDM 1D = 100 ≠ 1 ÿ 

( ÿ p c H(p)) (3.8) 

Z 

x c 

= 100 ≠ 1 ÿ ÿ 

p c (≠ ÿ p c p c 

q 

Z 

x c c c p log 2 q 

c c p ). (3.9) 

c 

where 1 Z 

is a normalization factor, to obtain ranking values between 0 and 100, having 

100 as best value: 

1 

Z = 100 

log 2 M q q 

x c p . (3.10) 

c 

Figure 3.3: 2D view and rotated projection axes. The projection on the rotated plane has less 

overlap, and the structures of the data can be seen even in the projection. This is not possible for 

a projection on the original axes. 

In some data sets, paraxial projections are not able to show the structure of highdimensional 

data. In these cases, simple rotation of the projection axes can improve the 

quality of the measure. In Figure 3.3 we show an example, where a rotation is improving 

the projection quality. While the paraxial projection of these classes cannot show these 

structures on the axes, the rotated (dotted projection) axes have less overlay for a projection 

on the x Õ axis. Consequently, we rotate the projection plane and compute the

11 12 13 14 

dim 2 

(5,8,11,12) 

−5 0 5 10 

Comp.1 

(8,11,12) 

−8 −6 −4 −2 0 2 4 

Comp.1 

(5,8,11) 

−5 0 5 10 

Comp.1 

(5,8,12) 

−5 0 5 10 

Comp.1 

(8,11,12) 

−8 −6 −4 −2 0 2 4 

Comp.1 

3.1.3 Quality Measures for Scatterplots with Classified Data 37 

1D-HDM for di erent angles ◊. For each plot we choose the best 1D-HDM value out of 

di erent rotation angles. We experimentally found ◊ =9m degree, with m œ [0, 20), to 

be working well for all our data sets. Figure 3.4 sketches this first step, showing how we 

measure di erent rotations for one plot (represented by the distribution histograms) to 

find his best measure value representing the visual quality of the plot. 

1D-HDM 

dim 8 

1 2 3 4 5 

all rotations 

0 10 20 30 40 

0 10 20 30 40 

0 10 20 30 40 

... 

0 10 20 30 40 

best 1D-HDM 

0 10 20 30 40 

Figure 3.4: First step of the HDM approach: each plot is ranked for di erent rotations with the 

1D-HDM. The best measure value is taken for the plot. 

Second, a subset of the best ranked dimensions are chosen to be further investigated 

in higher dimensions. All the combinations of the selected dimensions enter a PCA computation. 

PCA [83] transforms a high-dimensional data set with correlated dimensions, in 

a lower-dimensional data set with uncorrelated dimensions, called principal components. 

For more properties of PCA please refer back to Section 2.1.2 

For every combination of selected dimensions, after the PCA is computed, the first two 

components of the PCA are plotted to be ranked by the 2D-HDM (see Figure 3.5). The 

2D-HDM is an extended version of the 1D-HDM, for which a 2-dimensional histogram 

is computed on the scatterplot. The quality is measured, exactly as for the 1D-HDM 

by summing up a weighted sum of the entropy of one bin. The measure is normalized 

between 0 and 100, having 100 for the best data points visualization, where each bin 

contains points of only one class. The bin neighborhood has here been considered since 

for each bin p c we sum the information of the bin itself and the direct neighborhood, 

labeled as u c . Consequently, the 2D-HDM is: 

HDM 2D = 100 ≠ 1 ÿ ÿ 

u c (≠ ÿ Z 

x,y c c 

u c 

q 

c u c 

log 2 

u c 

q 

c u c 

) (3.11) 

with the adapted normalization factor: 

1 

Z = 100 

log 2 M q x,y (q c u c) . (3.12) 

selected with 1D-HDM 

2D-HDM 

k best 

dimensions 

that 

separate 

PCA(Subset) 

Comp.2 

−8 −6 −4 −2 0 2 4 

Comp.2 

−1 0 1 2 3 

Comp.2 

−4 −2 0 2 4 6 8 

... 

Comp.2 

−4 −3 −2 −1 0 1 2 

best 2D-HDM 

Comp.2 

−1 0 1 2 3 

Figure 3.5: Second step of the HDM approach: PCA is computed on the k best selected dimensions 

and on all the possible subsets greater than 3 dimensions. The first two components are plotted 

in scatterplots, that are ranked with the 2D-HDM. The best measure value indicates the best 

scatterplot where the class information is separated.


3.1.4 Quality Measures for Parallel Coordinates with Unclassified Data 

When analyzing parallel coordinates plots, we focus on the detection of plots that either 

show significant correlation between attribute dimensions or good clustering properties 

in certain attribute ranges. There exist a number of analytical approaches for parallel 

coordinates to generate dimension orderings that try to fulfill these tasks [9, 159]. However, 

they often do not generate an optimal parallel plot for correlation and clustering properties, 

because of local e ects that are not taken into account by most analytical functions. We 

therefore present analysis functions that do not only take the properties of the data into 

account, but also considers the properties of the resulting plot. 

Hough Space Measure 6 

Our analysis is based on finding patterns like clustered lines with similar positions and 

directions. Our algorithm for detecting these clusters is based on the Hough transform [73]. 

Straight lines in the image space can be described as y = ax + b. The main idea of the 

Hough transform is to define a straight line according to its parameters, i.e. the slope a 

and the interception b. Due to a practical di culty (the slope of vertical lines is infinite) 

the normal representation of a line is: 

fl = x · cos◊ + y · sin◊, (3.13) 

where fl is the length of the normal from the origin to the line and ◊ is the angle between 

this normal and the x-axis. Using this representation, for each non-background pixel in 

the visualization, we have a distinct sinusoidal curve in the fl◊-plane, also called Hough 

or accumulator space. An intersection of these curves indicates that the corresponding 

pixels belong to the line defined by the parameters (fl i ,◊ i ) in the original space. Figure 3.6 

shows two synthetic examples of parallel coordinates and their respective Hough spaces: 

Figure 3.6(a) presents two well defined line clusters and is more interesting for the cluster 

identification task than Figure 3.6(b), where no line cluster can be identified. Note that 

the bright areas in the fl◊-plane represent the clusters of lines with similar fl and ◊. 

To reduce the bias towards long lines, e.g. diagonal lines, we scale the pairwise visualization 

images to an n ◊ n resolution, usually 512 ◊ 512. The accumulator space is 

quantized into a w ◊ h cell grid, where w and h control the similarity sensibility of the 

lines. We use 50 ◊ 50 grids for computing the results presented in Section 3.1.6 and in 

Section 3.1.7. A lower value for w and h reduces the sensibility of the algorithm because 

lines with a slightly di erent fl and ◊ are mapped to the same accumulator cells. 

Based on our definition, good visualizations must contain fewer well defined clusters, 

which are represented by accumulator cells with high values. To identify these cells, 

we compute the median value m as an adaptive threshold that divides the accumulator 

function h(x) into two identical parts: 

q h(x) 

2 

g(x) = 

= ÿ g(x), where (3.14) 

I 

x if x Æ m; 



m 

else.

3.1.4 Quality Measures for Parallel Coordinates with Unclassified Data 39 

(a) 

(b) 

Figure 3.6: Synthetic examples of parallel coordinates and their respective Hough spaces: (a) 

presents two well defined line clusters and is more interesting for the cluster identification task 

than (b), where no line cluster can be identified. Note that the bright areas in the fl◊-plane 

represent the clusters of lines with similar fl and ◊. 

Using the median value, only a few clusters are selected in an accumulator space with high 

contrast between the cells (see Figure 3.6(a)) while in a uniform accumulator space many 

clusters are selected (see Figure 3.6(b)). This adaptive threshold is not only necessary to 

select possible line clusters in the accumulator space, but also to avoid the influence of 

outliers and occlusion between the lines. In the occlusion case, a point that belongs to 

two or more lines is computed just once in the accumulator space. 

The final quality value for a 2D visualization is computed by the number of accumulator 

cells n cells that have a higher value than m normalized by the total number of cells (w · h) 

to the interval [0, 1]: 

s i,j =1≠ n cells 

w · h , (3.15) 

where i, j are the indices of the respective dimensions, and the computed measure s i,j 

presents higher values for images containing well defined line clusters (similar lines) and 

lower values for images containing lines in many di erent directions and positions. 

Having combined the pairwise visualizations, we can now compute the overall quality 

measure by summing up the respective pairwise measurements. This overall quality 

measure of a parallel visualization containing n dimensions is: 

HSM = ÿ a i œI 

s ai ,a i+1 

, (3.16) 

where I is a vector containing any possible combination of the n dimensions indices. In this 

way we can measure the quality of any given visualization by using parallel coordinates. 

Exhaustively computing all n-dimensional combinations in order to choose the best/worst 

ones, requires a very long computation time and becomes unfeasible for a large n. Inthese 

cases, searching for the best n-dimensional combinations in a feasible time, an algorithm 

to solve a Traveling Salesman Problem is used, e.g. the A*-Search algorithm [66] or others 

[12]. Instead of exhaustively combining all possible pairwise visualizations, these kind of 

algorithms would compose only the best overall visualization.


3.1.5 Quality Measures for Parallel Coordinates with Classified Data 

While analyzing parallel coordinates visualizations with class information, we consider 

two main issues. First, in good parallel coordinates visualizations, the lines that belong 

to a determined class must be quite similar (inclination and position similarity). Second, 

visualizations where the classes can be separately observed and that contain less overlapping 

are also considered to be good. We developed two measures for classified parallel 

coordinates that take these matters into account: the Similarity Measure that encourages 

inner class similarities, and the Overlap Measure that analyzes the overlap between 

classes. Both are based on the Hough Space Measure for unclassified data presented in the 

previous Section 3.1.4. 

Similarity Measure 7 

The Similarity Measure (SM) is a direct extension of the HSM presented before for unclassified 

data. For visualizations containing class information, the di erent classes are usually 

represented by di erent colors. We separate the classes into distinct images, containing 

only the pixels in the respective class color, and compute a quality measure s k for each 

class, using Equation 3.15. Thereafter, an overall quality value SM is computed as the 

sum of all class quality measures: 

SM = ÿ s k . (3.17) 

k 

Using this measure, we encourage visualizations with strong inner class similarities and 

slightly penalize overlapped classes. Note that due to the classes overlap, some classes 

have many missing pixels, which results in a lower s k value compared to other visualizations 

where less or no overlap between the classes exists. 

Overlap Measure 8 

In order to penalize overlap between classes, we analyze the di erence between the classes 

in the Hough space (see Section 3.1.4). As in the SM, for the Overlap Measure we also 

separate the classes to di erent images and compute the Hough transform over each image. 

Once we have a Hough space h for each class, we compute the quality measure as the sum 

of the absolute di erence between the classes: 


OM = 

|hk i ≠ hl|. i (3.18) 

k=1 l=k+1 i=1 

Here M is the number of Hough space images, i.e. classes respectively and P is the number 

of pixels in each image. The measure value is high if the Hough spaces are disjoint, i.e. if 

there is no large overlap between the classes. Therefore, the visualization with the smallest 

overlap between the classes receives the highest measure values. 





3.1.6 Application on Real Data Sets 41 

Another valuable use of this measure is to encourage or search for similarities between 

di erent classes. In this case, the overlap between the classes is desired, and the previously 

computed measure can be inverted to compute suitable quality values: 

OM inv = 1 

OM . (3.19) 

3.1.6 Application on Real Data Sets 

To evaluate our measures we tested them on a variety of di erent real data sets. We applied 

our Class Density Measure (CDM), Class Separating Measure (CSM), Histogram Density 

Measure (HDM), Similarity Measure (SM), and Overlap Measure (OM) on classified data 

to find views that try to either separate or show similarities between the classes. For 

unclassified data, we applied our Rotating Variance Measure (RVM) and Hough Space 

Measure (HSM) in order to find linear or non-linear correlations and clusters in the data 

sets, respectively. 

Except for the HDM, we chose to present only relative measures, i.e. all calculated 

values are scaled so that the best found visualization is assigned 100 and the worst 0. 

This scaling is intended to ease the interpretability of the measure by the user. For 

the HDM, we chose to present the unchanged measure values, as the HDM allows an 

easy direct interpretation, with a value of 100 being the best and 0 being the worst 

possible constellation. If not otherwise stated, our examples are proof-of-concepts, and 

interpretations of some of the results should be provided by domain experts. 

Data Sets 

We used the data sets summarized in Table 3.2 to show the measures’ properties. In this 

table we present some information about the data. More details about the data sources 

and the dimensions names of each data set can be found in Appendix A. 

Table 3.2: Overview over the data sets used to show the measures properties. 

data set name records dimensions 9 classes source 

Cars 7404 22 2 partners 

Olives 572 8 9 [163] 

Parkinson’s Disease 195 11 0 [95, 96] 

Wine 178 13 3 [53] 

Wisconsin Diagnostic Breast Cancer 569 30 2 [131] 

Cars contains 7404 cars listed with 23 di erent attributes, including price, power, fuel 

consumption, width, height and others, automatically collected from a national second 

hand car selling website 10 . We chose the attribute fuel as a class label, having the data 

divided in two classes, benzine and diesel. Our goal is to find the similarities and di erences 

between these. 

9 The number of dimensions doesn’t count the class attribute in. 

10 Collected by another institute from Braunschweig and provided to our partners there.


Best ranked views using RVM 

100 - (dim9,dim12) 97 - (dim2,dim3) 75 - (dim2,dim4) 

Worst ranked views using RVM 

0 - (dim6,dim8) 0.3 - (dim7,dim8) 5.6 - (dim2,dim8) 

Figure 3.7: Results for the Parkinson’s Disease data set using our RVM measure (Section 3.1.2). 

While clumpy low-correlation bearing views are punished (bottom row), views containing higher 

correlation between the variables are preferred (top row). 

Olives is a classified data set with 572 olive oil samples from nine di erent regions in 

Italy [163]. For each sample the normalized concentrations of eight fatty acids are given. 

The large number of classes (regions) poses a challenging task to the algorithms trying to 

find views in which all classes are well separated. 

Parkinson’s Disease is a data set composed of 195 biomedical voice measures from 

31 people, of which 23 with Parkinson’s disease [95, 96]. Each of the 12 dimensions is 

a particular voice measure. The voice recordings from these individuals have been taken 

with the goal to discriminate healthy people from those with Parkinson’s disease. 

Wine is a classified data set with 178 instances and 13 attributes describing chemical 

properties of Italian wines derived from three di erent cultivars. 

Wisconsin Diagnostic Breast Cancer (WDBC) data set consists of 569 samples with 

30 real-valued dimensions each [131]. The data is classified into malign and benign cells. 

The task is to find the best separating dimensions showing the two classes. 


First we show the results for RVM on the Parkinson’s Disease data set 11 .Thethreebest 

and the three worst ranked scatterplots by the RVM are shown in Figure 3.7, presenting 

the RVM value above each plot. High correlations have been measured in the plots 

(dim9,dim12 ), (dim2,dim3 ), as well as (dim2,dim4 ). However, visualizations containing 

11 For easier reading of this paragraph, we renamed the original dimension names. Please refer to 

Appendix A Table A.3 for the original dimension names.


low correlation received a low value, as shown in the second row of this figure presenting 

the worst ranked views and their measure values. This example demonstrates that our 

target pattern, the correlated dimensions, are correctly identified by the RVM measure. 

Best ranked views using CDM 


Worst ranked views using CDM 


Figure 3.8: Results for the Olives data set using our CDM measure (Section 3.1.3). The di erent 

colors depict the di erent classes (regions) of the data set. While it is impossible for this data set 

to find views completely separating all classes, our CDM measure still found views where most of 

the classes are mutually separated (top row). In the worst ranked views the classes clearly overlap 

with each other (bottom row). 

Best ranked PCA-views using HDM approach 

85.45 - PCA(dim(4,5,8)) 84.98 - PCA(dim(1,2,4,5)) 84.9 - PCA(all data dims) 

Figure 3.9: Results for the Olives data set using our HDM measure (Section 3.1.3). The best 

ranked plot is the PCA of dim(4,5,8) revealing a good view on all the classes, the second best is 

the PCA of dim(1,2,4) and the third is the PCA on all 8 dimensions. The di erences between 

the last two are small because the variance in that additional dimensions for the 3rd eigenvector 

relative to the 2nd, is not big. The di erence between the last two views and the first view is 

clearly visible (e.g. looking at the yellow class).


In Figure 3.8, we show the results for the Olives data set 12 using our CDM measure. 

Even though a view separating all nine di erent olive classes does not exist, the CDM 

reliably choses three views that separate the data quite well in the dimensions (dim4, 

dim5 ), (dim1,dim5 ) as well as (dim1,dim4 ). The bottom row of this figure presents the 

worst ranked projections. We can see that in these cases it is impossible to identify any 

class structure in the views. 

We also applied our HDM technique to this data set. First the 1D-HDM tries to 

identify the best separating dimensions of the data set, as presented in Section 3.1.3. 

The dimensions dim1, dim2, dim4, dim5 and dim8 were ranked as the best separating 

dimensions by the 1D-HDM. We computed all subsets of these dimensions, computed the 

PCA on this subsets, and ranked the views of the first two PCA components with the 

2D-HDM. In the best ranked views, presented in Figure 3.9, the di erent classes are well 

separated. Compared to the upper row in Figure 3.8, the visualization utilizes the screen 

space better, which is due to the PCA transformation. 

Best ranked views using CSM 


Worst ranked views using CSM 


Figure 3.10: Results for the Wine data set using our CSM measure (Section 3.1.3). The best ranked 

plots present a large distance between the centers of the class clusters while the worst ranked views 

show only cluttered data. 

Comparing our CSM and CDM measures, we can observe that they present distinct 

results on the same data sets. Applying the CSM to the Wine data set 13 reveals views 

that present a good separation between the classes. The best ranked plots are shown in 

the upper row of Figure 3.10: (dim7,dim13 ), (dim7,dim10 ), and (dim7,dim12 ). They 

present a large distance between the centers of the class clusters. The worst ranked views, 

in opposite, show only cluttered data. In comparison, the result for CDM measure on 

12 For easing the reading trough the paragraph, we renamed the original dimension names. Please refer 

to Appendix A Table A.2 for the original dimension names. 


to Appendix A Table A.4 for the original dimension names.




Worst ranked views using CDM 


Figure 3.11: Results for the Wine data set using our CDM measure (Section 3.1.3). Note that the 

second best ranked view, (dim1,dim7) (with CDM = 89), is not considered good using the CSM 

measure (CSM = 58). 

the Wine data set is depicted in the Figure 3.11. The best ranked plots (dim7,dim10 ), 

(dim1,dim7 ), and (dim7,dim13 ) present more dense clusters, as expected from the ranking 

criteria of this measure. Note that the second best ranked view, (dim1,dim7 )(withCDM 

= 89), is not considered good using the CSM measure getting a lower rank and quality 

value (CSM = 58). Comparing Figure 3.10 and Figure 3.11, we can observe that the CSM 

favors large distances between the clusters while the CDM assigns high values to views 

that present dense but separated clusters, even if the distances between them are much 

smaller. 

There are cases when just looking at the best ranked and the worst ranked plots is 

not enough. By arranging all the scatterplots in a scatterplot matrix the analyst has 

the possibility to look at all orthogonal views of a data set at once. In our system the 

scatterplots are shown in the upper right half of the SPLOM while the other half is used 

to display the quality values of each plot. To guide the analysis the user can fade out 

lower ranked views, which helps to focus on those with a higher probability of information 

bearing content. One drawback is that for a very large number of dimensions due to the 

quadratically number of scatterplots, this SPLOM cannot scale. Figure 3.12 shows an 

example. Both SPLOMs show the WDBC data set 14 , but the upper SPLOM shows the 

results for the RVM while the bottom SPLOM shows the results for the CDM measure. 

The threshold for both SPLOMs was set to 0.95 15 , so all plots with a lower rank have 

14 Please refer to Appendix A Table A.5 for details about the original dimension names of the data set. 

15 Please note that the SPLOM shows the measure values between 0 and 1 while all the other results 

presented before where on a scale from 0 to 100.


been faded out. As can be seen in the enlarged detail, di erent views come into focus 

depending on the chosen measure. While the RVM considers plots with a high degree of 

correlation as more important, the CDM focuses on separating the designated classes, here 

the malign and benign cells. It depends on the user task what pattern is more important. 

Figure 3.12: Results on the WDBC data set for the RVM (top) and the CDM (bottom). In this 

example, views with a quality value of less than 0.95 have been faded out. This way many irrelevant 

views can be faded out reducing the number of the plots to be inspected by the user in more detail 

to a better manageable number.



To demonstrate the value of our approaches for parallel coordinates, we present the best 

and worst ranked visualizations by our measures on di erent data sets. The corresponding 

visualizations are shown in Figure 3.13, 3.14 and 3.15. For a better comparability the 

visualizations have been cropped after the display of the 4th dimension. In all experiments 

we used a size of 50 ◊ 50 for the Hough accumulator. The algorithms are quite robust 

with respect to the size, and using more cells generally only increases computation time 

but has little influence on the result. 

Figure 3.13 shows the ranked results for the Parkinsons Disease data set 16 using our 

Hough Space Measure. 

The HSM algorithm prefers views with more similarity in the distance and inclination 

of the di erent lines, resulting in the prominent small band in the visualization of the 

Parkinsons Disease data set. This is similar to clusters in the projected views of these 

dimension, here between dim3 and dim12 as well as dim6 and dim11. 

best ranked views using HSM 

100 97 

97 

worst ranked views using HSM 

0 0.7 1.1 

Figure 3.13: Results for the non-classified version of the Parkinsons Disease data set. Best and 

worst ranked visualizations using our HSM measure for non-classified data (ref. Section 3.1.4). Top 

row: The three best ranked visualizations and their respective normalized measures. Well defined 

clusters in the data set are favored. Bottom row: The three worst ranked visualizations. The large 

amount of spread exacerbates interpretation. Note that the user task related to this measure is 

not to find possible correlation between the dimensions but to detect good separated clusters. 

Applying our Similarity Measure to the Cars data set we can see that there seem to be 

barely any good views to split the clusters of the data set (see Figure 3.14). We verified 

these by exhaustively looking at all pairwise projections. However, the only dimension 

where the classes can be mostly separated and at least some form of cluster can be reliably 

found is dim6, in which cars using diesel (represented in red) generally have a lower value 

compared to benzine (represented in black). Figure 3.14 shows the best ranked results in 

the top row. Additionally, the similarity of the majority in dim15, dim18 and dim3 can be 

detected. Obviously cars using diesel are cheaper, this might be due to the age of the diesel 

cars, but age was unfortunately not included in the data base. On the other hand, the 

worst ranked views using the SM (see Figure 3.14, bottom row) are barely interpretable 

but at least we were unable to extract any useful information. 

In Figure 3.15 the results for our Overlap Measure applied to the WDBC data set are 


to Appendix A Table A.3 for the original dimension names.


best ranked views using SM 

100 98 

98 

17 6 15 18 17 6 20 18 

3 20 18 15 

worst ranked views using SM 

0 0.1 0.2 

9 1 19 12 

5 19 1 9 9 1 12 19 

Figure 3.14: Results of the SM for the Cars data set. Cars using benzine are shown in black, 

diesel in red. Best and worst ranked visualizations using our Hough Similarity Measure (Section 

3.1.5) for parallel coordinates. Top row: The three best ranked visualizations and their respective 

normalized measures. Bottom row: The three worst ranked visualizations. 

best ranked views using OM 

100 99 99 

25 9 24 29 

22 9 24 29 25 9 22 29 

worst ranked views using OM 

0 0.1 0.2 

17 18 31 21 

13 31 18 17 

13 17 18 31 

Figure 3.15: Results of the OM for the WDBC data set. Malign nuclei are colored black while 

healthy nuclei are red. Best and worst ranked visualizations using our Overlap Measure (Section 

3.1.5) for parallel coordinates. Top row: The three best ranked visualizations. Despite good similarity, 

which are similar to clusters, visualizations are favored that minimize the overlap between 

the classes, so that the di erence between malign and benign cells becomes more clear. Bottom 

row: The three worst ranked visualizations. The overlap of the data complicates the analysis and 

the information is useless for the task of discriminating malign and benign cells. 

shown. This result is very promising. In the top row, showing the best plots, the malign 

and benign are well separated. It seems that the dimensions dim22 (radius (worst)), dim9 

(concave points (mean)), dim24 (perimeter (worst)), dim29 (concave points (mean)) and 

dim25 (area (worst)) separate the two classes well.

3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data 49 

3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data 

The work presented by Johansson and Johansson [82] introduces a system for dimensionality 

reduction by combining user-defined quality metrics using weighted functions to 

preserve as many important structures as possible in the reduced data set. The analyzed 

structures are clustering properties, outliers and dimension correlations. We used the synthetic 

data set presented in their paper to test our Hough Space Measure. This contains 

1320 data items and 100 variables, of which 14 contain significant structures. 

The HSM algorithm prefers views with more similarity in the distance and inclination 

of the di erent lines. We computed our HSM on this synthetical data set and present the 

result in Figure 3.16. Here we can see the best ranked 4-dimensional parallel coordinates 

plots for clustered data points in the top row and the worst ranked plots in the bottom. 

At the top, the clusters of lines are clearly visible in contrast to the bottom where no 

structures are visible. The five dimensions that are in the best plots are dimensions A, 

C, G, I, J. Four out of five dimensions are also determined by [82] as the best dimensions 

for clustering. They use user-defined quality measures for their system to determine the 

best dimensions according to di erent criteria. Our resulting dimensions are a subset of 

their best 9 dimensions for showing clustered data points. This provides proof that our 

measures are also designed in the way that users would rank their plots. 

best ranked views using HSM 

100 99.3 

98.8 

worst ranked views using HSM 

0 0 0.2 

Figure 3.16: Results of the HSM for the synthetic data set from [82] presenting the best and worst 

ranked visualizations using our HSM measure for non-classified data (ref. Section 3.1.4). Top 

row: The three best ranked visualizations and their respective normalized measures. Well defined 

clusters in the data set are favored. Bottom row: The three worst ranked visualizations. The large 

amount of spread exacerbates interpretation. Note that the user task related to this measure is 

not to find high correlation between the dimensions but to detect good separated clusters. 

To show the e ectivity of our scatterplot measures and to explain their di erences, we 

analyzed their results on a self-generated synthetical data set - synthetic2. We created a 

10-dimensional data set with two classes. By selecting just two classes, we aim to show 

the fundamental di erences between the measures that allow to detect hidden patterns. 

In three dimensions we hid target patterns to test how this projections are ranked by 

the measures. The patterns where created as follows: the first pattern in subspace (2, 5) 

contains two classes with means at m 1 =(6, 14) and m 2 = A(13, 6), eachB 

containing 500 

3 2.7 

samples from a multivariate normal distribution with C 1 = 

the covariance 

2.7 3 

matrix of the variables. In dimension 6 we defined two classes with means at m 3 =6 

respectively m 4 = 13 with 500 random samples of a normal distribution and with standard


deviation std =1.5 for each class. With this definition of the dimensions three patterns 

in subspaces (2, 5), (2, 6) and (5, 6) occur. 

In the other 7 dimensions we defined random patterns. This are developed systematically, 

by taking for every dimension the mean m d = 10 and 1000 samples from a normal 

distribution starting from a standard deviation std =0.5 and increasing this with 0.5 for 

each dimension. Therefore, the last random dimension has the std =3.5. 

Figure 3.17: Matrix for the synthetical data set with scatterplots above the main diagonal and 

parallel coordinate plots bellow. 

In Figure 3.17, we present the scatterplot matrix of the synthetical data set showing the 

scatterplots above the main diagonal and the parallel coordinate plots under the diagonal. 

We ranked all these plots with our measures for scatterplots and parallel coordinates. 

The results are presented in Figure 3.18. For every measure we show a point chart containing 

the sorted measure results. The target patterns are marked red in each plot. It 

can be seen that all measures ranked as best plot one of the target patterns. 

The scatterplot measures for classified data CDM and CSM found all the three target 

patterns as the best projections of the data set. This confirms our assumption that this 

measures search for the projections with the best class separability and the most dense 

classes. The RVM designed for data sets without classes was computed on the same data 

set with no class information. (Note that this means that RVM was measured on plots 

like in Figure 3.17 that have no di erent colors for the data points.) The best ranked 

scatterplot by RVM is (2, 5) having the most dense target pattern. RVM is aimed to find

3.1.7 Evaluation of the Measures’ Performance Using Synthetic Data 51 


RV M 


HSM 

0 20 40 60 80 100 

0 20 40 60 80 100 

0 10 20 30 40 

CDM 

0 10 20 30 40 

OM 

0 20 40 60 80 100 

0 20 40 60 80 100 

0 10 20 30 40 

CSM 

0 10 20 30 40 

SM 

0 20 40 60 80 100 

0 20 40 60 80 100 

0 10 20 30 40 

0 10 20 30 40 

1D ≠ HDM 

40 50 60 70 80 90 100 

0 10 20 30 40 

Figure 3.18: Results of the 7 measures for classified and unclassified data. The left column shows 

the result for the scatterplot measures and the right column for the parallel coordinates measures. 

The ranks are sorted decreasing and the target patterns are marked with red crosses.


● 

● 

● 

Comp.2 

−5 0 5 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● ● ● 

● 

●● 

● ● 

● ● ● 

● 

● 

● 

● 

●● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● ● 

● 

●● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

●● 

● ● 

● 

● 

● 

● ● ● ● 

● 

● ● 

●● 

● 

● ● 

● 

● ● ● 

● 

● ● ● 

● ● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

●● 

● 

● ● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

●●● 

● 

● ● 

● 

● 

● ● 

● 

● 

●● 

● 

● 

● 

●● ● 

● ● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

●● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● ●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

●● 

● 

●● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● ●● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

−5 0 5 10 

Comp.1 

Figure 3.19: Scatterplot of the first two components of the PCA over dimensions 2, 5 and 6. 

the scatterplots with the highest correlations. We can see that in subspace (2, 5) is the 

target pattern with the highest correlation. The second target pattern in (2, 6) shows two 

clusters with high correlation, and is also found by the RVM. 

The 1D-HDM ranked best all the target patterns with a result of 100. This synthetical 

data set is unfortunately inapplicable to test the 2D-HDM because the patterns are 

defined along the data dimensions and therefore the 1D-HDM finds the best projection. 

Computing the PCA and searching for a better projection of the principal components is 

not necessary because the value of 100 cannot be improved. Applying the PCA to the 

best dimensions selected by the 1D-HDM (2, 5 and 6), we obtain the plot shown in Figure 

3.19. These best components of the PCA are also ranked with 100 by the 2D-HDM. 

Note that the resulting plot is not visually better then the orthogonal projection (2, 5) 

and no additional information can be obtained through the PCA. 

The parallel coordinates measures are designed to target di erent patterns. HSM ranks 

best parallel coordinates plots for unclassified data with similar positions and directions, 

i.e. clusters. For classified data, SM looks for this clusters taking the classes into account, 

and OM is designed to find parallel coordinates plots having classes with fewest overlap. 

In the point charts of the right column of Figure 3.18, we see that all the measures 

for parallel coordinates ranked best one of our target patterns. HSM analyzed the data 

with no class information and ranked as best plot (5, 6) where two classes are visible. OM 

ranked also (5, 6) as the best because this plot has the smallest overlap between the two 

classes. SM ranked two target patterns in top 3: (5, 6) as the best, and (2, 6) as third 

best, presenting lines in the two classes with almost the same positions and directions. 

This evaluation is only a starting point for an evaluation of every possible parameter 

combination. In the future, a complete statistical analysis of the correlation between the 

measures and the correlation to the ground truth will be necessary. In the following, we 

briefly outline the basic steps for the future evaluation process:

3.1.8 Conclusion and Future Work 53 

1. Define ground truth. The ground truth should be generated in a synthetic data 

set having two independent variables, as the density and separability of classes. 

2. Vary the number of classes. The synthetical data sets have to have di erent 

numbers of classes. 

3. Vary the number of dimensions. The synthetical data sets have to have di erent 

numbers of dimensions. They should simulate di erent types of high-dimensional 

data: small data sets – 2 to 9 dimensions, medium data sets – 10 to 49 dimensions, 

and large data sets – 50 to 100 dimensions. 

4. Statistical analysis. Make a statistical analysis of the correlation between the 

measures and a correlation to the ground truth. 

3.1.8 Conclusion and Future Work 

In this sections, we presented several methods to aid and potentially speed up the visual 

exploration process for di erent visualization techniques. In particular, we automated the 

ranking of scatterplot and parallel coordinates visualizations for classified and unclassified 

data for the purpose of correlation and cluster separation. In the next section, a ground 

truth is generated by letting users choose the most relevant visualizations from a manageable 

test set. To prove our methods, we compare them to the automatically generated 

ranking. Some limitations are recognized as it is not always possible to find good separating 

views due to a growing number of classes and due to some multivariate relations. 

This is a general problem and not related to our techniques. 

The limitations of the above presented approach are of course determined by the 

task, data complexity, and the measures applied to find the requested patterns. Tasks 

might be of di erent types, such as finding outliers, significant patterns, di erent types 

of correlations between the dimensions etc. The complexity of the data can be described 

by the number of dimensions, the number of contained classes, and the clarity of patterns 

(noise, over-plotting, and distribution of the data). This complexity strongly influences 

the ability of measures to detect the required patterns. There are a number of measures 

in the domain of high-dimensional data visualization assessing di erent types of tasks and 

di erent applicability levels for di erent data sets. However, creating a data-task-measure 

taxonomy for our domain is out of scope of this thesis, however, we strongly recommend 

this for future research. In Section 4.2, we will also present the results of a data-measure 

taxonomy with the focus on one task, namely the class separation in visualization. 

Our current approach is therefore to describe systematically the functionality of the 

presented measures as a function of their ability to detect hidden patterns in the data for 

a particular task. Our results have to be handled accordingly. 

The comparison to other existing measures should be considered in future work. Furthermore, 

issues such as over-plotting need to be part of the study since they were currently 

disregarded. Scalability concerns will need to be addressed in future research under the 

constraint of data complexity and heuristics to reduce the search space for target patterns.


3.2 Quality Measures and Human Perception – An Empirical Study 

Quality measures have been devised to automatically extract interesting visual representations 

out of a large number of available candidates in the exploration of high-dimensional 

databases. The measures permit for instance to search within a large set of scatterplots 

(e.g., in a scatterplot matrix) and select the views that contain the best separation among 

clusters. The rationale behind these techniques is that automatic selection of “best” views 

is not only useful but also necessary when the number of potential projections exceeds the 

limit of human interpretation. While useful as a concept in general, such metrics received 

so far limited validation in terms of human perception. In this chapter, we present a 

perceptual study investigating the relationship between human interpretation of clusters 

in 2D scatterplots and the measures automatically extracted out of them. Specifically 

we compare a series of selected metrics and analyze how they predict human detection 

of clusters. A thorough discussion of results follows with reflections on their impact and 

directions for future research. 

Our empirical evaluation is based on a user study where users had to select projections 

of attribute-combinations well suited for classifying the data under inspection. The study 

then compares the scores of the selected scatterplots with the score obtained by the selected 

quality measures to analyze their correlation. The outcome of the study permits primarily 

to validate the assumption that the selection of views best ranked by quality measures is a 

viable way to simulate the selection of users. Furthermore, the study permits to compare 

the performance of the measures employed and kick-start a quality measures benchmark 

process, where metrics are compared against a baseline represented by the results obtained. 

In summary, the main contributions of this section are: 

• A validation of the hypothesis that quality measures can simulate the selection of 

best views by human beings; 

• A comparison among a set of promising and established measures; 

• The provision of a first benchmark framework, through which it is possible to compare 

new quality metrics. 

The rest of the chapter is organized as follows. Section 3.2.1 describes the measures 

employed in the study in details. Section 3.2.2 describes the whole experiment design and 

Section 3.2.3 presents the results. Section 3.2.4 discusses the results obtained in the study 

o ering a vision on how they can be interpreted and exploited in the future. Section 3.2.5 

provides a description how to set up a framework for user based evaluation of quality 

metrics as suggested in this section. Finally, Section 3.2.6 provides the conclusions. 

3.2.1 Measures 

For this study we have selected quality metrics from [129] and from Section 3.1.3 ([133]) 

that where developed specifically for scatterplots with classified data. In both cases the authors 

propose automatic analysis methods to extract potentially relevant visual structures 

from a set of candidate visualizations. 

Our study is based on the Class Density Measure (CDM) and the Histogram Density 

Measure (HDM) presented in Section 3.1.3. These two measures where also described 

in [133].

3.2.1 Measures 55 

In [129] Sips et al. also present similar work. They provide measures for ranking 

scatterplots with classified and unclassified data. They propose two additional quantitative 

measures on class consistency: one based on the distance to the cluster centroids, and 

another based on the entropies of the spatial distributions of classes. The paper also 

describes an initial small user study where user selections are compared the outcomes of 

the proposed methods. From this work we adopt the Class Consistency Measure (CCM). 

The authors present a measure called Class Density Measure that, although having the 

same name as our measure presented in Section 3.1.3, di ers from our Class Density 

Measure. It is in fact similar to the HDM measure and is therefore not included in the 

analysis. 

For a better overview the metrics are summarized in Table 3.3. 

Table 3.3: Overview of the analyzed measures with the reference for additional details. 

Measure 

Reference 

Distance Consistency Measure (DCM) [129] 

1D Histogram Density Measure (1D-HDM) 

2D Histogram Density Measure (2D-HDM) 3.1.3 & [133] 

Class Density Measure (CDM) 

The following is based on the assumption that each cluster in the data is uniquely 

labeled (either manually or through some form of n-dimensional clustering algorithm) and 

that for each point it is possible to know to which cluster it pertains. Finally, in the 

visualizations shown here, and those used in the experiment, each cluster is colored with 

auniquehue. 

We will not provide extensive formal specifications and details on the metrics. For 

additional details and further discussions on their limits and capabilities please refer to 

the original papers [129] and [133], and the previous Section 3.1.3. 

Distance Consistency Measure 

The Distance Consistency Measure (DCM) presented by Sips et al. in [129] is based 

on the distance of data points to their cluster centroid. The measure assumes the calculation 

of a clustering model in the n-dimensional space and computes a specific value for 

a given 2D projection by projecting points and centroids on the selected 2D space. 

More precisely, the algorithm is based on the calculation of how many points violate 

the distance to centroid measure. For any given point the distance to its centroid in the 

n-dimensional space must always be lower than the distance to any other cluster centroid. 

However, when data is projected on a specific 2D space, this property can be violated. For 

a given projection, the measure is therefore calculated as the proportion of data points 

that violate the centroid distance measure. 

The Distance Consistency Measure (DCM) based on the centroid distance is consequently 

calculated as follows: 

|x Õ œ v(X) :CD(x Õ ,centr Õ (c clabel(x) )) ”= true| 

[129] (3.20) 

k 

where x Õ is the 2D projection of the data point x, centr Õ (c clabel(x) ) is the centroid pro-


jection of the centroid of the class of x (clabel(x)), and k the number of data points. 

CD(x Õ ,centr Õ (c clabel(x) )) the centroid distance function, that describes that the distance 

of any point to his class centroid is minimal in comparison to the distance to all other 

centroids. In other words, the percentage of points that do not satisfy this property is 

calculated. 

Histogram Density Measure (1D and 2D) 

The Histogram Density Measure (HDM) approach presented in Section 3.1.3 is describing 

two quality measures for scatterplots with class information. 

For computing the 1D Histogram Density Measure (1D-HDM), data is projected 

over onto axis and a histogram is calculated to describe the distribution of the data points 

over it. Since there are points pertaining to di erent classes (i.e., clusters), the measure is 

based on the analysis of the amount of overlap among points of di erent classes in the same 

histogram bin. The measure is intended to isolate plots that show good class separations. 

Consequently, HDM looks for corresponding histograms that show significant separation, 

and this property holds when the histogram bins contain only points of one class. 

In order to measure this property, the approach uses entropy and axes rotation. Several 

instances of the same 2D projection are computed, each with a di erent rotation factor. 

For each one an average entropy value is computed and the best rank among the rotation 

is selected as the measure’s value. The computation of the entropy values is explained in 

Section 3.1.3 in more detail. 

The 2D Histogram Density Measure (2D-HDM) is an extended version of the 1D- 

HDM, for which a 2-dimensional histogram on the scatterplot is computed, that is each 

bin represents a small square over the 2D projection and the bin count is the number of 

data points falling within the square. The quality is measured similarly to the 1D-HDM 

by summing up a weighted sum of the entropy of each bin. The measure is normalized 

between 0 and 100, having 100 for the best data points visualization when each bin contains 

points of only one class. 

In addition to the 1D-HDM, the bin neighborhood is also taken into account in 2D- 

HDM. For each bin the information of points p c in the bin and the direct neighbors labeled 

as u c are summed up. The full equation explaining the calculation in details can be found 

in Section 3.1.3 and in the original paper [133]. 

The extended HDM measure to 2D can also find projections where classes are like two 

concentric circles of di erent diameters. In this case, a 1D projection will always have a 

big overlap of the classes, even if this circles do not overlap in 2D or nD. 

Class Density Measure 

The Class Density Measure (CDM) was also presented in detail in Section 3.1.3. This 

measure evaluates the scatterplots according to their separation properties of classes. The 

goal is to identify those plots that show minimal overlap between the classes. 

In order to compute the overlap between the classes, the method uses a continuous 

representation where the points belonging to the same cluster form a separate image. For 

each class we have a distinct image for which a continuous and smooth density function 

based on local neighborhoods is calculated. For each pixel p the distance to its k-th nearest 

neighbors N p of the same class is computed and the local density is calculated over the

3.2.2 Empirical Evaluation 57 

sphere with radius equal to the maximum distance. 

Having these continuous density functions available for each class, the mutual overlap 

can be estimated by computing the sum of the absolute di erence between each pair and 

sum up the results. Section 3.1.3 gives more details about the computation formulas. The 

value of the metric is high if the densities at each pixel di er as much as possible, i.e., if 

one class has a higher density value compared to all others. Therefore, the visualization 

with the fewest overlap of the classes will be given the highest value. A property of this 

measure is that it not only estimates separate clusters well, but also estimates clusters 

where density di erence is noticeable. This is a great advantage since it can ease the 

interpretation of the data in the visualization. 

3.2.2 Empirical Evaluation 

The following section describes the empirical evaluation of the described measures for projection 

quality. The aim of this evaluation is to assess the degree to which these measures 

reflect users’ perception of a high quality projection. Our method consists therefore of a 

user study for creating a baseline and a series of measures that all judge the quality of a 

set of scatterplots. The results show the correlation computation between all the measures 

with the user graded quality. 

Hypotheses 

The hypotheses for the analyses were defined by the features of the four di erent automatic 

measures. 

H1. We expect lowest correlation of the 1D-HDM measure with users’ selection since this 

measure takes only one dimensional projection for computing the separation quality 

of the data into account. 

H2. Higher correlation results are expected by the 2D-HDM measure because this extends 

its 1D version by creating a 2D histogram and considers direct neighborhoods of each 

data point for the quality computation. 

H3. The perceived quality of a projection may be even influenced by the density of 

clusters having a minimal overlap, as suggested by the CDM. Here we expect a 

strong correlation with the measures’ rank. 

H4. Finally, we expect high correlation with users’ selection, when the consistency of 

clusters is computed, which is expressed by the quality of separation of the clusters. 

This is assessed by the DSC as described previously. 

In general, we expect a significant positive correlation of all these measure with users 

selection. However, these measures are also expected to vary in their approximation of 

users’ perception, which is expressed by the coe cient of determination - R 2 - of the 

regression.


Participants 

Participants were 18 undergraduate students from the faculty of natural sciences. All had 

extensive experience in working with computers and scatterplots. Students participated 

in the experiment voluntarily and received no award for participating in the experiment. 

Data and Plot Selection 

To conduct the empirical evaluation, we took the UCI wine data set 17 containing the results 

of a chemical analysis of three wine types grown in a specific area of Italy. These types are 

represented in the 178 samples with the results of 13 chemical analyses recorded for each 

sample. The 13 attributes of the data set were pairwise combined into 78 scatterplots. 

The quality of these scatterplots was then computed by the four di erent measures. The 

data did not contain any special cases of cluster constellation, nor did it have outliers or 

hidden data points. 

The number of scatterplot representations to be used in the user study was 18, in 

order to keep the performance time reasonably small, to allow a one-page representation 

of all the scatterplots at once in a reasonable size, so that all data points can be seen. 

The selection of the 18 scatterplots was conducted along the distribution of the measures’ 

quality assignment, described as follows: 

1. The quality values of the measures were normalized between 0 to 1, and assigned to 

one quantile. 

2. The scatterplots were sampled in such a way that the distribution between the 

number of projections in higher and lower quantiles were approximately the same 

for all measures. 

3. As a result, the distribution of quality values in each quantile was 4±1. 

These selected scatterplots were ordered in six columns and three rows and then printed 

using a high quality color printer. The order of the scatterplots was permuted by the Latinsquare 

method, resulting in 18 di erent settings, one for each participant. An example 

of the set of scatterplots used in the experiment is shown in Figure 3.20. Two original 

experiment forms are attached in Appendix A.2 – Figure A.1 and Figure A.2. 

Task 

Participants were confronted with a scenario around the wine data set. They were acting 

in this scenario as a wine-consultant for three di erent types of wines. They were told 

that their challenge is to analyze a large amount of attributes describing the wines, such as 

color saturation, alcohol content, etc. Participants were requested to select projections of 

attribute-combinations that are well suited for classifying the three di erent types of wines. 

This task had to be carried out using a selected set of scatterplot views showing attributes 

in a pair-wise manner. At first, participants were asked to select the five most qualitative 

projections for separating wine types and then order them using numbers between 1 and 5 

(1 indicating the absolute best representation, and 5 the worst out of the five best quality 

scatterplots). 

17 Source at UCI: www.archive.ics.uci.edu/ml/datasets/Wine

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● ● ● ● 

● 

● 

● ● ●● 

● 

● 

● 

● ● ● 

● ● 

● 

● ● ● ● 

● ● ● ● ● 

● ● 

● 

● ● ● 

● 

● ● ● ● ● ● ● ● ●● 

● ● 

● 

● ●● 

● ● ● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● ● 

● ● ● 

● ● ● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●●● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● ● ● 

● ● 

● ● ● ● 

●● 

● ● ● ● 

● ● ● 

●● 

● 

● 

● 

● ● 

● ● ● ● 

● 

● ● ● 

● ●● 

●● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● ● 

● ● ● ● 

● 

● ● ●● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● 

● ● ● ● ● 

● 

● 

● 

● ● ● ● ● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● ● 

● ● ● 

● ● 

● ● ● 

● ● 

● ● 

● 

● ● 

● 

● ● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● ● 

● 

● 

● ● 

● 

● ● ● 

●● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● ● 

● 

● ●● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ●● 

● 

● ● 

● ●● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● ● 

● 

● ● 

● ● ● ● ●● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● 

● 

● 

● 

● ● 

● ● 

● 

● ● ● 

● 

● 

● 

● ● 

● ● 

●● 

● 

● ● ● ● 

● ● 

● 

● 

●● 

● ● ● ● 

● ● ● ● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● ● 

● ●● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

●● 

●● 

● 

● 

● 

● 

● ● ● ● 

●●● 

● 

● ● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

●● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

●● 

● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

●● 

● 

● 

● 

● 

● ● 

● 

●● 

● 

● 

● ● 

● ●●● 

● 

●● 

● ● 

●● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● ● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ●● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● ●● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● 

● ● 

● ● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● ● ● ● 

●● 

● ● 

● ●● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● ● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● ● 

● ● ● 

● ●● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● ● ● ● ● 

● ● ● 

● ●● 

● 

● ●● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● ●● 

● 

● 

● 

● 

● 

●● 

● 

● ● ● 

● ● ● ● 

● ● ● ●● 

●● 

● 

● ● ● 

●● ● 

● 

● 

● ● ● 

● ● ●● 

● 

● 

● 

●● 

● ●● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ●● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● ● 

● 

● ●● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● ● ● 

● 

● 

● ● ● 

● ●● 

● ● ● ● ● 

● 

● ● 

● ● 

● ● 

● ● 

● 

● ● 

● 

● 

● 

● 

●● 

● 

● 

● ● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● ● ● ● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● ● ● 

● 

● ● ● 

● ● ● 

● 

● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● ● 

● ● ● ● ● 

● 

●● 

● ● ● 

● 

● ● 

● 

● 

● ●● 

● 

● ● ●●● 

● ● 

● 

● 

● 

● ● ● 

● ● ● 

● ● 

●● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

●● 

● ● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● 

● ● 

● 

● ● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● 

● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● ● 

● ● ● ● 

● 

● 

● ● 

● 

● ● 

● ● ● 

● 

● 

●● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● ● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● ● 

● 

● ●● 

● ● 

● ● 

● 

● ● ● 

● ● 

● 

● ● ● ●● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

●● 

● ● ● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● ● 

● 

● ● 

● ● 

● 

● 

● 

●● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

3.2.3 Results 59 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● ● 

● 

● ● ● 

● ● 

● ● 

● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● 

● ● ● 

● 

● 

● ● 

● 

● 

Figure 3.20: Projections of scatterplots used in the experiment. Participants had to select the best 

five projections and order them by their quality. The order of the scatterplots was permuted for 

each participant separately using the Latin-Square method. 

Procedure 

The experiment consisted of two parts. In the first part, participants had to read a short 

description of the scenario, the task and fill out a short standardized form on general 

questions (such as age, study stage, experience with computers and scatterplots) 18 . In 

the second main part of the experiment, participants had to perform the task by selecting 

and ordering the five best representations that classified three wine types 19 . Clearly, the 

best suited scatterplot is the one that allows a clear distinction of the three wine types 

by the two attributes. Participants’ e ectiveness mainly depended on their ability to read 

and interpret scatterplots. The group of participants was quite homogeneous with regard 

to age and previous education. Expectedly, their performance did not show significant 

deviations or anomalies. This was assured by computing that none of the scores is above 

or below the triple standard deviation. In order not to be biased towards any of the 

measures, participants were not directed on how to define a high quality projection, nor 

how to look for dense or consistent clusters. 

3.2.3 Results 

A linear regression analysis was carried out using the Pearson coe cient for assessing 

the correlation between users’ classification and the measures’ quality assignment of the 

selected projections. In order to make the measures comparable, we normalized the assigned 

quality measures individually for the projections between 0 to 1. From the users’ 

answers we computed the probability of selecting a projection by counting the number of 

times each projection was selected. These probabilities were weighted with the averaged 

ranks assigned by the participants. This resulted in a sequential order of the projections 

reflecting users’ quality preferences. The dependent variable of the statistical evaluation 

18 Appendix A.2 contains this general question form (in German) in Section A.2.1. 

19 Appendix A.2 contains two examples of the experiment form (Figure A.1 and Figure A.2).


was the user rankings, and each of the four measures was one independent variable in separate 

computations. The results show significant positive correlation for all four measures 

(p

3.2.3 Results 61 

2D-HDM and DCM assigned the best quality to the projection exactly as did the users. 

CDM assigned for this projection 99% quality (rank 2), and 1D-HDM only 68% quality 

(rank 4). The projection of users’ highest quality is shown in Figure 3.22(a). 

The highest quality projection selected by CDM and 1D-HDM is shown in Figure 3.22(b). 

This projection shows a clear and very dense cluster for one of the wine types, however, it 

also shows a high overlap for the other two types. Users assigned rank 4 for this projection. 

In users’ eye the worst quality projection was the one showing high density of all three 

wine types but also a high overlap, as shown in Figure 3.22(c). This was also confirmed by 

three measures, except by the CDM measure that still assigned a quality of 26.3% (rank 

11) to this projection. 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● ● 

●● 

● ● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● ● ● 

● 

● 

● ● 

● ● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● ● ● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

●● 

● 

● ● ● 

● ● ● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● ●● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● ● 

● ● ● 

● ●● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

●● 

● 

● ● 

● 

● ● ● ● ● 

● ● ● 

● ●● 

● 

● 

● ● ●● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● ● ● 

● 

● ● 

● 

● ● ● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● ●● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● 

● ● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● ● 

● ● 

● ● ● ● ● 

● ● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

(a) Users’ highest quality 

ranked projection was confirmed 

by DCM and 2D- 

HDM quality measures. 

(b) Highest quality ranked 

projection by CDM and 

1D-HDM measures. 

(c) Users’ lowest quality 

ranked projection was confirmed 

by DCM, 2D-HDM 

and also by 1D-HDM quality 

measures. 

Figure 3.22: Correlation of measures with users’ classification for highest and one lowest quality 

projection. 

Interesting is also the phenomenon that none of the users selected 8 of the 18 projections 

21 . CDM, however, still assigned 65% quality to one of these projections as shown in 

Figure 3.23(a). The highest quality assignment to one of these 8 projections was 58% by 

1D-HDM, 50% by DCM, and only 40% by 2D-HDM. Surprisingly, the projection shown 

in Figure 3.23(b) was selected by a user and ranked between the best five, but all the 

measures ranked it second to last, or even last by CDM. 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● ● ● ● 

● 

● ● 

● ● ● 

● 

● 

● ● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● ● ● ● 

● 

● 

● 

● 

● ● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● ● 

● ● 

● 

● ●● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● ● 

● 

● 

● ● ● 

● 

● ● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● ● ● 

● ● ● 

● 

● 

● 

●● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● ● ● 

● 

● ● 

● 

● ● 

● ● ● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● 

● ● ● ● 

● 

● 

● 

● 

● ● 

● ● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● ● ● 

● ● ● 

● ● ● 

● ● ● 

● 

● ● 

● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● ● ● ● ● ● 

● 

● 

● ● 

● 

● ● 

● ● 

● 

● 

● 

● 

● 

● ● ● ● ● ● 

● 

● ● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

(a) Not selected by any 

user, but ranked by CDM 

with 65. 

(b) Selected by a user, 

ranked by all the measures 

second to last, and 

by CDM last. 

Figure 3.23: Surprising study results. 

21 In Appendix A.2.3 Figure A.3 shows the 8 projections that where not selected by any user.


In summary, 2D-HDM, tightly followed by DCM, reflected users’ quality assignment 

best by reaching the highest and lowest quality ranking accurately, and having the highest 

R 2 value of the correlation. These results should however not indicate that density (CDM) 

is unimportant for quality assignments. It should rather motivate to combine and improve 

these measures, so they can su ciently support users in their task. 

3.2.4 Discussion 

In the following section we examine the results of the experiment in more detail, discussing 

some of their potential implications and ideas for further research. As we have noted in 

the results there is a divergence of results when the measure takes into account the density 

or the amount of overlap among the clusters. 2D-HDM together with DCM reflected users 

preference for high quality projections better than the others. Intuitively, both density 

and overlap should play a role in the perception of clusters, nonetheless the results of our 

experiment seem to suggest that separation is more important. Future research will need 

to address this issue to establish whether a combination of measures based on both density 

and separation can outperform the others. 

Another open issue not investigated in this study, is the influence di erent shapes of 

clusters might have on user perception and, at the same time, on the proposed measures. 

Current results do not permit to di erentiate between the shapes clusters have, even if 

the images with highly ranked clusters contain circular shapes. 

In relation to this last observation, it is worth noticing that the major factor involved 

in the separation of clusters is the proximity of the points. This is of course not surprising 

as the Gestalt Laws of Grouping suggest that proximity is the strongest visual features 

used by the visual system to extract patterns out of images. Nonetheless, we believe 

it is worth running new studies investigating the relationship between the other laws of 

grouping (e.g., closure, similarity, continuation, etc.), users’ perception and additional 

quality metrics. Going along these lines, Section 4.2 presents the results of a qualitative 

analysis on cluster separation factors. Here di erent plots that show a variety of data sets 

where analyzed manually to identify what kind of patterns are formed by clusters and how 

these are identified by current metrics. 

Here our experimental task is focused on the perception of clusters. However, it is 

important to acknowledge that the perception of clusters of n-dimensional data spaces is 

not the only useful task. For instance, the detection of outliers for which it is not only 

necessary to find suitable metrics but also to run studies similar to ours, is relevant in 

order to understand the relationship between user perception and the metric. The same 

idea can and should be repeated for several user’s tasks, visual patterns, and metrics. 

We consider our study only a starting point in this direction, nonetheless, it introduces a 

well-reasoned experimental design procedure that can be repeated to explore all we have 

outlined above. For this reason in the following section, we briefly summarize the common 

elements of our study design so that it could be repeated in future experiments. 

Finally, we point out that the current study focuses exclusively on the correlation and 

comparison of what metrics and users detect, with an underlying assumption that users’ 

perception represents a sort of optimum. This assumption requires additional investigation 

as computational methods might be able to detect interesting patterns that users cannot 

necessarily perceive visually.

3.2.5 Guidelines 63 

3.2.5 Guidelines 

In the following, we briefly outline the basic steps to repeat in new user studies, following 

the same schema used in this study. Our motivation is the desire to facilitate the design 

of similar studies and to promote the production of related studies on the perception of 

visual patterns and their formalization in computable metrics. 

1. Select a visualization technique. The first element necessary is the selection of 

a specific visualization technique. In our examples we have used scatterplots that is 

one of the most used techniques in visualization. Future studies might include other 

high-dimensional visualization techniques like the ones presented in Section 2.2, e.g., 

treemaps, parallel coordinates, line charts, etc. 

2. Select a visual feature. In this phase it is necessary to think in terms of what 

particular features can be detected in the visualization technique under inspection. 

Note that some concepts recur across several visualization but need a redefinition 

for each specific case (e.g., clustering in scatterplots and in parallel coordinates). 

3. Formalize the feature. This is a fundamental step in our design schema. Once 

a specific feature has been selected it is necessary to formalize it in a way that it 

can be computed through an algorithm. In this phase it is advisable to produce 

more than one measure in order to capture several aspects of the same feature. This 

also permits to compare the performance of the selected measure in the study and 

acquire additional information on the visual processes implied in the perception of 

the feature. 

4. Run a rank-based study. Once the feature has been formalized it is possible to 

run a study where the users have to rank the images in terms of the selected feature. 

When the images have been ranked it is possible to compare the ranks given by the 

metrics and the ones provided by the users (as suggested in our method and design 

of the study). 

5. Study and refine. The results of the algorithms can be compared to the results 

obtained by the users who represent the reference against which all measures are 

evaluated. The goal of this phase is not only to determine which of the metrics 

performs best, but also to reason around the results to (1) hunt for interesting 

insights about how users perceive the selected feature; (2) design better metrics able 

to capture the desired feature with more accuracy. 


To conclude the research presented in this Section 3.2, we would like to recall the contributions 

mentioned at the beginning. Through a user centered evaluation design we 

showed that some quality measures are more and some less able to reflect users’ perception. 

However, there is still a question as to which extent users are able to preselect good 

quality projections of their multidimensional data in an e cient and unbiased manner. 

Our results indicate that further development is needed to find the ultimate automatic 

quality measure. Nevertheless, the provision of the first quality benchmark framework,


with which it is possible to compare di erent metrics is created. Another question regarding 

the future development of similar studies is whether the accumulation of several similar 

experiments on di erent visualization techniques and features can be joined to create a 

uniform model or better understanding of how visualization works and how visual patterns 

can be formalized. While the answer to this issue is not clear at the moment, it is evident 

that at the very least every single study has the potential to improve the understanding 

and the utilization of the selected technique. 

In future works, the same techniques can be applied to other visualization methods, 

e.g., parallel coordinates, to evaluate the correlation between the specific quality metrics 

and the user perception. Since in the current work we focus on cluster detection exclusively, 

di erent visual patterns like outliers could be investigated. Like mentioned in Section 3.2.4, 

it is also important to analyze how good users perform in finding interesting patterns.

4 

A Systematization of Quality Metrics in 

High-Dimensional Data Visualization 

Contents 

„Nothing has such power to broaden the mind as the ability to investigate 

systematically and truly all that comes under thy observation in life.” 

Marcus Aurelius 

4.1 Quality Metrics in High-Dimensional Data Visualization . . . . 66 

4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

4.1.3 Quality Metrics Pipeline . . . . . . . . . . . . . . . . . . . . . . . 71 

4.1.4 Systematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 74 

4.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

4.1.6 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

4.1.7 Directions for Further Research . . . . . . . . . . . . . . . . . . . 85 

4.1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 


4.2 Visual Cluster Separation Factors: Sketching a Taxonomy . . . 87 

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.2.3 Visual Cluster Separation Taxonomy . . . . . . . . . . . . . . . . 89 

4.2.4 Discussion and Further Research . . . . . . . . . . . . . . . . . . 90 

I 

n a number of recent papers, di erent quality metrics have been proposed to automate 

the demanding search through large spaces of alternative visualizations (e.g., alternative 

projections or ordering), allowing the user to concentrate on the most promising 

visualizations suggested by the quality metrics. Over the last decade, this approach has 

witnessed a remarkable development, however, few reflections exist on how these methods 

are related to each other and how the approach can be developed further. For this 

purpose, in Section 4.1 we provide an overview of approaches that use quality metrics in 

high-dimensional data visualization and propose a systematization based on a thorough 

literature review. We carefully analyze the papers and derive a set of factors for discriminating 

the quality metrics, visualization techniques, and the process itself. The process is 

described through a reworked version of the well-known information visualization pipeline. 

We demonstrate the usefulness of our model by applying it to several existing approaches 

that use quality metrics, and we provide reflections on implications of our model for future 

research. 

Another aspect that is worth to be investigated in the context of quality metrics, is 

their ability to detect di erent types of structures of high-dimensional data. In Section 4.2

66 Chapter 4. A Systematization of Quality Metrics in High-Dimensional Data Visualization 

we present the results of an in-depth qualitative evaluation of two cluster separation measures. 

This evaluation is concentrated on scatterplot visualizations (2D, 3D, and SPLOM) 

and the most popular task – clustering. The qualitative data study converged into a taxonomy 

of visual cluster separation factors for scatterplots, and we shortly report on the 

results in this section. Beyond that, the outcome of the study is used to describe possible 

next steps in the field, that we deem important to advance the research in this area. 

Parts of this chapter appeared in the following publications [27, 122]. 

4.1 Quality Metrics in High-Dimensional Data Visualization 

The extraction of relevant and meaningful information out of high-dimensional data is 

notoriously complex and cumbersome. The curse of dimensionality is a popular way 

of stigmatizing the whole set of troubles encountered in high-dimensional data analysis; 

finding relevant projections, selecting meaningful dimensions, and getting rid of noise, 

being only a few of them. Multi-dimensional data visualization also carries its own set of 

challenges like, above all, the limited capability of any technique to scale to more than an 

handful of data dimensions. 

Researchers have been trying to solve these problems through a number of automatic 

data analysis and visualization approaches that cover the whole spectrum of possibilities: 

from fully automatic to fully interactive. Visualization researchers have discovered early 

on that searching for interesting patterns in this kind of data can be done through a mixed 

approach, where the machine based on quality metrics automatically searches through a 

large number of potentially interesting projections, and the user interactively steers the 

process and explores the output through visualization. 

The pioneering work of Friedman and Tukey in 1974 introduced the idea with their 

projection pursuits method [54]. They recognized the limit of human beings in exploring 

the exponential set of projections and tackled the high-dimensionality issue by letting an 

algorithm discover interesting linear projections in 1D (histograms) and 2D (scatterplots) 

and letting the user evaluate the corresponding output. 

During the last few years the use of this paradigm has witnessed a growing interest, 

and an increasing number of techniques has been published in key data visualization 

conferences, and journals. Quality metrics have been used for very disparate goals such as: 

searching for interesting projections, reducing clutter, and finding meaningful abstractions. 

However, the initial idea of quality metrics has been elaborated and expanded so much 

further and into so many di erent directions that it is hard to come up with a coherent and 

unified picture for them. A reader of one of these papers may well appreciate the value of 

a single technique without having a way to place it into a larger context. Also, researchers 

who might want to approach this area of investigation for the first time and develop new 

techniques may have a hard time appreciating the whole spectrum of possibilities and 

directions related to the use of quality metrics. 

In this section, we move first steps towards filling this gap. We provide a systematization 

of using quality metrics in high-dimensional data analysis through a literature review. 

We analyzed numerous papers containing quality metrics and went through an iterative

4.1. Quality Metrics in High-Dimensional Data Visualization 67 

process that led to the definition of a number of factors and a quality metrics pipeline, 

which is inspired to the traditional information visualization pipeline [36]. 

The extracted factors and the pipeline have the following interrelated goals: 

1. putting the existing methods into a common framework; 

2. easing the generation of new research in the field; 

3. spotting relevant gaps to bridge with future research. 

In this section, we provide an extensive explanation of the methodology we followed, 

the results we obtained, and their practical use. In particular, we demonstrate this by going 

through a number of selected examples how we are able to describe existing approaches 

through the proposed models. Also, we identify a number of interesting gaps and give 

guidelines on how to carry out new research in this area. To the best of our knowledge, 

despite the numerous techniques that can be categorized under the umbrella of qualitymetrics-driven 

visualization, this is the first attempt in this direction. 

Definitions 

In order to make the goal and scope of our work clear, we provide some initial definitions. 

Information Visualization Pipeline: a reference model that describes how to transforms 

data into visualizations through a series of processing steps, as defined in [36]. 

Quality Metric: a metric calculated at any stage of the information visualization 

pipeline that captures properties useful to extract meaningful information about the data. 

(Please note that we use the terms metric and measure as synonyms in this thesis.) 

High-Dimensional Data: any data set with a dimensionality that is too high to 

easily extract meaningful relations across the whole set of dimensions. In the context of 

this thesis, any dimensionality higher than 10 is considered high-dimensional. 

Our focus is on the analysis of methods that apply quality metrics at any stage of the 

information visualization pipeline as a way to facilitate the detection and presentation of 

interesting patterns in high-dimensional data. 

Examples 

We first discuss a few short examples of the approaches covered in our review to familiarize 

the reader with the concepts exposed in this section and get the feeling of their 

heterogeneity. They cover a broad selection of the factors, denoted with italics, which will 

be presented in detail in Section 4.1.4. 

Example 1 

We start with a familiar example presented in Section 3.1.6 and published in [133] where 

high-dimensional data sets are analyzed by computing an interestingness score for every 

scatterplot generated with all the possible combinations of axis pairs from the original



100 97 84 

Figure 4.1: (Top row of Figure 3.8) Ranking projections according to the Class Density Measure, 

favoring projections with minimal overlap between predefined classes (i.e., the colors) [133]. 

data. The score is calculated by running image processing algorithms on top of each scatterplot 

in order to detect images with clusters in the visualization. The system returns a 

list of scatterplots as those presented in Figure 4.1 sorted in order of relevance according 

to the chosen quality measure. 

Example 2 

Peng et al. in [112] provide algorithms to reorder the axes of multidimensional data visualizations 

(parallel coordinates, scatterplot matrices, glyphs, recursive patterns) in order 

to reduce clutter and make interesting patterns more clearly visible. For each visualization 

a specific quality metric calculated in the data space is used to find the best ordering. In 

Figure 4.2, we present an example on scatterplot matrix reordering. 

Figure 4.2: Clutter reduction achieved through axes reordering in a scatterplot matrix (initial 

visualization on the left, reordered on the right) [112]. 

Example 3 

Johansson et al. in [80] study the abstraction obtained by applying sampling or aggregation 

algorithms on top of parallel coordinates and provide quality metrics to judge when the 

abstraction disrupts relevant patterns in the data. In Figure 4.3 we show an example from 

their work, where on the left the data set containing 16384 items is displayed with parallel

4.1.1 Background 69 

The original data set containing 16384 items. Targeting a visual quality of 0.95 

retains 987 items. 

Figure 4.3: Data abstraction algorithm based on sampling, aiming at reducing data size while 

preserving relevant patterns. Original visualization on the left with 16384 data items. Sampled 

visualization on the right with 987 items and a visual quality of 0.95 [80]. 

coordinates. On the right side they display an image targeting a visual quality of 0.95 

(on a scale from [0,1]) by displaying only 987 items. The image quality is calculated by a 

screen metric using distance transforms. 

All the approaches have in common that they use quality metrics in the context of 

high-dimensional data visualization; nonetheless they can di er on a variety of aspects. 

For instance, in Example 1 the purpose is to find interesting projections, in Example 2 

the purpose is to reduce clutter, whereas the purpose in Example 3 is to find the right 

abstraction level. The approaches can as well di er in a number of other aspects such 

as: the visualization techniques employed, the space in which the quality metrics are 

calculated, or the level of interaction they provide. 

Therefore the questions are: 

Q1. How we can put all the approaches into a common framework which is able to 

highlight commonalities and di erences? 

Q2. What are the main factors through which we can describe them? 

Q3. How can we learn from the approaches and build on top of them to systematically 

move the idea of quality-metrics driven visualization forward? 

These are the main questions that motivate our work, and in the following sections we 

will provide the results of our investigation. 

4.1.1 Background 

While more areas are dealing with quality metrics (see Section 2.3.2), we decided to focus 

on the use of quality metrics in high-dimensional data exploration only. Our initial data 

gathering process included a broader class of papers, including those cited in Section 2.3.2. 

However, we soon realized there is no all encompassing model able to synthesize the 

relevant aspects and, at the same time, is useful in practice. For this reason, here we 

focuses only on the use of quality metrics in high-dimensional data.


There exist a number of research papers which try to categorize existing work in the 

visualization area. We briefly mention some recent ones to put our work in a larger context. 

In Rethinking Visualization [138] Tory and Möller provide a taxonomy to describe scientific 

and information visualization under the same structure. Ellis and Dix organize a large 

number of existing clutter reduction techniques into a clutter reduction taxonomy [49]. 

Yi et al. review a large number of visualization systems to better understand the role of 

interaction in visualization [160]. Segel and Heer analyze a large body of story telling 

visualizations to identify common design patterns [123]. All these papers share with our 

work the need of putting some order into a complex aspect of data visualization by starting 

from a detailed analysis of what researchers and practitioners have proposed in the past. 

Since our proposed systematization uses a data visualization pipeline as the basis for 

the analysis of quality metrics, we deem important to briefly discuss existing data processing 

pipelines. The information visualization pipeline has been presented by Card et al. [36] 

and is widely accepted as the standard processing model for information visualization. The 

pipeline transforms data going through the following stages: raw data, table data, visual 

structures and views. At each stage an operator is applied, respectively: data transformation, 

visual mapping, and view transformation. The Data State Reference model [39] is 

largely based on the information visualization pipeline and classifies visualizations according 

to how they use the operators in the pipeline. In this regard it is similar to our work 

in that we also use elements of the pipeline to classify the papers we have analyzed. The 

KDD pipeline [51] has been developed in the early nineties to describe the data processing 

stages involved in knowledge discovery. The data goes through several stages (selection, 

pre-processing, transformation, data mining, interpretation/evaluation) leading to a final 

stage of knowledge generation. While we took inspiration from this model, as quality metrics 

involve automatic computation and visualization, we decided not to use it as a basis 

for our work because visualization does not explicitly appear in the intermediary steps of 

the process. Keim et al. [88] and Bertini et al. [23] present alternative pipelines that show 

how automated data analysis algorithms can be included in the data visualization process. 

These papers are also sources of inspiration for our work as they focus on the integration 

of automated algorithms and data visualization. 

4.1.2 Methodology 

We followed an iterative data gathering, coding, and modeling approach inspired to the 

methods used in grounded theory analysis [130]. We started from a small set of papers 

about quality metrics we knew from our own experience and used this initial list to derive a 

first set of descriptive factors. After that, we expanded the list by analyzing the references 

contained in the first set of papers and by searching in relevant visualization venues. In 

particular, we used Google Scholar 1 to search for references to and from the collected 

papers. We also expanded our list by targeted keyword search. We also tried to expand 

our list by keyword search but it did not produce satisfactory results, mainly because 

many quality metrics paper do not mention the word “quality metrics” in their text. 

At this stage, we decided to narrow down the scope of our study and focus on quality 

metrics for high-dimensional data analysis. We discarded the papers that (1) did not explicitly 

address high-dimensional data, and (2) did not propose quality metrics systems or 

1 http://scholar.google.com/

4.1.3 Quality Metrics Pipeline 71 

algorithms. For instance we discarded a number of interesting papers on the use of quality 

metrics for generic data visualizations [79], for graph drawing [45], or the discussions on 

generic aspects of quality metrics [26]. 

The first two the authors 2 went independently through the current list of papers, 

completed a table with the current version of the classification, and took notes on necessary 

modifications/additions to accommodate new aspects discovered during the analysis. After 

this first phase the two lists and the notes where confronted in order to reach a consensus 

on table factors and paper coding. The third author 3 played the devil’s advocate role at 

this stage to confirm the factors were explicative, understandable and relevant. A third 

set of additional papers were gathered and coded at this point to test the classification 

further. 

We proceeded then to the definition of a visualization pipeline able to capture the 

data visualization processes described in the papers. We started from the traditional 

information visualization pipeline [36] because it is widely known and helps capturing key 

elements of quality-metrics-driven visualizations (details in Section 4.1.3). 

We generated the quality metrics pipeline iteratively using the set of gathered papers 

and the descriptive table with quality metrics factors as reference. In particular, (1) we 

built a first draft of the new pipeline; (2) we went through the whole list of papers and 

checked whether the pipeline was able to describe every aspect involved in the process; (3) 

where discrepancies were found, we refined the pipeline accordingly. As a final step, we 

double-checked that every paper in the list could be described by a specific instance of the 

pipeline. Similarly to the procedure followed in the first phase we let one of the authors, 

not involved in the model generation phase 3 , again play devil’s advocate and refine the 

model at intermediary steps. The work on the pipeline generated also small adjustments 

that led to the final version of the quality metrics table (Table 4.2). 

It is important to note that, while we followed a systematic approach there is no 

guarantee that this is the only way to describe quality metrics and their use. Many of 

the elements introduced in the proposed models are the result of our own experience 

and are thus necessarily subjective. Nonetheless, the usefulness of the proposed model is 

demonstrated by its ability to describe the whole set of papers and to identify relevant 

gaps interesting for future research. 

4.1.3 Quality Metrics Pipeline 

We briefly recall the main elements of the Card et al.’s pipeline [36] and then we move 

forward to the description of our extensions. 

The original purpose of the infovis pipeline was to model the main steps required to 

transform data into interactive visualizations. The quality metrics pipeline in Figure 4.4 

preserves its main elements: processing steps (horizontal arrows), stages (boxes), and 

user feedback (with few naming di erences we will explain soon). Data transformation 

transforms data into the desired format. Visual mapping maps data structures into visual 

structures (visualization axes, marks, graphical properties). View transformation creates 

rendered views out of the visual structures. The whole set of transformations is influenced 

by the user who can decide at any time to transform the data (e.g., filter), use di erent 

visual structures and, navigate the visualization through di erent view points. 

2 Enrico Bertini and myself. 

3 Daniel Keim.


Quality-Metrics-Driven Automation 

Source 

Data 


Transformation 

Transformed 



Visual 

Structures 

View 


Rendering 

Views 

Figure 4.4: Quality metrics pipeline. The pipeline provides an additional layer named quality 

metrics base automation on top of the traditional information visualization pipeline [36]. The 

layer obtains information from the stages of the pipeline (the boxes) and influences the processes 

of the pipeline through the metrics it calculates. The user is always in control. 

The infovis pipeline captures extremely well the key elements of interactive visualization 

across a variety of domains and visual techniques. However, when we focus on 

the visualization of high-dimensional data patterns a practical problem arises. While the 

whole set of processes is still valid, the number of possible combinations at each step is 

so high that it is impractical to find interactively the most e ective ones. An example in 

the spirit of Mackinlay’s seminal analysis [99] helps to clarify the problem: if the original 

data has dimensionality n = 10 (still a quite low number) and the number of available 

visual parameters is k = 4 (e.g., a scatterplot with the following visual primitives: x-axis, 

y-axis, size, and color - see Figure 4.5), the number of alternative mappings at the visual 

mapping stage is already more than 5000 (k-permutations, i.e., the number of sequences 

without repetition: 

n! 

(n≠k)! ). 

Figure 4.5: Mapping a 10 dimensional data set to a scatterplot with four visual primitives (x-axis, 

y-axis, size, and color) has over 5000 possible alternative mappings. 

The main function of quality metrics algorithms is to aid the user in the selection of 

promising combinations. Typically, the algorithms search through large sets of possibilities 

and suggest one or more solutions to be evaluated by the user. To describe these steps we 

created an additional layer in Figure 4.4 that we call quality-metrics-driven automation, 

which depicts how quality metrics fit into the process. The metrics draw information from 

the stages of the pipeline (green upwards arrows) and influence the processing steps (blue 

downwards arrows) with their computation. The user remains in control of the whole 

process letting the machine perform the computationally hard tasks. We named the new 

pipeline the quality metrics pipeline. 

The concept of generation of alternatives and their evaluation is at the core of the 

method. Regardless the purpose, all the systems we have encountered follow a common 

general pattern:

4.1.3 Quality Metrics Pipeline 73 

1. Create alternatives (projections, mappings, etc.); 

2. Evaluate alternatives (rank views, orderings, etc); 

3. Produce a final representation (ranked list of views, small multiples, etc.). 

As we will show in Section 4.1.5, systems with disparate purposes can be described by 

this same model. 

Processing 

In the following we provide details about specific features of the processing steps of the 

quality metrics pipeline. 

1. Data Transformation (source data æ transformed data). In the original pipeline 

this step has the main role to put the data in a tabular format, hence the original 

name tabular data of its output. Since here we focus on high-dimensional data, 

we assume the source data to be already in a tabular format and we rename it into 

transformed data. At this stage data transformation is responsible for the generation 

of alternative data subsets or derivations. Common operations include: feature 

selection, projection, aggregation, and sampling. 

2. Visual Mapping (transformed data æ visual structures). Visual mapping is the 

core stage of the pipeline where data dimensions are mapped to visual features to 

form visual structures. Distinct mappings of data features to visual features provide 

alternatives that can again be evaluated in terms of quality metrics. The most 

common type of operation at this stage is the generation of orderings; by assigning 

data dimensions to visualization axes in di erent orders. In general, alternatives can 

be generated by considering the full set of visual features (e.g., color, size, shape). 

3. Rendering/View Transformation (visual structures æ views). Rendering transforms 

visual structures into views by specifying graphical properties that turn these 

structures into pixels. We added the word Rendering to the pipeline to emphasize 

the role of the image space; many quality metrics are thus calculated directly in the 

image space considering the pixels generated in the visualization process. At this 

stage alternatives views of the same structures can be generated automatically. Surprisingly, 

as we discuss in Section 4.1.6, this stage is, in the context of our inquiry, 

rarely used. 

Quality Metrics Computation 

Quality metrics can draw information from any of the stages of the pipeline. As we describe 

later in Section 4.1.4 quality metrics can be calculated in the data space, image space or 

a combination of the two. Metrics calculated at the view stage draw information from the 

rendered image, whereas the others draw information from the data space (and elements of 

the visual structures in some few cases). Many di erent kind of metrics are possible. Our 

analysis of quality metrics features in Section 4.1.4 provides numerous additional details.


Quality Metrics Influence 

As described above, quality metrics algorithms generate alternatives and organize them 

into a final representation. At the data processing stage they can for instance generate 

1D, 2D, or nD projections (e.g., [52, 59, 126]), data samples (e.g., [24, 80]), or alternative 

aggregates (e.g., [42]). At the visual mapping stage the layer generates alternative orderings 

or mappings between data and visual properties (e.g., [112, 120]). At the view stage 

the layer can generate modifications of the current view like changing the point of view, 

highlighting specific items, or distorting the visual space (e.g., [8]). 

User Influence 

The quality metrics layer does not want to substitute the user in favor of the machine. 

While the users can always influence all the stages of the pipeline, their main responsibility 

becomes to steer the process, e.g., by setting quality metrics parameters, and to explore 

the resulting views. It is worth noting that the process is not necessarily a linear flow 

through the steps. As will be evident from the examples in Section 4.1.5 in many cases 

complex iteration takes place. 

4.1.4 Systematic Analysis 

Through our paper review we identified two main areas of investigation. First, we classify 

the papers according to quality metrics criteria that help explaining their key features. 

Second, we provide a more detailed categorization of the visualization techniques we have 

come across. 

Quality Metrics 

We identified a number of factors that describe the methods encountered through the 

literature review. Each factor has a number of possible values and each paper can assume 

one or more of these values (see Table 4.2). 

In the following, we describe the main factors we extracted from our analysis. 

What is measured 

This factor describes what is measured by the quality metric. In our analysis we have 

grouped the metrics in the following categories: 

Clustering metrics measure the extent to which the visualization or the data contain 

groupings, that is, well-separated clusters that can be easily identified. Clustering is loosely 

defined because we have encountered many alternative approaches. It is worth to keep in 

mind that with clustering here we intend any measure in the data or image space which 

is able to capture groupings. 

Correlation relates to two or more data dimensions and captures the extent to which 

systematic changes to one dimension are accompanied by changes in other dimensions. 

Simple Pearson correlation between two variables is one of the most commonly used 

measure in this category but global correlation among multiple data dimensions is also 

used [82].

4.1.4 Systematic Analysis 75 

Outlier metrics capture the extent to which the data segment under inspection contains 

elements that behave di erently from the large majority of the data, i.e., outliers. 

Complex patterns metrics capture aspects that cannot be easily categorized as any of 

the classes described above. We detected a number of papers with such measures and 

grouped all of them in this class. An example is Graph-Theoretic Scagnostics [151] a 

technique where it is possible to characterize scatterplots with features like “stringy” or 

“skinny”. 

Image quality refers to metrics where the purpose is not necessarily to find specific 

patterns but more to identify the degree of organization of a visualization or, as some of 

the papers call it, the amount of clutter. 

Feature preservation metrics focus on the comparison between a reference state and 

the representation in the visualization, or between the features in the data and the visualization, 

with the intent to preserve the features of interest as much as possible. A 

subset of these papers focus on classified data, searching for projections where the original 

classes are well separated [129, 133]. In the same category we can find papers that 

measure the information loss due to data abstraction techniques such as sampling and 

aggregation [24, 42, 80]. 

It is worth noticing that in this categorization we classified the techniques according 

to their main target. This however does not hinder a metric of one type to also detect 

patterns of another type. For instance, clustering and correlation, as well as complex 

patterns and image quality, may have such an overlap. 

Where it is measured (data/image space) 

In our review we have found a completely mixed set of approaches with respect to where 

the metrics are calculated: data space or image space. Metrics calculated in data space 

detect data features directly in the data without using information from the view that 

will be used to display the results. For instance, the Rank-by-Feature technique [126] 

ranks 1D and 2D projections according to a number of statistical properties calculated 

only in data space. Metrics calculated in image space bypass the analysis of the data and 

work directly on the rendered image. Often these methods employ sophisticated image 

processing techniques like our work presented in Section 3.1.2 and [133] where interesting 

scatterplots are ranked using a Hough transformation. A mixed-space approach, where 

both data and and image space are used at the same time, is also possible. We found 

two distinct cases. Bertini and Santucci [24] present a measure to compare features in 

the data space to features in the image space; with the intent of preserving as much as 

possible data features in the final image. Peng et al. [112] measure clutter in relation to 

the ordering of visualization axes: these calculations need data features (outliers, correlations) 

and visualization features (e.g., axes adjacency) at the same time. Please note that 

the entries in Table 4.2, where both data and image space are present, do not necessarily 

imply the use of the aforementioned mixed approach. More often, they simply mean that 

alternative approaches co-exist in the context of the same paper. 

Purpose 

Purpose describes the main reason for using quality metrics, that is, what is the goal to 

be achieved with the metric. We identified the following purposes. 

Projection aims at finding subsets of the original dimensions in which interesting patterns 

reside, e.g., analyzing all the possible 2D projections of a multidimensional data set 

by checking whether interesting groupings exist in a scatterplot.


Ordering aims at finding, where possible, an ordering of the visualization axes that 

eases the visual detection of interesting patterns. Parallel coordinates is a classical example 

where the order of the axes greatly influences the chances of detecting interesting patterns 

in the data. 

Abstraction aims at maintaining or controlling a certain degree of data representation 

quality when data reduction techniques are used to increase the scalability of a visualization. 

Sampling and aggregation are the two main types of abstraction techniques we 

encountered. For instance, in [42] the authors propose a data abstraction technique that 

permits to measure the information loss due to abstraction and to find a trade-o between 

data loss and data reduction. 

Visual mapping aims at finding interesting mappings between the original data features 

and the visual features of the visualization technique. Features such as color, size or shape 

fall into this category. 

View optimization aims at modifying parameters of the view with the intent to produce 

better visualizations, in which, for example, data segments with a high degree of interest 

are highlighted. 

Interaction 

The last column of the table indicates which papers o er the possibility to interact with the 

quality-metrics-based automation. We extracted two main classes of interaction: threshold 

selection and metrics selection. With threshold selection we mean the possibility to set 

thresholds in the quality metrics computation mechanism (e.g., the data abstraction level 

in [42] or the density estimation smoothing parameter in [52]). With metrics selection we 

mean systems in which the user can either switch from one metrics to another or combine 

them into an integrated one (e.g., [42, 82]). Please note that some of the papers may 

contain interaction capabilities and still be marked as not interactive because they do not 

provide direct interaction with the quality metrics mechanisms. 

Visualization 

The original table we have designed to classify the full set of papers (see Table 4.2 below) 

contains a rough categorization of visualization techniques into three main classes: scatterplots 

(SP), parallel coordinates (PC), and others (which include a fairly large number of 

di erent techniques). While this categorization helps understanding how these techniques 

distribute over the whole set of papers (SP and PC accounts for 80% of the total) it does 

not say anything about key features of visualization techniques; especially those closely 

related to the usage of quality metrics. 

We define layout dimensionality as the number of data axes a visualization has. A 

data axis is the visualization feature that establishes what position a single visual mark 

takes in the visualization. For instance, scatterplots have dimensionality two because they 

can accommodate two spatial dimensions. 

The visualization techniques are classified into 1D, 2D, 3D, 4D, and nD, where nD 

stands for techniques that can accommodate an arbitrary number of dimensions (with 

obvious scalability limits when the number of dimensions grows too big). 

It is worth noticing that in general every visualization has an additional number of 

visual features to which data features can be mapped, e.g., color and size, but here we 

focus on the layout because it is the variable that most characterizes every visualization

4.1.4 Systematic Analysis 77 

technique and that has the biggest impact on the use of quality metrics. Table 4.1 shows 

the dimensionality of all the techniques we have identified in the review. 

The visualization techniques that are not in the nD class necessarily need an additional 

mechanism for the analysis of high-dimensional data. Typically, as discussed below, they 

are organized in a higher level structure that accommodates several projections. Those 

which can accommodate an arbitrary number of dimensions (nD) all need some kind of 

ordering mechanisms. 

Table 4.1: Visualization techniques categorized by their layout dimensionality (i.e., the number of 

axes of the visualization). 


histogram 

jigsaw map [150] 

scatterplot 

pixel bar charts [87] 

dimensional stacking [91] 

matrix [22] 

parallel coordinates [78] 

radvis [72] 

scatterplot matrix [37] 

star glyphs [128] 

table lens [115] 

Layout Dimensionality 

1D 

1D 

2D 

4D 

nD 

nD 

nD 

nD 

nD 

nD 

nD 

While not explicitly discussed in any of the reviewed papers, we have noticed that 

often a quality-metrics-driven approach needs some kind of (implicit or explicit) metavisualization. 

With meta-visualization we mean a visualization of visualizations. More 

specifically, a visualization layout strategy that organizes single visualizations into an organized 

form. For instance, when a quality-metrics-driven technique produces a number 

of interesting scatterplots as an output, there is the need to organize them into a schema 

that facilitates their comprehension and analysis (e.g., organized into a list sorted by interestingness). 

From our analysis we have identified the following main meta-visualization 

strategies: 

List: a layout strategy that organizes visualizations in an ordered linear fashion (often 

sorted to reflect quality metrics rankings); 

Matrix: a layout strategy that organizes visualizations in a grid format, where grid entries 

are organized according to some data features (e.g., column and rows represent data 

dimensions) (often called also Small Multiples, Trellis, Lattice, Facets). 

It is worth noticing that some basic visualization techniques can be considered metavisualizations 

themselves. A notable example is the scatterplot matrix which shows a set 

of scatterplots organized in a matrix layout. 

In general there is a strong interplay between visualizations and meta-visualizations. 

As mentioned above, techniques with a fixed dimensionality need to be organized in a 

meta-visualization. The meta-visualization influences the ordering of the visualizations


Table 4.2: Quality metrics papers classified according to quality metrics factors (sorted by purpose). 

Paper Title Visualization technique What is measured 

SP PC other clustering correlation outliers complex 

patterns 

What is measured Where it is 

image 

quality 

feature 

pres. 

Where it is 

measured 

A Projection Pursuit Algorithm for Exploratory 

Data Analysis - Friedman & Tukey [54] SP clustering data projection 

data image projection ordering abstraction visual 

mapping 

space 

Purpose Inter- 

view 

optimization 

act- 

ion 

A Rank-by-Feature Framework for Unsupervised 

Multidimensional Data Exploration Using Low 

Dimensional Projections-Seo & Shneiderman[126] 

Finding and Visualizing Relevant Subspaces for 

Clustering High-Dimensional Astronomical Data 

Using Connected Morphological Operators**[52] 

SP 

histogram, 

matrix, list 

clustering correlation outliers 

complex 

patterns 

data projection S 

SP histogram clustering image projection T 

Graph-Theoretic Scagnostics - Wilkinson et al. 

[151] SP clustering outliers 

complex 

patterns 

image projection 

Selecting good views of high-dimensional data 

using class consistency - Sips et al. [129] SP class pres. data projection T 

Coordinating computational and visual 

approaches for interactive feature selection and 

multivariate clustering - Guo [59] 

Exploring High-D Spaces with Multiform Matrices 

and Small Multiples - MacEachern et al. [98] 

Improving the Visual Analysis of High-Dimensional 

Datasets Using Quality Measures - Albuquerque 

et al. [8] 

Interactive Hierarchical Dimension Ordering, 

Spacing and Filtering for Exploration of High 

Dimensional Datasets - Yang et al. [158] 

Interactive Dimensionality Reduction Through 

User-defined Combinations of Quality Metrics - 

Johansson & Johansson [82] 

PC 

matrix correlation data projection ordering 

pixel based 

vis., matrix, 

small multiples 

jigsaw map, 

radvis, table 

lens 

histogram, star 

glyphs 

Pargnostics: Image-Space Metrics for Parallel 

Coordinates - Dasgupta & Kosara [43] PC clustering correlation 

correlation data projection ordering 

clustering correlation outliers data image projection ordering 

correlation data projection ordering 

visual 


view 

optimization 

PC clustering correlation outliers data projection ordering S, T 

image 

quality 

image projection ordering S 

S, T 

Combining automated analysis and visualization 

techniques for effective exploration of highdimensional 

data - Tatu et al. [133] 

High-Dimensional Visual Analytics: Interactive 

Exploration Guided by Pairwise Views of Point 

Distributions - Wilkinson et al. [152] 

Clutter Reduction in Multi-Dimensional Data 

Visualization Using Dimension Reordering - Peng 

et al. [112] 

Similarity Clustering of Dimensions for an 

Enhanced Visualization of Multidimensional Data 

- Ankerst et al. [9] 

SP PC clustering correlation 

SP PC clustering outliers 

SP PC 

PC 

star glyphs, 

dim. stacking 

recursive 

pattern, circle 

segments 

Measuring Data Abstraction Quality in 

Multiresolution Visualizations - Cui et al. [42] SP PC histogram 

correlation outliers 

complex 

patterns 

complex 

patterns 

image 

quality 

class pres. data image projection ordering 

image projection ordering 

data image ordering 

correlation data ordering 

feature 

pres. 

data abstraction T 

Quality Metrics for 2D Scatterplot Graphics: 

Automatically Reducing Visual Clutter - Bertini & 

Santucci [24] 

A Screen Space Quality Method for Data 

Abstraction - Johansson & Cooper [80] PC 

SP clustering 

feature 

pres. 

feature 

pres. 

data image abstraction 

image sampling 

Enabling Automatic Clutter Reduction in Parallel 

Coordinate Plots - Ellis & Dix [48] PC 

image 

quality 

image sampling T 

Pixnostics: Towards measuring the value of 

visualization - Schneidewind et al. [120] 

jigsaw map, 

pixel bar chart 

correlation 

complex 

patterns 

data image 

visual 


** Ferdosi et al. 

Legend: SP = scatter plot (& matrix), PC = parallel coordinates, feature/class pres. = feature/class preservation, S = select metric, T = set threshold.

4.1.5 Examples 79 

and in some cases also the content. For instance, the matrix layout requires that the 

visualization within a grid cell corresponds to the data values it represents. 

Finally, meta-visualizations can themselves be influenced by quality metrics. All the 

layout strategies have some degree of freedom in terms of reordering, and an optimal 

reordering (according to some given goal) can only be achieved by searching in the space 

of solutions (e.g., as presented in [112]). 

4.1.5 Examples 

In this section, we provide four selected examples from our review as a way to show 

how our proposed model can describe existing approaches in this area. We selected the 

examples in a way to cover as many interesting aspects as possible. In particular, we 

picked papers with di erent purposes because they guarantee a larger variety of features. 

For completeness we provide all the other quality metrics pipelines in Appendix A.3 in 

the same order the papers are listed in Table 4.2. 

The first example comes from our own work presented in Section 3.1.4 and Section 3.1.5, 

published in [133]. The main goal of this work is to find interesting projections of n- 

dimensional data using image processing techniques. The section presents several measures, 

but here we focus only on the part dealing with parallel coordinates and one specific 

metric, the Similarity Measure. 

The basic idea of the method is to generate all possible 2D combinations of the original 

dimensions and evaluate them in terms of their ability to form clusters in a 2-axis parallel 

coordinates representation. Every pair of axis is evaluated individually using a standard 

image processing technique (the Hough transform), which permits to discriminate between 

uniform and chaotic distributions of line angles and positions (for details please refer back 

to Figure 3.6 for the Hough transform). Once interesting pairs have been extracted, they 

are joined together to form groups of parallel coordinates of a desired (user-defined) size 

(e.g., in Figure 3.14, groups of 4-dimensional parallel coordinates are formed). 

Figure 4.6 presents the pipeline for this example. We can recognize three main elements: 

(A) all 2D parallel coordinates are generated in the data transformation phase; 

(B) all the alternatives are evaluated in the image space at the view stage; (C) the algorithm 

combines the interesting segments into a list of parallel coordinates (like those in 

Figure 3.14) using the visual mapping stage. 


A 

C 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure 4.6: Quality metrics pipeline for the first example from [133]: (A) generation of alternatives; 

(B) evaluation of alternatives (image space); (C) creation of the final representation. 

The technique uses parallel coordinates (PC) as principal visualization technique and 

a list as a meta-visualization. It measures clustering properties, in the image space, and 

its main purpose is to find interesting projections. Interaction with the metrics is very 

limited if not absent.


The second example comes from the work of Johansson and Johansson on interactive 

feature selection [82]. The technique ranks every single dimension for its importance using 

a combination of correlation, outlier, and clustering features calculated on the data. This 

ranking is used as the basis for an interactive threshold selection tool by which the user 

can decide how many dimensions to keep; weighting the choice with the corresponding 

information loss presented by the chart (see Figure 4.7). Once the user selects the desired 

number of dimensions the system presents the result with parallel coordinates and automatically 

finds a good ordering using the same data features calculated for ranking the 

dimensions. The user can also choose di erent weighting schemes to focus more on correlation, 

outliers or clusters. Figure 4.8 shows the results of clustering (top) and correlation 

(bottom). 

!"#$%&&"%'$%('!"#$%&&"%)'*%+,-$.+*/,'(*0,%&*"%$1*+2'-,(3.+*"%'+#-"34#'3&,-5(,6*%, 

S reduced = [1, 2, 3, 7, 

Cluster 

c 0 

Variables Q 

[3, 6, 7, 10] 

c 1 [2, 3, 10, 17] 

c 2 [1, 2, 7] 

First iteration: 

[1, 2, 

Second iteration: 

[1, 12 

Third iteration: 

[12, 1 

Fig. 5. Example of variable ordering a 

Initially the clusters are ordered acco 

iteration the reordering is found that r 

Figure 4.7: Interactive Fig. 4. Interactive chart to select displaynumber of the amount of dimensions of information to keep lost relative vs. information to connected loss [82]. variables being part of c i , 

!!" !"""#$%&'(&)$!*'(#*'#+!(,&-!.&$!*'#&'/#)*01,$"%#2%&13!)(4#+ 

number of variables to keep in the reduced data set. The black line previous clusters (represented by red 

represents the combined information loss for all quality metrics, the blue, 

red and green lines represent information loss in cluster, correlation and 

outlier structures respectively. The red vertical line corresponds to the 

number of variables currently selected. 

sum of I(x j ) for the removed variables and I total is the sum of I(x j ) 

for all variables in the data set. 

The interactive display (figure 4) consists of a line graph and a 

graphical user interface for modification of weight values and selection 

of number of variables to keep. The line graph displays the relationship 

between I lost (y-axis) and number of variables to keep in the 

reduced data set (x-axis), representing each quality metric individually 

by a line and using one line for the combined importance value of all 

metrics. A similar approach is taken in [6], where quality measures for 

data abstractions such as clustering and sampling are integrated into 

multivariate visualizations. A vertical line is used in the interactive 

display to facilitate identification of lost information for the selected 

number of variables. If retaining 18 variables, according to the position 

of the vertical line in figure 4, it can be seen from the display that 

some of the retained variables contain no cluster information at all. In 

figure 6 the corresponding 18 variable data set is visualized using parallel 

The coordinates. syntheticAs data can set be seen reduced from the tovisual 9 variables aids at the using bottomdifferent 

qualityof metric the axes, weights the five and left variablesorders. are of lowInglobal the top importance view clustering and is 

Fig. 2. 

assigned alsoahave large lowweight cluster and correlation the variables importance. are ordered By looking to enhance at the the 

clusterpatterns structures. of the lines In the it isbottom also quite view easily a seen corresponding that these variables weighting are and 

ordering rather is made noisy, for hence correlation more variables structures. can be removed from the data set 

without losing much more information. 

This pair forms the basis of the orde 

the highest correlation containing x a o 

ordered is identified. The unordered v 

right border of the ordered variables, 

forms a highly correlated pair. This c 

pairs with highest correlation contain 

positioned at the leftmost or rightmos 

ordered variables, until all variables a 

The variable orderings enhancing 

based on the quality values calculat 

connection with the cluster and outlie 

the same way. An example of the ord 

structures, is shown in figure 5, whe 

retained after dimensionality reductio 

formed as follows: 

1. Initially the clusters are sorted i 

quality value, as shown in figu 

based on three clusters, c 0 , c 1 a 

Figure 4.8: Top: best ordering to enhance clustering. Bottom: best ordering to enhance correlation 

[82]. 

2. In the first iteration all variable 

first cluster, c 0 , are positioned 

c 0 includes variable 6. This va 

Figure 4.9 shows the pipeline for this example. Again we have three mainiselements: 

hence not taken into conside 

(A) every single dimension is ranked by the quality metrics directly from the source figure Fig. data. represents 3. Thethe visual positions aidso 

The reason why the source data is needed is that the importance measure3. of In aderstanding thesingle 

of the impor 

subsequent iterations the 

dimension is computed 3.4 Variable takingOrdering 

into account the full set of dimensions (see the Spaper r is the 

reduced and for 

correlation 

of any cluster, 

betw 

c j 

c 1 , red for instance, and positive variables in blue. 2, 3 

the unit, TheN order is the of variables total number in multivariate of data visualization items in has theadata largeset impact and D is variables and I out 3 and are10the arecluster, 

also part 

the range 

on how 

of the 

easilyvariable we can perceive 

containing 

different 

thestructures one-dimensional 

in the data. 

unit. 

The 

A k- 

proposed system combines several quality metrics to find a dimensionalityunit 

reduction 

dimensional 

4. The reordering of variables in S 

is considered 

that can bedense regarded 

if 

as 

itsadensity good representation 

is higher than 

of 

the 

sequence of connected variable 

thresholds the original of all data one-dimensional set, focusing the units structures of which that it areisofcomposed. 

interest for variables traversing the border p 

Within the particular the proposed analysis task system at hand. theFinding clustering one appropriate algorithm variable has been rectangles) areI(x found, and S redu 

ordering enhancing all interesting structures at once may, however, be 

j )=w corr I c 

i = 1 this is achieved by switch 

slightly modified to further speed-up the cluster detection by using


details); (B) the user selects the dimensions guided by the quality metrics, both the user 

and the quality metric influence the data transformation process; (C) the system finds 

the best ordering according to the weighting scheme proposed by the user producing one 

specific visual mapping. Theviewispresentedtotheuser. 


A B C 

Source 




Transformed 




Structures 

View 



Views 

ULTIRESOLUTIONVISUALIZATION 

L, all the records 

us sample. 

oundary, the syss 

view, and then 

the above guidel 

have the option 

e the DAL or the 

ction 

nging from a sinusters 

containing 

lution visualizarepresentative 

or 

ords in this cluscords 

or clusters 

groups, a tree of 

l the items with a 

AL. If the tree is 

the nodes of this 

nique position in 

ange of nodes in 

ge. All the nodes 

abstraction level 

Figure 4.9: Quality metrics pipeline for the second example from [82]: (A) dimensions ranked by 

their importance; (B) selection of number of dimensions to retain vs. information loss; (C) creation 

of the final mapping with ordering. 

713 

This technique uses parallel coordinates as principal visualization. There is no metavisualization 

to organize alternative results in a schema but the interactive chart functions 

while specific to hierarchically clustered data, can support all of the 

interactions as a wayonto the abstraction. pilot the generation of alternatives. It measures clustering, correlation and 

outliers in the data space and its main purpose is to find interesting projections and 

5 CASE STUDY 

orderings. Interaction 

1: CHOOSING A DATA ABSTRACTION LEVEL 

plays a central role in the selection of the number of dimensions 

(DAL) 

and in the weighting scheme. 

In this section, we show how to choose an appropriate DAL. At this 

level, The the abstracted third example dataset should is have taken highfrom data abstraction the work quality 

This (equal paper or moreproposes than 0.90) and a technique the visualization toshould create haveabstracted the visualizations in a user-controlled 

of Cui et al. on data abstraction quality [42]. 

best visual quality under the constraints of the data abstraction quality. 

manner. The analytic The task is system to searchfeatures for clusters in data the OUT5D abstraction dataset. metrics (Histogram Di erence Measure 

This anddataset Nearest consists Neighbor of five remote Measure) sensing channels: and controllers SPOT, Magnetics, 

Potassium, Thorium and Uranium, with 16384 records. We 

to let the user find a trade-o between 

abstraction level and information loss. In particular, the data abstraction quality is calculated 

dataset. byData comparing points have significant featuresoverlaps of the with original each otherdata and to features in the sampled or aggregated 

employ scatterplots to visualize this dataset. Figure 4 shows the original 

so data. we cannot distinguish relative data density in different regions and 

have difficulty observing any trends within this dataset. 

714 

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 

observe. Next we adjust the 

visual quality in the marked 

data density is maintained, a 

Cluster A still overlap with 

abstraction are shown in Figu 

and we terminate our explora 

Abstraction quality measu 

discovered. If we only know 

ber of abstracted records and 

we cannot have much confid 

that 96 percent of the data a 

more than 0.95 and the NNM 

sampling, we are fairly certai 

original dataset very well an 

is very likely valid. In gener 

measures to the discovered p 

the pattern, which enables an 

Fig. 4. Scatterplots of original dataset (DAL=1.00) 

Fig. 6. Scatterplots of abstracted dataset (DAL=0.08) 

Figure 4.10: Visual abstraction of a scatterplot matrix from [42]. 

Figure 4.11 shows the pipeline for this example. We have two main elements: (A) 

the data abstraction quality measures are calculated by comparing the source data to the 

transformed data; (B) the user selects the desired abstraction quality and receives feedback 

6 CASE STUDY 2: COM 

ODS 

In this application, two data 

pling, are compared using the 

bedded within our multireso 

the AAUP dataset, which su 

tion of professors at 1161 in 

visualize this dataset. Throu 

has the advantage of maintai 

clustering has the advantage 

First we briefly review som 

The HDM is based on the 

between the distributions of 

changes in the relative densi 

tance between the original da 

cannot be eliminated during 

average distance, because th 

records. Thus the NNM met 

good at monitoring the chang



A B A 

Source 




Transformed 




Structures 

View 



Views 

IEEE TRANSACTIONS Figure ON VISUALIZATION 4.11: Quality ANDmetrics COMPUTER pipeline GRAPHICS, for example VOL. 12, NO. three 5, SEPTEMBER/OCTOBER from [42]: (A) data 2006features compared 

between the original data and the abstracted data; (B) instantiation of the desired abstraction 

level guided by quality metrics. 

between pairs of image pixels. The PSNR x-axis represents the DAL and the y-axis represents the quality measures. 

The red and blue line represent the changes of HDM and NNM 

) is the most common image quality meamean 

squared error) and used in the JPEG against the abstraction level, respectively. A vertical line called the 

efined by the following equations: on its quality by DAL steering handle the is drawn data transformation to indicate the current process. abstraction level. The 

The paper applies cross points the technique of this vertical to scatterplots line and the and plot lines parallel denote coordinates the corresponding 

to many measures other of this techniques. abstraction There level. The is no DAL meta-visualization and measures to organize 

but it is generic 

N M 

i=1 j=1 (F(i, j) ˆF(i, j)) enough 2 

to be (12) applied 

NM 

are displayed to the right of the DAL handle. With these plots, analysts 

alternative results canbut knowsimilarly the qualityto of the the current second DAL example in the context an interactive of the entire chart is used to 

uared error, F(i, j) is the pixel set an value abstraction at (i, j) quality threshold space. (see Figure 4.12). It measures feature preservation, and its 

i, j) is the pixel value at main (i, j) inpurpose the com- is abstraction. Interaction plays a central role in the selection of the right 

N are the length and height abstraction of the image. level. 

R = 10log 10 ( MAX2 I 

MSE ) (13) 

ignal-to-noise ratio and MAX I is the maxian 

see, the NNM employs the same method 

stance between two datasets. The only dify 

different methods to process the average 

es. 

ITY MEASURES WITH MULTIRESOLUe 

our work on integrating quality measures 

p effective and abstraction-aware multiresst 

we describe the interaction tool that we 

sures. Then we present the interactive opality 

measures. Next, we discuss the view 

pling, and finally we give Figure an overview 4.12: Visual of abstraction chart with threshold setting for the abstraction level and feedback 

(SBB) we use to control abstraction abstraction param-quality [42]. 

Analysts can adjust the DAL of clustering Fig. 2. 1D plots of quality measures 

widget for all abstraction methods and the 

y brush the structure formed As by clustering a fourth example we choose the paper from Yang et al. [158]. They use quality 

metrics to support 4.2anInteractive dimensionOperations 

management system for high-dimensional data. Their 

res 

interactive hierarchical Several interactive dimension operations management are supported system in this called system. DOSFA Users can (Dimension Ordering, 

[12, Spacing, 

ctive selection via brushing 

move the slider bar in Figure 1 or the DAL handle in Figure 2 to adjust 

the data abstraction level. After the DAL has been changed, the 

16] using aFiltering Approach) supports automatic and interactive dimension ordering, 

he data selected through filtering brushing isand called spacing. systemAn willexample generate an canabstracted be seendataset in Figure and display 4.13 where it in theon data the left hand side, 

e remaining data are called the the data unselected is presented visualization. in an unchanged The DALs forway, selected and and onunselected the right data hand can be side adjusted 

independently. Users can also modify the location of one of the 

the data is visualized 

several after quality DOSFA was applied and the data is ordered, spaced and filtered. Di erent 

t the DAL for the selected data as well as 

view of the data generates boundaries of the selected region by clicking the left mouse button on 

ts to display them. Figureorders 1 showsof twodimensions such or near cantheshow boundary di erent and dragging patterns in the of desired the data direction. to theInuser. addi-Dependetion, the selected orderregion a can importance-oriented be moved by choosing order a region is needed. on the 

on the 

nveys the quality measures task, for the a similarity-oriented selected 

veys the quality measures for the unselected 

An annotateddata pipeline display, with and then these adjusting stepstheis DAL presented for the region. in Figure This usually 4.14. It contains four 

means that the user knows the data subset that she wants to explore 

main steps: (A) a hierarchical structure of the dimensions is constructed, by grouping 

and wants to take advantage of the scalability of multiresolution visualization. 

into clusters Alternatively andasimilar user can clusters first choose into a DAL larger in the clusters; current (B) in the data 

similar dimensions 

transformation process selected dimensions region, and then are adjust filtered the selected/brushing based on their boundary similarity to enlarge 

ordering diminish influences the size of the themapping region. This stage usually by means determining that an the ordering of 

and importance; 

(C) the dimension 

acceptable data abstraction level had been found, but the area of interest 

needs to be increased or decreased. 

Analysts can also instruct the system to run the abstraction algorithm 

again to generate a new abstraction. For example, resampling 

can help analysts verify patterns that had been discovered in the previous 

samples. If a pattern still exists after resampling several times, 

this pattern is most likely a robust one. Furthermore, analysts can compare 

the abstraction measures from mutiple resampling, and select an


Figure 4.13: Left: star glyphs representing original data set. Right: visualized data after DOSFA 

was applied [158]. 


Source 


A B C D 



Transformed 




Structures 

View 



Views 

Figure 4.14: Quality metrics pipeline for example four from [158]: (A) construct hierarchical 

structure of dimensions by clustering; (B) filter dimensions by similarity and importance; (C) map 

dimensions ordering to visualization; (D) influence the view according to the quality measured 

(spacing the parallel coordinates according to their similarity). The user can steer all these steps, 

after interacting with the clustered dimensions showed in an InterRing visualization. 

the visualization’s dimensions, or for mapping the more important dimensions to more 

prevalent visualization positions or to map them to more pre attentive visual attributes. 

(D) the quality of dimensions influences also the view transformation step by determining 

the spacing between the dimensions in the parallel coordinates according to their similarity. 

All these “best” settings can be automatically calculated by the system and the result is 

presented to the user. It is also possible to present the dimension hierarchies to the user 

with a InterRing [159]. The user can interact with the InterRing, triggering the filtering, 

ordering, and spacing for the final result. This is represented by the user-interaction arrows 

on the lower level of the pipeline. 

This paper applies quality metrics in data space to improve scatterplots, parallel coordinates 

and star glyphs for high-dimensional data. It measures correlation to find the 

best dimension ordering, projection, and view optimization for the data sets. The user can 

steer the process by influencing all the pipeline steps. 

These four examples cover many aspects discussed in the previous sections, especially 

metrics calculated in the data vs. image space, di erent purposes, di erent measure types, 

di erent uses of the pipeline, and di erent interaction levels. Many of the papers we have 

reviewed have similar elements and functions, nonetheless there are others that deviate 

considerably from these ones. While we cannot provide the full set of examples in this 

section, we discuss in Section 4.1.6 some findings that stem from the analysis of the whole 

set, including those with uncommon approaches and list all the quality metrics pipelines 

in Appendix A.3.


4.1.6 Findings 

In the following, we discuss some major trends we have observed during our analysis. 

From the visualization point of view we already discussed the role of meta-visualizations, 

that is, visualizations with the purpose to accommodate other visualizations. During the 

paper review we found very limited explicit discussions of this aspect that we deem extremely 

relevant. Many of the papers we have analyzed seem to assume that providing 

a simple list of interesting visualizations will automatically solve the user’s task. To the 

best of our knowledge, the only work that analyzes the issue explicitly and in great depth 

is the Trellis display [18], which organizes the display in a way to make patterns among 

views apparent. We believe a deeper investigation of this issue is needed. 

Interestingly, some of the papers we reviewed do take care of the navigation issue, that 

is, how to explore configurations automatically found by the algorithm. These papers usually 

provide an additional visualization that permits to navigate from one configuration to 

another. For instance, Johansson et al. provide a line chart visualization to interactively 

show alternative projections in parallel coordinates [82]. Similarly, “hierarchical dimension 

ordering” [158] uses the InterRing visualization to the let the user navigate through 

alternative subsets of dimensions organized in a hierarchical fashion. Finally, the Rankby-Feature 

framework [126] uses color-coded interactive lists and scatterplot matrices to 

provide a preview of the statistical properties of each views. 

We also noticed a lack of systematic approaches to the ordering problem - every paper 

proposes its own method. The whole topic of seriation, introduced in the early work of 

Bertin [22] and discussed in depth by Hahsler et al. [62], deserves deeper investigation and 

acknowledgment. Additionally, innovative ways of ordering data dimensions may exist, 

like the eulerian tours and hamiltonian decompositions presented by Hurley et al. [75], 

which explore the possibility of repeating the axes to reduce dependency on a specific 

order. 

In Section 4.1.4, we listed a series of meta-visualizations that we have found, namely 

list and matrix (small multiples). We believe this list can be expanded if novel solutions 

are developed. A promising one we have noticed in a few papers, but not included in the 

review (because they are not specifically using quality metrics) is the idea of arranging 

iconic versions of the visualizations generated in a scatterplot view (e.g., using MDS or 

similar techniques). Such a technique is for instance proposed in the work of Yang et al. 

where pixel-based icons are laid out with an MDS projection in a scatterplot [156]. 

Another issue we noticed from our analysis is the limited use of the visual mapping 

and view transformation functions in the pipeline. More specifically, visual mapping is 

almost exclusively used as a way to generate alternative orderings, taking into account 

exclusively the mapping between the original data dimensions and the visualization axes. 

But alternative mappings can also be generated by linking data dimensions to the whole 

spectrum of visual features like color, size, shape, etc., as is common in several systems 

based on visual languages like ggplot2[1], tableau[3], and protovis[2]). Pixnostics [120] is 

the only technique in our review presenting this kind of a process supported by quality 

metrics. 

View transformation is also rarely used in the quality metrics pipeline. The only 

example we found is the use of quality metrics to automatically select focus area parameters 

in table lens [8]. The automatic selection of interesting point of views in 3D scatterplots, 

for example, is one clear case where the use of quality metrics at the view transformation 

stage would be beneficial. Another one is the automatic highlight of interesting items in

4.1.7 Directions for Further Research 85 

a view (e.g., visual boosting in pixel-based visualizations [109]). 

Finally, the purposes we have considered can be roughly classified into two broad higher 

level purposes: finding interesting visualizations and scaling visualizations to larger data 

sets. When considering these goals it is evident how clustering, correlation, outliers, and 

complex patterns support more the first goal, whereas image quality and feature preservation 

tend to support more the second one. One interesting pending issue is whether the 

use of quality metrics in high-dimensional data is confined to these two general purposes. 

One purpose, which to the best of our knowledge is totally unexplored, is the use of quality 

metrics to automatically or semi-automatically compare di erent visual techniques of the 

same data. 

4.1.7 Directions for Further Research 

In the following, we present a selected set of research issues we deem important for the 

advancement of quality-metrics-driven data visualization. 

Evaluation and applications 

Surprisingly, none of the papers we have analyzed reported on user evaluation. While 

we are convinced that quality metrics are useful and need to be further developed, we 

also realize that the whole idea has not yet been tested. Usefulness is therefore one of 

the most important aspect to consider, followed by usability issues. To the best of our 

knowledge, there are no studies reporting on the use of the quality metrics approach in 

real-world settings. Observatory studies or even simple case studies would greatly improve 

the approach and most likely direct research to specific issues hard to anticipate without 

observation. 

Perceptual tuning 

All the metrics that work in the image space try to simulate the human pattern recognition 

machinery to some extend. They try to partially substitute human vision with image 

processing algorithms with the (implicit) assumption that algorithm rankings will match 

user rankings. This assumption needs a much deeper investigation. Our study presented 

in Section 3.2 and published in [134], where quality metrics rankings of clusters in scatterplots 

are compared to human rankings, represents a first step in this direction. In addition, 

it is necessary to validate and tune the image space metrics in a way that the parameters 

take models of human perception into account. Excellent examples of initial steps in this 

direction are in the following papers [81, 94, 116], where the perception of visual patterns 

has been tuned according to user studies aimed at modeling the way humans perceive them. 

Metrics systematization 

During our review we collected a very large number of alternative quality metrics, some 

calculated in data space some in image space. While this proliferation of metrics is a sign 

of the richness of this approach, it is currently very hard to compare them and understand 

which one is suitable for a given task. Some authors provide a number of metrics in the 

same environment letting the user choose which one to use. Nonetheless, we fear that 

this approach with limited guidance may not be e ective for end users, especially, if there 

is a lack of understanding of the level of redundancy between one metric and another. 

Similarly, given the above mentioned dichotomy, it is hard if not impossible to state which


approach yields the best results in which contexts. On a side note, the mixed approach 

of giving the user the possibility to combine several metrics into a composite one needs 

much more investigation, validation, and guidance. 

Scalability 

Image space and data space quality metrics have di erent scalability issues. Quality 

metrics in image space have the advantage of being independent from the original data size, 

e.g., [42], that is, their computational complexity only depends on the screen dimensions. 

However, as data grows in size, virtually all visualizations experience some degree of 

degradation that may influence the discriminatory power of the metric. For instance, 

visualizations with a lot of clutter might hinder the discovery of the desired patterns. 

Quality metrics in data space, on the other hand, are expected to be more robust in terms 

of pattern detection, but their computation is directly a ected by data size. A thorough 

investigation of these issues and how to find a compromise between the two is clearly an 

interesting subject for future research. 

4.1.8 Limitations 

Our work has some important limitations to take into account; first of all its subjective 

nature. We are by no means suggesting this is the only way to describe the current state of 

quality metrics in high-dimensional visualization. There are no doubt a number of equally 

good alternative ways to describe it; this chapter provides a much-needed starting point. 

We encourage the reader to use this as a way to get inspiration for further research and 

to understand its status. 

Similarly, while we did our best to follow a thorough methodology (see Section 4.1.2), 

there might be relevant papers we overlooked. Even though we tried to be very broad 

and inclusive, our background heavily influences the review. Especially, given our focus 

on Computer Science we might have missed relevant literature from Statistics. However, 

we feel confident that at this point of our review any additional paper would not change 

the structure or the elements of our model. In other terms, the real goal of our review 

was not to include every possible paper on the discussed matter but more to have enough 

coverage to build a coherent and useful picture. 


We presented a systematic analysis of quality metrics as a way to support the exploration 

of high-dimensional data sets. Quality metrics have been used in a variety of contexts and 

purposes. With this work we started a collection of these disparate systems under one 

umbrella and provided a way to reason about their characteristic features. Specifically, 

we presented an analysis of the visualization techniques, the quality metrics, and the 

processing pipeline. The analysis has two main outcomes. First, it permits to describe the 

methods in detail and to capture their key components. Second, as shown in Section 4.1.6 

and Section 4.1.7, it permits to spot interesting research gaps and promising directions 

for future research. While we consider this work just an initial step, we hope it will spur 

new ideas and support researchers and practitioners in the development of interesting new 

applications and novel techniques.

4.2. Visual Cluster Separation Factors: Sketching a Taxonomy 87 

4.2 Visual Cluster Separation Factors: Sketching a Taxonomy 4 

The quality metrics systematization presented in the previous section was followed by 

a qualitative analysis of concrete measures from this large pool. Here we turned our 

focus to the quality metrics for scatterplots that are designed to identify the visualizations 

representing best the clusters in classified data. That means they rank scatterplot views 

that separate the data classes well - better than views with mixed classes. Our idea was 

to use two of these metrics to identify the best visualizations, and independent of the 

data. Simultaneously, we wanted to give advice as to whether it would be best to use 

a 2D scatterplot, a 3D scatterplot, or a SPLOM for a specific data set. We therefore 

computed the measures for di erent data sets, and surprisingly identified that these are 

not robust with respect to the di erent cluster shapes encountered in the analyzed data. 

Led by this insight, we analyzed more deeply all the cases using open and axial coding 

of failure reasons building up a taxonomy of visual cluster separation factors. We named 

this process a qualitative evaluation. 

The next sections will sketch the methodology and the results of this evaluation by presenting 

introductory ideas in Section 4.2.1 that led to this work. Section 4.2.2 will present 

a short description of the methodology, followed by the taxonomy axes in Section 4.2.3 

and concluding in Section 4.2.4 with a discussion about the limitations of this work and 

possible future research making use of the developed taxonomy. 

4.2.1 Introduction 

An impressive number of quality measures, dimension reduction techniques, and visualizations 

for high-dimensional data have been developed in the past. The more exist, the 

harder it is for users to find the right choice for their tasks. The literature is not providing 

any guidance on how to choose the right visualization or dimension reduction technique 

for the complex multidimensional data. Quality metrics were designed to filter the high 

number of representations and provide an interesting selection to the user. Sedlmair et 

al. [122] investigate to which extent the existent measures can accomplish this task. They 

choose the 2D-HDM (Section 3.1.3) and the DCM [129] (also used in the empirical evaluation 

in Section 3.2 and described in Section 3.2.1) as quality metrics to judge the di erent 

data projections regarding their ability to represent the clusters in classified data sets. 

The measures where designed for 2D scatterplots and extended by the authors to work 

also on scatterplot matrices (SPLOMs) and 3D scatterplots. This decision is motivated 

by the fact that scatterplots is a widely used technique to display high-dimensional projections, 

and often SPLOMs are used to see more than two dimensions of the data. Since 

analysts also quiet often work with 3D scatterplots, this technique was also included in 

the study. To obtain the lower-dimensional embeddings, di erent dimension reduction 

techniques were used - the well known PCA, robust PCA, MDS and t-SNE [143]. Initial 

4 This chapter is based on the collaboration with UBC where I participated in a project on quality 

measures, lead by Prof. T. Munzner and M. Sedlmair. The work resulted in a joint EuroVis publication 

[122]. Since I was not in the lead in this project, this chapter is presenting briefly the methodology and 

the results, and a deeper description can be gathered from the paper itself. Please note that the full 

taxonomy of cluster separation factors and data characteristics is not my contribution. Since I was part of 

the qualitative analysis I would like to recall the results in my thesis, and provide a personal outlook on 

further research ideas at the end of this chapter.


experiments on di erent data sets showed that by using these visualizations and projection 

techniques, the measures are not able to detect di erent cluster shapes in the data 

projections. Compared to a human judgement, they surprisingly provided mismatches by 

ranking visualizations high, when the human rank was low, and ranking visualizations 

low, when the human rank was high. This implies that good visualizations are sometimes 

missed and bad visualizations are ranked high, both cases that should be avoided. 

These surprising outcomes shifted the focus of the study from a guide for the user to the 

right choice of a visualization technique and a dimension reduction technique dependent on 

their data, to an in depth analysis of di erent visual separability factors. A sketch of the 

methodology of the systematic study of the di erences between the computed measures 

and the human judgement is presented in the next section. 

4.2.2 Method 

To discover the divergences between human judgement and measure ranks, a qualitative 

data study was conducted. The first two authors of the paper 5 manually inspected over 

800 visualizations (combination of 75 data sets, 4 dimension reduction techniques, and 3 

visualizations - 2D, 3D scatterplot and SPLOM) and judged their quality in displaying 

data clusters. Their judgements were compared with the measure ranks and the mismatches 

were analyzed. “The investigators generated a detailed set of characteristics that 

influenced cluster separability in general, and specific reasons why the measures failed in 

the cases where they found a mismatch. Based on separability characteristics and failure 

reasons, we generate a higher-level taxonomy of factors, which we iteratively refined in 

multiple passes” [122]. This was done “not only by considering its explanatory clarity 

and power, but also by mapping the ranges where each measure was successful along the 

factor axes, and by placing some of the studied data sets along them. Figure 4.15 shows 

the measure success ranges on a simplified version of the taxonomy” [122]. The study 

consisted of four stages: (1) choosing variables for study; (2) generating data set instances 

Within-Class Factors 

Count 

Size 

Clumpiness 

few 

small 

equidistant 

x 

uni-rand. one spot many spots 

x 

many 

large 

Density sparse x dense 

clumpy 

Outlier none x many 

x 

Between-Class Factors 

Class/Point 

Count 

Variance of Count 

Variance of Size 

few classes/ x 

many points 

similar 

Variance of Density similar 

Mixture 

similar 

random 

x 

x 

x 

x 

VS. 

many classes/ 

few points 

different 

different 

different 

non-random: 

equidistant/ 

interwoven 

Shape 

narrow 

round 

Isotropy 

Curvature 

x 

curvy 

Split 

Variance of Shape 

Inner-Outer 

Position 

contiguous 

similar 

non-existent 

x 

x 

x 

VS. 

VS. 

split 

different 

existent 

Centroid evocative x 

misleading 

Class Separation full overlap 

x 

partial overlap adjacent separate 

distant 

Measures: 

Centroid 

Grid 

Datasets: 

gaussian: synth., MDS, Fig. 5(a) 

fisheries: real, MDS, Fig. 5(d) 

x spambase: real, PCA, Fig. 5(b) 

hiv: real, t-SNE, Fig. 5(e) 

shuttle: real, MDS, Fig. 5(c) 

entangled: synth., t-SNE, Fig. 5(f) 

Figure 4.15: Taxonomy of factors in visual cluster separation, where factor axes are marked to 

show the ranges where existing measures are successful; gaps represent failure cases. The centroid 

measure (CDM) is marked in blue and the grid (2D-HDM) is marked in red. All positions are 

approximate estimates. Marked along the factor axes are six data sets that are exemplified in the 

paper. (Used with permission by [122].) 

5 Michael Sedlmair and myself.

4.2.3 Visual Cluster Separation Taxonomy 89 

and computing measures; (3) open coding and measure evaluation; and (4) axial coding 

and taxonomy building, and details can be found in the paper [122]. 

4.2.3 Visual Cluster Separation Taxonomy 

Class separation in a visualization is influenced by di erent characteristics of the data set. 

Figure 4.16 presents the factors that a ect visual cluster separation. These are grouped in 

“Within-Class factors” that are determined by the structure or appearance of a single class 

and “Between-Class factors” that represent interactions between two or more classes [122]. 

Influence 

Shape Point Distance Scale 

Count 

Size 

Density 

Clumpiness 

Outlier 

Shape 

equidistant 

Within-Class Factors 

Isotropy 

few 

small 

sparse 

none 

narrow 

uniformly 

random 

one 

dense spot 

Curvature 

many 

large 

dense 

many dense 

spots 

many 

curvy 

clumpy 

Variance 

Class/Point 

Count 

Variance of 

Count 

few classes 

many points 

similar 

many classes 

few points 

different 


Size similar different 


Density similar different 

Mixture 

Split 


Shape 

Between-Class Factors 

random 

contiguous 

similar 

VS. 

equidistant 

VS. 

interwoven 

split 

different 

Position 

Centroid 

round 

evocative 

misleading 

Inner-Outer 

Position 

Class 

Separation 

non-existent 

full 

overlap 

partial 

overlap 

adjacent 

separate 

existent 

distant 

Figure 4.16: A taxonomy of data characteristics with respect to class separation in scatterplots. 

Some factors are organized as axes (arrows) while others are binned. Between-Class factors often 

result from the variance of Within-Class factors (horizontal dependencies), and factors at the top 

can strongly influence factors below them (vertical dependencies). Class Separation is therefore 

dependent on all other factors (used with permission by [122]). 

In brief, we recall the characteristics determining these factor groups. Four categories 

describe the two factor groups, the scale, point distance, shape, and position category 

that influence each other from first to last. Enclosed in these groups the Within-Class 

factors are: count, size, density, clumpiness, outlier, shape and centroid. These factors 

describe the structure and appearance of single classes. Variance of the Within-Class 

factors across multiple classes determine the Between-Class factors sketched on the right 

side of the figure. In this study, the following combinations influencing the perceived 

cluster shapes were identified: class-point count, variance of (point) count, variance of 

(class) size, variance of (class) density, mixture (of classes), split (of classes), variance of 

shape, inner-outer position, class separation. Since the arrows indicate the influence of 

the factors, horizontally from left to right, and vertically from top to bottom, the factor 

positioned in the lower right corner, class separation, can be strongly influenced by all the 

other factors. 

The two quality measures have di erent strengths, so they perform di erently while 

encountering these di erent factors. Figure 4.15 marks along the factor axes the measures’


performances. This clearly shows the gaps where current measures can not achieve good 

results. The study identifies that the centroid factor is influenced by many other factors 

and the centroid based measure (CDM) alone cannot identify all di erent constellations of 

visual classes. CDM is vulnerable with respect to shape, clumpiness, outliers, variance of 

count, of size, or of density, and inner-outer position [122]. Similar, HDM also encountered 

a number of problems while identifying classes in visualizations. The biggest issue is with 

narrow, adjacent classes that coexist in the same grid cell and span over di erent cells. 

The measure emerged to be sensitive to the grid size, despite previous results from the 

literature. The most di cult factor was the class separation, the measure failing in contact 

with overlapping classes. Depending on the grid, the classes were sometimes rated good, 

even though presenting a high overlap or class split. 

The goal of this taxonomy is guiding others in designing, using, and evaluating cluster 

separability measures. Other researchers can test di erent data sets and map their features 

onto the taxonomy axes. This will give an overview of the coverage of relevant factors by 

the particular measures and help in improving or developing more reliable measures in the 

future. 

4.2.4 Discussion and Further Research 

This study shows that so far measures were developed and validated on far too few and 

too simple data sets. The real world is much more complex, and since the data complexity 

rises, a more systematic development of the measures is needed. As we saw in the previous 

section, more aspects can be identified in real data sets, that are not covered yet by 

existing measures. In the following, we present a list of issues that emerged as a result of 

this study, and which we deem important for further research in the area of quality metrics. 

Taxonomy based evaluation and systematization 

A large number of metrics for cluster separation in scatterplots have been developed. They 

all try to discover good views displaying the data clusters. We believe that there are two 

main reasons, why there are a variety of measures for this task: di erent strengths of measures 

and missing unified picture of existing approaches. First, the measures have di erent 

strengths according to the factors of the taxonomy. They cannot cover the entire spectrum 

of data characteristics, and therefore focus just on a subset of these. Using the metrics 

for the area that they cannot cope with will lead to wrong results. Therefore, guided by 

the taxonomy presented before, an evaluation of the existent metrics is needed that can 

help users to choose the right measure depending on their data. Second, the variety of 

measures makes the development of new ones di cult since a unifying picture is missing. 

Guided by the taxonomy axes, the existent approaches can be evaluated and their ranges 

of success can be marked to them. This analysis would provide a good systematization 

of current approaches spotting the data characteristics that have to be addressed in the 

future and lead the researchers through the variety of approaches. 

Taxonomy based measure development 

After the gaps of existent measures are identified, new research can be conducted to cover 

the data characteristics missing so far. We believe that it is hard to develop one single 

measure to cover all these factors, but having di erent measures and being aware of their 

coverage potential along these axes helps in avoiding false rankings in the future.

4.2.4 Discussion and Further Research 91 

New taxonomies for di erent visualization techniques 

While this taxonomy focuses on one prominent visualization technique, the scatterplot, 

there are also metrics designed for other high-dimensional visualization techniques like categorized 

in Section 4.1.4. Di erent visualization techniques will need di erent factors to 

characterize di erent patterns (e.g., cluster separation). Even though a taxonomy like this 

is laborious, the benefits of it can improve the development of metrics for these techniques. 

New taxonomies for di erent quality metric factors 

We have seen in Section 4.1.4 that di erent patterns are quantified by measures, and a 

systematization of the factors that influence them is missing for other factors too. Factors 

like correlation, outliers, complex patterns, image quality, or feature preservation are 

missing such a taxonomy. Having all these taxonomies – which would be the ideal case 

scenario – it would be possible to identify interrelations between di erent patterns and 

how they are represented in visualizations. We believe that these insights can help in 

combining measures to identify more than one pattern. 

Metrics for dimension reduction properties 

Dimension reduction techniques are often used to reduce the dimensionality of the data 

sets before displaying them on the screen. The metrics are always applied on dimension 

reduced data sets, so artifacts included by these techniques cannot be excluded. A study 

of how di erent data characteristics are maintained or obscured by these techniques, can 

be conducted by comparing di erent techniques, or the same technique with di erent 

parameter settings on the same data set. As far as we know, there are no studies reporting 

on this type of analysis, and we believe it to be an interesting topic for future research. Also 

quality measures can be designed to automatically detect structure changes, by parameter 

or technique change. Properties like noise invariance, rotation invariance, scalability with 

respect to data points and dimensions, can be explored by new quality metrics.

92 Chapter 4. A Systematization of Quality Metrics in High-Dimensional Data Visualization

5 

Visual Subspace Analysis of 


Contents 

„Visual ideas combined with technology combined with personal interpretation 

equals photography. Each must hold it’s own; if it doesn’t, the thing 

collapses.” 

Arnold Newman 

5.1 Visual Exploration for Subspace Clustering . . . . . . . . . . . 94 

5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

5.1.2 Subspace Clustering Algorithms . . . . . . . . . . . . . . . . . . 96 

5.1.3 Task Definition and Design Space for Visual Subspace Cluster 

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

5.1.4 The ClustNails System . . . . . . . . . . . . . . . . . . . . . . . . 101 

5.1.5 Use Case and System Comparison . . . . . . . . . . . . . . . . . 106 

5.1.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 109 

5.2 Visual Analytics of Subspace Search . . . . . . . . . . . . . . . . 110 

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

5.2.2 Subspace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

5.2.3 Proposed Analytical Workflow . . . . . . . . . . . . . . . . . . . 113 

5.2.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

5.2.5 Discussion and Possible Extensions . . . . . . . . . . . . . . . . . 124 

5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

S 

ubspace clustering addresses an important problem in clustering multidimensional 

data. In sparse multidimensional data, many dimensions are irrelevant and obscure 

the cluster boundaries. Subspace clustering helps by mining the clusters present in only 

locally relevant subsets of dimensions. However, understanding the result of subspace 

clustering by analysts is not trivial. In addition to the grouping information, relevant 

sets of dimensions and overlaps between groups, both in terms of dimensions and records, 

need to be analyzed. In Section 5.1, we present an interactive visualization system called 

ClustNails to analyze, navigate, relate, and understand subspace clustering results. Real 

world data sets are used to demonstrate the functionality of the system. 

Additionally, high-dimensional data spaces often consist of combined features that measure 

di erent properties, in which case the particular relationships between the various 

properties may not be clear to the analysts a priori since it can only be revealed if appropriate 

feature combinations (subspaces) of the data are taken into consideration. Considering 

just a single subspace is, however, often not su cient since di erent subspaces may show 

complementary, conjointly, or contradicting relations between data items. Useful informa-

94 Chapter 5. Visual Subspace Analysis of High-Dimensional Data 

tion may consequently remain embedded in sets of subspaces of a given high-dimensional 

input data space. 

Relying on the notion of subspaces in Section 5.2, we propose a novel method for the 

visual analysis of high-dimensional data in which we employ an interestingness-guided 

subspace search algorithm to detect a candidate set of subspaces. Using proper defined 

subspace similarity functions we provide an interactive exploration environment to compare 

and relate subspaces with respect to their topological similarities and dimension 

similarities. Real and synthetic data sets are used to demonstrate our approach. 

Parts of this chapter appeared in the following publications [135, 136]. 

5.1 Visual Exploration for Subspace Clustering 

In this section, we introduce a visual subspace cluster analysis system called ClustNails. It 

integrates several novel visualization techniques with various user interaction facilities to 

support the navigation and interpretation of subspace clustering results. We demonstrate 

the e ectiveness of the proposed system by analyzing real world data sets and comparing 

it to other existing visual subspace cluster analysis systems. 

This section is organized as follows. In Section 5.1.1, we elaborate what aspects motivated 

our research in this area. In Section 5.1.2, we introduce the subspace clustering 

problem and point to important overview articles in this area. We also explain in Section 

5.1.3 the challenges in designing e ective visualization tools for subspace clustering 

analysis tasks. In Section 5.1.4, we provide an overall view of the system as well as detailed 

visualization and ordering techniques. In Section 5.1.5, we validate the system with real 

world data sets and compare it with a state of the art system, and Section 5.1.6 concludes. 

5.1.1 Motivation 

Clustering is one of the most prominent techniques used to analyze large and complex data 

sets, and visualization is often helpful in understanding the output of a given clustering 

method. A clustering algorithm assesses the relationships among objects of a data set by 

organizing objects into clusters, such that objects within a cluster are similar to each other 

but dissimilar from objects in other clusters. Clustering has a wide range of application 

in areas such as business intelligence, pattern recognition, image or document analysis, 

and bioinformatics. With the fast development of modern technologies, vast amounts of 

high-dimensional data are generated. This poses new challenges for clustering that require 

specialized solutions. 

The need for subspace clustering stems from the well-known “curse of dimensionality”, 

that is, the enormous challenges that arise in data analysis whenever the data under 

analysis has a high number of dimensions. As the number of dimensions grows, relations 

among data points become more complex and interesting patterns become harder to uncover. 

Computation also becomes an issue as the number of combinations increase steeply

5.1.1 Motivation 95 

with data dimensionality. 

The need for subspace clustering derives essentially from two distinct but related issues: 

(1) how similarity among data items changes as as the number of data dimensions grows 

and (2) the relevance of di erent dimensions in di erent clusters. 

Several studies have analyzed the strange behavior similarity functions have in highdimensional 

data [28, 69]. In summary, they are organized around the problem of finding 

the nearest and farthest points to a given query point and show that as the number of data 

dimensions increases the di erence between the two does not increase as fast the distance 

to the nearest point. That is: 

dist max ≠ dist min 

lim 

=0, (5.1) 

dæŒ dist min 

meaning that the discrimination between the nearest and farthest points becomes irrelevant. 

In turn, this has the e ect that a progressive degradation of the quality of data 

clustering can be expected because distances between data points become progressively 

meaningless. 

The second problem is related to the fact that clusters are often present only in subsets 

of dimensions of the original data space, and this is of course more probable when the 

number of dimensions is high. These clusters might be hard to detect if considering the 

whole data space because they can introduce noise and fool the clustering algorithm. This 

e ect can be explained through a simple diagram like the one shown in Figure 5.1. 

The figure shows the distribution of data points in a 3D space and illustrates the 

concept of a subspace cluster – given three dimensions x, y, and z, clustersmayexist 

in di erent subspaces. A standard clustering algorithm like k-means would have problems 

finding the clusters because they are not clearly separated in the 3D space. But, 

when considering 2D projections of these data respectively on (x, y), (x, z) and (y, z) the 

clusters become apparent. Subspace clustering techniques aim to find these clusters that 

might otherwise remain hidden if a traditional clustering algorithm was applied. Subspace 

clustering gives for each cluster (1) the objects belonging to the cluster, and (2) the subset 

of dimensions that constitute the cluster. Based on the type of subspace clustering 

method, there exist two forms of output: a partitioning of the data into separate clusters 

and clusters allowing for overlapping elements. Overlap may also exist between the sets 

of dimensions constituting the clusters. 

#" 

$" 

Figure 5.1: Data projected in several subspaces. 

Designing e ective visualizations to help analyze the clustering result is not trivial. In 

addition to the cluster membership information, the relevant sets of dimensions and the 

!"


overlaps of memberships and dimensions need to be considered. Although a number of 

techniques (e.g., parallel coordinates [55, 78], scatterplot matrices [17], heat maps [47]) 

exist for visualizing traditional clustering results, little research has been carried out for 

visualizing subspace clustering results. There is a need for e ective systems that allow the 

comparison and analysis of clusters in arbitrary subspace projections, supporting overview 

and in-depth study of the subspace clustering results. 

In this section, we present ClustNails, a novel visualization system for mining subspace 

clusters and analyzing the results. The system takes high-dimensional data as input, and 

applies a user-selectable subspace clustering algorithm from a set of algorithms, to group 

the objects into clusters. The system displays the subspace clustering results using two appropriately 

designed visual representations – Spikes and HeatNails. These representations 

support the interpretation of the result of subspace clustering algorithms by visualizing 

characteristics of the clustering results from di erent perspectives. Appropriate ordering 

techniques are integrated with the visualization to help extracting meaningful patterns 

from the clustering results. 

The main contributions of this section are: 

• an integrated data analysis and visualization tool for mining patterns in multidimensional 

data using subspace clustering algorithms; 

• a characterization of subspace cluster analysis tasks and the resulting design space; 

• two novel visualization techniques, Spike and HeatNail, for analyzing subspace clustering 

results; 

• appropriate ordering techniques for pattern extraction. 

5.1.2 Subspace Clustering Algorithms 

Given a set X of data points in some multidimensional space D, a subspace clustering 

algorithm aims to find a subset X k of data points together with a subset D k of dimensions 

such that the points in X k are closely clustered in the subspace of dimension D k . 

The most critical part of subspace clustering is the subspace generation. Given a 

d-dimensional space, there are 2 d possible subsets of dimensions. It is computationally 

infeasible to examine each possible subset to find subspaces of interest for a predefined 

pattern. Since this is clearly not a viable way, every algorithm is based on some kind 

of heuristic that speeds up the search in such a huge combinatoric space. A number of 

subspace clustering algorithms with strategies for narrowing down the search space have 

been proposed in the past and some of them enumerated in Section 2.3.1. As suggested 

by Parsons et al. [110], the existing algorithms can be categorized into bottom-up and 

top-down strategies. 

The bottom-up approaches implement a so called ”downward closure property” (or 

monotonicity property), which means if subspace S contains a cluster, then any subspace 

T S must also contain a cluster. The property is used for pruning – if a subspace T 

does not have high enough density, then any superspace S, T S, can be excluded from 

the searching space. A common implementation of a bottom-up approach starts from one 

dimensional dense subspaces, iteratively considering an increasing number of dimensions 

and combining the dense units that are adjacent until no more new dense units are found. 

A typical algorithm will have three major steps:

5.1.2 Subspace Clustering Algorithms 97 

1. generate high dense units (subspaces) using an a-priori-like approach; 

2. assign cluster membership to each object; 

3. remove outliers that have distance to the cluster center higher than the critical value. 

The top-down approach starts with an initial configuration where data is clustered using 

the full feature space with equally weighted dimensions. Each dimension is assigned a 

weight for each cluster to characterize the relevance of the dimension to the cluster. Subsequently 

the annotated clusters are re-clustered taking into account the weights assigned 

in the preceding step. Typically sampling techniques are used to improve performance as 

the approach involves multiple iterations of re-clustering in the full set of dimensions. 

Any of these approaches require some kind of parametrization. Bottom-up approaches 

generally require specifications of threshold densities and bin size. Top-down approaches 

require a specification of the desired number of clusters (similar to k-means) and the 

average number of dimensions included in a subspace. 

In this chapter we use Proclus, which is one of the most established algorithms and 

has demonstrated advantages over a number of subspace clustering techniques [102]. Proclus 

[4] takes a top-down approach and extends the traditional k-medoid clustering algorithm. 

The k-medoid algorithm starts with an initial partition and then iteratively assigns 

objects to medoids, computes the quality of clustering, and improves the partition and 

medoid. Proclus extends k-medoid by associating medoids with subspaces and improves 

both partitions and subspaces iteratively. 

Taking two input parameters, number of clusters k and the average number of dimensions 

l, the algorithm proceeds in 3 phases. (1) In the initialization phase the set of 

k medoid candidates is selected, by picking a representative sample from the entire data 

and choosing the medoids from the representatives by using a greedy method. (2) In the 

iterative phase the medoids are improved and a subspace for each medoid is computed. 

This is done by going through the following steps. First a random set of k medoids is 

selected from the representatives and the optimal set of dimensions is determined for each 

medoid. Then all the objects are assigned to the nearest medoid. If the current clustering 

is better than the previous, than it is kept. These steps are repeated until the clustering 

does not change anymore when determining the bad medoids and replacing them with random 

representatives. (3) In the last phase, the cluster refinement phase, once the best 

medoids are found, the clustering is improved by determining optimal dimension sets for 

the medoids and reassigning the objects to clusters. Algorithm 1 presents the pseudocode 

from [4] describing the algorithmic steps in more detail. 

A number of reviews and surveys exist to compare and classify the subspace clustering 

approaches. The survey mentioned above by Parsons et al. [110] organizes the techniques 

in a hierarchy of algorithmic strategies and provide a small experiment on representative 

algorithms of each class. Kriegel et al. present a more thorough systematization 

and updated survey [90], where the broader problem of clustering high-dimensional data 

is discussed. The recent work of Müller et al. [102] presents a systematic and unique 

evaluation of subspace clustering algorithms in terms of quality of generated output and 

performance. According to [102], Proclus is one of the best partitioning algorithms and 

has a good runtime compared to other techniques. We rely on this and use Proclus in our 

experiments.


Algorithm 1 PROCLUS(No. of Clusters: k, Avg. Dimensions: l) 

{C i is the ith cluster} 

{D i is the set of dimensions associated with cluster C i } 

{M current is the set of medoids in current iteration} 

{M best is the best set of medoids found so far } 

{N i is the final set of medoids with associated dimensions} 

{A, B are constant integers} 

/*1. Initialization Phase: select set of k medoid candidates */ 

S = random sample of size A · k 

M = GREEDY(S, B · k) 

/*2. Iterative Phase: improve medoids and compute subspace for each medoid */ 

BestObjective = Œ 

M current = Random set of medoids {m 1 ,m 2 ,...,m k }µM 

repeat 

/* Approximate the optimal set of dimensions */ 

for each medoid m i œ M current do 

Let ” i be the distance to nearest medoid from m i 

L i = Points in sphere centered at m i width radius ” i 

end for 

L = {L 1 ,...,L k } 

(D 1 , D 2 ,...,D k ) = FindDimensions(k, l, L) 

{Form the clusters} 

(C 1 ,...,C k ) = AssignPoints(D 1 ,...,D k ) 

ObjectiveFunction = EvaluateClusters(C 1 ,...,C k , D 1 ,...,D k ) 

if ObjectiveFunction < BestObjective then 

BestObjective = ObjectiveFunction 

M best = M current 

Compute the bad medoids in M best 

end if 

Compute M current by replacing the bad medoids in 

M best with random points from M 

until (termination criterion) 

/*3. Cluster Refinement Phase: improve quality of the partitions and subspaces */ 

L = {C 1 ,...,C k } 

(D 1 , D 2 ,...,D k ) = FindDimensions(k, l, L) 

(C 1 ,...,C k ) = AssignPoints(D 1 ,...,D k ) 

N =(M best , D 1 , D 2 ,...,D k ) 

return N

5.1.3 Task Definition and Design Space for Visual Subspace Cluster Analysis 99 

5.1.3 Task Definition and Design Space for Visual Subspace Cluster Analysis 

Subspace cluster visualization remains a challenging task due to the multiple types of 

information contained in subspace clustering results such as subspaces, cluster membership 

of objects, and overlap between subspaces and clusters. Existing subspace visualization 

techniques have been detailed in Section 2.4.2. To develop e ective visualization systems 

for subspace cluster analysis, it is necessary to take into consideration the di erent tasks 

that are involved in the data analysis and use it as a base for exploring the design space. 

We describe next main tasks that an appropriate subspace cluster visualization technique 

needs to address and, therefore, provide a generic and reusable characterization. We also 

analyze the design space and provide: (1) a classification, and (2) a reasoned analysis of 

common design alternatives, from which a baseline design space is derived. This analysis 

serves as a baseline not only for the design of our proposed subspace cluster visualization 

system, but allows to compare with existing approaches and identify empty areas in this 

design space for future work. 

Scope of Subspace Cluster Analysis 

Clustering abstracts a larger data set to a smaller number of groups that are presumably 

more amenable to analysis and interpretation. Standard clustering algorithms rely on a 

fixed set of dimensions used in the similarity function of the clustering algorithm. Typically, 

the selection of dimensions is done outside of the clustering algorithm. Subspace 

clustering methods, on the other hand, provide an extended output, including also the set 

of dimensions relevant to finding the groups, possibly described with weights indicating 

the importance of each dimension for the found result. Depending on the subspace method 

used, there can be an overlap between dimensions and records between the clusters. In 

principle, analysis of the subspace clustering can be done without considering the identified 

dimensions. In our work, we are interested in jointly analyzing the clustering results 

and the sets of selected dimensions, to provide enhanced analysis capabilities. 

Tasks 

The analysis of properties and relationships within and among clusters are important tasks 

in cluster analysis. We break these general analysis tasks down to a series of subtasks: 

T1 Reveal properties of individual clusters 

When analyzing clustering output it is necessary to understand the main features of 

each generated cluster. In particular, once the clustering output has been generated 

and a visualization is constructed to represent it, it is necessary to perceive the 

following information: 

T1.1 How many records does the cluster contain? 

T1.2 How many dimensions are involved and what are their weights? 

T1.3 How are the data values distributed in each of the contained dimensions? (homogeneity 

of cluster members, central and outlier elements, subgrouping of 

clusters)


T2 Enable cluster comparison 

Once the output has been considered and each cluster has been characterized visually, 

it is important to display the information in a way that meaningful comparisons can 

be made among the clusters. It is important to understand how similar (or distant) 

clusters are, which translates into: 

T2.1 How do clusters di er with respect to contained records and involved dimensions? 

T2.2 Is there overlap between records and dimensions or are they distinct? 

T3 Indicate the quality of the generated cluster output 

Subspace clustering algorithms, as many methods that work on multidimensional 

spaces, are heavily based on heuristics and are dependent on parameterization. For 

this reason, clustering outputs are not always optimal. Even if research in subspace 

clustering has largely improved the clustering quality, it is still important to be able 

to judge the output quality by considering the following: 

T3.1 How good is the clustering quality produced by a given algorithm? 

T3.2 How sensitive is the output with respect to parameter variations? 

We take these task considerations as a baseline for developing the ClustNails system 

presented in the next Section. While we have not formally evaluated the degree to which 

ClustNails fulfills each of these criteria, we find that they are at the core of the functionality 

that ClustNails o ers. 

Design Space 

In terms of the previously described tasks, the information entities of interest to be visualized 

are: Elements (data records, clusters, dimensions), Relationships (membership of 

records in clusters, clusters overlap with respect to records and dimensions), Attributes 

(cluster size, dimension distribution, dimension weight, etc.) 

We identify two main categories of visualization solutions for the representation of the 

subspace clustering output: Cluster-Centric (CC) and Data-Centric (DC). Cluster-centric 

solutions put their focus on the representation of the clusters first, with the intent to allow 

their comparison. Data-centric solutions put their focus on the representation of the data 

values with the intent to ease the interpretation of each cluster in terms of their internal 

distributions. 

There is a natural tension between these two extremes. Cluster-centric solutions scale 

much better in terms of number of data items and dimensions. Their higher level of 

abstraction allows an easier comparison between the cluster features, however, at the 

expense of limiting their interpretation. On the contrary, data-centric views ease cluster 

interpretation but do not scale very well with respect to data size and dimensionality. 

In our analysis of the design space, we explored several alternative visual designs and 

isolated some basic ones for both approaches. To discuss them briefly helps to better 

motivate our proposed final solution. 

Record-Centric Designs 

In record-centric designs each visual item represents a record. A 2D scatterplot projection 

is often used as a way to identify clusters of data elements in traditional clustering,

5.1.4 The ClustNails System 101 

however, it is not clear how to extend this design in a way that information about cluster 

dimensions is included. Parallel coordinates plots (PCP) could in principle be extended 

to represent subspace clusters by drawing lines between adjacent axes only when these 

belong to the cluster being drawn. But this generates complicated ordering problems with 

potential extreme cases where the polyline of a whole record might not be drawn at all 

because its axes are never adjacent. Also, PCP do not scale well to data of even moderate 

dimensionality, which in turn is the main focus of subspace clustering. Heat maps (or matrix/tabular 

representations) can be extended more easily by using di erent color scales for 

included and not included dimensions. In addition, their design allows for easy reordering 

of records and dimensions so that the structure of the clusters can be more easily perceived. 

Cluster-Centric Designs 

In cluster-centric designs each visual item represents a cluster. A 2D scatterplot projection 

is possible, as the one presented in VISA [14] (see also Figure 5.7). The clusters are 

projected with MDS, or similar techniques, taking into account their similarity according 

to some predefined criteria (e.g., shared number of dimensions). This solution permits to 

group clusters according to their similarity but their visibility and understanding is often 

hindered by the amount of overlap the items have. A matrix comparing one cluster to 

another in terms of their shared dimensions and records is also possible but its e ectiveness 

depends on how well row and columns are ordered, plus it is not necessarily the most 

compact design. Finally, icons or glyphs can be used to provide a rich representation 

of each cluster in a way that every single icon can provide information about cluster 

dimensions, records and weights in a integrated fashion. 

In ClustNails we integrate the best of the two approaches in a multiple views user 

interface (see Figure 5.5). A cluster-centric view based on sorted icons provides support for 

cluster understanding and comparison (T1.1, T1.2, and T2.2). A data-centric view based 

on sorted and compressed heat maps provides support in interpreting and comparing the 

clusters in terms of their data distribution (T1.3, T2.1, T2.2). All of them in turn help 

interpreting the quality of the generated output (T3.1 and T3.2). In the following section, 

we describe the whole system and its views in detail. 

5.1.4 The ClustNails System 

ClustNails is designed as an interactive visualization tool for subspace clustering analysis. 

It integrates a number of subspace clustering algorithms with novel visual representations 

and ordering techniques to help analysts generate subspace clusters from multidimensional 

data and identify interesting patterns from the visualization models. We next provide 

an overview of the design and main functionalities of the system, as well as a detailed 

description of the visualization and ordering techniques applied. 

Overview 

ClustNails integrates the OpenSubspace library of Weka [106] that contains a range of 

subspace clustering algorithms including Clique, Doc, Fires, Proclus, MineClus, INSCY, 

P3c, Schism, Statpc, and Subclu. The system takes multidimensional data as input, 

clusters the objects using a user-selected subspace clustering algorithm, and displays the

... 

... 


clustering result in a multi-view user interface. A number of ordering functions allow the 

analyst to examine the results and compare clusters from di erent perspectives. Various 

user interactions are added to allow the user to select clustering algorithms, parameters, 

and the order of the clustering results in the visualization panels. A linking-and-brushing 

function is implemented such that dimensions/clusters of interest can be highlighted in 

di erent views. By placing the mouse cursor over an item (record, dimension, or cluster) 

in the visualization panel, the analyst can see detailed information of the item in a tooltip. 

high-dimensional data 

D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 

subspace cluster 

123 59 81 

12 92 93 

subspace cluster view 

x1 

123 

43 

37 

68 

66 

59 166 81 112 112 

. . . 

. . . 

. . . 

x2 102 98 145 99 

87 

92 

134 

93 

23 

23 

44 

42 

93 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

x20 84 33 178 44 24 52 127 42 93 93 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

x40 51 57 37 12 87 57 111 96 23 39 

Subspace 

Clustering 

Algorithm 

. . . 

. . . 

. . . 

51 87 23 

. . . 

. . . 

. . . 

Subspace 

Cluster 


cluster and 

dimension 

ordering 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

Xm 42 103 38 74 61 82 73 121 49 

49 

61 82 121 

DATA SPACE 

VISUAL SPACE 

Figure 5.2: Workflow of subspace cluster analysis using the ClustNails system. 

Figure 5.2 illustrates the workflow supported by our tool. Figure 5.2 (left) shows 

that the system loads a d-dimensional data set as input and a user-selected clustering 

algorithm computes the subspace clusters, provided as a list of clusters, each containing a 

subset of records and a subset of dimensions. Figure 5.2 (middle) shows that each cluster 

is quantified in terms of the number of instances and associated number of dimensions; 

this information, together with the records for each subspace cluster is visualized in a 

multiple view visualization panel, which includes a Spikes view for cluster-centric analysis 

(top), and a HeatNails view for record-centric analysis (bottom). Figure 5.2 (right) shows 

that the order of clusters, dimensions and records can be rearranged in each view for 

easy comparison between clusters. Next, we describe the di erent views and supported 

ordering strategies. 

Visualization Components 

Visualization of Clusters: the Spikes View 

The Spikes view is a cluster-oriented view and provides a matrix of thumbnails, each 

representing a subspace cluster. Each cluster is visualized in a circular area that contains 

radial spikes. The spikes represent the individual dimensions (the subspace) that define 

the given cluster, and the spike length is scaled according to the weight (importance) of a 

dimension for the cluster (see below for the definition). The radial dimension sequence is 

identical for each spike-glyph. The number of records in the cluster is represented by the 

area size of the inner circle. 

Subspace clustering algorithms provide as output a subset of dimensions D k for each 

cluster SC k , as well as the set of instances (records) of this cluster X k . Given a dimension 

m within the set of dimensions D k in a subspace cluster SC k , we define the weight of that 

dimension in that cluster as: 

q 

wk m x 

= 

m i œXm |xm k i ≠ c m k | 

, (5.2) 

|X k |


where c m k is the center of the points in X k along the dimension m, x m i the value in dimension 

m of the point x i of this cluster and |X k | the number of elements in SC k . The smaller 

wk 

m is, the more compact are the points around the center in dimension m. This implies 

that dimensions with smaller weights have better clustered points and are defined as more 

important for a cluster. We normalize the weights wk 

m for all dimensions of all clusters to 

the interval [0, 1] and map the corresponding values inversely to the length of the spike. 

The lower wk 

m (the more important the dimension), the longer the corresponding spike. 

Note that owing to our definition of wk m , the relationship between weights and importance 

is inverse, and we reflect this by an inverse mapping between weights and size of the visual 

attribute (the spikes). Also note that in case the given subspace cluster algorithm natively 

outputs weights for each dimension, those weights can also be mapped to the spike length 

instead. 

Figure 5.3: Two subspace clusters visualized as spikes. The clusters share common dimensions 

but the importance of the dimensions for the clusters are di erent. Dim29 and dim32 in the left 

cluster show smaller pikes than in the right cluster, as they are considered less important for the 

definition of that cluster according to our measure wk m . Furthermore, the left cluster has fewer 

dimensions and more objects than the right cluster. 

The visual representation for each subspace cluster is a circle in the Spikes view. Each 

spike in a circle represents a dimension contained in that subspace. The length of the 

spike represents the weight of the dimension for that particular cluster (the longer, the 

more important). The order of the dimensions is identical for each cluster. The area of 

the inner circles indicates the number of records within each cluster. Figure 5.3 illustrates 

the Spikes view. 

The resulting Spikes view allows users to quickly recognize overlapping dimensions 

by comparing the spike patterns of the di erent clusters. To support this comparison, a 

background is divided into pies and colored alternatively with two colors (gray and light 

red). This supports the comparison of the spike angles in two di erent clusters. 

Visualization of Records: the HeatNails View 

The HeatNails view is an extended heat map displaying the data values and dimensions. 

Rows represent dimensions, and columns represent data items (records). Each HeatNail 

cell represents a data value of a record in one dimension. Data items are grouped by 

clusters. These clusters are aligned next to each other and separated by black lines. Data 

values are normalized globally and mapped to an appropriate color scale. A yellow-togreen 

color scale is used for dimensions that are members of the given cluster, while a 

gray scale is used for the remaining dimensions of the data set per cluster (see Figure 5.4


(bottom)). This allows for an e ective visual perception of the distribution of values 

across dimensions, and the relation between dimensions and clusters with respect to their 

inclusion in the cluster definition. 

Figure 5.4: HeatNails visualization. Bottom: showing the distribution of dimension values for all 

dimensions (rows) and records (columns). Top: showing histograms for the values of all dimensions 

per cluster for comparison purposes. 

We also give a summary representation of the values of the dimensions occurring 

in the clusters. The distribution of dimension values of each cluster is discretized into a 

histogram and visualized by color (for dimensions included) and gray scales (for dimensions 

not included). This allows for easy comparison between clusters with respect to data 

values. Figure 5.4 (top) shows these histogram views. Finally, depending on the clustering 

algorithm, it is possible that records are members in multiple clusters. We illustrate this by 

marking the cluster IDs of multi-cluster members at the bottom of the display. In addition 

to the Spikes view, the HeatNails view also allows the quick recognition of overlapping 

dimensions across the clusters by means of the given color and grey-scale patterns. Both 

Spikes and HeatNails views incorporate linking-and brushing functionality. Clicking on any 

set of dimensions/clusters of interest in one view highlights the same dimensions/clusters 

in all other views. 

Ordering Heuristics 

Ordering is implemented to support perception of structural similarities of clusters with respect 

to dimensions and value distributions. As ordering problems for clusters, dimensions, 

and records are typically complex NP-complete combinatorial optimization problems [9], 

we rely on heuristics to order dimensions, records, clusters, and values in the various displays. 

Our essential idea is to place similar or closely related objects together to help the 

analyst find interesting patterns. 

Dimension Ordering 

To find a global ordering of the dimensions, we compute a frequency value for each dimension, 

denoting the number of subspace clusters that are using this dimension. We order 

the list of dimensions by this frequency value starting the sequence of dimensions with the 

dimension that is most frequently used by the set of subclusters. The next positions are 

filled in the same way: the dimension that co-occurs most frequently with the previous 

positioned dimension is placed next. If a co-occurrence is not found, the most frequent 

dimension from the remaining dimensions is positioned next in the ordering vector. The


dimension ordering can be applied to both the Spikes view and HeatNails view. 

Subspace Cluster Ordering 

A useful visual representation of subspace clustering results should arrange similar subspaces 

next to each other to reduce visual search time by the user. We propose an ordering 

strategy that is formalized in the following. Using the dimension weights defined in Equation 

5.2, we propose a measure for the global interestingness I g SC k 

of a cluster SC k : 

q 

I g mœD 

SC k 

= 

k 

wk 

m , (5.3) 

|D k | 

where wm k is the weight of dimension m œ D k of SC k , and |D k | is the number of dimensions 

in this subcluster. We define the global interestingness of a cluster k as the average of the 

weights of the dimensions contained in this subcluster. This measure is used to determine 

the first cluster in the ordering. We then use the subspace cluster distance (eq. 5.4) 

employed in [14] to find the most similar cluster, which is placed next to the initial cluster. 

This distance function is a convex sum of subspace distance and object distance: 

— 

A 

1 ≠ |D i fl D j | 

|D i fi D j | 

B 

+(1≠ —) 

A 

1 ≠ 

|X i fl X j | 

min{|X i |, |X j |} 

B 

[14] (5.4) 

where |D i fl D j | is the number of common dimensions of the two subspaces i and j, and 

|X i fl X j | the number of shared objects of the two subspaces. We continue this placement 

until all clusters are placed. 

Record Ordering 

Two di erent types of record ordering strategies are implemented in HeatNails. One strategy 

is to order the records from min to max with respect to their values in the dimension 

that has the biggest variance, among all dimensions. A second strategy is to order the 

records according to the Euclidian distance across the contained dimensions of the given 

subspace, based on a selected starting record. The starting record, in turn, may either be 

user selected, or selected automatically as the record that shows the largest variance over 

all dimensions. 

Value Ordering 

A value ordering facility is implemented in the HeatMap view and visible in the top 

summary row of the HeatNails view. In each row the distribution of values in a given 

dimension is shown. To that end, we sort the values from min to max, and bin them into 

a user-selectable number of bins. In this view the distribution of values per dimension 

and cluster is indicated in the form of a color-coded histogram. The histograms help in 

understanding the distribution of data values within each dimension, and may support 

finding out why a particular dimension was selected or not by the clustering algorithm. 

Summary and Discussion of the ClustNails System Design 

ClustNails is an integrated system for visual subspace cluster analysis. Its design features 

(1) a number of subspace clustering algorithms from which the user can chose and (2) a 

design of di erent visual representations for the most important aspects of the output of 

automatic subspace cluster analysis.


Regarding (1), we provide access to a number of state of the art algorithms as contained 

in the OpenSubspace library [106]. The list of integrated algorithms is extensive. 

Regarding (2), we composed a visual display of three aspects. The Spikes view is 

inspired by star glyphs and distinguishes clusters from each other, in terms of included 

dimensions. The radial basis shape in the Spikes view is visually dominant and allows fast 

perception of cluster properties. Sorting of the cluster glyphs by similarity o oads users 

(at least partially) from sequential visual search. The Spikes view is complemented by the 

HeatNails view, which is a dimension-oriented detail view that we provide in a coordinated 

view, below the cluster glyphs. The HeatNails view is based on the ideas of heat maps and 

the pixel-paradigm for showing the maximum possible information, allocating eventually 

only one pixel per record dimension (bottom view) or histogram bin per cluster dimension 

(top view). The overall layout of the three views follows an overview-first approach, from 

the most aggregate view at the top (the Spikes view of clusters) to the most detailed view 

(the HeatNails record view) on bottom. The histogram view showing the distribution of 

dimensions per cluster is located in the middle. 

We designed this integrated layout having the di erent subspace clustering output 

parameters in mind, and arranged them according to the level of detail provided. While 

we believe our system design is justified from these considerations, we recognize that 

other multidimensional visualization techniques do exist, which could be alternative views 

in our visualization layout. Parallel coordinates in conjunction with color-coding could be 

an option. A dedicated user study, as part of future work, could explore design alternatives 

and compare them with each other. 

5.1.5 Use Case and System Comparison 

We apply the ClustNails system to a real world data set, demonstrating its applicability 

and illustrating di erent types of analysis one can perform with it. Then we compare it 

with the state of the art system VISA [14] to validate the e ectiveness of the system and 

its design. 

Use Case: USDA Food Composition Data Set 

We analyzed the USDA Food Composition data set 1 that contains a full collection of 

raw and processed foods characterized by their composition in terms of nutrients. The 

data comprises more than 7000 records and 44 dimensions. We selected Proclus for the 

clustering task. As parameters we set the number of clusters to 15, and the average number 

of dimensions to 8. Figure 5.5 shows the result generated by the system with this settings. 

From Figure 5.5 we can see that cluster C11, C12, C13, and C14 (highlighted red) 

all share the same two dimensions water and calories, although the sizes of the clusters 

vary from 4 to 24 records. All the records share some common features - high water 

containment and low calories. To gain more understanding of the clustering result, one 

can drill down to each record by checking the data table or detail-on-demand information 

displayed in tooltips upon mouse-over actions. It is not di cult to find out that these 

groups mostly consist of foods that are commonly regarded as “healthy”. Foods of similar 

nature, e.g., lima and mango beans, various types of low-fat dairy products, and soups are 

1 http://www.ars.usda.gov/

5.1.5 Use Case and System Comparison 107 

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 

Figure 5.5: Visualization of the subspace clusters of the USDA Food Composition data set generated 

by Proclus. 

placed in the same groups, which means the clustering makes good sense. 

Using the value ordering function in the HeatNails, we can further explore the distribution 

of data values inside each cluster and look for interesting patterns (see Figure 5.6). 

We note that most of the data values in the dimensions not selected by Proclus have relatively 

large variance. This is not surprising as subspace clustering algorithms are typically 

designed to reduce the sparsity of data by discarding dimensions that have big variances. 

C0 C1 C2 C3 C4C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 

Figure 5.6: Sorted view (Value ordering function applied). 

Taking a look in the sorted view at how the same two dimensions are distributed along 

the other clusters, it is not di cult to identify clusters, like C10, which have similar trends 

over the two dimensions but have stronger patterns in other dimensions (exceptionally low 

values for both total lipids and proteins, discussed later), thus the two dimensions are not 

selected to characterize the cluster. These types of information are not only useful in 

helping to understand the cluster analysis result, but also add more transparency to the 

data mining algorithms, which are usually hidden from the user in black boxes. At a 

closer inspection, we can identify a cluster that also shares the two dimensions, but with 

an inverse trend, that is, low water containment and high calories (C6). The detailed 

information reveals that this cluster represents a whole set of di erent candies (probably 

not the most recommendable food for a diet). 

Another interesting cluster is C10, which is characterized by an exceptionally low value 

for both total lipids and proteins. All the other records, excluding the ones in C1, have 

either consistently high values or higher variances in one of these two dimensions. They 

represent various kinds of beverages such as alcoholic beverages, teas, and fruit-based


toppings. C1 is characterized by the same trend but it forms a di erent cluster with 

exceptionally low values for other nutrients like various kinds of fats and vitamin B12. All 

the foods in C1 are again beverages. 

Comparing C10 to C1, one can notice that C10 has, in fact, a very similar distribution 

of values in the dimensions that are included in C1. This is a clear example in which the 

output of the algorithm is not optimal and a merge of these two would make sense. 

Comparison with VISA 

Figure 5.7: Visualization of the subspace clusters in VISA [14] framework discussed in Subsection 

5.1.5. Cluster view (left), record view (right). 

Figure 5.7 shows the representation in VISA [14] of the same subspace clusters as used 

for our above use case (same data set, same clustering result). As we can see, the 15 

clusters are projected in the cluster view to a 2D scatterplot using MDS based on their 

dimension similarity (left screenshot in Figure 5.7). Each cluster is represented by a circle 

scaled according to the cluster size. The record-centric view shows the result as a heat 

map (right screenshot in Figure 5.7), where rows represent records and columns represent 

dimensions. Di erent color codes are used in the heat map: black for unselected dimensions, 

brightness for interestingness, and hue for data values. We recognize the following 

benefits in the ClustNails design regarding VISA: 

• Overlap 

Circles of di erent sizes in the VISA MDS projection can cause occlusion problems 

and end up with over-cluttered displays. For example, only 9 out of 15 clusters 

are visible in the cluster view in Figure 5.7. The Spikes and HeatNails views avoid 

overlap. One may argue that scatterplots scale better, but in practice the number 

of clusters in a result is usually small, because a large number of clusters implies, 

in many cases, a poor performance of the clustering algorithm [90]. The scatterplot 

visualization, on the other hand, su ers from occlusion problems regardless of the 

number of clusters. Also, the ClustNails glyphs provide richer information for each 

cluster, as described next. 

• Richer information 

VISA shows only the number of records and dimensions of each cluster and maps the 

similarities between clusters to distances. The Spikes view in ClustNails extends this 

basic encoding by including additional information about each cluster, permitting a 

user to (1) draw richer information from the result and (2) detect and understand the

5.1.6 Conclusions and Future Work 109 

similarities between clusters more easily. Specifically, the spikes permit one to see the 

detailed dimensions and their corresponding importance in each subspace and thus 

to relate one cluster to another. The linking-and-brushing technique implemented 

in the Spikes view helps in highlighting the shared dimensions among clusters. 

• Ordering supports comparison 

The ClustNails ordering techniques place similar clusters, dimensions, and records 

close to each other. These techniques permit to detect similarities and dissimilarities 

between the clusters more easily. No ordering technique is implemented in the 

current version of VISA, similarity of clusters could just be seen in the cluster view 

represented by the 2D distance of clusters in the projection. 

• Scalability 

The heat map solution implemented in VISA is initially designed to display a limited 

number of records that belong to a small subset of clusters. The compression 

techniques we propose for the thumbnails view of HeatNails, can scale up to a much 

larger number of records and thus is not limited to representing only a subset of the 

data. Subspace clustering algorithms can produce hundreds of subspace clusters for 

some parameter settings. To analyze and understand if the result makes sense the 

clusters need to be displayed and compared. Our histogram views can be used to 

visualize this output, they can also be ordered linearly into more rows, or even a two 

dimensional ordering heuristic can be developed to make the technique scale. 

• Non-member dimensions 

In VISA all data values in the unselected dimensions are colored in black; hence 

the information in these segments is missing from the visualization. This may be 

detrimental to data understanding as the information contained in those segments 

provides evidence of why the clustering algorithm did not select a given dimension 

to characterize the cluster. The algorithm choice can be justified if the visualization 

shows extreme values or has large variances in the unselected dimensions. Our design 

displays these dimensions in a gray scale so they can be used to understand the result. 

5.1.6 Conclusions and Future Work 

Subspace clustering addresses an important problem in clustering multidimensional data. 

The algorithms successfully reduce the noise in multidimensional data by showing clusters 

that exist only in subsets of dimensions in the data. Visualization of subspace clustering 

results is challenging. In addition to the information contained in traditional clustering 

results, subsets of dimensions that define clusters, and overlap between dimensions and 

records needs to be represented in an understandable and uncluttered way. ClustNails 

was presented as an interactive data analysis and visualization tool for subspace clustering 

analysis. It provides several novel visualization and ordering techniques to help analysts 

extract subspace clusters from data and then analyze the results. The system implements 

linked and ordered cluster-centric (Spikes) and a record-centric (HeatNails) views. We 

demonstrated the e ectiveness of our system design in the analysis of real world data and 

a comparison with existing visual subspace cluster analysis systems. 

For future work one extension of the system is really needed – the support of parameter 

selection, which is a di cult problem given that each algorithm has its own parameters


and di erent settings may generate very di erent results. Another extension could be the 

development of a so called “agreement matrix” among a set of results that shows those 

parts that most results agree on. The agreement matrix could then be used to evaluate the 

quality of individual outputs and to help the analyst to understand the consensus made 

by di erent algorithms and parameter settings. Another future research direction might 

include improving the scalability of the ClustNails system. While we have not done a 

formal evaluation, we assume scalability is restricted to dozens of clusters and dimensions, 

depending on the resolution of the given display. Some results may contain hundreds of 

clusters and thousands of dimensions, for which scalable solutions are needed. 

5.2 Visual Analytics of Subspace Search 

Many methods are currently available for an explorative data analysis of high-dimensional 

data spaces. So far, proposed automatic approaches include dimensionality reduction 

and cluster analysis, whereby visual-interactive methods aim to provide e ective visual 

mappings to show, relate, and navigate this data. 

As described before, analyzing high-dimensional data is notoriously di cult as interesting 

patterns may occur in any possible subspace. We address two important research 

directions to discover the patterns hidden in this data spaces. One was proposed in the 

previous section where a visual interactive system to analyze the result of subspace clustering 

algorithms for a better understanding was developed. The second direction goes 

one step back, before the clustering step and identifies important subspaces where possible 

patterns my occur. Looking at one single interesting subspace is often not su cient since 

di erent subspaces may show confirmatory, complementary, conjointly, or contradicting 

relations between data items. We propose a novel method for the visual analysis of interesting 

subspaces, pointing out these type of relations. Based on appropriately defined 

subspace similarity functions, we visualize the subspaces and provide navigation facilities 

to interactively explore large sets of subspaces. Our approach allows users to e ectively 

compare and relate subspaces with respect to involved dimensions and clusters of objects. 

We apply our approach to synthetic and real data sets. We thereby demonstrate its 

support for understanding high-dimensional data from di erent perspectives, e ectively 

yielding a more complete view on high-dimensional data. 

5.2.1 Introduction 

For large feature spaces, interesting patterns may often be located only in subspace projections 

of the data. As insights may not be hidden in only one single subspace, relevant 

analysis should consider also multiple subspaces and their interrelations. Especially, for 

high-dimensional data we can expect to have di erent views of the same data [58, 107], 

i.e., the same objects might group di erently given di erent subspace perspectives (see 

Figure 5.8 for an illustration). The existence of alternative relevant subspaces may stem 

from the data description process, e.g., when during preprocessing, features (dimensions) 

describing di erent semantic properties of the data, are combined. For instance, in demographic 

analysis, households are often described by an array of many variables, combi-

5.2.1 Introduction 111 

nations that constitute di erent conceptual domains, such as wealth, mobility, or health. 

Likewise, it may be the combination of otherwise not semantically related dimensions, 

which by their combination result in interesting patterns. In the Data Mining community, 

a class of so-called Subspace Analysis algorithms has been proposed to cope with the problem 

of identifying interesting subspaces and clusters from a high-dimensional data set. To 

date, however, there has been a very limited focus on the presentation and interpretation 

of the generated output. Furthermore, subspace analysis often produces highly redundant 

results that need to be further manipulated in order to get meaningful results [101]. 

"traveling subspace" 

"health subspace" 

income 

blood pressure 

traveling frequency 

age 

Figure 5.8: Alternative data distributions and groupings from [103] in two di erent subspaces of 

a larger high-dimensional data space (domain here: demographic data analysis). Our proposed 

visual analysis method integrates the notion of alternative subspaces into the analysis process and 

links it to the task of comparative cluster analysis. 

We propose an initial step towards the use of visual analytics as a way to explore 

alternative views generated by subspace analysis algorithms. We define an analytical 

pipeline made of algorithmic and visual components that permits to single out and explore 

alternative views in the data. After being analyzed by a subspace search algorithm, the 

data is structured and further processed in an interactive visualization environment to 

reduce redundancy. 

The main contribution of this section is the operative definition and implementation 

of this multistep pipeline that permits to sift through an exponential number of subspace 

candidates and to reduce the problem to a handful of relevant views. More specifically, we 

1. introduce a mechanism to deal with subspace redundancy by defining topological and 

dimensional subspace similarity and by allowing flexible and interactive subspace 

aggregation; 

2. provide a well-reasoned interactive visualization environment that permits to compare 

and assess alternative views by visually comparing topological and dimensional 

similarities and strike a balance between visual complexity and level of detail. 

We evaluate our method through two case studies. The first is based on synthetic 

data to check whether the tool does what it is supposed to do. The second is based on 

real-world data to demonstrate how the tool can help finding and interpreting alternative 

views in high-dimensional data. We believe these results show the potential of visual 

analytics in the context of automated mining algorithms. It furthermore shows how the 

use of visual analytics can enhance the understanding of the results of automated data 

analysis methods, and lead to new questions concerning more e ective or more e cient 

algorithms. 

The remainder of this section is structured as follows. In Section 5.2.2, we discuss 

concepts underlying the class of subspace search algorithms that are important for our


approach. In Section 5.2.3, we then introduce our analytic methodology and suggested 

workflow, detailing the employed algorithmic and visual-interactive components employed. 

Section 5.2.4 demonstrates the application of our tool to synthetic and real data sets, 

showing its usefulness for the problem at hand. In Section 5.2.5, we discuss advantages 

and limitations of our methodology and conclude in Section 5.2.6. 

5.2.2 Subspace Analysis 

In this section, we discuss the challenges for visual subspace analysis in more detail and 

explain how we tackle these with our new interactive, explorative framework supported 

by subspace search algorithms. 

As is commonly known in subspace clustering, dealing with high-dimensional data in its 

subspace projections faces two main challenges. The first, serious challenge is a reasonable 

scalability with regard to the dimensionality of the data set. As for a d-dimensional data 

set the number of possible subspaces S {1,...,d} is q d ! d 

k=1 k" =2 d ≠ 1, many subspace 

clustering approaches do not scale well for very high-dimensional data. Every algorithm 

has to employ some strategy and heuristics to cope with such an exponential search space. 

The second, closely related challenge is dealing with high redundancy, that stems from 

the high similarity of the exponentially many subspaces. If two subspaces share a high 

proportion of dimensions, they are likely to exhibit a very similar clustering structure [58]. 

A large search result with high redundancy is, however, not beneficial for the user as it 

masks the complete information and is hard to interpret. 

A core task in analysis of high-dimensional data is to apply a clustering method to 

reduce data complexity and identify groups of data for comparison. Di erent clustering algorithms 

follow di erent clustering notions, e.g., there exist density- (e.g., DBSCAN [50]) 

or compactness-based (e.g., k-means) clustering methods, and their outcomes often depend 

crucially on non-intuitive parameter settings. Usually several clustering attempts 

are required until the user has a usable result. It is obvious that high runtimes of subspace 

clustering processes (see Section 2.4.2) are not tolerable for such a workflow. Consequently, 

we decided to start the visual data exploration one step before the actual clustering process 

and decouple subspace search and the actual clustering. Dedicated subspace search algorithms 

[16, 38, 84] have been designed to e ciently filter and rank the possible subspaces 

according to specific quality criteria (or interestingness measures, see also below). After 

subspace search has taken place, an arbitrary clustering approach can be used to cluster 

in the identified subspaces. 

The use of subspace search for our purposes has several advantages: (1) It helps to 

e ectively filter out those subspaces that based on low interestingness do not need to 

be considered by the user. (2) Subspace search approaches are designed to reduce the 

search space e ciently and they do not need to compute clusters. And (3) although, 

subspace search approaches themselves also rely on certain assumptions of what makes a 

subspace interesting, these assumptions do not necessarily lead to very di erent subspaces 

among di erent approaches. Therefore, the results are not as biased as they are for 

di erent clustering algorithms, which enables the user to already obtain valuable results 

with one subspace search approach. For example, the quality assessment based on the 

k-NN distance [16], favors neither the DBSCAN nor the k-means clustering notion. And 

(4), integrating the subspace search into the high-dimensional analysis o ers the user the 

opportunity to obtain a visual, intuitive overview of the clustering structure before even

5.2.3 Proposed Analytical Workflow 113 

starting the actual clustering. Thus, the user can assess the potential of the data to deliver 

valuable clustering results at all; decide which subspaces are to be clustered; decide which 

clustering notion to follow in each subspace (since the notion does not need to be the same 

for all); more easily determine meaningful parameter settings for clustering approaches. 

Subspace search methods guide their search process by specific interestingness scores 

that are defined heuristically. For example, the method proposed in [38] considers as 

interestingness score the variation of the density of objects across a regular cell-based partitioning 

of a given subspace. The underlying assumption is that the higher the variation 

of density the higher the probability that the subspace shows a meaningful structure. As 

another example, the SURFING method [16] relies on the histogram of the k-nearest neighbor 

distances for all objects in a given subspace. It considers subspaces with non-uniform 

distance distributions more interesting (as they are an indication of the presence of strong 

clusterings). Here the underlying assumption is that for subspaces that show meaningful 

structures (e.g., clusters), di erent k-NN distances will occur. These and other measures 

aim at identifying subspaces that show a high “contrast” with respect to the distribution 

of objects thereby allowing to spot meaningful structure in the subspaces. 

Subspace search methods also typically contain heuristic approaches for early abandoning 

uninteresting subspaces, as exhaustive search would be prohibitively expensive. 

SURFING for example is based on a bottom-up strategy for searching subspaces by increasing 

dimensionality. It is based on testing additional dimensions for subspaces already 

known to be interesting. The list of currently interesting subspaces is continuously pruned 

to keep only the most interesting subspaces and speed up the search. SURFING has no 

dimensionality bias, assumes no specific clustering structure, and in practice, it is parameter 

free. Due to these properties, we rely on this method in our proposed approach, using 

the implementation provided to us by the original authors, but other subspace search 

algorithms could be easily used as well. 

Overall, using the results of a subspace search algorithm as a starting point for our 

visualization has many advantages. Subspace search methods such as SURFING employ 

e cient search strategies tackling the e ciency challenge of subspace analysis. However, 

they typically do not solve the challenge of high redundancy. Our proposed visual analytical 

workflow, which is introduced next, starts precisely at this point. 

5.2.3 Proposed Analytical Workflow 

We propose a carefully designed visual analytics workflow for subspace-based exploration 

of high-dimensional data, making use of algorithmic subspace search in combination with 

visual-interactive representations for user-based filtering and exploration. Our approach 

starts (1) with an automatic subspace search step, where a large number of interesting 

subspaces is selected by a subspace search algorithm. Current subspace search methods 

provide an algorithmic handling of the problem of finding interesting subspaces, yet they 

often produce too many subspaces that may also be redundant and thereby overwhelm the 

interactive analysis (see also Section 5.2.2). We therefore employ similarity-based grouping 

of subspaces (2) and perform the interactive exploration of interesting subspaces based on 

a few group representatives. Appropriate visual representations and interactions support 

the visual interactive analysis (3) for better understanding the subspace search results, 

including the support for comparative cluster analysis. 

Figure 5.9 depicts our proposed analytical workflow. We next detail the technical



Subspace Search 

e.g. SURFING 

Interesting 

Subspaces 

Subspace 

Grouping and Filtering 

e.g. Hierarchical Clustering 

based on subspace similarity 

Redundancy 

Reduced View 

Subspace Interaction 

e.g. coloring clusters 

Cluster 

Colored View 

Figure 5.9: Our proposed analysis pipeline. A subspace selection algorithm is applied to automatically 

identify a candidate set of interesting subspaces. A filtering step reduces the potentially large 

and redundant set of automatically obtained subspaces to a user-selectable number of representing 

subspaces. Visual-interactive user exploration then proceeds on the subspace representations. Subspace 

analysis is also supported by comparative cluster views, allowing users to identify meaningful 

similar, complementary or even conflicting clustering structures in the set of subspaces. 

design decisions made for each of the analysis steps, including discussion of alternatives. 

Generation of interesting subspace candidates 

To search for interesting subspaces of an high-dimensional data, we propose to use a 

subspace search algorithm. We employ automatic subspace search as a tool to serve our 

main purpose, which is to explore high-dimensional data in an e ective manner. The 

advantages for choosing subspace search, and in particular SURFING, have been already 

discussed in detail in Section 5.2.2. We observe that typically subspace search algorithms 

output a huge number of subspaces that are often rather redundant with respect to the 

reported interestingness index and the sets of involved dimension shows high overlap. 

Since the examination of all subspaces is infeasible, a common approach is to filter the 

subspaces based on a certain threshold. This, however, ignores the fact, that the first 

ranked subspaces might be only slight variations (i.e., high overlap of dimension sets) 

of the same subspace and therefore are redundant to each other. However, interesting 

subspaces with substantially di erent dimension sets, as compared to the top ranked 

results, could be found at much later ranking positions, and run the risk to be neglected 

from the analysis. Therefore, we apply a grouping step based on an appropriately defined 

notion of subspace similarity, as described next. 

Similarity-based subspace grouping and filtering 

Given a large number of candidate subspaces, we apply hierarchical grouping and filtering 

to yield a smaller set of mutually su ciently di erent, yet individually interesting groups 

of subspaces for interactive analysis. Our filtering and grouping operation is based on a 

custom similarity function defined on pairs of subspaces according to two main criteria: 

(1) overlap of the sets of dimensions that constitute the respective subspaces, and (2) 

resemblance in the data topology given in the respective subspaces. 

Similarity based on dimension overlap 

Subspaces can be similar regarding their constituent dimensions. We use the Tanimoto 

Similarity [117] on bit vectors indicating the contained (active) dimensions in a respective 

subspace (1 denotes an active dimension, 0 the converse). The Tanimoto Similarity is then 

computed as the fraction of dimensions contained in both subspaces (AND-ing of the bit 

vectors), among the total number of di erent dimensions occurring in the subspaces (ORing 

of the bit vectors).


Similarity based on data topology 

We also compare subspaces with regard to their data distribution. Specifically, we consider 

the similarity of k-NN relationships in the respective subspaces. For e ciency reasons, we 

compute the k-nearest neighborhood (k = 5) lists for a sample of 5% of the contained data 

points. The similarity between two subspaces is then evaluated as the average percentage 

of agreement of k-NN lists in the subspaces. This score measures the similarity of the 

k-NN topology of the data, where k is a parameter and can be adapted to the data sets 

at hand by the user. Note that also other similarity measures are in principle possible. 

For instance, the data could be clustered and the similarity between subspaces evaluated 

according to the resemblance of obtained clusterings by an appropriate measure such as 

the RandIndex [114]. 

These two distance functions are the basis for the subspace grouping step in our analytical 

workflow as follows: 

1. Subspace grouping: We apply hierarchical agglomerative grouping of subspaces 

based on the topologic distance function using Ward’s minimum variance method [144]. 

Based on the dendrogram representation of the obtained hierarchical grouping, the 

user chooses the hierarchy depth level to select a number of groups. This way the 

user can easily decide how many clusters are desired for the analysis. 

2. Subspace filtering: Based on the previously achieved grouping of subspaces, we 

filter one subspace from each group as representative: for each group we consider 

the subspaces with the lowest dimensionality and choose the one that exhibits the 

highest interestingness score. We note that other rules for filtering representatives 

are possible, but find that this rule is robust and e ective for users, as it tries to 

keep the dimensionality as low as possible. 

These steps together with both distance functions, take us further towards our goal of 

understanding the di erent kinds of relationships between subspaces. They can complement, 

confirm, or contradict each other and being aware of these relations can be crucial 

for further mining tasks. 

contained dimensions 

similar 

not similar 

data topology 

similar 

redundant 

confirmatory 

not similar 

dominant 

dimensions 

complementary 

Figure 5.10: Filtering cases that can be supported by our two defined subspace similarity functions. 

Four basic cases can be identified, each of which might be relevant for a given subspace 

analysis task: 

1. Subspaces that are similar in both, their contained dimension sets and their data 

topology (redundant subspaces); 

2. Subspaces that are dissimilar in both, their contained dimensions and their data 

topology (complementary subspaces);


3. Subspaces that are similar with regard to data topology but dissimilar regarding 

their contained dimensions (confirmatory subspaces: we confirm the same data relationships 

in di erent subspaces); and 

4. Subspaces that are similar with regard to their contained dimensions, but dissimilar 

regarding topology (this is generally not expected but could indicate the existence 

of one or a few dimensions that are by their nature very dominant for the data 

topology). 

Figure 5.10 illustrates these four basic filtering cases. 

Visual-interactive design 

After hierarchical aggregation and/or filtering of the potentially redundant set of subspaces 

have taken place, we apply a set of analytical views for exploring and comparing the 

subspaces. Our displays are based on (1) scatterplot-oriented representations of individual 

subspaces or groups of subspaces, (2) similarity-based or linear list layouts for sets of 

subspaces, and (3) additional informative views (parallel coordinates and color-coding for 

comparison of groups in data). 

The proposed design is the result of several iterations of alternative solutions in which 

we explored and compared several representations. Two design choices are worth discussing 

here: (1) the design of a visual representative for subspaces and (2) their layout. We 

decided to represent subspaces with scatterplots because they allow for the identification 

and comparison of groups in the data. More abstract representations (like simple colored 

marks) would require less space but would not allow the rich topological comparison 

provided by the scatterplots. In contrast, representations that are more complex like, e.g., 

parallel coordinates would provide a direct representation of the dimensions included in 

the subspace but would make their representation much more cluttered. As for the layout, 

we tried several tree and graph layouts to make the relationship between the subspaces and 

their shared dimensions explicit, however, we found that this rarely provides interesting 

insights and makes the visualization too cluttered to be of any use. 

Figure 5.11: Subspace representation by 2D scatterplots with dimension glyph. We can see the 

visual representations of two 5D subspaces (left) and one 4D subspace (right). 

To represent each subspace in a similar way, independent of its dimensionality, we 

decided to plot each subspace in a 2D scatterplot. The scatterplot representation can 

be generated by any appropriate projection technique such as PCA [83], MDS [41] or 

t-SNE [143], to name a few. We currently use MDS; however, we experimented with 

other dimension reduction techniques and found that other techniques could be used al-


ternatively. To convey the involved subspace dimensions, we add an index glyph to the 

respective scatterplot (see Figure 5.11). 

1 

2 3 

Figure 5.12: (1) Linearly sorted view of subspaces for the 12D synthetical data set from [52] 

showing the full result of SURFING, consisting of 296 subspaces. The selected subspace in this 

view is shown in a (2) single subspace view to enable interaction and in (3) a parallel coordinates 

view with the subspace dimensions as the first axes (highlighted), and all the other data dimension 

as the last axes. 

The analytical views are combined and linked in an application that consists of the 

following components: 

Linearly sorted view of subspaces 

To obtain a first overview of the output of the subspace search algorithm, we present all the 

subspaces in a linear view. The MDS scatterplots representing the individual subspaces 

are sorted left-to-right and top-down according to the interestingness index provided by 

the subspace search method. This view is exclusively used as a detail view for groups of 

topologically similar subspaces. Figure 5.12(1) illustrates the subspaces of the synthetic 

data set, which is described later in Subsection 5.2.4. 

Subspace group view 

In this view, groups of subspaces that have been formed by hierarchical agglomerative 

grouping are shown. Each group is represented by one selected subspace from that group, 

using the filtering method as described in the previous subsection. Figure 5.13 shows the 

dendrogram provided by the hierarchical grouping algorithm of all 296 subspaces visible 

in the linearly sorted view. Each node in the dendrogram represents a cluster at a certain 

similarity. A larger image of the dendrogram can be seen in Appendix A.4. 

The user can navigate trough this hierarchy (possible with the hierarchical navigation 

buttons shown in Figure 5.16(6)) and specify a certain similarity threshold for clustering.

Subspaces 

FL 

FI 

DI 

BJ 

FJ 


CFGIJK 

CFGIJKL 

CFGIK 

CFGIKL 

CFGJK 

CFGJKL 

CFGK 

CFGKL 

CFIKL 

CFIJKL 

CFIK 

CFIJK 

CFKL 

CFJKL 

CFHIKL 

CFHIJKL 

CFHKL 

CFHJKL 

CFHK 

CFHJK 

CFK 

CFJK 

CFGHJK 

CFGHJKL 

CFGHK 

CFGHKL 

CFHIK 

CFHIJK 

CFGHIK 

CFGHIJK 

CFIJL 

CFHIJL 

CFIL 

CFHIL 

CFJL 

CFHJL 

CFL 

CFHL 

CFHI 

CFHIJ 

CFI 

CFIJ 

CFH 

CFJ 

CFHJ 

CFGHI 

CFGHIJ 

CFGI 

CFGIJ 

CFGJ 

CFGHJ 

CFG 

CFGH 

CFGHIL 

CFGHIJL 

CFGHIKL 

CFGHIJKL 

CFGHL 

CFGHJL 

CFGJL 

CFGIJL 

CFGL 

CFGIL 

CDGIK 

CDGIJK 

CDGJK 

CDGHJK 

CDGK 

CDGHK 

CDGHKL 

CDGHJKL 

CDGJKL 

CDGIJKL 

CDGKL 

CDGIKL 

CDGHIJ 

CDGHIJL 

CDGHI 

CDGHIL 

CDGHIKL 

CDGHIJKL 

CDGHIK 

CDGHIJK 

CDGJ 

CDGIJ 

CDG 

CDGI 

CDGIL 

CDGIJL 

CDGJL 

CDGHJL 

CDGL 

CDGHL 

CDHIK 

CDHKL 

CDHIKL 

CDIK 

CDIKL 

CDK 

CDKL 

CDHJKL 

CDHIJKL 

CDJKL 

CDIJKL 

CDHK 

CDHJK 

CDJK 

CDIJK 

CDHIJK 

CDIJL 

CDHIJL 

CDIL 

CDHIL 

CDIJ 

CDHIJ 

CDI 

CDHI 

CDGH 

CDGHJ 

CDH 

CDHJ 

CDL 

CDHL 

CDJ 

CDJL 

CDHJL CF 

BCF 

CDF 

BCDF BC 

CD 

BCD 

CDFGHJ 

CDFGHJL 

CDFGJ 

CDFGJL 

CDFGL 

CDFGHL 

CDFG 

CDFGH 

CDFJL 

CDFHJL 

CDFJ 

CDFHJ 

CDFL 

CDFHL 

CDFIJL 

CDFHIJL 

CDFIL 

CDFHIL 

CDFGIL 

CDFGIJL 

CDFIJ 

CDFGIJ 

CDFI 

CDFGI 

CDFGHIL 

CDFGHIJL 

CDFGHI 

CDFGHIJ 

CDFH 

CDFHI 

CDFHIJ 

CDFIK 

CDFGIK 

CDFK 

CDFGK 

CDFHJK 

CDFHIJK 

CDFJK 

CDFIJK 

CDFHK 

CDFHIK 

CDFGHIK 

CDFGHIJK 

CDFGHK 

CDFGHJK 

CDFHIKL 

CDFHIJKL 

CDFHKL 

CDFHJKL 

CDFGHIKL 

CDFGHIJKL 

CDFGHKL 

CDFGHJKL 

CDFKL 

CDFIKL 

CDFJKL 

CDFIJKL 

CDFGKL 

CDFGIKL 

CDFGJKL 

CDFGIJKL 

CDFGJK 

CDFGIJK BL 

DL 

FH BH 

DH 

DG FG BG DF 

BD 

BF 

BDF BI 

GI 

IJ 

JL 

HJ 

DJ IL 

GL 

HL 

GH GJ 

IK 

BK 

FK 

DK HI 

KL 

JK 

CJ 

hierarchical agglomerative grouping 

synthetic dataset 

HK 

GK 

CGHIL 

CGHIJL 

CGIL 

CGIJL 

CHIL 

CHIJL 

CIL 

CIJL CI 

CGI 

CGIJ 

CGHIJ 

CIJ 

CHIJ 

CGJL 

CGHJL 

CGL 

CGHL 

CJL 

CHJL 

CL 

CHL 

CGJ 

CGHJ 

CHJ 

CG 

CGH CH 

CHI 

CGHI 

CGIJKL 

CGHIJKL 

CGIKL 

CGHIKL 

CGIJK 

CGHIJK 

CGIK 

CGHIK 

CHIKL 

CHIJKL 

CIKL 

CIJKL 

CIJK 

CHIJK 

CIK 

CHIK 

CGJKL 

CGHJKL 

CGJK 

CGHJK 

CGHK 

CGHKL 

CGK 

CGKL 

CJKL 

CHJKL 

CKL 

CHKL 

CHK 

CHJK CK 

CJK 

Distance (Similarity) 

0 5 10 15 20 25 30 

Figure 5.13: Hierarchical agglomerative grouping of the 296 interesting subspaces. The red line 

shows the threshold for 6 groups shown in the subspace group view. Each group is marked by a 

colored rectangle. The colors are maintained in Figure 5.14. 

This threshold is indicated by the red line in the figure showing the dendrogram, resulting 

in six groups visible in the subspace group view presented in Figure 5.14 and illustrated 

also in the overview-Figure 5.16(1). 

Figure 5.14: Subspace group view for the 12D synthetic data set with six subspace groups. 

The representative subspaces of each group are each visualized by an MDS plot, and 

shown side-by-side. A dimension histogram on top of each indicates the distribution 

of dimensions contained by the subspaces in that group, where the length of the bar 

encodes the frequency of the respective dimension. The last bar encodes the percentage of 

subspaces contained in this group. It is colored in orange to be easily distinguished from 

the others. 

Each group of subspaces from the preceding view can be expanded and its member 

subspaces can be seen and compared in detail (as Figure 5.16(5) illustrates). This allows a 

better understanding of the current similarity threshold, and allows to expand or further 

collapse the group structure based on visually perceived similarity between subspaces. The 

user can investigate how similar the distribution of dimensions is among di erent groups 

of subspaces. To this end, a click on the dimension histogram icon of one particular group 

will cross-highlight the dimensions of the selected group that are also contained by other 

clusters. In this example the dimension glyph of the green group has been clicked. In summary, 

the subspace group view allows a global comparison of non-redundant subspaces and 

their similarities concerning the contained data topology. 

Dimension-based subspace similarity view 

We also support the comparative analysis of all subspaces based on their similarity regarding 

the set of active dimensions. Consequently a global MDS layout, based on the 

Tanimoto distances between the subspaces, as described at the beginning of this section, is 

generated. Figure 5.15 (respective Figure 5.16(4)) illustrates the subspace similarity view. 

For a high number of subspaces, this view can only provide an impression of the similarity 

relationships but by zooming more details become visible. The agglomerative grouping


based on the topologic distance function could be used to reduce the number of displayed 

subspaces in this view. The subspace group view (based on data topology distance) and 

Figure 5.15: Dimension-based subspace similarity MDS view of the 296 subspaces selected by the 

subspace search algorithm. 

dimension-similarity view (based on Tanimoto distance) are linked by color-coding (outer 

frame coloring). Thereby, we can compare the similarity of subspaces by their topological 

and dimension-overlap-based similarity. 

Additional views and cluster comparison support 

We also integrated details-on-demand for each subspace by a parallel coordinates view 

(Figures 5.12(3) and 5.16(3) illustrate). Highlighting contained dimensions helps to understand 

the di erence of the subspaces in more detail. The subspace dimensions are the 

first dimensions of the parallel coordinates view and highlighted. The others are added in 

a random way, in a lighter gray. This enables the comparison to the rest of the data set, 

and understanding the distribution of the subspace dimensions, compared to the rest of 

the data. 

Furthermore, interactive exploration of the subspaces is enhanced by a single subspace 

view, providing an enlarged view of a selected subspace scatterplot (Figures 5.12(2) and


5.16(2) illustrate this). This view also allows to manually select clusters of objects by a 

lasso tool. Cross-coloring of the selected points among the other subspaces and within the 

parallel coordinates plot thus allows comparative exploration of grouping structures – a 

core problem in making e ective use of alternative subspaces. 

1 

4 

5 

6 

2 3 

Figure 5.16: All linked views: (1) Subspace group view for the 12D synthetic data set with six 

subspace groups. (2) Single subspace view showing the representative subspace for the first group. 

(3) Details-on-demand in the parallel coordinates view for the selected subspace. (4) The MDS 

layout of the subspace search results based on their dimension similarity. (5) Group detail view for 

the three (orange, green, purple) subspace groups. (6) Hierarchical navigation buttons. 

5.2.4 Application 

We now demonstrate the analytical capabilities of our proposed approach by application to 

synthetic and real world data in two scenarios. This two scenarios have di erent purposes. 

First, we use synthetic data as a proof of concept and exemplify the suggested workflow. 

We show how that relevant subspaces can conveniently be identified. Then, we describe 

an explorative setting in which interesting findings in alternative subspaces of a real world 

data set are obtained. 

Application Scenario 1: Synthetic Data 

To show the power of the proposed approach, we used a 750 record sample of the first 

12D synthetic data set presented in [52] (data set No. 2). This data set consists of four 

3D Gaussian clusters and two 6D Gaussian clusters. The remaining dimensions contain 

uniformly distributed random noise. The first step of our approach is to determine the 

interesting subspaces of the high-dimensional data set, by running automatic subspace 

search using SURFING (see Section 5.2.3). This subspace search returns a total of 296

5.2.4 Application 121 

subspaces identified as interesting, out of the 4095 possible subspaces. To get a first 

impression of these subspaces, we use the linearly sorted view of subspaces shown in 

Figure 5.12, relying on MDS representations of the data in the subspaces, and sorted by 

the interestingness score in decreasing order. 

The view shows the diversity of subspaces identified during the automatic step. The 

first elements in the first row of the view are very similar in terms of the point distribution 

(showing mostly scattered and spherical point distributions). However, at later positions, 

we also see other varieties of point distributions, including parallel stripe patterns, and 

stripes mixed with spherical patterns. In a normal (non-visual) analysis case, relying just 

on the subspaces ranked top by the interestingness score, the analyst might miss some of 

these di erent characteristics of the subspaces. 

Judging by the shape of the MDS projection representations, the overview also confirms 

that the subspace search did return a lot of redundant subspaces. The next step is therefore 

to group the subspaces according to their similarity, allowing the user to abstract to a 

smaller number of relevant subspaces to compare them in detail. We used our similarity 

function based on the data topology, creating a hierarchal agglomerative clustering using 

Ward’s minimum variance method [144]. We found that this method turned out to show 

good results, in terms of providing clusters of subspaces that discriminate well from each 

other. The obtained clustering dendrogram has been shown in Figure 5.13. By setting a 

similarity threshold, Figure 5.16(1) shows that the number of subspaces can be reduced 

considerably in a meaningful way by the user. The navigation buttons, as shown in 

Figure 5.16(6), allow the user to move through each dendrogram level and to find the 

desired level of redundancy. Here the dendrogram was cut at 0.73 (value range (0,1)). 

As a result, six groups are found and visualized by their representatives. The number of 

groups can be variated, by selecting di erent similarity levels in the dendrogram hierarchy. 

For this data we quickly found that six groups is the right level of detail for our further 

investigation. 

We investigate the components of each group of subspaces in more detail. Figure 5.16(5) 

shows the group detail view of the orange, green, and purple subspace groups as framed 

in Figure 5.16(1). Topologically similar subspaces are grouped together. In this way, the 

analyst is given an overview of the existing groups and, if needed, can further compare 

individual group components. 

On top of the scatterplots, a dimension histogram is indicating the distribution of dimensions 

for each group. The last bar of the histogram is marked in orange and represents 

the percentage of subspaces contained in this group. It is scaled logarithmically, so that 

this bar is also visible for groups with few elements. A click on the dimension histogram 

of one group representative highlights its dimensions in all the other representatives. In 

Figure 5.16(1) (enlarged in Figure 5.14) the green group was clicked. To understand why 

the green- and gray-framed groups are split, we can consult the additional view in Figure 

5.16(4). It shows an MDS layout of all interesting subspaces based on the dimension 

overlap (Tanimoto) similarity. In this view closeness of two subspaces corresponds to dimension 

similarity. We see that the green- and gray-framed cluster groups are located 

on the far left side in the plot. This shows us that the subspaces are similar in terms of 

dimensions, but being in di erent groups, they must show di erent topological similarity 

according to our similarity measure. The reason is that all the subspaces of the grayframed 

group contain dimension d12, while none of the subspaces in the green-framed 

group contain this dimension, which is visible by the bars in the dimension histogram of 

the gray-framed group (see Figure 5.16(1)). As it is not highlighted, it is not contained in


the marked green-framed group, and obviously this dimension is responsible for a di erent 

data distribution. 

We can also go one step further in detailed comparison of subspaces by cross-colorcoding 

clusters of points in the MDS representation. Our lasso tool allows the user to 

manually mark clusters of points in the MDS subspace representation, which allows to 

cross-compare the groupings among di erent subspaces. For example, we manually marked 

six separate clusters of points in the pink-framed subspace group (group number two in 

Figure 5.16(1)) and assigned distinct colors. By analyzing the distribution of colors among 

subspace group representatives, we see that other subspaces merge some of these clusters 

and spread others. This is also true for the purple framed group representative. The dark 

blue and pink point cluster (the upper most in the original colored subspace) are clustered 

in the purple subspace but some of their points also became noise in this subspace. 

Summing up, we can see how our visual analytics workflow helps to deal with the 

extensive number of possibly interesting subspaces in a natural overview-first based visual 

analytics workflow. In a first step, the SURFING approach reduced the number of 

subspaces of the 12 dimensional data set from 4095 to 296 interesting ones. Since this 

set of subspaces still showed a high redundancy, in our next step we grouped them using 

our topological similarity measure. Based on the grouped subspaces, further investigations 

could take place for comparing the relations and distributions among points of data within 

the subspaces. 

Application Scenario 2: Exploration/Discovery 

We will now demonstrate the exploratory functionalities of our proposed approach based on 

a real data set. We analyze again the USDA Food Composition data set 2 , a full collection 

of raw and processed foods characterized by their composition in terms of nutrients. The 

database contains more than 7000 records and 44 dimensions. After removing missing 

values and outliers, as well as normalizations, 722 records (foods) remained for which we 

selected 18 dimensions of the data set that where interpretable. 

From this input data set, application of the SURFING algorithm returned 216 interesting 

subspaces for further exploration. To obtain a first impression of this data, we 

investigated the linearly sorted view (see Figure 5.17 for a cut-out). Many subspaces, in 

particular those ranked with a high interestingness index, showed a rather skewed distribution 

of points in our projection representation, concentrating along the edges of the 

diagrams. Only later in the ranking, we observed the projections forming out more structure 

that could be meaningful. The red color framed subspace in Figure 5.17 seems to be 

very interesting, forming long, clear stripes. With the help of the single subspace view, we 

further investigated this subspace (Iron,Maganase,V it D ) by coloring each stripe with a 

di erent color and compared the formation of these clusters across the other subspaces. 

Most of them seemed to be overspread by the cyan class (see Figure 5.17 right). 

At the same time, it is clear that a high level of redundancy is still present, and a further 

grouping is deemed necessary. Therefore, we continued with our next analytical step, the 

subspace grouping by agglomerative hierarchical clustering. We obtained di erent groups 

of subspaces and found out that these clearly striped clusters only appear in subspaces 

containing Vit D . 

We therefore reset the coloring and started a new interactive analysis step, beginning 

2 http://www.ars.usda.gov/

5.2.4 Application 123 

Figure 5.17: Linearly sorted view cut-out of subspaces for the 18D USDA Food Composition data 

set. The full result of SURFING, consisting of 216 subspaces. We see a rather high level of 

redundancy. Subspaces exhibiting more structure are found in particular at the mid and end 

positions in the ranking. Relying only on the numerically top ranked results, we would have 

omitted such interesting cases from the analysis. 

with this stage of our workflow. After testing di erent filtering thresholds and comparing 

the topological- and the dimension-based similarity relations, we obtained a number of 12 

groups, and considered this suitable for subsequent analysis (see Figure 5.19(1)). 

A 

B 

C 

D 

Figure 5.18: (A) Interesting spotted subspace (Carbohydrat,Fibre) presenting two clusters. 

(B) Subspace (Carbohydarte,Lipid,Protein) in the same cluster group of (A) wherethecluster 

structure changes. (C) Green marked third cluster in subspace from (B). (D) Subspace 

(Fiber,Protein,Vit D ) of orange color-framed subspace group, where the alternative clustering of 

points is visible. 

From the reduced number of representative subspaces, one particular subspace stood 

out to us (see Figure 5.19(1) for the group representatives and Figure 5.18(A) for the 

interesting spotted one). This subspace shows the most structure and allows to discern 

two point clusters (pink and blue). We selected this specific subspace group (framed 

brown in Figure 5.19) for further analysis. Cross-coloring is used to highlight its group 

components, that are shown at the bottom of the figure. It is visible that the group of 

subspaces are topologically similar, consequently this subspace is a valid representative. 

In addition, we observe that there are some subspaces in this group where the clustering 

is changing. One example is shown in Figure 5.18(B). We assigned the green color to the 

outstanding points on the left side since they seem to form a di erent structure. In the 

group view (see Figure 5.19(1)) we can see that this green cluster overspreads on five


of the 12 subspace group representatives. After a closer look to the components of the 

orange subspace group, we spotted a sharply defined green cluster (see Figure 5.18(D) and 

highlighted in Figure 5.19(2)). By highlighting the dimensions of the orange group, we 

can see that the brown group has a dominant dimension (Protein) that is not contained 

by any subspace of the orange group. We can therefore assume that this dimension is 

decisive for the clustering of the points. In the dimension-based similarity view (MDS 

Layout in Figure 5.19(3)), the subspaces of the brown and orange groups are far apart 

from each other, which supports our finding that the groups contain di erent dimensions. 

Likewise we can see that the group components of the brown group are scattered across 

the MDS layout. This is due to the fact that the group subspaces are dissimilar in terms 

of their dimensions, but their topological similarity is dominated by the shared dimension 

(Protein). 

1 

3 

2 

Figure 5.19: (1) Grouped view of subspaces for the 18D USDA Food Composition Data Set with 12 

group representatives. (2) The brown and orange group components are shown in the components 

view. (3) MDS Layout of the total number of subspaces with cross-colored group representatives. 

Summing up, we demonstrated how our interactive exploratory workflow can be applied 

to real data. Compared to the previous scenario, the information about the clusters is not 

known in real data sets, meaning that several interactive attempts are needed to investigate 

the vast number of interesting subspaces provided by the subspace search algorithm. With 

the help of the topological similarity functionalities, we could group the redundant clusters 

and have a closer look in their topological change. Using the di erent linked views of our 

approach helped us to identify di erent subspaces that present alternative clusterings. 

5.2.5 Discussion and Possible Extensions 

We will now summarize the main goal of our system, and discuss limitations and possible 

extensions next.

5.2.5 Discussion and Possible Extensions 125 

Summarizing the Main Goals of our Approach 

Our presented approach supports visual-interactive analysis of high-dimensional data from 

multiple perspectives based on the notion of automatic subspace search. The core assumption 

for our approach is that useful information could be extracted in a comparative way 

from several di erent subspaces residing in a larger high-dimensional data space. This 

assumption is the major driving force behind subspace search and subspace clustering 

algorithms developed in the Data Mining community over the past few years. We exploit 

algorithmic subspace search in an encompassing visual-interactive system. Our approach 

is designed around Shneiderman’s Visual Information-Seeking Mantra [127], applied to the 

problem of analyzing potentially large sets of subspaces. Modern subspace search methods 

such as SURFING e ciently identify candidate subspaces that are expected to exhibit informative 

structure without restriction on a specific nature of the structure. Specifically, 

interactively detecting and understanding relevant structures in subspaces is an explicit 

goal of our system. Our interactive support allows users to condense and compare subspaces, 

and even groups in data, whereby the analytical loop from the algorithmic search 

of subspaces to the sense-making by the user is closed. Subspace search algorithms are 

very useful as a starting point. Since the identification based on interestingness is performed 

heuristically, the search methods alone cannot solve the analytical problems at 

hand. To this end, capable visual-analytic systems need to be designed based on the output 

of the subspace search algorithm. We therefore designed, implemented, and applied 

an encompassing system design based on a subspace search method (exemplarily we used 

SURFING). It allows to explore high-dimensional data taking into account the curse of 

dimensionality and the possibility to find alternative clusters in di erent subspaces. 

Limitations and Possible Extensions 

We identify the following limitations and improvement opportunities for our approach: 

• Computational scalability 

We designed and tested our system around data sets of moderate high-dimensionality 

of tens of dimensions. For higher-dimensional data, we will have to deal with scalability 

issues in (1) computational complexity of the subspace search and (2) scalability 

of the visual representation of subspaces. Regarding (1), the search space increases 

exponentially with dimensionality. Subspace search algorithms probably need more 

aggressive filtering mechanisms to keep the number of searched subspaces tractable. 

A dynamically adjustable threshold could be useful here. However, we still need 

to ensure that no relevant results are excluded. To this end, sensitivity analysis is 

needed. 

• Visual scalability 

Regarding (2), also scalable visual representations are needed for higher-dimensional 

data. We need to scale with the number of subspaces and the representation of each 

subspace. Hierarchical grouping of subspaces is already included in our system to 

scale with the number of subspaces. The linearly sorted view per se does not scale 

with many subspaces, yet it can be restricted to the representative subspaces obtained 

from hierarchical grouping. Visual representation of subspaces takes place by 

projection to show the data points and an index view to show contained dimensions.


In particular, the latter will only scale for a limited number of dimensions. How 

to design set-oriented views to compare many sets of dimensions is a challenging 

problem that if solved, would improve our tool. 

• Projection-based subspace representation 

We currently represent the subspaces by MDS projections of the data residing in 

respective subspaces. However, projection typically induces loss in information, that 

could be incorporated in our visualization, e.g., by showing the stress values in an 

overlay visualization [121]. In our experiments, MDS performed very well compared 

to using PCA. Yet, it would be interesting to test other projections. Also, other 

subspace representations besides scatterplots could be thought of, in essence similar 

to Value-and-Relation displays [157]. Likewise, many di erent, useful similarity 

notions to group and compare subspaces, such as notions based on stress measures, 

implicit clustering structures, relations to outliers, scagnostics features [151], etc. 

could be employed. Testing them in di erent application domains is considered 

valuable future work. We note that our analytical approach can easily accommodate 

alternative subspace search algorithms, representations, and filtering options. 

• Interpretable dimensions 

To relate subspaces and data groups in subspaces, it is important for the analyst to be 

aware of the meaning of the dimensions of the respective subspace. Our index-based 

glyph does not convey information about the type of dimension. More semantically 

meaningful dimension representations would be useful. Detail-on-demand functions 

could be added to help the user interpret the involved dimensions and properties of 

the data points more e ciently. 

• Definition of interestingness and sensitivity to noise 

Subspace search algorithms heuristically identify subspaces as interesting based on 

certain properties of object relations. Based on the user and application, additional 

interestingness formulations are possible and should be supported. Following best 

practices in data analysis, we have applied a data cleaning step (outlier and missing 

value removal) to our tested data before we fed it into our system. The SURFING 

algorithm is not robust with respect to missing values, whereas it seems to be robust 

with respect to outliers. The original paper does not discuss this aspect, and we 

did not further investigate it. The projections used to represent data distributions 

in subspaces are sensitive to outliers and may generate clamped distributions if not 

pre-processed. We postpone the analysis of this problem to future work. 

• Automatic support for cluster comparison 

Adding automatic clustering of data points in subspaces would be useful as a postprocessing 

step. Equipped with automatic clustering, we can color-code the found 

clusters. This could lead to new visual-oriented interestingness measures useful for 

selecting interesting subspaces in the future. User interaction with the subspace 

search output could be a useful analytical feature for refinement. Allowing expert 

users to split or merge subspaces, or construct new subspaces by adding or removing 

dimensions, would be one option. 

• Usability and user adoption 

Our current system design targets users with expertise in data mining. End-user 

applications, e.g., in Market Segment analysis, could benefit from subspace analysis.

5.2.6 Conclusions 127 

However, we recognize that for end-users, the interface of our system would need to 

be customized, possibly. Our experience in collaborating with data mining experts 

showed that the tool can be useful not only for data exploration but also as an 

evaluation tool to assess the output generated by subspace analysis algorithms. 

5.2.6 Conclusions 

We presented an encompassing visual-interactive system for subspace-based analysis in 

high-dimensional data. Subspace-based analysis can constitute a new paradigm for highdimensional 

data analysis since informative structures in the data can be found and compared 

in di erent subspaces of a larger high-dimensional input space. We defined, implemented, 

and demonstrated an analytical workflow based on automatic subspace search. A 

larger set of automatically identified interesting subspaces is grouped for interactive exploration 

by the user. A custom subspace similarity function allows for comparing subspaces. 

Our approach is able to e ectively pin down several interesting views and helps to come 

up with specific findings regarding similarities of groups in data. We discussed a set of 

possible extensions of the system, which could be addressed as future work.

128 Chapter 5. Visual Subspace Analysis of High-Dimensional Data

6 

Conclusion and Future Work 

„The important thing is not to stop questioning. Curiosity has its own reason 

for existing.” 

Albert Einstein 

Contents 

6.1 Summary of Contributions and Future Work . . . . . . . . . . . 129 

T 

his chapter takes a step back from the concrete presented solutions of each chapter 

positioning the work into the big picture of pattern finding in high-dimensional data 

pointing out the contributions of this thesis and concluding the work. Future work is presented 

here with respect to the big picture since each chapter contains specific conclusions 

and identifies particular further research directions. 

All domains nowadays produce high-dimensional data sets that can hide numerous 

important patterns. Finding these patterns is a complex task, but at the same time it is 

of high significance. Using visual representation to show the numerical data, automation 

to compute the interesting elements, and interaction to navigate the information spaces 

can help to spot the valuable patterns in the data. Due to the high-dimensionality of 

the data, one domain alone is not powerful enough, and since the results for certain 

data analysis questions are previous unknown in accurateness and complexity, systems 

combining visualization, automation, and interaction are needed to investigate this data. 

6.1 Summary of Contributions and Future Work 

We presented two main research directions to search for interesting patterns in highdimensional 

data: (1) reducing the data dimensionality by projections, ranking them and 

visualizing the best and (2) looking into di erent subspaces of the data and spotting the 

interesting ones for further analysis, by comparing them in terms of dimensions, records 

and clusters. 

For (1) we presented new quality measures to judge the quality of projections automatically 

and a systematization of existing quality metrics for high-dimensional data. For our 

new developed quality metrics, we choose from the large spectrum of patterns correlation 

and clustering. Two di erent types of metrics were presented in Chapter 3, namely data 

quality metrics and image quality metrics. Both are developed for two di erent visualization 

techniques – scatterplots and parallel coordinates. The new measures are applied on

130 Chapter 6. Conclusion and Future Work 

di erent synthetic and real data sets to demonstrate their properties. As these automatic 

measures should represent the user’s preference, we conducted an empirical evaluation on 

four state of the art measures to evaluate their correspondence to the user’s preference. 

This study helped us in developing guidelines for further metric development. To see the 

big picture regarding the lately proposed quality measures for high-dimensional data visualizations, 

we conducted a literature review and present in Chapter 4 a systematization 

of the existing measures, identifying a number of characteristic factors and developing a 

quality metrics pipeline to illustrate the process. The goal is to put the existing methods 

into a common framework, thus easing the generation of new research in the field and 

identifying important gaps to bridge with future research. 

Learning from the outcome of these two chapters, the following main directions are 

indicated for future research: 

• developing new quality metrics for purposes like view optimization or visual mapping 

optimization; 

• developing new quality metrics for “non-standard” visualization techniques for highdimensional 

data like pixel based techniques, glyphs, etc; 

• running user evaluations to test the quality metric applicability in real world settings; 

• using quality metrics to explore projection techniques’ properties like noise or rotation 

invariance, scalability with respect to data points or data dimensions; 

• using quality metrics to select features to be used for building a model for data 

classifiers. 

For (2) we presented visual analytics approaches to understand the relations between 

subspaces that contain important patterns. We recognize four main research directions to 

use visualization and interaction in understanding patterns identified by subspace algorithms: 

1. Interactive subspace clustering result exploration 

The probably simplest way to use visualization in conjunction with subspace algorithms 

is to visualize their results. The workflow in Figure 6.1 illustrates the needed 

steps. Data is processed by a certain subspace algorithm and the results are visualized 

providing interactive facilities to explore the result. 


Subspace 



Figure 6.1: Interactive exploration of subspace clustering results. 

In Section 5.1 we presented ClustNails, a tool to visualize subspace clustering results, 

and support comparison of di erent subspace clusters regarding their data 

distribution and dimension overlap. We proposed ordering strategies for dimensions 

and clusters to ease the cluster comparison. Brushing and linking of dimensions and 

clusters support the exploration.

6.1. Summary of Contributions and Future Work 131 

2. Interactive subspace search result exploration 

One problem for all the subspace clustering algorithms is the exponential number of 

existing subspaces. To address this issue, we decoupled the process of clustering and 

subspace search, to restrict the number of interesting subspaces that are inspected 

for clusters (see Figure 6.2). 


Subspace 

Search 



Figure 6.2: Interactive exploration of subspace search results. 

In Section 5.2 (see Figure 6.2 for the specific workflow) we addressed this research 

direction and presented a subspace search approach to identify alternative views 

(valid groups of clusters) in the data. We first run a subspace search algorithm 

on the data to identify possible interesting subspaces. Then we visualize them for 

a better understanding. Given the high redundancy of spotted subspaces, due to 

the high number of shared dimensions, grouping and filtering functions, based on 

topological and dimension similarities are proposed. The groups can be navigated 

and interactive lasso tools are available to manually mark clusters. Di erent views 

can be identified by comparing di erent subspaces according to the marked clusters. 

For further work automatic clustering can be used to increase the number of di erent 

identified views. 

3. Visual comparison of subspace clustering results. 

Subspace 



Subspace 


Comparing 


... 

Subspace 


Figure 6.3: Visual comparison of subspace clustering results using visualization. 

One further research direction would be to use visualization to compare di erent 

results of subspace algorithms (see Figure 6.3). Di erent results can be obtained 

either by running di erent subspace clustering algorithms on the same data set, or 

one algorithm with di erent parameter settings. Visualization can then be used to: 

• identify what we call a “common sense clustering” – meaning, clusters that will 

probably pop out independent on the algorithm or parameters;

132 Chapter 6. Conclusion and Future Work 

• compare di erent clustering results, and identify the role of parameters for 

specific algorithms; 

• the user feedback by looking at the visualization can be integrated into the computation 

of clusters. Clusters can be labeled with user’s preference (like/dislike), 

merged or alternatives to selected cluster can be computed. 

4. Visually-assisted in-line steering of subspace clustering. 


Subspace 


intermediate 

result 

feedback 



Figure 6.4: Visual-assisted in-line steering of subspace clustering. 

Another research direction, and probably the most complex one, could be an intermediate 

use of visualization and user feedback into the algorithmic process (see 

Figure 6.4). This is like opening the box of subspace clustering and providing a 

steerable clustering tool. The algorithm can be interrupted and intermediary results 

could be visualized. User’s preference can be integrated in a feedback loop to steer 

the algorithm. 

In conclusion, we can say that the complexity and amount of high-dimensional data 

requires a combination of the strengths of visualization, automation and interaction to 

discover interesting, unknown facets of these data sets. This thesis has addressed some 

of the relevant research questions in this field. At the same time, new questions arose 

that will hopefully motivate other researchers to develop applicable solutions in the near 

future.

List of Figures 

1.1 Multiple valid and interesting groupings of a high-dimensional data set [104]. 3 

1.2 Schematic overview of the interrelation of chapters in this thesis. . . . . . . 7 

2.1 High-dimensional visualization techniques taken from [145]. A: Scatterplot 

matrix showing on the diagonal a histogram plot for each dimension. Selected 

points are marked in red in all plots. B: Parallel coordinates plot of 

a seven-dimensional data set. One polyline representing one data point is 

highlighted in red. C: Star glyphs in a MDS layout. D: Dense pixel displays 

representing a 14-dimensional data set. . . . . . . . . . . . . . . . . . . . . . 14 

2.2 (A) Scagnostics SPLOM having as axes scagnostics measures and showing 

each data scatterplot as a point in the measures scatterplot [152]. (B) 

Scagnostics indices used as quality measures to rank data scatterplots [152]. 21 

2.3 Visual interactive feature selection systems. A: Rank-by-Feature Framework 

presented in [125]. B: Feature selection supported by quality measures 

[82]. C: DimStiller for feature selection [76]. . . . . . . . . . . . . . . . 23 

2.4 Interactive visual analysis systems for clustering in high-dimensional visualization. 

A: Interactive exploration of hierarchically clustered data along 

a dendrogram [124]. B: (a) Grouping icons to form clusters based on visual 

similarity. (b) User-defined grouping of icons [35]. . . . . . . . . . . . . . . . 24 

2.5 Interactive visual analysis systems for classification in high-dimensional 

data. A: Visual classification from [11] illustrates the decision tree for DNA 

training data having 19 attributes, visualizing each attribute-value by a 

colored pixel arranged in bars. B: Decision tree construction system [142], 

representing the tree in a node-link diagram, displaying split points on the 

links and the split attributes on the node. . . . . . . . . . . . . . . . . . . . 25 

2.6 (a) VISA system [14]. Left: MDS projection for the global view of clusters. 

Right: Matrix of subspace clusters for in-depth view. (b) Heidi Matrix [141] 

over a subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

2.7 Visualization techniques applied in Ferdosi’s work [52]. Left: 1D subspace. 

Middle: 2D subspace. Right: Subspace with 3 or more dimensions. . . . . . 28 

3.1 Working steps for using quality metrics to rank high-dimensional visualizations 

according to a given task. . . . . . . . . . . . . . . . . . . . . . . . . . 31 

3.2 Scatterplot example and its respective density image. For each pixel we 

compute the mass distribution along di erent directions and save the smallest 

value, here depicted by the blue line. . . . . . . . . . . . . . . . . . . . . 33 

3.3 2D view and rotated projection axes. The projection on the rotated plane 

has less overlap, and the structures of the data can be seen even in the 

projection. This is not possible for a projection on the original axes. . . . . 36 

3.4 First step of the HDM approach: each plot is ranked for di erent rotations 

with the 1D-HDM. The best measure value is taken for the plot. . . . . . . 37 

3.5 Second step of the HDM approach: PCA is computed on the k best selected 

dimensions and on all the possible subsets greater than 3 dimensions. The 

first two components are plotted in scatterplots, that are ranked with the 

2D-HDM. The best measure value indicates the best scatterplot where the 

class information is separated. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

134 List of Figures 

3.6 Synthetic examples of parallel coordinates and their respective Hough spaces: 

(a) presents two well defined line clusters and is more interesting for the 

cluster identification task than (b), where no line cluster can be identified. 

Note that the bright areas in the fl◊-plane represent the clusters of lines 

with similar fl and ◊. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.7 Results for the Parkinson’s Disease data set using our RVM measure (Section 

3.1.2). While clumpy low-correlation bearing views are punished (bottom 

row), views containing higher correlation between the variables are 

preferred (top row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

3.8 Results for the Olives data set using our CDM measure (Section 3.1.3). 

The di erent colors depict the di erent classes (regions) of the data set. 

While it is impossible for this data set to find views completely separating 

all classes, our CDM measure still found views where most of the classes 

are mutually separated (top row). In the worst ranked views the classes 

clearly overlap with each other (bottom row). . . . . . . . . . . . . . . . . . 43 

3.9 Results for the Olives data set using our HDM measure (Section 3.1.3). The 

best ranked plot is the PCA of dim(4,5,8) revealing a good view on all the 

classes, the second best is the PCA of dim(1,2,4) and the third is the PCA 

on all 8 dimensions. The di erences between the last two are small because 

the variance in that additional dimensions for the 3rd eigenvector relative 

to the 2nd, is not big. The di erence between the last two views and the 

first view is clearly visible (e.g. looking at the yellow class). . . . . . . . . 43 

3.10 Results for the Wine data set using our CSM measure (Section 3.1.3). The 

best ranked plots present a large distance between the centers of the class 

clusters while the worst ranked views show only cluttered data. . . . . . . . 44 

3.11 Results for the Wine data set using our CDM measure (Section 3.1.3). Note 

that the second best ranked view, (dim1,dim7) (with CDM = 89), is not 

considered good using the CSM measure (CSM = 58). . . . . . . . . . . . . 45 

3.12 Results on the WDBC data set for the RVM (top) and the CDM (bottom). 

In this example, views with a quality value of less than 0.95 have been 

faded out. This way many irrelevant views can be faded out reducing the 

number of the plots to be inspected by the user in more detail to a better 

manageable number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.13 Results for the non-classified version of the Parkinsons Disease data set. 

Best and worst ranked visualizations using our HSM measure for nonclassified 

data (ref. Section 3.1.4). Top row: The three best ranked visualizations 

and their respective normalized measures. Well defined clusters 

in the data set are favored. Bottom row: The three worst ranked visualizations. 

The large amount of spread exacerbates interpretation. Note 

that the user task related to this measure is not to find possible correlation 

between the dimensions but to detect good separated clusters. . . . . . . . 47 

3.14 Results of the SM for the Cars data set. Cars using benzine are shown in 

black, diesel in red. Best and worst ranked visualizations using our Hough 

Similarity Measure (Section 3.1.5) for parallel coordinates. Top row: The 

three best ranked visualizations and their respective normalized measures. 

Bottom row: The three worst ranked visualizations. . . . . . . . . . . . . . 48


3.15 Results of the OM for the WDBC data set. Malign nuclei are colored black 

while healthy nuclei are red. Best and worst ranked visualizations using 

our Overlap Measure (Section 3.1.5) for parallel coordinates. Top row: The 

three best ranked visualizations. Despite good similarity, which are similar 

to clusters, visualizations are favored that minimize the overlap between the 

classes, so that the di erence between malign and benign cells becomes more 

clear. Bottom row: The three worst ranked visualizations. The overlap of 

the data complicates the analysis and the information is useless for the task 

of discriminating malign and benign cells. . . . . . . . . . . . . . . . . . . . 48 

3.16 Results of the HSM for the synthetic data set from [82] presenting the best 

and worst ranked visualizations using our HSM measure for non-classified 

data (ref. Section 3.1.4). Top row: The three best ranked visualizations and 

their respective normalized measures. Well defined clusters in the data set 

are favored. Bottom row: The three worst ranked visualizations. The large 

amount of spread exacerbates interpretation. Note that the user task related 

to this measure is not to find high correlation between the dimensions but 

to detect good separated clusters. . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.17 Matrix for the synthetical data set with scatterplots above the main diagonal 

and parallel coordinate plots bellow. . . . . . . . . . . . . . . . . . . . 50 

3.18 Results of the 7 measures for classified and unclassified data. The left 

column shows the result for the scatterplot measures and the right column 

for the parallel coordinates measures. The ranks are sorted decreasing and 

the target patterns are marked with red crosses. . . . . . . . . . . . . . . . 51 

3.19 Scatterplot of the first two components of the PCA over dimensions 2, 5 

and 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

3.20 Projections of scatterplots used in the experiment. Participants had to 

select the best five projections and order them by their quality. The order 

of the scatterplots was permuted for each participant separately using the 

Latin-Square method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.21 Correlation of measures with users’ classification shows highest R 2 values 

for the 2D-HDM measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

3.22 Correlation of measures with users’ classification for highest and one lowest 

quality projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3.23 Surprising study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

4.1 (Top row of Figure 3.8) Ranking projections according to the Class Density 

Measure, favoring projections with minimal overlap between predefined 

classes (i.e., the colors) [133]. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

4.2 Clutter reduction achieved through axes reordering in a scatterplot matrix 

(initial visualization on the left, reordered on the right) [112]. . . . . . . . . 68 

4.3 Data abstraction algorithm based on sampling, aiming at reducing data size 

while preserving relevant patterns. Original visualization on the left with 

16384 data items. Sampled visualization on the right with 987 items and a 

visual quality of 0.95 [80]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


4.4 Quality metrics pipeline. The pipeline provides an additional layer named 

quality metrics base automation on top of the traditional information visualization 

pipeline [36]. The layer obtains information from the stages of the 

pipeline (the boxes) and influences the processes of the pipeline through the 

metrics it calculates. The user is always in control. . . . . . . . . . . . . . . 72 

4.5 Mapping a 10 dimensional data set to a scatterplot with four visual primitives 

(x-axis, y-axis, size, and color) has over 5000 possible alternative 

mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

4.6 Quality metrics pipeline for the first example from [133]: (A) generation of 

alternatives; (B) evaluation of alternatives (image space); (C) creation of 

the final representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

4.7 Interactive chart to select number of dimensions to keep vs. information 

loss [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

4.8 Top: best ordering to enhance clustering. Bottom: best ordering to enhance 

correlation [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

4.9 Quality metrics pipeline for the second example from [82]: (A) dimensions 

ranked by their importance; (B) selection of number of dimensions to retain 

vs. information loss; (C) creation of the final mapping with ordering. . . . . 81 

4.10 Visual abstraction of a scatterplot matrix from [42]. . . . . . . . . . . . . . 81 

4.11 Quality metrics pipeline for example three from [42]: (A) data features compared 

between the original data and the abstracted data; (B) instantiation 

of the desired abstraction level guided by quality metrics. . . . . . . . . . . 82 

4.12 Visual abstraction chart with threshold setting for the abstraction level and 

feedback on abstraction quality [42]. . . . . . . . . . . . . . . . . . . . . . . 82 

4.13 Left: star glyphs representing original data set. Right: visualized data after 

DOSFA was applied [158]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.14 Quality metrics pipeline for example four from [158]: (A) construct hierarchical 

structure of dimensions by clustering; (B) filter dimensions by 

similarity and importance; (C) map dimensions ordering to visualization; 

(D) influence the view according to the quality measured (spacing the parallel 

coordinates according to their similarity). The user can steer all these 

steps, after interacting with the clustered dimensions showed in an Inter- 

Ring visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.15 Taxonomy of factors in visual cluster separation, where factor axes are 

marked to show the ranges where existing measures are successful; gaps 

represent failure cases. The centroid measure (CDM) is marked in blue and 

the grid (2D-HDM) is marked in red. All positions are approximate estimates. 

Marked along the factor axes are six data sets that are exemplified 

in the paper. (Used with permission by [122].) . . . . . . . . . . . . . . . . 88 

4.16 A taxonomy of data characteristics with respect to class separation in scatterplots. 

Some factors are organized as axes (arrows) while others are 

binned. Between-Class factors often result from the variance of Within- 

Class factors (horizontal dependencies), and factors at the top can strongly 

influence factors below them (vertical dependencies). Class Separation is 

therefore dependent on all other factors (used with permission by [122]). . . 89 

5.1 Data projected in several subspaces. . . . . . . . . . . . . . . . . . . . . . . 95 

5.2 Workflow of subspace cluster analysis using the ClustNails system. . . . . 102


5.3 Two subspace clusters visualized as spikes. The clusters share common dimensions 

but the importance of the dimensions for the clusters are di erent. 

Dim29 and dim32 in the left cluster show smaller pikes than in the right 

cluster, as they are considered less important for the definition of that cluster 

according to our measure wk m . Furthermore, the left cluster has fewer 

dimensions and more objects than the right cluster. . . . . . . . . . . . . . . 103 

5.4 HeatNails visualization. Bottom: showing the distribution of dimension 

values for all dimensions (rows) and records (columns). Top: showing histograms 

for the values of all dimensions per cluster for comparison purposes.104 

5.5 Visualization of the subspace clusters of the USDA Food Composition data 

set generated by Proclus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

5.6 Sorted view (Value ordering function applied). . . . . . . . . . . . . . . . . 107 

5.7 Visualization of the subspace clusters in VISA [14] framework discussed in 

Subsection 5.1.5. Cluster view (left), record view (right). . . . . . . . . . . . 108 

5.8 Alternative data distributions and groupings from [103] in two di erent subspaces 

of a larger high-dimensional data space (domain here: demographic 

data analysis). Our proposed visual analysis method integrates the notion 

of alternative subspaces into the analysis process and links it to the task of 

comparative cluster analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

5.9 Our proposed analysis pipeline. A subspace selection algorithm is applied 

to automatically identify a candidate set of interesting subspaces. A filtering 

step reduces the potentially large and redundant set of automatically 

obtained subspaces to a user-selectable number of representing subspaces. 

Visual-interactive user exploration then proceeds on the subspace representations. 

Subspace analysis is also supported by comparative cluster views, 

allowing users to identify meaningful similar, complementary or even conflicting 

clustering structures in the set of subspaces. . . . . . . . . . . . . . 114 

5.10 Filtering cases that can be supported by our two defined subspace similarity 

functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

5.11 Subspace representation by 2D scatterplots with dimension glyph. We can 

see the visual representations of two 5D subspaces (left) and one 4D subspace 

(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

5.12 (1) Linearly sorted view of subspaces for the 12D synthetical data set from 

[52] showing the full result of SURFING, consisting of 296 subspaces. The 

selected subspace in this view is shown in a (2) single subspace view to 

enable interaction and in (3) a parallel coordinates view with the subspace 

dimensions as the first axes (highlighted), and all the other data dimension 

as the last axes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

5.13 Hierarchical agglomerative grouping of the 296 interesting subspaces. The 

red line shows the threshold for 6 groups shown in the subspace group view. 

Each group is marked by a colored rectangle. The colors are maintained in 

Figure 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

5.14 Subspace group view for the 12D synthetic data set with six subspace groups.118 

5.15 Dimension-based subspace similarity MDS view of the 296 subspaces selected 

by the subspace search algorithm. . . . . . . . . . . . . . . . . . . . . 119


5.16 All linked views: (1) Subspace group view for the 12D synthetic data set 

with six subspace groups. (2) Single subspace view showing the representative 

subspace for the first group. (3) Details-on-demand in the parallel 

coordinates view for the selected subspace. (4) The MDS layout of the subspace 

search results based on their dimension similarity. (5) Group detail 

view for the three (orange, green, purple) subspace groups. (6) Hierarchical 

navigation buttons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

5.17 Linearly sorted view cut-out of subspaces for the 18D USDA Food Composition 

data set. The full result of SURFING, consisting of 216 subspaces. 

We see a rather high level of redundancy. Subspaces exhibiting more structure 

are found in particular at the mid and end positions in the ranking. 

Relying only on the numerically top ranked results, we would have omitted 

such interesting cases from the analysis. . . . . . . . . . . . . . . . . . . . . 123 

5.18 (A) Interesting spotted subspace (Carbohydrat,Fibre)presentingtwoclusters. 

(B) Subspace (Carbohydarte,Lipid,Protein) in the same cluster 

group of (A) where the cluster structure changes. (C) Green marked third 

cluster in subspace from (B). (D) Subspace (Fiber,Protein,Vit D ) of orange 

color-framed subspace group, where the alternative clustering of points 

is visible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

5.19 (1) Grouped view of subspaces for the 18D USDA Food Composition Data 

Set with 12 group representatives. (2) The brown and orange group components 

are shown in the components view. (3) MDS Layout of the total 

number of subspaces with cross-colored group representatives. . . . . . . . . 124 

6.1 Interactive exploration of subspace clustering results. . . . . . . . . . . . . . 130 

6.2 Interactive exploration of subspace search results. . . . . . . . . . . . . . . . 131 

6.3 Visual comparison of subspace clustering results using visualization. . . . . 131 

6.4 Visual-assisted in-line steering of subspace clustering. . . . . . . . . . . . . . 132 

A.1 Empirical study experiment form version A. . . . . . . . . . . . . . . . . . . 153 

A.2 Empirical study experiment form version B. . . . . . . . . . . . . . . . . . . 154 

A.3 The eight projections that where never selected by a user as being on the 

scale 1 to 5 in terms of separability of classes among the 18 presented plots. 155 

A.4 Pipeline for “A Projection Pursuit Algorithm for Exploratory Data Analysis” 

by Friedman and Tukey [54]: (A) di erent 2D linear, but not axisparallel, 

data projections are computed and evaluated by the quality metric; 

(B) the best projection direction is chosen by the quality metric, called “usefulness” 

index, that measures the quality of a projection axis and varies the 

projection direction so that the index is maximized. . . . . . . . . . . . . . 156


A.5 Pipeline for “A Rank-by-Feature Framework for Interactive Exploration of 

Multidimensional Data” by Seo and Shneiderman [126]: (A) generation of 

projections and each 1D and 2D projection is evaluated/ranked by a quality 

metric selected by the user; (B) best projections are presented; (C) present 

ranking scores in a color coded grid (“Score Overview”), as well as an colorcoded 

“Ordered List” for each projection. The user selects one view in the 

list or grid, and can also change dimension axes and then the view adapts. 

Please note: here we have a visualization of dimensions and quality metric 

scores, that are highly interactive, rather than a static projection of data 

records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 

A.6 Pipeline for “Finding and Visualizing Relevant Subspaces for Clustering 

High-Dimensional Astronomical Data Using Connected Morphological Operators” 

by Ferdosi et al. [52]: (A) generation of projections, all above 3D 

are reduced with PCA; the user can change the smoothing parameter, what 

influences the number of projections; (B) evaluate each view; the user can 

select the view to inspect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

A.7 Pipeline for “Graph-Theoretic Scagnostics” by Wilkinson et al. [151]: (A) 

generation of projections; (B) all 2D views are ranked by several metrics; (C) 

once the metrics have been computed, they are used to create the SPLOM 

(rows and columns are the metrics) - projections are mapped as data points. 157 

A.8 Pipeline for “Selecting good views of high-dimensional data using class consistency” 

by Sips et al. [129]: (A) all 2D projections are ranked with the 

quality metric; (B) each view is associated with a quality metric computed 

in A; (C) view transformation decides which scatterplot to highlight (fade 

out) depending on the quality values and the set threshold. . . . . . . . . . 157 

A.9 Pipeline for “Coordinating computational and visual approaches for interactive 

feature selection and multivariate clustering” by Guo [59]: (A) all 2D 

projections are evaluated with the “minimum conditional entropy (MCE)”; 

(B) original dimensions are clustered to find an ordering according to their 

MCE value; (C) matrix ordered according to dimension clustering. The 

user can 1) select, add to, or subtract from a variable subset that is analyzed 

further; 2) move the threshold bar for the connecting edges, and 

clusters are automatically extracted and colored; 3) interact to link, brush 

and select elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

A.10 Pipeline for “Exploring High-D Spaces with Multiform Matrices and Small 

Multiples” by MacEachren et al. [98]: (A) automatic selection of potentially 

interesting subspaces of variables; the user can also manually select 

subspaces; (B) all 2D plots are ranked with a quality metric (conditional 

entropy based); (C) the matrix view is colored and ordered according to 

the quality metric value. The user can select a dimension subset to be 

visualized with other visualization techniques. . . . . . . . . . . . . . . . . . 158 

A.11 Pipeline for “Improving the Visual Analysis of High-dimensional Datasets 

Using Quality Measures” by Albuquerque et al. [8] for Jigsaw Maps: (A) 

mapping of dimension to 2D displays; (B) all 2D plots are ranked with a 

quality metric to select the best. . . . . . . . . . . . . . . . . . . . . . . . . 158



Using Quality Measures” by Albuquerque et al. [8] for RadVis: (A) all views 

are ranked with a quality metric; (B) dimensions are ordered according to 

quality values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 


Using Quality Measures” by Albuquerque et al. [8] for Table Lens: (A) 

quality metric is computed on the data (B) user can select an area, marking 

dimensions and records; the view is than transformed according to the user 

interaction; (C) colors are mapped according to the quality metrics values 

for outliers and correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 

A.14 Pipeline for “Pragnostics: Screen-Space Metrics for Parallel Coordinates” 

by Dasputa and Kosara [43]: (A) all 2D views are evaluated according to the 

metrics; (B) the best pairs are selected to compute the best ordering of dimensions. 

The user can also influence this decision by selecting interesting 

plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 

A.15 Pipeline for “Combining automated analysis and visualization techniques 

for e ective exploration of high-dimensional data” by Tatu et al. [133] for 

HDM: (A) all 2D data tables are evaluated according to the 1D-HDM; (B) 

create the best nD visible on the 2D plot (with PCA), evaluated by the 

2D-HDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 

A.16 Pipeline for “High-Dimensional Visual Analytics: Interactive Exploration 

Guided by Pairwise Views of Point Distributions” by Wilkinson et al. [152]: 

(A) generation of projections; (B) all 2D views are evaluated according to 

quality metric; (C) a sorted/highlighted view is created using the metrics. 

The user can navigate trough the ranked list, and sort and highlight plots 

in this and the SPLOM view. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 

A.17 Pipeline for “Clutter Reduction in Multi-Dimensional Data Visualization 

Using Dimension Reordering” by Peng et al. [112]: (A) quality metric is 

computed on the data; (B) quality metric calculated also dependent on the 

visual abstraction; (C) best visual mapping (ordering) decided based on 

metric values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 

A.18 Pipeline for “Similarity Clustering of Dimensions for an Enhanced Visualization 

of Multidimensional Data” by Ankerst et al. [9]: (A) quality metric 

is computed on the data; (B) quality metric calculated also dependent on 

the visual abstraction; (C) best visual mapping (ordering) decided based 

on metric values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 

A.19 Pipeline for “Quality Metrics for 2D Scatterplot Graphics: Automatically 

Reducing Visual Clutter” by Bertini and Santucci [24]: (A) quality metric 

is computed on the data density and screen density and compared; (B) 

projection and sampling based on metric values. . . . . . . . . . . . . . . . 160 

A.20 Pipeline for “A Screen Space Quality Method for Data Abstraction” by Johansson 

and Cooper [80]: (A) sampled and original data tables are associated 

to quality metric computed on the views of sampled and original data; 

(B) the values are used to decide upon the sampling rate. . . . . . . . . . . 160


A.21 Pipeline for “Enabling Automatic Clutter Reduction in Parallel Coordinate 

Plots” by Ellis and Dix [48]: (A) pixel occlusion is measured in the view 

space; the user can move a window (lens) and sampling and measuring 

occlusion is done only in this window (B) the values of the quality metric 

are used to decide upon the sampling rate. . . . . . . . . . . . . . . . . . . . 161 

A.22 Pipeline for “Pixnostics: Towards Measuring the Value of Visualization” 

by Schneidewind et al. [120]: (A) a subset of dimensions is selected with 

standard mining techniques; (B) alternative mappings of selected data are 

evaluated on the screen space; (C) and (D) based on the quality value the 

best subset and mapping is determined. The user can decide to fix map 

some data features to visual features manually. . . . . . . . . . . . . . . . . 161 

A.23 Hierarchical agglomerative grouping of the 296 interesting subspaces. The 

red line shows the threshold for 6 groups shown in the subspace group view. 

Each group is marked by a colored rectangle. The colors are maintained in 

Figure 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

142 List of Figures

List of Tables 

3.1 Overview and classification of our quality measures. . . . . . . . . . . . . . 31 

3.2 Overview over the data sets used to show the measures properties. . . . . . 41 

3.3 Overview of the analyzed measures with the reference for additional details. 55 

3.4 Results of the regression analysis. . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.1 Visualization techniques categorized by their layout dimensionality (i.e., the 

number of axes of the visualization). . . . . . . . . . . . . . . . . . . . . . . 77 

4.2 Quality metrics papers classified according to quality metrics factors (sorted 

by purpose). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

A.1 Dimension names for the Cars data set. . . . . . . . . . . . . . . . . . . . . 146 

A.2 Dimension names for the Olives data set [163]. . . . . . . . . . . . . . . . . 146 

A.3 Dimension names for the Parkinson’s Disease data set [95, 96]. . . . . . . . 147 

A.4 Dimension names for the Wine data set [53]. . . . . . . . . . . . . . . . . . 147 

A.5 Dimension names for the WDBC data set [131]. . . . . . . . . . . . . . . . . 148

144 List of Tables

A 

Appendix 

Contents 

A.1 Original Data Dimensions for Used Data Sets . . . . . . . . . . 145 

A.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

A.2.1 General Questions Form . . . . . . . . . . . . . . . . . . . . . . . 149 

A.2.2 Experiment Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 

A.2.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . 155 

A.3 Quality Metrics Pipelines for the Literature Review . . . . . . 156 

A.4 Hierarchical Grouping of Interesting Subspaces . . . . . . . . . 162 

A.1 Original Data Dimensions for Used Data Sets 

The Cars data set was collected by another institute from Braunschweig and provided to 

our partners there and has the original dimensions enumerated in Table A.1.

146 Appendix A. Appendix 

Table A.1: Dimension names for the Cars data set. 

original 

TYPEOFMOTOR 

MANUFACTURER 

TYPE 

PRICE 

CYLINDERCAPACITY 

POWER 

RPM 

TORQUE 

VMAX 

ACCELERATION 

FUELCONSUMPTION 

CO2EMISSION 

WEIGHT 

LENGTH 

WIDTH 

HEIGHT 

WHEELBASE 

LOADCAPACITY 

TRUNK 

TOWINGCAPACITY 

ROOFLOAD 

TANKCAPACITY 

TAXES 

renamed 

dim0 (class) 

dim1 

dim2 

dim3 

dim4 

dim5 

dim6 

dim7 

dim8 

dim9 

dim10 

dim11 

dim12 

dim13 

dim14 

dim15 

dim16 

dim17 

dim18 

dim19 

dim20 

dim21 

dim22 

The Olives data set can be found at http://www2.chemie.uni-erlangen.de/publications/ 

ANN-book/datasets/oliveoil/index.html and has the original dimensions enumerated 

in Table A.2. 

Table A.2: Dimension names for the Olives data set [163]. 


palmitic 

palmitoleic 

stearic 

oleic 

linoleic 

linolenic 

arachidic 

eicosenoic 

area 

renamed 

dim1 

dim2 

dim3 

dim4 

dim5 

dim6 

dim7 

dim8 

dim9 (class)

A.1. Original Data Dimensions for Used Data Sets 147 

The Parkinson’s Disease data set can be found at http://archive.ics.uci.edu/ml/ 

datasets/Parkinsons and has the original dimensions enumerated in Table A.3. 

Table A.3: Dimension names for the Parkinson’s Disease data set [95, 96]. 


status - health status of the subject (one) - Parkinson’s, (zero) - healthy 

MDVP:Fo(Hz) - average vocal fundamental frequency 

MDVP:Fhi(Hz) - maximum vocal fundamental frequency 

MDVP:Flo(Hz) - minimum vocal fundamental frequency 

MDVP:Shimmer(dB) - measure of variation in amplitude 

HNR - measure of ratio of noise to tonal components in the voice 

RPDE - nonlinear dynamical complexity measure 

D2 - nonlinear dynamical complexity measure 

DFA - signal fractal scaling exponent 

spread1 - nonlinear measure of fundamental frequency variation 

spread2 - nonlinear measure of fundamental frequency variation 

PPE - nonlinear measure of fundamental frequency variation 

renamed 

dim1 (class) 

dim2 

dim3 

dim4 

dim5 

dim6 

dim7 

dim8 

dim9 

dim10 

dim11 

dim12 

The Wine data set can be found at http://archive.ics.uci.edu/ml/datasets/ 

Wine and has the original dimensions enumerated in Table A.4. 

Table A.4: Dimension names for the Wine data set [53]. 


Alcohol 

Malic acid 

Ash 

Alcalinity of ash 

Magnesium 

Total phenols 

Flavanoids 

Nonflavanoid phenols 

Proanthocyanins 

Color intensity 

Hue 

OD280/OD315 of diluted wines 

Proline 

cluster ID 

renamed 

dim1 

dim2 

dim3 

dim4 

dim5 

dim6 

dim7 

dim8 

dim9 

dim10 

dim11 

dim12 

dim13 

dim14 (class)


The Wisconsin Diagnostic Breast Cancer (WDBC) data set can be found at http:// 

archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) and has 

the original dimensions enumerated in Table A.5. [131] contains detailed descriptions of 

how these features are computed. 

Table A.5: Dimension names for the WDBC data set [131]. 


diagnosis 

radius 

texture 

perimeter 

area 

smoothness (local variation in radius lengths) 

compactness (perimeter 2 / area - 1.0) 

concavity (severity of concave portions of the contour) 

concave points (number of concave portions of the contour) 

symmetry 

fractal dimension (“coastline approximation” - 1) 

renamed 

dim1 (class) 

dim2-dim31 

The mean, standard error, and “worst” or largest (mean of the three largest values) of 

these features were computed for each image, resulting in 30 features. For instance, dim2 

is Mean Radius, dim12 is Radius SE, dim22 is Worst Radius.

A.2. Empirical Study 149 

A.2 Empirical Study 

A.2.1 

General Questions Form 

On the next two pages we present the introductory form to our study. The participants 

were asked to fill in personal information on the first page and the experiment task was 

explained on the second page trough an example. 

Please note that the study took place at University of Konstanz, therefore the forms 

are in German language.


Fragen zur Person 

(* Zutreffendes bitte ankreutzen) 

Studienfach: 

Anzahl der Fachsemester: 

Geschlecht*: männlich weiblich 

Alter: 

Wie oft haben Sie sich mit Daten und ihrer Auswertung beschäftigt*? 

(wie z.B. Excel-Tabellen, Datenbanken, usw.) 

Laufend Oft Manchmal Selten Nie 

Verwendete Software: 

Wie oft haben Sie sich mit der graphischen Darstellung von Daten beschäftigt*? 

(wie z.B. Excel-Diagramme, usw.) 

Laufend Oft Manchmal Selten Nie 

Verwendete Software: 

(* Zutreffendes bitte ankreutzen)

A.2.1 General Questions Form 151 

Wir bitten Sie die Anweisung aufmerksam durchzulesen und dann den folgenden Bogen 

ohne Unterbrechung durchzuarbeiten. 

Stellen Sie sich vor Sie sind Weinhändler und haben ein großes Repertoire an 

Weinflaschen. Ihre Weinflaschen lassen sich in drei Weinsorten einteilen (Apperetive-, 

Likör-, und Tafelwein). Alle Weinflaschen haben eine Reihe von Standartanalysen 

durchlaufen, die Aufschluss über ihre Eigenschaften liefern, wie z.B. Alkoholgehalt, 

Farbtönung, usw. Die Ergebnisse dieser Analysen sind in 18 Streudiagrammen 

dargestellt, in denen immer zwei Eigenschaften (z.B.: X–Y, etc.) gegeneinander 

aufgetragen sind. Jede Weinflasche ist durch einen Punkt im Diagramm dargestellt, die 

Farben der Punkte stehen für die drei Weinsorten. An Hand dieser Darstellungen 

müssen Sie bestimmen welches Eigenschaftspaar sich am besten zur Unterscheidung 

der Weinsorten eignet, wie im folgenden Beispiel gezeigt wird: 

Eigenschaft Y 

Sorte A 

Sorte B 

Eigenschaft X 

Sorte C 

Nun liegt Ihre Aufgabe darin die Darstellungen auszuwählen, 

die sich gut zur Unterscheidung von Weinsorten eignen! 

Bitte vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 bis 5 (1 steht für die 

beste Darstellung). Die Zahlen sind in die Kästchen neben der Darstellung einzutragen. 

Die Kästchen der anderen Darstellungen können leer bleiben. 

Während der Bearbeitung der nächsten Seite bitten wir sie um Ruhe und Konzentration. 

Vielen Dank für Ihre Teilnahme!


A.2.2 

Experiment Form 

On the next page we show two examples of the study forms for the participants of the 

empirical study described in Section 3.2. Every participant were shown the same plots 

but ordered by a di erent permutation.

A.2.2 Experiment Form 153 

Vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 - 5 (1 für die Beste).A 

Figure A.1: Empirical study experiment form version A.


Vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 - 5 (1 für die Beste).B 

Figure A.2: Empirical study experiment form version B.

A.2.3 Additional Experiment Results 155 

A.2.3 

Additional Experiment Results 

This plots have never been selected by a user as being on a scale from 1-5 between the 

best plots out of the 18 presented. 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● ● 

● 

● 

● 

● 

● 

●● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

Figure A.3: The eight projections that where never selected by a user as being on the scale 1 to 5 

in terms of separability of classes among the 18 presented plots.


A.3 Quality Metrics Pipelines for the Literature Review 

Here we attach all the quality metrics pipelines for all the papers from the taxonomy 

presented in Section 4.1 and summarized in Table 4.2 that are not part of the examples 

of this section. We ordered them in the same order that the papers are presented in the 

taxonomy’s table. 


B 

A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.4: Pipeline for “A Projection Pursuit Algorithm for Exploratory Data Analysis” by Friedman 

and Tukey [54]: (A) di erent 2D linear, but not axis-parallel, data projections are computed 

and evaluated by the quality metric; (B) the best projection direction is chosen by the quality 

metric, called “usefulness” index, that measures the quality of a projection axis and varies the 

projection direction so that the index is maximized. 


B 

A 

C 

Source 




Transformed 




Structures 

View 



Views 

Figure A.5: Pipeline for “A Rank-by-Feature Framework for Interactive Exploration of Multidimensional 

Data” by Seo and Shneiderman [126]: (A) generation of projections and each 1D and 

2D projection is evaluated/ranked by a quality metric selected by the user; (B) best projections 

are presented; (C) present ranking scores in a color coded grid (“Score Overview”), as well as an 

color-coded “Ordered List” for each projection. The user selects one view in the list or grid, and 

can also change dimension axes and then the view adapts. Please note: here we have a visualization 

of dimensions and quality metric scores, that are highly interactive, rather than a static projection 

of data records.

A.3. Quality Metrics Pipelines for the Literature Review 157 


A 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.6: Pipeline for “Finding and Visualizing Relevant Subspaces for Clustering High- 

Dimensional Astronomical Data Using Connected Morphological Operators” by Ferdosi et al. [52]: 

(A) generation of projections, all above 3D are reduced with PCA; the user can change the smoothing 

parameter, what influences the number of projections; (B) evaluate each view; the user can 

select the view to inspect. 


A C B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.7: Pipeline for “Graph-Theoretic Scagnostics” by Wilkinson et al. [151]: (A) generation 

of projections; (B) all 2D views are ranked by several metrics; (C) once the metrics have been 

computed, they are used to create the SPLOM (rows and columns are the metrics) - projections 

are mapped as data points. 


A 

C 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.8: Pipeline for “Selecting good views of high-dimensional data using class consistency” 

by Sips et al. [129]: (A) all 2D projections are ranked with the quality metric; (B) each view is 

associated with a quality metric computed in A; (C) view transformation decides which scatterplot 

to highlight (fade out) depending on the quality values and the set threshold.



B A A C 

Source 




Transformed 




Structures 

View 



1 2 3 

Views 

Figure A.9: Pipeline for “Coordinating computational and visual approaches for interactive feature 

selection and multivariate clustering” by Guo [59]: (A) all 2D projections are evaluated with the 

“minimum conditional entropy (MCE)”; (B) original dimensions are clustered to find an ordering 

according to their MCE value; (C) matrix ordered according to dimension clustering. The user 

can 1) select, add to, or subtract from a variable subset that is analyzed further; 2) move the 

threshold bar for the connecting edges, and clusters are automatically extracted and colored; 3) 

interact to link, brush and select elements. 


A B C 

Source 




Transformed 




Structures 

View 



Views 

Figure A.10: Pipeline for “Exploring High-D Spaces with Multiform Matrices and Small Multiples” 

by MacEachren et al. [98]: (A) automatic selection of potentially interesting subspaces of variables; 

the user can also manually select subspaces; (B) all 2D plots are ranked with a quality metric 

(conditional entropy based); (C) the matrix view is colored and ordered according to the quality 

metric value. The user can select a dimension subset to be visualized with other visualization 

techniques. 


A 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.11: Pipeline for “Improving the Visual Analysis of High-dimensional Datasets Using 

Quality Measures” by Albuquerque et al. [8] for Jigsaw Maps: (A) mapping of dimension to 2D 

displays; (B) all 2D plots are ranked with a quality metric to select the best. 


B 

A 

Source 




Transformed 




Structures 

View 



Views 


Quality Measures” by Albuquerque et al. [8] for RadVis: (A) all views are ranked with a quality 

metric; (B) dimensions are ordered according to quality values.



A 

Source 




Transformed 


C 



Structures 

B 

View 



Views 


Quality Measures” by Albuquerque et al. [8] for Table Lens: (A) quality metric is computed on 

the data (B) user can select an area, marking dimensions and records; the view is than transformed 

according to the user interaction; (C) colors are mapped according to the quality metrics values 

for outliers and correlation. 


B 

A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.14: Pipeline for “Pragnostics: Screen-Space Metrics for Parallel Coordinates” by Dasputa 

and Kosara [43]: (A) all 2D views are evaluated according to the metrics; (B) the best pairs are 

selected to compute the best ordering of dimensions. The user can also influence this decision by 

selecting interesting plots. 


B 

A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.15: Pipeline for “Combining automated analysis and visualization techniques for e ective 

exploration of high-dimensional data” by Tatu et al. [133] for HDM: (A) all 2D data tables are 

evaluated according to the 1D-HDM; (B) create the best nD visible on the 2D plot (with PCA), 

evaluated by the 2D-HDM. 


A 

C 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.16: Pipeline for “High-Dimensional Visual Analytics: Interactive Exploration Guided by 

Pairwise Views of Point Distributions” by Wilkinson et al. [152]: (A) generation of projections; 

(B) all 2D views are evaluated according to quality metric; (C) a sorted/highlighted view is created 

using the metrics. The user can navigate trough the ranked list, and sort and highlight plots in 

this and the SPLOM view.



A 

C 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.17: Pipeline for “Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension 

Reordering” by Peng et al. [112]: (A) quality metric is computed on the data; (B) quality 

metric calculated also dependent on the visual abstraction; (C) best visual mapping (ordering) 

decided based on metric values. 


A 

C 

B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.18: Pipeline for “Similarity Clustering of Dimensions for an Enhanced Visualization 

of Multidimensional Data” by Ankerst et al. [9]: (A) quality metric is computed on the data; 

(B) quality metric calculated also dependent on the visual abstraction; (C) best visual mapping 

(ordering) decided based on metric values. 


B A A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.19: Pipeline for “Quality Metrics for 2D Scatterplot Graphics: Automatically Reducing 

Visual Clutter” by Bertini and Santucci [24]: (A) quality metric is computed on the data density 

and screen density and compared; (B) projection and sampling based on metric values. 


B A A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.20: Pipeline for “A Screen Space Quality Method for Data Abstraction” by Johansson 

and Cooper [80]: (A) sampled and original data tables are associated to quality metric computed 

on the views of sampled and original data; (B) the values are used to decide upon the sampling 

rate.



B 

A 

Source 




Transformed 




Structures 

View 



Views 

Figure A.21: Pipeline for “Enabling Automatic Clutter Reduction in Parallel Coordinate Plots” by 

Ellis and Dix [48]: (A) pixel occlusion is measured in the view space; the user can move a window 

(lens) and sampling and measuring occlusion is done only in this window (B) the values of the 

quality metric are used to decide upon the sampling rate. 


A D C B 

Source 




Transformed 




Structures 

View 



Views 

Figure A.22: Pipeline for “Pixnostics: Towards Measuring the Value of Visualization” by Schneidewind 

et al. [120]: (A) a subset of dimensions is selected with standard mining techniques; (B) 

alternative mappings of selected data are evaluated on the screen space; (C) and (D) based on the 

quality value the best subset and mapping is determined. The user can decide to fix map some 

data features to visual features manually.


A.4 Hierarchical Grouping of Interesting Subspaces 

hierarchical agglomerative grouping 

synthetic dataset 

0 5 10 15 20 25 30 

CFGIJK 

CFGIJKL 

CFGIK 

CFGIKL 

CFGJK 

CFGJKL 

CFGK 

CFGKL 

CFIKL 

CFIJKL 

CFIK 

CFIJK 

CFKL 

CFJKL 

CFHIKL 

CFHIJKL 

CFHKL 

CFHJKL 

CFHK 

CFHJK 

CFK 

CFJK 

CFGHJK 

CFGHJKL 

CFGHK 

CFGHKL 

CFHIK 

CFHIJK 

CFGHIK 

CFGHIJK 

CFIJL 

CFHIJL 

CFIL 

CFHIL 

CFJL 

CFHJL 

CFL 

CFHL 

CFHI 

CFHIJ 

CFI 

CFIJ 

CFH 

CFJ 

CFHJ 

CFGHI 

CFGHIJ 

CFGI 

CFGIJ 

CFGJ 

CFGHJ 

CFG 

CFGH 

CFGHIL 

CFGHIJL 

CFGHIKL 

CFGHIJKL 

CFGHL 

CFGHJL 

CFGJL 

CFGIJL 

CFGL 

CFGIL 

CDGIK 

CDGIJK 

CDGJK 

CDGHJK 

CDGK 

CDGHK 

CDGHKL 

CDGHJKL 

CDGJKL 

CDGIJKL 

CDGKL 

CDGIKL 

CDGHIJ 

CDGHIJL 

CDGHI 

CDGHIL 

CDGHIKL 

CDGHIJKL 

CDGHIK 

CDGHIJK 

CDGJ 

CDGIJ 

CDG 

CDGI 

CDGIL 

CDGIJL 

CDGJL 

CDGHJL 

CDGL 

CDGHL 

CDHIK 

CDHKL 

CDHIKL 

CDIK 

CDIKL 

CDK 

CDKL 

CDHJKL 

CDHIJKL 

CDJKL 

CDIJKL 

CDHK 

CDHJK 

CDJK 

CDIJK 

CDHIJK 

CDIJL 

CDHIJL 

CDIL 

CDHIL 

CDIJ 

CDHIJ 

CDI 

CDHI 

CDGH 

CDGHJ 

CDH 

CDHJ 

CDL 

CDHL 

CDJ 

CDJL 

CDHJL 

CF 

BCF 

CDF 

BCDF BC 

CD 

BCD 

CDFGHJ 

CDFGHJL 

CDFGJ 

CDFGJL 

CDFGL 

CDFGHL 

CDFG 

CDFGH 

CDFJL 

CDFHJL 

CDFJ 

CDFHJ 

CDFL 

CDFHL 

CDFIJL 

CDFHIJL 

CDFIL 

CDFHIL 

CDFGIL 

CDFGIJL 

CDFIJ 

CDFGIJ 

CDFI 

CDFGI 

CDFGHIL 

CDFGHIJL 

CDFGHI 

CDFGHIJ 

CDFH 

CDFHI 

CDFHIJ 

CDFIK 

CDFGIK 

CDFK 

CDFGK 

CDFHJK 

CDFHIJK 

CDFJK 

CDFIJK 

CDFHK 

CDFHIK 

CDFGHIK 

CDFGHIJK 

CDFGHK 

CDFGHJK 

CDFHIKL 

CDFHIJKL 

CDFHKL 

CDFHJKL 

CDFGHIKL 

CDFGHIJKL 

CDFGHKL 

CDFGHJKL 

CDFKL 

CDFIKL 

CDFJKL 

CDFIJKL 

CDFGKL 

CDFGIKL 

CDFGJKL 

CDFGIJKL 

CDFGJK 

CDFGIJK BL 

FL 

DL 

FH 

BH 

DH 

DG FG 

BG DF 

BD 

BF 

BDF BI 

FI 

DI 

BJ 

FJ 

DJ IL GI 

IJ 

JL 

GL 

HL 

GH GJ 

HJ 

BK 

FK 

DK HI 

IK 

HK KL 

GK JK 

CGHIL 

CGHIJL 

CGIL 

CGIJL 

CHIL 

CHIJL 

CIL 

CIJL CI 

CGI 

CGIJ 

CGHIJ 

CIJ 

CHIJ 

CGJL 

CGHJL 

CGL 

CGHL 

CJL 

CHJL 

CL 

CHL 

CGJ 

CGHJ 

CJ 

CHJ 

CG 

CGH CH 

CHI 

CGHI 

CGIJKL 

CGHIJKL 

CGIKL 

CGHIKL 

CGIJK 

CGHIJK 

CGIK 

CGHIK 

CHIKL 

CHIJKL 

CIKL 

CIJKL 

CIJK 

CHIJK 

CIK 

CHIK 

CGJKL 

CGHJKL 

CGJK 

CGHJK 

CGHK 

CGHKL 

CGK 

CGKL 

CJKL 

CHJKL 

CKL 

CHKL 

CHK 

CHJK CK 

CJK 

Subspaces 

Distance (Similarity) 

Figure A.23: Hierarchical agglomerative grouping of the 296 interesting subspaces. The red line 

shows the threshold for 6 groups shown in the subspace group view. Each group is marked by a 

colored rectangle. The colors are maintained in Figure 5.14.

Bibliography 

[1] ggplot2. http://had.co.nz/ggplot2/. 

[2] Protovis. http://vis.stanford.edu/protovis/. 

[3] Tableau. http://www.tableausoftware.com/. 

[4] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for 

projected clustering. In Proceedings of the ACM SIGMOD International Conference on 

Management of Data (SIGMOD ’99), pages 61–72. ACM, 1999. 

[5] C. C. Aggarwal and P. S. Yu. Redefining clustering for high-dimensional applications. IEEE 

Transactions on Knowledge and Data Engineering, 14(2):210–225, 2002. 

[6] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace Clustering of 

High Dimensional Data for Data Mining Applications. In Proceedings of the ACM SIGMOD 

International Conference on Management of Data (SIGMOD ’98), volume 27, pages 94–105. 

ACM, 1998. 

[7] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items 

in Large Databases. In Proceedings of the ACM SIGMOD International Conference on 


[8] G. Albuquerque, M. Eisemann, D. J. Lehmann, H. Theisel, and M. Magnor. Improving the 

Visual Analysis of High-dimensional Datasets Using Quality Measures. In Proceedings of 

the IEEE Symposium on Visual Analytics Science and Technology (VAST ’10), pages 19–26. 

IEEE CS Press, 2010. 

[9] M. Ankerst, S. Berchtold, and D. A. Keim. Similarity clustering of dimensions for an enhanced 

visualization of multidimensional data. In Proceedings of the IEEE Symposium Information 

Visualization (InfoVis ’98), pages 52–60. IEEE CS Press, 1998. 

[10] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify 

the clustering structure. In Proceedings of the ACM SIGMOD International Conference on 


[11] M. Ankerst, M. Ester, and H. P. Kriegel. Towards an e ective cooperation of the user and the 

computer for classification. In Proceedings of the ACM SIGKDD International Conference 

on Knowledge Discovery and Data Mining (KDD ’00), pages 179–188, 2000. 

[12] D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook. The Traveling Salesman Problem: 

A Computational Study (Princeton Series in Applied Mathematics). Princeton University 

Press, 2007. 

[13] D. Asimov. The Grand Tour: A Tool for Viewing Multidimensional Data. Journal on 

Scientific and Statistical Computing, 6(1):128–143, 1985. 

[14] I. Assent, R. Krieger, E. Müller, and T. Seidl. VISA: Visual Subspace Clustering Analysis. 

ACM SIGKDD Explorations Newsletter - Special Issue on Visual Analytics, 9(2):5–12, 2007. 

[15] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison- 

Wesley, 1999. 

[16] C. Baumgartner, C. Plant, K. Kailing, H.-P. Kriegel, and P. Kröger. Subspace selection for 

clustering high-dimensional data. In Proceedings of the Fourth IEEE Conference on Data 

Mining (ICDM ’04), pages 11–18. IEEE CS Press, 2004. 

[17] R. Becker and W. Cleveland. Brushing scatterplots. Technometrics, 29:127–142, 1987. 

[18] R. A. Becker, W. S. Cleveland, and M.-J. Shyu. The visual design and control of trellis 

display. Journal of Computational and Graphical Statistics, 5(2):123–155, 1996.

164 Bibliography 

[19] B. B. Bederson, J. D. Hollan, K. Perlin, J. Meyer, D. Bacon, and G. Furnas. Pad++: A 

Zoomable Graphical Sketchpad For Exploring Alternate Interface Physics. Journal of Visual 

Languages & Computing, 7(1):3–32, 1996. 

[20] R. Bellman. Dynamic Programming. Princeton University Press, 1st edition, 1957. 

[21] P. Berkhin. A Survey of Clustering Data Mining Techniques. Grouping Multidimensional 

Data, pages 25–71, 2006. 

[22] J. Bertin. Semiology of graphics. University of Wisconsin Press, 1983. 

[23] E. Bertini and D. Lalanne. Investigating and reflecting on the integration of automatic data 

analysis and visualization in knowledge discovery. ACM SIGKDD Explorations Newsletter, 

11:9–18, 2010. 

[24] E. Bertini and G. Santucci. Quality Metrics for 2D Scatterplot Graphics: Automatically 

Reducing Visual Clutter. In Proceedings Smart Graphics (SG), volume 3031, pages 77–89, 

2004. 

[25] E. Bertini and G. Santucci. Give chance a chance: modeling density to enhance scatter plot 

quality through random data sampling. Information Visualization, 5(2):95–110, 2006. 

[26] E. Bertini and G. Santucci. Visual Quality Metrics. In Proceedings of the 2006 AVI workshop 

on BEyond time and errors: noveL evaluation methods for Information Visualization 

(BELIV), pages 1–5. ACM, 2006. 

[27] E. Bertini, A. Tatu, and D. A. Keim. Quality Metrics in High-Dimensional Data Visualization: 

An Overview and Systematization. Proceedings of the IEEE Symposium on Information 

Visualization (InfoVis ’11), 17(12):2203–2212, 2011. 

[28] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is ”Nearest Neighbor” Meaningful? 

In Proceedings of the 7th International Conference on Database Theory (ICDT ’99), 

pages 217–235, 1999. 

[29] E. A. Bier, M. C. Stone, K. Pier, K. Fishkin, T. Baudel, M. Conway, W. Buxton, and 

T. DeRose. Toolglass and Magic Lenses: The See-Through Interface. In Conference Companion 

on Human Factors in Computing Systems (CHI ’94), pages 445–446. ACM, 1994. 

[30] T. Boogaerts, L.-C. Tranchevent, G. A. Pavlopoulos, J. Aerts, and J. Vandewalle. Visualizing 

high dimensional datasets using parallel coordinates: Application to gene prioritization. In 

IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE ’12), pages 

52–57. IEEE CS Press, 2012. 

[31] I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Applications. 

Springer, 2005. 

[32] R. Brath. Metrics for e ective information visualization. In Proceedings of the IEEE Symposium 

Information Visualization (InfoVis ’97), pages 108–111, 1997. 

[33] S. Bremm, T. v. Landesberger, J. Bernard, and T. Schreck. Assisted descriptor selection 

based on visual comparative data analysis. Computer Graphics Forum, 30(3):891–900, 2011. 

[34] S. Bremm, T. v. Landesberger, M. Heß, T. Schreck, P. Weil, and K. Hamacher. Interactive 

visual comparison of multiple trees. In Proceedings of IEEE Symposium on Visual Analytics 

Science and Technology (VAST ’11), pages 31–40. IEEE CS Press, 2011. 

[35] N. Cao, D. Gotz, J. Sun, and H. Qu. DICON: Interactive Visual Analysis of Multidimensional 

Clusters. IEEE Transactions on Visualization and Computer Graphics (TVCG ’ 11), 

17:2581–2590, 2011. 

[36] S. K. Card, J. D. Mackinlay, and B. Shneiderman. Readings in information visualization: 

using vision to think. Morgan Kaufmann Publishers Inc., 1999.

Bibliography 165 

[37] D. B. Carr, R. J. Littlefield, and W. L. Nichloson. Scatterplot Matrix Techniques for Large 

N. In Proceedings of the Seventeenth Symposium on the Interface of Computer Sciences and 

Statistics on Computer Science and Statistics, pages 297–306. Elsevier North-Holland, Inc., 

1986. 

[38] C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical 

data. In Proceedings of the fifth ACM SIGKDD International Conference on Knowledge 

Discovery and Data Mining (KDD ’99), pages 84–93. ACM, 1999. 

[39] E. H. Chi. A Taxonomy of Visualization Techniques Using the Data State Reference Model. 

In Proceedings of the IEEE Symposium on Information Visualization (InfoVis ’00), pages 

69–75. IEEE CS Press, 2000. 

[40] K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. 

Computational Linguistics, 16(1):22–29, 1990. 

[41] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, 1994. 

[42] Q. Cui, M. Ward, E. Rundensteiner, and J. Yang. Measuring Data Abstraction Quality in 

Multiresolution Visualizations. IEEE Transactions on Visualization and Computer Graphics 

(TVCG ’06), 12:709–716, 2006. 

[43] A. Dasgupta and R. Kosara. Pargnostics: Screen-Space Metrics for Parallel Coordinates. 

IEEE Transactions on Visualization and Computer Graphics (TVCG ’10), 16:1017–1026, 

2010. 

[44] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley-Interscience, 2nd edition, 

2001. 

[45] C. Dunne and B. Shneiderman. Improving graph drawing readability by incorporating readability 

metrics: A software tool for network analysts. Technical Report HCIL-2009-13, University 

of Maryland, 2009. 

[46] S. G. Eick and G. J. Wills. High Interaction Graphics. European Journal of Operations 

Research, 81(3):445–459, 1995. 

[47] M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genomewide 

expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863– 

14868, 1998. 

[48] G. Ellis and A. Dix. Enabling Automatic Clutter Reduction in Parallel Coordinate Plots. 

IEEE Transactions on Visualization and Computer Graphics (TVCG ’06), 12:717–724, 2006. 

[49] G. Ellis and A. Dix. A taxonomy of clutter reduction for information visualisation. IEEE 

Transactions on Visualization and Computer Graphics (TVCG ’07), 13:1216–1223, 2007. 

[50] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering 

clusters in large spatial databases with noise. In Proceedings of the Second ACM SIGKDD 

International Conference on Knowledge Discovery and Data Mining (KDD ’96), pages 226– 

231. AAAI Press, 1996. 

[51] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful 

knowledge from volumes of data. Communications of the ACM, 39:27–34, 1996. 

[52] B. J. Ferdosi, H. Buddelmeijer, S. Trager, M. H. F. Wilkinson, and J. B. T. M. Roerdink. 

Finding and visualizing relevant subspaces for clustering high-dimensional astronomical data 

using connected morphological operators. In Proceedings of the IEEE Symposium on Visual 

Analytics Science and Technology (VAST ’11), pages 35–42. IEEE CS Press, 2010. 

[53] A. Frank and A. Asuncion. University of California Irvine (UCI) Machine Learning Repository, 

2010. 

[54] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. 

IEEE Transactions on Computers, 23:881–890, 1974.


[55] Y.-H. Fua, M. Ward, and E. Rundensteiner. Hierarchical parallel coordinates for exploration 

of large data sets. In Proceedings of the Conference on Visualization (VIS ’99), pages 43–50. 

IEEE CS Press, 1999. 

[56] K. Fukunaga. Introduction to statistical pattern recognition. Academic Press Professional, 

Inc., 2nd edition, 1990. 

[57] S. Guha, R. Rastogi, and K. Shim. Cure: an e cient clustering algorithm for large databases. 

In Proceedings of the ACM SIGMOD International Conference on Management of Data 

(SIGMOD ’98), pages 73–84. ACM, 1998. 

[58] S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection of orthogonal concepts in subspaces 

of high dimensional data. In Proceedings of the 18th ACM conference on Information 

and knowledge management (CIKM ’09), pages 1317–1326, 2009. 

[59] D. Guo. Coordinating computational and visual approaches for interactive feature selection 

and multivariate clustering. Information Visualization, 2(4):232–246, 2003. 

[60] D. Guo, J. Chen, A. M. MacEachren, and K. Liao. A visualization system for space-time 

and multivariate patterns (vis-stamp). IEEE Transactions on Visualization and Computer 

Graphics (TVCG ’06), 12(6):1461–1474, 2006. 

[61] I. Guyon and A. Elissee . An introduction to variable and feature selection. Journal of 

Machine Learning Research - Special Issue on Variable and Feature Selection, (3):1157–1182, 

2003. 

[62] M. Hahsler, K. Hornik, and C. Buchta. Getting things in order: An introduction to the R 

package seriation. Journal of Statistical Software, 25(3):1–34, 2008. 

[63] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers 

Inc., 1st edition, 2000. 

[64] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers 

Inc., 2nd edition, 2006. 

[65] S. Haroz and K.-L. Ma. Natural visualization. In Proceedings of Eurographics Visualization 

Symposium, pages 43–50, 2006. 

[66] P. N. Hart, N. Nilsson, and B. Raphael. A Formal Basis for the Heuristic Determination of 

Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 

1968. 

[67] C. G. Healey, K. S. Booth, and J. T. Enns. High-speed visual estimation using preattentive 

processing. ACM Transactions on Computer-Human Interaction (TOCHI ’96), 3(2):107–135, 

1996. 

[68] C. G. Healey and J. T. Enns. Building perceptual textures to visualize multidimensional 

datasets. In Proceedings of the Conference on Visualization (VIS ’98), pages 111–118. IEEE 

CS Press, 1998. 

[69] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high 

dimensional spaces? In Proceedings of the 26th International Conference on Very Large 

Data Bases (VLDB ’00), pages 506–515. Morgan Kaufmann Publishers Inc., 2000. 

[70] A. Hinneburg and D. A. Keim. An E cient Approach to Clustering in Large Multimedia 

Databases with Noise. In Proceedings 4th International Conference on Knowledge Discovery 

in Databases (KDD ’98), pages 58–65, 1998. 

[71] A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality 

in high-dimensional clustering. In Proceedings of the 25th International Conference 

on Very Large Data Bases (VLDB ’99), pages 506–517. Morgan Kaufmann Publishers Inc., 

1999.


[72] P. Ho man, G. Grinstein, and D. Pinkney. Dimensional anchors: a graphic primitive for 

multidimensional multivariate information visualizations. In Proceedings Workshop on New 

Paradigms in Information Visualization and Manipulation (NPIVM ’99), pages 9–16. 

[73] P. V. C. Hough. Method and means for recognizing complex patterns. US Patent, 3069654, 

1962. 

[74] P. J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985. 

[75] C. B. Hurley and R. W. Oldford. Pairwise display of high-dimensional information via 

eulerian tours and hamiltonian decompositions. Journal of Computational and Graphical 

Statistics, 19(4):861–886, 2010. 

[76] S. Ingram, T. Munzner, V. Irvine, M. Tory, S. Bergner, and T. Möller. DimStiller: Workflows 

for dimensional analysis and reduction. In Proceedings of the IEEE Symposium on Visual 

Analytics Science and Technology (VAST ’10). IEEE CS Press, 2010. 

[77] A. Inselberg. The plane with parallel coordinates. The Visual Computer, 1(4):69–91, 1985. 

[78] A. Inselberg and B. Dimsdale. Parallel coordinates: a tool for visualizing multi-dimensional 

geometry. In Proceedings of the IEEE Conference on Visualization (VIS ’90). IEEECS 

Press, 1990. 

[79] H. Jänicke and M. Chen. A Salience-based Quality Metric for Visualization. Computer 

Graphics Forum (Proc. EuroVis), 29(3):1183–1192, 2010. 

[80] J. Johansson and M. Cooper. A Screen Space Quality Method for Data Abstraction. Computer 


[81] J. Johansson, C. Forsell, M. Lind, and M. Cooper. Perceiving patterns in parallel coordinates: 

determining thresholds for identification of relationships. Information Visualization, 

7(2):152–162, 2008. 

[82] S. Johansson and J. Johansson. Interactive Dimensionality Reduction Through User-defined 

Combinations of Quality Metrics. IEEE Transactions on Visualization and Computer Graphics 

(TVCG ’09), 15:993–1000, 2009. 

[83] I. T. Jolli e. Principal Component Analysis. Springer, 2nd edition, 2002. 

[84] K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering 

high dimensional data. In Proceedings of the 7th European Conference on Principles 

and Practice of Knowledge Discovery in Databases (PKDD ’03), pages 241–252, 2003. 

[85] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster 

Analysis. Wiley-Interscience, 9th edition, 1990. 

[86] D. A. Keim, M. Ankerst, and M. Sips. Visual Data-Mining Techniques, pages 813–825. 

Kolam Publishing, 2004. 

[87] D. A. Keim, M. C. Hao, U. Dayal, and M. Hsu. Pixel bar charts: A visualization technique 

for very large multi-attribute data sets. Information Visualization, 1(1):20–34, 2002. 

[88] D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. Visual analytics: 

Scope and challenges. In S. J. Simo , M. H. Böhlen, and A. Mazeika, editors, Visual Data 

Mining: Theory, Techniques and Tools for Visual Analytics, pages 76–90. Springer-Verlag, 

2008. 

[89] Y. Koren and L. Carmel. Visualization of labeled data using linear transformations. Proceedings 

of the IEEE Symposium on Information Visualization (InfoVis ’03), 0:16, 2003. 

[90] H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A survey on 

subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions 

on Knowledge Discovery from Data (TKDD ’09), 3(1):1–58, 2009.


[91] J. LeBlanc, M. O. Ward, and N. Wittels. Exploring N-dimensional databases. In Proceedings 

of the IEEE Conference on Visualization (VIS ’90). IEEE CS Press, 1990. 

[92] Y. K. Leung and M. D. Aerley. A review and taxonomy of distortion-oriented presentation 

techniques. ACM Transactions on Computer-Human Interaction, 1(2):126–160, 1994. 

[93] A. Lex, M. Streit, C. Partl, and D. Schmalstieg. Comparative analysis of multidimensional, 

quantitative data. IEEE Transactions on Visualization and Computer Graphics (TVCG ’10), 

16(6):1027–1035, 2010. 

[94] J. Li, J.-B. Martens, and J. J. van Wijk. Judging correlation from scatterplots and parallel 

coordinate plots. Information Visualization, 9(1):13–30, 2008. 

[95] M. A. Little, P. E. McSharry, E. J. Hunter, and L. O. Ramig. Suitability of dysphonia 

measurements for telemonitoring of parkinson’s disease. In IEEE Transactions on Biomedical 

Engineering, pages 1015–1022, 2009. 

[96] M. A. Little, P. E. Mcsharry, S. J. Roberts, D. A. E. Costello, and I. M. Moroz. Exploiting 

nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMedical 

Engineering OnLine, 6(1):23, 2007. 

[97] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman & Hall/CRC, 

2008. edited by Huan Liu and Hiroshi Motoda.; Includes bibliographical references and index. 

[98] A. MacEachren, X. Dai, F. Hardisty, D. Guo, and G. Lengerich. Exploring high-D spaces 

with multiform matrices and small multiples. In Proceedings of the IEEE Symposium on 

Information Visualization (InfoVis ’03), pages 31–38. IEEE CS Press, 2003. 

[99] J. Mackinlay. Automating the design of graphical presentations of relational information. 

ACM Transactions on Graphics, 5(2):110–141, 1986. 

[100] N. Miller, B. Hetzler, G. Nakamura, and P. Whitney. The need for metrics in visual information 

analysis. In Proceedings of the Workshop on New Paradigms in Information Visualization 

and Manipulation. ACM, 1997. 

[101] E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: 

Mining the most interesting non-redundant concepts in high dimensional data. In Proceedings 

of the IEEE International Conference on Data Mining (ICDM ’09), pages 377–386, 2009. 

[102] E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluating clustering in subspace projections 

of high dimensional data. In Proceedings of the International Conference on Very 

Large Data Bases (VLDB ’09), volume 2, pages 1270–1281, 2009. 

[103] E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: 

Grouping objects in di erent views of the data. In Proceedings of the 10th IEEE Conference 

on Data Mining (ICDM ’10), page 1220, 2010. 

[104] E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: 

Grouping objects in di erent views of the data. In Tutorial at the 16th Pacific-Asia 

Conference on Knowledge Discovery and Data Mining (PAKDD ’12), 2012. 

[105] T. Munzner. Visualization (Chapter 27). In Fundamentals of Graphics, pages 675–707. AK 

Peters, 3rd edition, 2009. 

[106] E. MÃ ller, I. Assent, S. GÃ nnemann, T. Jansen, and T. Seidl. Opensubspace: An open 

source framework for evaluation and exploration of subspace clustering algorithms in weka. 

In Proceedings of the 1st Open Source in Data Mining Workshop (OSDM ’09) in conjunction 

with 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD ’09), 

pages 2–13, 2009. 

[107] D. Niu, J. G. Dy, and M. I. Jordan. Multiple non-redundant spectral clustering views. 

In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 

831–838. Omnipress, 2010.


[108] C. North. Toward measuring visualization insight. IEEE Computer Graphics and Applications, 

26(3):6–9, 2006. 

[109] D. Oelke, H. Janetzko, S. Simon, K. Neuhaus, and D. A. Keim. Visual Boosting in Pixelbased 

Visualizations. Computer Graphics Forum (Proc. EuroVis), 30(3):871–880, 2011. 

[110] L. Parsons, E. Haque, and H. Liu. Subspace Clustering for High Dimensional Data: A 

Review. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced 

Datasets, 6(1):90–105, 2004. 

[111] F. Paulovich, M. Oliveira, and R. Minghim. The Projection Explorer: A Flexible Tool 

for Projection-based Multidimensional Visualization. In Proceedings of the XX Brazilian 

Symposium on Computer Graphics and Image Processing (SIBGRAPI ’07), pages 27–36, 

Oct. 

[112] W. Peng, M. O. Ward, and E. A. Rundensteiner. Clutter Reduction in Multi-Dimensional 

Data Visualization Using Dimension Reordering. In Proceedings of the IEEE Symposium on 


[113] C. Plaisant, J.-D. Fekete, and G. Grinstein. Promoting insight-based evaluation of visualizations: 

From contest to benchmark repository. IEEE Transactions on Visualization and 

Computer Graphics (TVCG ’08), 14(1):120–134, 2008. 

[114] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the 

American Statistical Association, 66(336):846–850, 1971. 

[115] R. Rao and S. K. Card. The table lens: merging graphical and symbolic representations in 

an interactive focus + context visualization for tabular information. In Proceedings of the 

SIGCHI Conference on Human Factors in Computing Systems (CHI ’94). ACM, 1994. 

[116] R. A. Rensink and G. Baldridge. The perception of correlation in scatterplots. Computer 


[117] D. J. Rogers and T. T. Tanimoto. A Computer Program for Classifying Plants. Science, 

132(3434):1115–1118, 1960. 

[118] R. Rosenholtz, Y. Li, J. Mansfield, and Z. Jin. Feature congestion: a measure of display 

clutter. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 

(CHI ’05), pages 761–770. ACM, 2005. 

[119] M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen, and D. A. Keim. 

Improving projection-based data analysis by feature space transformations. In Proceedings 

of SPIE 8654, Visualization and Data Analysis (VDA ’13), volume 8654, pages 86540H– 

86540H–15, 2013. 

[120] J. Schneidewind, M. Sips, and D. A. Keim. Pixnostics: Towards measuring the value of visualization. 

In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology 

(VAST ’06), pages 199–206. IEEE CS Press, 2006. 

[121] T. Schreck, T. von Landesberger, and S. Bremm. Techniques for precision-based visual 

analysis of projected data. Palgrave Macmillan Information Visualization, 9(3):181–193, 

2010. 

[122] M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy of visual cluster separation 

factors. Computer Graphics Forum (Proc. EuroVis), 31(3):1335–1344, 2012. 

[123] E. Segel and J. Heer. Narrative visualization: Telling stories with data. IEEE Transactions 

on Visualization and Computer Graphics (TVCG ’10), 16:1139–1148, 2010. 

[124] J. Seo and B. Shneiderman. Interactively exploring hierarchical clustering results. Computer, 

35(7):80–86, 2002.


[125] J. Seo and B. Shneiderman. A rank-by-feature framework for unsupervised multidimensional 

data exploration using low dimensional projections. In Proceedings of IEEE Symposium on 


[126] J. Seo and B. Shneiderman. A rank-by-feature framework for interactive exploration of 

multidimensional data. Information Visualization, 4(2):96–113, 2005. 

[127] B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information 

Visualizations. In Proceedings of the IEEE Symposium on Visual Languages (VL), pages 

336–343. IEEE CS Press, 1996. 

[128] J. H. Siegel, E. J. Farrell, R. M. Goldwyn, and H. P. Friedman. The surgical implication of 

physiologic patterns in myocardial infarction shock. Surgery, 72:126–141, 1972. 

[129] M. Sips, B. Neubert, J. P. Lewis, and P. Hanrahan. Selecting good views of high-dimensional 

data using class consistency. Computer Graphics Forum (Proc. EuroVis), 28(3):831–838, 

2009. 

[130] A. Strauss and J. M. Corbin. Basics of Qualitative Research: Techniques and Procedures for 

Developing Grounded Theory. SAGE Publications, 1998. 

[131] W. Street, W. Wolberg, and O. Mangasarian. Nuclear feature extraction for breast tumor 

diagnosis. IS&T / SPIE International Symposium on Electronic Imaging: Science and 

Technology, 1905:861–870, 1993. 

[132] A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A. Keim. 

Automated Visual Analysis Methods for an E ective Exploration of High-Dimensional Data. 

IEEE Transactions on Visualization and Computer Graphics (TVCG ’11), 17(5):pp. 584– 

597, 2011. 

[133] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, and 

D. Keim. Combining automated analysis and visualization techniques for e ective exploration 

of high dimensional data. Proceedings of the IEEE Symposium on Visual Analytics Science 

and Technology (VAST ’09), pages 59–66, 2009. 

[134] A. Tatu, P. Bak, E. Bertini, D. A. Keim, and J. Schneidewind. Visual quality metrics and 

human perception: an initial study on 2D projections of large multidimensional data. In 

Proceedings of the Working Conference on Advanced Visual Interfaces (AVI), pages 49–56. 

ACM, 2010. 

[135] A. Tatu, F. Maaß, I. Färber, E. Bertini, T. Schreck, T. Seidl, and D. Keim. Subspace 

Search and Visualization to Make Sense of Alternative Clusterings in High-Dimensional 

Data. Proceedings of the IEEE Symposium on Visual Analytics Science and Technology 

(VAST ’12), pages 63–72, 2012. 

[136] A. Tatu, L. Zhang, E. Bertini, T. Schreck, D. A. Keim, S. Bremm, and T. von Landesberger. 

ClustNails: Visual Analysis of Subspace Clusters. Tsinghua Science and Technology, Special 

Issue on Visualization and Computer Graphics, 17(4):419–428, 2012. 

[137] J. J. Thomas and K. A. Cook. Illuminating the Path: The Research and Development Agenda 

for Visual Analytics. National Visualization and Analytics Ctr, 2005. 

[138] M. Tory and T. Möller. Rethinking Visualization: A High-Level Taxonomy. In Proceedings 

of the IEEE Symposium on Information Visualization (InfoVis ’04), pages 151–158. IEEE 

CS Press, 2004. 

[139] E. R. Tufte. The visual display of quantitative information. Graphics Press, 1986. 

[140] J. Tukey and P. Tukey. Computer graphics and exploratory data analysis: An introduction. 

Proceedings of the Annual Conference and Exposition: Computer Graphics, 3:773–785, 1985.


[141] S. Vadapalli and K. Karlapalem. Heidi matrix: nearest neighbor driven high dimensional 

data visualization. In Proceedings of the ACM SIGKDD Workshop on Visual Analytics and 

Knowledge Discovery, pages 83–92, 2009. 

[142] S. van den Elzen and J. J. van Wijk. BaobabView: Interactive Construction and Analysis 

of Decision Trees. In Proceedings of the IEEE Symposium on Visual Analytics Science and 

Technology (VAST ’11), pages 151–160. IEEE CS Press, 2011. 

[143] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine 

Learning Research, 9(2579-2605):85, 2008. 

[144] J. Ward. Hierarchical grouping to optimize an objective function. Journal of the American 

Statistical Association, 58:236–244, 1963. 

[145] M. Ward, G. Grinstein, and D. Keim. Interactive Data Visualization: Foundations, Techniques, 

and Applications. Taylor & Francis, 2010. 

[146] M. O. Ward. Xmdvtool: Integrating multiple methods for visualizing multivariate data. 

In Proceedings of the IEEE Symposium on Information Visualization (InfoVis ’94), pages 

326–333. IEEE CS Press, 1994. 

[147] M. O. Ward. A taxonomy of glyph placement strategies for multidimensional data visualization. 

Information Visualization, 1(3/4):194–210, 2002. 

[148] C. Ware. Information Visualization: Perception for Design. Morgan Kaufmann Publishers 

Inc., 2004. 

[149] C. Ware, H. Purchase, L. Colpoys, and M. McGill. Cognitive measurements of graph aesthetics. 

Information Visualization, 1:103–110, 2002. 

[150] M. Wattenberg. A note on space-filling visualizations and space-filling curves. In Proceedings 

of the IEEE Symposium on Information Visualization (InfoVis ’05). IEEE CS Press, 2005. 

[151] L. Wilkinson, A. Anand, and R. Grossman. Graph-theoretic scagnostics. In Proceedings of 

the IEEE Symposium on Information Visualization (InfoVis ’05), pages 157–164. IEEE CS 

Press, 2005. 

[152] L. Wilkinson, A. Anand, and R. Grossman. High-dimensional visual analytics: Interactive 

exploration guided by pairwise views of point distributions. IEEE Transactions on Visualization 

and Computer Graphics (TVCG ’06), 12:1363–1372, 2006. 

[153] A. Wismueller, M. Verleysen, M. Aupetit, and J. A. Lee. Recent Advances in Nonlinear 

Dimensionality Reduction, Manifold and Topological Learning. 18th European Symposium 

on Artificial Neural Networks - Computational Intelligence and Machine Learning (ESANN), 

pages 71–80, 2010. 

[154] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. 

The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers, 

2nd edition, 2005. 

[155] R. Xu and D. C. W. II. Survey of clustering algorithms. IEEE Transactions on Neural 

Networks, 16(3):645–678, 2005. 

[156] J. Yang, D. Hubball, M. O. Ward, E. A. Rundensteiner, and W. Ribarsky. Value and relation 

display: Interactive visual exploration of large data sets with hundreds of dimensions. IEEE 

Transactions on Visualization and Computer Graphics (TVCG ’07), 13:494–507, 2007. 

[157] J. Yang, A. Patro, S. Huang, N. Mehta, M. O. Ward, and E. A. Rundensteiner. Value and 

Relation Display for Interactive Exploration of High Dimensional Datasets. In Proceedings of 

IEEE Symposium on Information Visualization (InfoVis ’04), pages 73–80. IEEE CS Press, 

2004.


[158] J. Yang, W. Peng, M. O. Ward, and E. A. Rundensteiner. Interactive Hierarchical Dimension 

Ordering, Spacing and Filtering for Exploration of High Dimensional Datasets. In Proceedings 

of the IEEE Symposium Information Visualization (InfoVis ’03). IEEE CS Press, 2003. 

[159] J. Yang, M. O. Ward, E. A. Rundensteiner, and S. Huang. Visual hierarchical dimension 

reduction for exploration of high dimensional datasets. In Proceedings of the Symposium on 

Data Visualization (VISSYM), pages 19–28. Eurographics Association, 2003. 

[160] J. S. Yi, Y. a. Kang, J. Stasko, and J. Jacko. Toward a deeper understanding of the role of 

interaction in information visualization. IEEE Transactions on Visualization and Computer 

Graphics (TVCG ’07), 13:1224–1231, 2007. 

[161] X. Yuan, Z. Wang, and C. Guo. Mds-tree and mds-matrix for high dimensional data visualization. 

In Proceedings of IEEE Symposium on Information Visualization (InfoVis ’11), 

2011. Poster abstract. 

[162] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an e cient data clustering method for 

very large databases. In Proceedings of the ACM SIGMOD International Conference on 

Management of Data (SIGMOD ’96), pages 103–114, New York, NY, USA, 1996. ACM. 

[163] J. Zupan, M. Novic, X. Li, and J. Gasteiger. Classification of multicomponent analytical 

data of olive oils using di erent neural networks. In Analytica Chimica Acta, volume 292, 

pages 219–234, 1994.

Visual Analytics of Patterns in High-Dimensional Data - Fachbereich ...

Create successful ePaper yourself

Delete template?

Save as template?