Visual Analytics of Patterns in High-Dimensional Data - Fachbereich ...
Visual Analytics of Patterns in High-Dimensional Data - Fachbereich ...
Visual Analytics of Patterns in High-Dimensional Data - Fachbereich ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Visual</strong> <strong>Analytics</strong> <strong>of</strong> <strong>Patterns</strong> <strong>in</strong><br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Dissertation zur Erlangung des akademischen Grades<br />
e<strong>in</strong>es Dr. rer. nat.<br />
vorgelegt von<br />
Andrada Tatu<br />
an der<br />
Mathematisch-Naturwissenschaftliche Sektion<br />
<strong>Fachbereich</strong> Informatik und Informationswissenschaft<br />
Tag der mündlichen Prüfung: 12 Juli 2013<br />
Referenten:<br />
Pr<strong>of</strong>. Dr. Daniel A. Keim, Universität Konstanz<br />
Pr<strong>of</strong>. Dr. Oliver Deussen, Universität Konstanz<br />
Pr<strong>of</strong>. Dr. Giuseppe Santucci, Sapienza Università di Roma
Pentru păr<strong>in</strong>ţii mei iubitori.
Acknowledgements<br />
This dissertation is the most important milestone <strong>in</strong> my academic career. One <strong>of</strong> the<br />
joys <strong>of</strong> completion is to look back and remember all the mentors, friends, collaborators,<br />
colleagues and family who have guided, supported, and <strong>in</strong>spired me along this fulfill<strong>in</strong>g<br />
journey.<br />
First and foremost, I would like to express my deep appreciation to my advisor, Pr<strong>of</strong>essor<br />
Dr. Daniel Keim, who has stirred my <strong>in</strong>terest <strong>in</strong> <strong>Visual</strong> <strong>Analytics</strong> early on <strong>in</strong> my<br />
studies. He has not only been a strong supporter <strong>of</strong> my work, but he has also allowed<br />
me great freedom to develop my thesis. Without his guidance and persistent help, this<br />
dissertation would not have been possible. As a part <strong>of</strong> his group, I was able to perfect<br />
my research skills and draw appropriate conclusions.<br />
In addition, I would like to thank my committee members, Pr<strong>of</strong>essor Dr. Oliver Deussen<br />
and Pr<strong>of</strong>essor Dr. Giuseppe Santucci for their encourag<strong>in</strong>g and <strong>in</strong>sightful comments and<br />
their analytic questions that prompted me shape my ideas comprehensively.<br />
I am especially grateful to Dr. Enrico Bert<strong>in</strong>i and Dr. Tobias Schreck, who closely<br />
accompanied my research dur<strong>in</strong>g these years and motivated me to seek perfect solutions.<br />
Many <strong>of</strong> the results reported here present jo<strong>in</strong>t e orts. Their recommendations and <strong>in</strong>structions<br />
have enabled me to assemble and f<strong>in</strong>ish the dissertation e ectively.<br />
I would also like to express my gratitude to my collaborators for their guidance and<br />
<strong>in</strong>spirations <strong>in</strong> these past years, and especially name Ines Färber, Pr<strong>of</strong>essor Dr. Thomas<br />
Seidl, Pr<strong>of</strong>essor Dr. Tamara Munzner, Dr. Michael Sedlmair, Pr<strong>of</strong>essor Dr. Melanie Tory,<br />
Georgia Albuquerque, Dr. Mart<strong>in</strong> Eisemann, Dr. Jörn Schneidew<strong>in</strong>d and Dr. Peter Bak.<br />
I am grateful to my colleagues for creat<strong>in</strong>g a pleasant work<strong>in</strong>g atmosphere. A special<br />
thank you goes to Svenja Simon (for her friendship and tricky R programm<strong>in</strong>g sessions),<br />
Miloš Krstajić (for support<strong>in</strong>g all my moods and encourag<strong>in</strong>g me throughout these years),<br />
Dr. Florian Mansmann (for gett<strong>in</strong>g me <strong>in</strong>to the group and becom<strong>in</strong>g a lovely friend),<br />
David Spretke (for accompany<strong>in</strong>g me from the first day <strong>of</strong> my Bachelor studies to the<br />
last <strong>of</strong> my doctoral work as a friend and hardwork<strong>in</strong>g colleague), Dr. Andreas Sto el (for<br />
always keep<strong>in</strong>g his door open and the helpful debugg<strong>in</strong>g sessions), Christian Rohrdantz<br />
(for helpful suggestions and mental support dur<strong>in</strong>g the writ<strong>in</strong>g phase and preparation<br />
<strong>of</strong> my defense talk), Dr. Leishi Zhang (for the great collaboration dur<strong>in</strong>g the ClustNails<br />
project), Dr. Daniela Oelke (for <strong>in</strong>itial paper writ<strong>in</strong>g suggestions and provid<strong>in</strong>g me the<br />
thesis template), and Sab<strong>in</strong>e Kuhr (for her support <strong>in</strong> adm<strong>in</strong>istrative work). I am very<br />
happy that, <strong>in</strong> many cases, my friendship with all <strong>of</strong> you has enriched my time beyond<br />
our shared time <strong>in</strong> the o ce.<br />
Special thanks goes to my student assistant Fabian Maaß, who implemented parts <strong>of</strong><br />
the subspace visualization system and whose creativity shaped the research outcome.
vi<br />
This acknowledgement would not be complete without extend<strong>in</strong>g my s<strong>in</strong>cere thanks<br />
to our DBVIS support team, which really made my life easier by provid<strong>in</strong>g fast, anytime<br />
technical support, computational power, and storage opportunities for my projects. I<br />
would like to specially mention Florian Sto el and Juri Buchmüller.<br />
Special thanks go to Mrs. Anna Dowden-Williams from the Academic Sta Development<br />
for pro<strong>of</strong>read<strong>in</strong>g most <strong>of</strong> my research papers and this thesis, which has pr<strong>of</strong>oundly<br />
improved its overall composition.<br />
My deepest appreciation and gratitude goes, however, to my family who has encouraged<br />
my studies from the start and provided me with the moral and emotional support<br />
needed through the entire process. They believed <strong>in</strong> my dream and helped me to fulfill it.<br />
I will be forever grateful for your unconditional love and support.<br />
I gratefully acknowledge also the f<strong>in</strong>ancial support received from the German Research<br />
Society (DFG) under the research grant DFG-611 with<strong>in</strong> the DFG Priority Program<br />
“Scalable <strong>Visual</strong> <strong>Analytics</strong>: Interactive <strong>Visual</strong> Analysis Systems <strong>of</strong> Complex Information<br />
Spaces” (SPP 1335). I also recognize be<strong>in</strong>g an associated PhD student to the GK-1042<br />
(PhD Graduate Program) “Explorative Analysis and <strong>Visual</strong>ization <strong>of</strong> Large Information<br />
Spaces”.
Abstract<br />
Due to the technological progress over the last decades, today’s scientific and commercial<br />
applications are capable <strong>of</strong> generat<strong>in</strong>g, stor<strong>in</strong>g, and process<strong>in</strong>g, massive amounts <strong>of</strong> data<br />
sets. This <strong>in</strong>fluences the type <strong>of</strong> data generated, which <strong>in</strong> turn means that with each<br />
data entry di erent aspects are comb<strong>in</strong>ed and stored <strong>in</strong>to one common database. Often<br />
the describ<strong>in</strong>g attributes are numeric; we name data with more than a handful attributes<br />
(dimensions) high-dimensional. Hav<strong>in</strong>g to make use <strong>of</strong> these types <strong>of</strong> data archives provides<br />
new challenges to analysis techniques.<br />
The work <strong>of</strong> this thesis centers around the question <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g patterns<br />
(mean<strong>in</strong>gful <strong>in</strong>formation) <strong>in</strong> high-dimensional data sets. This task is highly challeng<strong>in</strong>g<br />
because <strong>of</strong> the so called curse <strong>of</strong> dimensionality, express<strong>in</strong>g that when dimensionality<br />
<strong>in</strong>creases the data becomes sparse. This phenomena disturbs standard analysis techniques.<br />
Automatic techniques have to deal with the data complexity not only <strong>in</strong>creas<strong>in</strong>g their<br />
runtime, but also vitiat<strong>in</strong>g their computation functions (like distance functions). Moreover,<br />
explor<strong>in</strong>g these data sets visually is h<strong>in</strong>dered by the high number <strong>of</strong> dimensions that have<br />
to be displayed on the two dimensional screen space.<br />
This thesis is motivated by the idea that search<strong>in</strong>g for <strong>in</strong>terest<strong>in</strong>g patterns <strong>in</strong> this<br />
k<strong>in</strong>d <strong>of</strong> data can be done through a mixed approach <strong>of</strong> automation, visualization, and<br />
<strong>in</strong>teraction. The amount <strong>of</strong> patterns a visualization conta<strong>in</strong>s can be measured by so called<br />
quality metrics. These automated functions can then filter the high number <strong>of</strong> highdimensional<br />
visualizations and present to the user a pre-filtered good subset for further<br />
<strong>in</strong>vestigation. We propose quality metrics for scatterplots and parallel coord<strong>in</strong>ates focus<strong>in</strong>g<br />
on di erent user tasks like identify<strong>in</strong>g clusters and correlations. We also evaluate these<br />
measures with regard to (1) their ability to identify clusters <strong>in</strong> a variety <strong>of</strong> real and<br />
synthetic datasets; (2) their correlation with human perception <strong>of</strong> clusters <strong>in</strong> scatterplots.<br />
A thorough discussion <strong>of</strong> results follows reflect<strong>in</strong>g the impact on directions for future<br />
research.<br />
As quality metrics were developed for a large number <strong>of</strong> di erent high-dimensional<br />
visualization techniques, we present our reflections on how these methods are related to<br />
each other and how the approach can be developed further. For this purpose, we provide<br />
an overview <strong>of</strong> approaches that use quality metrics <strong>in</strong> high-dimensional data visualization<br />
and propose a systematization based on a comprehensive literature review.<br />
In high-dimensional data, patterns exist <strong>of</strong>ten only <strong>in</strong> a subset <strong>of</strong> the dimensions.<br />
Subspace cluster<strong>in</strong>g techniques aim at f<strong>in</strong>d<strong>in</strong>g these subspaces where clusters exist and<br />
which might otherwise be hidden if a traditional cluster<strong>in</strong>g algorithm is applied. While<br />
subspace cluster<strong>in</strong>g approaches tackle the sparsity problem <strong>in</strong> high-dimensional data well,<br />
design<strong>in</strong>g e ective visualization to help analyz<strong>in</strong>g the cluster<strong>in</strong>g result is not trivial. In<br />
addition to the cluster membership <strong>in</strong>formation, the relevant sets <strong>of</strong> dimensions and the<br />
overlaps <strong>of</strong> memberships and dimensions need to also be considered. Although, a number<br />
<strong>of</strong> techniques (for example, scatterplots, heat maps, dendrograms, hierarchical parallel<br />
coord<strong>in</strong>ates) exist for visualiz<strong>in</strong>g traditional cluster<strong>in</strong>g results, little research has been<br />
done for visualiz<strong>in</strong>g subspace cluster<strong>in</strong>g results. Moreover, while extensive research has<br />
been carried out with regard to design<strong>in</strong>g subspace cluster<strong>in</strong>g algorithms, surpris<strong>in</strong>gly<br />
little attention has been paid to the develop<strong>in</strong>g <strong>of</strong> e ective visualization tools analyz<strong>in</strong>g the
viii<br />
cluster<strong>in</strong>g result. Appropriate visualization techniques will not only help <strong>in</strong> monitor<strong>in</strong>g the<br />
cluster<strong>in</strong>g process but, with special m<strong>in</strong><strong>in</strong>g techniques, they could also enable the doma<strong>in</strong><br />
expert to guide and even to steer the subspace cluster<strong>in</strong>g process to reveal the patterns <strong>of</strong><br />
<strong>in</strong>terest. To this goal, we envision a concept that comb<strong>in</strong>es subspace cluster<strong>in</strong>g algorithms<br />
and <strong>in</strong>teractive scalable visual exploration techniques. This work <strong>in</strong>cludes the task <strong>of</strong><br />
comparative visualization and feedback guided computation <strong>of</strong> alternative cluster<strong>in</strong>gs.
Zusammenfassung<br />
Bed<strong>in</strong>gt durch den technologischen Fortschritt der letzten Jahrzehnte s<strong>in</strong>d heutige kommerzielle<br />
Applikationen <strong>in</strong> der Lage, riesige Datenmengen zu erzeugen, zu speichern und<br />
zu verarbeiten. Diese Entwicklung bee<strong>in</strong>flusst auch die Natur der erzeugten Daten, d.h.<br />
dass für jeden Datene<strong>in</strong>trag unterschiedliche Aspekte <strong>in</strong> der gleichen Datenbank gespeichert<br />
werden. Oft s<strong>in</strong>d die beschreibenden Attribute numerisch. Datensätze, die mehr<br />
als fünf solcher Attribute (Dimensionen) be<strong>in</strong>halten, nenne ich hochdimensional. Der<br />
wertbr<strong>in</strong>gende Gebrauch solcher Datenarchive br<strong>in</strong>gt neue Herausforderungen an Analysetechniken<br />
mit sich.<br />
Die vorliegende Dissertation bearbeitet die Fragestellung, wie <strong>in</strong>teressante Muster (bedeutende<br />
Information) <strong>in</strong> hochdimensionalen Räumen gefunden werden können. Diese<br />
Aufgabenstellung ist durch das Problem des Fluches der <strong>Dimensional</strong>ität äußerst herausfordernd.<br />
Dieses Problem besagt, dass Daten im hochdimensionalen Raum spärlich<br />
vorkommen. Herkömmliche Analysetechniken werden dadurch bee<strong>in</strong>trächtigt. Automatische<br />
Methoden müssen die Datenkomplexität nicht nur ihre Laufzeit, sondern auch ihre<br />
Berechnungsfunktionen (z.B. Distanzfunktionen) betre end, e<strong>in</strong>beziehen. Außerdem wird<br />
die visuelle Exploration dieser Daten durch die Zweidimensionalität der Darstellungen<br />
bee<strong>in</strong>trächtigt.<br />
Diese Dissertation stützt sich auf das Konzept, dass die Suche nach <strong>in</strong>teressanten<br />
Mustern <strong>in</strong> hochdimensionalen Datenmengen mit e<strong>in</strong>em komb<strong>in</strong>ierten Ansatz von automatischen,<br />
visuellen und <strong>in</strong>teraktiven Methoden durchgeführt werden kann. Die Ausprägung<br />
der Muster e<strong>in</strong>er <strong>Visual</strong>isierung kann durch sogenannte Qualitätsmaße gemessen werden.<br />
Durch diese automatischen Funktionen kann die große Menge an hochdimensionalen <strong>Visual</strong>isierungen<br />
e<strong>in</strong>gegrenzt und dem Benutzer e<strong>in</strong>e ausgewählte Menge zur weiteren Untersuchung<br />
zur Verfügung gestellt werden. Ich schlage Qualitätsmaße für Scatterplots<br />
und Parallele Koord<strong>in</strong>aten vor, die sich auf unterschiedliche Aufgaben, wie die Identifikation<br />
von Gruppen oder Korrelationen, konzentrieren. Zusätzlich werden diese Techniken<br />
bezüglich (1) ihrer Fähigkeit Cluster <strong>in</strong> unterschiedlichen realen und synthetischen<br />
Datensätzen und (2) ihrer Korrelation mit der menschlichen Wahrnehmung untersucht.<br />
Der ausführlichen Diskussion dieser Resultate folgen Überlegungen für die zukünftige<br />
Forschung.<br />
Da viele verschiedene Qualitätsmaße für e<strong>in</strong>e Reihe weiterer hochdimensionaler <strong>Visual</strong>isierungen<br />
entwickelt wurden, werde ich Vorschläge für deren Vernetzung und Weiterentwicklung<br />
vorstellen. Hierfür wird e<strong>in</strong>e Übersicht über die verschiedenen Ansätze erstellt,<br />
welcher e<strong>in</strong>e Systematisierung zugrunde liegt, die aufgrund e<strong>in</strong>er umfassenden Literaturauswertung<br />
zustande kam.<br />
Im hochdimensionalen Raum existieren manche Muster nur <strong>in</strong> verschiedenen Unterräumen<br />
des Datenraumes. Subspace Cluster<strong>in</strong>g Algorithmen wurden entwickelt, um Unterräume<br />
zu f<strong>in</strong>den <strong>in</strong> denen Cluster existieren, die durch traditionelle Cluster<strong>in</strong>g Algorithmen<br />
nicht gefunden werden würden. Obwohl diese Algorithmen spärlich mit Daten<br />
besetzte, hochdimensionale Räume gut explorieren können, ist das Entwickeln von e ektiven<br />
<strong>Visual</strong>isierungstechniken, um diese Cluster<strong>in</strong>gresultate zu analysieren, nicht trivial.<br />
Zusätzlich zu der Clusterzugehörigkeit von Elementen müssen die relevanten Attributmengen<br />
e<strong>in</strong>es Clusters und die Objekt- und Dimensionsüberlappungen von Subspaceclus-
x<br />
tern dargestellt werden. Auch wenn e<strong>in</strong>e Reihe von Techniken für die <strong>Visual</strong>isierung<br />
von traditionellen Cluster<strong>in</strong>g Resultaten existiert (z.B. Scatterplots, Heatmaps, Dendrogramme,<br />
hierarchische Parallele Koord<strong>in</strong>aten) gibt es nur wenige Ansätze, um das Resultat<br />
von Subspace Cluster<strong>in</strong>g Algorithmen zu visualisieren. Außerdem wurden bisher<br />
erstaunlich wenige Ansätze vorgestellt, die e<strong>in</strong>e visuelle Analyse der Subspace Cluster<strong>in</strong>g<br />
Ergebnisse unterstützen können, obwohl im Bereich der Subspace Cluster<strong>in</strong>g Algorithmen<br />
viel Forschung betrieben wurde. Angemessene <strong>Visual</strong>isierungstechniken, die<br />
von speziellen Methoden zur Extraktion von Informationen unterstützt werden, würden<br />
nicht nur die Nachverfolgung der Cluster<strong>in</strong>g Ergebnisse ermöglichen, sondern auch Fachleuten<br />
dabei helfen, den Subspace Cluster<strong>in</strong>g Prozess so zu steuern, dass relevante Muster<br />
zum Vorsche<strong>in</strong> kommen. Dieses Ziel vor Augen stelle ich e<strong>in</strong> Konzept vor, das Subspace<br />
Cluster<strong>in</strong>g Algorithmen mit <strong>in</strong>teraktiven skalierbaren <strong>Visual</strong>isierungen komb<strong>in</strong>iert. Me<strong>in</strong>e<br />
Ansätze widmen sich deshalb der Aufgabe der <strong>Visual</strong>isierung zum Vergleich von alternativen<br />
Clustergruppen, die durch Nutzerfeedback gesteuert werden.
Contents<br />
1 Introduction 1<br />
1.1 Need for <strong>Visual</strong> Interactive <strong>Data</strong> Exploration . . . . . . . . . . . . . . . . . 1<br />
1.2 Contributions <strong>of</strong> the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
2 <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis 11<br />
2.1 Basic Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis . . . . . . . . . . . . 12<br />
2.1.1 Common Challenges with <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . . . 12<br />
2.1.2 Feature Selection and Feature Extraction . . . . . . . . . . . . . . . 12<br />
2.2 Information <strong>Visual</strong>ization Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . 13<br />
2.2.1 Information <strong>Visual</strong>ization Techniques . . . . . . . . . . . . . . . . . 13<br />
2.2.2 Limitations while <strong>Visual</strong>iz<strong>in</strong>g <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . 16<br />
2.3 Automated Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . . . . . . . 17<br />
2.3.1 <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . . 17<br />
2.3.2 Quality Measures for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>izations . . . . . 19<br />
2.4 <strong>Visual</strong> <strong>Analytics</strong> for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . . . . . . . . . . . 22<br />
2.4.1 <strong>Visual</strong> Interactive Systems for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis . . . 22<br />
2.4.2 Subspace Cluster Analysis and <strong>Visual</strong>ization . . . . . . . . . . . . . 26<br />
3 Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> 29<br />
3.1 Quality Measures for Scatterplots and Parallel Coord<strong>in</strong>ates . . . . . . . . . 30<br />
3.1.1 Overview and Problem Description . . . . . . . . . . . . . . . . . . . 30<br />
3.1.2 Quality Measures for Scatterplots with Unclassified <strong>Data</strong> . . . . . . 32<br />
3.1.3 Quality Measures for Scatterplots with Classified <strong>Data</strong> . . . . . . . . 34<br />
3.1.4 Quality Measures for Parallel Coord<strong>in</strong>ates with Unclassified <strong>Data</strong> . . 38<br />
3.1.5 Quality Measures for Parallel Coord<strong>in</strong>ates with Classified <strong>Data</strong> . . . 40<br />
3.1.6 Application on Real <strong>Data</strong> Sets . . . . . . . . . . . . . . . . . . . . . 41<br />
3.1.7 Evaluation <strong>of</strong> the Measures’ Performance Us<strong>in</strong>g Synthetic <strong>Data</strong> . . . 49<br />
3.1.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 53<br />
3.2 Quality Measures and Human Perception – An Empirical Study . . . . . . . 54<br />
3.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.2.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
3.2.5 Guidel<strong>in</strong>es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.2.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 63<br />
4 A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
65<br />
4.1 Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization . . . . . . . . . . . 66<br />
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xii<br />
Contents<br />
4.1.3 Quality Metrics Pipel<strong>in</strong>e . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
4.1.4 Systematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
4.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
4.1.6 F<strong>in</strong>d<strong>in</strong>gs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
4.1.7 Directions for Further Research . . . . . . . . . . . . . . . . . . . . . 85<br />
4.1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
4.1.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 86<br />
4.2 <strong>Visual</strong> Cluster Separation Factors: Sketch<strong>in</strong>g a Taxonomy . . . . . . . . . . 87<br />
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.2.3 <strong>Visual</strong> Cluster Separation Taxonomy . . . . . . . . . . . . . . . . . . 89<br />
4.2.4 Discussion and Further Research . . . . . . . . . . . . . . . . . . . . 90<br />
5 <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> 93<br />
5.1 <strong>Visual</strong> Exploration for Subspace Cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . 94<br />
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
5.1.2 Subspace Cluster<strong>in</strong>g Algorithms . . . . . . . . . . . . . . . . . . . . 96<br />
5.1.3 Task Def<strong>in</strong>ition and Design Space for <strong>Visual</strong> Subspace Cluster Analysis 99<br />
5.1.4 The ClustNails System . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
5.1.5 Use Case and System Comparison . . . . . . . . . . . . . . . . . . . 106<br />
5.1.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 109<br />
5.2 <strong>Visual</strong> <strong>Analytics</strong> <strong>of</strong> Subspace Search . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.2.2 Subspace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
5.2.3 Proposed Analytical Workflow . . . . . . . . . . . . . . . . . . . . . 113<br />
5.2.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
5.2.5 Discussion and Possible Extensions . . . . . . . . . . . . . . . . . . . 124<br />
5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6 Conclusion and Future Work 129<br />
6.1 Summary <strong>of</strong> Contributions and Future Work . . . . . . . . . . . . . . . . . 129<br />
List <strong>of</strong> Figures 133<br />
List <strong>of</strong> Tables 143<br />
A Appendix 145<br />
A.1 Orig<strong>in</strong>al <strong>Data</strong> Dimensions for Used <strong>Data</strong> Sets . . . . . . . . . . . . . . . . . 145<br />
A.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
A.2.1 General Questions Form . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
A.2.2 Experiment Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152<br />
A.2.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . 155<br />
A.3 Quality Metrics Pipel<strong>in</strong>es for the Literature Review . . . . . . . . . . . . . . 156<br />
A.4 Hierarchical Group<strong>in</strong>g <strong>of</strong> Interest<strong>in</strong>g Subspaces . . . . . . . . . . . . . . . . 162<br />
Bibliography 163
1<br />
Introduction<br />
Contents<br />
„Everybody gets so much <strong>in</strong>formation all day long<br />
that they lose their common sense.”<br />
Gertrude Ste<strong>in</strong><br />
1.1 Need for <strong>Visual</strong> Interactive <strong>Data</strong> Exploration . . . . . . . . . . 1<br />
1.2 Contributions <strong>of</strong> the Thesis . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.1 Need for <strong>Visual</strong> Interactive <strong>Data</strong> Exploration<br />
T<br />
oday data is produced everywhere - everyth<strong>in</strong>g is recorded from production processes<br />
<strong>in</strong> the <strong>in</strong>dustry to employees work<strong>in</strong>g behavior and their personal data. Even animals<br />
are equipped with sensors and all their movements are recorded over long periods <strong>of</strong> time,<br />
click behavior <strong>of</strong> <strong>in</strong>ternet users is traced, or supermarket purchases are stored for later<br />
analysis. S<strong>in</strong>ce today’s technology allows for <strong>in</strong>expensive and abundant storage space,<br />
there will even be more data stored <strong>in</strong> the near future. At the same time, these advantages<br />
reveal the problem <strong>of</strong> how to handle the data most e ectively. The gap between the<br />
generated data and the understand<strong>in</strong>g <strong>of</strong> it <strong>in</strong>creases [154], which also poses a challenge<br />
for analysis techniques, e.g. it is di cult to filter and extract relevant <strong>in</strong>formation s<strong>in</strong>ce<br />
not only the volume <strong>in</strong>creases, but also the complexity.<br />
<strong>Visual</strong>ization has long been used as an e ective tool to explore and make sense <strong>of</strong> data,<br />
especially when analysts need to generate hypotheses about the <strong>in</strong>formation that is hidden<br />
<strong>in</strong> the data. While some techniques and commercial products have proven to be useful <strong>in</strong><br />
provid<strong>in</strong>g e ective solutions, there are still modern databases that can store data <strong>of</strong> such<br />
complexities that go well beyond the limits <strong>of</strong> human understand<strong>in</strong>g.<br />
The goal <strong>of</strong> this thesis is pattern f<strong>in</strong>d<strong>in</strong>g <strong>in</strong> high-dimensional or multidimensional data.<br />
The methods presented here work with numerical data sets, with a large number <strong>of</strong> objects,<br />
and a large number <strong>of</strong> dimensions, also called attributes. Depend<strong>in</strong>g on the application<br />
area, a large number <strong>of</strong> objects can already start at hundreds and go up to thousands. The<br />
same is true for the describ<strong>in</strong>g attributes, or features <strong>of</strong> the objects. In this work we call<br />
high-dimensional data, all data sets with more than hundred objects and more than ten<br />
dimensions. An example <strong>of</strong> analysis tasks based on a costumer database will be described<br />
later <strong>in</strong> this section.<br />
Classical data exploration requires the user to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g phenomena <strong>in</strong> the data
2 Chapter 1. Introduction<br />
<strong>in</strong>teractively, by start<strong>in</strong>g with an <strong>in</strong>itial visual representation. In [36] the authors suggest<br />
that “the purpose <strong>of</strong> visualization is <strong>in</strong>sight, not pictures”. The techniques for highdimensional<br />
data visualization can also <strong>in</strong>corporate automated analysis components to<br />
reduce its complexity and to e ectively guide the user dur<strong>in</strong>g the <strong>in</strong>teractive exploration<br />
process. This process is called visual analytics. “<strong>Visual</strong> analytics strives to facilitate<br />
the analytical reason<strong>in</strong>g process by creat<strong>in</strong>g s<strong>of</strong>tware that maximizes human capacity to<br />
perceive, understand, and reason about complex and dynamic data and situations” [137].<br />
<strong>Patterns</strong> are also not a new concept when analyz<strong>in</strong>g data. Witten and Frank expressed<br />
this perfectly <strong>in</strong> [154]: “There is noth<strong>in</strong>g new about this” (patterns). “People have been<br />
seek<strong>in</strong>g patterns <strong>in</strong> data s<strong>in</strong>ce human life began. Hunters seek patterns <strong>in</strong> animal migration<br />
behavior, farmers seek patterns <strong>in</strong> crop growth, politicians seek patterns <strong>in</strong> voter op<strong>in</strong>ion,<br />
and lovers seek patterns <strong>in</strong> their partners’ responses. A scientist’s job (like a baby’s) is<br />
to make sense <strong>of</strong> data, to discover the patterns that govern how the physical world works<br />
and encapsulate them <strong>in</strong> theories that can be used for predict<strong>in</strong>g what will happen <strong>in</strong> new<br />
situations.”<br />
In large scale multivariate data sets, sole <strong>in</strong>teractive exploration becomes <strong>in</strong>e ective<br />
or even unfeasible s<strong>in</strong>ce the number <strong>of</strong> possible representations grows rapidly with the<br />
number <strong>of</strong> dimensions. Methods are needed that help the user to automatically f<strong>in</strong>d<br />
e ective and expressive visualizations. E ective and e cient analysis methods <strong>of</strong> large<br />
multidimensional data is necessary to understand the complexity <strong>of</strong> the <strong>in</strong>formation hidden<br />
<strong>in</strong> these databases. <strong>Data</strong> dimensionality is <strong>of</strong>ten the major limit<strong>in</strong>g factor.<br />
For automatic pattern detection, a typically employed paradigm is one <strong>of</strong> cluster<strong>in</strong>g<br />
identify<strong>in</strong>g groups <strong>of</strong> objects based on their mutual similarity. Unlike traditional cluster<strong>in</strong>g<br />
methods, for the aforementioned high-dimensional data consider<strong>in</strong>g all features simultaneously<br />
is no longer e ective due to the so-called curse <strong>of</strong> dimensionality [28]. As dimensionality<br />
<strong>in</strong>creases, the distances between any two objects become less discrim<strong>in</strong>ative.<br />
Moreover, the probability <strong>of</strong> many dimensions be<strong>in</strong>g irrelevant for the underly<strong>in</strong>g cluster<br />
structure <strong>in</strong>creases. In such data sets it can be observed that each object may participate<br />
<strong>in</strong> di erent group<strong>in</strong>gs, mean<strong>in</strong>g that objects may have di erent roles. In comparison, <strong>in</strong><br />
classical cluster<strong>in</strong>g each object belongs to one cluster, and the data set is partitioned <strong>in</strong>to<br />
a number <strong>of</strong> clusters. “For example, <strong>in</strong> customer segmentation, we observe for each customer<br />
multiple possible behaviors which should be detected as clusters. In other doma<strong>in</strong>s,<br />
such as sensor networks each sensor node can be assigned to multiple clusters accord<strong>in</strong>g to<br />
di erent environmental events. In gene expression analysis, objects should be detected <strong>in</strong><br />
multiple clusters due to the various functions <strong>of</strong> each gene. In general, multiple group<strong>in</strong>gs<br />
are desired as they characterize di erent views <strong>of</strong> the data” [103].<br />
If we consider for example a customer database with a large number <strong>of</strong> customers<br />
(rows <strong>in</strong> the table) described by a large number <strong>of</strong> attributes (columns <strong>in</strong> the table) we<br />
may ask, how do this customers relate to each other, and what k<strong>in</strong>d <strong>of</strong> patterns <strong>in</strong> this<br />
case groups can be identified <strong>in</strong> this database. In Figure 1.1 we can see a toy-example<br />
belong<strong>in</strong>g to this k<strong>in</strong>d <strong>of</strong> multiple valid group<strong>in</strong>gs for one database. We can have groups<br />
like: “rich oldies”, “healthy sporties”, “unhealthy gamers”, “unemployed people”, “average<br />
people” and “sport pr<strong>of</strong>essionals” 1 . To facilitate the data analysis <strong>in</strong> this direction, we<br />
present <strong>in</strong> Chapter 5 visual <strong>in</strong>teractive systems and new analysis methods to support the<br />
understand<strong>in</strong>g and comparison <strong>of</strong> di erent group<strong>in</strong>gs <strong>in</strong> high-dimensional data.<br />
As already mentioned, this thesis is about visual analytics <strong>of</strong> patterns <strong>in</strong> high-dimensional<br />
1 This image appeared <strong>in</strong> the tutorial slides <strong>of</strong> Müller et al. [104] and the describ<strong>in</strong>g story is made up<br />
by myself.
1.1. Need for <strong>Visual</strong> Interactive <strong>Data</strong> Exploration 3<br />
Figure 1.1: Multiple valid and <strong>in</strong>terest<strong>in</strong>g group<strong>in</strong>gs <strong>of</strong> a high-dimensional data set [104].<br />
data. To assist the analysis <strong>of</strong> such data sets, e ective <strong>in</strong>formation visualization techniques<br />
provid<strong>in</strong>g a mapp<strong>in</strong>g <strong>of</strong> data properties to the screen, have been developed and are needed<br />
to make sense <strong>of</strong> the complex data at hand. The visualization <strong>of</strong> large complex <strong>in</strong>formation<br />
spaces typically <strong>in</strong>volves mapp<strong>in</strong>g high-dimensional data to lower-dimensional visual<br />
representations. The challenge for the analyst is to f<strong>in</strong>d an <strong>in</strong>sightful mapp<strong>in</strong>g, while the<br />
dimensionality <strong>of</strong> the data, and consequently the number <strong>of</strong> possible mapp<strong>in</strong>gs, <strong>in</strong>creases.<br />
As we will see later <strong>in</strong> Chapter 2, numerous expressive and e ective low-dimensional<br />
visualizations for high-dimensional data sets have been proposed <strong>in</strong> the past, such as<br />
scatterplots and scatterplot matrices (SPLOM) [37], parallel coord<strong>in</strong>ates [78], glyph-based<br />
techniques [147], pixel-based displays [145] and geometrically transformed displays [86,<br />
145]. However, f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>formation-bear<strong>in</strong>g and user-<strong>in</strong>terpretable visual representations<br />
automatically rema<strong>in</strong>s a di cult task s<strong>in</strong>ce there could be a large number <strong>of</strong> possible<br />
representations. In addition, it could be di cult to expla<strong>in</strong> their relevance to the user.<br />
F<strong>in</strong>d<strong>in</strong>g relations, patterns, and trends over numerous dimensions is also di cult because<br />
the projection <strong>of</strong> n-dimensional objects over 2D spaces carries necessarily some form<br />
<strong>of</strong> <strong>in</strong>formation loss. Projection techniques like multidimensional scal<strong>in</strong>g (MDS) and pr<strong>in</strong>cipal<br />
component analysis (PCA) o er traditional solutions by creat<strong>in</strong>g data embedd<strong>in</strong>gs<br />
that try as much as possible to preserve distances <strong>of</strong> the orig<strong>in</strong>al multidimensional space<br />
<strong>in</strong> the 2D projection. These techniques have, however, severe problems <strong>in</strong> terms <strong>of</strong> <strong>in</strong>terpretation,<br />
as it is no longer possible to <strong>in</strong>terpret the observed patterns <strong>in</strong> terms <strong>of</strong> the<br />
dimension <strong>of</strong> the orig<strong>in</strong>al data space.<br />
Mechanisms to measure the quality <strong>of</strong> the visualizations are therefore needed. In<br />
the past, quality measures have been developed for di erent areas like measures for data<br />
quality (outliers, miss<strong>in</strong>g values, sampl<strong>in</strong>g rate, level <strong>of</strong> detail), cluster<strong>in</strong>g quality (purity,<br />
F-measure (comb<strong>in</strong><strong>in</strong>g precision and recall), Rand <strong>in</strong>dex [114], silhouette coe cient [85],<br />
etc.), association rule quality (support and confidence [7], <strong>in</strong>formation ga<strong>in</strong> [40], etc.) or<br />
the distance distribution measure <strong>in</strong> SURFING [16], a subspace search algorithm described<br />
and used <strong>in</strong> Chapter 5 to filter data spaces and f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g subspaces. For visualizations,<br />
a number <strong>of</strong> authors have started <strong>in</strong>troduc<strong>in</strong>g quality measures to quantify their<br />
importance. The rationale beh<strong>in</strong>d this method is that quality measures can help users<br />
reduce the search space by filter<strong>in</strong>g out views with low <strong>in</strong>formation content. In the ideal
4 Chapter 1. Introduction<br />
system, users can select one or more measures and the system optimizes the visualization<br />
<strong>in</strong> such a way as to reflect the choice <strong>of</strong> the user. This thesis also contributes to the field<br />
<strong>of</strong> quality measures, and <strong>in</strong> Chapter 3 new measures are presented for scatterplot matrices<br />
and parallel coord<strong>in</strong>ates plots.<br />
However, there is one problem with these measures the lack <strong>of</strong> empirical validation<br />
based on user studies. These studies are <strong>in</strong> fact needed to <strong>in</strong>spect the underly<strong>in</strong>g assumption<br />
that the patterns captured by these measures correspond to the patterns captured by<br />
the human eye. S<strong>in</strong>ce many di erent patterns can be analyzed, <strong>in</strong> this thesis we started<br />
with clusters <strong>in</strong> visualizations and research <strong>in</strong> this direction by compar<strong>in</strong>g some <strong>of</strong> the<br />
most promis<strong>in</strong>g quality measures for filter<strong>in</strong>g visualizations that present clusters to the<br />
human judgement by look<strong>in</strong>g at the visualizations.<br />
The analysis <strong>of</strong> high-dimensional data is an ubiquitously relevant, yet well-known difficult<br />
problem. Problems exist both <strong>in</strong> automatic data analysis and <strong>in</strong> the visualization<br />
<strong>of</strong> this k<strong>in</strong>d <strong>of</strong> data. On the visual-<strong>in</strong>teractive side, a limited number <strong>of</strong> available visual<br />
variables and limited short-term memory <strong>of</strong> human analysts make it di cult to e ectively<br />
visualize data <strong>in</strong> high numbers <strong>of</strong> dimensions. In Chapter 5 we tackle this problem from<br />
the visual-<strong>in</strong>teractive side. We present a visual-<strong>in</strong>teractive tool to make sense <strong>of</strong> clusters<br />
<strong>in</strong> di erent subspaces, as well as an approach to identify subspaces that might show<br />
complementary cluster<strong>in</strong>gs.<br />
In summary, the focus <strong>of</strong> this thesis is to contribute on both sides <strong>of</strong> pattern f<strong>in</strong>d<strong>in</strong>g <strong>in</strong><br />
high-dimensional data, the automatic and the visual <strong>in</strong>teractive part. We believe that these<br />
parts are simultaneously needed to solve the problem and therefore we present automatic<br />
mechanisms namely quality measures to reduce the alternative possible visualizations <strong>of</strong><br />
high-dimensional data, and on the other side we visualize the relations between results to<br />
support the user <strong>in</strong> an <strong>in</strong>teractive pattern f<strong>in</strong>d<strong>in</strong>g process.<br />
1.2 Contributions <strong>of</strong> the Thesis<br />
This dissertation provides visual analytics mechanisms for pattern f<strong>in</strong>d<strong>in</strong>g <strong>in</strong> high-dimensional<br />
data. In achiev<strong>in</strong>g this goal Substantiat<strong>in</strong>g the results, we supply the follow<strong>in</strong>g contributions:<br />
• Quality measures for scatterplots and parallel coord<strong>in</strong>ates plots are developed. <strong>Visual</strong><br />
quality metrics have been recently devised to automatically extract <strong>in</strong>terest<strong>in</strong>g visual<br />
projections out <strong>of</strong> a large number <strong>of</strong> available candidates <strong>in</strong> the exploration <strong>of</strong> highdimensional<br />
databases. The metrics permit for <strong>in</strong>stance to search with<strong>in</strong> a large set <strong>of</strong><br />
scatterplots (e.g., <strong>in</strong> a scatterplot matrix) and select the views that conta<strong>in</strong> the best<br />
separation among clusters. The rationale beh<strong>in</strong>d these techniques is that automatic<br />
selection <strong>of</strong> “best” views is not only useful but also necessary when the number <strong>of</strong><br />
potential projections exceeds the limit <strong>of</strong> human <strong>in</strong>terpretation (Chapter 3) [132,<br />
133].<br />
• Validat<strong>in</strong>g the measures trough a perceptual study. We present a perceptual study<br />
<strong>in</strong>vestigat<strong>in</strong>g the relationship between human <strong>in</strong>terpretation <strong>of</strong> clusters <strong>in</strong> 2D scatterplots<br />
and the measures that were automatically extracted from these plots. Specifically,<br />
we compare a series <strong>of</strong> selected metrics and analyze how they predict human
1.3. Thesis Structure 5<br />
detection <strong>of</strong> clusters. A thorough discussion <strong>of</strong> results follows with reflections on<br />
their impact and directions for future research (Chapter 3) [134].<br />
• A systematization <strong>of</strong> techniques that use quality metrics to help <strong>in</strong> the visual exploration<br />
<strong>of</strong> mean<strong>in</strong>gful patterns <strong>in</strong> high-dimensional data. We present reflections<br />
on how di erent quality measure methods are related to each other and how the<br />
approach can be developed further. For this purpose, we provide an overview <strong>of</strong> approaches<br />
that use quality metrics <strong>in</strong> high-dimensional data visualization and propose<br />
a systematization based on a thorough literature review. We carefully analyze the<br />
papers and derive a set <strong>of</strong> factors for discrim<strong>in</strong>at<strong>in</strong>g the quality metrics, visualization<br />
techniques, and the process itself. A quality metrics pipel<strong>in</strong>e is proposed to model<br />
all the encountered varieties <strong>of</strong> metrics (Chapter 4) [27].<br />
• A visual subspace cluster analysis system (ClustNails) to understand the result <strong>of</strong><br />
subspace cluster<strong>in</strong>g. In subspace cluster<strong>in</strong>g <strong>in</strong> addition to the group<strong>in</strong>g <strong>in</strong>formation<br />
(clusters), the relevance <strong>of</strong> dimensions for particular groups and overlaps between<br />
groups, both <strong>in</strong> terms <strong>of</strong> dimensions and records, need to be analyzed. ClustNails <strong>in</strong>tegrates<br />
several novel visualization techniques with various user <strong>in</strong>teraction facilities<br />
to support navigat<strong>in</strong>g and <strong>in</strong>terpret<strong>in</strong>g the result <strong>of</strong> subspace cluster<strong>in</strong>g algorithms<br />
(Chapter 5) [136].<br />
• A novel method for the visual analysis <strong>of</strong> high-dimensional data for understand<strong>in</strong>g<br />
high-dimensional data from di erent perspectives and <strong>in</strong>vestigat<strong>in</strong>g alternative<br />
cluster<strong>in</strong>gs. We employ an <strong>in</strong>terest<strong>in</strong>gness-guided subspace search algorithm to detect<br />
a candidate set <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces, that may conta<strong>in</strong> important patterns<br />
for further analysis. Based on appropriately def<strong>in</strong>ed subspace similarity functions,<br />
we visualize the subspaces and provide navigation facilities to <strong>in</strong>teractively explore<br />
large sets <strong>of</strong> subspaces. Our approach allows users to e ectively compare and relate<br />
subspaces identify<strong>in</strong>g complementary or contradict<strong>in</strong>g relations among them, thus<br />
identify<strong>in</strong>g alternative cluster<strong>in</strong>gs (Chapter 5) [135].<br />
1.3 Thesis Structure<br />
After illustrat<strong>in</strong>g the problem <strong>in</strong> the previous section and enumerat<strong>in</strong>g the contributions<br />
<strong>of</strong> this thesis, the rema<strong>in</strong>der <strong>of</strong> the thesis is structured as follows.<br />
Chapter 2 provides a brief overview <strong>of</strong> important related work <strong>in</strong> the field <strong>of</strong> highdimensional<br />
data analysis, cover<strong>in</strong>g three ma<strong>in</strong> areas. Section 2.1 <strong>in</strong>troduces the common<br />
challenges when analyz<strong>in</strong>g high-dimensional data and presents dimension reduction techniques<br />
that reduce the data complexity. Section 2.2 describes important visualization<br />
techniques for high-dimensional data. Section 2.3 <strong>in</strong>troduces standard automatic techniques<br />
from the <strong>Data</strong> M<strong>in</strong><strong>in</strong>g community, as well as presents quality measures, that are<br />
automated rank<strong>in</strong>g functions, to judge the quality <strong>of</strong> a visualization with respect to a<br />
given task. Section 2.4 presents some examples where the <strong>in</strong>terplay between visualization,<br />
automation, and <strong>in</strong>teraction is far more beneficial then any <strong>of</strong> these techniques alone.<br />
Chapter 3 proposes eight new quality metrics, for di erent tasks and two visualization<br />
types: scatterplot matrices and parallel coord<strong>in</strong>ates. The metrics are tested on a set <strong>of</strong><br />
synthetical and real data sets to prove their e ect. To ensure that the metrics reflect the
6 Chapter 1. Introduction<br />
user’s perception, a selected subset <strong>of</strong> measures for scatterplot matrices is evaluated and<br />
compared with the user’s perception. We found that both perform similar. Based on this<br />
study, we have formulated guidel<strong>in</strong>es for further evaluation <strong>of</strong> exist<strong>in</strong>g metrics.<br />
Based on a literature review, Chapter 4 <strong>in</strong>troduces a systematization <strong>of</strong> di erent quality<br />
measures for high-dimensional data visualization. Their relation is described through<br />
characteristic factors like visualization techniques or a purpose for com<strong>in</strong>g up with a coherent<br />
and unified picture for these techniques. By putt<strong>in</strong>g the exist<strong>in</strong>g methods <strong>in</strong>to a<br />
common framework, we hope <strong>in</strong> eas<strong>in</strong>g the generation <strong>of</strong> new research <strong>in</strong> the field and spott<strong>in</strong>g<br />
relevant gaps to bridge with future research. Follow<strong>in</strong>g, Section 4.2 briefly presents<br />
the results <strong>of</strong> a qualitative data analysis that lead to a visual cluster separability taxonomy.<br />
This results are the basis for the follow up discussion on relevant aspects that arise<br />
when analyz<strong>in</strong>g clusters visually and what future works need to be focused on.<br />
Chapter 5 presents two <strong>in</strong>teractive systems that help to make sense <strong>of</strong> the highdimensional<br />
data sets with respect to di erent cluster<strong>in</strong>gs. Search<strong>in</strong>g <strong>in</strong> subspaces is<br />
needed as automatic pattern search is done trough cluster<strong>in</strong>g algorithms, and it is not feasible<br />
to search for clusters <strong>in</strong> full space for high-dimensional data. Section 5.1 <strong>in</strong>troduces a<br />
visual tool, ClustNails, to <strong>in</strong>vestigate subspace cluster<strong>in</strong>g results for di erent state <strong>of</strong> the<br />
art subspace cluster<strong>in</strong>g algorithms. This tool is <strong>in</strong>tended to support the <strong>in</strong>terpretation <strong>of</strong><br />
the result with respect to the subspace cluster relations. With this visual tool questions<br />
like how many objects do clusters conta<strong>in</strong>, how many dimensions, what dimensions do<br />
overlap between clusters or what objects are shared by more clusters can be answered.<br />
Section 5.2 goes one step further and presents an analytical approach to support the<br />
identification <strong>of</strong> alternative cluster<strong>in</strong>gs <strong>in</strong> this spaces. As we know, the high-dimensionality<br />
provides di erent facets <strong>in</strong> the data like for example <strong>in</strong> a data set about people we might<br />
have clusters <strong>in</strong> the taste <strong>of</strong> music perspective (rock-music, classical music, jazz, etc.) but<br />
at the same time we also might have di erent group<strong>in</strong>gs <strong>of</strong> the same people describ<strong>in</strong>g their<br />
sportive activity level. Both views on this data are valid but provide a di erent <strong>in</strong>sight<br />
about the data. To discover such alternative cluster<strong>in</strong>gs <strong>in</strong> high-dimensional data, <strong>in</strong> this<br />
section we propose an analytical workflow that starts from search<strong>in</strong>g the set <strong>of</strong> possible<br />
subspaces identify<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces. We then group these subspaces accord<strong>in</strong>g to<br />
their data similarity provid<strong>in</strong>g filter<strong>in</strong>g mechanisms for further <strong>in</strong>teractive <strong>in</strong>vestigation.<br />
Supported by <strong>in</strong>teraction, di erent cluster<strong>in</strong>gs <strong>of</strong> the data can be identified.<br />
Chapter 6 concludes the thesis and gives an overview <strong>of</strong> further research questions that<br />
we seem <strong>in</strong>terest<strong>in</strong>g to be <strong>in</strong>vestigated <strong>in</strong> future.<br />
A schematic overview <strong>of</strong> the chapter <strong>in</strong>terrelations is shown <strong>in</strong> Figure 1.2.
1.3. Thesis Structure 7<br />
Chapter1: Introduction<br />
Chapter2: <strong>High</strong> <strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
HD data<br />
Chapter4: A Model <strong>of</strong> HD <strong>Data</strong> <strong>Visual</strong>ization<br />
subspaces<br />
dimension<br />
projections<br />
<strong>Data</strong> Quality<br />
Metrics<br />
<strong>Visual</strong> Quality<br />
Metrics<br />
what is<br />
<strong>in</strong>terest<strong>in</strong>g?<br />
subspaces with<br />
"<strong>in</strong>terest<strong>in</strong>g"<br />
patterns<br />
methods to<br />
extract<br />
patterns<br />
present most<br />
<strong>in</strong>terest<strong>in</strong>g<br />
results first<br />
rank<strong>in</strong>g<br />
the result space<br />
visualization <strong>of</strong><br />
the result space<br />
how do we<br />
visualize and<br />
<strong>in</strong>teract with that?<br />
how do<br />
subspaces relate<br />
to each other?<br />
Chapter3: QM based <strong>Visual</strong> Analysis <strong>of</strong> HD <strong>Data</strong><br />
Chapter5: <strong>Visual</strong> Subspace Analysis <strong>of</strong> HD <strong>Data</strong><br />
Chapter6: Conclusion and Future Work<br />
Figure 1.2: Schematic overview <strong>of</strong> the <strong>in</strong>terrelation <strong>of</strong> chapters <strong>in</strong> this thesis.<br />
Parts <strong>of</strong> this thesis where published <strong>in</strong>:<br />
1. A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidew<strong>in</strong>d, H. Theisel, M. Magnor,<br />
and D. Keim. Comb<strong>in</strong><strong>in</strong>g automated analysis and visualization techniques<br />
for e ective exploration <strong>of</strong> high dimensional data. Proceed<strong>in</strong>gs <strong>of</strong> the IEEE<br />
Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and Technology (VAST), pages 59-66, 2009.<br />
The contributions: for this publication I took the lead on the computer science<br />
research part <strong>of</strong> the paper implement<strong>in</strong>g the data space measures and lead<strong>in</strong>g also<br />
the writ<strong>in</strong>g <strong>of</strong> the paper itself. G. Albuquerque and M. Eisemann implemented the<br />
image quality metrics and provided their description <strong>in</strong> the paper and some parts<br />
<strong>of</strong> the evaluation section with these metrics. The Histogram Density measures were<br />
programmed by myself. J. Schneidew<strong>in</strong>d gave advice for structur<strong>in</strong>g the paper and<br />
present<strong>in</strong>g the results. D. Keim accompanied the project with suggestions for improvements<br />
for application and text. H. Theisel and M. Magnor gave advice to the<br />
project. All parts <strong>of</strong> the paper where revised several times by me, thus <strong>in</strong> this thesis<br />
I use the paper text without citation marks. G. Albuquerque’s thesis (title unknown<br />
by the time <strong>of</strong> my submission) might conta<strong>in</strong> some text passages <strong>of</strong> this paper too for<br />
the parts she took part <strong>in</strong> the project.<br />
2. A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A.<br />
Keim. Automated <strong>Visual</strong> Analysis Methods for an E ective Exploration<br />
<strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong>. IEEE Transactions on <strong>Visual</strong>ization and Computer<br />
Graphics (TVCG), 17(5):pp. 584-597, May 2011.
8 Chapter 1. Introduction<br />
The contributions: publication 1. was elected as one <strong>of</strong> the best for the VAST’09<br />
conference and this publication is an <strong>in</strong>vited extension <strong>of</strong> 1. As primary author, I<br />
was responsible for writ<strong>in</strong>g the paper, generat<strong>in</strong>g new use-cases, test<strong>in</strong>g our measures<br />
and describ<strong>in</strong>g further research directions <strong>in</strong> this area. G. Albuquerque implemented,<br />
described and tested the new CSM measure. P. Bak gave advice for structur<strong>in</strong>g<br />
the experiments and present<strong>in</strong>g the results. D. Keim accompanied the paper with<br />
suggestions for improvements for application and text. M. Eisemann, H. Theisel<br />
and M. Magnor gave advice to the paper. All parts <strong>of</strong> the paper where revised several<br />
times by me, thus, <strong>in</strong> this thesis I use the paper text without citation marks. G.<br />
Albuquerque’s thesis (title unknown by the time <strong>of</strong> my submission) might conta<strong>in</strong><br />
some text passages <strong>of</strong> this paper too for the parts she took part <strong>in</strong> the project.<br />
3. A. Tatu, P. Bak, E. Bert<strong>in</strong>i, D. A. Keim, and J. Schneidew<strong>in</strong>d. <strong>Visual</strong> quality<br />
metrics and human perception: an <strong>in</strong>itial study on 2D projections <strong>of</strong> large<br />
multidimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the Work<strong>in</strong>g Conference on Advanced<br />
<strong>Visual</strong> Interfaces (AVI), pages 49-56. ACM, 2010.<br />
The contributions: for this publication I took primary responsibility and additionally,<br />
I took the lead on the automatic evaluation. P. Bak took the lead on the human<br />
experiment. Together we compared the results and evaluated them statistically. E.<br />
Bert<strong>in</strong>i, D. Keim and J. Schneidew<strong>in</strong>d accompanied the paper with suggestions for<br />
improvements for experimental design and text. All parts <strong>of</strong> the paper where revised<br />
several times by me; thus, <strong>in</strong> this thesis I use the paper text without citation marks.<br />
4. D. J. Lehmann, G. Albuquerque, M. Eisemann, A. Tatu, D. A. Keim, H. Schumann,<br />
M. Magnor and H. Theisel. <strong>Visual</strong>isierung und Analyse multidimensionaler<br />
Datensätze. Informatik-Spektrum, Spr<strong>in</strong>ger Berl<strong>in</strong>/Heidelberg, 33(6):589-<br />
600, 2010.<br />
The contributions: this publication was authored by D. Lehman. My contribution<br />
was to describe the use <strong>of</strong> quality metrics for high-dimensional data. This thesis was<br />
<strong>in</strong>spired by the discussions <strong>of</strong> this paper.<br />
5. E. Bert<strong>in</strong>i, A. Tatu, and D. A. Keim. Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong><br />
<strong>Data</strong> <strong>Visual</strong>ization: An Overview and Systematization. Proceed<strong>in</strong>gs <strong>of</strong> the<br />
IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis), 17(12):pages 2203-2212,<br />
Dec. 2011.<br />
The contributions: this publication was authored equally by E. Bert<strong>in</strong>i and myself.<br />
We decided to show this by enumerat<strong>in</strong>g our names alphabetically <strong>in</strong> the authors list.<br />
E. Bert<strong>in</strong>i and I conducted the literature review, came up with the systematization<br />
and description model <strong>of</strong> quality metrics, and described this process <strong>in</strong> this paper. D.<br />
Keim played the devils advocate to test our model and gave advice for improvement.<br />
All parts <strong>of</strong> the paper where written and revised several times by both lead<strong>in</strong>g authors.<br />
Thus, <strong>in</strong> this thesis I use the paper text without citation marks.<br />
6. M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy <strong>of</strong> visual cluster<br />
separation factors. Computer Graphics Forum (EuroVis), 31(3pt4):1335-1344,<br />
June 2012.<br />
The contributions: M. Sedlmair took the lead <strong>in</strong> writ<strong>in</strong>g this publication. M.<br />
Sedlmair and I conducted the qualitative analysis <strong>of</strong> the over 800 plots, and labeled
1.3. Thesis Structure 9<br />
all the cases with di erent keywords. Based on these M. Sedlmair and T. Munzner<br />
came up with the taxonomy, and described it <strong>in</strong> the paper. I tested special cases like<br />
grid size <strong>in</strong>fluence dur<strong>in</strong>g the writ<strong>in</strong>g process <strong>of</strong> the paper. M. Tory accompanied the<br />
paper with suggestions for improvements for the analysis and taxonomy and revised<br />
the text. In this thesis, I describe the results presented <strong>in</strong> that paper, without us<strong>in</strong>g<br />
the text, and I provide further ideas for research <strong>in</strong> this area.<br />
7. A. Tatu, F. Maaß, I. Färber, E. Bert<strong>in</strong>i, T. Schreck, T. Seidl, and D. Keim. Subspace<br />
Search and <strong>Visual</strong>ization to Make Sense <strong>of</strong> Alternative Cluster<strong>in</strong>gs<br />
<strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong>. IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and<br />
Technology (VAST), pages 63-72, 2012.<br />
The contributions: for this publication I took the lead on the project and paper<br />
writ<strong>in</strong>g. F. Maaß implemented the subspace tool advised by myself, E. Bert<strong>in</strong>i and T.<br />
Schreck. T. Schreck gave advise <strong>in</strong> structur<strong>in</strong>g the paper and present<strong>in</strong>g the results<br />
by provid<strong>in</strong>g <strong>in</strong>itial sections <strong>of</strong> the paper. I. Färber provided an <strong>in</strong>itial section on<br />
subspace cluster<strong>in</strong>g. T. Seidl and D. Keim gave advice to the project. Major parts <strong>of</strong><br />
the paper where written by myself and all the other parts where revised several times<br />
by me. Thus, <strong>in</strong> this thesis I use the paper text without citation marks.<br />
8. A. Tatu, L. Zhang, E. Bert<strong>in</strong>i, T. Schreck, D. A. Keim, S. Bremm, and T. von Landesberger.<br />
ClustNails: <strong>Visual</strong> Analysis <strong>of</strong> Subspace Clusters. Ts<strong>in</strong>ghua Science<br />
and Technology, Special Issue on <strong>Visual</strong>ization and Computer Graphics, 17(4):419-<br />
428, Aug. 2012.<br />
The contributions: for this publication I took the lead on the project and paper<br />
writ<strong>in</strong>g. I implemented the subspace tool supported for some components by L. Zhang.<br />
E. Bert<strong>in</strong>i, T. Schreck gave advise <strong>in</strong> structur<strong>in</strong>g the paper and present<strong>in</strong>g the results<br />
and provided <strong>in</strong>itial sections that I shaped for the f<strong>in</strong>al submission. D. A. Keim, S.<br />
Bremm, and T. von Landesberger gave advice to the project. Major parts <strong>of</strong> the<br />
paper where written by myself and I revised all the other parts <strong>of</strong> my co-authors<br />
several times to shape the f<strong>in</strong>al paper version. Thus, <strong>in</strong> this thesis I use the paper<br />
text without citation marks.<br />
Other publications to which I contributed but are not <strong>in</strong>cluded <strong>in</strong> this thesis:<br />
1. M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen and D. A.<br />
Keim. Improv<strong>in</strong>g projection-based data analysis by feature space transformations.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> SPIE 8654, <strong>Visual</strong>ization and <strong>Data</strong> Analysis, 2013.<br />
2. B. Bustos, D. A. Keim, D. Saupe, T. Schreck and A. Tatu. Methods and User<br />
Interfaces for E ective Retrieval <strong>in</strong> 3D <strong>Data</strong>bases (<strong>in</strong> German). Datenbank<br />
- Spektrum - Zeitschrift fuer Datenbank Technologie und Information Retrieval,<br />
dpunkt.verlag, 7(20):23-32, 2007.
10 Chapter 1. Introduction
2<br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
Contents<br />
„You can observe a lot by watch<strong>in</strong>g.”<br />
Yogi Berra<br />
2.1 Basic Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis . . . . . 12<br />
2.1.1 Common Challenges with <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . 12<br />
2.1.2 Feature Selection and Feature Extraction . . . . . . . . . . . . . 12<br />
2.2 Information <strong>Visual</strong>ization Techniques for <strong>High</strong>-<strong>Dimensional</strong><br />
<strong>Data</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
2.2.1 Information <strong>Visual</strong>ization Techniques . . . . . . . . . . . . . . . 13<br />
2.2.2 Limitations while <strong>Visual</strong>iz<strong>in</strong>g <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . 16<br />
2.3 Automated Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . 17<br />
2.3.1 <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . 17<br />
2.3.2 Quality Measures for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>izations . . . 19<br />
2.4 <strong>Visual</strong> <strong>Analytics</strong> for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> . . . . . . . . . . . 22<br />
2.4.1 <strong>Visual</strong> Interactive Systems for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis . 22<br />
2.4.2 Subspace Cluster Analysis and <strong>Visual</strong>ization . . . . . . . . . . . 26<br />
H<br />
igh-dimensional data conta<strong>in</strong>s complex patterns and di erent data analysis approaches<br />
have beed developed dur<strong>in</strong>g the past years to uncover the possible hidden<br />
patterns <strong>of</strong> this data. As is outl<strong>in</strong>ed <strong>in</strong> the follow<strong>in</strong>g, this thesis is related to a number <strong>of</strong><br />
broader areas <strong>in</strong> data analysis and visualization <strong>of</strong> high-dimensional data.<br />
In this chapter, Section 2.1 describes the ma<strong>in</strong> challenges when deal<strong>in</strong>g with highdimensional<br />
data and some basic techniques to reduce its dimensionality. Section 2.2 gives<br />
an overview <strong>of</strong> exist<strong>in</strong>g visualization techniques for high-dimensional data, and identifies<br />
the visualization challenges that arise due to the data complexity. Section 2.3 presents a<br />
series <strong>of</strong> automated techniques from <strong>Data</strong> M<strong>in</strong><strong>in</strong>g for pattern analysis <strong>in</strong> high-dimensional<br />
data, focus<strong>in</strong>g on cluster<strong>in</strong>g. The second part presents mechanisms to quantify the quality<br />
<strong>of</strong> visualizations, called quality metrics. Due to the limitations <strong>of</strong> the pure visual<strong>in</strong>teractive<br />
solution or a sole automatic approach, <strong>in</strong> Section 2.4 we present works from<br />
related fields where the <strong>in</strong>terplay <strong>of</strong> visualization and automation together with <strong>in</strong>teractive<br />
features can provide better solutions to the tasks at hand. All examples <strong>of</strong> these sections<br />
are <strong>in</strong> the context <strong>of</strong> pattern f<strong>in</strong>d<strong>in</strong>g and understand<strong>in</strong>g <strong>of</strong> high-dimensional data.<br />
Parts <strong>of</strong> this chapter appeared <strong>in</strong> [27, 132, 133, 134, 135, 136].
12 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
2.1 Basic Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
2.1.1 Common Challenges with <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Before present<strong>in</strong>g di erent techniques to analyze high-dimensional data sets, we will discuss<br />
two common challenges <strong>in</strong> this area.<br />
The first issue is the so called curse <strong>of</strong> dimensionality. In high-dimensional analysis<br />
problems are known to be di cult due to the curse <strong>of</strong> dimensionality. This term was<br />
formulated by R. Bellman [20] <strong>in</strong> the context <strong>of</strong> dynamic programm<strong>in</strong>g, and describes<br />
the fact, that when dimensionality <strong>in</strong>creases the data becomes sparse. In other words,<br />
<strong>in</strong> high-dimensional data everyth<strong>in</strong>g tends to be basically equidistant mak<strong>in</strong>g it hard to<br />
make any dist<strong>in</strong>ctions between objects. Additionally, many exist<strong>in</strong>g <strong>Data</strong> M<strong>in</strong><strong>in</strong>g algorithms<br />
have a complexity exponential with respect to the number <strong>of</strong> data dimensions.<br />
With <strong>in</strong>creas<strong>in</strong>g dimensionality, these algorithms become computationally <strong>in</strong>tractable and<br />
therefore <strong>in</strong>applicable <strong>in</strong> many real applications.<br />
The second issue concerns the mean<strong>in</strong>g <strong>of</strong> similarity <strong>in</strong> a high-dimensional space is<br />
therefore dim<strong>in</strong>ished. It was shown <strong>in</strong> [28] that as dimensionality <strong>in</strong>creases the distance to<br />
the nearest data po<strong>in</strong>t approaches the distance to the farthest data po<strong>in</strong>t. This problem<br />
<strong>in</strong>fluences the design <strong>of</strong> similarity functions for objects <strong>in</strong> high-dimensional spaces.<br />
2.1.2 Feature Selection and Feature Extraction<br />
A simple, but sometimes very e ective, way to deal with high-dimensional data is to reduce<br />
the number <strong>of</strong> dimensions by elim<strong>in</strong>at<strong>in</strong>g those that seem to be irrelevant.<br />
Dimension reduction can be achieved by either feature selection [61] or feature extraction<br />
[44]. Feature selection is the problem <strong>of</strong> select<strong>in</strong>g from a large space <strong>of</strong> <strong>in</strong>put features<br />
(or dimensions) a smaller number <strong>of</strong> features that optimize a measurable criterion, e.g.,<br />
the accuracy <strong>of</strong> a classifier [97].<br />
Feature extraction methods reduce the dimensionality <strong>of</strong> the data by form<strong>in</strong>g a new<br />
set <strong>of</strong> dimensions as a l<strong>in</strong>ear or nonl<strong>in</strong>ear comb<strong>in</strong>ation <strong>of</strong> the orig<strong>in</strong>al dimensions. This<br />
synthetic dimensions represent most (or all) <strong>of</strong> the structure <strong>of</strong> the orig<strong>in</strong>al data set by<br />
us<strong>in</strong>g less attributes. Depend<strong>in</strong>g on the tra<strong>in</strong><strong>in</strong>g data, the methods can be supervised<br />
or unsupervised. “Supervised methods rely on class labels and optimize the performance<br />
<strong>of</strong> a supervised learn<strong>in</strong>g algorithm, typically a classifier. Unsupervised methods rely on<br />
quality criteria measured from the output <strong>of</strong> an unsupervised learn<strong>in</strong>g method, typically a<br />
cluster<strong>in</strong>g algorithm. However, many algorithms have variations for both supervised and<br />
unsupervised learn<strong>in</strong>g” [119]. Most automatic feature selection methods rely on supervised<br />
<strong>in</strong>formation (e.g., class labeled data) to perform the selection. Consequently, they are not<br />
directly applicable to the explorative analysis problem.<br />
For understand<strong>in</strong>g the fundamental pr<strong>in</strong>ciple <strong>of</strong> feature extraction techniques <strong>in</strong> the<br />
next paragraphs, we describe the traditional dimension reduction methods, the pr<strong>in</strong>cipal<br />
component analysis (PCA) [83] and the multidimensional scal<strong>in</strong>g (MDS) [41].<br />
PCA tries to preserve the variance <strong>in</strong> the data and transforms the set <strong>of</strong> possibly<br />
correlated dimensions <strong>in</strong>to new set <strong>of</strong> l<strong>in</strong>early uncorrelated dimensions that are a l<strong>in</strong>ear<br />
comb<strong>in</strong>ation <strong>of</strong> the orig<strong>in</strong>al dimensions and are called pr<strong>in</strong>cipal components. The first<br />
component conta<strong>in</strong>s the largest variance <strong>of</strong> the orig<strong>in</strong>al dimension set, the second component<br />
is l<strong>in</strong>early uncorrelated to the previous one and also conta<strong>in</strong>s the maximal possible
2.2. Information <strong>Visual</strong>ization Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> 13<br />
variance and so on. The data set can be reduced by ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g a smaller set <strong>of</strong> pr<strong>in</strong>cipal<br />
coord<strong>in</strong>ates, as transformed dimensions.<br />
MDS tries to preserve the pairwise distances between the data po<strong>in</strong>ts. There are a lot<br />
<strong>of</strong> variants <strong>of</strong> MDS dependent on the used distance functions [31]. The simplest version<br />
is the l<strong>in</strong>ear MDS, also called classical scal<strong>in</strong>g, and its solution is very closely related to<br />
PCA when us<strong>in</strong>g an Euclidian distance function.<br />
All these techniques rely on the idea that variation <strong>of</strong> the data can be expla<strong>in</strong>ed by<br />
a smaller number <strong>of</strong> transformed features. Their ma<strong>in</strong> di erence to the feature selection<br />
methods is that these methods <strong>in</strong>stead <strong>of</strong> choos<strong>in</strong>g a subset <strong>of</strong> dimensions from the data,<br />
create new dimensions def<strong>in</strong>ed as functions over all dimensions. They also do not consider<br />
class labels but rather their computation is rely<strong>in</strong>g just on data po<strong>in</strong>ts.<br />
General problems <strong>in</strong> these techniques are that the mapp<strong>in</strong>g <strong>of</strong>ten is not unique. The<br />
techniques have several parameters that <strong>in</strong>fluence the result, and the <strong>in</strong>terpretability <strong>of</strong><br />
result<strong>in</strong>g dimensions is sometimes di cult because the orig<strong>in</strong>al space dimensions com<strong>in</strong>g<br />
from a specific doma<strong>in</strong> have a certa<strong>in</strong> <strong>in</strong>terpretation (like age, <strong>in</strong>come, etc.) but their<br />
l<strong>in</strong>ear comb<strong>in</strong>ations can be hardly <strong>in</strong>terpreted.<br />
Koren and Carmel propose a series <strong>of</strong> new methods for creat<strong>in</strong>g projections from highdimensional<br />
data sets us<strong>in</strong>g l<strong>in</strong>ear transformations [89]. For non-labeled data, they propose<br />
a generalization <strong>of</strong> the PCA, the normalized PCA, that normalizes the squared pairwise<br />
distances to reduce the dom<strong>in</strong>ance <strong>of</strong> the large distances normally occurr<strong>in</strong>g for the standard<br />
PCA transformation. For labeled data, their methods <strong>in</strong>tegrate the class labels <strong>of</strong><br />
the data <strong>in</strong> the computation, result<strong>in</strong>g <strong>in</strong> projections with a clearer separation between<br />
the classes. This methods compared to traditional PCA or MDS have the advantage that<br />
they also capture <strong>in</strong>tra-cluster shapes.<br />
In addition to PCA and MDS presented above, there have been developed more techniques<br />
based on l<strong>in</strong>ear or non-l<strong>in</strong>ear transformations <strong>of</strong> the orig<strong>in</strong>al features to obta<strong>in</strong> a<br />
reduced set <strong>of</strong> synthetic dimensions. Detailed surveys can be found <strong>in</strong> [111, 153]. Another<br />
prom<strong>in</strong>ent group <strong>of</strong> techniques for dimension reduction, which we want to recall shortly<br />
at this po<strong>in</strong>t, rely on signal process<strong>in</strong>g techniques, that, when applied to a data vector,<br />
transform it to a numerically di erent vector [64]. These are for e.g. Discrete Fourier<br />
Transform, Cos<strong>in</strong>e Transform, Wavelet Transform etc. S<strong>in</strong>ce <strong>in</strong>put and transformed data<br />
vectors have the same length, the data is reduced by a user specified threshold that is used<br />
to truncate the transformed vector (e.g. wavelet coe cients).<br />
2.2 Information <strong>Visual</strong>ization Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
2.2.1 Information <strong>Visual</strong>ization Techniques<br />
The representation <strong>of</strong> high-dimensional data is one <strong>of</strong> the ma<strong>in</strong> research challenges <strong>in</strong><br />
visualization. Several techniques have been developed <strong>in</strong> recent years to deal with the<br />
problem <strong>of</strong> represent<strong>in</strong>g relations among many dimensions on a computer display, which<br />
is <strong>in</strong>herently bi-dimensional. Consider<strong>in</strong>g also the visual variables data visualizations can<br />
go a bit beyond 2D us<strong>in</strong>g color, shape, etc. but still have di erent issues for represent<strong>in</strong>g<br />
high-dimensional data sets. Classic approaches <strong>in</strong>clude parallel coord<strong>in</strong>ates, scatterplot<br />
matrices, glyph-based and pixel-oriented techniques [145]. Figure 2.1 shows some examples
14 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
for these techniques taken from [145].<br />
A<br />
B<br />
C<br />
D<br />
Figure 2.1: <strong>High</strong>-dimensional visualization techniques taken from [145]. A: Scatterplot matrix<br />
show<strong>in</strong>g on the diagonal a histogram plot for each dimension. Selected po<strong>in</strong>ts are marked <strong>in</strong> red <strong>in</strong><br />
all plots. B: Parallel coord<strong>in</strong>ates plot <strong>of</strong> a seven-dimensional data set. One polyl<strong>in</strong>e represent<strong>in</strong>g<br />
one data po<strong>in</strong>t is highlighted <strong>in</strong> red. C: Star glyphs <strong>in</strong> a MDS layout. D: Dense pixel displays<br />
represent<strong>in</strong>g a 14-dimensional data set.<br />
Scatterplots and Scatterplot Matrices [37]<br />
2D scatterplots are one <strong>of</strong> the most common used visualization techniques <strong>in</strong> data analysis.<br />
The data is represented by po<strong>in</strong>ts <strong>in</strong> a rectangular box, each hav<strong>in</strong>g the value <strong>of</strong> one<br />
variable (dimension) determ<strong>in</strong><strong>in</strong>g the position on the horizontal axis, and the value <strong>of</strong> the<br />
other variable, determ<strong>in</strong><strong>in</strong>g the position on the vertical axis. To represent a data set <strong>of</strong> a<br />
higher dimensionality, a common approach is to build a scatterplot matrix (SPLOM) [37].<br />
Figure 2.1A shows an example <strong>of</strong> such a matrix for a four-dimensional data set, where<br />
every pair <strong>of</strong> dimensions is represented <strong>in</strong> one scatterplot. The matrix shows every plot<br />
twice, be<strong>in</strong>g symmetrical with respect to the diagonal. Additionally, on the diagonal, dimension<br />
histograms show the value distribution <strong>in</strong>formation for each dimension. Selected<br />
po<strong>in</strong>ts are highlighted <strong>in</strong> red and a purple rectangle <strong>in</strong>dicates their region.
2.2.1 Information <strong>Visual</strong>ization Techniques 15<br />
Parallel Coord<strong>in</strong>ates [78]<br />
Another important visualization method for multivariate data sets is parallel coord<strong>in</strong>ates.<br />
Parallel coord<strong>in</strong>ates was first <strong>in</strong>troduced by Inselberg [77] and is used <strong>in</strong> several tools,<br />
e.g. XmdvTool [146] and VIS-STAMP [60], for visualiz<strong>in</strong>g multivariate data. The basic<br />
idea is that each dimension 1 <strong>of</strong> the data is a vertical l<strong>in</strong>e, so the axes <strong>of</strong> the plot are a<br />
collection <strong>of</strong> parallel l<strong>in</strong>es. Each data po<strong>in</strong>t is a polyl<strong>in</strong>e that crosses each dimension axis<br />
by <strong>in</strong>tersect<strong>in</strong>g it at its dimension value. Figure 2.1B shows an example <strong>of</strong> parallel coord<strong>in</strong>ates<br />
for a seven-dimensional data set where one data po<strong>in</strong>t’s ployl<strong>in</strong>e is highlighted <strong>in</strong><br />
red. In comparison to the scatterplots, parallel coord<strong>in</strong>ates can show data sets <strong>of</strong> higher<br />
dimensionality <strong>in</strong> one display. In a SPLOM a higher dimensional data set can be visualized<br />
by plott<strong>in</strong>g every two-dimensional comb<strong>in</strong>ation <strong>in</strong> one scatterplot. For both, parallel coord<strong>in</strong>ates<br />
and SPLOM, the order<strong>in</strong>g is important. For parallel coord<strong>in</strong>ates the order <strong>of</strong> axes<br />
(dimensions) and analog for the SPLOM the order <strong>of</strong> rows and columns, s<strong>in</strong>ce di erent<br />
order<strong>in</strong>gs make di erent relations <strong>in</strong> the data visible. It is important to decide the order<br />
<strong>of</strong> the dimensions that are to be presented to the user. Their e ectiveness, however, is<br />
highly related to the dimensionality <strong>of</strong> the data under <strong>in</strong>spection. Because the resolution<br />
available decreases as the number <strong>of</strong> data dimensions <strong>in</strong>creases, it becomes very di cult, if<br />
not impossible, to explore the whole set <strong>of</strong> available order<strong>in</strong>gs manually. In Section 2.3.2,<br />
we describe the notion <strong>of</strong> quality metrics that are mechanisms to automatically quantify<br />
the quality <strong>of</strong> the display and <strong>in</strong> Section 3.1.4, we <strong>in</strong>troduce new quality metrics to determ<strong>in</strong>e<br />
the best order<strong>in</strong>g <strong>in</strong> parallel coord<strong>in</strong>ates with respect to a given task.<br />
Glyph-based techniques [147]<br />
“Glyphs are graphical entities that convey one or more data values via attributes such<br />
as shape, size, color, and position” [147]. There is a variety <strong>of</strong> glyphs proposed <strong>in</strong> the<br />
literature so far, and just to name some there are: star glyphs, face glyphs, pr<strong>of</strong>ile glyphs<br />
or box glyphs. An overview <strong>of</strong> multivariate glyphs can be found <strong>in</strong> [147]. They all have<br />
<strong>in</strong> common that they have one graphical representation per object, but use di erent encod<strong>in</strong>gs<br />
for the objects attributes (e.g. length, area, color). In Figure 2.1C star glyphs<br />
are exemplified. As the name suggests each object is represented by a star shaped glyph,<br />
where the value <strong>of</strong> each dimension is represented by the length <strong>of</strong> evenly spaced rays. The<br />
ray ends are connected by a polyl<strong>in</strong>e.<br />
Pixel-oriented techniques [145]<br />
Pixel-oriented techniques “map each value to <strong>in</strong>dividual pixels and create a filled polygon to<br />
represent each dimension” [145]. In Figure 2.1D a 14-dimensional data set is represented<br />
by dense pixel displays show<strong>in</strong>g each dimension <strong>in</strong> a separate rectangle and each data<br />
value as a colored pixel <strong>in</strong> the rectangle. The values are sorted accord<strong>in</strong>g to the tenth<br />
dimension, that is marked with a black border. Here we can see several challenges for<br />
this techniques. One is the already mentioned order<strong>in</strong>g <strong>of</strong> data values, to spot correlated<br />
dimensions, another one is the order<strong>in</strong>g <strong>of</strong> dimensions to position similar dimensions close<br />
to each other on the screen. Us<strong>in</strong>g di erent colormaps can also reveal di erent patterns <strong>in</strong><br />
the data, thus choos<strong>in</strong>g the suitable colormap for each data and task, suitable colormap<br />
is yet another challenge. Additionally, position<strong>in</strong>g the dimensions on the screen is not<br />
trivial, s<strong>in</strong>ce di erent layouts – not only the grid layout – can be possible.<br />
1 We use the terms dimension and attribute (as well as feature, variable, column and axis) <strong>in</strong>terchangeably<br />
<strong>in</strong> this thesis. We choose among them based on the context <strong>of</strong> the discussion, while attempt<strong>in</strong>g to be<br />
consistent with their use <strong>in</strong> the literature.
16 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
2.2.2 Limitations while <strong>Visual</strong>iz<strong>in</strong>g <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
As previously demonstrated, there are di erent ways to represent high-dimensional data<br />
on the screen and all these br<strong>in</strong>g a number <strong>of</strong> challenges with them. Moreover, as already<br />
identified there are challenges due to the scalability <strong>of</strong> the display, the order<strong>in</strong>g <strong>of</strong> displayed<br />
objects or dimensions, the position<strong>in</strong>g <strong>of</strong> objects on the screen, the high number<br />
<strong>of</strong> possible visual mapp<strong>in</strong>gs. Provid<strong>in</strong>g solutions for some <strong>of</strong> this problems would ease the<br />
exploration <strong>of</strong> the high-dimensional data. By an appropriate sort<strong>in</strong>g <strong>of</strong> dimensions and<br />
an appropriate mapp<strong>in</strong>g to visual variables, clutter can be reduced and these visualization<br />
methods could allow to overview and relate high-dimensional data sets [49]. The data<br />
dimensionality causes problems <strong>in</strong> the visual mapp<strong>in</strong>g stage, mean<strong>in</strong>g it is unclear which<br />
mapp<strong>in</strong>g is the best, so what data dimension should be mapped to what visual variable.<br />
Because <strong>of</strong> the high number <strong>of</strong> possible mapp<strong>in</strong>gs for a high-dimensional data set, automated<br />
methods are needed to restrict this number. One way to judge the quality <strong>of</strong> these<br />
mapp<strong>in</strong>gs is to compute quality measures for the displayed data (see Chapter 3 for more<br />
details) or to reduce the number <strong>of</strong> dimensions by dimensionality reduction techniques<br />
(see Section 2.1.2).<br />
Enrich<strong>in</strong>g <strong>Visual</strong>izations<br />
Static visualization techniques are not flexible enough to reveal the complex high-dimensional<br />
patterns, thus <strong>in</strong>teraction is needed at this po<strong>in</strong>t. Proposed are di erent solutions to make<br />
visualizations <strong>in</strong>teractive, support<strong>in</strong>g a dynamic use for high-dimensional data. These <strong>in</strong>clude<br />
brush<strong>in</strong>g and l<strong>in</strong>k<strong>in</strong>g [46], pann<strong>in</strong>g and zoom<strong>in</strong>g [19], focus-plus-context [92], magic<br />
lenses [29].<br />
“Brush<strong>in</strong>g and l<strong>in</strong>k<strong>in</strong>g refers to the connect<strong>in</strong>g <strong>of</strong> two or more views <strong>of</strong> the same data,<br />
such that a change to the representation <strong>in</strong> one view a ects the representation <strong>in</strong> the<br />
other views as well. ... Pann<strong>in</strong>g and zoom<strong>in</strong>g refers to the actions <strong>of</strong> a movie camera<br />
that can scan sideways across a scene (pann<strong>in</strong>g) or move <strong>in</strong> for a closeup or back away to<br />
get a wider view (zoom<strong>in</strong>g). ... When zoom<strong>in</strong>g is used, the more detail is visible about<br />
a particular item, the less can be seen about the surround<strong>in</strong>g items. Focus-plus-context<br />
is used to partly alleviate this e ect. The idea is to make one portion <strong>of</strong> the view – the<br />
focus <strong>of</strong> attention – larger, while simultaneously shr<strong>in</strong>k<strong>in</strong>g the surround<strong>in</strong>g objects. The<br />
farther an object is from the focus <strong>of</strong> attention, the smaller it is made to appear. ... Magic<br />
lenses are directly manipulable transparent w<strong>in</strong>dows that, when overlapped on some other<br />
data type, cause a transformation to be applied to the underly<strong>in</strong>g data, thus chang<strong>in</strong>g<br />
its appearance” [15]. A full exemplification <strong>of</strong> these techniques is out <strong>of</strong> the scope <strong>of</strong> this<br />
work, and more details can be read <strong>in</strong> [15] 2 .<br />
<strong>Patterns</strong> that are just visible <strong>in</strong> subspaces <strong>of</strong> the orig<strong>in</strong>al data space also need specialized<br />
visualizations to disclose the relations between the di erent subspaces from which<br />
they orig<strong>in</strong>ate as well as their possible object overlap. In Chapter 5 we present a visual<strong>in</strong>teractive<br />
tool for this purpose.<br />
2 The cited description for each technique are from Chapter 10: User Interfaces and <strong>Visual</strong>ization - by<br />
Marti Hearst. This chapter can also be found onl<strong>in</strong>e at http://people.ischool.berkeley.edu/˜hearst/<br />
irbook/10/node3.html#SECTION00122000000000000000f(last accessed on 03/13).
2.3. Automated Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> 17<br />
2.3 Automated Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
In this section, we present automated methods for analyz<strong>in</strong>g high-dimensional data. Section<br />
2.3.1 discusses di erent data m<strong>in</strong><strong>in</strong>g approaches to extract patterns from data. The<br />
focus is on cluster<strong>in</strong>g. We present general approaches, enumerat<strong>in</strong>g approaches that have<br />
been especially developed for cop<strong>in</strong>g with high-dimensional data, and present the di erence<br />
between cluster<strong>in</strong>g <strong>in</strong> a dimension reduced data set and subspace cluster<strong>in</strong>g. Besides<br />
automated pattern extraction, <strong>in</strong> Section 2.3.2 we <strong>in</strong>troduce automation to judge the quality<br />
<strong>of</strong> visualization, namely by quality metrics. Given the huge number <strong>of</strong> possible visual<br />
representations for high-dimensional data, the user is assisted <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g the right visual<br />
mapp<strong>in</strong>g or the right projection for his data. Our contribution to this area consist<strong>in</strong>g <strong>of</strong><br />
new measures, a quality measures pipel<strong>in</strong>e, and a systematization <strong>of</strong> exist<strong>in</strong>g measures, is<br />
outl<strong>in</strong>ed <strong>in</strong> Chapters 3 and 4.<br />
2.3.1 <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Techniques for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
<strong>Data</strong> M<strong>in</strong><strong>in</strong>g refers to extract<strong>in</strong>g, or m<strong>in</strong><strong>in</strong>g, knowledge (<strong>in</strong>terest<strong>in</strong>g patterns) from large<br />
amounts <strong>of</strong> data [64]. In order to extract these data patterns, di erent <strong>in</strong>telligent methods<br />
have been developed <strong>in</strong> the past. One important method, which is also the closest to<br />
this thesis, is cluster<strong>in</strong>g. Cluster<strong>in</strong>g takes the data set as <strong>in</strong>put and groups the objects<br />
accord<strong>in</strong>g to their similarity <strong>in</strong>to di erent groups, called clusters. Therefore, the similarity<br />
between objects <strong>of</strong> one group is maximized, and between objects <strong>of</strong> di erent groups the<br />
similarity is m<strong>in</strong>imized. That means that objects <strong>of</strong> one group are very similar to each<br />
other, while dissimilar to objects <strong>of</strong> other groups. The similarity is calculated on the full<br />
attribute space, us<strong>in</strong>g di erent distance functions, like Euclidian, M<strong>in</strong>kowski, or City-block<br />
distances.<br />
State <strong>of</strong> the Art Cluster<strong>in</strong>g<br />
There are di erent criteria to classify the exist<strong>in</strong>g cluster<strong>in</strong>g algorithms. We would like to<br />
di erentiate them roughly <strong>in</strong>to hierarchical cluster<strong>in</strong>g algorithms, and partition<strong>in</strong>g cluster<strong>in</strong>g<br />
algorithms and enumerate some <strong>of</strong> the most known representatives. For further details<br />
please refer to the follow<strong>in</strong>g surveys [21, 155] or the orig<strong>in</strong>al papers <strong>of</strong> the algorithms.<br />
Hierarchical cluster<strong>in</strong>g organizes objects <strong>in</strong>to groups that are at the same time grouped<br />
<strong>in</strong>to groups. This is done consecutively build<strong>in</strong>g up a hierarchy <strong>of</strong> clusters. Representatives<br />
for this category, which we will also use later <strong>in</strong> Section 5.2, are hierarchical cluster<strong>in</strong>gs<br />
with di erent l<strong>in</strong>kage methods, like s<strong>in</strong>gle-l<strong>in</strong>kage, complete-l<strong>in</strong>kage, average-l<strong>in</strong>kage, or<br />
m<strong>in</strong>imum variance [144]. Try<strong>in</strong>g to develop algorithms for handl<strong>in</strong>g large-scale data, <strong>in</strong> recent<br />
years, new hierarchical algorithms appeared that improve the cluster<strong>in</strong>g performance.<br />
Examples <strong>in</strong>clude BIRCH [162] an algorithm designed to use a height-balanced tree to<br />
store summaries <strong>of</strong> the orig<strong>in</strong>al data that can achieve a l<strong>in</strong>ear computational complexity.<br />
The partition<strong>in</strong>g methods, divide all the data objects <strong>in</strong>to a fixed number <strong>of</strong> groups,<br />
without any hierarchical structure. Major representatives for this category are algorithms<br />
like the density based DBSCAN [50] and OPTICS [10], or relocation methods like k-<br />
medoids and k-means methods [56].
18 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
Cluster<strong>in</strong>g <strong>in</strong> <strong>High</strong> Dimensions<br />
For high-dimensional data sets, the challenge is to design e ective and e cient cluster<strong>in</strong>g<br />
algorithms that can cope with the high number <strong>of</strong> objects, dimensions, and the noise level<br />
<strong>of</strong> this k<strong>in</strong>d <strong>of</strong> data. Therefore a number <strong>of</strong> di erent algorithms were proposed to cluster<br />
this type <strong>of</strong> data.<br />
CURE [57] is a hierarchical cluster<strong>in</strong>g algorithm that can explore arbitrary cluster<br />
shapes and utilizes a random sample strategy to reduce computational complexity.<br />
Density-based cluster<strong>in</strong>g (DENCLUE) [70] is a well known approach for density based<br />
cluster<strong>in</strong>g for high-dimensional data. To make computations more feasible, the data is <strong>in</strong>dexed<br />
us<strong>in</strong>g a B + -tree. The algorithm is built on the idea that the <strong>in</strong>fluence <strong>of</strong> each data<br />
po<strong>in</strong>t on his neighborhood can be modeled us<strong>in</strong>g a so called <strong>in</strong>fluence function. The overall<br />
density <strong>of</strong> the data space can be modeled analytically as the sum <strong>of</strong> the <strong>in</strong>fluence function<br />
applied to all data po<strong>in</strong>ts. Clusters are then determ<strong>in</strong>ed by identify<strong>in</strong>g local maxima <strong>of</strong><br />
the overall density function.<br />
Although, these algorithms can deal with large-scale data, they are sometimes not<br />
su cient to analyze high-dimensional data. Due to the previously described problem, the<br />
curse <strong>of</strong> dimensionality, namely algorithms rely<strong>in</strong>g on distance functions, can no longer<br />
perform well <strong>in</strong> high-dimensional spaces. To overcome this problem, dimension reduction<br />
(see Section 2.1.2) is used <strong>in</strong> cluster analysis to reduce the dimensionality <strong>of</strong> the data<br />
sets. However, dimensionality reduction methods cause some loss <strong>of</strong> <strong>in</strong>formation, and<br />
may destroy the <strong>in</strong>terpretability <strong>of</strong> the results, even distort the real clusters. Moreover,<br />
such techniques do not actually remove any <strong>of</strong> the orig<strong>in</strong>al attributes from the analysis.<br />
This is problematic when there are a large number <strong>of</strong> irrelevant attributes. The irrelevant<br />
<strong>in</strong>formation may mask the real clusters, even after transformation. Another way to tackle<br />
this problem is to use subspace cluster<strong>in</strong>g algorithms, that search for data clusters <strong>in</strong><br />
di erent subsets <strong>of</strong> the same data set. Di erent subspaces may conta<strong>in</strong> di erent mean<strong>in</strong>gful<br />
clusters. The problem here is how to identify such subspace clusters e ciently.<br />
A large number <strong>of</strong> algorithms for subspace cluster<strong>in</strong>g have been developed <strong>in</strong> the past<br />
and we picked some representatives to be briefly described next. CLIQUE (CLuster<strong>in</strong>g<br />
In QUEst) [6] employs a bottom-up approach and searches for dense rectangular cells <strong>in</strong><br />
all subspaces with high density <strong>of</strong> po<strong>in</strong>ts. The clusters are generated by merg<strong>in</strong>g these<br />
rectangles. OptiGrid [71] is designed to obta<strong>in</strong> an optimal grid partition<strong>in</strong>g us<strong>in</strong>g cutt<strong>in</strong>g<br />
hyperplanes. It uses density estimations similar to DENCLUE to f<strong>in</strong>d the plane that<br />
separates two significantly dense half spaces, and goes trough a po<strong>in</strong>t <strong>of</strong> m<strong>in</strong>imal density,<br />
us<strong>in</strong>g a set <strong>of</strong> l<strong>in</strong>ear projections. In Section 5.1 we use the k-medoid based algorithm<br />
PROCLUS (PROjected CLUster<strong>in</strong>g) [4], one <strong>of</strong> the most robust algorithms for subspace<br />
cluster<strong>in</strong>g. It def<strong>in</strong>es a cluster as a densely distributed subset <strong>of</strong> data objects <strong>in</strong> a subspace.<br />
ORCLUS (arbitrarily ORiented projected CLUster generation) [5] uses a similar approach<br />
but uses non-axes parallel subspaces to f<strong>in</strong>d the clusters. Further elaborations on the<br />
problem <strong>of</strong> subspace cluster<strong>in</strong>g are described <strong>in</strong> Section 2.4.2 and Section 5.1.2.<br />
Other <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Techniques<br />
In addition to cluster<strong>in</strong>g techniques, many other techniques have been developed dur<strong>in</strong>g<br />
the past.<br />
Ma<strong>in</strong>ly they are m<strong>in</strong><strong>in</strong>g frequent patterns, associations, correlations, or outliers
2.3.2 Quality Measures for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>izations 19<br />
<strong>in</strong> data. A frequent pattern is a set <strong>of</strong> items that occur frequently <strong>in</strong> a data set. This<br />
term was first proposed by [7] <strong>in</strong> the context <strong>of</strong> frequent itemsets and association rule<br />
m<strong>in</strong><strong>in</strong>g. By m<strong>in</strong><strong>in</strong>g frequent patterns, the goal is to identify regularities <strong>in</strong> the data, like<br />
products purchased <strong>of</strong>ten together <strong>in</strong> basket data analysis. Frequent patterns form the<br />
foundation for many essential data m<strong>in</strong><strong>in</strong>g tasks, such as association analysis, correlation<br />
analysis, classification (associative classification) and cluster analysis (frequent patternbased<br />
cluster<strong>in</strong>g). “Association analysis is the discovery <strong>of</strong> association rules show<strong>in</strong>g<br />
attribute-value conditions that occur frequently together <strong>in</strong> a dataset” [63]. As mentioned<br />
<strong>in</strong> Section 1.1 support and confidence can characterize the quality <strong>of</strong> association rules. The<br />
rules are generated based on the identified frequent itemset <strong>in</strong> the data. One problem,<br />
however, is that for low support and confidence levels the result<strong>in</strong>g set <strong>of</strong> association rules<br />
is very high. Us<strong>in</strong>g higher levels <strong>of</strong> support and confidence can remove useful rules, so<br />
a mechanism is needed to detect the right confidence level. <strong>Visual</strong>ization can help to<br />
overcome this issue, and supports the user <strong>in</strong> identify<strong>in</strong>g the right rules. In Section 3.1 we<br />
will present image based quality measures to identify correlation among data attributes<br />
and attributes form<strong>in</strong>g strong groups (clusters) <strong>in</strong> the data.<br />
In classification analysis the data is <strong>of</strong>ten classified (labeled), and a model is derived<br />
to dist<strong>in</strong>guish these data classes. This model is tra<strong>in</strong>ed on a subset <strong>of</strong> the data, called<br />
tra<strong>in</strong><strong>in</strong>g set. Another subset <strong>of</strong> the data is used to validate the rules, which is the so<br />
called test set. The model can be represented by classification rules, decision trees, neural<br />
networks or mathematical formulas and is used to classify new data. However, <strong>of</strong>ten users<br />
need to predict miss<strong>in</strong>g values <strong>in</strong> the data, rather than class labels. When the predicted<br />
values are numerical the process is named prediction. Our work on quality metrics with<br />
labeled data (see Section 3.1.3 and Section 3.1.5), can be seen as a complementary way<br />
to identify the attributes that can best dist<strong>in</strong>guish the classes <strong>in</strong> the data relevant for<br />
build<strong>in</strong>g the classification model. Classification is also referred to as supervised learn<strong>in</strong>g,<br />
because the tra<strong>in</strong><strong>in</strong>g set is used to teach how to classify new data. Cluster<strong>in</strong>g is referred<br />
as unsupervised learn<strong>in</strong>g, s<strong>in</strong>ce there are no class labels for tra<strong>in</strong><strong>in</strong>g, and clusters or classes<br />
are established to group the data elements.<br />
In some applications, as <strong>in</strong> fraud detection, rare events can be <strong>of</strong> <strong>in</strong>terest. The analysis<br />
<strong>of</strong> outlier data is referred to as outlier m<strong>in</strong><strong>in</strong>g. Outliers can be detected for example by<br />
us<strong>in</strong>g statistical tests, but also by some quality metrics. Examples for quality metrics for<br />
outliers are marked <strong>in</strong> Table 4.2 later <strong>in</strong> Chapter 4.<br />
2.3.2 Quality Measures for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>izations<br />
General Measures<br />
Quality metrics (or measures) <strong>in</strong> visualization have a long history. While <strong>in</strong> our work we<br />
focus only on their specific use <strong>in</strong> high-dimensional data analysis, they have a broader<br />
scope than we can describe here. Early attempts to calculate quality metrics can be<br />
traced back to the work <strong>of</strong> Tufte [139], where he proposed metrics such as the data to<br />
<strong>in</strong>k ratio and the lie factor, which respectively optimize the use <strong>of</strong> the visualization space<br />
and reduce the distortions that visualization may <strong>in</strong>troduce. Later <strong>in</strong> 1997 Richard Brath<br />
proposed a rich set <strong>of</strong> metrics to characterize the quality <strong>of</strong> bus<strong>in</strong>ess visualizations [32]<br />
and, around the same period Miller et al. advocated the use <strong>of</strong> visualization metrics as a<br />
way to compare visualizations [100]. The graph draw<strong>in</strong>g community developed its own set
20 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
<strong>of</strong> metrics, most notable aesthetic metrics such as those found <strong>in</strong> the foundational work <strong>of</strong><br />
Ware et al. on cognitive measurements <strong>of</strong> graph aesthetics [149]. Later, the word quality<br />
metrics assumed a more specific mean<strong>in</strong>g; <strong>in</strong> particular it appeared <strong>in</strong> the context <strong>of</strong> a<br />
number <strong>of</strong> papers related to clutter reduction and scalability [24, 26, 80, 82, 112].<br />
For the sake <strong>of</strong> completeness, it is worth mention<strong>in</strong>g that the word metric is also used <strong>in</strong><br />
the context <strong>of</strong> <strong>in</strong>formation visualization user studies as a way to <strong>in</strong>dicate how the elements<br />
<strong>of</strong> <strong>in</strong>terest are measured (e.g., [108, 113]).<br />
Scatterplot Measures<br />
The idea <strong>of</strong> us<strong>in</strong>g measures calculated over the data or over the visualization space to select<br />
<strong>in</strong>terest<strong>in</strong>g projections, has been proposed already <strong>in</strong> some foundational works like Projection<br />
Pursuit [54, 74] and Grand Tour [13]. Projection Pursuit searches for low-dimensional<br />
(one or two-dimensional) projections that expose <strong>in</strong>terest<strong>in</strong>g structures, us<strong>in</strong>g a “Projection<br />
Pursuit Index” that considers <strong>in</strong>ter-po<strong>in</strong>t distances and their variation. Grand Tour<br />
adopts a more <strong>in</strong>teractive approach by allow<strong>in</strong>g the user to easily navigate through many<br />
view<strong>in</strong>g directions, creat<strong>in</strong>g a movie like presentation <strong>of</strong> the whole orig<strong>in</strong>al space.<br />
More recently, several works appeared <strong>in</strong> the visualization community that propose different<br />
forms <strong>of</strong> quality measures. Examples are, graph-theoretic measures for scatterplot<br />
matrices [151], measures over pixel-based visualizations [120], measures based on clutter<br />
reduction for visualizations [25, 112], and composite measures to f<strong>in</strong>d several data structures<br />
outliers, correlations, and sub-clusters [82]. We present a systematization <strong>of</strong> works<br />
on quality measures <strong>in</strong> Chapter 4 and propose a quality measures pipel<strong>in</strong>e to describe the<br />
process <strong>of</strong> these measures. Additionally, several factors are derived to characterize the<br />
measures <strong>in</strong> a common language, and implications on further research are raised. At this<br />
po<strong>in</strong>t, it seems important to provide a short description <strong>of</strong> the first two categories, and<br />
postpone the details for the others for Chapter 4.<br />
First, the scagnostics measures [140] have an important role s<strong>in</strong>ce they are a major<br />
<strong>in</strong>spiration source for our work. As an alternative to Projection Pursuit, the scagnostics<br />
method [140] was proposed to analyze structures <strong>in</strong> scatterplots. S<strong>in</strong>ce they never<br />
published their specifics <strong>of</strong> the method, Wilk<strong>in</strong>son et al. [151] take their opportunity to<br />
presented this scagnostics ideas and apply them to high-dimensional data. They describe<br />
detailed graph-theoretic measures for scatterplots. This means that graphs and their properties<br />
(like convex hull, alpha hull, M<strong>in</strong>imum Spann<strong>in</strong>g Tree (MST)) are used as bases for<br />
comput<strong>in</strong>g scagnostics measures. Their scagnostics <strong>in</strong>dices assess five aspects <strong>of</strong> the po<strong>in</strong>t<br />
distribution: outliers, shape, trend, density and coherence propos<strong>in</strong>g n<strong>in</strong>e characteristic<br />
<strong>in</strong>dices for the distribution <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> scatterplots: outly<strong>in</strong>g, skewed, clumpy, convex,<br />
sk<strong>in</strong>ny, striated, str<strong>in</strong>gy, straight, and monotonic. Orig<strong>in</strong>ally these <strong>in</strong>dices are used to<br />
form a SPLOM <strong>of</strong> scagnostics, where each axes is a scagnostics measure. Here each data<br />
scatterplot is represented by a po<strong>in</strong>t accord<strong>in</strong>g to his measures. The scagnostics SPLOM<br />
was used to spot unusual scatterplots regard<strong>in</strong>g their data distribution (see Figure 2.2A).<br />
These <strong>in</strong>dices were also used as rank<strong>in</strong>g functions <strong>in</strong> data SPLOMs support<strong>in</strong>g di erent<br />
analysis tasks [152] as shown <strong>in</strong> Figure 2.2B.<br />
Second, the approach most similar to ours presented <strong>in</strong> Chapter 3 is Pixnostics, proposed<br />
by Schneidew<strong>in</strong>d et al. [120]. They also use image-analysis techniques to rank the<br />
di erent lower-dimensional views <strong>of</strong> the data set and present only the best ranked to the<br />
user. The method does not only provide valuable lower-dimensional projections to the
2.3.2 Quality Measures for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>izations 21<br />
A<br />
B<br />
Figure 2.2: (A) Scagnostics SPLOM hav<strong>in</strong>g as axes scagnostics measures and show<strong>in</strong>g each data<br />
scatterplot as a po<strong>in</strong>t <strong>in</strong> the measures scatterplot [152]. (B) Scagnostics <strong>in</strong>dices used as quality<br />
measures to rank data scatterplots [152].<br />
user, but also optimized parameter sett<strong>in</strong>gs for pixel-level visualizations. However, while<br />
their approach concentrates on pixel-level visualizations, we focus on scatterplots and<br />
parallel coord<strong>in</strong>ates.<br />
We contribute to the field <strong>of</strong> quality metrics by propos<strong>in</strong>g image-based and data-based<br />
measures for classified and non-classified data <strong>in</strong> scatterplots and parallel coord<strong>in</strong>ates<br />
<strong>in</strong> Section 3.1. In Section 3.1.2 we present an image-based measure for non-classified<br />
scatterplots <strong>in</strong> order to quantify the structures and correlations between the respective<br />
dimensions. Our measure could for example be used as an additional <strong>in</strong>dex <strong>in</strong> a scagnostics<br />
matrix.<br />
Parallel to our work from Section 3.1 published <strong>in</strong> [133], Sips et al. [129] developed a<br />
class consistency visualization algorithm. Similar to our Histogram Density measures, the<br />
class consistency method proposes measures to rank 2D scatterplots. It filters the highest<br />
ranked scatterplots and presents them <strong>in</strong> an ord<strong>in</strong>ary scatterplot matrix.<br />
Parallel Coord<strong>in</strong>ates Measures<br />
Measures were not only used to rank a high number <strong>of</strong> visualizations regard<strong>in</strong>g their<br />
structures, but also with the purpose to optimize visualizations for high-dimensional data<br />
representation. One major factor handled by these measures is optimiz<strong>in</strong>g the order<strong>in</strong>g <strong>of</strong><br />
elements (like axes or data po<strong>in</strong>ts) <strong>in</strong> the visualization. Aim<strong>in</strong>g at dimension reorder<strong>in</strong>g,<br />
Ankerst et al. [9] presented a method based on similarity cluster<strong>in</strong>g <strong>of</strong> dimensions, plac<strong>in</strong>g<br />
similar dimensions close to each other. Yang [159] developed a method to generate <strong>in</strong>terest<strong>in</strong>g<br />
projections also based on similarity between the dimensions. Similar dimensions<br />
are clustered and used to create a lower-dimensional projection <strong>of</strong> the data.<br />
As an alternative to the methods for dimension reorder<strong>in</strong>g for parallel coord<strong>in</strong>ates, we<br />
propose a method based on the structure presented on the low-dimensional embedd<strong>in</strong>gs<br />
<strong>of</strong> the data set. Three di erent k<strong>in</strong>ds <strong>of</strong> measures to rank these embedd<strong>in</strong>gs are presented
22 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
<strong>in</strong> Section 3.1.4 for class and non-class based visualizations.<br />
Evaluat<strong>in</strong>g Measures<br />
A common denom<strong>in</strong>ator <strong>of</strong> all these works is the total absence <strong>of</strong> user studies able to <strong>in</strong>spect<br />
the relationship between human-detected and mach<strong>in</strong>e-detected data patterns. While it<br />
is certa<strong>in</strong>ly clear how these measures can help users deal with large data spaces, there<br />
are a number <strong>of</strong> open issues related to the human perception <strong>of</strong> the structures captured<br />
automatically by the suggested algorithms. In Section 3.2 we focus on the question <strong>of</strong><br />
whether there is a correlation between what the human eye perceives and what the mach<strong>in</strong>e<br />
detects.<br />
Despite the lack <strong>of</strong> user studies specifically focused on the issues discussed above,<br />
there are a number <strong>of</strong> user studies focused on the detection <strong>of</strong> visual patterns which are<br />
worth mention<strong>in</strong>g here. A large literature exists on the detection <strong>of</strong> pre-attentive features,<br />
notably the work <strong>of</strong> Healey focused on visualization [67] and <strong>of</strong> Gestalt Laws [148], which<br />
are <strong>of</strong>ten taken as the basis for the detection <strong>of</strong> patterns from visual representations. Some<br />
more specific works focused on visualization are: [25] and [68] based on the perception <strong>of</strong><br />
density <strong>in</strong> pixel-based scatterplots and <strong>in</strong> visualizations based on “pexels” (perceptual<br />
texture elements) respectively, [81] on the study <strong>of</strong> thresholds for the detection <strong>of</strong> patterns<br />
<strong>in</strong> parallel coord<strong>in</strong>ates, and [65] on the correlation between the visualization performance<br />
an similarity with natural images. The study presented <strong>in</strong> [118] is also relevant and very<br />
similar to ours presented <strong>in</strong> Section 3.2 <strong>in</strong> terms <strong>of</strong> experiment design. Users ranked a<br />
series <strong>of</strong> images <strong>in</strong> terms <strong>of</strong> their perception <strong>of</strong> the degree <strong>of</strong> clutter exposed by the image,<br />
and the study correlated the degree <strong>of</strong> correlation between the user rank and the rank<br />
given by the suggested measure named feature congestion.<br />
2.4 <strong>Visual</strong> <strong>Analytics</strong> for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
2.4.1 <strong>Visual</strong> Interactive Systems for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
As presented <strong>in</strong> the previous chapter, comb<strong>in</strong><strong>in</strong>g data visualization with <strong>in</strong>teractive and<br />
automated components speeds up the analysis <strong>of</strong> high-dimensional data sets. As a consequence,<br />
many <strong>in</strong>teractive systems have been developed recently to support the user <strong>in</strong><br />
analyz<strong>in</strong>g high-dimensional data sets. S<strong>in</strong>ce there is a large number <strong>of</strong> <strong>in</strong>teractive systems<br />
<strong>in</strong> the literature, present<strong>in</strong>g a full summary would overload this section. Hence <strong>in</strong><br />
the follow<strong>in</strong>g paragraphs, we identify only the four ma<strong>in</strong> doma<strong>in</strong>s related to this thesis<br />
and enumerate a selection <strong>of</strong> visual <strong>in</strong>teractive systems for visual feature selection, visual<br />
cluster<strong>in</strong>g, visual classification, and dimension reorder<strong>in</strong>g.<br />
<strong>Visual</strong> Feature Selection<br />
Reduc<strong>in</strong>g high-dimensional data to a lower subset <strong>of</strong> features that express the data characteristics,<br />
is a crucial task <strong>in</strong> high-dimensional data analysis. <strong>Data</strong> features are therefore<br />
compared, for example comput<strong>in</strong>g correlations, data variation, etc. to identify their impor-
2.4.1 <strong>Visual</strong> Interactive Systems for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis 23<br />
tance <strong>in</strong> express<strong>in</strong>g the data characteristics. S<strong>in</strong>ce fully automated feature selection methods<br />
<strong>of</strong>ten are <strong>in</strong>feasible, due to the data complexity and dimensionality, visual-<strong>in</strong>teractive<br />
systems have been developed to deal with this problem. We illustrate three examples for<br />
such systems <strong>in</strong> Figure 2.3 with a short description, and po<strong>in</strong>t to more literature <strong>in</strong> this<br />
field <strong>in</strong> the next paragraphs.<br />
A<br />
B<br />
C<br />
Figure 2.3: <strong>Visual</strong> <strong>in</strong>teractive feature selection systems. A: Rank-by-Feature Framework presented<br />
<strong>in</strong> [125]. B: Feature selection supported by quality measures [82]. C: DimStiller for feature selection<br />
[76].<br />
In exist<strong>in</strong>g works <strong>in</strong>volv<strong>in</strong>g visual-<strong>in</strong>teractive selections or comparison <strong>of</strong> features, the<br />
Rank-by-Feature Framework [125] (see Figure 2.3A) provides a sorted visual overview<br />
<strong>of</strong> the correlation among pairs <strong>of</strong> features. In [82], the selection <strong>of</strong> <strong>in</strong>put features was<br />
supported by a measure <strong>of</strong> the <strong>in</strong>terest<strong>in</strong>gness <strong>of</strong> the visual view provided by candidate<br />
features (see Figure 2.3B). An <strong>in</strong>teractive dimensionality reduction workflow was presented<br />
<strong>in</strong> [76], rely<strong>in</strong>g on visual approaches to guide users <strong>in</strong> select<strong>in</strong>g features (see Figure 2.3C).<br />
In [33] and [34], <strong>in</strong>teractive visual comparison was proposed to relate data described<br />
<strong>in</strong> di erent given feature spaces based on 2D mapp<strong>in</strong>gs and tree structures extracted from<br />
the di erent data spaces. Furthermore, <strong>in</strong> [93] a visual design based on network and heat<br />
map visualization was proposed to relate cluster<strong>in</strong>gs <strong>in</strong> di erent subsets <strong>of</strong> dimensions.<br />
In [159], dimensions are hierarchically clustered based on a simple value-oriented similarity<br />
measure. Based on this structure, user navigation can take place to identify <strong>in</strong>terest<strong>in</strong>g<br />
subspaces. In a recent work [161], the output <strong>of</strong> this simple search method was visualized<br />
by tree- and matrix-based views, where each dimension comb<strong>in</strong>ation was represented by<br />
a s<strong>in</strong>gle MDS plot.<br />
In summary, many <strong>of</strong> these methods are applicable to compare data regard<strong>in</strong>g di erent
24 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
criteria. However, most <strong>of</strong> them assume the feature selection to be performed globally and<br />
do not take the subspace search problem directly <strong>in</strong>to account. One focus <strong>of</strong> this thesis<br />
is to show that local selection <strong>of</strong> features is essential when analyz<strong>in</strong>g patterns <strong>of</strong> highdimensional<br />
data. The analysis is then performed <strong>in</strong> di erent subspaces <strong>of</strong> the data and<br />
related work on visual analysis tools that deal especially with subspaces will be presented<br />
<strong>in</strong> the next subsection.<br />
<strong>Visual</strong> Cluster<strong>in</strong>g<br />
Identification and relation <strong>of</strong> groups <strong>of</strong> data is a key explorative data analysis task. Often,<br />
user <strong>in</strong>teraction is needed to identify and revise the number and characteristics <strong>of</strong> data<br />
clusters found by automatic search methods. To this end, visual-<strong>in</strong>teractive approaches are<br />
useful. Although, many methods have been proposed, we can only highlight few <strong>of</strong> them<br />
<strong>in</strong> an exemplary manner. In [124], <strong>in</strong>teractive exploration <strong>of</strong> hierarchically clustered data<br />
along a dendrogram data structure is proposed to help users f<strong>in</strong>d the right level <strong>of</strong> clusters<br />
for their tasks (see Figure 2.4A). In [159], the parallel coord<strong>in</strong>ates approach serves as a<br />
basic display to show data cluster<strong>in</strong>g results allow<strong>in</strong>g to compare clusters along their highdimensional<br />
data space. Also, 2D projections, possibly <strong>in</strong> conjunction with glyph-based<br />
representation <strong>of</strong> clusters, are widely employed, a recent example is [35] (see Figure 2.4B).<br />
A<br />
B<br />
Figure 2.4: Interactive visual analysis systems for cluster<strong>in</strong>g <strong>in</strong> high-dimensional visualization. A:<br />
Interactive exploration <strong>of</strong> hierarchically clustered data along a dendrogram [124]. B: (a) Group<strong>in</strong>g<br />
icons to form clusters based on visual similarity. (b) User-def<strong>in</strong>ed group<strong>in</strong>g <strong>of</strong> icons [35].<br />
These approaches to visualization and cluster<strong>in</strong>g <strong>in</strong> high-dimensional data spaces all<br />
have <strong>in</strong> common that they are based on a given full (or reduced) dimensionality <strong>of</strong> the<br />
<strong>in</strong>put data set. Thereby, they show only a s<strong>in</strong>gular perspective <strong>of</strong> the usually multi-faceted<br />
high-dimensional data, which might not be the most relevant one. As we will show <strong>in</strong> this<br />
thesis, it is also useful to explore high-dimensional data for patterns <strong>in</strong> di erent subsets<br />
<strong>of</strong> its full high-dimensional <strong>in</strong>put space to <strong>in</strong>crease potential data <strong>in</strong>sight.<br />
<strong>Visual</strong> Classification<br />
Classification is us<strong>in</strong>g a model that dist<strong>in</strong>guishes data classes, and is created based on a<br />
labeled tra<strong>in</strong><strong>in</strong>g data set, to label new data. The classification model can be represented<br />
by decision trees. With pure automatic approaches, problems like over-fitt<strong>in</strong>g the model<br />
or tree prun<strong>in</strong>g, are di cult to tackle [86]. Us<strong>in</strong>g visualization can help to overcome
2.4.1 <strong>Visual</strong> Interactive Systems for <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis 25<br />
these problems, for example by <strong>in</strong>corporat<strong>in</strong>g the user <strong>in</strong> the tree construct<strong>in</strong>g process.<br />
Ankerst et al. present <strong>in</strong> [11] a user-centered approach that comb<strong>in</strong>es the doma<strong>in</strong> knowledge<br />
<strong>of</strong> users, with computation strengths <strong>of</strong> the computer to create rules that satisfy the<br />
user’s constra<strong>in</strong>s and generate visualizations <strong>of</strong> these patterns. Additionally, the pattern<br />
recognition <strong>of</strong> the human supported by adequate data visualizations can be used to <strong>in</strong>crease<br />
the e ectivity <strong>of</strong> decision trees. In Figure 2.5A the visual classification shows the decision<br />
tree, visualiz<strong>in</strong>g each attribute-value by a colored pixel arranged <strong>in</strong> bars. Each attribute<br />
bar is sorted, and the purest value distribution is selected as split attribute <strong>of</strong> the decision<br />
tree. This procedure is repeated until all leaves conta<strong>in</strong> pure classes. The split is marked<br />
with a black vertical l<strong>in</strong>e, and the leaves are underl<strong>in</strong>ed with a black l<strong>in</strong>e. Compared to<br />
standard visualizations <strong>of</strong> decision trees, additional <strong>in</strong>formation is encoded <strong>in</strong> a compact<br />
way, namely: size <strong>of</strong> the nodes (number <strong>of</strong> tra<strong>in</strong><strong>in</strong>g records for the correspond<strong>in</strong>g node),<br />
quality <strong>of</strong> the split (visible <strong>in</strong> the purity <strong>of</strong> the result<strong>in</strong>g partitions), class distribution<br />
(frequency and location <strong>of</strong> the tra<strong>in</strong><strong>in</strong>g <strong>in</strong>stances <strong>of</strong> all classes).<br />
A<br />
B<br />
Figure 2.5: Interactive visual analysis systems for classification <strong>in</strong> high-dimensional data. A: <strong>Visual</strong><br />
classification from [11] illustrates the decision tree for DNA tra<strong>in</strong><strong>in</strong>g data hav<strong>in</strong>g 19 attributes,<br />
visualiz<strong>in</strong>g each attribute-value by a colored pixel arranged <strong>in</strong> bars. B: Decision tree construction<br />
system [142], represent<strong>in</strong>g the tree <strong>in</strong> a node-l<strong>in</strong>k diagram, display<strong>in</strong>g split po<strong>in</strong>ts on the l<strong>in</strong>ks and<br />
the split attributes on the node.<br />
Figure 2.5B shows a recent example from [142] <strong>of</strong> an <strong>in</strong>teractive system for decision<br />
tree construction. Here the authors have the same goal, e.g. to br<strong>in</strong>g the doma<strong>in</strong> specific
26 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
knowledge <strong>of</strong> the user <strong>in</strong>to the construction <strong>of</strong> the tree. A tight <strong>in</strong>tegration <strong>of</strong> visualization,<br />
<strong>in</strong>teraction and automation supports doma<strong>in</strong> experts <strong>in</strong> grow<strong>in</strong>g, prun<strong>in</strong>g, optimiz<strong>in</strong>g and<br />
analyz<strong>in</strong>g decision trees [142]. Compared to the previous example, here the tree representation<br />
is a more classic one s<strong>in</strong>ce the tree is represented by node-l<strong>in</strong>k diagrams. Internal<br />
and leaf nodes are represented by node glyphs, and each parent-child relationship is represented<br />
by a l<strong>in</strong>k from patent to child node. The advantage <strong>of</strong> this visual representation<br />
is that it allows for an easier count<strong>in</strong>g <strong>of</strong> the number <strong>of</strong> leafs while at the same time it<br />
shows which nodes are on the same level [142]. The ma<strong>in</strong> view displays split po<strong>in</strong>ts on<br />
the l<strong>in</strong>ks, us<strong>in</strong>g the width to encode the number <strong>of</strong> items and color the class membership<br />
<strong>of</strong> the items. The split attribute is shown on the nodes <strong>of</strong> the tree. These are visualized<br />
as rectangles conta<strong>in</strong><strong>in</strong>g relevant <strong>in</strong>formation like split attribute, class distribution, split<br />
po<strong>in</strong>ts, and class histogram. Additional l<strong>in</strong>ked views support the user <strong>in</strong> construct<strong>in</strong>g and<br />
optimiz<strong>in</strong>g the decision tree.<br />
Dimension Reorder<strong>in</strong>g<br />
As already discussed, dimension order<strong>in</strong>g is a relevant component <strong>of</strong> high-dimensional data<br />
visualization and exploration, as di erent order<strong>in</strong>gs can expose di erent patterns. Ankerst<br />
et al. <strong>in</strong>troduced the problem <strong>of</strong> dimensional order<strong>in</strong>g as an optimization problem <strong>in</strong> [9]<br />
and demonstrated that it is a NP-complete problem that must thus be solved through<br />
heuristics. Peng et al. <strong>in</strong> [112] applies dimension reorder<strong>in</strong>g on a series <strong>of</strong> n-dimensional<br />
visualization techniques to reduce clutter. Matrix based visualizations, start<strong>in</strong>g from the<br />
sem<strong>in</strong>al work <strong>of</strong> Bert<strong>in</strong> [22] have also been heavily researched <strong>in</strong> terms <strong>of</strong> the patterns<br />
they can expose through reorder<strong>in</strong>g. In Section 5.1 we use dimensional reorder<strong>in</strong>g and<br />
cluster reorder<strong>in</strong>g to make relationships among dimensions and clusters apparent <strong>in</strong> our<br />
ClustNails system.<br />
In [59] Guo also addresses ways to <strong>in</strong>tegrate visual and computational measures for<br />
pick<strong>in</strong>g and order<strong>in</strong>g variables for display on parallel coord<strong>in</strong>ates. He describes a humancentered<br />
exploration environment, which <strong>in</strong>corporates a coord<strong>in</strong>ated suite <strong>of</strong> computational<br />
and visualization methods to explore high-dimensional data and f<strong>in</strong>d patterns <strong>in</strong><br />
this spaces. The ma<strong>in</strong> di erence between this approach and our approach presented <strong>in</strong><br />
Section 3.1.4 is that Guo searches for locally def<strong>in</strong>ed patterns <strong>in</strong> subspaces, while our work<br />
concentrates on f<strong>in</strong>d<strong>in</strong>g global patterns <strong>in</strong> a 2-dimensional projection <strong>of</strong> the data set.<br />
To summarize, order<strong>in</strong>g plays and important role <strong>in</strong> di erent areas: like order<strong>in</strong>g axes<br />
<strong>of</strong> parallel coord<strong>in</strong>ates, order<strong>in</strong>g as a way to reduce clutter <strong>in</strong> scatterplot matrices, order<strong>in</strong>g<br />
to support similarity search <strong>of</strong> glyph-based visualizations or pixel-based displays.<br />
2.4.2 Subspace Cluster Analysis and <strong>Visual</strong>ization<br />
As traditional full-space cluster<strong>in</strong>g is <strong>of</strong>ten not e ective for reveal<strong>in</strong>g a mean<strong>in</strong>gful cluster<strong>in</strong>g<br />
structure for high-dimensional data (see Section 2.3.1), <strong>in</strong> the emerg<strong>in</strong>g research<br />
field <strong>of</strong> subspace cluster<strong>in</strong>g [90] several approaches aim at discover<strong>in</strong>g mean<strong>in</strong>gful clusters<br />
<strong>in</strong> locally relevant subspaces. The problem <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong> high-dimensional data<br />
can be divided <strong>in</strong>to two sub-problems: subspace search and cluster search. The first one<br />
aims at f<strong>in</strong>d<strong>in</strong>g the subspaces where clusters exist, the second one at f<strong>in</strong>d<strong>in</strong>g the actual<br />
clusters. The large majority <strong>of</strong> exist<strong>in</strong>g algorithms considers the two problems simultane-
2.4.2 Subspace Cluster Analysis and <strong>Visual</strong>ization 27<br />
ously and produces a set <strong>of</strong> clusters, where each cluster is typically represented by a set <strong>of</strong><br />
clustered objects (rows <strong>of</strong> the orig<strong>in</strong>al data table) and the subset <strong>of</strong> relevant dimensions<br />
(columns <strong>of</strong> the orig<strong>in</strong>al data table). Several methods have been proposed that di er to<br />
the cluster<strong>in</strong>g search strategy and constra<strong>in</strong>ts with respect to the overlap <strong>of</strong> clusters and<br />
dimensions [38, 84, 107]. Kriegel et al. [90] categorize these algorithms <strong>in</strong>to four classes:<br />
(1) projected cluster<strong>in</strong>g; (2) “s<strong>of</strong>t” projected cluster<strong>in</strong>g; (3) subspace cluster<strong>in</strong>g; (4) hybrid.<br />
The first two generate clusters that do not overlap, that is, every object belongs to<br />
only one cluster. Subspace cluster<strong>in</strong>g and hybrid may generate clusters that do overlap.<br />
While extensive research has been carried out <strong>in</strong> design<strong>in</strong>g subspace cluster<strong>in</strong>g algorithms,<br />
surpris<strong>in</strong>gly little attention has been paid to develop visualization support for<br />
subspace cluster<strong>in</strong>g. To our knowledge only a few subspace cluster visualization systems<br />
exist.<br />
(a)<br />
(b)<br />
Figure 2.6: (a) VISA system [14]. Left: MDS projection for the global view <strong>of</strong> clusters. Right:<br />
Matrix <strong>of</strong> subspace clusters for <strong>in</strong>-depth view. (b) Heidi Matrix [141] over a subspace.<br />
The VISA system [14] implements both a global view and an <strong>in</strong>-depth view (see<br />
Figure 2.6(a)) to help <strong>in</strong>terpret the subspace cluster<strong>in</strong>g result. In the global view, the<br />
subspace clusters are projected onto a 2D display us<strong>in</strong>g a multidimensional scal<strong>in</strong>g (MDS)<br />
projection. The aim is to show the similarity between clusters <strong>in</strong> terms <strong>of</strong> the number<br />
<strong>of</strong> records and dimensions <strong>in</strong> each cluster. Each cluster is represented as a colored circle<br />
where color represents the number <strong>of</strong> dimensions and the size represents the number <strong>of</strong><br />
<strong>in</strong>stances. The <strong>in</strong>-depth view shows the detailed characteristics <strong>of</strong> the cluster<strong>in</strong>g result<br />
<strong>in</strong>clud<strong>in</strong>g data items <strong>in</strong> each cluster and their values us<strong>in</strong>g a matrix representation. It<br />
uses di erent color codes to visualize all characteristics <strong>of</strong> an object: black for unselected<br />
dimensions, brightness for areas <strong>of</strong> <strong>in</strong>terest, and hue for value. The MDS projection <strong>in</strong><br />
VISA provides a good overview <strong>of</strong> the cluster<strong>in</strong>g results. However, us<strong>in</strong>g circles <strong>of</strong> di erent<br />
sizes <strong>in</strong> the MDS projection <strong>in</strong> VISA can be problematic; the distance between two clusters<br />
can be obscured by the radius <strong>of</strong> the circles, and the overlap between clusters <strong>of</strong>ten causes<br />
a cluttered display. The <strong>in</strong>-depth view shows detailed characteristics <strong>of</strong> the cluster<strong>in</strong>g<br />
result, but as shown <strong>in</strong> Figure 2.6(a), both hue and brightness are relatively weak at<br />
show<strong>in</strong>g di erence/variations between numbers and values <strong>in</strong> unselected dimension.<br />
Heidi Matrix [141] uses a complex arrangement <strong>of</strong> subspaces <strong>in</strong> a matrix representation.<br />
This matrix is based on the computation <strong>of</strong> the k-Nearest Neighbors (kNN) <strong>in</strong><br />
each subspace (see Figure 2.6(b)). Rows and columns represent the data items, and each
28 Chapter 2. <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> Analysis<br />
entry (i, j) <strong>in</strong> the matrix represents the number <strong>of</strong> subspaces <strong>in</strong> which i and j are neighbors.<br />
A categorical color<strong>in</strong>g scheme is used to color the cells accord<strong>in</strong>g to the particular<br />
comb<strong>in</strong>ation <strong>of</strong> subspaces <strong>in</strong> which two data items are neighbors. In addition, rows and<br />
columns are ordered accord<strong>in</strong>g to the output generated by a cluster<strong>in</strong>g algorithm. The<br />
biggest advantage <strong>of</strong> Heidi Matrix is that it displays the full <strong>in</strong>formation <strong>of</strong> the data and<br />
the subspace cluster<strong>in</strong>g result. However, the rather abstract visual mapp<strong>in</strong>g scheme makes<br />
<strong>in</strong>terpretation <strong>of</strong> the results di cult and to the best <strong>of</strong> our knowledge its e ectiveness has<br />
not been evaluated yet. The scalability <strong>of</strong> the visualization is another critical issue because<br />
it requires n ◊ n display space, where n is the number <strong>of</strong> data items.<br />
Figure 2.7: <strong>Visual</strong>ization techniques applied <strong>in</strong> Ferdosi’s work [52]. Left: 1D subspace. Middle:<br />
2D subspace. Right: Subspace with 3 or more dimensions.<br />
Ferdosi et al. [52] proposed an algorithm for f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces <strong>in</strong> astronomical<br />
data as well as a visual system for display<strong>in</strong>g the results. The algorithm identifies<br />
candidate subspaces from data and ranks those by a quality metric based on density estimation<br />
and morphological operators. The result subspaces are visualized <strong>in</strong> di erent<br />
forms: l<strong>in</strong>e graphs for 1-dimensional subspaces, 2D scatterplots for 2-dimensional subspaces,<br />
and pr<strong>in</strong>ciple component analysis (PCA) projections for subspaces with higher<br />
dimensionalities (see Figure 2.7). Ferdosi’s work provides some <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>sight <strong>in</strong>to<br />
subsets <strong>of</strong> dimensions <strong>in</strong> astronomical data with a high density <strong>of</strong> data objects. However,<br />
the algorithm does not assign objects to subspaces. Hence, the subspace cluster<strong>in</strong>g <strong>in</strong>formation<br />
is partially miss<strong>in</strong>g from both the data m<strong>in</strong><strong>in</strong>g and the visualization compared to<br />
VISA and Heidi Matrix, mean<strong>in</strong>g there is no direct way <strong>of</strong> compar<strong>in</strong>g subspaces.<br />
In all <strong>of</strong> the above mentioned visualization systems, the visualization <strong>of</strong> overlapp<strong>in</strong>g<br />
dimensions and overlapp<strong>in</strong>g clusters is lack<strong>in</strong>g. It is di cult to see and compare such<br />
overlapp<strong>in</strong>g <strong>in</strong>formation <strong>in</strong> the visual representations. In Section 5.1 we propose a visual<br />
tool to <strong>in</strong>vestigate subspace cluster<strong>in</strong>g results and represent also dimension and object<br />
overlap among clusters.<br />
We note that if we apply one <strong>of</strong> these subspace cluster<strong>in</strong>g visualizations, we immediately<br />
<strong>in</strong>herit two ma<strong>in</strong> challenges <strong>of</strong> this paradigm that is still considered an open research issues,<br />
namely: the e ciency challenge (relat<strong>in</strong>g to subspace cluster search) and the redundancy<br />
challenge (relat<strong>in</strong>g to the typical redundancy <strong>of</strong> the outputs generated). In Section 5.2 the<br />
redundancy problem is addressed by our proposed analytical workflow.
3<br />
Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong><br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Contents<br />
„Measure what is measurable, and make measurable what is not so.”<br />
Galileo Galilei<br />
3.1 Quality Measures for Scatterplots and Parallel Coord<strong>in</strong>ates . . 30<br />
3.1.1 Overview and Problem Description . . . . . . . . . . . . . . . . . 30<br />
3.1.2 Quality Measures for Scatterplots with Unclassified <strong>Data</strong> . . . . 32<br />
3.1.3 Quality Measures for Scatterplots with Classified <strong>Data</strong> . . . . . . 34<br />
3.1.4 Quality Measures for Parallel Coord<strong>in</strong>ates with Unclassified <strong>Data</strong> 38<br />
3.1.5 Quality Measures for Parallel Coord<strong>in</strong>ates with Classified <strong>Data</strong> . 40<br />
3.1.6 Application on Real <strong>Data</strong> Sets . . . . . . . . . . . . . . . . . . . 41<br />
3.1.7 Evaluation <strong>of</strong> the Measures’ Performance Us<strong>in</strong>g Synthetic <strong>Data</strong> . 49<br />
3.1.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 53<br />
3.2 Quality Measures and Human Perception – An Empirical Study 54<br />
3.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.2.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
3.2.5 Guidel<strong>in</strong>es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.2.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 63<br />
V<br />
isual exploration <strong>of</strong> multivariate data typically requires projection onto lower-dimensional<br />
representations. The number <strong>of</strong> possible representations grows rapidly with<br />
the number <strong>of</strong> dimensions, and manual exploration quickly becomes <strong>in</strong>e ective or even<br />
unfeasible. In this chapter, we propose automatic analysis methods to extract potentially<br />
relevant visual structures from a set <strong>of</strong> candidate visualizations. Based on these features,<br />
the visualizations are ranked <strong>in</strong> accordance with a specified user task. The user is provided<br />
with a manageable number <strong>of</strong> potentially useful candidate visualizations that can be used<br />
as a start<strong>in</strong>g po<strong>in</strong>t for <strong>in</strong>teractive data analysis. This can e ectively ease the task <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g<br />
truly useful visualizations and potentially speed up the data exploration task. Therefore<br />
<strong>in</strong> Section 3.1, we present quality measures for class-based as well as non class-based<br />
scatterplots and parallel coord<strong>in</strong>ates visualizations. The proposed analysis methods are<br />
evaluated on real and synthetic data sets and the results are presented <strong>in</strong> Section 3.1.6<br />
and 3.1.7. Section 3.2 presents an empirical study to compare the measures rank<strong>in</strong>g with<br />
the user perception. The study helped us to derive further factors that we must take <strong>in</strong>to<br />
account when design<strong>in</strong>g new measures that have to fit the users’ perception.
30 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Parts <strong>of</strong> this chapter appeared <strong>in</strong> the follow<strong>in</strong>g publications [132, 133, 134] 1 .<br />
3.1 Quality Measures for Scatterplots and Parallel Coord<strong>in</strong>ates<br />
In this section, we present an automated approach that supports the user <strong>in</strong> the exploration<br />
process <strong>of</strong> high-dimensional data. The basic idea is to generate di erent projections from<br />
the high-dimensional data set and to automatically identify potentially relevant visual or<br />
data-structures from this set <strong>of</strong> possible candidates. These structures are used to determ<strong>in</strong>e<br />
the relevance <strong>of</strong> each projection to common predef<strong>in</strong>ed analysis tasks. The user may then<br />
use the projection with the highest relevance as the start<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> the visual <strong>in</strong>teractive<br />
analysis. We present relevance measures for typical analysis tasks based on scatterplots<br />
and parallel coord<strong>in</strong>ates. The experiments on class-labeled and non class-labeled data<br />
sets demonstrate the potential <strong>of</strong> our quality measures to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g projections and<br />
visualizations and thus speed up the exploration process.<br />
3.1.1 Overview and Problem Description<br />
Increas<strong>in</strong>g dimensionality and grow<strong>in</strong>g volumes <strong>of</strong> data lead to the necessity <strong>of</strong> e ective exploration<br />
techniques to present the hidden <strong>in</strong>formation and structures <strong>of</strong> high-dimensional<br />
data sets. To support visual exploration, the high-dimensional data is commonly mapped<br />
to low-dimensional views, also called projections. Depend<strong>in</strong>g on the technique, exponentially<br />
many di erent low-dimensional views exist that cannot be analyzed manually.<br />
As already presented <strong>in</strong> Section 2.2.1, scatterplots and parallel coord<strong>in</strong>ates plots are<br />
commonly used visualization techniques to deal with multivariate data sets. This lowdimensional<br />
embedd<strong>in</strong>gs <strong>of</strong> the high-dimensional data <strong>in</strong> a 2D view can be <strong>in</strong>terpreted<br />
easily by the users. We have also seen that this techniques entail di erent challenges for<br />
high-dimensional data sets. For scatterplots, the high number <strong>of</strong> possible 2D projections<br />
for a high-dimensional data sets is challeng<strong>in</strong>g. S<strong>in</strong>ce there are n2 ≠n<br />
2<br />
di erent plots for a n-<br />
dimensional data set <strong>in</strong> a scatterplot matrix, an automatic analysis technique to preselect<br />
the important projections is useful and necessary.<br />
For parallel coord<strong>in</strong>ates one problem is the large number <strong>of</strong> possible arrangements <strong>of</strong><br />
the dimension axes. It has been shown <strong>in</strong> [30] that for a n-dimensional data set n+1<br />
2<br />
permutations<br />
are needed to visualize all relations between dimensions, but there are n! possible<br />
arrangements. An automated analysis <strong>of</strong> the visualizations can help f<strong>in</strong>d<strong>in</strong>g the best order<strong>in</strong>g<br />
out <strong>of</strong> all possible arrangements. We attempt to analyze the pairwise comb<strong>in</strong>ations <strong>of</strong><br />
dimensions that are later assembled to f<strong>in</strong>d the best visualizations by reduc<strong>in</strong>g the visual<br />
1 Please note that parts <strong>of</strong> the publications used here are slightly changed to adapt to the dissertation’s<br />
term<strong>in</strong>ology. Due to readability issues and be<strong>in</strong>g an author <strong>in</strong> lead<strong>in</strong>g role for these publications, I decided<br />
not to quote these excerpts.<br />
The <strong>in</strong>tense collaboration for [133] with G. Albuquerque and M. Eisemann from Braunschweig, brought up<br />
new image quality measures that they implemented and described for our jo<strong>in</strong>ed publication. I participated<br />
<strong>in</strong> some <strong>of</strong> the discussions. Together we ran experiments for the application section on real data sets and<br />
described them <strong>in</strong> the paper. The evaluation part on the synthetic data was completely designed by<br />
myself. I decided to <strong>in</strong>clude the full description <strong>of</strong> the metrics <strong>in</strong> my thesis for a better understand<strong>in</strong>g<br />
<strong>of</strong> the experiments and the discussions about the outcome. Major parts <strong>of</strong> Section 3.1.2, Section 3.1.3,<br />
Section 3.1.4 and Section 3.1.5 are therefore credited to aforementioned authors.
3.1.1 Overview and Problem Description 31<br />
analysis to n 2 visualizations. We propose rank<strong>in</strong>g functions to judge the quality <strong>of</strong> a visual<br />
embedd<strong>in</strong>g. This rank<strong>in</strong>g functions are called quality measures and automatically select<br />
the best visual representation with respect to a given task.<br />
HD <strong>Data</strong><br />
Set <strong>of</strong><br />
<strong>Visual</strong>izations<br />
Quality Measures<br />
Ranked<br />
<strong>Visual</strong>izations<br />
2D projections<br />
<strong>in</strong> scatterplots<br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
& Projection<br />
2000 4000 6000 8000<br />
0 200 400 600 800<br />
dim 4<br />
dim 22<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●●●●●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●●●●●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
100 200 300 400 500 600<br />
0 200 400 600 800 1000<br />
dim 5<br />
dim 7<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
100 200 300 400 500 600<br />
0 200 400 600 800<br />
dim 5<br />
dim 22<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●●<br />
●<br />
● ● ● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
●●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ● ●<br />
● ● ● ●<br />
● ● ● ●<br />
● ● ● ●<br />
● ● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
0 2000 4000 6000 8000<br />
0 200 400 600 800<br />
dim 6<br />
dim 22<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
0 2000 4000 6000<br />
−300 −200 −100 0 100 200<br />
Comp.1<br />
Comp.2<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●●<br />
●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
● ●●<br />
●<br />
● ●●<br />
● ●<br />
● ●●<br />
●<br />
● ●●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
● ●●●<br />
●<br />
● ●●●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●●<br />
●●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●●<br />
● ●● ●<br />
● ●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ● ●<br />
● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●●●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●●●●●●●●●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
● ●●●●<br />
● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●●●<br />
● ●<br />
●<br />
● ●●● ● ●<br />
●<br />
● ●●<br />
●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
● ●●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●● ●<br />
●<br />
●<br />
●● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●●<br />
●<br />
● ●<br />
●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●●<br />
● ●<br />
●●●<br />
●<br />
●●<br />
●<br />
●<br />
●●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
0 2000 4000 6000<br />
−200 0 200 400 600<br />
Comp.1<br />
Comp.2<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●●<br />
● ● ●●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
● ●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
Task<br />
2000 4000 6000 8000<br />
0 200 400 600 800<br />
dim 4<br />
dim 22<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●●●●●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●●●●●●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●
32 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
An overview <strong>of</strong> our techniques is shown <strong>in</strong> Table 3.1. For scatterplots with unclassified<br />
data, we developed the Rotat<strong>in</strong>g Variance Measure which favors xy-plots with a high<br />
correlation between the two dimensions. For classified data, we propose measures that<br />
consider the class <strong>in</strong>formation while comput<strong>in</strong>g the rank<strong>in</strong>g value <strong>of</strong> the images. We<br />
developed four methods, a Class Density Measure, aClass Separat<strong>in</strong>g Measure, a1D-<br />
Histogram Density Measure, and a 2D-Histogram Density Measure. They have the goal<br />
to f<strong>in</strong>d the best scatterplots show<strong>in</strong>g the classes separated. For parallel coord<strong>in</strong>ates with<br />
unclassified data, we propose a Hough Space Measure that searches for <strong>in</strong>terest<strong>in</strong>g patterns<br />
such as clustered l<strong>in</strong>es <strong>in</strong> the views. For classified data, we propose two measures: the<br />
Overlap Measure that focuses on f<strong>in</strong>d<strong>in</strong>g views with as little overlap as possible between<br />
the classes, so that the classes separate well, and the Similarity Measure that looks for<br />
correlations between the l<strong>in</strong>es. All the measures, except the 1D and 2D-Histogram Density<br />
Measures, are computed directly over the visualization images and do not consider possible<br />
<strong>in</strong>tra- and <strong>in</strong>terclass overplott<strong>in</strong>g <strong>of</strong> po<strong>in</strong>ts.<br />
As example analysis tasks for unclassified data sets, we choose correlation search <strong>in</strong><br />
scatterplots (Section 3.1.2) and cluster search (i.e. similar l<strong>in</strong>es) <strong>in</strong> parallel coord<strong>in</strong>ates<br />
(Section 3.1.4). If class <strong>in</strong>formation is given, the tasks are to f<strong>in</strong>d views where dist<strong>in</strong>ct<br />
clusters <strong>in</strong> the data set are also well separated <strong>in</strong> the visualization (Section 3.1.3) or show<br />
a high level <strong>of</strong> <strong>in</strong>ter- and <strong>in</strong>traclass similarity (Section 3.1.5).<br />
3.1.2 Quality Measures for Scatterplots with Unclassified <strong>Data</strong><br />
Our scatterplot measures aim to assess the distribution <strong>of</strong> the data regard<strong>in</strong>g correlation<br />
and density <strong>of</strong> po<strong>in</strong>ts and the separateness <strong>of</strong> classes. In this section, we therefore propose<br />
analysis functions to compute the correlation <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> scatterplots with unclassified<br />
data. Additionally, new methods to measure the density <strong>of</strong> the classes and for assess<strong>in</strong>g<br />
the separateness <strong>of</strong> classes <strong>in</strong> scatterplots with classified data are proposed <strong>in</strong> the next<br />
Section 3.1.3. In the case <strong>of</strong> unclassified, but well separable data, class labels can be<br />
automatically assigned us<strong>in</strong>g cluster<strong>in</strong>g algorithms.<br />
Rotat<strong>in</strong>g Variance Measure 2<br />
<strong>High</strong> correlations are represented as long, sk<strong>in</strong>ny structures <strong>in</strong> the scatterplot visualization.<br />
Due to outliers even almost perfect correlations can lead to skewed distributions <strong>in</strong> the<br />
plot and attention needs to be paid to this fact. The Rotat<strong>in</strong>g Variance Measure (RVM)<br />
is aimed at f<strong>in</strong>d<strong>in</strong>g l<strong>in</strong>ear and nonl<strong>in</strong>ear correlations between the pairwise dimensions <strong>of</strong> a<br />
given data set.<br />
To compute the measure over the image representation we first transform the discrete<br />
scatterplot visualization <strong>in</strong>to a cont<strong>in</strong>uous density field. For each screen pixel s and its<br />
position x =(x, y) the distance to its k-th nearest sample po<strong>in</strong>ts N s <strong>in</strong> the visualization<br />
is computed. To obta<strong>in</strong> an estimate <strong>of</strong> the local density fl at a pixel s, wedef<strong>in</strong>efl =1/r,<br />
where r is the radius <strong>of</strong> the enclos<strong>in</strong>g sphere <strong>of</strong> the k-nearest neighbors <strong>of</strong> s given by<br />
r = max iœNs ||x ≠ x i ||. (3.1)<br />
2 Implemented and described by our partners from Braunschweig: G. Albuquerque and M. Eisemann<br />
for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.
3.1.2 Quality Measures for Scatterplots with Unclassified <strong>Data</strong> 33<br />
(a)<br />
(b)<br />
Figure 3.2: Scatterplot example and its respective density image. For each pixel we compute the<br />
mass distribution along di erent directions and save the smallest value, here depicted by the blue<br />
l<strong>in</strong>e.<br />
Choos<strong>in</strong>g the k-th neighbor <strong>in</strong>stead <strong>of</strong> the nearest elim<strong>in</strong>ates the <strong>in</strong>fluence <strong>of</strong> outliers. k<br />
is chosen to be between 2 and n ≠ 1, so that the m<strong>in</strong>imum value <strong>of</strong> r is mapped to 1. We<br />
used k = 4 throughout the application Section 3.1.6. Other density estimations could <strong>of</strong><br />
course be used as well.<br />
<strong>Visual</strong>izations conta<strong>in</strong><strong>in</strong>g high correlations should generally have correspond<strong>in</strong>g density<br />
fields with a small band <strong>of</strong> larger values while views with lower correlation should have<br />
a density field consist<strong>in</strong>g <strong>of</strong> many local maxima spread <strong>in</strong> the image. We can estimate<br />
this amount <strong>of</strong> spread for every pixel by comput<strong>in</strong>g the normalized mass distribution by<br />
tak<strong>in</strong>g s samples along di erent l<strong>in</strong>es l ◊ centered at the correspond<strong>in</strong>g pixel positions x l◊<br />
and with length equal to the image width, see Figure 3.2. For these sampled l<strong>in</strong>es we<br />
compute the weighted distribution for each pixel position x i :<br />
‹ ◊ i =<br />
q sj=1<br />
p s j<br />
l ◊<br />
||x i ≠ x s j<br />
||<br />
q sj=1<br />
p s j<br />
l ◊<br />
(3.2)<br />
‹ i = m<strong>in</strong><br />
◊œ[0,2fi] ‹i ◊ (3.3)<br />
where p s j<br />
l ◊<br />
is the j-th sample along l<strong>in</strong>e l ◊ and x s j<br />
is its correspond<strong>in</strong>g position <strong>in</strong> the image.<br />
For pixels positioned at a maximum <strong>of</strong> a density image convey<strong>in</strong>g a real correlation the<br />
distribution value will be very small, if the l<strong>in</strong>e is orthogonal to the local ma<strong>in</strong> direction<br />
<strong>of</strong> the correlation at the current position <strong>in</strong> comparison to other positions <strong>in</strong> the image.<br />
Note that such a l<strong>in</strong>e can be found even <strong>in</strong> non-l<strong>in</strong>ear correlations. On the other hand,<br />
pixels <strong>in</strong> density images convey<strong>in</strong>g low correlation will always have only large ‹ values.<br />
For each column <strong>in</strong> the image, we compute the m<strong>in</strong>imum value and sum up the result.<br />
The f<strong>in</strong>al RVM value is therefore def<strong>in</strong>ed as:<br />
RV M =<br />
1<br />
qx m<strong>in</strong> y ‹(x, y) , (3.4)<br />
where ‹(x, y) is the mass distribution value at pixel position (x, y).
34 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
3.1.3 Quality Measures for Scatterplots with Classified <strong>Data</strong><br />
Most <strong>of</strong> the known techniques calculate the quality <strong>of</strong> a projection without tak<strong>in</strong>g the class<br />
distribution <strong>in</strong>to account. In classified data plots we can search for the class distribution <strong>in</strong><br />
the projection, where good views should show good class separation, i.e. m<strong>in</strong>imal overlap<br />
<strong>of</strong> classes.<br />
In this section, we propose three approaches to rank the scatterplots <strong>of</strong> multivariate<br />
classified data sets, <strong>in</strong> order to determ<strong>in</strong>e the best views <strong>of</strong> the high-dimensional structures.<br />
Class Density Measure 3<br />
The Class Density Measure (CDM) evaluates orthogonal projections, i.e. scatterplots,<br />
accord<strong>in</strong>g to their separation properties. Therefore, CDM computes a score for each<br />
candidate plot that reflects the separation properties <strong>of</strong> the classes consider<strong>in</strong>g also the<br />
density <strong>of</strong> each class. The candidate plots are then ranked accord<strong>in</strong>g to their score, so<br />
that the user can start <strong>in</strong>vestigat<strong>in</strong>g highly ranked plots <strong>in</strong> the exploration process.<br />
In case we are given only the visualization without the data, we assume that every<br />
color used <strong>in</strong> the visualization represents one class. We therefore separate the classes<br />
first <strong>in</strong>to dist<strong>in</strong>ct images, so that each image conta<strong>in</strong>s only the <strong>in</strong>formation <strong>of</strong> one <strong>of</strong> the<br />
classes. Please note that the overplott<strong>in</strong>g <strong>of</strong> classes <strong>in</strong>fluences the computation <strong>of</strong> the<br />
measure. If the data is available, this is no longer a problem s<strong>in</strong>ce all the classes can be<br />
plotted separately <strong>in</strong> one image. S<strong>in</strong>ce a cont<strong>in</strong>uous representation for each class-image is<br />
necessary to compute the overlap between the classes, we estimate a cont<strong>in</strong>uous, smooth<br />
density function based on local neighborhoods. For each screen pixel s the distance to its<br />
k-th nearest neighbors N s <strong>of</strong> the same class is computed and the local density is derived<br />
as described earlier <strong>in</strong> this section.<br />
Hav<strong>in</strong>g these cont<strong>in</strong>uous density functions available for each class, we estimate the<br />
mutual overlap by comput<strong>in</strong>g the sum <strong>of</strong> the absolute di erence between each pair and<br />
sum up the result:<br />
M≠1 ÿ Mÿ Pÿ<br />
CDM =<br />
|p i k ≠ p i l|, (3.5)<br />
k=1 l=k+1 i=1<br />
with M be<strong>in</strong>g the number <strong>of</strong> density images, i.e. classes respectively, p i k is the i-th pixel<br />
value <strong>in</strong> the density image computed for the class k, and P is the number <strong>of</strong> pixels. If<br />
the range <strong>of</strong> the pixel values is normalized to [0, 1] the range for the CDM is between<br />
0 and P , consider<strong>in</strong>g 2 classes (M=2). This value is large, if the densities at each pixel<br />
di er as much as possible, i.e. if one class has a high density value compared to all others.<br />
Consequently, the visualization with the fewest overlap <strong>of</strong> the classes will be given the<br />
highest value. Another property <strong>of</strong> this measure is not only <strong>in</strong> assess<strong>in</strong>g well separated<br />
but also dense clusters that ease the <strong>in</strong>terpretability <strong>of</strong> the data <strong>in</strong> the visualization. Note<br />
that non-overlapp<strong>in</strong>g classes <strong>in</strong> scatterplots produce di erent density images us<strong>in</strong>g our<br />
algorithm. If the clusters are similar, the density images are di erent, which results <strong>in</strong> a<br />
high value for the CDM measure.<br />
3 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann,<br />
for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.
3.1.3 Quality Measures for Scatterplots with Classified <strong>Data</strong> 35<br />
Class Separat<strong>in</strong>g Measure 4<br />
The CDM <strong>in</strong>troduced before f<strong>in</strong>ds views with few overlap between classes and dense clusters<br />
<strong>in</strong> high-dimensional data sets. The CDM measure is computed over density images<br />
with a rapid fallo function. The local density fl was def<strong>in</strong>ed <strong>in</strong> Section 3.1.2 as fl =1/r.<br />
By chang<strong>in</strong>g this function, we are able to control the balance between the property <strong>of</strong><br />
separation and dense cluster<strong>in</strong>g. Choos<strong>in</strong>g a function with an <strong>in</strong>creas<strong>in</strong>g value for r can<br />
yield better separated clusters but with a lower cluster<strong>in</strong>g property.<br />
In our experiments, we found that us<strong>in</strong>g fl = r <strong>in</strong>stead fl =1/r, provides a good<br />
trade-o between class separability and cluster<strong>in</strong>g. In extension to the CDM measure, we<br />
therefore propose the Class Separat<strong>in</strong>g Measure (CSM). The ma<strong>in</strong> di erence between these<br />
two measures is <strong>in</strong> the computation <strong>of</strong> the cont<strong>in</strong>uous representation <strong>of</strong> the scatterplot,<br />
henceforth termed distance field for the CSM (with fl = r), and density image for the<br />
CDM (with fl =1/r).<br />
To compute a distance field, the local distance at a screen pixel s is def<strong>in</strong>ed as r, where<br />
r is the radius <strong>of</strong> the enclos<strong>in</strong>g sphere <strong>of</strong> the k-nearest neighbors <strong>of</strong> s, as described earlier<br />
<strong>in</strong> Section 3.1.2. Once we have the distance field <strong>of</strong> each class, the CSM is computed as<br />
the sum <strong>of</strong> the absolute di erence between them (note that for the CDM measure the<br />
<strong>in</strong>verse <strong>of</strong> the distance was used):<br />
M≠1 ÿ Mÿ Pÿ<br />
CSM =<br />
|p i k ≠ p i l|, (3.6)<br />
k=1 l=k+1 i=1<br />
with M be<strong>in</strong>g the number <strong>of</strong> distance field images, i.e. classes respectively, p i k is the i-th<br />
pixel value <strong>in</strong> the distance field computed for the class k, and P is the number <strong>of</strong> pixels.<br />
Compar<strong>in</strong>g the CSM and the CDM, the Class Separat<strong>in</strong>g Measure has a bias towards<br />
large distances between clusters while the Class Density Measure has a bias towards dense<br />
clusters. We consider separation and density <strong>of</strong> the clusters as two di erent user tasks.<br />
Frequently, views with well separated clusters are not necessarily the ones with dense clusters.<br />
When a view presents both properties simultaneously, it is assigned with a higher<br />
value by the two measures, produc<strong>in</strong>g a similar rank for both measures. The user has the<br />
opportunity to choose his measure accord<strong>in</strong>g to the task, or even comb<strong>in</strong>e both measures,<br />
to f<strong>in</strong>d projections support<strong>in</strong>g both tasks. A comparison between the Class Separat<strong>in</strong>g<br />
and Class Density measures with a real example is presented <strong>in</strong> Section 3.1.6.<br />
Histogram Density Measures 5<br />
The Histogram Density Measures (1D and 2D-HDM) are density measures for scatterplots<br />
that extend the previously presented approaches by <strong>in</strong>clud<strong>in</strong>g non-orthogonal views <strong>in</strong><br />
the ranked result lists. They consider the class distribution <strong>of</strong> the data po<strong>in</strong>ts us<strong>in</strong>g<br />
histograms. S<strong>in</strong>ce we are <strong>in</strong>terested <strong>in</strong> plots that show good class separations, HDM looks<br />
for correspond<strong>in</strong>g histograms that show significant separation properties given by pure<br />
histogram b<strong>in</strong>s. To determ<strong>in</strong>e the best low-dimensional embedd<strong>in</strong>g <strong>of</strong> the high-dimensional<br />
data us<strong>in</strong>g HDM, a two step computation is conducted.<br />
4 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann,<br />
for the collaborative publication [132]. Adapted and slightly changed for the thesis by myself.<br />
5 Implemented and described by myself.
36 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
First, all 2D scatterplots <strong>of</strong> the data set are ranked with the 1D-HDM to search <strong>in</strong><br />
the 1D l<strong>in</strong>ear projections which dimensions are represent<strong>in</strong>g the classes best separated.<br />
For each projection, we therefore rank them by the entropy value <strong>of</strong> the 1D projections<br />
separated <strong>in</strong> small equidistant parts, called histogram b<strong>in</strong>s. p c is the number <strong>of</strong> po<strong>in</strong>ts <strong>of</strong><br />
class c <strong>in</strong> one b<strong>in</strong>. The entropy, average <strong>in</strong>formation content <strong>of</strong> that b<strong>in</strong>, is calculated as:<br />
H(p) =≠ ÿ c<br />
p c<br />
q<br />
c p c<br />
log 2<br />
p c<br />
q<br />
c p c<br />
. (3.7)<br />
H(p) is 0, if a b<strong>in</strong> has only po<strong>in</strong>ts <strong>of</strong> one class, and log 2 M, if it conta<strong>in</strong>s equivalent po<strong>in</strong>ts<br />
<strong>of</strong> all M classes. Each projection is ranked with the 1D-HDM :<br />
HDM 1D = 100 ≠ 1 ÿ<br />
( ÿ p c H(p)) (3.8)<br />
Z<br />
x c<br />
= 100 ≠ 1 ÿ ÿ<br />
p c (≠ ÿ p c p c<br />
q<br />
Z<br />
x c c c p log 2 q<br />
c c p ). (3.9)<br />
c<br />
where 1 Z<br />
is a normalization factor, to obta<strong>in</strong> rank<strong>in</strong>g values between 0 and 100, hav<strong>in</strong>g<br />
100 as best value:<br />
1<br />
Z = 100<br />
log 2 M q q<br />
x c p . (3.10)<br />
c<br />
Figure 3.3: 2D view and rotated projection axes. The projection on the rotated plane has less<br />
overlap, and the structures <strong>of</strong> the data can be seen even <strong>in</strong> the projection. This is not possible for<br />
a projection on the orig<strong>in</strong>al axes.<br />
In some data sets, paraxial projections are not able to show the structure <strong>of</strong> highdimensional<br />
data. In these cases, simple rotation <strong>of</strong> the projection axes can improve the<br />
quality <strong>of</strong> the measure. In Figure 3.3 we show an example, where a rotation is improv<strong>in</strong>g<br />
the projection quality. While the paraxial projection <strong>of</strong> these classes cannot show these<br />
structures on the axes, the rotated (dotted projection) axes have less overlay for a projection<br />
on the x Õ axis. Consequently, we rotate the projection plane and compute the
11 12 13 14<br />
dim 2<br />
(5,8,11,12)<br />
−5 0 5 10<br />
Comp.1<br />
(8,11,12)<br />
−8 −6 −4 −2 0 2 4<br />
Comp.1<br />
(5,8,11)<br />
−5 0 5 10<br />
Comp.1<br />
(5,8,12)<br />
−5 0 5 10<br />
Comp.1<br />
(8,11,12)<br />
−8 −6 −4 −2 0 2 4<br />
Comp.1<br />
3.1.3 Quality Measures for Scatterplots with Classified <strong>Data</strong> 37<br />
1D-HDM for di erent angles ◊. For each plot we choose the best 1D-HDM value out <strong>of</strong><br />
di erent rotation angles. We experimentally found ◊ =9m degree, with m œ [0, 20), to<br />
be work<strong>in</strong>g well for all our data sets. Figure 3.4 sketches this first step, show<strong>in</strong>g how we<br />
measure di erent rotations for one plot (represented by the distribution histograms) to<br />
f<strong>in</strong>d his best measure value represent<strong>in</strong>g the visual quality <strong>of</strong> the plot.<br />
1D-HDM<br />
dim 8<br />
1 2 3 4 5<br />
all rotations<br />
0 10 20 30 40<br />
0 10 20 30 40<br />
0 10 20 30 40<br />
...<br />
0 10 20 30 40<br />
best 1D-HDM<br />
0 10 20 30 40<br />
Figure 3.4: First step <strong>of</strong> the HDM approach: each plot is ranked for di erent rotations with the<br />
1D-HDM. The best measure value is taken for the plot.<br />
Second, a subset <strong>of</strong> the best ranked dimensions are chosen to be further <strong>in</strong>vestigated<br />
<strong>in</strong> higher dimensions. All the comb<strong>in</strong>ations <strong>of</strong> the selected dimensions enter a PCA computation.<br />
PCA [83] transforms a high-dimensional data set with correlated dimensions, <strong>in</strong><br />
a lower-dimensional data set with uncorrelated dimensions, called pr<strong>in</strong>cipal components.<br />
For more properties <strong>of</strong> PCA please refer back to Section 2.1.2<br />
For every comb<strong>in</strong>ation <strong>of</strong> selected dimensions, after the PCA is computed, the first two<br />
components <strong>of</strong> the PCA are plotted to be ranked by the 2D-HDM (see Figure 3.5). The<br />
2D-HDM is an extended version <strong>of</strong> the 1D-HDM, for which a 2-dimensional histogram<br />
is computed on the scatterplot. The quality is measured, exactly as for the 1D-HDM<br />
by summ<strong>in</strong>g up a weighted sum <strong>of</strong> the entropy <strong>of</strong> one b<strong>in</strong>. The measure is normalized<br />
between 0 and 100, hav<strong>in</strong>g 100 for the best data po<strong>in</strong>ts visualization, where each b<strong>in</strong><br />
conta<strong>in</strong>s po<strong>in</strong>ts <strong>of</strong> only one class. The b<strong>in</strong> neighborhood has here been considered s<strong>in</strong>ce<br />
for each b<strong>in</strong> p c we sum the <strong>in</strong>formation <strong>of</strong> the b<strong>in</strong> itself and the direct neighborhood,<br />
labeled as u c . Consequently, the 2D-HDM is:<br />
HDM 2D = 100 ≠ 1 ÿ ÿ<br />
u c (≠ ÿ Z<br />
x,y c c<br />
u c<br />
q<br />
c u c<br />
log 2<br />
u c<br />
q<br />
c u c<br />
) (3.11)<br />
with the adapted normalization factor:<br />
1<br />
Z = 100<br />
log 2 M q x,y (q c u c) . (3.12)<br />
selected with 1D-HDM<br />
2D-HDM<br />
k best<br />
dimensions<br />
that<br />
separate<br />
PCA(Subset)<br />
Comp.2<br />
−8 −6 −4 −2 0 2 4<br />
Comp.2<br />
−1 0 1 2 3<br />
Comp.2<br />
−4 −2 0 2 4 6 8<br />
...<br />
Comp.2<br />
−4 −3 −2 −1 0 1 2<br />
best 2D-HDM<br />
Comp.2<br />
−1 0 1 2 3<br />
Figure 3.5: Second step <strong>of</strong> the HDM approach: PCA is computed on the k best selected dimensions<br />
and on all the possible subsets greater than 3 dimensions. The first two components are plotted<br />
<strong>in</strong> scatterplots, that are ranked with the 2D-HDM. The best measure value <strong>in</strong>dicates the best<br />
scatterplot where the class <strong>in</strong>formation is separated.
38 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
3.1.4 Quality Measures for Parallel Coord<strong>in</strong>ates with Unclassified <strong>Data</strong><br />
When analyz<strong>in</strong>g parallel coord<strong>in</strong>ates plots, we focus on the detection <strong>of</strong> plots that either<br />
show significant correlation between attribute dimensions or good cluster<strong>in</strong>g properties<br />
<strong>in</strong> certa<strong>in</strong> attribute ranges. There exist a number <strong>of</strong> analytical approaches for parallel<br />
coord<strong>in</strong>ates to generate dimension order<strong>in</strong>gs that try to fulfill these tasks [9, 159]. However,<br />
they <strong>of</strong>ten do not generate an optimal parallel plot for correlation and cluster<strong>in</strong>g properties,<br />
because <strong>of</strong> local e ects that are not taken <strong>in</strong>to account by most analytical functions. We<br />
therefore present analysis functions that do not only take the properties <strong>of</strong> the data <strong>in</strong>to<br />
account, but also considers the properties <strong>of</strong> the result<strong>in</strong>g plot.<br />
Hough Space Measure 6<br />
Our analysis is based on f<strong>in</strong>d<strong>in</strong>g patterns like clustered l<strong>in</strong>es with similar positions and<br />
directions. Our algorithm for detect<strong>in</strong>g these clusters is based on the Hough transform [73].<br />
Straight l<strong>in</strong>es <strong>in</strong> the image space can be described as y = ax + b. The ma<strong>in</strong> idea <strong>of</strong> the<br />
Hough transform is to def<strong>in</strong>e a straight l<strong>in</strong>e accord<strong>in</strong>g to its parameters, i.e. the slope a<br />
and the <strong>in</strong>terception b. Due to a practical di culty (the slope <strong>of</strong> vertical l<strong>in</strong>es is <strong>in</strong>f<strong>in</strong>ite)<br />
the normal representation <strong>of</strong> a l<strong>in</strong>e is:<br />
fl = x · cos◊ + y · s<strong>in</strong>◊, (3.13)<br />
where fl is the length <strong>of</strong> the normal from the orig<strong>in</strong> to the l<strong>in</strong>e and ◊ is the angle between<br />
this normal and the x-axis. Us<strong>in</strong>g this representation, for each non-background pixel <strong>in</strong><br />
the visualization, we have a dist<strong>in</strong>ct s<strong>in</strong>usoidal curve <strong>in</strong> the fl◊-plane, also called Hough<br />
or accumulator space. An <strong>in</strong>tersection <strong>of</strong> these curves <strong>in</strong>dicates that the correspond<strong>in</strong>g<br />
pixels belong to the l<strong>in</strong>e def<strong>in</strong>ed by the parameters (fl i ,◊ i ) <strong>in</strong> the orig<strong>in</strong>al space. Figure 3.6<br />
shows two synthetic examples <strong>of</strong> parallel coord<strong>in</strong>ates and their respective Hough spaces:<br />
Figure 3.6(a) presents two well def<strong>in</strong>ed l<strong>in</strong>e clusters and is more <strong>in</strong>terest<strong>in</strong>g for the cluster<br />
identification task than Figure 3.6(b), where no l<strong>in</strong>e cluster can be identified. Note that<br />
the bright areas <strong>in</strong> the fl◊-plane represent the clusters <strong>of</strong> l<strong>in</strong>es with similar fl and ◊.<br />
To reduce the bias towards long l<strong>in</strong>es, e.g. diagonal l<strong>in</strong>es, we scale the pairwise visualization<br />
images to an n ◊ n resolution, usually 512 ◊ 512. The accumulator space is<br />
quantized <strong>in</strong>to a w ◊ h cell grid, where w and h control the similarity sensibility <strong>of</strong> the<br />
l<strong>in</strong>es. We use 50 ◊ 50 grids for comput<strong>in</strong>g the results presented <strong>in</strong> Section 3.1.6 and <strong>in</strong><br />
Section 3.1.7. A lower value for w and h reduces the sensibility <strong>of</strong> the algorithm because<br />
l<strong>in</strong>es with a slightly di erent fl and ◊ are mapped to the same accumulator cells.<br />
Based on our def<strong>in</strong>ition, good visualizations must conta<strong>in</strong> fewer well def<strong>in</strong>ed clusters,<br />
which are represented by accumulator cells with high values. To identify these cells,<br />
we compute the median value m as an adaptive threshold that divides the accumulator<br />
function h(x) <strong>in</strong>to two identical parts:<br />
q h(x)<br />
2<br />
g(x) =<br />
= ÿ g(x), where (3.14)<br />
I<br />
x if x Æ m;<br />
6 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann,<br />
for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.<br />
m<br />
else.
3.1.4 Quality Measures for Parallel Coord<strong>in</strong>ates with Unclassified <strong>Data</strong> 39<br />
(a)<br />
(b)<br />
Figure 3.6: Synthetic examples <strong>of</strong> parallel coord<strong>in</strong>ates and their respective Hough spaces: (a)<br />
presents two well def<strong>in</strong>ed l<strong>in</strong>e clusters and is more <strong>in</strong>terest<strong>in</strong>g for the cluster identification task<br />
than (b), where no l<strong>in</strong>e cluster can be identified. Note that the bright areas <strong>in</strong> the fl◊-plane<br />
represent the clusters <strong>of</strong> l<strong>in</strong>es with similar fl and ◊.<br />
Us<strong>in</strong>g the median value, only a few clusters are selected <strong>in</strong> an accumulator space with high<br />
contrast between the cells (see Figure 3.6(a)) while <strong>in</strong> a uniform accumulator space many<br />
clusters are selected (see Figure 3.6(b)). This adaptive threshold is not only necessary to<br />
select possible l<strong>in</strong>e clusters <strong>in</strong> the accumulator space, but also to avoid the <strong>in</strong>fluence <strong>of</strong><br />
outliers and occlusion between the l<strong>in</strong>es. In the occlusion case, a po<strong>in</strong>t that belongs to<br />
two or more l<strong>in</strong>es is computed just once <strong>in</strong> the accumulator space.<br />
The f<strong>in</strong>al quality value for a 2D visualization is computed by the number <strong>of</strong> accumulator<br />
cells n cells that have a higher value than m normalized by the total number <strong>of</strong> cells (w · h)<br />
to the <strong>in</strong>terval [0, 1]:<br />
s i,j =1≠ n cells<br />
w · h , (3.15)<br />
where i, j are the <strong>in</strong>dices <strong>of</strong> the respective dimensions, and the computed measure s i,j<br />
presents higher values for images conta<strong>in</strong><strong>in</strong>g well def<strong>in</strong>ed l<strong>in</strong>e clusters (similar l<strong>in</strong>es) and<br />
lower values for images conta<strong>in</strong><strong>in</strong>g l<strong>in</strong>es <strong>in</strong> many di erent directions and positions.<br />
Hav<strong>in</strong>g comb<strong>in</strong>ed the pairwise visualizations, we can now compute the overall quality<br />
measure by summ<strong>in</strong>g up the respective pairwise measurements. This overall quality<br />
measure <strong>of</strong> a parallel visualization conta<strong>in</strong><strong>in</strong>g n dimensions is:<br />
HSM = ÿ a i œI<br />
s ai ,a i+1<br />
, (3.16)<br />
where I is a vector conta<strong>in</strong><strong>in</strong>g any possible comb<strong>in</strong>ation <strong>of</strong> the n dimensions <strong>in</strong>dices. In this<br />
way we can measure the quality <strong>of</strong> any given visualization by us<strong>in</strong>g parallel coord<strong>in</strong>ates.<br />
Exhaustively comput<strong>in</strong>g all n-dimensional comb<strong>in</strong>ations <strong>in</strong> order to choose the best/worst<br />
ones, requires a very long computation time and becomes unfeasible for a large n. Inthese<br />
cases, search<strong>in</strong>g for the best n-dimensional comb<strong>in</strong>ations <strong>in</strong> a feasible time, an algorithm<br />
to solve a Travel<strong>in</strong>g Salesman Problem is used, e.g. the A*-Search algorithm [66] or others<br />
[12]. Instead <strong>of</strong> exhaustively comb<strong>in</strong><strong>in</strong>g all possible pairwise visualizations, these k<strong>in</strong>d <strong>of</strong><br />
algorithms would compose only the best overall visualization.
40 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
3.1.5 Quality Measures for Parallel Coord<strong>in</strong>ates with Classified <strong>Data</strong><br />
While analyz<strong>in</strong>g parallel coord<strong>in</strong>ates visualizations with class <strong>in</strong>formation, we consider<br />
two ma<strong>in</strong> issues. First, <strong>in</strong> good parallel coord<strong>in</strong>ates visualizations, the l<strong>in</strong>es that belong<br />
to a determ<strong>in</strong>ed class must be quite similar (<strong>in</strong>cl<strong>in</strong>ation and position similarity). Second,<br />
visualizations where the classes can be separately observed and that conta<strong>in</strong> less overlapp<strong>in</strong>g<br />
are also considered to be good. We developed two measures for classified parallel<br />
coord<strong>in</strong>ates that take these matters <strong>in</strong>to account: the Similarity Measure that encourages<br />
<strong>in</strong>ner class similarities, and the Overlap Measure that analyzes the overlap between<br />
classes. Both are based on the Hough Space Measure for unclassified data presented <strong>in</strong> the<br />
previous Section 3.1.4.<br />
Similarity Measure 7<br />
The Similarity Measure (SM) is a direct extension <strong>of</strong> the HSM presented before for unclassified<br />
data. For visualizations conta<strong>in</strong><strong>in</strong>g class <strong>in</strong>formation, the di erent classes are usually<br />
represented by di erent colors. We separate the classes <strong>in</strong>to dist<strong>in</strong>ct images, conta<strong>in</strong><strong>in</strong>g<br />
only the pixels <strong>in</strong> the respective class color, and compute a quality measure s k for each<br />
class, us<strong>in</strong>g Equation 3.15. Thereafter, an overall quality value SM is computed as the<br />
sum <strong>of</strong> all class quality measures:<br />
SM = ÿ s k . (3.17)<br />
k<br />
Us<strong>in</strong>g this measure, we encourage visualizations with strong <strong>in</strong>ner class similarities and<br />
slightly penalize overlapped classes. Note that due to the classes overlap, some classes<br />
have many miss<strong>in</strong>g pixels, which results <strong>in</strong> a lower s k value compared to other visualizations<br />
where less or no overlap between the classes exists.<br />
Overlap Measure 8<br />
In order to penalize overlap between classes, we analyze the di erence between the classes<br />
<strong>in</strong> the Hough space (see Section 3.1.4). As <strong>in</strong> the SM, for the Overlap Measure we also<br />
separate the classes to di erent images and compute the Hough transform over each image.<br />
Once we have a Hough space h for each class, we compute the quality measure as the sum<br />
<strong>of</strong> the absolute di erence between the classes:<br />
M≠1 ÿ Mÿ Pÿ<br />
OM =<br />
|hk i ≠ hl|. i (3.18)<br />
k=1 l=k+1 i=1<br />
Here M is the number <strong>of</strong> Hough space images, i.e. classes respectively and P is the number<br />
<strong>of</strong> pixels <strong>in</strong> each image. The measure value is high if the Hough spaces are disjo<strong>in</strong>t, i.e. if<br />
there is no large overlap between the classes. Therefore, the visualization with the smallest<br />
overlap between the classes receives the highest measure values.<br />
7 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann,<br />
for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.<br />
8 Implemented and described by our partners from Braunschweig, G. Albuquerque and M. Eisemann,<br />
for the collaborative publication [133]. Adapted and slightly changed for the thesis by myself.
3.1.6 Application on Real <strong>Data</strong> Sets 41<br />
Another valuable use <strong>of</strong> this measure is to encourage or search for similarities between<br />
di erent classes. In this case, the overlap between the classes is desired, and the previously<br />
computed measure can be <strong>in</strong>verted to compute suitable quality values:<br />
OM <strong>in</strong>v = 1<br />
OM . (3.19)<br />
3.1.6 Application on Real <strong>Data</strong> Sets<br />
To evaluate our measures we tested them on a variety <strong>of</strong> di erent real data sets. We applied<br />
our Class Density Measure (CDM), Class Separat<strong>in</strong>g Measure (CSM), Histogram Density<br />
Measure (HDM), Similarity Measure (SM), and Overlap Measure (OM) on classified data<br />
to f<strong>in</strong>d views that try to either separate or show similarities between the classes. For<br />
unclassified data, we applied our Rotat<strong>in</strong>g Variance Measure (RVM) and Hough Space<br />
Measure (HSM) <strong>in</strong> order to f<strong>in</strong>d l<strong>in</strong>ear or non-l<strong>in</strong>ear correlations and clusters <strong>in</strong> the data<br />
sets, respectively.<br />
Except for the HDM, we chose to present only relative measures, i.e. all calculated<br />
values are scaled so that the best found visualization is assigned 100 and the worst 0.<br />
This scal<strong>in</strong>g is <strong>in</strong>tended to ease the <strong>in</strong>terpretability <strong>of</strong> the measure by the user. For<br />
the HDM, we chose to present the unchanged measure values, as the HDM allows an<br />
easy direct <strong>in</strong>terpretation, with a value <strong>of</strong> 100 be<strong>in</strong>g the best and 0 be<strong>in</strong>g the worst<br />
possible constellation. If not otherwise stated, our examples are pro<strong>of</strong>-<strong>of</strong>-concepts, and<br />
<strong>in</strong>terpretations <strong>of</strong> some <strong>of</strong> the results should be provided by doma<strong>in</strong> experts.<br />
<strong>Data</strong> Sets<br />
We used the data sets summarized <strong>in</strong> Table 3.2 to show the measures’ properties. In this<br />
table we present some <strong>in</strong>formation about the data. More details about the data sources<br />
and the dimensions names <strong>of</strong> each data set can be found <strong>in</strong> Appendix A.<br />
Table 3.2: Overview over the data sets used to show the measures properties.<br />
data set name records dimensions 9 classes source<br />
Cars 7404 22 2 partners<br />
Olives 572 8 9 [163]<br />
Park<strong>in</strong>son’s Disease 195 11 0 [95, 96]<br />
W<strong>in</strong>e 178 13 3 [53]<br />
Wiscons<strong>in</strong> Diagnostic Breast Cancer 569 30 2 [131]<br />
Cars conta<strong>in</strong>s 7404 cars listed with 23 di erent attributes, <strong>in</strong>clud<strong>in</strong>g price, power, fuel<br />
consumption, width, height and others, automatically collected from a national second<br />
hand car sell<strong>in</strong>g website 10 . We chose the attribute fuel as a class label, hav<strong>in</strong>g the data<br />
divided <strong>in</strong> two classes, benz<strong>in</strong>e and diesel. Our goal is to f<strong>in</strong>d the similarities and di erences<br />
between these.<br />
9 The number <strong>of</strong> dimensions doesn’t count the class attribute <strong>in</strong>.<br />
10 Collected by another <strong>in</strong>stitute from Braunschweig and provided to our partners there.
42 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Best ranked views us<strong>in</strong>g RVM<br />
100 - (dim9,dim12) 97 - (dim2,dim3) 75 - (dim2,dim4)<br />
Worst ranked views us<strong>in</strong>g RVM<br />
0 - (dim6,dim8) 0.3 - (dim7,dim8) 5.6 - (dim2,dim8)<br />
Figure 3.7: Results for the Park<strong>in</strong>son’s Disease data set us<strong>in</strong>g our RVM measure (Section 3.1.2).<br />
While clumpy low-correlation bear<strong>in</strong>g views are punished (bottom row), views conta<strong>in</strong><strong>in</strong>g higher<br />
correlation between the variables are preferred (top row).<br />
Olives is a classified data set with 572 olive oil samples from n<strong>in</strong>e di erent regions <strong>in</strong><br />
Italy [163]. For each sample the normalized concentrations <strong>of</strong> eight fatty acids are given.<br />
The large number <strong>of</strong> classes (regions) poses a challeng<strong>in</strong>g task to the algorithms try<strong>in</strong>g to<br />
f<strong>in</strong>d views <strong>in</strong> which all classes are well separated.<br />
Park<strong>in</strong>son’s Disease is a data set composed <strong>of</strong> 195 biomedical voice measures from<br />
31 people, <strong>of</strong> which 23 with Park<strong>in</strong>son’s disease [95, 96]. Each <strong>of</strong> the 12 dimensions is<br />
a particular voice measure. The voice record<strong>in</strong>gs from these <strong>in</strong>dividuals have been taken<br />
with the goal to discrim<strong>in</strong>ate healthy people from those with Park<strong>in</strong>son’s disease.<br />
W<strong>in</strong>e is a classified data set with 178 <strong>in</strong>stances and 13 attributes describ<strong>in</strong>g chemical<br />
properties <strong>of</strong> Italian w<strong>in</strong>es derived from three di erent cultivars.<br />
Wiscons<strong>in</strong> Diagnostic Breast Cancer (WDBC) data set consists <strong>of</strong> 569 samples with<br />
30 real-valued dimensions each [131]. The data is classified <strong>in</strong>to malign and benign cells.<br />
The task is to f<strong>in</strong>d the best separat<strong>in</strong>g dimensions show<strong>in</strong>g the two classes.<br />
Scatterplot Measures<br />
First we show the results for RVM on the Park<strong>in</strong>son’s Disease data set 11 .Thethreebest<br />
and the three worst ranked scatterplots by the RVM are shown <strong>in</strong> Figure 3.7, present<strong>in</strong>g<br />
the RVM value above each plot. <strong>High</strong> correlations have been measured <strong>in</strong> the plots<br />
(dim9,dim12 ), (dim2,dim3 ), as well as (dim2,dim4 ). However, visualizations conta<strong>in</strong><strong>in</strong>g<br />
11 For easier read<strong>in</strong>g <strong>of</strong> this paragraph, we renamed the orig<strong>in</strong>al dimension names. Please refer to<br />
Appendix A Table A.3 for the orig<strong>in</strong>al dimension names.
3.1.6 Application on Real <strong>Data</strong> Sets 43<br />
low correlation received a low value, as shown <strong>in</strong> the second row <strong>of</strong> this figure present<strong>in</strong>g<br />
the worst ranked views and their measure values. This example demonstrates that our<br />
target pattern, the correlated dimensions, are correctly identified by the RVM measure.<br />
Best ranked views us<strong>in</strong>g CDM<br />
100 - (dim4,dim5) 97 - (dim1,dim5) 84 - (dim1,dim4)<br />
Worst ranked views us<strong>in</strong>g CDM<br />
0 - (dim6,dim8) 15 - (dim6,dim7) 24 - (dim7,dim8)<br />
Figure 3.8: Results for the Olives data set us<strong>in</strong>g our CDM measure (Section 3.1.3). The di erent<br />
colors depict the di erent classes (regions) <strong>of</strong> the data set. While it is impossible for this data set<br />
to f<strong>in</strong>d views completely separat<strong>in</strong>g all classes, our CDM measure still found views where most <strong>of</strong><br />
the classes are mutually separated (top row). In the worst ranked views the classes clearly overlap<br />
with each other (bottom row).<br />
Best ranked PCA-views us<strong>in</strong>g HDM approach<br />
85.45 - PCA(dim(4,5,8)) 84.98 - PCA(dim(1,2,4,5)) 84.9 - PCA(all data dims)<br />
Figure 3.9: Results for the Olives data set us<strong>in</strong>g our HDM measure (Section 3.1.3). The best<br />
ranked plot is the PCA <strong>of</strong> dim(4,5,8) reveal<strong>in</strong>g a good view on all the classes, the second best is<br />
the PCA <strong>of</strong> dim(1,2,4) and the third is the PCA on all 8 dimensions. The di erences between<br />
the last two are small because the variance <strong>in</strong> that additional dimensions for the 3rd eigenvector<br />
relative to the 2nd, is not big. The di erence between the last two views and the first view is<br />
clearly visible (e.g. look<strong>in</strong>g at the yellow class).
44 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
In Figure 3.8, we show the results for the Olives data set 12 us<strong>in</strong>g our CDM measure.<br />
Even though a view separat<strong>in</strong>g all n<strong>in</strong>e di erent olive classes does not exist, the CDM<br />
reliably choses three views that separate the data quite well <strong>in</strong> the dimensions (dim4,<br />
dim5 ), (dim1,dim5 ) as well as (dim1,dim4 ). The bottom row <strong>of</strong> this figure presents the<br />
worst ranked projections. We can see that <strong>in</strong> these cases it is impossible to identify any<br />
class structure <strong>in</strong> the views.<br />
We also applied our HDM technique to this data set. First the 1D-HDM tries to<br />
identify the best separat<strong>in</strong>g dimensions <strong>of</strong> the data set, as presented <strong>in</strong> Section 3.1.3.<br />
The dimensions dim1, dim2, dim4, dim5 and dim8 were ranked as the best separat<strong>in</strong>g<br />
dimensions by the 1D-HDM. We computed all subsets <strong>of</strong> these dimensions, computed the<br />
PCA on this subsets, and ranked the views <strong>of</strong> the first two PCA components with the<br />
2D-HDM. In the best ranked views, presented <strong>in</strong> Figure 3.9, the di erent classes are well<br />
separated. Compared to the upper row <strong>in</strong> Figure 3.8, the visualization utilizes the screen<br />
space better, which is due to the PCA transformation.<br />
Best ranked views us<strong>in</strong>g CSM<br />
100 - (dim7,dim13) 97 - (dim7,dim10) 93 - (dim7,dim12)<br />
Worst ranked views us<strong>in</strong>g CSM<br />
0 - (dim3,dim5) 0.05 - (dim1,dim5) 0.08 - (dim1,dim4)<br />
Figure 3.10: Results for the W<strong>in</strong>e data set us<strong>in</strong>g our CSM measure (Section 3.1.3). The best ranked<br />
plots present a large distance between the centers <strong>of</strong> the class clusters while the worst ranked views<br />
show only cluttered data.<br />
Compar<strong>in</strong>g our CSM and CDM measures, we can observe that they present dist<strong>in</strong>ct<br />
results on the same data sets. Apply<strong>in</strong>g the CSM to the W<strong>in</strong>e data set 13 reveals views<br />
that present a good separation between the classes. The best ranked plots are shown <strong>in</strong><br />
the upper row <strong>of</strong> Figure 3.10: (dim7,dim13 ), (dim7,dim10 ), and (dim7,dim12 ). They<br />
present a large distance between the centers <strong>of</strong> the class clusters. The worst ranked views,<br />
<strong>in</strong> opposite, show only cluttered data. In comparison, the result for CDM measure on<br />
12 For eas<strong>in</strong>g the read<strong>in</strong>g trough the paragraph, we renamed the orig<strong>in</strong>al dimension names. Please refer<br />
to Appendix A Table A.2 for the orig<strong>in</strong>al dimension names.<br />
13 For eas<strong>in</strong>g the read<strong>in</strong>g trough the paragraph, we renamed the orig<strong>in</strong>al dimension names. Please refer<br />
to Appendix A Table A.4 for the orig<strong>in</strong>al dimension names.
3.1.6 Application on Real <strong>Data</strong> Sets 45<br />
Best ranked views us<strong>in</strong>g CDM<br />
100 - (dim7,dim10) 89 - (dim1,dim7) 88 - (dim7,dim13)<br />
Worst ranked views us<strong>in</strong>g CDM<br />
0 - (dim3,dim5) 0.04 - (dim4,dim8) 0.07 - (dim8,dim9)<br />
Figure 3.11: Results for the W<strong>in</strong>e data set us<strong>in</strong>g our CDM measure (Section 3.1.3). Note that the<br />
second best ranked view, (dim1,dim7) (with CDM = 89), is not considered good us<strong>in</strong>g the CSM<br />
measure (CSM = 58).<br />
the W<strong>in</strong>e data set is depicted <strong>in</strong> the Figure 3.11. The best ranked plots (dim7,dim10 ),<br />
(dim1,dim7 ), and (dim7,dim13 ) present more dense clusters, as expected from the rank<strong>in</strong>g<br />
criteria <strong>of</strong> this measure. Note that the second best ranked view, (dim1,dim7 )(withCDM<br />
= 89), is not considered good us<strong>in</strong>g the CSM measure gett<strong>in</strong>g a lower rank and quality<br />
value (CSM = 58). Compar<strong>in</strong>g Figure 3.10 and Figure 3.11, we can observe that the CSM<br />
favors large distances between the clusters while the CDM assigns high values to views<br />
that present dense but separated clusters, even if the distances between them are much<br />
smaller.<br />
There are cases when just look<strong>in</strong>g at the best ranked and the worst ranked plots is<br />
not enough. By arrang<strong>in</strong>g all the scatterplots <strong>in</strong> a scatterplot matrix the analyst has<br />
the possibility to look at all orthogonal views <strong>of</strong> a data set at once. In our system the<br />
scatterplots are shown <strong>in</strong> the upper right half <strong>of</strong> the SPLOM while the other half is used<br />
to display the quality values <strong>of</strong> each plot. To guide the analysis the user can fade out<br />
lower ranked views, which helps to focus on those with a higher probability <strong>of</strong> <strong>in</strong>formation<br />
bear<strong>in</strong>g content. One drawback is that for a very large number <strong>of</strong> dimensions due to the<br />
quadratically number <strong>of</strong> scatterplots, this SPLOM cannot scale. Figure 3.12 shows an<br />
example. Both SPLOMs show the WDBC data set 14 , but the upper SPLOM shows the<br />
results for the RVM while the bottom SPLOM shows the results for the CDM measure.<br />
The threshold for both SPLOMs was set to 0.95 15 , so all plots with a lower rank have<br />
14 Please refer to Appendix A Table A.5 for details about the orig<strong>in</strong>al dimension names <strong>of</strong> the data set.<br />
15 Please note that the SPLOM shows the measure values between 0 and 1 while all the other results<br />
presented before where on a scale from 0 to 100.
46 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
been faded out. As can be seen <strong>in</strong> the enlarged detail, di erent views come <strong>in</strong>to focus<br />
depend<strong>in</strong>g on the chosen measure. While the RVM considers plots with a high degree <strong>of</strong><br />
correlation as more important, the CDM focuses on separat<strong>in</strong>g the designated classes, here<br />
the malign and benign cells. It depends on the user task what pattern is more important.<br />
Figure 3.12: Results on the WDBC data set for the RVM (top) and the CDM (bottom). In this<br />
example, views with a quality value <strong>of</strong> less than 0.95 have been faded out. This way many irrelevant<br />
views can be faded out reduc<strong>in</strong>g the number <strong>of</strong> the plots to be <strong>in</strong>spected by the user <strong>in</strong> more detail<br />
to a better manageable number.
3.1.6 Application on Real <strong>Data</strong> Sets 47<br />
Parallel Coord<strong>in</strong>ates Measures<br />
To demonstrate the value <strong>of</strong> our approaches for parallel coord<strong>in</strong>ates, we present the best<br />
and worst ranked visualizations by our measures on di erent data sets. The correspond<strong>in</strong>g<br />
visualizations are shown <strong>in</strong> Figure 3.13, 3.14 and 3.15. For a better comparability the<br />
visualizations have been cropped after the display <strong>of</strong> the 4th dimension. In all experiments<br />
we used a size <strong>of</strong> 50 ◊ 50 for the Hough accumulator. The algorithms are quite robust<br />
with respect to the size, and us<strong>in</strong>g more cells generally only <strong>in</strong>creases computation time<br />
but has little <strong>in</strong>fluence on the result.<br />
Figure 3.13 shows the ranked results for the Park<strong>in</strong>sons Disease data set 16 us<strong>in</strong>g our<br />
Hough Space Measure.<br />
The HSM algorithm prefers views with more similarity <strong>in</strong> the distance and <strong>in</strong>cl<strong>in</strong>ation<br />
<strong>of</strong> the di erent l<strong>in</strong>es, result<strong>in</strong>g <strong>in</strong> the prom<strong>in</strong>ent small band <strong>in</strong> the visualization <strong>of</strong> the<br />
Park<strong>in</strong>sons Disease data set. This is similar to clusters <strong>in</strong> the projected views <strong>of</strong> these<br />
dimension, here between dim3 and dim12 as well as dim6 and dim11.<br />
best ranked views us<strong>in</strong>g HSM<br />
100 97<br />
97<br />
worst ranked views us<strong>in</strong>g HSM<br />
0 0.7 1.1<br />
Figure 3.13: Results for the non-classified version <strong>of</strong> the Park<strong>in</strong>sons Disease data set. Best and<br />
worst ranked visualizations us<strong>in</strong>g our HSM measure for non-classified data (ref. Section 3.1.4). Top<br />
row: The three best ranked visualizations and their respective normalized measures. Well def<strong>in</strong>ed<br />
clusters <strong>in</strong> the data set are favored. Bottom row: The three worst ranked visualizations. The large<br />
amount <strong>of</strong> spread exacerbates <strong>in</strong>terpretation. Note that the user task related to this measure is<br />
not to f<strong>in</strong>d possible correlation between the dimensions but to detect good separated clusters.<br />
Apply<strong>in</strong>g our Similarity Measure to the Cars data set we can see that there seem to be<br />
barely any good views to split the clusters <strong>of</strong> the data set (see Figure 3.14). We verified<br />
these by exhaustively look<strong>in</strong>g at all pairwise projections. However, the only dimension<br />
where the classes can be mostly separated and at least some form <strong>of</strong> cluster can be reliably<br />
found is dim6, <strong>in</strong> which cars us<strong>in</strong>g diesel (represented <strong>in</strong> red) generally have a lower value<br />
compared to benz<strong>in</strong>e (represented <strong>in</strong> black). Figure 3.14 shows the best ranked results <strong>in</strong><br />
the top row. Additionally, the similarity <strong>of</strong> the majority <strong>in</strong> dim15, dim18 and dim3 can be<br />
detected. Obviously cars us<strong>in</strong>g diesel are cheaper, this might be due to the age <strong>of</strong> the diesel<br />
cars, but age was unfortunately not <strong>in</strong>cluded <strong>in</strong> the data base. On the other hand, the<br />
worst ranked views us<strong>in</strong>g the SM (see Figure 3.14, bottom row) are barely <strong>in</strong>terpretable<br />
but at least we were unable to extract any useful <strong>in</strong>formation.<br />
In Figure 3.15 the results for our Overlap Measure applied to the WDBC data set are<br />
16 For eas<strong>in</strong>g the read<strong>in</strong>g trough the paragraph, we renamed the orig<strong>in</strong>al dimension names. Please refer<br />
to Appendix A Table A.3 for the orig<strong>in</strong>al dimension names.
48 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
best ranked views us<strong>in</strong>g SM<br />
100 98<br />
98<br />
17 6 15 18 17 6 20 18<br />
3 20 18 15<br />
worst ranked views us<strong>in</strong>g SM<br />
0 0.1 0.2<br />
9 1 19 12<br />
5 19 1 9 9 1 12 19<br />
Figure 3.14: Results <strong>of</strong> the SM for the Cars data set. Cars us<strong>in</strong>g benz<strong>in</strong>e are shown <strong>in</strong> black,<br />
diesel <strong>in</strong> red. Best and worst ranked visualizations us<strong>in</strong>g our Hough Similarity Measure (Section<br />
3.1.5) for parallel coord<strong>in</strong>ates. Top row: The three best ranked visualizations and their respective<br />
normalized measures. Bottom row: The three worst ranked visualizations.<br />
best ranked views us<strong>in</strong>g OM<br />
100 99 99<br />
25 9 24 29<br />
22 9 24 29 25 9 22 29<br />
worst ranked views us<strong>in</strong>g OM<br />
0 0.1 0.2<br />
17 18 31 21<br />
13 31 18 17<br />
13 17 18 31<br />
Figure 3.15: Results <strong>of</strong> the OM for the WDBC data set. Malign nuclei are colored black while<br />
healthy nuclei are red. Best and worst ranked visualizations us<strong>in</strong>g our Overlap Measure (Section<br />
3.1.5) for parallel coord<strong>in</strong>ates. Top row: The three best ranked visualizations. Despite good similarity,<br />
which are similar to clusters, visualizations are favored that m<strong>in</strong>imize the overlap between<br />
the classes, so that the di erence between malign and benign cells becomes more clear. Bottom<br />
row: The three worst ranked visualizations. The overlap <strong>of</strong> the data complicates the analysis and<br />
the <strong>in</strong>formation is useless for the task <strong>of</strong> discrim<strong>in</strong>at<strong>in</strong>g malign and benign cells.<br />
shown. This result is very promis<strong>in</strong>g. In the top row, show<strong>in</strong>g the best plots, the malign<br />
and benign are well separated. It seems that the dimensions dim22 (radius (worst)), dim9<br />
(concave po<strong>in</strong>ts (mean)), dim24 (perimeter (worst)), dim29 (concave po<strong>in</strong>ts (mean)) and<br />
dim25 (area (worst)) separate the two classes well.
3.1.7 Evaluation <strong>of</strong> the Measures’ Performance Us<strong>in</strong>g Synthetic <strong>Data</strong> 49<br />
3.1.7 Evaluation <strong>of</strong> the Measures’ Performance Us<strong>in</strong>g Synthetic <strong>Data</strong><br />
The work presented by Johansson and Johansson [82] <strong>in</strong>troduces a system for dimensionality<br />
reduction by comb<strong>in</strong><strong>in</strong>g user-def<strong>in</strong>ed quality metrics us<strong>in</strong>g weighted functions to<br />
preserve as many important structures as possible <strong>in</strong> the reduced data set. The analyzed<br />
structures are cluster<strong>in</strong>g properties, outliers and dimension correlations. We used the synthetic<br />
data set presented <strong>in</strong> their paper to test our Hough Space Measure. This conta<strong>in</strong>s<br />
1320 data items and 100 variables, <strong>of</strong> which 14 conta<strong>in</strong> significant structures.<br />
The HSM algorithm prefers views with more similarity <strong>in</strong> the distance and <strong>in</strong>cl<strong>in</strong>ation<br />
<strong>of</strong> the di erent l<strong>in</strong>es. We computed our HSM on this synthetical data set and present the<br />
result <strong>in</strong> Figure 3.16. Here we can see the best ranked 4-dimensional parallel coord<strong>in</strong>ates<br />
plots for clustered data po<strong>in</strong>ts <strong>in</strong> the top row and the worst ranked plots <strong>in</strong> the bottom.<br />
At the top, the clusters <strong>of</strong> l<strong>in</strong>es are clearly visible <strong>in</strong> contrast to the bottom where no<br />
structures are visible. The five dimensions that are <strong>in</strong> the best plots are dimensions A,<br />
C, G, I, J. Four out <strong>of</strong> five dimensions are also determ<strong>in</strong>ed by [82] as the best dimensions<br />
for cluster<strong>in</strong>g. They use user-def<strong>in</strong>ed quality measures for their system to determ<strong>in</strong>e the<br />
best dimensions accord<strong>in</strong>g to di erent criteria. Our result<strong>in</strong>g dimensions are a subset <strong>of</strong><br />
their best 9 dimensions for show<strong>in</strong>g clustered data po<strong>in</strong>ts. This provides pro<strong>of</strong> that our<br />
measures are also designed <strong>in</strong> the way that users would rank their plots.<br />
best ranked views us<strong>in</strong>g HSM<br />
100 99.3<br />
98.8<br />
worst ranked views us<strong>in</strong>g HSM<br />
0 0 0.2<br />
Figure 3.16: Results <strong>of</strong> the HSM for the synthetic data set from [82] present<strong>in</strong>g the best and worst<br />
ranked visualizations us<strong>in</strong>g our HSM measure for non-classified data (ref. Section 3.1.4). Top<br />
row: The three best ranked visualizations and their respective normalized measures. Well def<strong>in</strong>ed<br />
clusters <strong>in</strong> the data set are favored. Bottom row: The three worst ranked visualizations. The large<br />
amount <strong>of</strong> spread exacerbates <strong>in</strong>terpretation. Note that the user task related to this measure is<br />
not to f<strong>in</strong>d high correlation between the dimensions but to detect good separated clusters.<br />
To show the e ectivity <strong>of</strong> our scatterplot measures and to expla<strong>in</strong> their di erences, we<br />
analyzed their results on a self-generated synthetical data set - synthetic2. We created a<br />
10-dimensional data set with two classes. By select<strong>in</strong>g just two classes, we aim to show<br />
the fundamental di erences between the measures that allow to detect hidden patterns.<br />
In three dimensions we hid target patterns to test how this projections are ranked by<br />
the measures. The patterns where created as follows: the first pattern <strong>in</strong> subspace (2, 5)<br />
conta<strong>in</strong>s two classes with means at m 1 =(6, 14) and m 2 = A(13, 6), eachB<br />
conta<strong>in</strong><strong>in</strong>g 500<br />
3 2.7<br />
samples from a multivariate normal distribution with C 1 =<br />
the covariance<br />
2.7 3<br />
matrix <strong>of</strong> the variables. In dimension 6 we def<strong>in</strong>ed two classes with means at m 3 =6<br />
respectively m 4 = 13 with 500 random samples <strong>of</strong> a normal distribution and with standard
50 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
deviation std =1.5 for each class. With this def<strong>in</strong>ition <strong>of</strong> the dimensions three patterns<br />
<strong>in</strong> subspaces (2, 5), (2, 6) and (5, 6) occur.<br />
In the other 7 dimensions we def<strong>in</strong>ed random patterns. This are developed systematically,<br />
by tak<strong>in</strong>g for every dimension the mean m d = 10 and 1000 samples from a normal<br />
distribution start<strong>in</strong>g from a standard deviation std =0.5 and <strong>in</strong>creas<strong>in</strong>g this with 0.5 for<br />
each dimension. Therefore, the last random dimension has the std =3.5.<br />
Figure 3.17: Matrix for the synthetical data set with scatterplots above the ma<strong>in</strong> diagonal and<br />
parallel coord<strong>in</strong>ate plots bellow.<br />
In Figure 3.17, we present the scatterplot matrix <strong>of</strong> the synthetical data set show<strong>in</strong>g the<br />
scatterplots above the ma<strong>in</strong> diagonal and the parallel coord<strong>in</strong>ate plots under the diagonal.<br />
We ranked all these plots with our measures for scatterplots and parallel coord<strong>in</strong>ates.<br />
The results are presented <strong>in</strong> Figure 3.18. For every measure we show a po<strong>in</strong>t chart conta<strong>in</strong><strong>in</strong>g<br />
the sorted measure results. The target patterns are marked red <strong>in</strong> each plot. It<br />
can be seen that all measures ranked as best plot one <strong>of</strong> the target patterns.<br />
The scatterplot measures for classified data CDM and CSM found all the three target<br />
patterns as the best projections <strong>of</strong> the data set. This confirms our assumption that this<br />
measures search for the projections with the best class separability and the most dense<br />
classes. The RVM designed for data sets without classes was computed on the same data<br />
set with no class <strong>in</strong>formation. (Note that this means that RVM was measured on plots<br />
like <strong>in</strong> Figure 3.17 that have no di erent colors for the data po<strong>in</strong>ts.) The best ranked<br />
scatterplot by RVM is (2, 5) hav<strong>in</strong>g the most dense target pattern. RVM is aimed to f<strong>in</strong>d
3.1.7 Evaluation <strong>of</strong> the Measures’ Performance Us<strong>in</strong>g Synthetic <strong>Data</strong> 51<br />
Scatterplot Measures<br />
RV M<br />
Parallel Coord<strong>in</strong>ates Measures<br />
HSM<br />
0 20 40 60 80 100<br />
0 20 40 60 80 100<br />
0 10 20 30 40<br />
CDM<br />
0 10 20 30 40<br />
OM<br />
0 20 40 60 80 100<br />
0 20 40 60 80 100<br />
0 10 20 30 40<br />
CSM<br />
0 10 20 30 40<br />
SM<br />
0 20 40 60 80 100<br />
0 20 40 60 80 100<br />
0 10 20 30 40<br />
0 10 20 30 40<br />
1D ≠ HDM<br />
40 50 60 70 80 90 100<br />
0 10 20 30 40<br />
Figure 3.18: Results <strong>of</strong> the 7 measures for classified and unclassified data. The left column shows<br />
the result for the scatterplot measures and the right column for the parallel coord<strong>in</strong>ates measures.<br />
The ranks are sorted decreas<strong>in</strong>g and the target patterns are marked with red crosses.
52 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
●<br />
●<br />
●<br />
Comp.2<br />
−5 0 5<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
●●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●●●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●● ●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
−5 0 5 10<br />
Comp.1<br />
Figure 3.19: Scatterplot <strong>of</strong> the first two components <strong>of</strong> the PCA over dimensions 2, 5 and 6.<br />
the scatterplots with the highest correlations. We can see that <strong>in</strong> subspace (2, 5) is the<br />
target pattern with the highest correlation. The second target pattern <strong>in</strong> (2, 6) shows two<br />
clusters with high correlation, and is also found by the RVM.<br />
The 1D-HDM ranked best all the target patterns with a result <strong>of</strong> 100. This synthetical<br />
data set is unfortunately <strong>in</strong>applicable to test the 2D-HDM because the patterns are<br />
def<strong>in</strong>ed along the data dimensions and therefore the 1D-HDM f<strong>in</strong>ds the best projection.<br />
Comput<strong>in</strong>g the PCA and search<strong>in</strong>g for a better projection <strong>of</strong> the pr<strong>in</strong>cipal components is<br />
not necessary because the value <strong>of</strong> 100 cannot be improved. Apply<strong>in</strong>g the PCA to the<br />
best dimensions selected by the 1D-HDM (2, 5 and 6), we obta<strong>in</strong> the plot shown <strong>in</strong> Figure<br />
3.19. These best components <strong>of</strong> the PCA are also ranked with 100 by the 2D-HDM.<br />
Note that the result<strong>in</strong>g plot is not visually better then the orthogonal projection (2, 5)<br />
and no additional <strong>in</strong>formation can be obta<strong>in</strong>ed through the PCA.<br />
The parallel coord<strong>in</strong>ates measures are designed to target di erent patterns. HSM ranks<br />
best parallel coord<strong>in</strong>ates plots for unclassified data with similar positions and directions,<br />
i.e. clusters. For classified data, SM looks for this clusters tak<strong>in</strong>g the classes <strong>in</strong>to account,<br />
and OM is designed to f<strong>in</strong>d parallel coord<strong>in</strong>ates plots hav<strong>in</strong>g classes with fewest overlap.<br />
In the po<strong>in</strong>t charts <strong>of</strong> the right column <strong>of</strong> Figure 3.18, we see that all the measures<br />
for parallel coord<strong>in</strong>ates ranked best one <strong>of</strong> our target patterns. HSM analyzed the data<br />
with no class <strong>in</strong>formation and ranked as best plot (5, 6) where two classes are visible. OM<br />
ranked also (5, 6) as the best because this plot has the smallest overlap between the two<br />
classes. SM ranked two target patterns <strong>in</strong> top 3: (5, 6) as the best, and (2, 6) as third<br />
best, present<strong>in</strong>g l<strong>in</strong>es <strong>in</strong> the two classes with almost the same positions and directions.<br />
This evaluation is only a start<strong>in</strong>g po<strong>in</strong>t for an evaluation <strong>of</strong> every possible parameter<br />
comb<strong>in</strong>ation. In the future, a complete statistical analysis <strong>of</strong> the correlation between the<br />
measures and the correlation to the ground truth will be necessary. In the follow<strong>in</strong>g, we<br />
briefly outl<strong>in</strong>e the basic steps for the future evaluation process:
3.1.8 Conclusion and Future Work 53<br />
1. Def<strong>in</strong>e ground truth. The ground truth should be generated <strong>in</strong> a synthetic data<br />
set hav<strong>in</strong>g two <strong>in</strong>dependent variables, as the density and separability <strong>of</strong> classes.<br />
2. Vary the number <strong>of</strong> classes. The synthetical data sets have to have di erent<br />
numbers <strong>of</strong> classes.<br />
3. Vary the number <strong>of</strong> dimensions. The synthetical data sets have to have di erent<br />
numbers <strong>of</strong> dimensions. They should simulate di erent types <strong>of</strong> high-dimensional<br />
data: small data sets – 2 to 9 dimensions, medium data sets – 10 to 49 dimensions,<br />
and large data sets – 50 to 100 dimensions.<br />
4. Statistical analysis. Make a statistical analysis <strong>of</strong> the correlation between the<br />
measures and a correlation to the ground truth.<br />
3.1.8 Conclusion and Future Work<br />
In this sections, we presented several methods to aid and potentially speed up the visual<br />
exploration process for di erent visualization techniques. In particular, we automated the<br />
rank<strong>in</strong>g <strong>of</strong> scatterplot and parallel coord<strong>in</strong>ates visualizations for classified and unclassified<br />
data for the purpose <strong>of</strong> correlation and cluster separation. In the next section, a ground<br />
truth is generated by lett<strong>in</strong>g users choose the most relevant visualizations from a manageable<br />
test set. To prove our methods, we compare them to the automatically generated<br />
rank<strong>in</strong>g. Some limitations are recognized as it is not always possible to f<strong>in</strong>d good separat<strong>in</strong>g<br />
views due to a grow<strong>in</strong>g number <strong>of</strong> classes and due to some multivariate relations.<br />
This is a general problem and not related to our techniques.<br />
The limitations <strong>of</strong> the above presented approach are <strong>of</strong> course determ<strong>in</strong>ed by the<br />
task, data complexity, and the measures applied to f<strong>in</strong>d the requested patterns. Tasks<br />
might be <strong>of</strong> di erent types, such as f<strong>in</strong>d<strong>in</strong>g outliers, significant patterns, di erent types<br />
<strong>of</strong> correlations between the dimensions etc. The complexity <strong>of</strong> the data can be described<br />
by the number <strong>of</strong> dimensions, the number <strong>of</strong> conta<strong>in</strong>ed classes, and the clarity <strong>of</strong> patterns<br />
(noise, over-plott<strong>in</strong>g, and distribution <strong>of</strong> the data). This complexity strongly <strong>in</strong>fluences<br />
the ability <strong>of</strong> measures to detect the required patterns. There are a number <strong>of</strong> measures<br />
<strong>in</strong> the doma<strong>in</strong> <strong>of</strong> high-dimensional data visualization assess<strong>in</strong>g di erent types <strong>of</strong> tasks and<br />
di erent applicability levels for di erent data sets. However, creat<strong>in</strong>g a data-task-measure<br />
taxonomy for our doma<strong>in</strong> is out <strong>of</strong> scope <strong>of</strong> this thesis, however, we strongly recommend<br />
this for future research. In Section 4.2, we will also present the results <strong>of</strong> a data-measure<br />
taxonomy with the focus on one task, namely the class separation <strong>in</strong> visualization.<br />
Our current approach is therefore to describe systematically the functionality <strong>of</strong> the<br />
presented measures as a function <strong>of</strong> their ability to detect hidden patterns <strong>in</strong> the data for<br />
a particular task. Our results have to be handled accord<strong>in</strong>gly.<br />
The comparison to other exist<strong>in</strong>g measures should be considered <strong>in</strong> future work. Furthermore,<br />
issues such as over-plott<strong>in</strong>g need to be part <strong>of</strong> the study s<strong>in</strong>ce they were currently<br />
disregarded. Scalability concerns will need to be addressed <strong>in</strong> future research under the<br />
constra<strong>in</strong>t <strong>of</strong> data complexity and heuristics to reduce the search space for target patterns.
54 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
3.2 Quality Measures and Human Perception – An Empirical Study<br />
Quality measures have been devised to automatically extract <strong>in</strong>terest<strong>in</strong>g visual representations<br />
out <strong>of</strong> a large number <strong>of</strong> available candidates <strong>in</strong> the exploration <strong>of</strong> high-dimensional<br />
databases. The measures permit for <strong>in</strong>stance to search with<strong>in</strong> a large set <strong>of</strong> scatterplots<br />
(e.g., <strong>in</strong> a scatterplot matrix) and select the views that conta<strong>in</strong> the best separation among<br />
clusters. The rationale beh<strong>in</strong>d these techniques is that automatic selection <strong>of</strong> “best” views<br />
is not only useful but also necessary when the number <strong>of</strong> potential projections exceeds the<br />
limit <strong>of</strong> human <strong>in</strong>terpretation. While useful as a concept <strong>in</strong> general, such metrics received<br />
so far limited validation <strong>in</strong> terms <strong>of</strong> human perception. In this chapter, we present a<br />
perceptual study <strong>in</strong>vestigat<strong>in</strong>g the relationship between human <strong>in</strong>terpretation <strong>of</strong> clusters<br />
<strong>in</strong> 2D scatterplots and the measures automatically extracted out <strong>of</strong> them. Specifically<br />
we compare a series <strong>of</strong> selected metrics and analyze how they predict human detection<br />
<strong>of</strong> clusters. A thorough discussion <strong>of</strong> results follows with reflections on their impact and<br />
directions for future research.<br />
Our empirical evaluation is based on a user study where users had to select projections<br />
<strong>of</strong> attribute-comb<strong>in</strong>ations well suited for classify<strong>in</strong>g the data under <strong>in</strong>spection. The study<br />
then compares the scores <strong>of</strong> the selected scatterplots with the score obta<strong>in</strong>ed by the selected<br />
quality measures to analyze their correlation. The outcome <strong>of</strong> the study permits primarily<br />
to validate the assumption that the selection <strong>of</strong> views best ranked by quality measures is a<br />
viable way to simulate the selection <strong>of</strong> users. Furthermore, the study permits to compare<br />
the performance <strong>of</strong> the measures employed and kick-start a quality measures benchmark<br />
process, where metrics are compared aga<strong>in</strong>st a basel<strong>in</strong>e represented by the results obta<strong>in</strong>ed.<br />
In summary, the ma<strong>in</strong> contributions <strong>of</strong> this section are:<br />
• A validation <strong>of</strong> the hypothesis that quality measures can simulate the selection <strong>of</strong><br />
best views by human be<strong>in</strong>gs;<br />
• A comparison among a set <strong>of</strong> promis<strong>in</strong>g and established measures;<br />
• The provision <strong>of</strong> a first benchmark framework, through which it is possible to compare<br />
new quality metrics.<br />
The rest <strong>of</strong> the chapter is organized as follows. Section 3.2.1 describes the measures<br />
employed <strong>in</strong> the study <strong>in</strong> details. Section 3.2.2 describes the whole experiment design and<br />
Section 3.2.3 presents the results. Section 3.2.4 discusses the results obta<strong>in</strong>ed <strong>in</strong> the study<br />
o er<strong>in</strong>g a vision on how they can be <strong>in</strong>terpreted and exploited <strong>in</strong> the future. Section 3.2.5<br />
provides a description how to set up a framework for user based evaluation <strong>of</strong> quality<br />
metrics as suggested <strong>in</strong> this section. F<strong>in</strong>ally, Section 3.2.6 provides the conclusions.<br />
3.2.1 Measures<br />
For this study we have selected quality metrics from [129] and from Section 3.1.3 ([133])<br />
that where developed specifically for scatterplots with classified data. In both cases the authors<br />
propose automatic analysis methods to extract potentially relevant visual structures<br />
from a set <strong>of</strong> candidate visualizations.<br />
Our study is based on the Class Density Measure (CDM) and the Histogram Density<br />
Measure (HDM) presented <strong>in</strong> Section 3.1.3. These two measures where also described<br />
<strong>in</strong> [133].
3.2.1 Measures 55<br />
In [129] Sips et al. also present similar work. They provide measures for rank<strong>in</strong>g<br />
scatterplots with classified and unclassified data. They propose two additional quantitative<br />
measures on class consistency: one based on the distance to the cluster centroids, and<br />
another based on the entropies <strong>of</strong> the spatial distributions <strong>of</strong> classes. The paper also<br />
describes an <strong>in</strong>itial small user study where user selections are compared the outcomes <strong>of</strong><br />
the proposed methods. From this work we adopt the Class Consistency Measure (CCM).<br />
The authors present a measure called Class Density Measure that, although hav<strong>in</strong>g the<br />
same name as our measure presented <strong>in</strong> Section 3.1.3, di ers from our Class Density<br />
Measure. It is <strong>in</strong> fact similar to the HDM measure and is therefore not <strong>in</strong>cluded <strong>in</strong> the<br />
analysis.<br />
For a better overview the metrics are summarized <strong>in</strong> Table 3.3.<br />
Table 3.3: Overview <strong>of</strong> the analyzed measures with the reference for additional details.<br />
Measure<br />
Reference<br />
Distance Consistency Measure (DCM) [129]<br />
1D Histogram Density Measure (1D-HDM)<br />
2D Histogram Density Measure (2D-HDM) 3.1.3 & [133]<br />
Class Density Measure (CDM)<br />
The follow<strong>in</strong>g is based on the assumption that each cluster <strong>in</strong> the data is uniquely<br />
labeled (either manually or through some form <strong>of</strong> n-dimensional cluster<strong>in</strong>g algorithm) and<br />
that for each po<strong>in</strong>t it is possible to know to which cluster it perta<strong>in</strong>s. F<strong>in</strong>ally, <strong>in</strong> the<br />
visualizations shown here, and those used <strong>in</strong> the experiment, each cluster is colored with<br />
auniquehue.<br />
We will not provide extensive formal specifications and details on the metrics. For<br />
additional details and further discussions on their limits and capabilities please refer to<br />
the orig<strong>in</strong>al papers [129] and [133], and the previous Section 3.1.3.<br />
Distance Consistency Measure<br />
The Distance Consistency Measure (DCM) presented by Sips et al. <strong>in</strong> [129] is based<br />
on the distance <strong>of</strong> data po<strong>in</strong>ts to their cluster centroid. The measure assumes the calculation<br />
<strong>of</strong> a cluster<strong>in</strong>g model <strong>in</strong> the n-dimensional space and computes a specific value for<br />
a given 2D projection by project<strong>in</strong>g po<strong>in</strong>ts and centroids on the selected 2D space.<br />
More precisely, the algorithm is based on the calculation <strong>of</strong> how many po<strong>in</strong>ts violate<br />
the distance to centroid measure. For any given po<strong>in</strong>t the distance to its centroid <strong>in</strong> the<br />
n-dimensional space must always be lower than the distance to any other cluster centroid.<br />
However, when data is projected on a specific 2D space, this property can be violated. For<br />
a given projection, the measure is therefore calculated as the proportion <strong>of</strong> data po<strong>in</strong>ts<br />
that violate the centroid distance measure.<br />
The Distance Consistency Measure (DCM) based on the centroid distance is consequently<br />
calculated as follows:<br />
|x Õ œ v(X) :CD(x Õ ,centr Õ (c clabel(x) )) ”= true|<br />
[129] (3.20)<br />
k<br />
where x Õ is the 2D projection <strong>of</strong> the data po<strong>in</strong>t x, centr Õ (c clabel(x) ) is the centroid pro-
56 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
jection <strong>of</strong> the centroid <strong>of</strong> the class <strong>of</strong> x (clabel(x)), and k the number <strong>of</strong> data po<strong>in</strong>ts.<br />
CD(x Õ ,centr Õ (c clabel(x) )) the centroid distance function, that describes that the distance<br />
<strong>of</strong> any po<strong>in</strong>t to his class centroid is m<strong>in</strong>imal <strong>in</strong> comparison to the distance to all other<br />
centroids. In other words, the percentage <strong>of</strong> po<strong>in</strong>ts that do not satisfy this property is<br />
calculated.<br />
Histogram Density Measure (1D and 2D)<br />
The Histogram Density Measure (HDM) approach presented <strong>in</strong> Section 3.1.3 is describ<strong>in</strong>g<br />
two quality measures for scatterplots with class <strong>in</strong>formation.<br />
For comput<strong>in</strong>g the 1D Histogram Density Measure (1D-HDM), data is projected<br />
over onto axis and a histogram is calculated to describe the distribution <strong>of</strong> the data po<strong>in</strong>ts<br />
over it. S<strong>in</strong>ce there are po<strong>in</strong>ts perta<strong>in</strong><strong>in</strong>g to di erent classes (i.e., clusters), the measure is<br />
based on the analysis <strong>of</strong> the amount <strong>of</strong> overlap among po<strong>in</strong>ts <strong>of</strong> di erent classes <strong>in</strong> the same<br />
histogram b<strong>in</strong>. The measure is <strong>in</strong>tended to isolate plots that show good class separations.<br />
Consequently, HDM looks for correspond<strong>in</strong>g histograms that show significant separation,<br />
and this property holds when the histogram b<strong>in</strong>s conta<strong>in</strong> only po<strong>in</strong>ts <strong>of</strong> one class.<br />
In order to measure this property, the approach uses entropy and axes rotation. Several<br />
<strong>in</strong>stances <strong>of</strong> the same 2D projection are computed, each with a di erent rotation factor.<br />
For each one an average entropy value is computed and the best rank among the rotation<br />
is selected as the measure’s value. The computation <strong>of</strong> the entropy values is expla<strong>in</strong>ed <strong>in</strong><br />
Section 3.1.3 <strong>in</strong> more detail.<br />
The 2D Histogram Density Measure (2D-HDM) is an extended version <strong>of</strong> the 1D-<br />
HDM, for which a 2-dimensional histogram on the scatterplot is computed, that is each<br />
b<strong>in</strong> represents a small square over the 2D projection and the b<strong>in</strong> count is the number <strong>of</strong><br />
data po<strong>in</strong>ts fall<strong>in</strong>g with<strong>in</strong> the square. The quality is measured similarly to the 1D-HDM<br />
by summ<strong>in</strong>g up a weighted sum <strong>of</strong> the entropy <strong>of</strong> each b<strong>in</strong>. The measure is normalized<br />
between 0 and 100, hav<strong>in</strong>g 100 for the best data po<strong>in</strong>ts visualization when each b<strong>in</strong> conta<strong>in</strong>s<br />
po<strong>in</strong>ts <strong>of</strong> only one class.<br />
In addition to the 1D-HDM, the b<strong>in</strong> neighborhood is also taken <strong>in</strong>to account <strong>in</strong> 2D-<br />
HDM. For each b<strong>in</strong> the <strong>in</strong>formation <strong>of</strong> po<strong>in</strong>ts p c <strong>in</strong> the b<strong>in</strong> and the direct neighbors labeled<br />
as u c are summed up. The full equation expla<strong>in</strong><strong>in</strong>g the calculation <strong>in</strong> details can be found<br />
<strong>in</strong> Section 3.1.3 and <strong>in</strong> the orig<strong>in</strong>al paper [133].<br />
The extended HDM measure to 2D can also f<strong>in</strong>d projections where classes are like two<br />
concentric circles <strong>of</strong> di erent diameters. In this case, a 1D projection will always have a<br />
big overlap <strong>of</strong> the classes, even if this circles do not overlap <strong>in</strong> 2D or nD.<br />
Class Density Measure<br />
The Class Density Measure (CDM) was also presented <strong>in</strong> detail <strong>in</strong> Section 3.1.3. This<br />
measure evaluates the scatterplots accord<strong>in</strong>g to their separation properties <strong>of</strong> classes. The<br />
goal is to identify those plots that show m<strong>in</strong>imal overlap between the classes.<br />
In order to compute the overlap between the classes, the method uses a cont<strong>in</strong>uous<br />
representation where the po<strong>in</strong>ts belong<strong>in</strong>g to the same cluster form a separate image. For<br />
each class we have a dist<strong>in</strong>ct image for which a cont<strong>in</strong>uous and smooth density function<br />
based on local neighborhoods is calculated. For each pixel p the distance to its k-th nearest<br />
neighbors N p <strong>of</strong> the same class is computed and the local density is calculated over the
3.2.2 Empirical Evaluation 57<br />
sphere with radius equal to the maximum distance.<br />
Hav<strong>in</strong>g these cont<strong>in</strong>uous density functions available for each class, the mutual overlap<br />
can be estimated by comput<strong>in</strong>g the sum <strong>of</strong> the absolute di erence between each pair and<br />
sum up the results. Section 3.1.3 gives more details about the computation formulas. The<br />
value <strong>of</strong> the metric is high if the densities at each pixel di er as much as possible, i.e., if<br />
one class has a higher density value compared to all others. Therefore, the visualization<br />
with the fewest overlap <strong>of</strong> the classes will be given the highest value. A property <strong>of</strong> this<br />
measure is that it not only estimates separate clusters well, but also estimates clusters<br />
where density di erence is noticeable. This is a great advantage s<strong>in</strong>ce it can ease the<br />
<strong>in</strong>terpretation <strong>of</strong> the data <strong>in</strong> the visualization.<br />
3.2.2 Empirical Evaluation<br />
The follow<strong>in</strong>g section describes the empirical evaluation <strong>of</strong> the described measures for projection<br />
quality. The aim <strong>of</strong> this evaluation is to assess the degree to which these measures<br />
reflect users’ perception <strong>of</strong> a high quality projection. Our method consists therefore <strong>of</strong> a<br />
user study for creat<strong>in</strong>g a basel<strong>in</strong>e and a series <strong>of</strong> measures that all judge the quality <strong>of</strong> a<br />
set <strong>of</strong> scatterplots. The results show the correlation computation between all the measures<br />
with the user graded quality.<br />
Hypotheses<br />
The hypotheses for the analyses were def<strong>in</strong>ed by the features <strong>of</strong> the four di erent automatic<br />
measures.<br />
H1. We expect lowest correlation <strong>of</strong> the 1D-HDM measure with users’ selection s<strong>in</strong>ce this<br />
measure takes only one dimensional projection for comput<strong>in</strong>g the separation quality<br />
<strong>of</strong> the data <strong>in</strong>to account.<br />
H2. <strong>High</strong>er correlation results are expected by the 2D-HDM measure because this extends<br />
its 1D version by creat<strong>in</strong>g a 2D histogram and considers direct neighborhoods <strong>of</strong> each<br />
data po<strong>in</strong>t for the quality computation.<br />
H3. The perceived quality <strong>of</strong> a projection may be even <strong>in</strong>fluenced by the density <strong>of</strong><br />
clusters hav<strong>in</strong>g a m<strong>in</strong>imal overlap, as suggested by the CDM. Here we expect a<br />
strong correlation with the measures’ rank.<br />
H4. F<strong>in</strong>ally, we expect high correlation with users’ selection, when the consistency <strong>of</strong><br />
clusters is computed, which is expressed by the quality <strong>of</strong> separation <strong>of</strong> the clusters.<br />
This is assessed by the DSC as described previously.<br />
In general, we expect a significant positive correlation <strong>of</strong> all these measure with users<br />
selection. However, these measures are also expected to vary <strong>in</strong> their approximation <strong>of</strong><br />
users’ perception, which is expressed by the coe cient <strong>of</strong> determ<strong>in</strong>ation - R 2 - <strong>of</strong> the<br />
regression.
58 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Participants<br />
Participants were 18 undergraduate students from the faculty <strong>of</strong> natural sciences. All had<br />
extensive experience <strong>in</strong> work<strong>in</strong>g with computers and scatterplots. Students participated<br />
<strong>in</strong> the experiment voluntarily and received no award for participat<strong>in</strong>g <strong>in</strong> the experiment.<br />
<strong>Data</strong> and Plot Selection<br />
To conduct the empirical evaluation, we took the UCI w<strong>in</strong>e data set 17 conta<strong>in</strong><strong>in</strong>g the results<br />
<strong>of</strong> a chemical analysis <strong>of</strong> three w<strong>in</strong>e types grown <strong>in</strong> a specific area <strong>of</strong> Italy. These types are<br />
represented <strong>in</strong> the 178 samples with the results <strong>of</strong> 13 chemical analyses recorded for each<br />
sample. The 13 attributes <strong>of</strong> the data set were pairwise comb<strong>in</strong>ed <strong>in</strong>to 78 scatterplots.<br />
The quality <strong>of</strong> these scatterplots was then computed by the four di erent measures. The<br />
data did not conta<strong>in</strong> any special cases <strong>of</strong> cluster constellation, nor did it have outliers or<br />
hidden data po<strong>in</strong>ts.<br />
The number <strong>of</strong> scatterplot representations to be used <strong>in</strong> the user study was 18, <strong>in</strong><br />
order to keep the performance time reasonably small, to allow a one-page representation<br />
<strong>of</strong> all the scatterplots at once <strong>in</strong> a reasonable size, so that all data po<strong>in</strong>ts can be seen.<br />
The selection <strong>of</strong> the 18 scatterplots was conducted along the distribution <strong>of</strong> the measures’<br />
quality assignment, described as follows:<br />
1. The quality values <strong>of</strong> the measures were normalized between 0 to 1, and assigned to<br />
one quantile.<br />
2. The scatterplots were sampled <strong>in</strong> such a way that the distribution between the<br />
number <strong>of</strong> projections <strong>in</strong> higher and lower quantiles were approximately the same<br />
for all measures.<br />
3. As a result, the distribution <strong>of</strong> quality values <strong>in</strong> each quantile was 4±1.<br />
These selected scatterplots were ordered <strong>in</strong> six columns and three rows and then pr<strong>in</strong>ted<br />
us<strong>in</strong>g a high quality color pr<strong>in</strong>ter. The order <strong>of</strong> the scatterplots was permuted by the Lat<strong>in</strong>square<br />
method, result<strong>in</strong>g <strong>in</strong> 18 di erent sett<strong>in</strong>gs, one for each participant. An example<br />
<strong>of</strong> the set <strong>of</strong> scatterplots used <strong>in</strong> the experiment is shown <strong>in</strong> Figure 3.20. Two orig<strong>in</strong>al<br />
experiment forms are attached <strong>in</strong> Appendix A.2 – Figure A.1 and Figure A.2.<br />
Task<br />
Participants were confronted with a scenario around the w<strong>in</strong>e data set. They were act<strong>in</strong>g<br />
<strong>in</strong> this scenario as a w<strong>in</strong>e-consultant for three di erent types <strong>of</strong> w<strong>in</strong>es. They were told<br />
that their challenge is to analyze a large amount <strong>of</strong> attributes describ<strong>in</strong>g the w<strong>in</strong>es, such as<br />
color saturation, alcohol content, etc. Participants were requested to select projections <strong>of</strong><br />
attribute-comb<strong>in</strong>ations that are well suited for classify<strong>in</strong>g the three di erent types <strong>of</strong> w<strong>in</strong>es.<br />
This task had to be carried out us<strong>in</strong>g a selected set <strong>of</strong> scatterplot views show<strong>in</strong>g attributes<br />
<strong>in</strong> a pair-wise manner. At first, participants were asked to select the five most qualitative<br />
projections for separat<strong>in</strong>g w<strong>in</strong>e types and then order them us<strong>in</strong>g numbers between 1 and 5<br />
(1 <strong>in</strong>dicat<strong>in</strong>g the absolute best representation, and 5 the worst out <strong>of</strong> the five best quality<br />
scatterplots).<br />
17 Source at UCI: www.archive.ics.uci.edu/ml/datasets/W<strong>in</strong>e
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
● ● ● ●<br />
● ● ● ● ●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
● ● ● ● ● ● ● ● ●●<br />
● ●<br />
●<br />
● ●●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ● ● ●<br />
●●<br />
● ● ● ●<br />
● ● ●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ●<br />
●<br />
● ● ●<br />
● ●●<br />
●●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
● ● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ● ●<br />
● ●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●●<br />
●<br />
● ●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ● ● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●●<br />
●<br />
● ● ● ●<br />
● ●<br />
●<br />
●<br />
●●<br />
● ● ● ●<br />
● ● ● ●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●●●<br />
●<br />
● ●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
● ●<br />
● ●●●<br />
●<br />
●●<br />
● ●<br />
●●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
● ●●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ●<br />
● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
●●<br />
● ●<br />
● ●●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ●●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
● ● ● ● ●<br />
● ● ●<br />
● ●●<br />
●<br />
● ●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ● ●<br />
● ● ● ●<br />
● ● ● ●●<br />
●●<br />
●<br />
● ● ●<br />
●● ●<br />
●<br />
●<br />
● ● ●<br />
● ● ●●<br />
●<br />
●<br />
●<br />
●●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
● ●●<br />
● ● ● ● ●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ● ● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
● ● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ● ● ●<br />
●<br />
●●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●●<br />
●<br />
● ● ●●●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ● ●<br />
● ●<br />
●●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
●<br />
● ●●<br />
● ●<br />
● ●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
● ● ● ●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
3.2.3 Results 59<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
Figure 3.20: Projections <strong>of</strong> scatterplots used <strong>in</strong> the experiment. Participants had to select the best<br />
five projections and order them by their quality. The order <strong>of</strong> the scatterplots was permuted for<br />
each participant separately us<strong>in</strong>g the Lat<strong>in</strong>-Square method.<br />
Procedure<br />
The experiment consisted <strong>of</strong> two parts. In the first part, participants had to read a short<br />
description <strong>of</strong> the scenario, the task and fill out a short standardized form on general<br />
questions (such as age, study stage, experience with computers and scatterplots) 18 . In<br />
the second ma<strong>in</strong> part <strong>of</strong> the experiment, participants had to perform the task by select<strong>in</strong>g<br />
and order<strong>in</strong>g the five best representations that classified three w<strong>in</strong>e types 19 . Clearly, the<br />
best suited scatterplot is the one that allows a clear dist<strong>in</strong>ction <strong>of</strong> the three w<strong>in</strong>e types<br />
by the two attributes. Participants’ e ectiveness ma<strong>in</strong>ly depended on their ability to read<br />
and <strong>in</strong>terpret scatterplots. The group <strong>of</strong> participants was quite homogeneous with regard<br />
to age and previous education. Expectedly, their performance did not show significant<br />
deviations or anomalies. This was assured by comput<strong>in</strong>g that none <strong>of</strong> the scores is above<br />
or below the triple standard deviation. In order not to be biased towards any <strong>of</strong> the<br />
measures, participants were not directed on how to def<strong>in</strong>e a high quality projection, nor<br />
how to look for dense or consistent clusters.<br />
3.2.3 Results<br />
A l<strong>in</strong>ear regression analysis was carried out us<strong>in</strong>g the Pearson coe cient for assess<strong>in</strong>g<br />
the correlation between users’ classification and the measures’ quality assignment <strong>of</strong> the<br />
selected projections. In order to make the measures comparable, we normalized the assigned<br />
quality measures <strong>in</strong>dividually for the projections between 0 to 1. From the users’<br />
answers we computed the probability <strong>of</strong> select<strong>in</strong>g a projection by count<strong>in</strong>g the number <strong>of</strong><br />
times each projection was selected. These probabilities were weighted with the averaged<br />
ranks assigned by the participants. This resulted <strong>in</strong> a sequential order <strong>of</strong> the projections<br />
reflect<strong>in</strong>g users’ quality preferences. The dependent variable <strong>of</strong> the statistical evaluation<br />
18 Appendix A.2 conta<strong>in</strong>s this general question form (<strong>in</strong> German) <strong>in</strong> Section A.2.1.<br />
19 Appendix A.2 conta<strong>in</strong>s two examples <strong>of</strong> the experiment form (Figure A.1 and Figure A.2).
60 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
was the user rank<strong>in</strong>gs, and each <strong>of</strong> the four measures was one <strong>in</strong>dependent variable <strong>in</strong> separate<br />
computations. The results show significant positive correlation for all four measures<br />
(p
3.2.3 Results 61<br />
2D-HDM and DCM assigned the best quality to the projection exactly as did the users.<br />
CDM assigned for this projection 99% quality (rank 2), and 1D-HDM only 68% quality<br />
(rank 4). The projection <strong>of</strong> users’ highest quality is shown <strong>in</strong> Figure 3.22(a).<br />
The highest quality projection selected by CDM and 1D-HDM is shown <strong>in</strong> Figure 3.22(b).<br />
This projection shows a clear and very dense cluster for one <strong>of</strong> the w<strong>in</strong>e types, however, it<br />
also shows a high overlap for the other two types. Users assigned rank 4 for this projection.<br />
In users’ eye the worst quality projection was the one show<strong>in</strong>g high density <strong>of</strong> all three<br />
w<strong>in</strong>e types but also a high overlap, as shown <strong>in</strong> Figure 3.22(c). This was also confirmed by<br />
three measures, except by the CDM measure that still assigned a quality <strong>of</strong> 26.3% (rank<br />
11) to this projection.<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ● ●<br />
● ● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ●●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
● ● ● ● ●<br />
● ● ●<br />
● ●●<br />
●<br />
●<br />
● ● ●●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ●● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ● ● ● ●<br />
● ●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
(a) Users’ highest quality<br />
ranked projection was confirmed<br />
by DCM and 2D-<br />
HDM quality measures.<br />
(b) <strong>High</strong>est quality ranked<br />
projection by CDM and<br />
1D-HDM measures.<br />
(c) Users’ lowest quality<br />
ranked projection was confirmed<br />
by DCM, 2D-HDM<br />
and also by 1D-HDM quality<br />
measures.<br />
Figure 3.22: Correlation <strong>of</strong> measures with users’ classification for highest and one lowest quality<br />
projection.<br />
Interest<strong>in</strong>g is also the phenomenon that none <strong>of</strong> the users selected 8 <strong>of</strong> the 18 projections<br />
21 . CDM, however, still assigned 65% quality to one <strong>of</strong> these projections as shown <strong>in</strong><br />
Figure 3.23(a). The highest quality assignment to one <strong>of</strong> these 8 projections was 58% by<br />
1D-HDM, 50% by DCM, and only 40% by 2D-HDM. Surpris<strong>in</strong>gly, the projection shown<br />
<strong>in</strong> Figure 3.23(b) was selected by a user and ranked between the best five, but all the<br />
measures ranked it second to last, or even last by CDM.<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
● ●<br />
●<br />
● ●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
● ● ●<br />
● ● ●<br />
● ● ●<br />
● ● ●<br />
●<br />
● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ● ●<br />
●<br />
●<br />
● ●<br />
●<br />
● ●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ● ● ●<br />
●<br />
● ●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
(a) Not selected by any<br />
user, but ranked by CDM<br />
with 65.<br />
(b) Selected by a user,<br />
ranked by all the measures<br />
second to last, and<br />
by CDM last.<br />
Figure 3.23: Surpris<strong>in</strong>g study results.<br />
21 In Appendix A.2.3 Figure A.3 shows the 8 projections that where not selected by any user.
62 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
In summary, 2D-HDM, tightly followed by DCM, reflected users’ quality assignment<br />
best by reach<strong>in</strong>g the highest and lowest quality rank<strong>in</strong>g accurately, and hav<strong>in</strong>g the highest<br />
R 2 value <strong>of</strong> the correlation. These results should however not <strong>in</strong>dicate that density (CDM)<br />
is unimportant for quality assignments. It should rather motivate to comb<strong>in</strong>e and improve<br />
these measures, so they can su ciently support users <strong>in</strong> their task.<br />
3.2.4 Discussion<br />
In the follow<strong>in</strong>g section we exam<strong>in</strong>e the results <strong>of</strong> the experiment <strong>in</strong> more detail, discuss<strong>in</strong>g<br />
some <strong>of</strong> their potential implications and ideas for further research. As we have noted <strong>in</strong><br />
the results there is a divergence <strong>of</strong> results when the measure takes <strong>in</strong>to account the density<br />
or the amount <strong>of</strong> overlap among the clusters. 2D-HDM together with DCM reflected users<br />
preference for high quality projections better than the others. Intuitively, both density<br />
and overlap should play a role <strong>in</strong> the perception <strong>of</strong> clusters, nonetheless the results <strong>of</strong> our<br />
experiment seem to suggest that separation is more important. Future research will need<br />
to address this issue to establish whether a comb<strong>in</strong>ation <strong>of</strong> measures based on both density<br />
and separation can outperform the others.<br />
Another open issue not <strong>in</strong>vestigated <strong>in</strong> this study, is the <strong>in</strong>fluence di erent shapes <strong>of</strong><br />
clusters might have on user perception and, at the same time, on the proposed measures.<br />
Current results do not permit to di erentiate between the shapes clusters have, even if<br />
the images with highly ranked clusters conta<strong>in</strong> circular shapes.<br />
In relation to this last observation, it is worth notic<strong>in</strong>g that the major factor <strong>in</strong>volved<br />
<strong>in</strong> the separation <strong>of</strong> clusters is the proximity <strong>of</strong> the po<strong>in</strong>ts. This is <strong>of</strong> course not surpris<strong>in</strong>g<br />
as the Gestalt Laws <strong>of</strong> Group<strong>in</strong>g suggest that proximity is the strongest visual features<br />
used by the visual system to extract patterns out <strong>of</strong> images. Nonetheless, we believe<br />
it is worth runn<strong>in</strong>g new studies <strong>in</strong>vestigat<strong>in</strong>g the relationship between the other laws <strong>of</strong><br />
group<strong>in</strong>g (e.g., closure, similarity, cont<strong>in</strong>uation, etc.), users’ perception and additional<br />
quality metrics. Go<strong>in</strong>g along these l<strong>in</strong>es, Section 4.2 presents the results <strong>of</strong> a qualitative<br />
analysis on cluster separation factors. Here di erent plots that show a variety <strong>of</strong> data sets<br />
where analyzed manually to identify what k<strong>in</strong>d <strong>of</strong> patterns are formed by clusters and how<br />
these are identified by current metrics.<br />
Here our experimental task is focused on the perception <strong>of</strong> clusters. However, it is<br />
important to acknowledge that the perception <strong>of</strong> clusters <strong>of</strong> n-dimensional data spaces is<br />
not the only useful task. For <strong>in</strong>stance, the detection <strong>of</strong> outliers for which it is not only<br />
necessary to f<strong>in</strong>d suitable metrics but also to run studies similar to ours, is relevant <strong>in</strong><br />
order to understand the relationship between user perception and the metric. The same<br />
idea can and should be repeated for several user’s tasks, visual patterns, and metrics.<br />
We consider our study only a start<strong>in</strong>g po<strong>in</strong>t <strong>in</strong> this direction, nonetheless, it <strong>in</strong>troduces a<br />
well-reasoned experimental design procedure that can be repeated to explore all we have<br />
outl<strong>in</strong>ed above. For this reason <strong>in</strong> the follow<strong>in</strong>g section, we briefly summarize the common<br />
elements <strong>of</strong> our study design so that it could be repeated <strong>in</strong> future experiments.<br />
F<strong>in</strong>ally, we po<strong>in</strong>t out that the current study focuses exclusively on the correlation and<br />
comparison <strong>of</strong> what metrics and users detect, with an underly<strong>in</strong>g assumption that users’<br />
perception represents a sort <strong>of</strong> optimum. This assumption requires additional <strong>in</strong>vestigation<br />
as computational methods might be able to detect <strong>in</strong>terest<strong>in</strong>g patterns that users cannot<br />
necessarily perceive visually.
3.2.5 Guidel<strong>in</strong>es 63<br />
3.2.5 Guidel<strong>in</strong>es<br />
In the follow<strong>in</strong>g, we briefly outl<strong>in</strong>e the basic steps to repeat <strong>in</strong> new user studies, follow<strong>in</strong>g<br />
the same schema used <strong>in</strong> this study. Our motivation is the desire to facilitate the design<br />
<strong>of</strong> similar studies and to promote the production <strong>of</strong> related studies on the perception <strong>of</strong><br />
visual patterns and their formalization <strong>in</strong> computable metrics.<br />
1. Select a visualization technique. The first element necessary is the selection <strong>of</strong><br />
a specific visualization technique. In our examples we have used scatterplots that is<br />
one <strong>of</strong> the most used techniques <strong>in</strong> visualization. Future studies might <strong>in</strong>clude other<br />
high-dimensional visualization techniques like the ones presented <strong>in</strong> Section 2.2, e.g.,<br />
treemaps, parallel coord<strong>in</strong>ates, l<strong>in</strong>e charts, etc.<br />
2. Select a visual feature. In this phase it is necessary to th<strong>in</strong>k <strong>in</strong> terms <strong>of</strong> what<br />
particular features can be detected <strong>in</strong> the visualization technique under <strong>in</strong>spection.<br />
Note that some concepts recur across several visualization but need a redef<strong>in</strong>ition<br />
for each specific case (e.g., cluster<strong>in</strong>g <strong>in</strong> scatterplots and <strong>in</strong> parallel coord<strong>in</strong>ates).<br />
3. Formalize the feature. This is a fundamental step <strong>in</strong> our design schema. Once<br />
a specific feature has been selected it is necessary to formalize it <strong>in</strong> a way that it<br />
can be computed through an algorithm. In this phase it is advisable to produce<br />
more than one measure <strong>in</strong> order to capture several aspects <strong>of</strong> the same feature. This<br />
also permits to compare the performance <strong>of</strong> the selected measure <strong>in</strong> the study and<br />
acquire additional <strong>in</strong>formation on the visual processes implied <strong>in</strong> the perception <strong>of</strong><br />
the feature.<br />
4. Run a rank-based study. Once the feature has been formalized it is possible to<br />
run a study where the users have to rank the images <strong>in</strong> terms <strong>of</strong> the selected feature.<br />
When the images have been ranked it is possible to compare the ranks given by the<br />
metrics and the ones provided by the users (as suggested <strong>in</strong> our method and design<br />
<strong>of</strong> the study).<br />
5. Study and ref<strong>in</strong>e. The results <strong>of</strong> the algorithms can be compared to the results<br />
obta<strong>in</strong>ed by the users who represent the reference aga<strong>in</strong>st which all measures are<br />
evaluated. The goal <strong>of</strong> this phase is not only to determ<strong>in</strong>e which <strong>of</strong> the metrics<br />
performs best, but also to reason around the results to (1) hunt for <strong>in</strong>terest<strong>in</strong>g<br />
<strong>in</strong>sights about how users perceive the selected feature; (2) design better metrics able<br />
to capture the desired feature with more accuracy.<br />
3.2.6 Conclusion and Future Work<br />
To conclude the research presented <strong>in</strong> this Section 3.2, we would like to recall the contributions<br />
mentioned at the beg<strong>in</strong>n<strong>in</strong>g. Through a user centered evaluation design we<br />
showed that some quality measures are more and some less able to reflect users’ perception.<br />
However, there is still a question as to which extent users are able to preselect good<br />
quality projections <strong>of</strong> their multidimensional data <strong>in</strong> an e cient and unbiased manner.<br />
Our results <strong>in</strong>dicate that further development is needed to f<strong>in</strong>d the ultimate automatic<br />
quality measure. Nevertheless, the provision <strong>of</strong> the first quality benchmark framework,
64 Chapter 3. Quality Measures based <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
with which it is possible to compare di erent metrics is created. Another question regard<strong>in</strong>g<br />
the future development <strong>of</strong> similar studies is whether the accumulation <strong>of</strong> several similar<br />
experiments on di erent visualization techniques and features can be jo<strong>in</strong>ed to create a<br />
uniform model or better understand<strong>in</strong>g <strong>of</strong> how visualization works and how visual patterns<br />
can be formalized. While the answer to this issue is not clear at the moment, it is evident<br />
that at the very least every s<strong>in</strong>gle study has the potential to improve the understand<strong>in</strong>g<br />
and the utilization <strong>of</strong> the selected technique.<br />
In future works, the same techniques can be applied to other visualization methods,<br />
e.g., parallel coord<strong>in</strong>ates, to evaluate the correlation between the specific quality metrics<br />
and the user perception. S<strong>in</strong>ce <strong>in</strong> the current work we focus on cluster detection exclusively,<br />
di erent visual patterns like outliers could be <strong>in</strong>vestigated. Like mentioned <strong>in</strong> Section 3.2.4,<br />
it is also important to analyze how good users perform <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g patterns.
4<br />
A Systematization <strong>of</strong> Quality Metrics <strong>in</strong><br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Contents<br />
„Noth<strong>in</strong>g has such power to broaden the m<strong>in</strong>d as the ability to <strong>in</strong>vestigate<br />
systematically and truly all that comes under thy observation <strong>in</strong> life.”<br />
Marcus Aurelius<br />
4.1 Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization . . . . 66<br />
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
4.1.3 Quality Metrics Pipel<strong>in</strong>e . . . . . . . . . . . . . . . . . . . . . . . 71<br />
4.1.4 Systematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
4.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
4.1.6 F<strong>in</strong>d<strong>in</strong>gs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
4.1.7 Directions for Further Research . . . . . . . . . . . . . . . . . . . 85<br />
4.1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
4.1.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 86<br />
4.2 <strong>Visual</strong> Cluster Separation Factors: Sketch<strong>in</strong>g a Taxonomy . . . 87<br />
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.2.3 <strong>Visual</strong> Cluster Separation Taxonomy . . . . . . . . . . . . . . . . 89<br />
4.2.4 Discussion and Further Research . . . . . . . . . . . . . . . . . . 90<br />
I<br />
n a number <strong>of</strong> recent papers, di erent quality metrics have been proposed to automate<br />
the demand<strong>in</strong>g search through large spaces <strong>of</strong> alternative visualizations (e.g., alternative<br />
projections or order<strong>in</strong>g), allow<strong>in</strong>g the user to concentrate on the most promis<strong>in</strong>g<br />
visualizations suggested by the quality metrics. Over the last decade, this approach has<br />
witnessed a remarkable development, however, few reflections exist on how these methods<br />
are related to each other and how the approach can be developed further. For this<br />
purpose, <strong>in</strong> Section 4.1 we provide an overview <strong>of</strong> approaches that use quality metrics <strong>in</strong><br />
high-dimensional data visualization and propose a systematization based on a thorough<br />
literature review. We carefully analyze the papers and derive a set <strong>of</strong> factors for discrim<strong>in</strong>at<strong>in</strong>g<br />
the quality metrics, visualization techniques, and the process itself. The process is<br />
described through a reworked version <strong>of</strong> the well-known <strong>in</strong>formation visualization pipel<strong>in</strong>e.<br />
We demonstrate the usefulness <strong>of</strong> our model by apply<strong>in</strong>g it to several exist<strong>in</strong>g approaches<br />
that use quality metrics, and we provide reflections on implications <strong>of</strong> our model for future<br />
research.<br />
Another aspect that is worth to be <strong>in</strong>vestigated <strong>in</strong> the context <strong>of</strong> quality metrics, is<br />
their ability to detect di erent types <strong>of</strong> structures <strong>of</strong> high-dimensional data. In Section 4.2
66 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
we present the results <strong>of</strong> an <strong>in</strong>-depth qualitative evaluation <strong>of</strong> two cluster separation measures.<br />
This evaluation is concentrated on scatterplot visualizations (2D, 3D, and SPLOM)<br />
and the most popular task – cluster<strong>in</strong>g. The qualitative data study converged <strong>in</strong>to a taxonomy<br />
<strong>of</strong> visual cluster separation factors for scatterplots, and we shortly report on the<br />
results <strong>in</strong> this section. Beyond that, the outcome <strong>of</strong> the study is used to describe possible<br />
next steps <strong>in</strong> the field, that we deem important to advance the research <strong>in</strong> this area.<br />
Parts <strong>of</strong> this chapter appeared <strong>in</strong> the follow<strong>in</strong>g publications [27, 122].<br />
4.1 Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
The extraction <strong>of</strong> relevant and mean<strong>in</strong>gful <strong>in</strong>formation out <strong>of</strong> high-dimensional data is<br />
notoriously complex and cumbersome. The curse <strong>of</strong> dimensionality is a popular way<br />
<strong>of</strong> stigmatiz<strong>in</strong>g the whole set <strong>of</strong> troubles encountered <strong>in</strong> high-dimensional data analysis;<br />
f<strong>in</strong>d<strong>in</strong>g relevant projections, select<strong>in</strong>g mean<strong>in</strong>gful dimensions, and gett<strong>in</strong>g rid <strong>of</strong> noise,<br />
be<strong>in</strong>g only a few <strong>of</strong> them. Multi-dimensional data visualization also carries its own set <strong>of</strong><br />
challenges like, above all, the limited capability <strong>of</strong> any technique to scale to more than an<br />
handful <strong>of</strong> data dimensions.<br />
Researchers have been try<strong>in</strong>g to solve these problems through a number <strong>of</strong> automatic<br />
data analysis and visualization approaches that cover the whole spectrum <strong>of</strong> possibilities:<br />
from fully automatic to fully <strong>in</strong>teractive. <strong>Visual</strong>ization researchers have discovered early<br />
on that search<strong>in</strong>g for <strong>in</strong>terest<strong>in</strong>g patterns <strong>in</strong> this k<strong>in</strong>d <strong>of</strong> data can be done through a mixed<br />
approach, where the mach<strong>in</strong>e based on quality metrics automatically searches through a<br />
large number <strong>of</strong> potentially <strong>in</strong>terest<strong>in</strong>g projections, and the user <strong>in</strong>teractively steers the<br />
process and explores the output through visualization.<br />
The pioneer<strong>in</strong>g work <strong>of</strong> Friedman and Tukey <strong>in</strong> 1974 <strong>in</strong>troduced the idea with their<br />
projection pursuits method [54]. They recognized the limit <strong>of</strong> human be<strong>in</strong>gs <strong>in</strong> explor<strong>in</strong>g<br />
the exponential set <strong>of</strong> projections and tackled the high-dimensionality issue by lett<strong>in</strong>g an<br />
algorithm discover <strong>in</strong>terest<strong>in</strong>g l<strong>in</strong>ear projections <strong>in</strong> 1D (histograms) and 2D (scatterplots)<br />
and lett<strong>in</strong>g the user evaluate the correspond<strong>in</strong>g output.<br />
Dur<strong>in</strong>g the last few years the use <strong>of</strong> this paradigm has witnessed a grow<strong>in</strong>g <strong>in</strong>terest,<br />
and an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> techniques has been published <strong>in</strong> key data visualization<br />
conferences, and journals. Quality metrics have been used for very disparate goals such as:<br />
search<strong>in</strong>g for <strong>in</strong>terest<strong>in</strong>g projections, reduc<strong>in</strong>g clutter, and f<strong>in</strong>d<strong>in</strong>g mean<strong>in</strong>gful abstractions.<br />
However, the <strong>in</strong>itial idea <strong>of</strong> quality metrics has been elaborated and expanded so much<br />
further and <strong>in</strong>to so many di erent directions that it is hard to come up with a coherent and<br />
unified picture for them. A reader <strong>of</strong> one <strong>of</strong> these papers may well appreciate the value <strong>of</strong><br />
a s<strong>in</strong>gle technique without hav<strong>in</strong>g a way to place it <strong>in</strong>to a larger context. Also, researchers<br />
who might want to approach this area <strong>of</strong> <strong>in</strong>vestigation for the first time and develop new<br />
techniques may have a hard time appreciat<strong>in</strong>g the whole spectrum <strong>of</strong> possibilities and<br />
directions related to the use <strong>of</strong> quality metrics.<br />
In this section, we move first steps towards fill<strong>in</strong>g this gap. We provide a systematization<br />
<strong>of</strong> us<strong>in</strong>g quality metrics <strong>in</strong> high-dimensional data analysis through a literature review.<br />
We analyzed numerous papers conta<strong>in</strong><strong>in</strong>g quality metrics and went through an iterative
4.1. Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization 67<br />
process that led to the def<strong>in</strong>ition <strong>of</strong> a number <strong>of</strong> factors and a quality metrics pipel<strong>in</strong>e,<br />
which is <strong>in</strong>spired to the traditional <strong>in</strong>formation visualization pipel<strong>in</strong>e [36].<br />
The extracted factors and the pipel<strong>in</strong>e have the follow<strong>in</strong>g <strong>in</strong>terrelated goals:<br />
1. putt<strong>in</strong>g the exist<strong>in</strong>g methods <strong>in</strong>to a common framework;<br />
2. eas<strong>in</strong>g the generation <strong>of</strong> new research <strong>in</strong> the field;<br />
3. spott<strong>in</strong>g relevant gaps to bridge with future research.<br />
In this section, we provide an extensive explanation <strong>of</strong> the methodology we followed,<br />
the results we obta<strong>in</strong>ed, and their practical use. In particular, we demonstrate this by go<strong>in</strong>g<br />
through a number <strong>of</strong> selected examples how we are able to describe exist<strong>in</strong>g approaches<br />
through the proposed models. Also, we identify a number <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g gaps and give<br />
guidel<strong>in</strong>es on how to carry out new research <strong>in</strong> this area. To the best <strong>of</strong> our knowledge,<br />
despite the numerous techniques that can be categorized under the umbrella <strong>of</strong> qualitymetrics-driven<br />
visualization, this is the first attempt <strong>in</strong> this direction.<br />
Def<strong>in</strong>itions<br />
In order to make the goal and scope <strong>of</strong> our work clear, we provide some <strong>in</strong>itial def<strong>in</strong>itions.<br />
Information <strong>Visual</strong>ization Pipel<strong>in</strong>e: a reference model that describes how to transforms<br />
data <strong>in</strong>to visualizations through a series <strong>of</strong> process<strong>in</strong>g steps, as def<strong>in</strong>ed <strong>in</strong> [36].<br />
Quality Metric: a metric calculated at any stage <strong>of</strong> the <strong>in</strong>formation visualization<br />
pipel<strong>in</strong>e that captures properties useful to extract mean<strong>in</strong>gful <strong>in</strong>formation about the data.<br />
(Please note that we use the terms metric and measure as synonyms <strong>in</strong> this thesis.)<br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong>: any data set with a dimensionality that is too high to<br />
easily extract mean<strong>in</strong>gful relations across the whole set <strong>of</strong> dimensions. In the context <strong>of</strong><br />
this thesis, any dimensionality higher than 10 is considered high-dimensional.<br />
Our focus is on the analysis <strong>of</strong> methods that apply quality metrics at any stage <strong>of</strong> the<br />
<strong>in</strong>formation visualization pipel<strong>in</strong>e as a way to facilitate the detection and presentation <strong>of</strong><br />
<strong>in</strong>terest<strong>in</strong>g patterns <strong>in</strong> high-dimensional data.<br />
Examples<br />
We first discuss a few short examples <strong>of</strong> the approaches covered <strong>in</strong> our review to familiarize<br />
the reader with the concepts exposed <strong>in</strong> this section and get the feel<strong>in</strong>g <strong>of</strong> their<br />
heterogeneity. They cover a broad selection <strong>of</strong> the factors, denoted with italics, which will<br />
be presented <strong>in</strong> detail <strong>in</strong> Section 4.1.4.<br />
Example 1<br />
We start with a familiar example presented <strong>in</strong> Section 3.1.6 and published <strong>in</strong> [133] where<br />
high-dimensional data sets are analyzed by comput<strong>in</strong>g an <strong>in</strong>terest<strong>in</strong>gness score for every<br />
scatterplot generated with all the possible comb<strong>in</strong>ations <strong>of</strong> axis pairs from the orig<strong>in</strong>al
68 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Best ranked views us<strong>in</strong>g CDM<br />
100 97 84<br />
Figure 4.1: (Top row <strong>of</strong> Figure 3.8) Rank<strong>in</strong>g projections accord<strong>in</strong>g to the Class Density Measure,<br />
favor<strong>in</strong>g projections with m<strong>in</strong>imal overlap between predef<strong>in</strong>ed classes (i.e., the colors) [133].<br />
data. The score is calculated by runn<strong>in</strong>g image process<strong>in</strong>g algorithms on top <strong>of</strong> each scatterplot<br />
<strong>in</strong> order to detect images with clusters <strong>in</strong> the visualization. The system returns a<br />
list <strong>of</strong> scatterplots as those presented <strong>in</strong> Figure 4.1 sorted <strong>in</strong> order <strong>of</strong> relevance accord<strong>in</strong>g<br />
to the chosen quality measure.<br />
Example 2<br />
Peng et al. <strong>in</strong> [112] provide algorithms to reorder the axes <strong>of</strong> multidimensional data visualizations<br />
(parallel coord<strong>in</strong>ates, scatterplot matrices, glyphs, recursive patterns) <strong>in</strong> order<br />
to reduce clutter and make <strong>in</strong>terest<strong>in</strong>g patterns more clearly visible. For each visualization<br />
a specific quality metric calculated <strong>in</strong> the data space is used to f<strong>in</strong>d the best order<strong>in</strong>g. In<br />
Figure 4.2, we present an example on scatterplot matrix reorder<strong>in</strong>g.<br />
Figure 4.2: Clutter reduction achieved through axes reorder<strong>in</strong>g <strong>in</strong> a scatterplot matrix (<strong>in</strong>itial<br />
visualization on the left, reordered on the right) [112].<br />
Example 3<br />
Johansson et al. <strong>in</strong> [80] study the abstraction obta<strong>in</strong>ed by apply<strong>in</strong>g sampl<strong>in</strong>g or aggregation<br />
algorithms on top <strong>of</strong> parallel coord<strong>in</strong>ates and provide quality metrics to judge when the<br />
abstraction disrupts relevant patterns <strong>in</strong> the data. In Figure 4.3 we show an example from<br />
their work, where on the left the data set conta<strong>in</strong><strong>in</strong>g 16384 items is displayed with parallel
4.1.1 Background 69<br />
The orig<strong>in</strong>al data set conta<strong>in</strong><strong>in</strong>g 16384 items. Target<strong>in</strong>g a visual quality <strong>of</strong> 0.95<br />
reta<strong>in</strong>s 987 items.<br />
Figure 4.3: <strong>Data</strong> abstraction algorithm based on sampl<strong>in</strong>g, aim<strong>in</strong>g at reduc<strong>in</strong>g data size while<br />
preserv<strong>in</strong>g relevant patterns. Orig<strong>in</strong>al visualization on the left with 16384 data items. Sampled<br />
visualization on the right with 987 items and a visual quality <strong>of</strong> 0.95 [80].<br />
coord<strong>in</strong>ates. On the right side they display an image target<strong>in</strong>g a visual quality <strong>of</strong> 0.95<br />
(on a scale from [0,1]) by display<strong>in</strong>g only 987 items. The image quality is calculated by a<br />
screen metric us<strong>in</strong>g distance transforms.<br />
All the approaches have <strong>in</strong> common that they use quality metrics <strong>in</strong> the context <strong>of</strong><br />
high-dimensional data visualization; nonetheless they can di er on a variety <strong>of</strong> aspects.<br />
For <strong>in</strong>stance, <strong>in</strong> Example 1 the purpose is to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g projections, <strong>in</strong> Example 2<br />
the purpose is to reduce clutter, whereas the purpose <strong>in</strong> Example 3 is to f<strong>in</strong>d the right<br />
abstraction level. The approaches can as well di er <strong>in</strong> a number <strong>of</strong> other aspects such<br />
as: the visualization techniques employed, the space <strong>in</strong> which the quality metrics are<br />
calculated, or the level <strong>of</strong> <strong>in</strong>teraction they provide.<br />
Therefore the questions are:<br />
Q1. How we can put all the approaches <strong>in</strong>to a common framework which is able to<br />
highlight commonalities and di erences?<br />
Q2. What are the ma<strong>in</strong> factors through which we can describe them?<br />
Q3. How can we learn from the approaches and build on top <strong>of</strong> them to systematically<br />
move the idea <strong>of</strong> quality-metrics driven visualization forward?<br />
These are the ma<strong>in</strong> questions that motivate our work, and <strong>in</strong> the follow<strong>in</strong>g sections we<br />
will provide the results <strong>of</strong> our <strong>in</strong>vestigation.<br />
4.1.1 Background<br />
While more areas are deal<strong>in</strong>g with quality metrics (see Section 2.3.2), we decided to focus<br />
on the use <strong>of</strong> quality metrics <strong>in</strong> high-dimensional data exploration only. Our <strong>in</strong>itial data<br />
gather<strong>in</strong>g process <strong>in</strong>cluded a broader class <strong>of</strong> papers, <strong>in</strong>clud<strong>in</strong>g those cited <strong>in</strong> Section 2.3.2.<br />
However, we soon realized there is no all encompass<strong>in</strong>g model able to synthesize the<br />
relevant aspects and, at the same time, is useful <strong>in</strong> practice. For this reason, here we<br />
focuses only on the use <strong>of</strong> quality metrics <strong>in</strong> high-dimensional data.
70 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
There exist a number <strong>of</strong> research papers which try to categorize exist<strong>in</strong>g work <strong>in</strong> the<br />
visualization area. We briefly mention some recent ones to put our work <strong>in</strong> a larger context.<br />
In Reth<strong>in</strong>k<strong>in</strong>g <strong>Visual</strong>ization [138] Tory and Möller provide a taxonomy to describe scientific<br />
and <strong>in</strong>formation visualization under the same structure. Ellis and Dix organize a large<br />
number <strong>of</strong> exist<strong>in</strong>g clutter reduction techniques <strong>in</strong>to a clutter reduction taxonomy [49].<br />
Yi et al. review a large number <strong>of</strong> visualization systems to better understand the role <strong>of</strong><br />
<strong>in</strong>teraction <strong>in</strong> visualization [160]. Segel and Heer analyze a large body <strong>of</strong> story tell<strong>in</strong>g<br />
visualizations to identify common design patterns [123]. All these papers share with our<br />
work the need <strong>of</strong> putt<strong>in</strong>g some order <strong>in</strong>to a complex aspect <strong>of</strong> data visualization by start<strong>in</strong>g<br />
from a detailed analysis <strong>of</strong> what researchers and practitioners have proposed <strong>in</strong> the past.<br />
S<strong>in</strong>ce our proposed systematization uses a data visualization pipel<strong>in</strong>e as the basis for<br />
the analysis <strong>of</strong> quality metrics, we deem important to briefly discuss exist<strong>in</strong>g data process<strong>in</strong>g<br />
pipel<strong>in</strong>es. The <strong>in</strong>formation visualization pipel<strong>in</strong>e has been presented by Card et al. [36]<br />
and is widely accepted as the standard process<strong>in</strong>g model for <strong>in</strong>formation visualization. The<br />
pipel<strong>in</strong>e transforms data go<strong>in</strong>g through the follow<strong>in</strong>g stages: raw data, table data, visual<br />
structures and views. At each stage an operator is applied, respectively: data transformation,<br />
visual mapp<strong>in</strong>g, and view transformation. The <strong>Data</strong> State Reference model [39] is<br />
largely based on the <strong>in</strong>formation visualization pipel<strong>in</strong>e and classifies visualizations accord<strong>in</strong>g<br />
to how they use the operators <strong>in</strong> the pipel<strong>in</strong>e. In this regard it is similar to our work<br />
<strong>in</strong> that we also use elements <strong>of</strong> the pipel<strong>in</strong>e to classify the papers we have analyzed. The<br />
KDD pipel<strong>in</strong>e [51] has been developed <strong>in</strong> the early n<strong>in</strong>eties to describe the data process<strong>in</strong>g<br />
stages <strong>in</strong>volved <strong>in</strong> knowledge discovery. The data goes through several stages (selection,<br />
pre-process<strong>in</strong>g, transformation, data m<strong>in</strong><strong>in</strong>g, <strong>in</strong>terpretation/evaluation) lead<strong>in</strong>g to a f<strong>in</strong>al<br />
stage <strong>of</strong> knowledge generation. While we took <strong>in</strong>spiration from this model, as quality metrics<br />
<strong>in</strong>volve automatic computation and visualization, we decided not to use it as a basis<br />
for our work because visualization does not explicitly appear <strong>in</strong> the <strong>in</strong>termediary steps <strong>of</strong><br />
the process. Keim et al. [88] and Bert<strong>in</strong>i et al. [23] present alternative pipel<strong>in</strong>es that show<br />
how automated data analysis algorithms can be <strong>in</strong>cluded <strong>in</strong> the data visualization process.<br />
These papers are also sources <strong>of</strong> <strong>in</strong>spiration for our work as they focus on the <strong>in</strong>tegration<br />
<strong>of</strong> automated algorithms and data visualization.<br />
4.1.2 Methodology<br />
We followed an iterative data gather<strong>in</strong>g, cod<strong>in</strong>g, and model<strong>in</strong>g approach <strong>in</strong>spired to the<br />
methods used <strong>in</strong> grounded theory analysis [130]. We started from a small set <strong>of</strong> papers<br />
about quality metrics we knew from our own experience and used this <strong>in</strong>itial list to derive a<br />
first set <strong>of</strong> descriptive factors. After that, we expanded the list by analyz<strong>in</strong>g the references<br />
conta<strong>in</strong>ed <strong>in</strong> the first set <strong>of</strong> papers and by search<strong>in</strong>g <strong>in</strong> relevant visualization venues. In<br />
particular, we used Google Scholar 1 to search for references to and from the collected<br />
papers. We also expanded our list by targeted keyword search. We also tried to expand<br />
our list by keyword search but it did not produce satisfactory results, ma<strong>in</strong>ly because<br />
many quality metrics paper do not mention the word “quality metrics” <strong>in</strong> their text.<br />
At this stage, we decided to narrow down the scope <strong>of</strong> our study and focus on quality<br />
metrics for high-dimensional data analysis. We discarded the papers that (1) did not explicitly<br />
address high-dimensional data, and (2) did not propose quality metrics systems or<br />
1 http://scholar.google.com/
4.1.3 Quality Metrics Pipel<strong>in</strong>e 71<br />
algorithms. For <strong>in</strong>stance we discarded a number <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g papers on the use <strong>of</strong> quality<br />
metrics for generic data visualizations [79], for graph draw<strong>in</strong>g [45], or the discussions on<br />
generic aspects <strong>of</strong> quality metrics [26].<br />
The first two the authors 2 went <strong>in</strong>dependently through the current list <strong>of</strong> papers,<br />
completed a table with the current version <strong>of</strong> the classification, and took notes on necessary<br />
modifications/additions to accommodate new aspects discovered dur<strong>in</strong>g the analysis. After<br />
this first phase the two lists and the notes where confronted <strong>in</strong> order to reach a consensus<br />
on table factors and paper cod<strong>in</strong>g. The third author 3 played the devil’s advocate role at<br />
this stage to confirm the factors were explicative, understandable and relevant. A third<br />
set <strong>of</strong> additional papers were gathered and coded at this po<strong>in</strong>t to test the classification<br />
further.<br />
We proceeded then to the def<strong>in</strong>ition <strong>of</strong> a visualization pipel<strong>in</strong>e able to capture the<br />
data visualization processes described <strong>in</strong> the papers. We started from the traditional<br />
<strong>in</strong>formation visualization pipel<strong>in</strong>e [36] because it is widely known and helps captur<strong>in</strong>g key<br />
elements <strong>of</strong> quality-metrics-driven visualizations (details <strong>in</strong> Section 4.1.3).<br />
We generated the quality metrics pipel<strong>in</strong>e iteratively us<strong>in</strong>g the set <strong>of</strong> gathered papers<br />
and the descriptive table with quality metrics factors as reference. In particular, (1) we<br />
built a first draft <strong>of</strong> the new pipel<strong>in</strong>e; (2) we went through the whole list <strong>of</strong> papers and<br />
checked whether the pipel<strong>in</strong>e was able to describe every aspect <strong>in</strong>volved <strong>in</strong> the process; (3)<br />
where discrepancies were found, we ref<strong>in</strong>ed the pipel<strong>in</strong>e accord<strong>in</strong>gly. As a f<strong>in</strong>al step, we<br />
double-checked that every paper <strong>in</strong> the list could be described by a specific <strong>in</strong>stance <strong>of</strong> the<br />
pipel<strong>in</strong>e. Similarly to the procedure followed <strong>in</strong> the first phase we let one <strong>of</strong> the authors,<br />
not <strong>in</strong>volved <strong>in</strong> the model generation phase 3 , aga<strong>in</strong> play devil’s advocate and ref<strong>in</strong>e the<br />
model at <strong>in</strong>termediary steps. The work on the pipel<strong>in</strong>e generated also small adjustments<br />
that led to the f<strong>in</strong>al version <strong>of</strong> the quality metrics table (Table 4.2).<br />
It is important to note that, while we followed a systematic approach there is no<br />
guarantee that this is the only way to describe quality metrics and their use. Many <strong>of</strong><br />
the elements <strong>in</strong>troduced <strong>in</strong> the proposed models are the result <strong>of</strong> our own experience<br />
and are thus necessarily subjective. Nonetheless, the usefulness <strong>of</strong> the proposed model is<br />
demonstrated by its ability to describe the whole set <strong>of</strong> papers and to identify relevant<br />
gaps <strong>in</strong>terest<strong>in</strong>g for future research.<br />
4.1.3 Quality Metrics Pipel<strong>in</strong>e<br />
We briefly recall the ma<strong>in</strong> elements <strong>of</strong> the Card et al.’s pipel<strong>in</strong>e [36] and then we move<br />
forward to the description <strong>of</strong> our extensions.<br />
The orig<strong>in</strong>al purpose <strong>of</strong> the <strong>in</strong>fovis pipel<strong>in</strong>e was to model the ma<strong>in</strong> steps required to<br />
transform data <strong>in</strong>to <strong>in</strong>teractive visualizations. The quality metrics pipel<strong>in</strong>e <strong>in</strong> Figure 4.4<br />
preserves its ma<strong>in</strong> elements: process<strong>in</strong>g steps (horizontal arrows), stages (boxes), and<br />
user feedback (with few nam<strong>in</strong>g di erences we will expla<strong>in</strong> soon). <strong>Data</strong> transformation<br />
transforms data <strong>in</strong>to the desired format. <strong>Visual</strong> mapp<strong>in</strong>g maps data structures <strong>in</strong>to visual<br />
structures (visualization axes, marks, graphical properties). View transformation creates<br />
rendered views out <strong>of</strong> the visual structures. The whole set <strong>of</strong> transformations is <strong>in</strong>fluenced<br />
by the user who can decide at any time to transform the data (e.g., filter), use di erent<br />
visual structures and, navigate the visualization through di erent view po<strong>in</strong>ts.<br />
2 Enrico Bert<strong>in</strong>i and myself.<br />
3 Daniel Keim.
72 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Quality-Metrics-Driven Automation<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure 4.4: Quality metrics pipel<strong>in</strong>e. The pipel<strong>in</strong>e provides an additional layer named quality<br />
metrics base automation on top <strong>of</strong> the traditional <strong>in</strong>formation visualization pipel<strong>in</strong>e [36]. The<br />
layer obta<strong>in</strong>s <strong>in</strong>formation from the stages <strong>of</strong> the pipel<strong>in</strong>e (the boxes) and <strong>in</strong>fluences the processes<br />
<strong>of</strong> the pipel<strong>in</strong>e through the metrics it calculates. The user is always <strong>in</strong> control.<br />
The <strong>in</strong>fovis pipel<strong>in</strong>e captures extremely well the key elements <strong>of</strong> <strong>in</strong>teractive visualization<br />
across a variety <strong>of</strong> doma<strong>in</strong>s and visual techniques. However, when we focus on<br />
the visualization <strong>of</strong> high-dimensional data patterns a practical problem arises. While the<br />
whole set <strong>of</strong> processes is still valid, the number <strong>of</strong> possible comb<strong>in</strong>ations at each step is<br />
so high that it is impractical to f<strong>in</strong>d <strong>in</strong>teractively the most e ective ones. An example <strong>in</strong><br />
the spirit <strong>of</strong> Mack<strong>in</strong>lay’s sem<strong>in</strong>al analysis [99] helps to clarify the problem: if the orig<strong>in</strong>al<br />
data has dimensionality n = 10 (still a quite low number) and the number <strong>of</strong> available<br />
visual parameters is k = 4 (e.g., a scatterplot with the follow<strong>in</strong>g visual primitives: x-axis,<br />
y-axis, size, and color - see Figure 4.5), the number <strong>of</strong> alternative mapp<strong>in</strong>gs at the visual<br />
mapp<strong>in</strong>g stage is already more than 5000 (k-permutations, i.e., the number <strong>of</strong> sequences<br />
without repetition:<br />
n!<br />
(n≠k)! ).<br />
Figure 4.5: Mapp<strong>in</strong>g a 10 dimensional data set to a scatterplot with four visual primitives (x-axis,<br />
y-axis, size, and color) has over 5000 possible alternative mapp<strong>in</strong>gs.<br />
The ma<strong>in</strong> function <strong>of</strong> quality metrics algorithms is to aid the user <strong>in</strong> the selection <strong>of</strong><br />
promis<strong>in</strong>g comb<strong>in</strong>ations. Typically, the algorithms search through large sets <strong>of</strong> possibilities<br />
and suggest one or more solutions to be evaluated by the user. To describe these steps we<br />
created an additional layer <strong>in</strong> Figure 4.4 that we call quality-metrics-driven automation,<br />
which depicts how quality metrics fit <strong>in</strong>to the process. The metrics draw <strong>in</strong>formation from<br />
the stages <strong>of</strong> the pipel<strong>in</strong>e (green upwards arrows) and <strong>in</strong>fluence the process<strong>in</strong>g steps (blue<br />
downwards arrows) with their computation. The user rema<strong>in</strong>s <strong>in</strong> control <strong>of</strong> the whole<br />
process lett<strong>in</strong>g the mach<strong>in</strong>e perform the computationally hard tasks. We named the new<br />
pipel<strong>in</strong>e the quality metrics pipel<strong>in</strong>e.<br />
The concept <strong>of</strong> generation <strong>of</strong> alternatives and their evaluation is at the core <strong>of</strong> the<br />
method. Regardless the purpose, all the systems we have encountered follow a common<br />
general pattern:
4.1.3 Quality Metrics Pipel<strong>in</strong>e 73<br />
1. Create alternatives (projections, mapp<strong>in</strong>gs, etc.);<br />
2. Evaluate alternatives (rank views, order<strong>in</strong>gs, etc);<br />
3. Produce a f<strong>in</strong>al representation (ranked list <strong>of</strong> views, small multiples, etc.).<br />
As we will show <strong>in</strong> Section 4.1.5, systems with disparate purposes can be described by<br />
this same model.<br />
Process<strong>in</strong>g<br />
In the follow<strong>in</strong>g we provide details about specific features <strong>of</strong> the process<strong>in</strong>g steps <strong>of</strong> the<br />
quality metrics pipel<strong>in</strong>e.<br />
1. <strong>Data</strong> Transformation (source data æ transformed data). In the orig<strong>in</strong>al pipel<strong>in</strong>e<br />
this step has the ma<strong>in</strong> role to put the data <strong>in</strong> a tabular format, hence the orig<strong>in</strong>al<br />
name tabular data <strong>of</strong> its output. S<strong>in</strong>ce here we focus on high-dimensional data,<br />
we assume the source data to be already <strong>in</strong> a tabular format and we rename it <strong>in</strong>to<br />
transformed data. At this stage data transformation is responsible for the generation<br />
<strong>of</strong> alternative data subsets or derivations. Common operations <strong>in</strong>clude: feature<br />
selection, projection, aggregation, and sampl<strong>in</strong>g.<br />
2. <strong>Visual</strong> Mapp<strong>in</strong>g (transformed data æ visual structures). <strong>Visual</strong> mapp<strong>in</strong>g is the<br />
core stage <strong>of</strong> the pipel<strong>in</strong>e where data dimensions are mapped to visual features to<br />
form visual structures. Dist<strong>in</strong>ct mapp<strong>in</strong>gs <strong>of</strong> data features to visual features provide<br />
alternatives that can aga<strong>in</strong> be evaluated <strong>in</strong> terms <strong>of</strong> quality metrics. The most<br />
common type <strong>of</strong> operation at this stage is the generation <strong>of</strong> order<strong>in</strong>gs; by assign<strong>in</strong>g<br />
data dimensions to visualization axes <strong>in</strong> di erent orders. In general, alternatives can<br />
be generated by consider<strong>in</strong>g the full set <strong>of</strong> visual features (e.g., color, size, shape).<br />
3. Render<strong>in</strong>g/View Transformation (visual structures æ views). Render<strong>in</strong>g transforms<br />
visual structures <strong>in</strong>to views by specify<strong>in</strong>g graphical properties that turn these<br />
structures <strong>in</strong>to pixels. We added the word Render<strong>in</strong>g to the pipel<strong>in</strong>e to emphasize<br />
the role <strong>of</strong> the image space; many quality metrics are thus calculated directly <strong>in</strong> the<br />
image space consider<strong>in</strong>g the pixels generated <strong>in</strong> the visualization process. At this<br />
stage alternatives views <strong>of</strong> the same structures can be generated automatically. Surpris<strong>in</strong>gly,<br />
as we discuss <strong>in</strong> Section 4.1.6, this stage is, <strong>in</strong> the context <strong>of</strong> our <strong>in</strong>quiry,<br />
rarely used.<br />
Quality Metrics Computation<br />
Quality metrics can draw <strong>in</strong>formation from any <strong>of</strong> the stages <strong>of</strong> the pipel<strong>in</strong>e. As we describe<br />
later <strong>in</strong> Section 4.1.4 quality metrics can be calculated <strong>in</strong> the data space, image space or<br />
a comb<strong>in</strong>ation <strong>of</strong> the two. Metrics calculated at the view stage draw <strong>in</strong>formation from the<br />
rendered image, whereas the others draw <strong>in</strong>formation from the data space (and elements <strong>of</strong><br />
the visual structures <strong>in</strong> some few cases). Many di erent k<strong>in</strong>d <strong>of</strong> metrics are possible. Our<br />
analysis <strong>of</strong> quality metrics features <strong>in</strong> Section 4.1.4 provides numerous additional details.
74 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Quality Metrics Influence<br />
As described above, quality metrics algorithms generate alternatives and organize them<br />
<strong>in</strong>to a f<strong>in</strong>al representation. At the data process<strong>in</strong>g stage they can for <strong>in</strong>stance generate<br />
1D, 2D, or nD projections (e.g., [52, 59, 126]), data samples (e.g., [24, 80]), or alternative<br />
aggregates (e.g., [42]). At the visual mapp<strong>in</strong>g stage the layer generates alternative order<strong>in</strong>gs<br />
or mapp<strong>in</strong>gs between data and visual properties (e.g., [112, 120]). At the view stage<br />
the layer can generate modifications <strong>of</strong> the current view like chang<strong>in</strong>g the po<strong>in</strong>t <strong>of</strong> view,<br />
highlight<strong>in</strong>g specific items, or distort<strong>in</strong>g the visual space (e.g., [8]).<br />
User Influence<br />
The quality metrics layer does not want to substitute the user <strong>in</strong> favor <strong>of</strong> the mach<strong>in</strong>e.<br />
While the users can always <strong>in</strong>fluence all the stages <strong>of</strong> the pipel<strong>in</strong>e, their ma<strong>in</strong> responsibility<br />
becomes to steer the process, e.g., by sett<strong>in</strong>g quality metrics parameters, and to explore<br />
the result<strong>in</strong>g views. It is worth not<strong>in</strong>g that the process is not necessarily a l<strong>in</strong>ear flow<br />
through the steps. As will be evident from the examples <strong>in</strong> Section 4.1.5 <strong>in</strong> many cases<br />
complex iteration takes place.<br />
4.1.4 Systematic Analysis<br />
Through our paper review we identified two ma<strong>in</strong> areas <strong>of</strong> <strong>in</strong>vestigation. First, we classify<br />
the papers accord<strong>in</strong>g to quality metrics criteria that help expla<strong>in</strong><strong>in</strong>g their key features.<br />
Second, we provide a more detailed categorization <strong>of</strong> the visualization techniques we have<br />
come across.<br />
Quality Metrics<br />
We identified a number <strong>of</strong> factors that describe the methods encountered through the<br />
literature review. Each factor has a number <strong>of</strong> possible values and each paper can assume<br />
one or more <strong>of</strong> these values (see Table 4.2).<br />
In the follow<strong>in</strong>g, we describe the ma<strong>in</strong> factors we extracted from our analysis.<br />
What is measured<br />
This factor describes what is measured by the quality metric. In our analysis we have<br />
grouped the metrics <strong>in</strong> the follow<strong>in</strong>g categories:<br />
Cluster<strong>in</strong>g metrics measure the extent to which the visualization or the data conta<strong>in</strong><br />
group<strong>in</strong>gs, that is, well-separated clusters that can be easily identified. Cluster<strong>in</strong>g is loosely<br />
def<strong>in</strong>ed because we have encountered many alternative approaches. It is worth to keep <strong>in</strong><br />
m<strong>in</strong>d that with cluster<strong>in</strong>g here we <strong>in</strong>tend any measure <strong>in</strong> the data or image space which<br />
is able to capture group<strong>in</strong>gs.<br />
Correlation relates to two or more data dimensions and captures the extent to which<br />
systematic changes to one dimension are accompanied by changes <strong>in</strong> other dimensions.<br />
Simple Pearson correlation between two variables is one <strong>of</strong> the most commonly used<br />
measure <strong>in</strong> this category but global correlation among multiple data dimensions is also<br />
used [82].
4.1.4 Systematic Analysis 75<br />
Outlier metrics capture the extent to which the data segment under <strong>in</strong>spection conta<strong>in</strong>s<br />
elements that behave di erently from the large majority <strong>of</strong> the data, i.e., outliers.<br />
Complex patterns metrics capture aspects that cannot be easily categorized as any <strong>of</strong><br />
the classes described above. We detected a number <strong>of</strong> papers with such measures and<br />
grouped all <strong>of</strong> them <strong>in</strong> this class. An example is Graph-Theoretic Scagnostics [151] a<br />
technique where it is possible to characterize scatterplots with features like “str<strong>in</strong>gy” or<br />
“sk<strong>in</strong>ny”.<br />
Image quality refers to metrics where the purpose is not necessarily to f<strong>in</strong>d specific<br />
patterns but more to identify the degree <strong>of</strong> organization <strong>of</strong> a visualization or, as some <strong>of</strong><br />
the papers call it, the amount <strong>of</strong> clutter.<br />
Feature preservation metrics focus on the comparison between a reference state and<br />
the representation <strong>in</strong> the visualization, or between the features <strong>in</strong> the data and the visualization,<br />
with the <strong>in</strong>tent to preserve the features <strong>of</strong> <strong>in</strong>terest as much as possible. A<br />
subset <strong>of</strong> these papers focus on classified data, search<strong>in</strong>g for projections where the orig<strong>in</strong>al<br />
classes are well separated [129, 133]. In the same category we can f<strong>in</strong>d papers that<br />
measure the <strong>in</strong>formation loss due to data abstraction techniques such as sampl<strong>in</strong>g and<br />
aggregation [24, 42, 80].<br />
It is worth notic<strong>in</strong>g that <strong>in</strong> this categorization we classified the techniques accord<strong>in</strong>g<br />
to their ma<strong>in</strong> target. This however does not h<strong>in</strong>der a metric <strong>of</strong> one type to also detect<br />
patterns <strong>of</strong> another type. For <strong>in</strong>stance, cluster<strong>in</strong>g and correlation, as well as complex<br />
patterns and image quality, may have such an overlap.<br />
Where it is measured (data/image space)<br />
In our review we have found a completely mixed set <strong>of</strong> approaches with respect to where<br />
the metrics are calculated: data space or image space. Metrics calculated <strong>in</strong> data space<br />
detect data features directly <strong>in</strong> the data without us<strong>in</strong>g <strong>in</strong>formation from the view that<br />
will be used to display the results. For <strong>in</strong>stance, the Rank-by-Feature technique [126]<br />
ranks 1D and 2D projections accord<strong>in</strong>g to a number <strong>of</strong> statistical properties calculated<br />
only <strong>in</strong> data space. Metrics calculated <strong>in</strong> image space bypass the analysis <strong>of</strong> the data and<br />
work directly on the rendered image. Often these methods employ sophisticated image<br />
process<strong>in</strong>g techniques like our work presented <strong>in</strong> Section 3.1.2 and [133] where <strong>in</strong>terest<strong>in</strong>g<br />
scatterplots are ranked us<strong>in</strong>g a Hough transformation. A mixed-space approach, where<br />
both data and and image space are used at the same time, is also possible. We found<br />
two dist<strong>in</strong>ct cases. Bert<strong>in</strong>i and Santucci [24] present a measure to compare features <strong>in</strong><br />
the data space to features <strong>in</strong> the image space; with the <strong>in</strong>tent <strong>of</strong> preserv<strong>in</strong>g as much as<br />
possible data features <strong>in</strong> the f<strong>in</strong>al image. Peng et al. [112] measure clutter <strong>in</strong> relation to<br />
the order<strong>in</strong>g <strong>of</strong> visualization axes: these calculations need data features (outliers, correlations)<br />
and visualization features (e.g., axes adjacency) at the same time. Please note that<br />
the entries <strong>in</strong> Table 4.2, where both data and image space are present, do not necessarily<br />
imply the use <strong>of</strong> the aforementioned mixed approach. More <strong>of</strong>ten, they simply mean that<br />
alternative approaches co-exist <strong>in</strong> the context <strong>of</strong> the same paper.<br />
Purpose<br />
Purpose describes the ma<strong>in</strong> reason for us<strong>in</strong>g quality metrics, that is, what is the goal to<br />
be achieved with the metric. We identified the follow<strong>in</strong>g purposes.<br />
Projection aims at f<strong>in</strong>d<strong>in</strong>g subsets <strong>of</strong> the orig<strong>in</strong>al dimensions <strong>in</strong> which <strong>in</strong>terest<strong>in</strong>g patterns<br />
reside, e.g., analyz<strong>in</strong>g all the possible 2D projections <strong>of</strong> a multidimensional data set<br />
by check<strong>in</strong>g whether <strong>in</strong>terest<strong>in</strong>g group<strong>in</strong>gs exist <strong>in</strong> a scatterplot.
76 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Order<strong>in</strong>g aims at f<strong>in</strong>d<strong>in</strong>g, where possible, an order<strong>in</strong>g <strong>of</strong> the visualization axes that<br />
eases the visual detection <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g patterns. Parallel coord<strong>in</strong>ates is a classical example<br />
where the order <strong>of</strong> the axes greatly <strong>in</strong>fluences the chances <strong>of</strong> detect<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g patterns<br />
<strong>in</strong> the data.<br />
Abstraction aims at ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g or controll<strong>in</strong>g a certa<strong>in</strong> degree <strong>of</strong> data representation<br />
quality when data reduction techniques are used to <strong>in</strong>crease the scalability <strong>of</strong> a visualization.<br />
Sampl<strong>in</strong>g and aggregation are the two ma<strong>in</strong> types <strong>of</strong> abstraction techniques we<br />
encountered. For <strong>in</strong>stance, <strong>in</strong> [42] the authors propose a data abstraction technique that<br />
permits to measure the <strong>in</strong>formation loss due to abstraction and to f<strong>in</strong>d a trade-o between<br />
data loss and data reduction.<br />
<strong>Visual</strong> mapp<strong>in</strong>g aims at f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g mapp<strong>in</strong>gs between the orig<strong>in</strong>al data features<br />
and the visual features <strong>of</strong> the visualization technique. Features such as color, size or shape<br />
fall <strong>in</strong>to this category.<br />
View optimization aims at modify<strong>in</strong>g parameters <strong>of</strong> the view with the <strong>in</strong>tent to produce<br />
better visualizations, <strong>in</strong> which, for example, data segments with a high degree <strong>of</strong> <strong>in</strong>terest<br />
are highlighted.<br />
Interaction<br />
The last column <strong>of</strong> the table <strong>in</strong>dicates which papers o er the possibility to <strong>in</strong>teract with the<br />
quality-metrics-based automation. We extracted two ma<strong>in</strong> classes <strong>of</strong> <strong>in</strong>teraction: threshold<br />
selection and metrics selection. With threshold selection we mean the possibility to set<br />
thresholds <strong>in</strong> the quality metrics computation mechanism (e.g., the data abstraction level<br />
<strong>in</strong> [42] or the density estimation smooth<strong>in</strong>g parameter <strong>in</strong> [52]). With metrics selection we<br />
mean systems <strong>in</strong> which the user can either switch from one metrics to another or comb<strong>in</strong>e<br />
them <strong>in</strong>to an <strong>in</strong>tegrated one (e.g., [42, 82]). Please note that some <strong>of</strong> the papers may<br />
conta<strong>in</strong> <strong>in</strong>teraction capabilities and still be marked as not <strong>in</strong>teractive because they do not<br />
provide direct <strong>in</strong>teraction with the quality metrics mechanisms.<br />
<strong>Visual</strong>ization<br />
The orig<strong>in</strong>al table we have designed to classify the full set <strong>of</strong> papers (see Table 4.2 below)<br />
conta<strong>in</strong>s a rough categorization <strong>of</strong> visualization techniques <strong>in</strong>to three ma<strong>in</strong> classes: scatterplots<br />
(SP), parallel coord<strong>in</strong>ates (PC), and others (which <strong>in</strong>clude a fairly large number <strong>of</strong><br />
di erent techniques). While this categorization helps understand<strong>in</strong>g how these techniques<br />
distribute over the whole set <strong>of</strong> papers (SP and PC accounts for 80% <strong>of</strong> the total) it does<br />
not say anyth<strong>in</strong>g about key features <strong>of</strong> visualization techniques; especially those closely<br />
related to the usage <strong>of</strong> quality metrics.<br />
We def<strong>in</strong>e layout dimensionality as the number <strong>of</strong> data axes a visualization has. A<br />
data axis is the visualization feature that establishes what position a s<strong>in</strong>gle visual mark<br />
takes <strong>in</strong> the visualization. For <strong>in</strong>stance, scatterplots have dimensionality two because they<br />
can accommodate two spatial dimensions.<br />
The visualization techniques are classified <strong>in</strong>to 1D, 2D, 3D, 4D, and nD, where nD<br />
stands for techniques that can accommodate an arbitrary number <strong>of</strong> dimensions (with<br />
obvious scalability limits when the number <strong>of</strong> dimensions grows too big).<br />
It is worth notic<strong>in</strong>g that <strong>in</strong> general every visualization has an additional number <strong>of</strong><br />
visual features to which data features can be mapped, e.g., color and size, but here we<br />
focus on the layout because it is the variable that most characterizes every visualization
4.1.4 Systematic Analysis 77<br />
technique and that has the biggest impact on the use <strong>of</strong> quality metrics. Table 4.1 shows<br />
the dimensionality <strong>of</strong> all the techniques we have identified <strong>in</strong> the review.<br />
The visualization techniques that are not <strong>in</strong> the nD class necessarily need an additional<br />
mechanism for the analysis <strong>of</strong> high-dimensional data. Typically, as discussed below, they<br />
are organized <strong>in</strong> a higher level structure that accommodates several projections. Those<br />
which can accommodate an arbitrary number <strong>of</strong> dimensions (nD) all need some k<strong>in</strong>d <strong>of</strong><br />
order<strong>in</strong>g mechanisms.<br />
Table 4.1: <strong>Visual</strong>ization techniques categorized by their layout dimensionality (i.e., the number <strong>of</strong><br />
axes <strong>of</strong> the visualization).<br />
<strong>Visual</strong>ization<br />
histogram<br />
jigsaw map [150]<br />
scatterplot<br />
pixel bar charts [87]<br />
dimensional stack<strong>in</strong>g [91]<br />
matrix [22]<br />
parallel coord<strong>in</strong>ates [78]<br />
radvis [72]<br />
scatterplot matrix [37]<br />
star glyphs [128]<br />
table lens [115]<br />
Layout <strong>Dimensional</strong>ity<br />
1D<br />
1D<br />
2D<br />
4D<br />
nD<br />
nD<br />
nD<br />
nD<br />
nD<br />
nD<br />
nD<br />
While not explicitly discussed <strong>in</strong> any <strong>of</strong> the reviewed papers, we have noticed that<br />
<strong>of</strong>ten a quality-metrics-driven approach needs some k<strong>in</strong>d <strong>of</strong> (implicit or explicit) metavisualization.<br />
With meta-visualization we mean a visualization <strong>of</strong> visualizations. More<br />
specifically, a visualization layout strategy that organizes s<strong>in</strong>gle visualizations <strong>in</strong>to an organized<br />
form. For <strong>in</strong>stance, when a quality-metrics-driven technique produces a number<br />
<strong>of</strong> <strong>in</strong>terest<strong>in</strong>g scatterplots as an output, there is the need to organize them <strong>in</strong>to a schema<br />
that facilitates their comprehension and analysis (e.g., organized <strong>in</strong>to a list sorted by <strong>in</strong>terest<strong>in</strong>gness).<br />
From our analysis we have identified the follow<strong>in</strong>g ma<strong>in</strong> meta-visualization<br />
strategies:<br />
List: a layout strategy that organizes visualizations <strong>in</strong> an ordered l<strong>in</strong>ear fashion (<strong>of</strong>ten<br />
sorted to reflect quality metrics rank<strong>in</strong>gs);<br />
Matrix: a layout strategy that organizes visualizations <strong>in</strong> a grid format, where grid entries<br />
are organized accord<strong>in</strong>g to some data features (e.g., column and rows represent data<br />
dimensions) (<strong>of</strong>ten called also Small Multiples, Trellis, Lattice, Facets).<br />
It is worth notic<strong>in</strong>g that some basic visualization techniques can be considered metavisualizations<br />
themselves. A notable example is the scatterplot matrix which shows a set<br />
<strong>of</strong> scatterplots organized <strong>in</strong> a matrix layout.<br />
In general there is a strong <strong>in</strong>terplay between visualizations and meta-visualizations.<br />
As mentioned above, techniques with a fixed dimensionality need to be organized <strong>in</strong> a<br />
meta-visualization. The meta-visualization <strong>in</strong>fluences the order<strong>in</strong>g <strong>of</strong> the visualizations
78 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Table 4.2: Quality metrics papers classified accord<strong>in</strong>g to quality metrics factors (sorted by purpose).<br />
Paper Title <strong>Visual</strong>ization technique What is measured<br />
SP PC other cluster<strong>in</strong>g correlation outliers complex<br />
patterns<br />
What is measured Where it is<br />
image<br />
quality<br />
feature<br />
pres.<br />
Where it is<br />
measured<br />
A Projection Pursuit Algorithm for Exploratory<br />
<strong>Data</strong> Analysis - Friedman & Tukey [54] SP cluster<strong>in</strong>g data projection<br />
data image projection order<strong>in</strong>g abstraction visual<br />
mapp<strong>in</strong>g<br />
space<br />
Purpose Inter-<br />
view<br />
optimization<br />
act-<br />
ion<br />
A Rank-by-Feature Framework for Unsupervised<br />
Multidimensional <strong>Data</strong> Exploration Us<strong>in</strong>g Low<br />
<strong>Dimensional</strong> Projections-Seo & Shneiderman[126]<br />
F<strong>in</strong>d<strong>in</strong>g and <strong>Visual</strong>iz<strong>in</strong>g Relevant Subspaces for<br />
Cluster<strong>in</strong>g <strong>High</strong>-<strong>Dimensional</strong> Astronomical <strong>Data</strong><br />
Us<strong>in</strong>g Connected Morphological Operators**[52]<br />
SP<br />
histogram,<br />
matrix, list<br />
cluster<strong>in</strong>g correlation outliers<br />
complex<br />
patterns<br />
data projection S<br />
SP histogram cluster<strong>in</strong>g image projection T<br />
Graph-Theoretic Scagnostics - Wilk<strong>in</strong>son et al.<br />
[151] SP cluster<strong>in</strong>g outliers<br />
complex<br />
patterns<br />
image projection<br />
Select<strong>in</strong>g good views <strong>of</strong> high-dimensional data<br />
us<strong>in</strong>g class consistency - Sips et al. [129] SP class pres. data projection T<br />
Coord<strong>in</strong>at<strong>in</strong>g computational and visual<br />
approaches for <strong>in</strong>teractive feature selection and<br />
multivariate cluster<strong>in</strong>g - Guo [59]<br />
Explor<strong>in</strong>g <strong>High</strong>-D Spaces with Multiform Matrices<br />
and Small Multiples - MacEachern et al. [98]<br />
Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong><br />
<strong>Data</strong>sets Us<strong>in</strong>g Quality Measures - Albuquerque<br />
et al. [8]<br />
Interactive Hierarchical Dimension Order<strong>in</strong>g,<br />
Spac<strong>in</strong>g and Filter<strong>in</strong>g for Exploration <strong>of</strong> <strong>High</strong><br />
<strong>Dimensional</strong> <strong>Data</strong>sets - Yang et al. [158]<br />
Interactive <strong>Dimensional</strong>ity Reduction Through<br />
User-def<strong>in</strong>ed Comb<strong>in</strong>ations <strong>of</strong> Quality Metrics -<br />
Johansson & Johansson [82]<br />
PC<br />
matrix correlation data projection order<strong>in</strong>g<br />
pixel based<br />
vis., matrix,<br />
small multiples<br />
jigsaw map,<br />
radvis, table<br />
lens<br />
histogram, star<br />
glyphs<br />
Pargnostics: Image-Space Metrics for Parallel<br />
Coord<strong>in</strong>ates - Dasgupta & Kosara [43] PC cluster<strong>in</strong>g correlation<br />
correlation data projection order<strong>in</strong>g<br />
cluster<strong>in</strong>g correlation outliers data image projection order<strong>in</strong>g<br />
correlation data projection order<strong>in</strong>g<br />
visual<br />
mapp<strong>in</strong>g<br />
view<br />
optimization<br />
PC cluster<strong>in</strong>g correlation outliers data projection order<strong>in</strong>g S, T<br />
image<br />
quality<br />
image projection order<strong>in</strong>g S<br />
S, T<br />
Comb<strong>in</strong><strong>in</strong>g automated analysis and visualization<br />
techniques for effective exploration <strong>of</strong> highdimensional<br />
data - Tatu et al. [133]<br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Visual</strong> <strong>Analytics</strong>: Interactive<br />
Exploration Guided by Pairwise Views <strong>of</strong> Po<strong>in</strong>t<br />
Distributions - Wilk<strong>in</strong>son et al. [152]<br />
Clutter Reduction <strong>in</strong> Multi-<strong>Dimensional</strong> <strong>Data</strong><br />
<strong>Visual</strong>ization Us<strong>in</strong>g Dimension Reorder<strong>in</strong>g - Peng<br />
et al. [112]<br />
Similarity Cluster<strong>in</strong>g <strong>of</strong> Dimensions for an<br />
Enhanced <strong>Visual</strong>ization <strong>of</strong> Multidimensional <strong>Data</strong><br />
- Ankerst et al. [9]<br />
SP PC cluster<strong>in</strong>g correlation<br />
SP PC cluster<strong>in</strong>g outliers<br />
SP PC<br />
PC<br />
star glyphs,<br />
dim. stack<strong>in</strong>g<br />
recursive<br />
pattern, circle<br />
segments<br />
Measur<strong>in</strong>g <strong>Data</strong> Abstraction Quality <strong>in</strong><br />
Multiresolution <strong>Visual</strong>izations - Cui et al. [42] SP PC histogram<br />
correlation outliers<br />
complex<br />
patterns<br />
complex<br />
patterns<br />
image<br />
quality<br />
class pres. data image projection order<strong>in</strong>g<br />
image projection order<strong>in</strong>g<br />
data image order<strong>in</strong>g<br />
correlation data order<strong>in</strong>g<br />
feature<br />
pres.<br />
data abstraction T<br />
Quality Metrics for 2D Scatterplot Graphics:<br />
Automatically Reduc<strong>in</strong>g <strong>Visual</strong> Clutter - Bert<strong>in</strong>i &<br />
Santucci [24]<br />
A Screen Space Quality Method for <strong>Data</strong><br />
Abstraction - Johansson & Cooper [80] PC<br />
SP cluster<strong>in</strong>g<br />
feature<br />
pres.<br />
feature<br />
pres.<br />
data image abstraction<br />
image sampl<strong>in</strong>g<br />
Enabl<strong>in</strong>g Automatic Clutter Reduction <strong>in</strong> Parallel<br />
Coord<strong>in</strong>ate Plots - Ellis & Dix [48] PC<br />
image<br />
quality<br />
image sampl<strong>in</strong>g T<br />
Pixnostics: Towards measur<strong>in</strong>g the value <strong>of</strong><br />
visualization - Schneidew<strong>in</strong>d et al. [120]<br />
jigsaw map,<br />
pixel bar chart<br />
correlation<br />
complex<br />
patterns<br />
data image<br />
visual<br />
mapp<strong>in</strong>g<br />
** Ferdosi et al.<br />
Legend: SP = scatter plot (& matrix), PC = parallel coord<strong>in</strong>ates, feature/class pres. = feature/class preservation, S = select metric, T = set threshold.
4.1.5 Examples 79<br />
and <strong>in</strong> some cases also the content. For <strong>in</strong>stance, the matrix layout requires that the<br />
visualization with<strong>in</strong> a grid cell corresponds to the data values it represents.<br />
F<strong>in</strong>ally, meta-visualizations can themselves be <strong>in</strong>fluenced by quality metrics. All the<br />
layout strategies have some degree <strong>of</strong> freedom <strong>in</strong> terms <strong>of</strong> reorder<strong>in</strong>g, and an optimal<br />
reorder<strong>in</strong>g (accord<strong>in</strong>g to some given goal) can only be achieved by search<strong>in</strong>g <strong>in</strong> the space<br />
<strong>of</strong> solutions (e.g., as presented <strong>in</strong> [112]).<br />
4.1.5 Examples<br />
In this section, we provide four selected examples from our review as a way to show<br />
how our proposed model can describe exist<strong>in</strong>g approaches <strong>in</strong> this area. We selected the<br />
examples <strong>in</strong> a way to cover as many <strong>in</strong>terest<strong>in</strong>g aspects as possible. In particular, we<br />
picked papers with di erent purposes because they guarantee a larger variety <strong>of</strong> features.<br />
For completeness we provide all the other quality metrics pipel<strong>in</strong>es <strong>in</strong> Appendix A.3 <strong>in</strong><br />
the same order the papers are listed <strong>in</strong> Table 4.2.<br />
The first example comes from our own work presented <strong>in</strong> Section 3.1.4 and Section 3.1.5,<br />
published <strong>in</strong> [133]. The ma<strong>in</strong> goal <strong>of</strong> this work is to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g projections <strong>of</strong> n-<br />
dimensional data us<strong>in</strong>g image process<strong>in</strong>g techniques. The section presents several measures,<br />
but here we focus only on the part deal<strong>in</strong>g with parallel coord<strong>in</strong>ates and one specific<br />
metric, the Similarity Measure.<br />
The basic idea <strong>of</strong> the method is to generate all possible 2D comb<strong>in</strong>ations <strong>of</strong> the orig<strong>in</strong>al<br />
dimensions and evaluate them <strong>in</strong> terms <strong>of</strong> their ability to form clusters <strong>in</strong> a 2-axis parallel<br />
coord<strong>in</strong>ates representation. Every pair <strong>of</strong> axis is evaluated <strong>in</strong>dividually us<strong>in</strong>g a standard<br />
image process<strong>in</strong>g technique (the Hough transform), which permits to discrim<strong>in</strong>ate between<br />
uniform and chaotic distributions <strong>of</strong> l<strong>in</strong>e angles and positions (for details please refer back<br />
to Figure 3.6 for the Hough transform). Once <strong>in</strong>terest<strong>in</strong>g pairs have been extracted, they<br />
are jo<strong>in</strong>ed together to form groups <strong>of</strong> parallel coord<strong>in</strong>ates <strong>of</strong> a desired (user-def<strong>in</strong>ed) size<br />
(e.g., <strong>in</strong> Figure 3.14, groups <strong>of</strong> 4-dimensional parallel coord<strong>in</strong>ates are formed).<br />
Figure 4.6 presents the pipel<strong>in</strong>e for this example. We can recognize three ma<strong>in</strong> elements:<br />
(A) all 2D parallel coord<strong>in</strong>ates are generated <strong>in</strong> the data transformation phase;<br />
(B) all the alternatives are evaluated <strong>in</strong> the image space at the view stage; (C) the algorithm<br />
comb<strong>in</strong>es the <strong>in</strong>terest<strong>in</strong>g segments <strong>in</strong>to a list <strong>of</strong> parallel coord<strong>in</strong>ates (like those <strong>in</strong><br />
Figure 3.14) us<strong>in</strong>g the visual mapp<strong>in</strong>g stage.<br />
Quality-Metrics-Driven Automation<br />
A<br />
C<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure 4.6: Quality metrics pipel<strong>in</strong>e for the first example from [133]: (A) generation <strong>of</strong> alternatives;<br />
(B) evaluation <strong>of</strong> alternatives (image space); (C) creation <strong>of</strong> the f<strong>in</strong>al representation.<br />
The technique uses parallel coord<strong>in</strong>ates (PC) as pr<strong>in</strong>cipal visualization technique and<br />
a list as a meta-visualization. It measures cluster<strong>in</strong>g properties, <strong>in</strong> the image space, and<br />
its ma<strong>in</strong> purpose is to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g projections. Interaction with the metrics is very<br />
limited if not absent.
80 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
The second example comes from the work <strong>of</strong> Johansson and Johansson on <strong>in</strong>teractive<br />
feature selection [82]. The technique ranks every s<strong>in</strong>gle dimension for its importance us<strong>in</strong>g<br />
a comb<strong>in</strong>ation <strong>of</strong> correlation, outlier, and cluster<strong>in</strong>g features calculated on the data. This<br />
rank<strong>in</strong>g is used as the basis for an <strong>in</strong>teractive threshold selection tool by which the user<br />
can decide how many dimensions to keep; weight<strong>in</strong>g the choice with the correspond<strong>in</strong>g<br />
<strong>in</strong>formation loss presented by the chart (see Figure 4.7). Once the user selects the desired<br />
number <strong>of</strong> dimensions the system presents the result with parallel coord<strong>in</strong>ates and automatically<br />
f<strong>in</strong>ds a good order<strong>in</strong>g us<strong>in</strong>g the same data features calculated for rank<strong>in</strong>g the<br />
dimensions. The user can also choose di erent weight<strong>in</strong>g schemes to focus more on correlation,<br />
outliers or clusters. Figure 4.8 shows the results <strong>of</strong> cluster<strong>in</strong>g (top) and correlation<br />
(bottom).<br />
!"#$%&&"%'$%('!"#$%&&"%)'*%+,-$.+*/,'(*0,%&*"%$1*+2'-,(3.+*"%'+#-"34#'3&,-5(,6*%,<br />
S reduced = [1, 2, 3, 7,<br />
Cluster<br />
c 0<br />
Variables Q<br />
[3, 6, 7, 10]<br />
c 1 [2, 3, 10, 17]<br />
c 2 [1, 2, 7]<br />
First iteration:<br />
[1, 2,<br />
Second iteration:<br />
[1, 12<br />
Third iteration:<br />
[12, 1<br />
Fig. 5. Example <strong>of</strong> variable order<strong>in</strong>g a<br />
Initially the clusters are ordered acco<br />
iteration the reorder<strong>in</strong>g is found that r<br />
Figure 4.7: Interactive Fig. 4. Interactive chart to select displaynumber <strong>of</strong> the amount <strong>of</strong> dimensions <strong>of</strong> <strong>in</strong>formation to keep lost relative vs. <strong>in</strong>formation to connected loss [82]. variables be<strong>in</strong>g part <strong>of</strong> c i ,<br />
!!" !"""#$%&'(&)$!*'(#*'#+!(,&-!.&$!*'#&'/#)*01,$"%#2%&13!)(4#+<br />
number <strong>of</strong> variables to keep <strong>in</strong> the reduced data set. The black l<strong>in</strong>e previous clusters (represented by red<br />
represents the comb<strong>in</strong>ed <strong>in</strong>formation loss for all quality metrics, the blue,<br />
red and green l<strong>in</strong>es represent <strong>in</strong>formation loss <strong>in</strong> cluster, correlation and<br />
outlier structures respectively. The red vertical l<strong>in</strong>e corresponds to the<br />
number <strong>of</strong> variables currently selected.<br />
sum <strong>of</strong> I(x j ) for the removed variables and I total is the sum <strong>of</strong> I(x j )<br />
for all variables <strong>in</strong> the data set.<br />
The <strong>in</strong>teractive display (figure 4) consists <strong>of</strong> a l<strong>in</strong>e graph and a<br />
graphical user <strong>in</strong>terface for modification <strong>of</strong> weight values and selection<br />
<strong>of</strong> number <strong>of</strong> variables to keep. The l<strong>in</strong>e graph displays the relationship<br />
between I lost (y-axis) and number <strong>of</strong> variables to keep <strong>in</strong> the<br />
reduced data set (x-axis), represent<strong>in</strong>g each quality metric <strong>in</strong>dividually<br />
by a l<strong>in</strong>e and us<strong>in</strong>g one l<strong>in</strong>e for the comb<strong>in</strong>ed importance value <strong>of</strong> all<br />
metrics. A similar approach is taken <strong>in</strong> [6], where quality measures for<br />
data abstractions such as cluster<strong>in</strong>g and sampl<strong>in</strong>g are <strong>in</strong>tegrated <strong>in</strong>to<br />
multivariate visualizations. A vertical l<strong>in</strong>e is used <strong>in</strong> the <strong>in</strong>teractive<br />
display to facilitate identification <strong>of</strong> lost <strong>in</strong>formation for the selected<br />
number <strong>of</strong> variables. If reta<strong>in</strong><strong>in</strong>g 18 variables, accord<strong>in</strong>g to the position<br />
<strong>of</strong> the vertical l<strong>in</strong>e <strong>in</strong> figure 4, it can be seen from the display that<br />
some <strong>of</strong> the reta<strong>in</strong>ed variables conta<strong>in</strong> no cluster <strong>in</strong>formation at all. In<br />
figure 6 the correspond<strong>in</strong>g 18 variable data set is visualized us<strong>in</strong>g parallel<br />
The coord<strong>in</strong>ates. syntheticAs data can set be seen reduced from the tovisual 9 variables aids at the us<strong>in</strong>g bottomdifferent<br />
quality<strong>of</strong> metric the axes, weights the five and left variablesorders. are <strong>of</strong> lowInglobal the top importance view cluster<strong>in</strong>g and is<br />
Fig. 2.<br />
assigned alsoahave large lowweight cluster and correlation the variables importance. are ordered By look<strong>in</strong>g to enhance at the the<br />
clusterpatterns structures. <strong>of</strong> the l<strong>in</strong>es In the it isbottom also quite view easily a seen correspond<strong>in</strong>g that these variables weight<strong>in</strong>g are and<br />
order<strong>in</strong>g rather is made noisy, for hence correlation more variables structures. can be removed from the data set<br />
without los<strong>in</strong>g much more <strong>in</strong>formation.<br />
This pair forms the basis <strong>of</strong> the orde<br />
the highest correlation conta<strong>in</strong><strong>in</strong>g x a o<br />
ordered is identified. The unordered v<br />
right border <strong>of</strong> the ordered variables,<br />
forms a highly correlated pair. This c<br />
pairs with highest correlation conta<strong>in</strong><br />
positioned at the leftmost or rightmos<br />
ordered variables, until all variables a<br />
The variable order<strong>in</strong>gs enhanc<strong>in</strong>g<br />
based on the quality values calculat<br />
connection with the cluster and outlie<br />
the same way. An example <strong>of</strong> the ord<br />
structures, is shown <strong>in</strong> figure 5, whe<br />
reta<strong>in</strong>ed after dimensionality reductio<br />
formed as follows:<br />
1. Initially the clusters are sorted i<br />
quality value, as shown <strong>in</strong> figu<br />
based on three clusters, c 0 , c 1 a<br />
Figure 4.8: Top: best order<strong>in</strong>g to enhance cluster<strong>in</strong>g. Bottom: best order<strong>in</strong>g to enhance correlation<br />
[82].<br />
2. In the first iteration all variable<br />
first cluster, c 0 , are positioned<br />
c 0 <strong>in</strong>cludes variable 6. This va<br />
Figure 4.9 shows the pipel<strong>in</strong>e for this example. Aga<strong>in</strong> we have three ma<strong>in</strong>iselements:<br />
hence not taken <strong>in</strong>to conside<br />
(A) every s<strong>in</strong>gle dimension is ranked by the quality metrics directly from the source figure Fig. data. represents 3. Thethe visual positions aidso<br />
The reason why the source data is needed is that the importance measure3. <strong>of</strong> In aderstand<strong>in</strong>g thes<strong>in</strong>gle<br />
<strong>of</strong> the impor<br />
subsequent iterations the<br />
dimension is computed 3.4 Variable tak<strong>in</strong>gOrder<strong>in</strong>g<br />
<strong>in</strong>to account the full set <strong>of</strong> dimensions (see the Spaper r is the<br />
reduced and for<br />
correlation<br />
<strong>of</strong> any cluster,<br />
betw<br />
c j<br />
c 1 , red for <strong>in</strong>stance, and positive variables <strong>in</strong> blue. 2, 3<br />
the unit, TheN order is the <strong>of</strong> variables total number <strong>in</strong> multivariate <strong>of</strong> data visualization items <strong>in</strong> has theadata largeset impact and D is variables and I out 3 and are10the arecluster,<br />
also part<br />
the range<br />
on how<br />
<strong>of</strong> the<br />
easilyvariable we can perceive<br />
conta<strong>in</strong><strong>in</strong>g<br />
different<br />
thestructures one-dimensional<br />
<strong>in</strong> the data.<br />
unit.<br />
The<br />
A k-<br />
proposed system comb<strong>in</strong>es several quality metrics to f<strong>in</strong>d a dimensionalityunit<br />
reduction<br />
dimensional<br />
4. The reorder<strong>in</strong>g <strong>of</strong> variables <strong>in</strong> S<br />
is considered<br />
that can bedense regarded<br />
if<br />
as<br />
itsadensity good representation<br />
is higher than<br />
<strong>of</strong><br />
the<br />
sequence <strong>of</strong> connected variable<br />
thresholds the orig<strong>in</strong>al <strong>of</strong> all data one-dimensional set, focus<strong>in</strong>g the units structures <strong>of</strong> which that it areis<strong>of</strong>composed.<br />
<strong>in</strong>terest for variables travers<strong>in</strong>g the border p<br />
With<strong>in</strong> the particular the proposed analysis task system at hand. theF<strong>in</strong>d<strong>in</strong>g cluster<strong>in</strong>g one appropriate algorithm variable has been rectangles) areI(x found, and S redu<br />
order<strong>in</strong>g enhanc<strong>in</strong>g all <strong>in</strong>terest<strong>in</strong>g structures at once may, however, be<br />
j )=w corr I c<br />
i = 1 this is achieved by switch<br />
slightly modified to further speed-up the cluster detection by us<strong>in</strong>g
4.1.5 Examples 81<br />
details); (B) the user selects the dimensions guided by the quality metrics, both the user<br />
and the quality metric <strong>in</strong>fluence the data transformation process; (C) the system f<strong>in</strong>ds<br />
the best order<strong>in</strong>g accord<strong>in</strong>g to the weight<strong>in</strong>g scheme proposed by the user produc<strong>in</strong>g one<br />
specific visual mapp<strong>in</strong>g. Theviewispresentedtotheuser.<br />
Quality-Metrics-Driven Automation<br />
A B C<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
ULTIRESOLUTIONVISUALIZATION<br />
L, all the records<br />
us sample.<br />
oundary, the syss<br />
view, and then<br />
the above guidel<br />
have the option<br />
e the DAL or the<br />
ction<br />
ng<strong>in</strong>g from a s<strong>in</strong>usters<br />
conta<strong>in</strong><strong>in</strong>g<br />
lution visualizarepresentative<br />
or<br />
ords <strong>in</strong> this cluscords<br />
or clusters<br />
groups, a tree <strong>of</strong><br />
l the items with a<br />
AL. If the tree is<br />
the nodes <strong>of</strong> this<br />
nique position <strong>in</strong><br />
ange <strong>of</strong> nodes <strong>in</strong><br />
ge. All the nodes<br />
abstraction level<br />
Figure 4.9: Quality metrics pipel<strong>in</strong>e for the second example from [82]: (A) dimensions ranked by<br />
their importance; (B) selection <strong>of</strong> number <strong>of</strong> dimensions to reta<strong>in</strong> vs. <strong>in</strong>formation loss; (C) creation<br />
<strong>of</strong> the f<strong>in</strong>al mapp<strong>in</strong>g with order<strong>in</strong>g.<br />
713<br />
This technique uses parallel coord<strong>in</strong>ates as pr<strong>in</strong>cipal visualization. There is no metavisualization<br />
to organize alternative results <strong>in</strong> a schema but the <strong>in</strong>teractive chart functions<br />
while specific to hierarchically clustered data, can support all <strong>of</strong> the<br />
<strong>in</strong>teractions as a wayonto the abstraction. pilot the generation <strong>of</strong> alternatives. It measures cluster<strong>in</strong>g, correlation and<br />
outliers <strong>in</strong> the data space and its ma<strong>in</strong> purpose is to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g projections and<br />
5 CASE STUDY<br />
order<strong>in</strong>gs. Interaction<br />
1: CHOOSING A DATA ABSTRACTION LEVEL<br />
plays a central role <strong>in</strong> the selection <strong>of</strong> the number <strong>of</strong> dimensions<br />
(DAL)<br />
and <strong>in</strong> the weight<strong>in</strong>g scheme.<br />
In this section, we show how to choose an appropriate DAL. At this<br />
level, The the abstracted third example dataset should is have taken highfrom data abstraction the work quality<br />
This (equal paper or moreproposes than 0.90) and a technique the visualization toshould create haveabstracted the visualizations <strong>in</strong> a user-controlled<br />
<strong>of</strong> Cui et al. on data abstraction quality [42].<br />
best visual quality under the constra<strong>in</strong>ts <strong>of</strong> the data abstraction quality.<br />
manner. The analytic The task is system to searchfeatures for clusters <strong>in</strong> data the OUT5D abstraction dataset. metrics (Histogram Di erence Measure<br />
This anddataset Nearest consists Neighbor <strong>of</strong> five remote Measure) sens<strong>in</strong>g channels: and controllers SPOT, Magnetics,<br />
Potassium, Thorium and Uranium, with 16384 records. We<br />
to let the user f<strong>in</strong>d a trade-o between<br />
abstraction level and <strong>in</strong>formation loss. In particular, the data abstraction quality is calculated<br />
dataset. by<strong>Data</strong> compar<strong>in</strong>g po<strong>in</strong>ts have significant featuresoverlaps <strong>of</strong> the with orig<strong>in</strong>al each otherdata and to features <strong>in</strong> the sampled or aggregated<br />
employ scatterplots to visualize this dataset. Figure 4 shows the orig<strong>in</strong>al<br />
so data. we cannot dist<strong>in</strong>guish relative data density <strong>in</strong> different regions and<br />
have difficulty observ<strong>in</strong>g any trends with<strong>in</strong> this dataset.<br />
714<br />
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,<br />
observe. Next we adjust the<br />
visual quality <strong>in</strong> the marked<br />
data density is ma<strong>in</strong>ta<strong>in</strong>ed, a<br />
Cluster A still overlap with<br />
abstraction are shown <strong>in</strong> Figu<br />
and we term<strong>in</strong>ate our explora<br />
Abstraction quality measu<br />
discovered. If we only know<br />
ber <strong>of</strong> abstracted records and<br />
we cannot have much confid<br />
that 96 percent <strong>of</strong> the data a<br />
more than 0.95 and the NNM<br />
sampl<strong>in</strong>g, we are fairly certai<br />
orig<strong>in</strong>al dataset very well an<br />
is very likely valid. In gener<br />
measures to the discovered p<br />
the pattern, which enables an<br />
Fig. 4. Scatterplots <strong>of</strong> orig<strong>in</strong>al dataset (DAL=1.00)<br />
Fig. 6. Scatterplots <strong>of</strong> abstracted dataset (DAL=0.08)<br />
Figure 4.10: <strong>Visual</strong> abstraction <strong>of</strong> a scatterplot matrix from [42].<br />
Figure 4.11 shows the pipel<strong>in</strong>e for this example. We have two ma<strong>in</strong> elements: (A)<br />
the data abstraction quality measures are calculated by compar<strong>in</strong>g the source data to the<br />
transformed data; (B) the user selects the desired abstraction quality and receives feedback<br />
6 CASE STUDY 2: COM<br />
ODS<br />
In this application, two data<br />
pl<strong>in</strong>g, are compared us<strong>in</strong>g the<br />
bedded with<strong>in</strong> our multireso<br />
the AAUP dataset, which su<br />
tion <strong>of</strong> pr<strong>of</strong>essors at 1161 <strong>in</strong><br />
visualize this dataset. Throu<br />
has the advantage <strong>of</strong> ma<strong>in</strong>tai<br />
cluster<strong>in</strong>g has the advantage<br />
First we briefly review som<br />
The HDM is based on the<br />
between the distributions <strong>of</strong><br />
changes <strong>in</strong> the relative densi<br />
tance between the orig<strong>in</strong>al da<br />
cannot be elim<strong>in</strong>ated dur<strong>in</strong>g<br />
average distance, because th<br />
records. Thus the NNM met<br />
good at monitor<strong>in</strong>g the chang
82 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Quality-Metrics-Driven Automation<br />
A B A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
IEEE TRANSACTIONS Figure ON VISUALIZATION 4.11: Quality ANDmetrics COMPUTER pipel<strong>in</strong>e GRAPHICS, for example VOL. 12, NO. three 5, SEPTEMBER/OCTOBER from [42]: (A) data 2006features compared<br />
between the orig<strong>in</strong>al data and the abstracted data; (B) <strong>in</strong>stantiation <strong>of</strong> the desired abstraction<br />
level guided by quality metrics.<br />
between pairs <strong>of</strong> image pixels. The PSNR x-axis represents the DAL and the y-axis represents the quality measures.<br />
The red and blue l<strong>in</strong>e represent the changes <strong>of</strong> HDM and NNM<br />
) is the most common image quality meamean<br />
squared error) and used <strong>in</strong> the JPEG aga<strong>in</strong>st the abstraction level, respectively. A vertical l<strong>in</strong>e called the<br />
ef<strong>in</strong>ed by the follow<strong>in</strong>g equations: on its quality by DAL steer<strong>in</strong>g handle the is drawn data transformation to <strong>in</strong>dicate the current process. abstraction level. The<br />
The paper applies cross po<strong>in</strong>ts the technique <strong>of</strong> this vertical to scatterplots l<strong>in</strong>e and the and plot l<strong>in</strong>es parallel denote coord<strong>in</strong>ates the correspond<strong>in</strong>g<br />
to many measures other <strong>of</strong> this techniques. abstraction There level. The is no DAL meta-visualization and measures to organize<br />
but it is generic<br />
N M<br />
i=1 j=1 (F(i, j) ˆF(i, j)) enough 2<br />
to be (12) applied<br />
NM<br />
are displayed to the right <strong>of</strong> the DAL handle. With these plots, analysts<br />
alternative results canbut knowsimilarly the qualityto <strong>of</strong> the the current second DAL example <strong>in</strong> the context an <strong>in</strong>teractive <strong>of</strong> the entire chart is used to<br />
uared error, F(i, j) is the pixel set an value abstraction at (i, j) quality threshold space. (see Figure 4.12). It measures feature preservation, and its<br />
i, j) is the pixel value at ma<strong>in</strong> (i, j) <strong>in</strong>purpose the com- is abstraction. Interaction plays a central role <strong>in</strong> the selection <strong>of</strong> the right<br />
N are the length and height abstraction <strong>of</strong> the image. level.<br />
R = 10log 10 ( MAX2 I<br />
MSE ) (13)<br />
ignal-to-noise ratio and MAX I is the maxian<br />
see, the NNM employs the same method<br />
stance between two datasets. The only dify<br />
different methods to process the average<br />
es.<br />
ITY MEASURES WITH MULTIRESOLUe<br />
our work on <strong>in</strong>tegrat<strong>in</strong>g quality measures<br />
p effective and abstraction-aware multiresst<br />
we describe the <strong>in</strong>teraction tool that we<br />
sures. Then we present the <strong>in</strong>teractive opality<br />
measures. Next, we discuss the view<br />
pl<strong>in</strong>g, and f<strong>in</strong>ally we give Figure an overview 4.12: <strong>Visual</strong> <strong>of</strong> abstraction chart with threshold sett<strong>in</strong>g for the abstraction level and feedback<br />
(SBB) we use to control abstraction abstraction param-quality [42].<br />
Analysts can adjust the DAL <strong>of</strong> cluster<strong>in</strong>g Fig. 2. 1D plots <strong>of</strong> quality measures<br />
widget for all abstraction methods and the<br />
y brush the structure formed As by cluster<strong>in</strong>g a fourth example we choose the paper from Yang et al. [158]. They use quality<br />
metrics to support 4.2anInteractive dimensionOperations<br />
management system for high-dimensional data. Their<br />
res<br />
<strong>in</strong>teractive hierarchical Several <strong>in</strong>teractive dimension operations management are supported system <strong>in</strong> this called system. DOSFA Users can (Dimension Order<strong>in</strong>g,<br />
[12, Spac<strong>in</strong>g,<br />
ctive selection via brush<strong>in</strong>g<br />
move the slider bar <strong>in</strong> Figure 1 or the DAL handle <strong>in</strong> Figure 2 to adjust<br />
the data abstraction level. After the DAL has been changed, the<br />
16] us<strong>in</strong>g aFilter<strong>in</strong>g Approach) supports automatic and <strong>in</strong>teractive dimension order<strong>in</strong>g,<br />
he data selected through filter<strong>in</strong>g brush<strong>in</strong>g isand called spac<strong>in</strong>g. systemAn willexample generate an canabstracted be seendataset <strong>in</strong> Figure and display 4.13 where it <strong>in</strong> theon data the left hand side,<br />
e rema<strong>in</strong><strong>in</strong>g data are called the the data unselected is presented visualization. <strong>in</strong> an unchanged The DALs forway, selected and and onunselected the right data hand can be side adjusted<br />
<strong>in</strong>dependently. Users can also modify the location <strong>of</strong> one <strong>of</strong> the<br />
the data is visualized<br />
several after quality DOSFA was applied and the data is ordered, spaced and filtered. Di erent<br />
t the DAL for the selected data as well as<br />
view <strong>of</strong> the data generates boundaries <strong>of</strong> the selected region by click<strong>in</strong>g the left mouse button on<br />
ts to display them. Figureorders 1 shows<strong>of</strong> twodimensions such or near cantheshow boundary di erent and dragg<strong>in</strong>g patterns <strong>in</strong> the <strong>of</strong> desired the data direction. to theInuser. addi-Dependetion, the selected orderregion a can importance-oriented be moved by choos<strong>in</strong>g order a region is needed. on the<br />
on the<br />
nveys the quality measures task, for the a similarity-oriented selected<br />
veys the quality measures for the unselected<br />
An annotateddata pipel<strong>in</strong>e display, with and then these adjust<strong>in</strong>g stepstheis DAL presented for the region. <strong>in</strong> Figure This usually 4.14. It conta<strong>in</strong>s four<br />
means that the user knows the data subset that she wants to explore<br />
ma<strong>in</strong> steps: (A) a hierarchical structure <strong>of</strong> the dimensions is constructed, by group<strong>in</strong>g<br />
and wants to take advantage <strong>of</strong> the scalability <strong>of</strong> multiresolution visualization.<br />
<strong>in</strong>to clusters Alternatively andasimilar user can clusters first choose <strong>in</strong>to a DAL larger <strong>in</strong> the clusters; current (B) <strong>in</strong> the data<br />
similar dimensions<br />
transformation process selected dimensions region, and then are adjust filtered the selected/brush<strong>in</strong>g based on their boundary similarity to enlarge<br />
order<strong>in</strong>g dim<strong>in</strong>ish <strong>in</strong>fluences the size <strong>of</strong> the themapp<strong>in</strong>g region. This stage usually by means determ<strong>in</strong><strong>in</strong>g that an the order<strong>in</strong>g <strong>of</strong><br />
and importance;<br />
(C) the dimension<br />
acceptable data abstraction level had been found, but the area <strong>of</strong> <strong>in</strong>terest<br />
needs to be <strong>in</strong>creased or decreased.<br />
Analysts can also <strong>in</strong>struct the system to run the abstraction algorithm<br />
aga<strong>in</strong> to generate a new abstraction. For example, resampl<strong>in</strong>g<br />
can help analysts verify patterns that had been discovered <strong>in</strong> the previous<br />
samples. If a pattern still exists after resampl<strong>in</strong>g several times,<br />
this pattern is most likely a robust one. Furthermore, analysts can compare<br />
the abstraction measures from mutiple resampl<strong>in</strong>g, and select an
4.1.5 Examples 83<br />
Figure 4.13: Left: star glyphs represent<strong>in</strong>g orig<strong>in</strong>al data set. Right: visualized data after DOSFA<br />
was applied [158].<br />
Quality-Metrics-Driven Automation<br />
Source<br />
<strong>Data</strong><br />
A B C D<br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure 4.14: Quality metrics pipel<strong>in</strong>e for example four from [158]: (A) construct hierarchical<br />
structure <strong>of</strong> dimensions by cluster<strong>in</strong>g; (B) filter dimensions by similarity and importance; (C) map<br />
dimensions order<strong>in</strong>g to visualization; (D) <strong>in</strong>fluence the view accord<strong>in</strong>g to the quality measured<br />
(spac<strong>in</strong>g the parallel coord<strong>in</strong>ates accord<strong>in</strong>g to their similarity). The user can steer all these steps,<br />
after <strong>in</strong>teract<strong>in</strong>g with the clustered dimensions showed <strong>in</strong> an InterR<strong>in</strong>g visualization.<br />
the visualization’s dimensions, or for mapp<strong>in</strong>g the more important dimensions to more<br />
prevalent visualization positions or to map them to more pre attentive visual attributes.<br />
(D) the quality <strong>of</strong> dimensions <strong>in</strong>fluences also the view transformation step by determ<strong>in</strong><strong>in</strong>g<br />
the spac<strong>in</strong>g between the dimensions <strong>in</strong> the parallel coord<strong>in</strong>ates accord<strong>in</strong>g to their similarity.<br />
All these “best” sett<strong>in</strong>gs can be automatically calculated by the system and the result is<br />
presented to the user. It is also possible to present the dimension hierarchies to the user<br />
with a InterR<strong>in</strong>g [159]. The user can <strong>in</strong>teract with the InterR<strong>in</strong>g, trigger<strong>in</strong>g the filter<strong>in</strong>g,<br />
order<strong>in</strong>g, and spac<strong>in</strong>g for the f<strong>in</strong>al result. This is represented by the user-<strong>in</strong>teraction arrows<br />
on the lower level <strong>of</strong> the pipel<strong>in</strong>e.<br />
This paper applies quality metrics <strong>in</strong> data space to improve scatterplots, parallel coord<strong>in</strong>ates<br />
and star glyphs for high-dimensional data. It measures correlation to f<strong>in</strong>d the<br />
best dimension order<strong>in</strong>g, projection, and view optimization for the data sets. The user can<br />
steer the process by <strong>in</strong>fluenc<strong>in</strong>g all the pipel<strong>in</strong>e steps.<br />
These four examples cover many aspects discussed <strong>in</strong> the previous sections, especially<br />
metrics calculated <strong>in</strong> the data vs. image space, di erent purposes, di erent measure types,<br />
di erent uses <strong>of</strong> the pipel<strong>in</strong>e, and di erent <strong>in</strong>teraction levels. Many <strong>of</strong> the papers we have<br />
reviewed have similar elements and functions, nonetheless there are others that deviate<br />
considerably from these ones. While we cannot provide the full set <strong>of</strong> examples <strong>in</strong> this<br />
section, we discuss <strong>in</strong> Section 4.1.6 some f<strong>in</strong>d<strong>in</strong>gs that stem from the analysis <strong>of</strong> the whole<br />
set, <strong>in</strong>clud<strong>in</strong>g those with uncommon approaches and list all the quality metrics pipel<strong>in</strong>es<br />
<strong>in</strong> Appendix A.3.
84 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
4.1.6 F<strong>in</strong>d<strong>in</strong>gs<br />
In the follow<strong>in</strong>g, we discuss some major trends we have observed dur<strong>in</strong>g our analysis.<br />
From the visualization po<strong>in</strong>t <strong>of</strong> view we already discussed the role <strong>of</strong> meta-visualizations,<br />
that is, visualizations with the purpose to accommodate other visualizations. Dur<strong>in</strong>g the<br />
paper review we found very limited explicit discussions <strong>of</strong> this aspect that we deem extremely<br />
relevant. Many <strong>of</strong> the papers we have analyzed seem to assume that provid<strong>in</strong>g<br />
a simple list <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g visualizations will automatically solve the user’s task. To the<br />
best <strong>of</strong> our knowledge, the only work that analyzes the issue explicitly and <strong>in</strong> great depth<br />
is the Trellis display [18], which organizes the display <strong>in</strong> a way to make patterns among<br />
views apparent. We believe a deeper <strong>in</strong>vestigation <strong>of</strong> this issue is needed.<br />
Interest<strong>in</strong>gly, some <strong>of</strong> the papers we reviewed do take care <strong>of</strong> the navigation issue, that<br />
is, how to explore configurations automatically found by the algorithm. These papers usually<br />
provide an additional visualization that permits to navigate from one configuration to<br />
another. For <strong>in</strong>stance, Johansson et al. provide a l<strong>in</strong>e chart visualization to <strong>in</strong>teractively<br />
show alternative projections <strong>in</strong> parallel coord<strong>in</strong>ates [82]. Similarly, “hierarchical dimension<br />
order<strong>in</strong>g” [158] uses the InterR<strong>in</strong>g visualization to the let the user navigate through<br />
alternative subsets <strong>of</strong> dimensions organized <strong>in</strong> a hierarchical fashion. F<strong>in</strong>ally, the Rankby-Feature<br />
framework [126] uses color-coded <strong>in</strong>teractive lists and scatterplot matrices to<br />
provide a preview <strong>of</strong> the statistical properties <strong>of</strong> each views.<br />
We also noticed a lack <strong>of</strong> systematic approaches to the order<strong>in</strong>g problem - every paper<br />
proposes its own method. The whole topic <strong>of</strong> seriation, <strong>in</strong>troduced <strong>in</strong> the early work <strong>of</strong><br />
Bert<strong>in</strong> [22] and discussed <strong>in</strong> depth by Hahsler et al. [62], deserves deeper <strong>in</strong>vestigation and<br />
acknowledgment. Additionally, <strong>in</strong>novative ways <strong>of</strong> order<strong>in</strong>g data dimensions may exist,<br />
like the eulerian tours and hamiltonian decompositions presented by Hurley et al. [75],<br />
which explore the possibility <strong>of</strong> repeat<strong>in</strong>g the axes to reduce dependency on a specific<br />
order.<br />
In Section 4.1.4, we listed a series <strong>of</strong> meta-visualizations that we have found, namely<br />
list and matrix (small multiples). We believe this list can be expanded if novel solutions<br />
are developed. A promis<strong>in</strong>g one we have noticed <strong>in</strong> a few papers, but not <strong>in</strong>cluded <strong>in</strong> the<br />
review (because they are not specifically us<strong>in</strong>g quality metrics) is the idea <strong>of</strong> arrang<strong>in</strong>g<br />
iconic versions <strong>of</strong> the visualizations generated <strong>in</strong> a scatterplot view (e.g., us<strong>in</strong>g MDS or<br />
similar techniques). Such a technique is for <strong>in</strong>stance proposed <strong>in</strong> the work <strong>of</strong> Yang et al.<br />
where pixel-based icons are laid out with an MDS projection <strong>in</strong> a scatterplot [156].<br />
Another issue we noticed from our analysis is the limited use <strong>of</strong> the visual mapp<strong>in</strong>g<br />
and view transformation functions <strong>in</strong> the pipel<strong>in</strong>e. More specifically, visual mapp<strong>in</strong>g is<br />
almost exclusively used as a way to generate alternative order<strong>in</strong>gs, tak<strong>in</strong>g <strong>in</strong>to account<br />
exclusively the mapp<strong>in</strong>g between the orig<strong>in</strong>al data dimensions and the visualization axes.<br />
But alternative mapp<strong>in</strong>gs can also be generated by l<strong>in</strong>k<strong>in</strong>g data dimensions to the whole<br />
spectrum <strong>of</strong> visual features like color, size, shape, etc., as is common <strong>in</strong> several systems<br />
based on visual languages like ggplot2[1], tableau[3], and protovis[2]). Pixnostics [120] is<br />
the only technique <strong>in</strong> our review present<strong>in</strong>g this k<strong>in</strong>d <strong>of</strong> a process supported by quality<br />
metrics.<br />
View transformation is also rarely used <strong>in</strong> the quality metrics pipel<strong>in</strong>e. The only<br />
example we found is the use <strong>of</strong> quality metrics to automatically select focus area parameters<br />
<strong>in</strong> table lens [8]. The automatic selection <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> views <strong>in</strong> 3D scatterplots,<br />
for example, is one clear case where the use <strong>of</strong> quality metrics at the view transformation<br />
stage would be beneficial. Another one is the automatic highlight <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g items <strong>in</strong>
4.1.7 Directions for Further Research 85<br />
a view (e.g., visual boost<strong>in</strong>g <strong>in</strong> pixel-based visualizations [109]).<br />
F<strong>in</strong>ally, the purposes we have considered can be roughly classified <strong>in</strong>to two broad higher<br />
level purposes: f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g visualizations and scal<strong>in</strong>g visualizations to larger data<br />
sets. When consider<strong>in</strong>g these goals it is evident how cluster<strong>in</strong>g, correlation, outliers, and<br />
complex patterns support more the first goal, whereas image quality and feature preservation<br />
tend to support more the second one. One <strong>in</strong>terest<strong>in</strong>g pend<strong>in</strong>g issue is whether the<br />
use <strong>of</strong> quality metrics <strong>in</strong> high-dimensional data is conf<strong>in</strong>ed to these two general purposes.<br />
One purpose, which to the best <strong>of</strong> our knowledge is totally unexplored, is the use <strong>of</strong> quality<br />
metrics to automatically or semi-automatically compare di erent visual techniques <strong>of</strong> the<br />
same data.<br />
4.1.7 Directions for Further Research<br />
In the follow<strong>in</strong>g, we present a selected set <strong>of</strong> research issues we deem important for the<br />
advancement <strong>of</strong> quality-metrics-driven data visualization.<br />
Evaluation and applications<br />
Surpris<strong>in</strong>gly, none <strong>of</strong> the papers we have analyzed reported on user evaluation. While<br />
we are conv<strong>in</strong>ced that quality metrics are useful and need to be further developed, we<br />
also realize that the whole idea has not yet been tested. Usefulness is therefore one <strong>of</strong><br />
the most important aspect to consider, followed by usability issues. To the best <strong>of</strong> our<br />
knowledge, there are no studies report<strong>in</strong>g on the use <strong>of</strong> the quality metrics approach <strong>in</strong><br />
real-world sett<strong>in</strong>gs. Observatory studies or even simple case studies would greatly improve<br />
the approach and most likely direct research to specific issues hard to anticipate without<br />
observation.<br />
Perceptual tun<strong>in</strong>g<br />
All the metrics that work <strong>in</strong> the image space try to simulate the human pattern recognition<br />
mach<strong>in</strong>ery to some extend. They try to partially substitute human vision with image<br />
process<strong>in</strong>g algorithms with the (implicit) assumption that algorithm rank<strong>in</strong>gs will match<br />
user rank<strong>in</strong>gs. This assumption needs a much deeper <strong>in</strong>vestigation. Our study presented<br />
<strong>in</strong> Section 3.2 and published <strong>in</strong> [134], where quality metrics rank<strong>in</strong>gs <strong>of</strong> clusters <strong>in</strong> scatterplots<br />
are compared to human rank<strong>in</strong>gs, represents a first step <strong>in</strong> this direction. In addition,<br />
it is necessary to validate and tune the image space metrics <strong>in</strong> a way that the parameters<br />
take models <strong>of</strong> human perception <strong>in</strong>to account. Excellent examples <strong>of</strong> <strong>in</strong>itial steps <strong>in</strong> this<br />
direction are <strong>in</strong> the follow<strong>in</strong>g papers [81, 94, 116], where the perception <strong>of</strong> visual patterns<br />
has been tuned accord<strong>in</strong>g to user studies aimed at model<strong>in</strong>g the way humans perceive them.<br />
Metrics systematization<br />
Dur<strong>in</strong>g our review we collected a very large number <strong>of</strong> alternative quality metrics, some<br />
calculated <strong>in</strong> data space some <strong>in</strong> image space. While this proliferation <strong>of</strong> metrics is a sign<br />
<strong>of</strong> the richness <strong>of</strong> this approach, it is currently very hard to compare them and understand<br />
which one is suitable for a given task. Some authors provide a number <strong>of</strong> metrics <strong>in</strong> the<br />
same environment lett<strong>in</strong>g the user choose which one to use. Nonetheless, we fear that<br />
this approach with limited guidance may not be e ective for end users, especially, if there<br />
is a lack <strong>of</strong> understand<strong>in</strong>g <strong>of</strong> the level <strong>of</strong> redundancy between one metric and another.<br />
Similarly, given the above mentioned dichotomy, it is hard if not impossible to state which
86 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
approach yields the best results <strong>in</strong> which contexts. On a side note, the mixed approach<br />
<strong>of</strong> giv<strong>in</strong>g the user the possibility to comb<strong>in</strong>e several metrics <strong>in</strong>to a composite one needs<br />
much more <strong>in</strong>vestigation, validation, and guidance.<br />
Scalability<br />
Image space and data space quality metrics have di erent scalability issues. Quality<br />
metrics <strong>in</strong> image space have the advantage <strong>of</strong> be<strong>in</strong>g <strong>in</strong>dependent from the orig<strong>in</strong>al data size,<br />
e.g., [42], that is, their computational complexity only depends on the screen dimensions.<br />
However, as data grows <strong>in</strong> size, virtually all visualizations experience some degree <strong>of</strong><br />
degradation that may <strong>in</strong>fluence the discrim<strong>in</strong>atory power <strong>of</strong> the metric. For <strong>in</strong>stance,<br />
visualizations with a lot <strong>of</strong> clutter might h<strong>in</strong>der the discovery <strong>of</strong> the desired patterns.<br />
Quality metrics <strong>in</strong> data space, on the other hand, are expected to be more robust <strong>in</strong> terms<br />
<strong>of</strong> pattern detection, but their computation is directly a ected by data size. A thorough<br />
<strong>in</strong>vestigation <strong>of</strong> these issues and how to f<strong>in</strong>d a compromise between the two is clearly an<br />
<strong>in</strong>terest<strong>in</strong>g subject for future research.<br />
4.1.8 Limitations<br />
Our work has some important limitations to take <strong>in</strong>to account; first <strong>of</strong> all its subjective<br />
nature. We are by no means suggest<strong>in</strong>g this is the only way to describe the current state <strong>of</strong><br />
quality metrics <strong>in</strong> high-dimensional visualization. There are no doubt a number <strong>of</strong> equally<br />
good alternative ways to describe it; this chapter provides a much-needed start<strong>in</strong>g po<strong>in</strong>t.<br />
We encourage the reader to use this as a way to get <strong>in</strong>spiration for further research and<br />
to understand its status.<br />
Similarly, while we did our best to follow a thorough methodology (see Section 4.1.2),<br />
there might be relevant papers we overlooked. Even though we tried to be very broad<br />
and <strong>in</strong>clusive, our background heavily <strong>in</strong>fluences the review. Especially, given our focus<br />
on Computer Science we might have missed relevant literature from Statistics. However,<br />
we feel confident that at this po<strong>in</strong>t <strong>of</strong> our review any additional paper would not change<br />
the structure or the elements <strong>of</strong> our model. In other terms, the real goal <strong>of</strong> our review<br />
was not to <strong>in</strong>clude every possible paper on the discussed matter but more to have enough<br />
coverage to build a coherent and useful picture.<br />
4.1.9 Conclusion and Future Work<br />
We presented a systematic analysis <strong>of</strong> quality metrics as a way to support the exploration<br />
<strong>of</strong> high-dimensional data sets. Quality metrics have been used <strong>in</strong> a variety <strong>of</strong> contexts and<br />
purposes. With this work we started a collection <strong>of</strong> these disparate systems under one<br />
umbrella and provided a way to reason about their characteristic features. Specifically,<br />
we presented an analysis <strong>of</strong> the visualization techniques, the quality metrics, and the<br />
process<strong>in</strong>g pipel<strong>in</strong>e. The analysis has two ma<strong>in</strong> outcomes. First, it permits to describe the<br />
methods <strong>in</strong> detail and to capture their key components. Second, as shown <strong>in</strong> Section 4.1.6<br />
and Section 4.1.7, it permits to spot <strong>in</strong>terest<strong>in</strong>g research gaps and promis<strong>in</strong>g directions<br />
for future research. While we consider this work just an <strong>in</strong>itial step, we hope it will spur<br />
new ideas and support researchers and practitioners <strong>in</strong> the development <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g new<br />
applications and novel techniques.
4.2. <strong>Visual</strong> Cluster Separation Factors: Sketch<strong>in</strong>g a Taxonomy 87<br />
4.2 <strong>Visual</strong> Cluster Separation Factors: Sketch<strong>in</strong>g a Taxonomy 4<br />
The quality metrics systematization presented <strong>in</strong> the previous section was followed by<br />
a qualitative analysis <strong>of</strong> concrete measures from this large pool. Here we turned our<br />
focus to the quality metrics for scatterplots that are designed to identify the visualizations<br />
represent<strong>in</strong>g best the clusters <strong>in</strong> classified data. That means they rank scatterplot views<br />
that separate the data classes well - better than views with mixed classes. Our idea was<br />
to use two <strong>of</strong> these metrics to identify the best visualizations, and <strong>in</strong>dependent <strong>of</strong> the<br />
data. Simultaneously, we wanted to give advice as to whether it would be best to use<br />
a 2D scatterplot, a 3D scatterplot, or a SPLOM for a specific data set. We therefore<br />
computed the measures for di erent data sets, and surpris<strong>in</strong>gly identified that these are<br />
not robust with respect to the di erent cluster shapes encountered <strong>in</strong> the analyzed data.<br />
Led by this <strong>in</strong>sight, we analyzed more deeply all the cases us<strong>in</strong>g open and axial cod<strong>in</strong>g<br />
<strong>of</strong> failure reasons build<strong>in</strong>g up a taxonomy <strong>of</strong> visual cluster separation factors. We named<br />
this process a qualitative evaluation.<br />
The next sections will sketch the methodology and the results <strong>of</strong> this evaluation by present<strong>in</strong>g<br />
<strong>in</strong>troductory ideas <strong>in</strong> Section 4.2.1 that led to this work. Section 4.2.2 will present<br />
a short description <strong>of</strong> the methodology, followed by the taxonomy axes <strong>in</strong> Section 4.2.3<br />
and conclud<strong>in</strong>g <strong>in</strong> Section 4.2.4 with a discussion about the limitations <strong>of</strong> this work and<br />
possible future research mak<strong>in</strong>g use <strong>of</strong> the developed taxonomy.<br />
4.2.1 Introduction<br />
An impressive number <strong>of</strong> quality measures, dimension reduction techniques, and visualizations<br />
for high-dimensional data have been developed <strong>in</strong> the past. The more exist, the<br />
harder it is for users to f<strong>in</strong>d the right choice for their tasks. The literature is not provid<strong>in</strong>g<br />
any guidance on how to choose the right visualization or dimension reduction technique<br />
for the complex multidimensional data. Quality metrics were designed to filter the high<br />
number <strong>of</strong> representations and provide an <strong>in</strong>terest<strong>in</strong>g selection to the user. Sedlmair et<br />
al. [122] <strong>in</strong>vestigate to which extent the existent measures can accomplish this task. They<br />
choose the 2D-HDM (Section 3.1.3) and the DCM [129] (also used <strong>in</strong> the empirical evaluation<br />
<strong>in</strong> Section 3.2 and described <strong>in</strong> Section 3.2.1) as quality metrics to judge the di erent<br />
data projections regard<strong>in</strong>g their ability to represent the clusters <strong>in</strong> classified data sets.<br />
The measures where designed for 2D scatterplots and extended by the authors to work<br />
also on scatterplot matrices (SPLOMs) and 3D scatterplots. This decision is motivated<br />
by the fact that scatterplots is a widely used technique to display high-dimensional projections,<br />
and <strong>of</strong>ten SPLOMs are used to see more than two dimensions <strong>of</strong> the data. S<strong>in</strong>ce<br />
analysts also quiet <strong>of</strong>ten work with 3D scatterplots, this technique was also <strong>in</strong>cluded <strong>in</strong><br />
the study. To obta<strong>in</strong> the lower-dimensional embedd<strong>in</strong>gs, di erent dimension reduction<br />
techniques were used - the well known PCA, robust PCA, MDS and t-SNE [143]. Initial<br />
4 This chapter is based on the collaboration with UBC where I participated <strong>in</strong> a project on quality<br />
measures, lead by Pr<strong>of</strong>. T. Munzner and M. Sedlmair. The work resulted <strong>in</strong> a jo<strong>in</strong>t EuroVis publication<br />
[122]. S<strong>in</strong>ce I was not <strong>in</strong> the lead <strong>in</strong> this project, this chapter is present<strong>in</strong>g briefly the methodology and<br />
the results, and a deeper description can be gathered from the paper itself. Please note that the full<br />
taxonomy <strong>of</strong> cluster separation factors and data characteristics is not my contribution. S<strong>in</strong>ce I was part <strong>of</strong><br />
the qualitative analysis I would like to recall the results <strong>in</strong> my thesis, and provide a personal outlook on<br />
further research ideas at the end <strong>of</strong> this chapter.
88 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
experiments on di erent data sets showed that by us<strong>in</strong>g these visualizations and projection<br />
techniques, the measures are not able to detect di erent cluster shapes <strong>in</strong> the data<br />
projections. Compared to a human judgement, they surpris<strong>in</strong>gly provided mismatches by<br />
rank<strong>in</strong>g visualizations high, when the human rank was low, and rank<strong>in</strong>g visualizations<br />
low, when the human rank was high. This implies that good visualizations are sometimes<br />
missed and bad visualizations are ranked high, both cases that should be avoided.<br />
These surpris<strong>in</strong>g outcomes shifted the focus <strong>of</strong> the study from a guide for the user to the<br />
right choice <strong>of</strong> a visualization technique and a dimension reduction technique dependent on<br />
their data, to an <strong>in</strong> depth analysis <strong>of</strong> di erent visual separability factors. A sketch <strong>of</strong> the<br />
methodology <strong>of</strong> the systematic study <strong>of</strong> the di erences between the computed measures<br />
and the human judgement is presented <strong>in</strong> the next section.<br />
4.2.2 Method<br />
To discover the divergences between human judgement and measure ranks, a qualitative<br />
data study was conducted. The first two authors <strong>of</strong> the paper 5 manually <strong>in</strong>spected over<br />
800 visualizations (comb<strong>in</strong>ation <strong>of</strong> 75 data sets, 4 dimension reduction techniques, and 3<br />
visualizations - 2D, 3D scatterplot and SPLOM) and judged their quality <strong>in</strong> display<strong>in</strong>g<br />
data clusters. Their judgements were compared with the measure ranks and the mismatches<br />
were analyzed. “The <strong>in</strong>vestigators generated a detailed set <strong>of</strong> characteristics that<br />
<strong>in</strong>fluenced cluster separability <strong>in</strong> general, and specific reasons why the measures failed <strong>in</strong><br />
the cases where they found a mismatch. Based on separability characteristics and failure<br />
reasons, we generate a higher-level taxonomy <strong>of</strong> factors, which we iteratively ref<strong>in</strong>ed <strong>in</strong><br />
multiple passes” [122]. This was done “not only by consider<strong>in</strong>g its explanatory clarity<br />
and power, but also by mapp<strong>in</strong>g the ranges where each measure was successful along the<br />
factor axes, and by plac<strong>in</strong>g some <strong>of</strong> the studied data sets along them. Figure 4.15 shows<br />
the measure success ranges on a simplified version <strong>of</strong> the taxonomy” [122]. The study<br />
consisted <strong>of</strong> four stages: (1) choos<strong>in</strong>g variables for study; (2) generat<strong>in</strong>g data set <strong>in</strong>stances<br />
With<strong>in</strong>-Class Factors<br />
Count<br />
Size<br />
Clump<strong>in</strong>ess<br />
few<br />
small<br />
equidistant<br />
x<br />
uni-rand. one spot many spots<br />
x<br />
many<br />
large<br />
Density sparse x dense<br />
clumpy<br />
Outlier none x many<br />
x<br />
Between-Class Factors<br />
Class/Po<strong>in</strong>t<br />
Count<br />
Variance <strong>of</strong> Count<br />
Variance <strong>of</strong> Size<br />
few classes/ x<br />
many po<strong>in</strong>ts<br />
similar<br />
Variance <strong>of</strong> Density similar<br />
Mixture<br />
similar<br />
random<br />
x<br />
x<br />
x<br />
x<br />
VS.<br />
many classes/<br />
few po<strong>in</strong>ts<br />
different<br />
different<br />
different<br />
non-random:<br />
equidistant/<br />
<strong>in</strong>terwoven<br />
Shape<br />
narrow<br />
round<br />
Isotropy<br />
Curvature<br />
x<br />
curvy<br />
Split<br />
Variance <strong>of</strong> Shape<br />
Inner-Outer<br />
Position<br />
contiguous<br />
similar<br />
non-existent<br />
x<br />
x<br />
x<br />
VS.<br />
VS.<br />
split<br />
different<br />
existent<br />
Centroid evocative x<br />
mislead<strong>in</strong>g<br />
Class Separation full overlap<br />
x<br />
partial overlap adjacent separate<br />
distant<br />
Measures:<br />
Centroid<br />
Grid<br />
<strong>Data</strong>sets:<br />
gaussian: synth., MDS, Fig. 5(a)<br />
fisheries: real, MDS, Fig. 5(d)<br />
x spambase: real, PCA, Fig. 5(b)<br />
hiv: real, t-SNE, Fig. 5(e)<br />
shuttle: real, MDS, Fig. 5(c)<br />
entangled: synth., t-SNE, Fig. 5(f)<br />
Figure 4.15: Taxonomy <strong>of</strong> factors <strong>in</strong> visual cluster separation, where factor axes are marked to<br />
show the ranges where exist<strong>in</strong>g measures are successful; gaps represent failure cases. The centroid<br />
measure (CDM) is marked <strong>in</strong> blue and the grid (2D-HDM) is marked <strong>in</strong> red. All positions are<br />
approximate estimates. Marked along the factor axes are six data sets that are exemplified <strong>in</strong> the<br />
paper. (Used with permission by [122].)<br />
5 Michael Sedlmair and myself.
4.2.3 <strong>Visual</strong> Cluster Separation Taxonomy 89<br />
and comput<strong>in</strong>g measures; (3) open cod<strong>in</strong>g and measure evaluation; and (4) axial cod<strong>in</strong>g<br />
and taxonomy build<strong>in</strong>g, and details can be found <strong>in</strong> the paper [122].<br />
4.2.3 <strong>Visual</strong> Cluster Separation Taxonomy<br />
Class separation <strong>in</strong> a visualization is <strong>in</strong>fluenced by di erent characteristics <strong>of</strong> the data set.<br />
Figure 4.16 presents the factors that a ect visual cluster separation. These are grouped <strong>in</strong><br />
“With<strong>in</strong>-Class factors” that are determ<strong>in</strong>ed by the structure or appearance <strong>of</strong> a s<strong>in</strong>gle class<br />
and “Between-Class factors” that represent <strong>in</strong>teractions between two or more classes [122].<br />
Influence<br />
Shape Po<strong>in</strong>t Distance Scale<br />
Count<br />
Size<br />
Density<br />
Clump<strong>in</strong>ess<br />
Outlier<br />
Shape<br />
equidistant<br />
With<strong>in</strong>-Class Factors<br />
Isotropy<br />
few<br />
small<br />
sparse<br />
none<br />
narrow<br />
uniformly<br />
random<br />
one<br />
dense spot<br />
Curvature<br />
many<br />
large<br />
dense<br />
many dense<br />
spots<br />
many<br />
curvy<br />
clumpy<br />
Variance<br />
Class/Po<strong>in</strong>t<br />
Count<br />
Variance <strong>of</strong><br />
Count<br />
few classes<br />
many po<strong>in</strong>ts<br />
similar<br />
many classes<br />
few po<strong>in</strong>ts<br />
different<br />
Variance <strong>of</strong><br />
Size similar different<br />
Variance <strong>of</strong><br />
Density similar different<br />
Mixture<br />
Split<br />
Variance <strong>of</strong><br />
Shape<br />
Between-Class Factors<br />
random<br />
contiguous<br />
similar<br />
VS.<br />
equidistant<br />
VS.<br />
<strong>in</strong>terwoven<br />
split<br />
different<br />
Position<br />
Centroid<br />
round<br />
evocative<br />
mislead<strong>in</strong>g<br />
Inner-Outer<br />
Position<br />
Class<br />
Separation<br />
non-existent<br />
full<br />
overlap<br />
partial<br />
overlap<br />
adjacent<br />
separate<br />
existent<br />
distant<br />
Figure 4.16: A taxonomy <strong>of</strong> data characteristics with respect to class separation <strong>in</strong> scatterplots.<br />
Some factors are organized as axes (arrows) while others are b<strong>in</strong>ned. Between-Class factors <strong>of</strong>ten<br />
result from the variance <strong>of</strong> With<strong>in</strong>-Class factors (horizontal dependencies), and factors at the top<br />
can strongly <strong>in</strong>fluence factors below them (vertical dependencies). Class Separation is therefore<br />
dependent on all other factors (used with permission by [122]).<br />
In brief, we recall the characteristics determ<strong>in</strong><strong>in</strong>g these factor groups. Four categories<br />
describe the two factor groups, the scale, po<strong>in</strong>t distance, shape, and position category<br />
that <strong>in</strong>fluence each other from first to last. Enclosed <strong>in</strong> these groups the With<strong>in</strong>-Class<br />
factors are: count, size, density, clump<strong>in</strong>ess, outlier, shape and centroid. These factors<br />
describe the structure and appearance <strong>of</strong> s<strong>in</strong>gle classes. Variance <strong>of</strong> the With<strong>in</strong>-Class<br />
factors across multiple classes determ<strong>in</strong>e the Between-Class factors sketched on the right<br />
side <strong>of</strong> the figure. In this study, the follow<strong>in</strong>g comb<strong>in</strong>ations <strong>in</strong>fluenc<strong>in</strong>g the perceived<br />
cluster shapes were identified: class-po<strong>in</strong>t count, variance <strong>of</strong> (po<strong>in</strong>t) count, variance <strong>of</strong><br />
(class) size, variance <strong>of</strong> (class) density, mixture (<strong>of</strong> classes), split (<strong>of</strong> classes), variance <strong>of</strong><br />
shape, <strong>in</strong>ner-outer position, class separation. S<strong>in</strong>ce the arrows <strong>in</strong>dicate the <strong>in</strong>fluence <strong>of</strong><br />
the factors, horizontally from left to right, and vertically from top to bottom, the factor<br />
positioned <strong>in</strong> the lower right corner, class separation, can be strongly <strong>in</strong>fluenced by all the<br />
other factors.<br />
The two quality measures have di erent strengths, so they perform di erently while<br />
encounter<strong>in</strong>g these di erent factors. Figure 4.15 marks along the factor axes the measures’
90 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
performances. This clearly shows the gaps where current measures can not achieve good<br />
results. The study identifies that the centroid factor is <strong>in</strong>fluenced by many other factors<br />
and the centroid based measure (CDM) alone cannot identify all di erent constellations <strong>of</strong><br />
visual classes. CDM is vulnerable with respect to shape, clump<strong>in</strong>ess, outliers, variance <strong>of</strong><br />
count, <strong>of</strong> size, or <strong>of</strong> density, and <strong>in</strong>ner-outer position [122]. Similar, HDM also encountered<br />
a number <strong>of</strong> problems while identify<strong>in</strong>g classes <strong>in</strong> visualizations. The biggest issue is with<br />
narrow, adjacent classes that coexist <strong>in</strong> the same grid cell and span over di erent cells.<br />
The measure emerged to be sensitive to the grid size, despite previous results from the<br />
literature. The most di cult factor was the class separation, the measure fail<strong>in</strong>g <strong>in</strong> contact<br />
with overlapp<strong>in</strong>g classes. Depend<strong>in</strong>g on the grid, the classes were sometimes rated good,<br />
even though present<strong>in</strong>g a high overlap or class split.<br />
The goal <strong>of</strong> this taxonomy is guid<strong>in</strong>g others <strong>in</strong> design<strong>in</strong>g, us<strong>in</strong>g, and evaluat<strong>in</strong>g cluster<br />
separability measures. Other researchers can test di erent data sets and map their features<br />
onto the taxonomy axes. This will give an overview <strong>of</strong> the coverage <strong>of</strong> relevant factors by<br />
the particular measures and help <strong>in</strong> improv<strong>in</strong>g or develop<strong>in</strong>g more reliable measures <strong>in</strong> the<br />
future.<br />
4.2.4 Discussion and Further Research<br />
This study shows that so far measures were developed and validated on far too few and<br />
too simple data sets. The real world is much more complex, and s<strong>in</strong>ce the data complexity<br />
rises, a more systematic development <strong>of</strong> the measures is needed. As we saw <strong>in</strong> the previous<br />
section, more aspects can be identified <strong>in</strong> real data sets, that are not covered yet by<br />
exist<strong>in</strong>g measures. In the follow<strong>in</strong>g, we present a list <strong>of</strong> issues that emerged as a result <strong>of</strong><br />
this study, and which we deem important for further research <strong>in</strong> the area <strong>of</strong> quality metrics.<br />
Taxonomy based evaluation and systematization<br />
A large number <strong>of</strong> metrics for cluster separation <strong>in</strong> scatterplots have been developed. They<br />
all try to discover good views display<strong>in</strong>g the data clusters. We believe that there are two<br />
ma<strong>in</strong> reasons, why there are a variety <strong>of</strong> measures for this task: di erent strengths <strong>of</strong> measures<br />
and miss<strong>in</strong>g unified picture <strong>of</strong> exist<strong>in</strong>g approaches. First, the measures have di erent<br />
strengths accord<strong>in</strong>g to the factors <strong>of</strong> the taxonomy. They cannot cover the entire spectrum<br />
<strong>of</strong> data characteristics, and therefore focus just on a subset <strong>of</strong> these. Us<strong>in</strong>g the metrics<br />
for the area that they cannot cope with will lead to wrong results. Therefore, guided by<br />
the taxonomy presented before, an evaluation <strong>of</strong> the existent metrics is needed that can<br />
help users to choose the right measure depend<strong>in</strong>g on their data. Second, the variety <strong>of</strong><br />
measures makes the development <strong>of</strong> new ones di cult s<strong>in</strong>ce a unify<strong>in</strong>g picture is miss<strong>in</strong>g.<br />
Guided by the taxonomy axes, the existent approaches can be evaluated and their ranges<br />
<strong>of</strong> success can be marked to them. This analysis would provide a good systematization<br />
<strong>of</strong> current approaches spott<strong>in</strong>g the data characteristics that have to be addressed <strong>in</strong> the<br />
future and lead the researchers through the variety <strong>of</strong> approaches.<br />
Taxonomy based measure development<br />
After the gaps <strong>of</strong> existent measures are identified, new research can be conducted to cover<br />
the data characteristics miss<strong>in</strong>g so far. We believe that it is hard to develop one s<strong>in</strong>gle<br />
measure to cover all these factors, but hav<strong>in</strong>g di erent measures and be<strong>in</strong>g aware <strong>of</strong> their<br />
coverage potential along these axes helps <strong>in</strong> avoid<strong>in</strong>g false rank<strong>in</strong>gs <strong>in</strong> the future.
4.2.4 Discussion and Further Research 91<br />
New taxonomies for di erent visualization techniques<br />
While this taxonomy focuses on one prom<strong>in</strong>ent visualization technique, the scatterplot,<br />
there are also metrics designed for other high-dimensional visualization techniques like categorized<br />
<strong>in</strong> Section 4.1.4. Di erent visualization techniques will need di erent factors to<br />
characterize di erent patterns (e.g., cluster separation). Even though a taxonomy like this<br />
is laborious, the benefits <strong>of</strong> it can improve the development <strong>of</strong> metrics for these techniques.<br />
New taxonomies for di erent quality metric factors<br />
We have seen <strong>in</strong> Section 4.1.4 that di erent patterns are quantified by measures, and a<br />
systematization <strong>of</strong> the factors that <strong>in</strong>fluence them is miss<strong>in</strong>g for other factors too. Factors<br />
like correlation, outliers, complex patterns, image quality, or feature preservation are<br />
miss<strong>in</strong>g such a taxonomy. Hav<strong>in</strong>g all these taxonomies – which would be the ideal case<br />
scenario – it would be possible to identify <strong>in</strong>terrelations between di erent patterns and<br />
how they are represented <strong>in</strong> visualizations. We believe that these <strong>in</strong>sights can help <strong>in</strong><br />
comb<strong>in</strong><strong>in</strong>g measures to identify more than one pattern.<br />
Metrics for dimension reduction properties<br />
Dimension reduction techniques are <strong>of</strong>ten used to reduce the dimensionality <strong>of</strong> the data<br />
sets before display<strong>in</strong>g them on the screen. The metrics are always applied on dimension<br />
reduced data sets, so artifacts <strong>in</strong>cluded by these techniques cannot be excluded. A study<br />
<strong>of</strong> how di erent data characteristics are ma<strong>in</strong>ta<strong>in</strong>ed or obscured by these techniques, can<br />
be conducted by compar<strong>in</strong>g di erent techniques, or the same technique with di erent<br />
parameter sett<strong>in</strong>gs on the same data set. As far as we know, there are no studies report<strong>in</strong>g<br />
on this type <strong>of</strong> analysis, and we believe it to be an <strong>in</strong>terest<strong>in</strong>g topic for future research. Also<br />
quality measures can be designed to automatically detect structure changes, by parameter<br />
or technique change. Properties like noise <strong>in</strong>variance, rotation <strong>in</strong>variance, scalability with<br />
respect to data po<strong>in</strong>ts and dimensions, can be explored by new quality metrics.
92 Chapter 4. A Systematization <strong>of</strong> Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization
5<br />
<strong>Visual</strong> Subspace Analysis <strong>of</strong><br />
<strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Contents<br />
„<strong>Visual</strong> ideas comb<strong>in</strong>ed with technology comb<strong>in</strong>ed with personal <strong>in</strong>terpretation<br />
equals photography. Each must hold it’s own; if it doesn’t, the th<strong>in</strong>g<br />
collapses.”<br />
Arnold Newman<br />
5.1 <strong>Visual</strong> Exploration for Subspace Cluster<strong>in</strong>g . . . . . . . . . . . 94<br />
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
5.1.2 Subspace Cluster<strong>in</strong>g Algorithms . . . . . . . . . . . . . . . . . . 96<br />
5.1.3 Task Def<strong>in</strong>ition and Design Space for <strong>Visual</strong> Subspace Cluster<br />
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
5.1.4 The ClustNails System . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
5.1.5 Use Case and System Comparison . . . . . . . . . . . . . . . . . 106<br />
5.1.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 109<br />
5.2 <strong>Visual</strong> <strong>Analytics</strong> <strong>of</strong> Subspace Search . . . . . . . . . . . . . . . . 110<br />
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.2.2 Subspace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
5.2.3 Proposed Analytical Workflow . . . . . . . . . . . . . . . . . . . 113<br />
5.2.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
5.2.5 Discussion and Possible Extensions . . . . . . . . . . . . . . . . . 124<br />
5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
S<br />
ubspace cluster<strong>in</strong>g addresses an important problem <strong>in</strong> cluster<strong>in</strong>g multidimensional<br />
data. In sparse multidimensional data, many dimensions are irrelevant and obscure<br />
the cluster boundaries. Subspace cluster<strong>in</strong>g helps by m<strong>in</strong><strong>in</strong>g the clusters present <strong>in</strong> only<br />
locally relevant subsets <strong>of</strong> dimensions. However, understand<strong>in</strong>g the result <strong>of</strong> subspace<br />
cluster<strong>in</strong>g by analysts is not trivial. In addition to the group<strong>in</strong>g <strong>in</strong>formation, relevant<br />
sets <strong>of</strong> dimensions and overlaps between groups, both <strong>in</strong> terms <strong>of</strong> dimensions and records,<br />
need to be analyzed. In Section 5.1, we present an <strong>in</strong>teractive visualization system called<br />
ClustNails to analyze, navigate, relate, and understand subspace cluster<strong>in</strong>g results. Real<br />
world data sets are used to demonstrate the functionality <strong>of</strong> the system.<br />
Additionally, high-dimensional data spaces <strong>of</strong>ten consist <strong>of</strong> comb<strong>in</strong>ed features that measure<br />
di erent properties, <strong>in</strong> which case the particular relationships between the various<br />
properties may not be clear to the analysts a priori s<strong>in</strong>ce it can only be revealed if appropriate<br />
feature comb<strong>in</strong>ations (subspaces) <strong>of</strong> the data are taken <strong>in</strong>to consideration. Consider<strong>in</strong>g<br />
just a s<strong>in</strong>gle subspace is, however, <strong>of</strong>ten not su cient s<strong>in</strong>ce di erent subspaces may show<br />
complementary, conjo<strong>in</strong>tly, or contradict<strong>in</strong>g relations between data items. Useful <strong>in</strong>forma-
94 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
tion may consequently rema<strong>in</strong> embedded <strong>in</strong> sets <strong>of</strong> subspaces <strong>of</strong> a given high-dimensional<br />
<strong>in</strong>put data space.<br />
Rely<strong>in</strong>g on the notion <strong>of</strong> subspaces <strong>in</strong> Section 5.2, we propose a novel method for the<br />
visual analysis <strong>of</strong> high-dimensional data <strong>in</strong> which we employ an <strong>in</strong>terest<strong>in</strong>gness-guided<br />
subspace search algorithm to detect a candidate set <strong>of</strong> subspaces. Us<strong>in</strong>g proper def<strong>in</strong>ed<br />
subspace similarity functions we provide an <strong>in</strong>teractive exploration environment to compare<br />
and relate subspaces with respect to their topological similarities and dimension<br />
similarities. Real and synthetic data sets are used to demonstrate our approach.<br />
Parts <strong>of</strong> this chapter appeared <strong>in</strong> the follow<strong>in</strong>g publications [135, 136].<br />
5.1 <strong>Visual</strong> Exploration for Subspace Cluster<strong>in</strong>g<br />
In this section, we <strong>in</strong>troduce a visual subspace cluster analysis system called ClustNails. It<br />
<strong>in</strong>tegrates several novel visualization techniques with various user <strong>in</strong>teraction facilities to<br />
support the navigation and <strong>in</strong>terpretation <strong>of</strong> subspace cluster<strong>in</strong>g results. We demonstrate<br />
the e ectiveness <strong>of</strong> the proposed system by analyz<strong>in</strong>g real world data sets and compar<strong>in</strong>g<br />
it to other exist<strong>in</strong>g visual subspace cluster analysis systems.<br />
This section is organized as follows. In Section 5.1.1, we elaborate what aspects motivated<br />
our research <strong>in</strong> this area. In Section 5.1.2, we <strong>in</strong>troduce the subspace cluster<strong>in</strong>g<br />
problem and po<strong>in</strong>t to important overview articles <strong>in</strong> this area. We also expla<strong>in</strong> <strong>in</strong> Section<br />
5.1.3 the challenges <strong>in</strong> design<strong>in</strong>g e ective visualization tools for subspace cluster<strong>in</strong>g<br />
analysis tasks. In Section 5.1.4, we provide an overall view <strong>of</strong> the system as well as detailed<br />
visualization and order<strong>in</strong>g techniques. In Section 5.1.5, we validate the system with real<br />
world data sets and compare it with a state <strong>of</strong> the art system, and Section 5.1.6 concludes.<br />
5.1.1 Motivation<br />
Cluster<strong>in</strong>g is one <strong>of</strong> the most prom<strong>in</strong>ent techniques used to analyze large and complex data<br />
sets, and visualization is <strong>of</strong>ten helpful <strong>in</strong> understand<strong>in</strong>g the output <strong>of</strong> a given cluster<strong>in</strong>g<br />
method. A cluster<strong>in</strong>g algorithm assesses the relationships among objects <strong>of</strong> a data set by<br />
organiz<strong>in</strong>g objects <strong>in</strong>to clusters, such that objects with<strong>in</strong> a cluster are similar to each other<br />
but dissimilar from objects <strong>in</strong> other clusters. Cluster<strong>in</strong>g has a wide range <strong>of</strong> application<br />
<strong>in</strong> areas such as bus<strong>in</strong>ess <strong>in</strong>telligence, pattern recognition, image or document analysis,<br />
and bio<strong>in</strong>formatics. With the fast development <strong>of</strong> modern technologies, vast amounts <strong>of</strong><br />
high-dimensional data are generated. This poses new challenges for cluster<strong>in</strong>g that require<br />
specialized solutions.<br />
The need for subspace cluster<strong>in</strong>g stems from the well-known “curse <strong>of</strong> dimensionality”,<br />
that is, the enormous challenges that arise <strong>in</strong> data analysis whenever the data under<br />
analysis has a high number <strong>of</strong> dimensions. As the number <strong>of</strong> dimensions grows, relations<br />
among data po<strong>in</strong>ts become more complex and <strong>in</strong>terest<strong>in</strong>g patterns become harder to uncover.<br />
Computation also becomes an issue as the number <strong>of</strong> comb<strong>in</strong>ations <strong>in</strong>crease steeply
5.1.1 Motivation 95<br />
with data dimensionality.<br />
The need for subspace cluster<strong>in</strong>g derives essentially from two dist<strong>in</strong>ct but related issues:<br />
(1) how similarity among data items changes as as the number <strong>of</strong> data dimensions grows<br />
and (2) the relevance <strong>of</strong> di erent dimensions <strong>in</strong> di erent clusters.<br />
Several studies have analyzed the strange behavior similarity functions have <strong>in</strong> highdimensional<br />
data [28, 69]. In summary, they are organized around the problem <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g<br />
the nearest and farthest po<strong>in</strong>ts to a given query po<strong>in</strong>t and show that as the number <strong>of</strong> data<br />
dimensions <strong>in</strong>creases the di erence between the two does not <strong>in</strong>crease as fast the distance<br />
to the nearest po<strong>in</strong>t. That is:<br />
dist max ≠ dist m<strong>in</strong><br />
lim<br />
=0, (5.1)<br />
dæŒ dist m<strong>in</strong><br />
mean<strong>in</strong>g that the discrim<strong>in</strong>ation between the nearest and farthest po<strong>in</strong>ts becomes irrelevant.<br />
In turn, this has the e ect that a progressive degradation <strong>of</strong> the quality <strong>of</strong> data<br />
cluster<strong>in</strong>g can be expected because distances between data po<strong>in</strong>ts become progressively<br />
mean<strong>in</strong>gless.<br />
The second problem is related to the fact that clusters are <strong>of</strong>ten present only <strong>in</strong> subsets<br />
<strong>of</strong> dimensions <strong>of</strong> the orig<strong>in</strong>al data space, and this is <strong>of</strong> course more probable when the<br />
number <strong>of</strong> dimensions is high. These clusters might be hard to detect if consider<strong>in</strong>g the<br />
whole data space because they can <strong>in</strong>troduce noise and fool the cluster<strong>in</strong>g algorithm. This<br />
e ect can be expla<strong>in</strong>ed through a simple diagram like the one shown <strong>in</strong> Figure 5.1.<br />
The figure shows the distribution <strong>of</strong> data po<strong>in</strong>ts <strong>in</strong> a 3D space and illustrates the<br />
concept <strong>of</strong> a subspace cluster – given three dimensions x, y, and z, clustersmayexist<br />
<strong>in</strong> di erent subspaces. A standard cluster<strong>in</strong>g algorithm like k-means would have problems<br />
f<strong>in</strong>d<strong>in</strong>g the clusters because they are not clearly separated <strong>in</strong> the 3D space. But,<br />
when consider<strong>in</strong>g 2D projections <strong>of</strong> these data respectively on (x, y), (x, z) and (y, z) the<br />
clusters become apparent. Subspace cluster<strong>in</strong>g techniques aim to f<strong>in</strong>d these clusters that<br />
might otherwise rema<strong>in</strong> hidden if a traditional cluster<strong>in</strong>g algorithm was applied. Subspace<br />
cluster<strong>in</strong>g gives for each cluster (1) the objects belong<strong>in</strong>g to the cluster, and (2) the subset<br />
<strong>of</strong> dimensions that constitute the cluster. Based on the type <strong>of</strong> subspace cluster<strong>in</strong>g<br />
method, there exist two forms <strong>of</strong> output: a partition<strong>in</strong>g <strong>of</strong> the data <strong>in</strong>to separate clusters<br />
and clusters allow<strong>in</strong>g for overlapp<strong>in</strong>g elements. Overlap may also exist between the sets<br />
<strong>of</strong> dimensions constitut<strong>in</strong>g the clusters.<br />
#"<br />
$"<br />
Figure 5.1: <strong>Data</strong> projected <strong>in</strong> several subspaces.<br />
Design<strong>in</strong>g e ective visualizations to help analyze the cluster<strong>in</strong>g result is not trivial. In<br />
addition to the cluster membership <strong>in</strong>formation, the relevant sets <strong>of</strong> dimensions and the<br />
!"
96 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
overlaps <strong>of</strong> memberships and dimensions need to be considered. Although a number <strong>of</strong><br />
techniques (e.g., parallel coord<strong>in</strong>ates [55, 78], scatterplot matrices [17], heat maps [47])<br />
exist for visualiz<strong>in</strong>g traditional cluster<strong>in</strong>g results, little research has been carried out for<br />
visualiz<strong>in</strong>g subspace cluster<strong>in</strong>g results. There is a need for e ective systems that allow the<br />
comparison and analysis <strong>of</strong> clusters <strong>in</strong> arbitrary subspace projections, support<strong>in</strong>g overview<br />
and <strong>in</strong>-depth study <strong>of</strong> the subspace cluster<strong>in</strong>g results.<br />
In this section, we present ClustNails, a novel visualization system for m<strong>in</strong><strong>in</strong>g subspace<br />
clusters and analyz<strong>in</strong>g the results. The system takes high-dimensional data as <strong>in</strong>put, and<br />
applies a user-selectable subspace cluster<strong>in</strong>g algorithm from a set <strong>of</strong> algorithms, to group<br />
the objects <strong>in</strong>to clusters. The system displays the subspace cluster<strong>in</strong>g results us<strong>in</strong>g two appropriately<br />
designed visual representations – Spikes and HeatNails. These representations<br />
support the <strong>in</strong>terpretation <strong>of</strong> the result <strong>of</strong> subspace cluster<strong>in</strong>g algorithms by visualiz<strong>in</strong>g<br />
characteristics <strong>of</strong> the cluster<strong>in</strong>g results from di erent perspectives. Appropriate order<strong>in</strong>g<br />
techniques are <strong>in</strong>tegrated with the visualization to help extract<strong>in</strong>g mean<strong>in</strong>gful patterns<br />
from the cluster<strong>in</strong>g results.<br />
The ma<strong>in</strong> contributions <strong>of</strong> this section are:<br />
• an <strong>in</strong>tegrated data analysis and visualization tool for m<strong>in</strong><strong>in</strong>g patterns <strong>in</strong> multidimensional<br />
data us<strong>in</strong>g subspace cluster<strong>in</strong>g algorithms;<br />
• a characterization <strong>of</strong> subspace cluster analysis tasks and the result<strong>in</strong>g design space;<br />
• two novel visualization techniques, Spike and HeatNail, for analyz<strong>in</strong>g subspace cluster<strong>in</strong>g<br />
results;<br />
• appropriate order<strong>in</strong>g techniques for pattern extraction.<br />
5.1.2 Subspace Cluster<strong>in</strong>g Algorithms<br />
Given a set X <strong>of</strong> data po<strong>in</strong>ts <strong>in</strong> some multidimensional space D, a subspace cluster<strong>in</strong>g<br />
algorithm aims to f<strong>in</strong>d a subset X k <strong>of</strong> data po<strong>in</strong>ts together with a subset D k <strong>of</strong> dimensions<br />
such that the po<strong>in</strong>ts <strong>in</strong> X k are closely clustered <strong>in</strong> the subspace <strong>of</strong> dimension D k .<br />
The most critical part <strong>of</strong> subspace cluster<strong>in</strong>g is the subspace generation. Given a<br />
d-dimensional space, there are 2 d possible subsets <strong>of</strong> dimensions. It is computationally<br />
<strong>in</strong>feasible to exam<strong>in</strong>e each possible subset to f<strong>in</strong>d subspaces <strong>of</strong> <strong>in</strong>terest for a predef<strong>in</strong>ed<br />
pattern. S<strong>in</strong>ce this is clearly not a viable way, every algorithm is based on some k<strong>in</strong>d<br />
<strong>of</strong> heuristic that speeds up the search <strong>in</strong> such a huge comb<strong>in</strong>atoric space. A number <strong>of</strong><br />
subspace cluster<strong>in</strong>g algorithms with strategies for narrow<strong>in</strong>g down the search space have<br />
been proposed <strong>in</strong> the past and some <strong>of</strong> them enumerated <strong>in</strong> Section 2.3.1. As suggested<br />
by Parsons et al. [110], the exist<strong>in</strong>g algorithms can be categorized <strong>in</strong>to bottom-up and<br />
top-down strategies.<br />
The bottom-up approaches implement a so called ”downward closure property” (or<br />
monotonicity property), which means if subspace S conta<strong>in</strong>s a cluster, then any subspace<br />
T S must also conta<strong>in</strong> a cluster. The property is used for prun<strong>in</strong>g – if a subspace T<br />
does not have high enough density, then any superspace S, T S, can be excluded from<br />
the search<strong>in</strong>g space. A common implementation <strong>of</strong> a bottom-up approach starts from one<br />
dimensional dense subspaces, iteratively consider<strong>in</strong>g an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> dimensions<br />
and comb<strong>in</strong><strong>in</strong>g the dense units that are adjacent until no more new dense units are found.<br />
A typical algorithm will have three major steps:
5.1.2 Subspace Cluster<strong>in</strong>g Algorithms 97<br />
1. generate high dense units (subspaces) us<strong>in</strong>g an a-priori-like approach;<br />
2. assign cluster membership to each object;<br />
3. remove outliers that have distance to the cluster center higher than the critical value.<br />
The top-down approach starts with an <strong>in</strong>itial configuration where data is clustered us<strong>in</strong>g<br />
the full feature space with equally weighted dimensions. Each dimension is assigned a<br />
weight for each cluster to characterize the relevance <strong>of</strong> the dimension to the cluster. Subsequently<br />
the annotated clusters are re-clustered tak<strong>in</strong>g <strong>in</strong>to account the weights assigned<br />
<strong>in</strong> the preced<strong>in</strong>g step. Typically sampl<strong>in</strong>g techniques are used to improve performance as<br />
the approach <strong>in</strong>volves multiple iterations <strong>of</strong> re-cluster<strong>in</strong>g <strong>in</strong> the full set <strong>of</strong> dimensions.<br />
Any <strong>of</strong> these approaches require some k<strong>in</strong>d <strong>of</strong> parametrization. Bottom-up approaches<br />
generally require specifications <strong>of</strong> threshold densities and b<strong>in</strong> size. Top-down approaches<br />
require a specification <strong>of</strong> the desired number <strong>of</strong> clusters (similar to k-means) and the<br />
average number <strong>of</strong> dimensions <strong>in</strong>cluded <strong>in</strong> a subspace.<br />
In this chapter we use Proclus, which is one <strong>of</strong> the most established algorithms and<br />
has demonstrated advantages over a number <strong>of</strong> subspace cluster<strong>in</strong>g techniques [102]. Proclus<br />
[4] takes a top-down approach and extends the traditional k-medoid cluster<strong>in</strong>g algorithm.<br />
The k-medoid algorithm starts with an <strong>in</strong>itial partition and then iteratively assigns<br />
objects to medoids, computes the quality <strong>of</strong> cluster<strong>in</strong>g, and improves the partition and<br />
medoid. Proclus extends k-medoid by associat<strong>in</strong>g medoids with subspaces and improves<br />
both partitions and subspaces iteratively.<br />
Tak<strong>in</strong>g two <strong>in</strong>put parameters, number <strong>of</strong> clusters k and the average number <strong>of</strong> dimensions<br />
l, the algorithm proceeds <strong>in</strong> 3 phases. (1) In the <strong>in</strong>itialization phase the set <strong>of</strong><br />
k medoid candidates is selected, by pick<strong>in</strong>g a representative sample from the entire data<br />
and choos<strong>in</strong>g the medoids from the representatives by us<strong>in</strong>g a greedy method. (2) In the<br />
iterative phase the medoids are improved and a subspace for each medoid is computed.<br />
This is done by go<strong>in</strong>g through the follow<strong>in</strong>g steps. First a random set <strong>of</strong> k medoids is<br />
selected from the representatives and the optimal set <strong>of</strong> dimensions is determ<strong>in</strong>ed for each<br />
medoid. Then all the objects are assigned to the nearest medoid. If the current cluster<strong>in</strong>g<br />
is better than the previous, than it is kept. These steps are repeated until the cluster<strong>in</strong>g<br />
does not change anymore when determ<strong>in</strong><strong>in</strong>g the bad medoids and replac<strong>in</strong>g them with random<br />
representatives. (3) In the last phase, the cluster ref<strong>in</strong>ement phase, once the best<br />
medoids are found, the cluster<strong>in</strong>g is improved by determ<strong>in</strong><strong>in</strong>g optimal dimension sets for<br />
the medoids and reassign<strong>in</strong>g the objects to clusters. Algorithm 1 presents the pseudocode<br />
from [4] describ<strong>in</strong>g the algorithmic steps <strong>in</strong> more detail.<br />
A number <strong>of</strong> reviews and surveys exist to compare and classify the subspace cluster<strong>in</strong>g<br />
approaches. The survey mentioned above by Parsons et al. [110] organizes the techniques<br />
<strong>in</strong> a hierarchy <strong>of</strong> algorithmic strategies and provide a small experiment on representative<br />
algorithms <strong>of</strong> each class. Kriegel et al. present a more thorough systematization<br />
and updated survey [90], where the broader problem <strong>of</strong> cluster<strong>in</strong>g high-dimensional data<br />
is discussed. The recent work <strong>of</strong> Müller et al. [102] presents a systematic and unique<br />
evaluation <strong>of</strong> subspace cluster<strong>in</strong>g algorithms <strong>in</strong> terms <strong>of</strong> quality <strong>of</strong> generated output and<br />
performance. Accord<strong>in</strong>g to [102], Proclus is one <strong>of</strong> the best partition<strong>in</strong>g algorithms and<br />
has a good runtime compared to other techniques. We rely on this and use Proclus <strong>in</strong> our<br />
experiments.
98 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Algorithm 1 PROCLUS(No. <strong>of</strong> Clusters: k, Avg. Dimensions: l)<br />
{C i is the ith cluster}<br />
{D i is the set <strong>of</strong> dimensions associated with cluster C i }<br />
{M current is the set <strong>of</strong> medoids <strong>in</strong> current iteration}<br />
{M best is the best set <strong>of</strong> medoids found so far }<br />
{N i is the f<strong>in</strong>al set <strong>of</strong> medoids with associated dimensions}<br />
{A, B are constant <strong>in</strong>tegers}<br />
/*1. Initialization Phase: select set <strong>of</strong> k medoid candidates */<br />
S = random sample <strong>of</strong> size A · k<br />
M = GREEDY(S, B · k)<br />
/*2. Iterative Phase: improve medoids and compute subspace for each medoid */<br />
BestObjective = Œ<br />
M current = Random set <strong>of</strong> medoids {m 1 ,m 2 ,...,m k }µM<br />
repeat<br />
/* Approximate the optimal set <strong>of</strong> dimensions */<br />
for each medoid m i œ M current do<br />
Let ” i be the distance to nearest medoid from m i<br />
L i = Po<strong>in</strong>ts <strong>in</strong> sphere centered at m i width radius ” i<br />
end for<br />
L = {L 1 ,...,L k }<br />
(D 1 , D 2 ,...,D k ) = F<strong>in</strong>dDimensions(k, l, L)<br />
{Form the clusters}<br />
(C 1 ,...,C k ) = AssignPo<strong>in</strong>ts(D 1 ,...,D k )<br />
ObjectiveFunction = EvaluateClusters(C 1 ,...,C k , D 1 ,...,D k )<br />
if ObjectiveFunction < BestObjective then<br />
BestObjective = ObjectiveFunction<br />
M best = M current<br />
Compute the bad medoids <strong>in</strong> M best<br />
end if<br />
Compute M current by replac<strong>in</strong>g the bad medoids <strong>in</strong><br />
M best with random po<strong>in</strong>ts from M<br />
until (term<strong>in</strong>ation criterion)<br />
/*3. Cluster Ref<strong>in</strong>ement Phase: improve quality <strong>of</strong> the partitions and subspaces */<br />
L = {C 1 ,...,C k }<br />
(D 1 , D 2 ,...,D k ) = F<strong>in</strong>dDimensions(k, l, L)<br />
(C 1 ,...,C k ) = AssignPo<strong>in</strong>ts(D 1 ,...,D k )<br />
N =(M best , D 1 , D 2 ,...,D k )<br />
return N
5.1.3 Task Def<strong>in</strong>ition and Design Space for <strong>Visual</strong> Subspace Cluster Analysis 99<br />
5.1.3 Task Def<strong>in</strong>ition and Design Space for <strong>Visual</strong> Subspace Cluster Analysis<br />
Subspace cluster visualization rema<strong>in</strong>s a challeng<strong>in</strong>g task due to the multiple types <strong>of</strong><br />
<strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> subspace cluster<strong>in</strong>g results such as subspaces, cluster membership<br />
<strong>of</strong> objects, and overlap between subspaces and clusters. Exist<strong>in</strong>g subspace visualization<br />
techniques have been detailed <strong>in</strong> Section 2.4.2. To develop e ective visualization systems<br />
for subspace cluster analysis, it is necessary to take <strong>in</strong>to consideration the di erent tasks<br />
that are <strong>in</strong>volved <strong>in</strong> the data analysis and use it as a base for explor<strong>in</strong>g the design space.<br />
We describe next ma<strong>in</strong> tasks that an appropriate subspace cluster visualization technique<br />
needs to address and, therefore, provide a generic and reusable characterization. We also<br />
analyze the design space and provide: (1) a classification, and (2) a reasoned analysis <strong>of</strong><br />
common design alternatives, from which a basel<strong>in</strong>e design space is derived. This analysis<br />
serves as a basel<strong>in</strong>e not only for the design <strong>of</strong> our proposed subspace cluster visualization<br />
system, but allows to compare with exist<strong>in</strong>g approaches and identify empty areas <strong>in</strong> this<br />
design space for future work.<br />
Scope <strong>of</strong> Subspace Cluster Analysis<br />
Cluster<strong>in</strong>g abstracts a larger data set to a smaller number <strong>of</strong> groups that are presumably<br />
more amenable to analysis and <strong>in</strong>terpretation. Standard cluster<strong>in</strong>g algorithms rely on a<br />
fixed set <strong>of</strong> dimensions used <strong>in</strong> the similarity function <strong>of</strong> the cluster<strong>in</strong>g algorithm. Typically,<br />
the selection <strong>of</strong> dimensions is done outside <strong>of</strong> the cluster<strong>in</strong>g algorithm. Subspace<br />
cluster<strong>in</strong>g methods, on the other hand, provide an extended output, <strong>in</strong>clud<strong>in</strong>g also the set<br />
<strong>of</strong> dimensions relevant to f<strong>in</strong>d<strong>in</strong>g the groups, possibly described with weights <strong>in</strong>dicat<strong>in</strong>g<br />
the importance <strong>of</strong> each dimension for the found result. Depend<strong>in</strong>g on the subspace method<br />
used, there can be an overlap between dimensions and records between the clusters. In<br />
pr<strong>in</strong>ciple, analysis <strong>of</strong> the subspace cluster<strong>in</strong>g can be done without consider<strong>in</strong>g the identified<br />
dimensions. In our work, we are <strong>in</strong>terested <strong>in</strong> jo<strong>in</strong>tly analyz<strong>in</strong>g the cluster<strong>in</strong>g results<br />
and the sets <strong>of</strong> selected dimensions, to provide enhanced analysis capabilities.<br />
Tasks<br />
The analysis <strong>of</strong> properties and relationships with<strong>in</strong> and among clusters are important tasks<br />
<strong>in</strong> cluster analysis. We break these general analysis tasks down to a series <strong>of</strong> subtasks:<br />
T1 Reveal properties <strong>of</strong> <strong>in</strong>dividual clusters<br />
When analyz<strong>in</strong>g cluster<strong>in</strong>g output it is necessary to understand the ma<strong>in</strong> features <strong>of</strong><br />
each generated cluster. In particular, once the cluster<strong>in</strong>g output has been generated<br />
and a visualization is constructed to represent it, it is necessary to perceive the<br />
follow<strong>in</strong>g <strong>in</strong>formation:<br />
T1.1 How many records does the cluster conta<strong>in</strong>?<br />
T1.2 How many dimensions are <strong>in</strong>volved and what are their weights?<br />
T1.3 How are the data values distributed <strong>in</strong> each <strong>of</strong> the conta<strong>in</strong>ed dimensions? (homogeneity<br />
<strong>of</strong> cluster members, central and outlier elements, subgroup<strong>in</strong>g <strong>of</strong><br />
clusters)
100 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
T2 Enable cluster comparison<br />
Once the output has been considered and each cluster has been characterized visually,<br />
it is important to display the <strong>in</strong>formation <strong>in</strong> a way that mean<strong>in</strong>gful comparisons can<br />
be made among the clusters. It is important to understand how similar (or distant)<br />
clusters are, which translates <strong>in</strong>to:<br />
T2.1 How do clusters di er with respect to conta<strong>in</strong>ed records and <strong>in</strong>volved dimensions?<br />
T2.2 Is there overlap between records and dimensions or are they dist<strong>in</strong>ct?<br />
T3 Indicate the quality <strong>of</strong> the generated cluster output<br />
Subspace cluster<strong>in</strong>g algorithms, as many methods that work on multidimensional<br />
spaces, are heavily based on heuristics and are dependent on parameterization. For<br />
this reason, cluster<strong>in</strong>g outputs are not always optimal. Even if research <strong>in</strong> subspace<br />
cluster<strong>in</strong>g has largely improved the cluster<strong>in</strong>g quality, it is still important to be able<br />
to judge the output quality by consider<strong>in</strong>g the follow<strong>in</strong>g:<br />
T3.1 How good is the cluster<strong>in</strong>g quality produced by a given algorithm?<br />
T3.2 How sensitive is the output with respect to parameter variations?<br />
We take these task considerations as a basel<strong>in</strong>e for develop<strong>in</strong>g the ClustNails system<br />
presented <strong>in</strong> the next Section. While we have not formally evaluated the degree to which<br />
ClustNails fulfills each <strong>of</strong> these criteria, we f<strong>in</strong>d that they are at the core <strong>of</strong> the functionality<br />
that ClustNails o ers.<br />
Design Space<br />
In terms <strong>of</strong> the previously described tasks, the <strong>in</strong>formation entities <strong>of</strong> <strong>in</strong>terest to be visualized<br />
are: Elements (data records, clusters, dimensions), Relationships (membership <strong>of</strong><br />
records <strong>in</strong> clusters, clusters overlap with respect to records and dimensions), Attributes<br />
(cluster size, dimension distribution, dimension weight, etc.)<br />
We identify two ma<strong>in</strong> categories <strong>of</strong> visualization solutions for the representation <strong>of</strong> the<br />
subspace cluster<strong>in</strong>g output: Cluster-Centric (CC) and <strong>Data</strong>-Centric (DC). Cluster-centric<br />
solutions put their focus on the representation <strong>of</strong> the clusters first, with the <strong>in</strong>tent to allow<br />
their comparison. <strong>Data</strong>-centric solutions put their focus on the representation <strong>of</strong> the data<br />
values with the <strong>in</strong>tent to ease the <strong>in</strong>terpretation <strong>of</strong> each cluster <strong>in</strong> terms <strong>of</strong> their <strong>in</strong>ternal<br />
distributions.<br />
There is a natural tension between these two extremes. Cluster-centric solutions scale<br />
much better <strong>in</strong> terms <strong>of</strong> number <strong>of</strong> data items and dimensions. Their higher level <strong>of</strong><br />
abstraction allows an easier comparison between the cluster features, however, at the<br />
expense <strong>of</strong> limit<strong>in</strong>g their <strong>in</strong>terpretation. On the contrary, data-centric views ease cluster<br />
<strong>in</strong>terpretation but do not scale very well with respect to data size and dimensionality.<br />
In our analysis <strong>of</strong> the design space, we explored several alternative visual designs and<br />
isolated some basic ones for both approaches. To discuss them briefly helps to better<br />
motivate our proposed f<strong>in</strong>al solution.<br />
Record-Centric Designs<br />
In record-centric designs each visual item represents a record. A 2D scatterplot projection<br />
is <strong>of</strong>ten used as a way to identify clusters <strong>of</strong> data elements <strong>in</strong> traditional cluster<strong>in</strong>g,
5.1.4 The ClustNails System 101<br />
however, it is not clear how to extend this design <strong>in</strong> a way that <strong>in</strong>formation about cluster<br />
dimensions is <strong>in</strong>cluded. Parallel coord<strong>in</strong>ates plots (PCP) could <strong>in</strong> pr<strong>in</strong>ciple be extended<br />
to represent subspace clusters by draw<strong>in</strong>g l<strong>in</strong>es between adjacent axes only when these<br />
belong to the cluster be<strong>in</strong>g drawn. But this generates complicated order<strong>in</strong>g problems with<br />
potential extreme cases where the polyl<strong>in</strong>e <strong>of</strong> a whole record might not be drawn at all<br />
because its axes are never adjacent. Also, PCP do not scale well to data <strong>of</strong> even moderate<br />
dimensionality, which <strong>in</strong> turn is the ma<strong>in</strong> focus <strong>of</strong> subspace cluster<strong>in</strong>g. Heat maps (or matrix/tabular<br />
representations) can be extended more easily by us<strong>in</strong>g di erent color scales for<br />
<strong>in</strong>cluded and not <strong>in</strong>cluded dimensions. In addition, their design allows for easy reorder<strong>in</strong>g<br />
<strong>of</strong> records and dimensions so that the structure <strong>of</strong> the clusters can be more easily perceived.<br />
Cluster-Centric Designs<br />
In cluster-centric designs each visual item represents a cluster. A 2D scatterplot projection<br />
is possible, as the one presented <strong>in</strong> VISA [14] (see also Figure 5.7). The clusters are<br />
projected with MDS, or similar techniques, tak<strong>in</strong>g <strong>in</strong>to account their similarity accord<strong>in</strong>g<br />
to some predef<strong>in</strong>ed criteria (e.g., shared number <strong>of</strong> dimensions). This solution permits to<br />
group clusters accord<strong>in</strong>g to their similarity but their visibility and understand<strong>in</strong>g is <strong>of</strong>ten<br />
h<strong>in</strong>dered by the amount <strong>of</strong> overlap the items have. A matrix compar<strong>in</strong>g one cluster to<br />
another <strong>in</strong> terms <strong>of</strong> their shared dimensions and records is also possible but its e ectiveness<br />
depends on how well row and columns are ordered, plus it is not necessarily the most<br />
compact design. F<strong>in</strong>ally, icons or glyphs can be used to provide a rich representation<br />
<strong>of</strong> each cluster <strong>in</strong> a way that every s<strong>in</strong>gle icon can provide <strong>in</strong>formation about cluster<br />
dimensions, records and weights <strong>in</strong> a <strong>in</strong>tegrated fashion.<br />
In ClustNails we <strong>in</strong>tegrate the best <strong>of</strong> the two approaches <strong>in</strong> a multiple views user<br />
<strong>in</strong>terface (see Figure 5.5). A cluster-centric view based on sorted icons provides support for<br />
cluster understand<strong>in</strong>g and comparison (T1.1, T1.2, and T2.2). A data-centric view based<br />
on sorted and compressed heat maps provides support <strong>in</strong> <strong>in</strong>terpret<strong>in</strong>g and compar<strong>in</strong>g the<br />
clusters <strong>in</strong> terms <strong>of</strong> their data distribution (T1.3, T2.1, T2.2). All <strong>of</strong> them <strong>in</strong> turn help<br />
<strong>in</strong>terpret<strong>in</strong>g the quality <strong>of</strong> the generated output (T3.1 and T3.2). In the follow<strong>in</strong>g section,<br />
we describe the whole system and its views <strong>in</strong> detail.<br />
5.1.4 The ClustNails System<br />
ClustNails is designed as an <strong>in</strong>teractive visualization tool for subspace cluster<strong>in</strong>g analysis.<br />
It <strong>in</strong>tegrates a number <strong>of</strong> subspace cluster<strong>in</strong>g algorithms with novel visual representations<br />
and order<strong>in</strong>g techniques to help analysts generate subspace clusters from multidimensional<br />
data and identify <strong>in</strong>terest<strong>in</strong>g patterns from the visualization models. We next provide<br />
an overview <strong>of</strong> the design and ma<strong>in</strong> functionalities <strong>of</strong> the system, as well as a detailed<br />
description <strong>of</strong> the visualization and order<strong>in</strong>g techniques applied.<br />
Overview<br />
ClustNails <strong>in</strong>tegrates the OpenSubspace library <strong>of</strong> Weka [106] that conta<strong>in</strong>s a range <strong>of</strong><br />
subspace cluster<strong>in</strong>g algorithms <strong>in</strong>clud<strong>in</strong>g Clique, Doc, Fires, Proclus, M<strong>in</strong>eClus, INSCY,<br />
P3c, Schism, Statpc, and Subclu. The system takes multidimensional data as <strong>in</strong>put,<br />
clusters the objects us<strong>in</strong>g a user-selected subspace cluster<strong>in</strong>g algorithm, and displays the
...<br />
...<br />
102 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
cluster<strong>in</strong>g result <strong>in</strong> a multi-view user <strong>in</strong>terface. A number <strong>of</strong> order<strong>in</strong>g functions allow the<br />
analyst to exam<strong>in</strong>e the results and compare clusters from di erent perspectives. Various<br />
user <strong>in</strong>teractions are added to allow the user to select cluster<strong>in</strong>g algorithms, parameters,<br />
and the order <strong>of</strong> the cluster<strong>in</strong>g results <strong>in</strong> the visualization panels. A l<strong>in</strong>k<strong>in</strong>g-and-brush<strong>in</strong>g<br />
function is implemented such that dimensions/clusters <strong>of</strong> <strong>in</strong>terest can be highlighted <strong>in</strong><br />
di erent views. By plac<strong>in</strong>g the mouse cursor over an item (record, dimension, or cluster)<br />
<strong>in</strong> the visualization panel, the analyst can see detailed <strong>in</strong>formation <strong>of</strong> the item <strong>in</strong> a tooltip.<br />
high-dimensional data<br />
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9<br />
subspace cluster<br />
123 59 81<br />
12 92 93<br />
subspace cluster view<br />
x1<br />
123<br />
43<br />
37<br />
68<br />
66<br />
59 166 81 112 112<br />
. . .<br />
. . .<br />
. . .<br />
x2 102 98 145 99<br />
87<br />
92<br />
134<br />
93<br />
23<br />
23<br />
44<br />
42<br />
93<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
x20 84 33 178 44 24 52 127 42 93 93<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
x40 51 57 37 12 87 57 111 96 23 39<br />
Subspace<br />
Cluster<strong>in</strong>g<br />
Algorithm<br />
. . .<br />
. . .<br />
. . .<br />
51 87 23<br />
. . .<br />
. . .<br />
. . .<br />
Subspace<br />
Cluster<br />
<strong>Visual</strong>ization<br />
cluster and<br />
dimension<br />
order<strong>in</strong>g<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
Xm 42 103 38 74 61 82 73 121 49<br />
49<br />
61 82 121<br />
DATA SPACE<br />
VISUAL SPACE<br />
Figure 5.2: Workflow <strong>of</strong> subspace cluster analysis us<strong>in</strong>g the ClustNails system.<br />
Figure 5.2 illustrates the workflow supported by our tool. Figure 5.2 (left) shows<br />
that the system loads a d-dimensional data set as <strong>in</strong>put and a user-selected cluster<strong>in</strong>g<br />
algorithm computes the subspace clusters, provided as a list <strong>of</strong> clusters, each conta<strong>in</strong><strong>in</strong>g a<br />
subset <strong>of</strong> records and a subset <strong>of</strong> dimensions. Figure 5.2 (middle) shows that each cluster<br />
is quantified <strong>in</strong> terms <strong>of</strong> the number <strong>of</strong> <strong>in</strong>stances and associated number <strong>of</strong> dimensions;<br />
this <strong>in</strong>formation, together with the records for each subspace cluster is visualized <strong>in</strong> a<br />
multiple view visualization panel, which <strong>in</strong>cludes a Spikes view for cluster-centric analysis<br />
(top), and a HeatNails view for record-centric analysis (bottom). Figure 5.2 (right) shows<br />
that the order <strong>of</strong> clusters, dimensions and records can be rearranged <strong>in</strong> each view for<br />
easy comparison between clusters. Next, we describe the di erent views and supported<br />
order<strong>in</strong>g strategies.<br />
<strong>Visual</strong>ization Components<br />
<strong>Visual</strong>ization <strong>of</strong> Clusters: the Spikes View<br />
The Spikes view is a cluster-oriented view and provides a matrix <strong>of</strong> thumbnails, each<br />
represent<strong>in</strong>g a subspace cluster. Each cluster is visualized <strong>in</strong> a circular area that conta<strong>in</strong>s<br />
radial spikes. The spikes represent the <strong>in</strong>dividual dimensions (the subspace) that def<strong>in</strong>e<br />
the given cluster, and the spike length is scaled accord<strong>in</strong>g to the weight (importance) <strong>of</strong> a<br />
dimension for the cluster (see below for the def<strong>in</strong>ition). The radial dimension sequence is<br />
identical for each spike-glyph. The number <strong>of</strong> records <strong>in</strong> the cluster is represented by the<br />
area size <strong>of</strong> the <strong>in</strong>ner circle.<br />
Subspace cluster<strong>in</strong>g algorithms provide as output a subset <strong>of</strong> dimensions D k for each<br />
cluster SC k , as well as the set <strong>of</strong> <strong>in</strong>stances (records) <strong>of</strong> this cluster X k . Given a dimension<br />
m with<strong>in</strong> the set <strong>of</strong> dimensions D k <strong>in</strong> a subspace cluster SC k , we def<strong>in</strong>e the weight <strong>of</strong> that<br />
dimension <strong>in</strong> that cluster as:<br />
q<br />
wk m x<br />
=<br />
m i œXm |xm k i ≠ c m k |<br />
, (5.2)<br />
|X k |
5.1.4 The ClustNails System 103<br />
where c m k is the center <strong>of</strong> the po<strong>in</strong>ts <strong>in</strong> X k along the dimension m, x m i the value <strong>in</strong> dimension<br />
m <strong>of</strong> the po<strong>in</strong>t x i <strong>of</strong> this cluster and |X k | the number <strong>of</strong> elements <strong>in</strong> SC k . The smaller<br />
wk<br />
m is, the more compact are the po<strong>in</strong>ts around the center <strong>in</strong> dimension m. This implies<br />
that dimensions with smaller weights have better clustered po<strong>in</strong>ts and are def<strong>in</strong>ed as more<br />
important for a cluster. We normalize the weights wk<br />
m for all dimensions <strong>of</strong> all clusters to<br />
the <strong>in</strong>terval [0, 1] and map the correspond<strong>in</strong>g values <strong>in</strong>versely to the length <strong>of</strong> the spike.<br />
The lower wk<br />
m (the more important the dimension), the longer the correspond<strong>in</strong>g spike.<br />
Note that ow<strong>in</strong>g to our def<strong>in</strong>ition <strong>of</strong> wk m , the relationship between weights and importance<br />
is <strong>in</strong>verse, and we reflect this by an <strong>in</strong>verse mapp<strong>in</strong>g between weights and size <strong>of</strong> the visual<br />
attribute (the spikes). Also note that <strong>in</strong> case the given subspace cluster algorithm natively<br />
outputs weights for each dimension, those weights can also be mapped to the spike length<br />
<strong>in</strong>stead.<br />
Figure 5.3: Two subspace clusters visualized as spikes. The clusters share common dimensions<br />
but the importance <strong>of</strong> the dimensions for the clusters are di erent. Dim29 and dim32 <strong>in</strong> the left<br />
cluster show smaller pikes than <strong>in</strong> the right cluster, as they are considered less important for the<br />
def<strong>in</strong>ition <strong>of</strong> that cluster accord<strong>in</strong>g to our measure wk m . Furthermore, the left cluster has fewer<br />
dimensions and more objects than the right cluster.<br />
The visual representation for each subspace cluster is a circle <strong>in</strong> the Spikes view. Each<br />
spike <strong>in</strong> a circle represents a dimension conta<strong>in</strong>ed <strong>in</strong> that subspace. The length <strong>of</strong> the<br />
spike represents the weight <strong>of</strong> the dimension for that particular cluster (the longer, the<br />
more important). The order <strong>of</strong> the dimensions is identical for each cluster. The area <strong>of</strong><br />
the <strong>in</strong>ner circles <strong>in</strong>dicates the number <strong>of</strong> records with<strong>in</strong> each cluster. Figure 5.3 illustrates<br />
the Spikes view.<br />
The result<strong>in</strong>g Spikes view allows users to quickly recognize overlapp<strong>in</strong>g dimensions<br />
by compar<strong>in</strong>g the spike patterns <strong>of</strong> the di erent clusters. To support this comparison, a<br />
background is divided <strong>in</strong>to pies and colored alternatively with two colors (gray and light<br />
red). This supports the comparison <strong>of</strong> the spike angles <strong>in</strong> two di erent clusters.<br />
<strong>Visual</strong>ization <strong>of</strong> Records: the HeatNails View<br />
The HeatNails view is an extended heat map display<strong>in</strong>g the data values and dimensions.<br />
Rows represent dimensions, and columns represent data items (records). Each HeatNail<br />
cell represents a data value <strong>of</strong> a record <strong>in</strong> one dimension. <strong>Data</strong> items are grouped by<br />
clusters. These clusters are aligned next to each other and separated by black l<strong>in</strong>es. <strong>Data</strong><br />
values are normalized globally and mapped to an appropriate color scale. A yellow-togreen<br />
color scale is used for dimensions that are members <strong>of</strong> the given cluster, while a<br />
gray scale is used for the rema<strong>in</strong><strong>in</strong>g dimensions <strong>of</strong> the data set per cluster (see Figure 5.4
104 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
(bottom)). This allows for an e ective visual perception <strong>of</strong> the distribution <strong>of</strong> values<br />
across dimensions, and the relation between dimensions and clusters with respect to their<br />
<strong>in</strong>clusion <strong>in</strong> the cluster def<strong>in</strong>ition.<br />
Figure 5.4: HeatNails visualization. Bottom: show<strong>in</strong>g the distribution <strong>of</strong> dimension values for all<br />
dimensions (rows) and records (columns). Top: show<strong>in</strong>g histograms for the values <strong>of</strong> all dimensions<br />
per cluster for comparison purposes.<br />
We also give a summary representation <strong>of</strong> the values <strong>of</strong> the dimensions occurr<strong>in</strong>g<br />
<strong>in</strong> the clusters. The distribution <strong>of</strong> dimension values <strong>of</strong> each cluster is discretized <strong>in</strong>to a<br />
histogram and visualized by color (for dimensions <strong>in</strong>cluded) and gray scales (for dimensions<br />
not <strong>in</strong>cluded). This allows for easy comparison between clusters with respect to data<br />
values. Figure 5.4 (top) shows these histogram views. F<strong>in</strong>ally, depend<strong>in</strong>g on the cluster<strong>in</strong>g<br />
algorithm, it is possible that records are members <strong>in</strong> multiple clusters. We illustrate this by<br />
mark<strong>in</strong>g the cluster IDs <strong>of</strong> multi-cluster members at the bottom <strong>of</strong> the display. In addition<br />
to the Spikes view, the HeatNails view also allows the quick recognition <strong>of</strong> overlapp<strong>in</strong>g<br />
dimensions across the clusters by means <strong>of</strong> the given color and grey-scale patterns. Both<br />
Spikes and HeatNails views <strong>in</strong>corporate l<strong>in</strong>k<strong>in</strong>g-and brush<strong>in</strong>g functionality. Click<strong>in</strong>g on any<br />
set <strong>of</strong> dimensions/clusters <strong>of</strong> <strong>in</strong>terest <strong>in</strong> one view highlights the same dimensions/clusters<br />
<strong>in</strong> all other views.<br />
Order<strong>in</strong>g Heuristics<br />
Order<strong>in</strong>g is implemented to support perception <strong>of</strong> structural similarities <strong>of</strong> clusters with respect<br />
to dimensions and value distributions. As order<strong>in</strong>g problems for clusters, dimensions,<br />
and records are typically complex NP-complete comb<strong>in</strong>atorial optimization problems [9],<br />
we rely on heuristics to order dimensions, records, clusters, and values <strong>in</strong> the various displays.<br />
Our essential idea is to place similar or closely related objects together to help the<br />
analyst f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g patterns.<br />
Dimension Order<strong>in</strong>g<br />
To f<strong>in</strong>d a global order<strong>in</strong>g <strong>of</strong> the dimensions, we compute a frequency value for each dimension,<br />
denot<strong>in</strong>g the number <strong>of</strong> subspace clusters that are us<strong>in</strong>g this dimension. We order<br />
the list <strong>of</strong> dimensions by this frequency value start<strong>in</strong>g the sequence <strong>of</strong> dimensions with the<br />
dimension that is most frequently used by the set <strong>of</strong> subclusters. The next positions are<br />
filled <strong>in</strong> the same way: the dimension that co-occurs most frequently with the previous<br />
positioned dimension is placed next. If a co-occurrence is not found, the most frequent<br />
dimension from the rema<strong>in</strong><strong>in</strong>g dimensions is positioned next <strong>in</strong> the order<strong>in</strong>g vector. The
5.1.4 The ClustNails System 105<br />
dimension order<strong>in</strong>g can be applied to both the Spikes view and HeatNails view.<br />
Subspace Cluster Order<strong>in</strong>g<br />
A useful visual representation <strong>of</strong> subspace cluster<strong>in</strong>g results should arrange similar subspaces<br />
next to each other to reduce visual search time by the user. We propose an order<strong>in</strong>g<br />
strategy that is formalized <strong>in</strong> the follow<strong>in</strong>g. Us<strong>in</strong>g the dimension weights def<strong>in</strong>ed <strong>in</strong> Equation<br />
5.2, we propose a measure for the global <strong>in</strong>terest<strong>in</strong>gness I g SC k<br />
<strong>of</strong> a cluster SC k :<br />
q<br />
I g mœD<br />
SC k<br />
=<br />
k<br />
wk<br />
m , (5.3)<br />
|D k |<br />
where wm k is the weight <strong>of</strong> dimension m œ D k <strong>of</strong> SC k , and |D k | is the number <strong>of</strong> dimensions<br />
<strong>in</strong> this subcluster. We def<strong>in</strong>e the global <strong>in</strong>terest<strong>in</strong>gness <strong>of</strong> a cluster k as the average <strong>of</strong> the<br />
weights <strong>of</strong> the dimensions conta<strong>in</strong>ed <strong>in</strong> this subcluster. This measure is used to determ<strong>in</strong>e<br />
the first cluster <strong>in</strong> the order<strong>in</strong>g. We then use the subspace cluster distance (eq. 5.4)<br />
employed <strong>in</strong> [14] to f<strong>in</strong>d the most similar cluster, which is placed next to the <strong>in</strong>itial cluster.<br />
This distance function is a convex sum <strong>of</strong> subspace distance and object distance:<br />
—<br />
A<br />
1 ≠ |D i fl D j |<br />
|D i fi D j |<br />
B<br />
+(1≠ —)<br />
A<br />
1 ≠<br />
|X i fl X j |<br />
m<strong>in</strong>{|X i |, |X j |}<br />
B<br />
[14] (5.4)<br />
where |D i fl D j | is the number <strong>of</strong> common dimensions <strong>of</strong> the two subspaces i and j, and<br />
|X i fl X j | the number <strong>of</strong> shared objects <strong>of</strong> the two subspaces. We cont<strong>in</strong>ue this placement<br />
until all clusters are placed.<br />
Record Order<strong>in</strong>g<br />
Two di erent types <strong>of</strong> record order<strong>in</strong>g strategies are implemented <strong>in</strong> HeatNails. One strategy<br />
is to order the records from m<strong>in</strong> to max with respect to their values <strong>in</strong> the dimension<br />
that has the biggest variance, among all dimensions. A second strategy is to order the<br />
records accord<strong>in</strong>g to the Euclidian distance across the conta<strong>in</strong>ed dimensions <strong>of</strong> the given<br />
subspace, based on a selected start<strong>in</strong>g record. The start<strong>in</strong>g record, <strong>in</strong> turn, may either be<br />
user selected, or selected automatically as the record that shows the largest variance over<br />
all dimensions.<br />
Value Order<strong>in</strong>g<br />
A value order<strong>in</strong>g facility is implemented <strong>in</strong> the HeatMap view and visible <strong>in</strong> the top<br />
summary row <strong>of</strong> the HeatNails view. In each row the distribution <strong>of</strong> values <strong>in</strong> a given<br />
dimension is shown. To that end, we sort the values from m<strong>in</strong> to max, and b<strong>in</strong> them <strong>in</strong>to<br />
a user-selectable number <strong>of</strong> b<strong>in</strong>s. In this view the distribution <strong>of</strong> values per dimension<br />
and cluster is <strong>in</strong>dicated <strong>in</strong> the form <strong>of</strong> a color-coded histogram. The histograms help <strong>in</strong><br />
understand<strong>in</strong>g the distribution <strong>of</strong> data values with<strong>in</strong> each dimension, and may support<br />
f<strong>in</strong>d<strong>in</strong>g out why a particular dimension was selected or not by the cluster<strong>in</strong>g algorithm.<br />
Summary and Discussion <strong>of</strong> the ClustNails System Design<br />
ClustNails is an <strong>in</strong>tegrated system for visual subspace cluster analysis. Its design features<br />
(1) a number <strong>of</strong> subspace cluster<strong>in</strong>g algorithms from which the user can chose and (2) a<br />
design <strong>of</strong> di erent visual representations for the most important aspects <strong>of</strong> the output <strong>of</strong><br />
automatic subspace cluster analysis.
106 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
Regard<strong>in</strong>g (1), we provide access to a number <strong>of</strong> state <strong>of</strong> the art algorithms as conta<strong>in</strong>ed<br />
<strong>in</strong> the OpenSubspace library [106]. The list <strong>of</strong> <strong>in</strong>tegrated algorithms is extensive.<br />
Regard<strong>in</strong>g (2), we composed a visual display <strong>of</strong> three aspects. The Spikes view is<br />
<strong>in</strong>spired by star glyphs and dist<strong>in</strong>guishes clusters from each other, <strong>in</strong> terms <strong>of</strong> <strong>in</strong>cluded<br />
dimensions. The radial basis shape <strong>in</strong> the Spikes view is visually dom<strong>in</strong>ant and allows fast<br />
perception <strong>of</strong> cluster properties. Sort<strong>in</strong>g <strong>of</strong> the cluster glyphs by similarity o oads users<br />
(at least partially) from sequential visual search. The Spikes view is complemented by the<br />
HeatNails view, which is a dimension-oriented detail view that we provide <strong>in</strong> a coord<strong>in</strong>ated<br />
view, below the cluster glyphs. The HeatNails view is based on the ideas <strong>of</strong> heat maps and<br />
the pixel-paradigm for show<strong>in</strong>g the maximum possible <strong>in</strong>formation, allocat<strong>in</strong>g eventually<br />
only one pixel per record dimension (bottom view) or histogram b<strong>in</strong> per cluster dimension<br />
(top view). The overall layout <strong>of</strong> the three views follows an overview-first approach, from<br />
the most aggregate view at the top (the Spikes view <strong>of</strong> clusters) to the most detailed view<br />
(the HeatNails record view) on bottom. The histogram view show<strong>in</strong>g the distribution <strong>of</strong><br />
dimensions per cluster is located <strong>in</strong> the middle.<br />
We designed this <strong>in</strong>tegrated layout hav<strong>in</strong>g the di erent subspace cluster<strong>in</strong>g output<br />
parameters <strong>in</strong> m<strong>in</strong>d, and arranged them accord<strong>in</strong>g to the level <strong>of</strong> detail provided. While<br />
we believe our system design is justified from these considerations, we recognize that<br />
other multidimensional visualization techniques do exist, which could be alternative views<br />
<strong>in</strong> our visualization layout. Parallel coord<strong>in</strong>ates <strong>in</strong> conjunction with color-cod<strong>in</strong>g could be<br />
an option. A dedicated user study, as part <strong>of</strong> future work, could explore design alternatives<br />
and compare them with each other.<br />
5.1.5 Use Case and System Comparison<br />
We apply the ClustNails system to a real world data set, demonstrat<strong>in</strong>g its applicability<br />
and illustrat<strong>in</strong>g di erent types <strong>of</strong> analysis one can perform with it. Then we compare it<br />
with the state <strong>of</strong> the art system VISA [14] to validate the e ectiveness <strong>of</strong> the system and<br />
its design.<br />
Use Case: USDA Food Composition <strong>Data</strong> Set<br />
We analyzed the USDA Food Composition data set 1 that conta<strong>in</strong>s a full collection <strong>of</strong><br />
raw and processed foods characterized by their composition <strong>in</strong> terms <strong>of</strong> nutrients. The<br />
data comprises more than 7000 records and 44 dimensions. We selected Proclus for the<br />
cluster<strong>in</strong>g task. As parameters we set the number <strong>of</strong> clusters to 15, and the average number<br />
<strong>of</strong> dimensions to 8. Figure 5.5 shows the result generated by the system with this sett<strong>in</strong>gs.<br />
From Figure 5.5 we can see that cluster C11, C12, C13, and C14 (highlighted red)<br />
all share the same two dimensions water and calories, although the sizes <strong>of</strong> the clusters<br />
vary from 4 to 24 records. All the records share some common features - high water<br />
conta<strong>in</strong>ment and low calories. To ga<strong>in</strong> more understand<strong>in</strong>g <strong>of</strong> the cluster<strong>in</strong>g result, one<br />
can drill down to each record by check<strong>in</strong>g the data table or detail-on-demand <strong>in</strong>formation<br />
displayed <strong>in</strong> tooltips upon mouse-over actions. It is not di cult to f<strong>in</strong>d out that these<br />
groups mostly consist <strong>of</strong> foods that are commonly regarded as “healthy”. Foods <strong>of</strong> similar<br />
nature, e.g., lima and mango beans, various types <strong>of</strong> low-fat dairy products, and soups are<br />
1 http://www.ars.usda.gov/
5.1.5 Use Case and System Comparison 107<br />
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14<br />
Figure 5.5: <strong>Visual</strong>ization <strong>of</strong> the subspace clusters <strong>of</strong> the USDA Food Composition data set generated<br />
by Proclus.<br />
placed <strong>in</strong> the same groups, which means the cluster<strong>in</strong>g makes good sense.<br />
Us<strong>in</strong>g the value order<strong>in</strong>g function <strong>in</strong> the HeatNails, we can further explore the distribution<br />
<strong>of</strong> data values <strong>in</strong>side each cluster and look for <strong>in</strong>terest<strong>in</strong>g patterns (see Figure 5.6).<br />
We note that most <strong>of</strong> the data values <strong>in</strong> the dimensions not selected by Proclus have relatively<br />
large variance. This is not surpris<strong>in</strong>g as subspace cluster<strong>in</strong>g algorithms are typically<br />
designed to reduce the sparsity <strong>of</strong> data by discard<strong>in</strong>g dimensions that have big variances.<br />
C0 C1 C2 C3 C4C5 C6 C7 C8 C9 C10 C11 C12 C13 C14<br />
Figure 5.6: Sorted view (Value order<strong>in</strong>g function applied).<br />
Tak<strong>in</strong>g a look <strong>in</strong> the sorted view at how the same two dimensions are distributed along<br />
the other clusters, it is not di cult to identify clusters, like C10, which have similar trends<br />
over the two dimensions but have stronger patterns <strong>in</strong> other dimensions (exceptionally low<br />
values for both total lipids and prote<strong>in</strong>s, discussed later), thus the two dimensions are not<br />
selected to characterize the cluster. These types <strong>of</strong> <strong>in</strong>formation are not only useful <strong>in</strong><br />
help<strong>in</strong>g to understand the cluster analysis result, but also add more transparency to the<br />
data m<strong>in</strong><strong>in</strong>g algorithms, which are usually hidden from the user <strong>in</strong> black boxes. At a<br />
closer <strong>in</strong>spection, we can identify a cluster that also shares the two dimensions, but with<br />
an <strong>in</strong>verse trend, that is, low water conta<strong>in</strong>ment and high calories (C6). The detailed<br />
<strong>in</strong>formation reveals that this cluster represents a whole set <strong>of</strong> di erent candies (probably<br />
not the most recommendable food for a diet).<br />
Another <strong>in</strong>terest<strong>in</strong>g cluster is C10, which is characterized by an exceptionally low value<br />
for both total lipids and prote<strong>in</strong>s. All the other records, exclud<strong>in</strong>g the ones <strong>in</strong> C1, have<br />
either consistently high values or higher variances <strong>in</strong> one <strong>of</strong> these two dimensions. They<br />
represent various k<strong>in</strong>ds <strong>of</strong> beverages such as alcoholic beverages, teas, and fruit-based
108 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
topp<strong>in</strong>gs. C1 is characterized by the same trend but it forms a di erent cluster with<br />
exceptionally low values for other nutrients like various k<strong>in</strong>ds <strong>of</strong> fats and vitam<strong>in</strong> B12. All<br />
the foods <strong>in</strong> C1 are aga<strong>in</strong> beverages.<br />
Compar<strong>in</strong>g C10 to C1, one can notice that C10 has, <strong>in</strong> fact, a very similar distribution<br />
<strong>of</strong> values <strong>in</strong> the dimensions that are <strong>in</strong>cluded <strong>in</strong> C1. This is a clear example <strong>in</strong> which the<br />
output <strong>of</strong> the algorithm is not optimal and a merge <strong>of</strong> these two would make sense.<br />
Comparison with VISA<br />
Figure 5.7: <strong>Visual</strong>ization <strong>of</strong> the subspace clusters <strong>in</strong> VISA [14] framework discussed <strong>in</strong> Subsection<br />
5.1.5. Cluster view (left), record view (right).<br />
Figure 5.7 shows the representation <strong>in</strong> VISA [14] <strong>of</strong> the same subspace clusters as used<br />
for our above use case (same data set, same cluster<strong>in</strong>g result). As we can see, the 15<br />
clusters are projected <strong>in</strong> the cluster view to a 2D scatterplot us<strong>in</strong>g MDS based on their<br />
dimension similarity (left screenshot <strong>in</strong> Figure 5.7). Each cluster is represented by a circle<br />
scaled accord<strong>in</strong>g to the cluster size. The record-centric view shows the result as a heat<br />
map (right screenshot <strong>in</strong> Figure 5.7), where rows represent records and columns represent<br />
dimensions. Di erent color codes are used <strong>in</strong> the heat map: black for unselected dimensions,<br />
brightness for <strong>in</strong>terest<strong>in</strong>gness, and hue for data values. We recognize the follow<strong>in</strong>g<br />
benefits <strong>in</strong> the ClustNails design regard<strong>in</strong>g VISA:<br />
• Overlap<br />
Circles <strong>of</strong> di erent sizes <strong>in</strong> the VISA MDS projection can cause occlusion problems<br />
and end up with over-cluttered displays. For example, only 9 out <strong>of</strong> 15 clusters<br />
are visible <strong>in</strong> the cluster view <strong>in</strong> Figure 5.7. The Spikes and HeatNails views avoid<br />
overlap. One may argue that scatterplots scale better, but <strong>in</strong> practice the number<br />
<strong>of</strong> clusters <strong>in</strong> a result is usually small, because a large number <strong>of</strong> clusters implies,<br />
<strong>in</strong> many cases, a poor performance <strong>of</strong> the cluster<strong>in</strong>g algorithm [90]. The scatterplot<br />
visualization, on the other hand, su ers from occlusion problems regardless <strong>of</strong> the<br />
number <strong>of</strong> clusters. Also, the ClustNails glyphs provide richer <strong>in</strong>formation for each<br />
cluster, as described next.<br />
• Richer <strong>in</strong>formation<br />
VISA shows only the number <strong>of</strong> records and dimensions <strong>of</strong> each cluster and maps the<br />
similarities between clusters to distances. The Spikes view <strong>in</strong> ClustNails extends this<br />
basic encod<strong>in</strong>g by <strong>in</strong>clud<strong>in</strong>g additional <strong>in</strong>formation about each cluster, permitt<strong>in</strong>g a<br />
user to (1) draw richer <strong>in</strong>formation from the result and (2) detect and understand the
5.1.6 Conclusions and Future Work 109<br />
similarities between clusters more easily. Specifically, the spikes permit one to see the<br />
detailed dimensions and their correspond<strong>in</strong>g importance <strong>in</strong> each subspace and thus<br />
to relate one cluster to another. The l<strong>in</strong>k<strong>in</strong>g-and-brush<strong>in</strong>g technique implemented<br />
<strong>in</strong> the Spikes view helps <strong>in</strong> highlight<strong>in</strong>g the shared dimensions among clusters.<br />
• Order<strong>in</strong>g supports comparison<br />
The ClustNails order<strong>in</strong>g techniques place similar clusters, dimensions, and records<br />
close to each other. These techniques permit to detect similarities and dissimilarities<br />
between the clusters more easily. No order<strong>in</strong>g technique is implemented <strong>in</strong> the<br />
current version <strong>of</strong> VISA, similarity <strong>of</strong> clusters could just be seen <strong>in</strong> the cluster view<br />
represented by the 2D distance <strong>of</strong> clusters <strong>in</strong> the projection.<br />
• Scalability<br />
The heat map solution implemented <strong>in</strong> VISA is <strong>in</strong>itially designed to display a limited<br />
number <strong>of</strong> records that belong to a small subset <strong>of</strong> clusters. The compression<br />
techniques we propose for the thumbnails view <strong>of</strong> HeatNails, can scale up to a much<br />
larger number <strong>of</strong> records and thus is not limited to represent<strong>in</strong>g only a subset <strong>of</strong> the<br />
data. Subspace cluster<strong>in</strong>g algorithms can produce hundreds <strong>of</strong> subspace clusters for<br />
some parameter sett<strong>in</strong>gs. To analyze and understand if the result makes sense the<br />
clusters need to be displayed and compared. Our histogram views can be used to<br />
visualize this output, they can also be ordered l<strong>in</strong>early <strong>in</strong>to more rows, or even a two<br />
dimensional order<strong>in</strong>g heuristic can be developed to make the technique scale.<br />
• Non-member dimensions<br />
In VISA all data values <strong>in</strong> the unselected dimensions are colored <strong>in</strong> black; hence<br />
the <strong>in</strong>formation <strong>in</strong> these segments is miss<strong>in</strong>g from the visualization. This may be<br />
detrimental to data understand<strong>in</strong>g as the <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> those segments<br />
provides evidence <strong>of</strong> why the cluster<strong>in</strong>g algorithm did not select a given dimension<br />
to characterize the cluster. The algorithm choice can be justified if the visualization<br />
shows extreme values or has large variances <strong>in</strong> the unselected dimensions. Our design<br />
displays these dimensions <strong>in</strong> a gray scale so they can be used to understand the result.<br />
5.1.6 Conclusions and Future Work<br />
Subspace cluster<strong>in</strong>g addresses an important problem <strong>in</strong> cluster<strong>in</strong>g multidimensional data.<br />
The algorithms successfully reduce the noise <strong>in</strong> multidimensional data by show<strong>in</strong>g clusters<br />
that exist only <strong>in</strong> subsets <strong>of</strong> dimensions <strong>in</strong> the data. <strong>Visual</strong>ization <strong>of</strong> subspace cluster<strong>in</strong>g<br />
results is challeng<strong>in</strong>g. In addition to the <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> traditional cluster<strong>in</strong>g<br />
results, subsets <strong>of</strong> dimensions that def<strong>in</strong>e clusters, and overlap between dimensions and<br />
records needs to be represented <strong>in</strong> an understandable and uncluttered way. ClustNails<br />
was presented as an <strong>in</strong>teractive data analysis and visualization tool for subspace cluster<strong>in</strong>g<br />
analysis. It provides several novel visualization and order<strong>in</strong>g techniques to help analysts<br />
extract subspace clusters from data and then analyze the results. The system implements<br />
l<strong>in</strong>ked and ordered cluster-centric (Spikes) and a record-centric (HeatNails) views. We<br />
demonstrated the e ectiveness <strong>of</strong> our system design <strong>in</strong> the analysis <strong>of</strong> real world data and<br />
a comparison with exist<strong>in</strong>g visual subspace cluster analysis systems.<br />
For future work one extension <strong>of</strong> the system is really needed – the support <strong>of</strong> parameter<br />
selection, which is a di cult problem given that each algorithm has its own parameters
110 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
and di erent sett<strong>in</strong>gs may generate very di erent results. Another extension could be the<br />
development <strong>of</strong> a so called “agreement matrix” among a set <strong>of</strong> results that shows those<br />
parts that most results agree on. The agreement matrix could then be used to evaluate the<br />
quality <strong>of</strong> <strong>in</strong>dividual outputs and to help the analyst to understand the consensus made<br />
by di erent algorithms and parameter sett<strong>in</strong>gs. Another future research direction might<br />
<strong>in</strong>clude improv<strong>in</strong>g the scalability <strong>of</strong> the ClustNails system. While we have not done a<br />
formal evaluation, we assume scalability is restricted to dozens <strong>of</strong> clusters and dimensions,<br />
depend<strong>in</strong>g on the resolution <strong>of</strong> the given display. Some results may conta<strong>in</strong> hundreds <strong>of</strong><br />
clusters and thousands <strong>of</strong> dimensions, for which scalable solutions are needed.<br />
5.2 <strong>Visual</strong> <strong>Analytics</strong> <strong>of</strong> Subspace Search<br />
Many methods are currently available for an explorative data analysis <strong>of</strong> high-dimensional<br />
data spaces. So far, proposed automatic approaches <strong>in</strong>clude dimensionality reduction<br />
and cluster analysis, whereby visual-<strong>in</strong>teractive methods aim to provide e ective visual<br />
mapp<strong>in</strong>gs to show, relate, and navigate this data.<br />
As described before, analyz<strong>in</strong>g high-dimensional data is notoriously di cult as <strong>in</strong>terest<strong>in</strong>g<br />
patterns may occur <strong>in</strong> any possible subspace. We address two important research<br />
directions to discover the patterns hidden <strong>in</strong> this data spaces. One was proposed <strong>in</strong> the<br />
previous section where a visual <strong>in</strong>teractive system to analyze the result <strong>of</strong> subspace cluster<strong>in</strong>g<br />
algorithms for a better understand<strong>in</strong>g was developed. The second direction goes<br />
one step back, before the cluster<strong>in</strong>g step and identifies important subspaces where possible<br />
patterns my occur. Look<strong>in</strong>g at one s<strong>in</strong>gle <strong>in</strong>terest<strong>in</strong>g subspace is <strong>of</strong>ten not su cient s<strong>in</strong>ce<br />
di erent subspaces may show confirmatory, complementary, conjo<strong>in</strong>tly, or contradict<strong>in</strong>g<br />
relations between data items. We propose a novel method for the visual analysis <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g<br />
subspaces, po<strong>in</strong>t<strong>in</strong>g out these type <strong>of</strong> relations. Based on appropriately def<strong>in</strong>ed<br />
subspace similarity functions, we visualize the subspaces and provide navigation facilities<br />
to <strong>in</strong>teractively explore large sets <strong>of</strong> subspaces. Our approach allows users to e ectively<br />
compare and relate subspaces with respect to <strong>in</strong>volved dimensions and clusters <strong>of</strong> objects.<br />
We apply our approach to synthetic and real data sets. We thereby demonstrate its<br />
support for understand<strong>in</strong>g high-dimensional data from di erent perspectives, e ectively<br />
yield<strong>in</strong>g a more complete view on high-dimensional data.<br />
5.2.1 Introduction<br />
For large feature spaces, <strong>in</strong>terest<strong>in</strong>g patterns may <strong>of</strong>ten be located only <strong>in</strong> subspace projections<br />
<strong>of</strong> the data. As <strong>in</strong>sights may not be hidden <strong>in</strong> only one s<strong>in</strong>gle subspace, relevant<br />
analysis should consider also multiple subspaces and their <strong>in</strong>terrelations. Especially, for<br />
high-dimensional data we can expect to have di erent views <strong>of</strong> the same data [58, 107],<br />
i.e., the same objects might group di erently given di erent subspace perspectives (see<br />
Figure 5.8 for an illustration). The existence <strong>of</strong> alternative relevant subspaces may stem<br />
from the data description process, e.g., when dur<strong>in</strong>g preprocess<strong>in</strong>g, features (dimensions)<br />
describ<strong>in</strong>g di erent semantic properties <strong>of</strong> the data, are comb<strong>in</strong>ed. For <strong>in</strong>stance, <strong>in</strong> demographic<br />
analysis, households are <strong>of</strong>ten described by an array <strong>of</strong> many variables, combi-
5.2.1 Introduction 111<br />
nations that constitute di erent conceptual doma<strong>in</strong>s, such as wealth, mobility, or health.<br />
Likewise, it may be the comb<strong>in</strong>ation <strong>of</strong> otherwise not semantically related dimensions,<br />
which by their comb<strong>in</strong>ation result <strong>in</strong> <strong>in</strong>terest<strong>in</strong>g patterns. In the <strong>Data</strong> M<strong>in</strong><strong>in</strong>g community,<br />
a class <strong>of</strong> so-called Subspace Analysis algorithms has been proposed to cope with the problem<br />
<strong>of</strong> identify<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces and clusters from a high-dimensional data set. To<br />
date, however, there has been a very limited focus on the presentation and <strong>in</strong>terpretation<br />
<strong>of</strong> the generated output. Furthermore, subspace analysis <strong>of</strong>ten produces highly redundant<br />
results that need to be further manipulated <strong>in</strong> order to get mean<strong>in</strong>gful results [101].<br />
"travel<strong>in</strong>g subspace"<br />
"health subspace"<br />
<strong>in</strong>come<br />
blood pressure<br />
travel<strong>in</strong>g frequency<br />
age<br />
Figure 5.8: Alternative data distributions and group<strong>in</strong>gs from [103] <strong>in</strong> two di erent subspaces <strong>of</strong><br />
a larger high-dimensional data space (doma<strong>in</strong> here: demographic data analysis). Our proposed<br />
visual analysis method <strong>in</strong>tegrates the notion <strong>of</strong> alternative subspaces <strong>in</strong>to the analysis process and<br />
l<strong>in</strong>ks it to the task <strong>of</strong> comparative cluster analysis.<br />
We propose an <strong>in</strong>itial step towards the use <strong>of</strong> visual analytics as a way to explore<br />
alternative views generated by subspace analysis algorithms. We def<strong>in</strong>e an analytical<br />
pipel<strong>in</strong>e made <strong>of</strong> algorithmic and visual components that permits to s<strong>in</strong>gle out and explore<br />
alternative views <strong>in</strong> the data. After be<strong>in</strong>g analyzed by a subspace search algorithm, the<br />
data is structured and further processed <strong>in</strong> an <strong>in</strong>teractive visualization environment to<br />
reduce redundancy.<br />
The ma<strong>in</strong> contribution <strong>of</strong> this section is the operative def<strong>in</strong>ition and implementation<br />
<strong>of</strong> this multistep pipel<strong>in</strong>e that permits to sift through an exponential number <strong>of</strong> subspace<br />
candidates and to reduce the problem to a handful <strong>of</strong> relevant views. More specifically, we<br />
1. <strong>in</strong>troduce a mechanism to deal with subspace redundancy by def<strong>in</strong><strong>in</strong>g topological and<br />
dimensional subspace similarity and by allow<strong>in</strong>g flexible and <strong>in</strong>teractive subspace<br />
aggregation;<br />
2. provide a well-reasoned <strong>in</strong>teractive visualization environment that permits to compare<br />
and assess alternative views by visually compar<strong>in</strong>g topological and dimensional<br />
similarities and strike a balance between visual complexity and level <strong>of</strong> detail.<br />
We evaluate our method through two case studies. The first is based on synthetic<br />
data to check whether the tool does what it is supposed to do. The second is based on<br />
real-world data to demonstrate how the tool can help f<strong>in</strong>d<strong>in</strong>g and <strong>in</strong>terpret<strong>in</strong>g alternative<br />
views <strong>in</strong> high-dimensional data. We believe these results show the potential <strong>of</strong> visual<br />
analytics <strong>in</strong> the context <strong>of</strong> automated m<strong>in</strong><strong>in</strong>g algorithms. It furthermore shows how the<br />
use <strong>of</strong> visual analytics can enhance the understand<strong>in</strong>g <strong>of</strong> the results <strong>of</strong> automated data<br />
analysis methods, and lead to new questions concern<strong>in</strong>g more e ective or more e cient<br />
algorithms.<br />
The rema<strong>in</strong>der <strong>of</strong> this section is structured as follows. In Section 5.2.2, we discuss<br />
concepts underly<strong>in</strong>g the class <strong>of</strong> subspace search algorithms that are important for our
112 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
approach. In Section 5.2.3, we then <strong>in</strong>troduce our analytic methodology and suggested<br />
workflow, detail<strong>in</strong>g the employed algorithmic and visual-<strong>in</strong>teractive components employed.<br />
Section 5.2.4 demonstrates the application <strong>of</strong> our tool to synthetic and real data sets,<br />
show<strong>in</strong>g its usefulness for the problem at hand. In Section 5.2.5, we discuss advantages<br />
and limitations <strong>of</strong> our methodology and conclude <strong>in</strong> Section 5.2.6.<br />
5.2.2 Subspace Analysis<br />
In this section, we discuss the challenges for visual subspace analysis <strong>in</strong> more detail and<br />
expla<strong>in</strong> how we tackle these with our new <strong>in</strong>teractive, explorative framework supported<br />
by subspace search algorithms.<br />
As is commonly known <strong>in</strong> subspace cluster<strong>in</strong>g, deal<strong>in</strong>g with high-dimensional data <strong>in</strong> its<br />
subspace projections faces two ma<strong>in</strong> challenges. The first, serious challenge is a reasonable<br />
scalability with regard to the dimensionality <strong>of</strong> the data set. As for a d-dimensional data<br />
set the number <strong>of</strong> possible subspaces S {1,...,d} is q d ! d<br />
k=1 k" =2 d ≠ 1, many subspace<br />
cluster<strong>in</strong>g approaches do not scale well for very high-dimensional data. Every algorithm<br />
has to employ some strategy and heuristics to cope with such an exponential search space.<br />
The second, closely related challenge is deal<strong>in</strong>g with high redundancy, that stems from<br />
the high similarity <strong>of</strong> the exponentially many subspaces. If two subspaces share a high<br />
proportion <strong>of</strong> dimensions, they are likely to exhibit a very similar cluster<strong>in</strong>g structure [58].<br />
A large search result with high redundancy is, however, not beneficial for the user as it<br />
masks the complete <strong>in</strong>formation and is hard to <strong>in</strong>terpret.<br />
A core task <strong>in</strong> analysis <strong>of</strong> high-dimensional data is to apply a cluster<strong>in</strong>g method to<br />
reduce data complexity and identify groups <strong>of</strong> data for comparison. Di erent cluster<strong>in</strong>g algorithms<br />
follow di erent cluster<strong>in</strong>g notions, e.g., there exist density- (e.g., DBSCAN [50])<br />
or compactness-based (e.g., k-means) cluster<strong>in</strong>g methods, and their outcomes <strong>of</strong>ten depend<br />
crucially on non-<strong>in</strong>tuitive parameter sett<strong>in</strong>gs. Usually several cluster<strong>in</strong>g attempts<br />
are required until the user has a usable result. It is obvious that high runtimes <strong>of</strong> subspace<br />
cluster<strong>in</strong>g processes (see Section 2.4.2) are not tolerable for such a workflow. Consequently,<br />
we decided to start the visual data exploration one step before the actual cluster<strong>in</strong>g process<br />
and decouple subspace search and the actual cluster<strong>in</strong>g. Dedicated subspace search algorithms<br />
[16, 38, 84] have been designed to e ciently filter and rank the possible subspaces<br />
accord<strong>in</strong>g to specific quality criteria (or <strong>in</strong>terest<strong>in</strong>gness measures, see also below). After<br />
subspace search has taken place, an arbitrary cluster<strong>in</strong>g approach can be used to cluster<br />
<strong>in</strong> the identified subspaces.<br />
The use <strong>of</strong> subspace search for our purposes has several advantages: (1) It helps to<br />
e ectively filter out those subspaces that based on low <strong>in</strong>terest<strong>in</strong>gness do not need to<br />
be considered by the user. (2) Subspace search approaches are designed to reduce the<br />
search space e ciently and they do not need to compute clusters. And (3) although,<br />
subspace search approaches themselves also rely on certa<strong>in</strong> assumptions <strong>of</strong> what makes a<br />
subspace <strong>in</strong>terest<strong>in</strong>g, these assumptions do not necessarily lead to very di erent subspaces<br />
among di erent approaches. Therefore, the results are not as biased as they are for<br />
di erent cluster<strong>in</strong>g algorithms, which enables the user to already obta<strong>in</strong> valuable results<br />
with one subspace search approach. For example, the quality assessment based on the<br />
k-NN distance [16], favors neither the DBSCAN nor the k-means cluster<strong>in</strong>g notion. And<br />
(4), <strong>in</strong>tegrat<strong>in</strong>g the subspace search <strong>in</strong>to the high-dimensional analysis o ers the user the<br />
opportunity to obta<strong>in</strong> a visual, <strong>in</strong>tuitive overview <strong>of</strong> the cluster<strong>in</strong>g structure before even
5.2.3 Proposed Analytical Workflow 113<br />
start<strong>in</strong>g the actual cluster<strong>in</strong>g. Thus, the user can assess the potential <strong>of</strong> the data to deliver<br />
valuable cluster<strong>in</strong>g results at all; decide which subspaces are to be clustered; decide which<br />
cluster<strong>in</strong>g notion to follow <strong>in</strong> each subspace (s<strong>in</strong>ce the notion does not need to be the same<br />
for all); more easily determ<strong>in</strong>e mean<strong>in</strong>gful parameter sett<strong>in</strong>gs for cluster<strong>in</strong>g approaches.<br />
Subspace search methods guide their search process by specific <strong>in</strong>terest<strong>in</strong>gness scores<br />
that are def<strong>in</strong>ed heuristically. For example, the method proposed <strong>in</strong> [38] considers as<br />
<strong>in</strong>terest<strong>in</strong>gness score the variation <strong>of</strong> the density <strong>of</strong> objects across a regular cell-based partition<strong>in</strong>g<br />
<strong>of</strong> a given subspace. The underly<strong>in</strong>g assumption is that the higher the variation<br />
<strong>of</strong> density the higher the probability that the subspace shows a mean<strong>in</strong>gful structure. As<br />
another example, the SURFING method [16] relies on the histogram <strong>of</strong> the k-nearest neighbor<br />
distances for all objects <strong>in</strong> a given subspace. It considers subspaces with non-uniform<br />
distance distributions more <strong>in</strong>terest<strong>in</strong>g (as they are an <strong>in</strong>dication <strong>of</strong> the presence <strong>of</strong> strong<br />
cluster<strong>in</strong>gs). Here the underly<strong>in</strong>g assumption is that for subspaces that show mean<strong>in</strong>gful<br />
structures (e.g., clusters), di erent k-NN distances will occur. These and other measures<br />
aim at identify<strong>in</strong>g subspaces that show a high “contrast” with respect to the distribution<br />
<strong>of</strong> objects thereby allow<strong>in</strong>g to spot mean<strong>in</strong>gful structure <strong>in</strong> the subspaces.<br />
Subspace search methods also typically conta<strong>in</strong> heuristic approaches for early abandon<strong>in</strong>g<br />
un<strong>in</strong>terest<strong>in</strong>g subspaces, as exhaustive search would be prohibitively expensive.<br />
SURFING for example is based on a bottom-up strategy for search<strong>in</strong>g subspaces by <strong>in</strong>creas<strong>in</strong>g<br />
dimensionality. It is based on test<strong>in</strong>g additional dimensions for subspaces already<br />
known to be <strong>in</strong>terest<strong>in</strong>g. The list <strong>of</strong> currently <strong>in</strong>terest<strong>in</strong>g subspaces is cont<strong>in</strong>uously pruned<br />
to keep only the most <strong>in</strong>terest<strong>in</strong>g subspaces and speed up the search. SURFING has no<br />
dimensionality bias, assumes no specific cluster<strong>in</strong>g structure, and <strong>in</strong> practice, it is parameter<br />
free. Due to these properties, we rely on this method <strong>in</strong> our proposed approach, us<strong>in</strong>g<br />
the implementation provided to us by the orig<strong>in</strong>al authors, but other subspace search<br />
algorithms could be easily used as well.<br />
Overall, us<strong>in</strong>g the results <strong>of</strong> a subspace search algorithm as a start<strong>in</strong>g po<strong>in</strong>t for our<br />
visualization has many advantages. Subspace search methods such as SURFING employ<br />
e cient search strategies tackl<strong>in</strong>g the e ciency challenge <strong>of</strong> subspace analysis. However,<br />
they typically do not solve the challenge <strong>of</strong> high redundancy. Our proposed visual analytical<br />
workflow, which is <strong>in</strong>troduced next, starts precisely at this po<strong>in</strong>t.<br />
5.2.3 Proposed Analytical Workflow<br />
We propose a carefully designed visual analytics workflow for subspace-based exploration<br />
<strong>of</strong> high-dimensional data, mak<strong>in</strong>g use <strong>of</strong> algorithmic subspace search <strong>in</strong> comb<strong>in</strong>ation with<br />
visual-<strong>in</strong>teractive representations for user-based filter<strong>in</strong>g and exploration. Our approach<br />
starts (1) with an automatic subspace search step, where a large number <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g<br />
subspaces is selected by a subspace search algorithm. Current subspace search methods<br />
provide an algorithmic handl<strong>in</strong>g <strong>of</strong> the problem <strong>of</strong> f<strong>in</strong>d<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces, yet they<br />
<strong>of</strong>ten produce too many subspaces that may also be redundant and thereby overwhelm the<br />
<strong>in</strong>teractive analysis (see also Section 5.2.2). We therefore employ similarity-based group<strong>in</strong>g<br />
<strong>of</strong> subspaces (2) and perform the <strong>in</strong>teractive exploration <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces based on<br />
a few group representatives. Appropriate visual representations and <strong>in</strong>teractions support<br />
the visual <strong>in</strong>teractive analysis (3) for better understand<strong>in</strong>g the subspace search results,<br />
<strong>in</strong>clud<strong>in</strong>g the support for comparative cluster analysis.<br />
Figure 5.9 depicts our proposed analytical workflow. We next detail the technical
114 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
HD <strong>Data</strong><br />
Subspace Search<br />
e.g. SURFING<br />
Interest<strong>in</strong>g<br />
Subspaces<br />
Subspace<br />
Group<strong>in</strong>g and Filter<strong>in</strong>g<br />
e.g. Hierarchical Cluster<strong>in</strong>g<br />
based on subspace similarity<br />
Redundancy<br />
Reduced View<br />
Subspace Interaction<br />
e.g. color<strong>in</strong>g clusters<br />
Cluster<br />
Colored View<br />
Figure 5.9: Our proposed analysis pipel<strong>in</strong>e. A subspace selection algorithm is applied to automatically<br />
identify a candidate set <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces. A filter<strong>in</strong>g step reduces the potentially large<br />
and redundant set <strong>of</strong> automatically obta<strong>in</strong>ed subspaces to a user-selectable number <strong>of</strong> represent<strong>in</strong>g<br />
subspaces. <strong>Visual</strong>-<strong>in</strong>teractive user exploration then proceeds on the subspace representations. Subspace<br />
analysis is also supported by comparative cluster views, allow<strong>in</strong>g users to identify mean<strong>in</strong>gful<br />
similar, complementary or even conflict<strong>in</strong>g cluster<strong>in</strong>g structures <strong>in</strong> the set <strong>of</strong> subspaces.<br />
design decisions made for each <strong>of</strong> the analysis steps, <strong>in</strong>clud<strong>in</strong>g discussion <strong>of</strong> alternatives.<br />
Generation <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspace candidates<br />
To search for <strong>in</strong>terest<strong>in</strong>g subspaces <strong>of</strong> an high-dimensional data, we propose to use a<br />
subspace search algorithm. We employ automatic subspace search as a tool to serve our<br />
ma<strong>in</strong> purpose, which is to explore high-dimensional data <strong>in</strong> an e ective manner. The<br />
advantages for choos<strong>in</strong>g subspace search, and <strong>in</strong> particular SURFING, have been already<br />
discussed <strong>in</strong> detail <strong>in</strong> Section 5.2.2. We observe that typically subspace search algorithms<br />
output a huge number <strong>of</strong> subspaces that are <strong>of</strong>ten rather redundant with respect to the<br />
reported <strong>in</strong>terest<strong>in</strong>gness <strong>in</strong>dex and the sets <strong>of</strong> <strong>in</strong>volved dimension shows high overlap.<br />
S<strong>in</strong>ce the exam<strong>in</strong>ation <strong>of</strong> all subspaces is <strong>in</strong>feasible, a common approach is to filter the<br />
subspaces based on a certa<strong>in</strong> threshold. This, however, ignores the fact, that the first<br />
ranked subspaces might be only slight variations (i.e., high overlap <strong>of</strong> dimension sets)<br />
<strong>of</strong> the same subspace and therefore are redundant to each other. However, <strong>in</strong>terest<strong>in</strong>g<br />
subspaces with substantially di erent dimension sets, as compared to the top ranked<br />
results, could be found at much later rank<strong>in</strong>g positions, and run the risk to be neglected<br />
from the analysis. Therefore, we apply a group<strong>in</strong>g step based on an appropriately def<strong>in</strong>ed<br />
notion <strong>of</strong> subspace similarity, as described next.<br />
Similarity-based subspace group<strong>in</strong>g and filter<strong>in</strong>g<br />
Given a large number <strong>of</strong> candidate subspaces, we apply hierarchical group<strong>in</strong>g and filter<strong>in</strong>g<br />
to yield a smaller set <strong>of</strong> mutually su ciently di erent, yet <strong>in</strong>dividually <strong>in</strong>terest<strong>in</strong>g groups<br />
<strong>of</strong> subspaces for <strong>in</strong>teractive analysis. Our filter<strong>in</strong>g and group<strong>in</strong>g operation is based on a<br />
custom similarity function def<strong>in</strong>ed on pairs <strong>of</strong> subspaces accord<strong>in</strong>g to two ma<strong>in</strong> criteria:<br />
(1) overlap <strong>of</strong> the sets <strong>of</strong> dimensions that constitute the respective subspaces, and (2)<br />
resemblance <strong>in</strong> the data topology given <strong>in</strong> the respective subspaces.<br />
Similarity based on dimension overlap<br />
Subspaces can be similar regard<strong>in</strong>g their constituent dimensions. We use the Tanimoto<br />
Similarity [117] on bit vectors <strong>in</strong>dicat<strong>in</strong>g the conta<strong>in</strong>ed (active) dimensions <strong>in</strong> a respective<br />
subspace (1 denotes an active dimension, 0 the converse). The Tanimoto Similarity is then<br />
computed as the fraction <strong>of</strong> dimensions conta<strong>in</strong>ed <strong>in</strong> both subspaces (AND-<strong>in</strong>g <strong>of</strong> the bit<br />
vectors), among the total number <strong>of</strong> di erent dimensions occurr<strong>in</strong>g <strong>in</strong> the subspaces (OR<strong>in</strong>g<br />
<strong>of</strong> the bit vectors).
5.2.3 Proposed Analytical Workflow 115<br />
Similarity based on data topology<br />
We also compare subspaces with regard to their data distribution. Specifically, we consider<br />
the similarity <strong>of</strong> k-NN relationships <strong>in</strong> the respective subspaces. For e ciency reasons, we<br />
compute the k-nearest neighborhood (k = 5) lists for a sample <strong>of</strong> 5% <strong>of</strong> the conta<strong>in</strong>ed data<br />
po<strong>in</strong>ts. The similarity between two subspaces is then evaluated as the average percentage<br />
<strong>of</strong> agreement <strong>of</strong> k-NN lists <strong>in</strong> the subspaces. This score measures the similarity <strong>of</strong> the<br />
k-NN topology <strong>of</strong> the data, where k is a parameter and can be adapted to the data sets<br />
at hand by the user. Note that also other similarity measures are <strong>in</strong> pr<strong>in</strong>ciple possible.<br />
For <strong>in</strong>stance, the data could be clustered and the similarity between subspaces evaluated<br />
accord<strong>in</strong>g to the resemblance <strong>of</strong> obta<strong>in</strong>ed cluster<strong>in</strong>gs by an appropriate measure such as<br />
the RandIndex [114].<br />
These two distance functions are the basis for the subspace group<strong>in</strong>g step <strong>in</strong> our analytical<br />
workflow as follows:<br />
1. Subspace group<strong>in</strong>g: We apply hierarchical agglomerative group<strong>in</strong>g <strong>of</strong> subspaces<br />
based on the topologic distance function us<strong>in</strong>g Ward’s m<strong>in</strong>imum variance method [144].<br />
Based on the dendrogram representation <strong>of</strong> the obta<strong>in</strong>ed hierarchical group<strong>in</strong>g, the<br />
user chooses the hierarchy depth level to select a number <strong>of</strong> groups. This way the<br />
user can easily decide how many clusters are desired for the analysis.<br />
2. Subspace filter<strong>in</strong>g: Based on the previously achieved group<strong>in</strong>g <strong>of</strong> subspaces, we<br />
filter one subspace from each group as representative: for each group we consider<br />
the subspaces with the lowest dimensionality and choose the one that exhibits the<br />
highest <strong>in</strong>terest<strong>in</strong>gness score. We note that other rules for filter<strong>in</strong>g representatives<br />
are possible, but f<strong>in</strong>d that this rule is robust and e ective for users, as it tries to<br />
keep the dimensionality as low as possible.<br />
These steps together with both distance functions, take us further towards our goal <strong>of</strong><br />
understand<strong>in</strong>g the di erent k<strong>in</strong>ds <strong>of</strong> relationships between subspaces. They can complement,<br />
confirm, or contradict each other and be<strong>in</strong>g aware <strong>of</strong> these relations can be crucial<br />
for further m<strong>in</strong><strong>in</strong>g tasks.<br />
conta<strong>in</strong>ed dimensions<br />
similar<br />
not similar<br />
data topology<br />
similar<br />
redundant<br />
confirmatory<br />
not similar<br />
dom<strong>in</strong>ant<br />
dimensions<br />
complementary<br />
Figure 5.10: Filter<strong>in</strong>g cases that can be supported by our two def<strong>in</strong>ed subspace similarity functions.<br />
Four basic cases can be identified, each <strong>of</strong> which might be relevant for a given subspace<br />
analysis task:<br />
1. Subspaces that are similar <strong>in</strong> both, their conta<strong>in</strong>ed dimension sets and their data<br />
topology (redundant subspaces);<br />
2. Subspaces that are dissimilar <strong>in</strong> both, their conta<strong>in</strong>ed dimensions and their data<br />
topology (complementary subspaces);
116 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
3. Subspaces that are similar with regard to data topology but dissimilar regard<strong>in</strong>g<br />
their conta<strong>in</strong>ed dimensions (confirmatory subspaces: we confirm the same data relationships<br />
<strong>in</strong> di erent subspaces); and<br />
4. Subspaces that are similar with regard to their conta<strong>in</strong>ed dimensions, but dissimilar<br />
regard<strong>in</strong>g topology (this is generally not expected but could <strong>in</strong>dicate the existence<br />
<strong>of</strong> one or a few dimensions that are by their nature very dom<strong>in</strong>ant for the data<br />
topology).<br />
Figure 5.10 illustrates these four basic filter<strong>in</strong>g cases.<br />
<strong>Visual</strong>-<strong>in</strong>teractive design<br />
After hierarchical aggregation and/or filter<strong>in</strong>g <strong>of</strong> the potentially redundant set <strong>of</strong> subspaces<br />
have taken place, we apply a set <strong>of</strong> analytical views for explor<strong>in</strong>g and compar<strong>in</strong>g the<br />
subspaces. Our displays are based on (1) scatterplot-oriented representations <strong>of</strong> <strong>in</strong>dividual<br />
subspaces or groups <strong>of</strong> subspaces, (2) similarity-based or l<strong>in</strong>ear list layouts for sets <strong>of</strong><br />
subspaces, and (3) additional <strong>in</strong>formative views (parallel coord<strong>in</strong>ates and color-cod<strong>in</strong>g for<br />
comparison <strong>of</strong> groups <strong>in</strong> data).<br />
The proposed design is the result <strong>of</strong> several iterations <strong>of</strong> alternative solutions <strong>in</strong> which<br />
we explored and compared several representations. Two design choices are worth discuss<strong>in</strong>g<br />
here: (1) the design <strong>of</strong> a visual representative for subspaces and (2) their layout. We<br />
decided to represent subspaces with scatterplots because they allow for the identification<br />
and comparison <strong>of</strong> groups <strong>in</strong> the data. More abstract representations (like simple colored<br />
marks) would require less space but would not allow the rich topological comparison<br />
provided by the scatterplots. In contrast, representations that are more complex like, e.g.,<br />
parallel coord<strong>in</strong>ates would provide a direct representation <strong>of</strong> the dimensions <strong>in</strong>cluded <strong>in</strong><br />
the subspace but would make their representation much more cluttered. As for the layout,<br />
we tried several tree and graph layouts to make the relationship between the subspaces and<br />
their shared dimensions explicit, however, we found that this rarely provides <strong>in</strong>terest<strong>in</strong>g<br />
<strong>in</strong>sights and makes the visualization too cluttered to be <strong>of</strong> any use.<br />
Figure 5.11: Subspace representation by 2D scatterplots with dimension glyph. We can see the<br />
visual representations <strong>of</strong> two 5D subspaces (left) and one 4D subspace (right).<br />
To represent each subspace <strong>in</strong> a similar way, <strong>in</strong>dependent <strong>of</strong> its dimensionality, we<br />
decided to plot each subspace <strong>in</strong> a 2D scatterplot. The scatterplot representation can<br />
be generated by any appropriate projection technique such as PCA [83], MDS [41] or<br />
t-SNE [143], to name a few. We currently use MDS; however, we experimented with<br />
other dimension reduction techniques and found that other techniques could be used al-
5.2.3 Proposed Analytical Workflow 117<br />
ternatively. To convey the <strong>in</strong>volved subspace dimensions, we add an <strong>in</strong>dex glyph to the<br />
respective scatterplot (see Figure 5.11).<br />
1<br />
2 3<br />
Figure 5.12: (1) L<strong>in</strong>early sorted view <strong>of</strong> subspaces for the 12D synthetical data set from [52]<br />
show<strong>in</strong>g the full result <strong>of</strong> SURFING, consist<strong>in</strong>g <strong>of</strong> 296 subspaces. The selected subspace <strong>in</strong> this<br />
view is shown <strong>in</strong> a (2) s<strong>in</strong>gle subspace view to enable <strong>in</strong>teraction and <strong>in</strong> (3) a parallel coord<strong>in</strong>ates<br />
view with the subspace dimensions as the first axes (highlighted), and all the other data dimension<br />
as the last axes.<br />
The analytical views are comb<strong>in</strong>ed and l<strong>in</strong>ked <strong>in</strong> an application that consists <strong>of</strong> the<br />
follow<strong>in</strong>g components:<br />
L<strong>in</strong>early sorted view <strong>of</strong> subspaces<br />
To obta<strong>in</strong> a first overview <strong>of</strong> the output <strong>of</strong> the subspace search algorithm, we present all the<br />
subspaces <strong>in</strong> a l<strong>in</strong>ear view. The MDS scatterplots represent<strong>in</strong>g the <strong>in</strong>dividual subspaces<br />
are sorted left-to-right and top-down accord<strong>in</strong>g to the <strong>in</strong>terest<strong>in</strong>gness <strong>in</strong>dex provided by<br />
the subspace search method. This view is exclusively used as a detail view for groups <strong>of</strong><br />
topologically similar subspaces. Figure 5.12(1) illustrates the subspaces <strong>of</strong> the synthetic<br />
data set, which is described later <strong>in</strong> Subsection 5.2.4.<br />
Subspace group view<br />
In this view, groups <strong>of</strong> subspaces that have been formed by hierarchical agglomerative<br />
group<strong>in</strong>g are shown. Each group is represented by one selected subspace from that group,<br />
us<strong>in</strong>g the filter<strong>in</strong>g method as described <strong>in</strong> the previous subsection. Figure 5.13 shows the<br />
dendrogram provided by the hierarchical group<strong>in</strong>g algorithm <strong>of</strong> all 296 subspaces visible<br />
<strong>in</strong> the l<strong>in</strong>early sorted view. Each node <strong>in</strong> the dendrogram represents a cluster at a certa<strong>in</strong><br />
similarity. A larger image <strong>of</strong> the dendrogram can be seen <strong>in</strong> Appendix A.4.<br />
The user can navigate trough this hierarchy (possible with the hierarchical navigation<br />
buttons shown <strong>in</strong> Figure 5.16(6)) and specify a certa<strong>in</strong> similarity threshold for cluster<strong>in</strong>g.
Subspaces<br />
FL<br />
FI<br />
DI<br />
BJ<br />
FJ<br />
118 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
CFGIJK<br />
CFGIJKL<br />
CFGIK<br />
CFGIKL<br />
CFGJK<br />
CFGJKL<br />
CFGK<br />
CFGKL<br />
CFIKL<br />
CFIJKL<br />
CFIK<br />
CFIJK<br />
CFKL<br />
CFJKL<br />
CFHIKL<br />
CFHIJKL<br />
CFHKL<br />
CFHJKL<br />
CFHK<br />
CFHJK<br />
CFK<br />
CFJK<br />
CFGHJK<br />
CFGHJKL<br />
CFGHK<br />
CFGHKL<br />
CFHIK<br />
CFHIJK<br />
CFGHIK<br />
CFGHIJK<br />
CFIJL<br />
CFHIJL<br />
CFIL<br />
CFHIL<br />
CFJL<br />
CFHJL<br />
CFL<br />
CFHL<br />
CFHI<br />
CFHIJ<br />
CFI<br />
CFIJ<br />
CFH<br />
CFJ<br />
CFHJ<br />
CFGHI<br />
CFGHIJ<br />
CFGI<br />
CFGIJ<br />
CFGJ<br />
CFGHJ<br />
CFG<br />
CFGH<br />
CFGHIL<br />
CFGHIJL<br />
CFGHIKL<br />
CFGHIJKL<br />
CFGHL<br />
CFGHJL<br />
CFGJL<br />
CFGIJL<br />
CFGL<br />
CFGIL<br />
CDGIK<br />
CDGIJK<br />
CDGJK<br />
CDGHJK<br />
CDGK<br />
CDGHK<br />
CDGHKL<br />
CDGHJKL<br />
CDGJKL<br />
CDGIJKL<br />
CDGKL<br />
CDGIKL<br />
CDGHIJ<br />
CDGHIJL<br />
CDGHI<br />
CDGHIL<br />
CDGHIKL<br />
CDGHIJKL<br />
CDGHIK<br />
CDGHIJK<br />
CDGJ<br />
CDGIJ<br />
CDG<br />
CDGI<br />
CDGIL<br />
CDGIJL<br />
CDGJL<br />
CDGHJL<br />
CDGL<br />
CDGHL<br />
CDHIK<br />
CDHKL<br />
CDHIKL<br />
CDIK<br />
CDIKL<br />
CDK<br />
CDKL<br />
CDHJKL<br />
CDHIJKL<br />
CDJKL<br />
CDIJKL<br />
CDHK<br />
CDHJK<br />
CDJK<br />
CDIJK<br />
CDHIJK<br />
CDIJL<br />
CDHIJL<br />
CDIL<br />
CDHIL<br />
CDIJ<br />
CDHIJ<br />
CDI<br />
CDHI<br />
CDGH<br />
CDGHJ<br />
CDH<br />
CDHJ<br />
CDL<br />
CDHL<br />
CDJ<br />
CDJL<br />
CDHJL CF<br />
BCF<br />
CDF<br />
BCDF BC<br />
CD<br />
BCD<br />
CDFGHJ<br />
CDFGHJL<br />
CDFGJ<br />
CDFGJL<br />
CDFGL<br />
CDFGHL<br />
CDFG<br />
CDFGH<br />
CDFJL<br />
CDFHJL<br />
CDFJ<br />
CDFHJ<br />
CDFL<br />
CDFHL<br />
CDFIJL<br />
CDFHIJL<br />
CDFIL<br />
CDFHIL<br />
CDFGIL<br />
CDFGIJL<br />
CDFIJ<br />
CDFGIJ<br />
CDFI<br />
CDFGI<br />
CDFGHIL<br />
CDFGHIJL<br />
CDFGHI<br />
CDFGHIJ<br />
CDFH<br />
CDFHI<br />
CDFHIJ<br />
CDFIK<br />
CDFGIK<br />
CDFK<br />
CDFGK<br />
CDFHJK<br />
CDFHIJK<br />
CDFJK<br />
CDFIJK<br />
CDFHK<br />
CDFHIK<br />
CDFGHIK<br />
CDFGHIJK<br />
CDFGHK<br />
CDFGHJK<br />
CDFHIKL<br />
CDFHIJKL<br />
CDFHKL<br />
CDFHJKL<br />
CDFGHIKL<br />
CDFGHIJKL<br />
CDFGHKL<br />
CDFGHJKL<br />
CDFKL<br />
CDFIKL<br />
CDFJKL<br />
CDFIJKL<br />
CDFGKL<br />
CDFGIKL<br />
CDFGJKL<br />
CDFGIJKL<br />
CDFGJK<br />
CDFGIJK BL<br />
DL<br />
FH BH<br />
DH<br />
DG FG BG DF<br />
BD<br />
BF<br />
BDF BI<br />
GI<br />
IJ<br />
JL<br />
HJ<br />
DJ IL<br />
GL<br />
HL<br />
GH GJ<br />
IK<br />
BK<br />
FK<br />
DK HI<br />
KL<br />
JK<br />
CJ<br />
hierarchical agglomerative group<strong>in</strong>g<br />
synthetic dataset<br />
HK<br />
GK<br />
CGHIL<br />
CGHIJL<br />
CGIL<br />
CGIJL<br />
CHIL<br />
CHIJL<br />
CIL<br />
CIJL CI<br />
CGI<br />
CGIJ<br />
CGHIJ<br />
CIJ<br />
CHIJ<br />
CGJL<br />
CGHJL<br />
CGL<br />
CGHL<br />
CJL<br />
CHJL<br />
CL<br />
CHL<br />
CGJ<br />
CGHJ<br />
CHJ<br />
CG<br />
CGH CH<br />
CHI<br />
CGHI<br />
CGIJKL<br />
CGHIJKL<br />
CGIKL<br />
CGHIKL<br />
CGIJK<br />
CGHIJK<br />
CGIK<br />
CGHIK<br />
CHIKL<br />
CHIJKL<br />
CIKL<br />
CIJKL<br />
CIJK<br />
CHIJK<br />
CIK<br />
CHIK<br />
CGJKL<br />
CGHJKL<br />
CGJK<br />
CGHJK<br />
CGHK<br />
CGHKL<br />
CGK<br />
CGKL<br />
CJKL<br />
CHJKL<br />
CKL<br />
CHKL<br />
CHK<br />
CHJK CK<br />
CJK<br />
Distance (Similarity)<br />
0 5 10 15 20 25 30<br />
Figure 5.13: Hierarchical agglomerative group<strong>in</strong>g <strong>of</strong> the 296 <strong>in</strong>terest<strong>in</strong>g subspaces. The red l<strong>in</strong>e<br />
shows the threshold for 6 groups shown <strong>in</strong> the subspace group view. Each group is marked by a<br />
colored rectangle. The colors are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> Figure 5.14.<br />
This threshold is <strong>in</strong>dicated by the red l<strong>in</strong>e <strong>in</strong> the figure show<strong>in</strong>g the dendrogram, result<strong>in</strong>g<br />
<strong>in</strong> six groups visible <strong>in</strong> the subspace group view presented <strong>in</strong> Figure 5.14 and illustrated<br />
also <strong>in</strong> the overview-Figure 5.16(1).<br />
Figure 5.14: Subspace group view for the 12D synthetic data set with six subspace groups.<br />
The representative subspaces <strong>of</strong> each group are each visualized by an MDS plot, and<br />
shown side-by-side. A dimension histogram on top <strong>of</strong> each <strong>in</strong>dicates the distribution<br />
<strong>of</strong> dimensions conta<strong>in</strong>ed by the subspaces <strong>in</strong> that group, where the length <strong>of</strong> the bar<br />
encodes the frequency <strong>of</strong> the respective dimension. The last bar encodes the percentage <strong>of</strong><br />
subspaces conta<strong>in</strong>ed <strong>in</strong> this group. It is colored <strong>in</strong> orange to be easily dist<strong>in</strong>guished from<br />
the others.<br />
Each group <strong>of</strong> subspaces from the preced<strong>in</strong>g view can be expanded and its member<br />
subspaces can be seen and compared <strong>in</strong> detail (as Figure 5.16(5) illustrates). This allows a<br />
better understand<strong>in</strong>g <strong>of</strong> the current similarity threshold, and allows to expand or further<br />
collapse the group structure based on visually perceived similarity between subspaces. The<br />
user can <strong>in</strong>vestigate how similar the distribution <strong>of</strong> dimensions is among di erent groups<br />
<strong>of</strong> subspaces. To this end, a click on the dimension histogram icon <strong>of</strong> one particular group<br />
will cross-highlight the dimensions <strong>of</strong> the selected group that are also conta<strong>in</strong>ed by other<br />
clusters. In this example the dimension glyph <strong>of</strong> the green group has been clicked. In summary,<br />
the subspace group view allows a global comparison <strong>of</strong> non-redundant subspaces and<br />
their similarities concern<strong>in</strong>g the conta<strong>in</strong>ed data topology.<br />
Dimension-based subspace similarity view<br />
We also support the comparative analysis <strong>of</strong> all subspaces based on their similarity regard<strong>in</strong>g<br />
the set <strong>of</strong> active dimensions. Consequently a global MDS layout, based on the<br />
Tanimoto distances between the subspaces, as described at the beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> this section, is<br />
generated. Figure 5.15 (respective Figure 5.16(4)) illustrates the subspace similarity view.<br />
For a high number <strong>of</strong> subspaces, this view can only provide an impression <strong>of</strong> the similarity<br />
relationships but by zoom<strong>in</strong>g more details become visible. The agglomerative group<strong>in</strong>g
5.2.3 Proposed Analytical Workflow 119<br />
based on the topologic distance function could be used to reduce the number <strong>of</strong> displayed<br />
subspaces <strong>in</strong> this view. The subspace group view (based on data topology distance) and<br />
Figure 5.15: Dimension-based subspace similarity MDS view <strong>of</strong> the 296 subspaces selected by the<br />
subspace search algorithm.<br />
dimension-similarity view (based on Tanimoto distance) are l<strong>in</strong>ked by color-cod<strong>in</strong>g (outer<br />
frame color<strong>in</strong>g). Thereby, we can compare the similarity <strong>of</strong> subspaces by their topological<br />
and dimension-overlap-based similarity.<br />
Additional views and cluster comparison support<br />
We also <strong>in</strong>tegrated details-on-demand for each subspace by a parallel coord<strong>in</strong>ates view<br />
(Figures 5.12(3) and 5.16(3) illustrate). <strong>High</strong>light<strong>in</strong>g conta<strong>in</strong>ed dimensions helps to understand<br />
the di erence <strong>of</strong> the subspaces <strong>in</strong> more detail. The subspace dimensions are the<br />
first dimensions <strong>of</strong> the parallel coord<strong>in</strong>ates view and highlighted. The others are added <strong>in</strong><br />
a random way, <strong>in</strong> a lighter gray. This enables the comparison to the rest <strong>of</strong> the data set,<br />
and understand<strong>in</strong>g the distribution <strong>of</strong> the subspace dimensions, compared to the rest <strong>of</strong><br />
the data.<br />
Furthermore, <strong>in</strong>teractive exploration <strong>of</strong> the subspaces is enhanced by a s<strong>in</strong>gle subspace<br />
view, provid<strong>in</strong>g an enlarged view <strong>of</strong> a selected subspace scatterplot (Figures 5.12(2) and
120 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
5.16(2) illustrate this). This view also allows to manually select clusters <strong>of</strong> objects by a<br />
lasso tool. Cross-color<strong>in</strong>g <strong>of</strong> the selected po<strong>in</strong>ts among the other subspaces and with<strong>in</strong> the<br />
parallel coord<strong>in</strong>ates plot thus allows comparative exploration <strong>of</strong> group<strong>in</strong>g structures – a<br />
core problem <strong>in</strong> mak<strong>in</strong>g e ective use <strong>of</strong> alternative subspaces.<br />
1<br />
4<br />
5<br />
6<br />
2 3<br />
Figure 5.16: All l<strong>in</strong>ked views: (1) Subspace group view for the 12D synthetic data set with six<br />
subspace groups. (2) S<strong>in</strong>gle subspace view show<strong>in</strong>g the representative subspace for the first group.<br />
(3) Details-on-demand <strong>in</strong> the parallel coord<strong>in</strong>ates view for the selected subspace. (4) The MDS<br />
layout <strong>of</strong> the subspace search results based on their dimension similarity. (5) Group detail view for<br />
the three (orange, green, purple) subspace groups. (6) Hierarchical navigation buttons.<br />
5.2.4 Application<br />
We now demonstrate the analytical capabilities <strong>of</strong> our proposed approach by application to<br />
synthetic and real world data <strong>in</strong> two scenarios. This two scenarios have di erent purposes.<br />
First, we use synthetic data as a pro<strong>of</strong> <strong>of</strong> concept and exemplify the suggested workflow.<br />
We show how that relevant subspaces can conveniently be identified. Then, we describe<br />
an explorative sett<strong>in</strong>g <strong>in</strong> which <strong>in</strong>terest<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs <strong>in</strong> alternative subspaces <strong>of</strong> a real world<br />
data set are obta<strong>in</strong>ed.<br />
Application Scenario 1: Synthetic <strong>Data</strong><br />
To show the power <strong>of</strong> the proposed approach, we used a 750 record sample <strong>of</strong> the first<br />
12D synthetic data set presented <strong>in</strong> [52] (data set No. 2). This data set consists <strong>of</strong> four<br />
3D Gaussian clusters and two 6D Gaussian clusters. The rema<strong>in</strong><strong>in</strong>g dimensions conta<strong>in</strong><br />
uniformly distributed random noise. The first step <strong>of</strong> our approach is to determ<strong>in</strong>e the<br />
<strong>in</strong>terest<strong>in</strong>g subspaces <strong>of</strong> the high-dimensional data set, by runn<strong>in</strong>g automatic subspace<br />
search us<strong>in</strong>g SURFING (see Section 5.2.3). This subspace search returns a total <strong>of</strong> 296
5.2.4 Application 121<br />
subspaces identified as <strong>in</strong>terest<strong>in</strong>g, out <strong>of</strong> the 4095 possible subspaces. To get a first<br />
impression <strong>of</strong> these subspaces, we use the l<strong>in</strong>early sorted view <strong>of</strong> subspaces shown <strong>in</strong><br />
Figure 5.12, rely<strong>in</strong>g on MDS representations <strong>of</strong> the data <strong>in</strong> the subspaces, and sorted by<br />
the <strong>in</strong>terest<strong>in</strong>gness score <strong>in</strong> decreas<strong>in</strong>g order.<br />
The view shows the diversity <strong>of</strong> subspaces identified dur<strong>in</strong>g the automatic step. The<br />
first elements <strong>in</strong> the first row <strong>of</strong> the view are very similar <strong>in</strong> terms <strong>of</strong> the po<strong>in</strong>t distribution<br />
(show<strong>in</strong>g mostly scattered and spherical po<strong>in</strong>t distributions). However, at later positions,<br />
we also see other varieties <strong>of</strong> po<strong>in</strong>t distributions, <strong>in</strong>clud<strong>in</strong>g parallel stripe patterns, and<br />
stripes mixed with spherical patterns. In a normal (non-visual) analysis case, rely<strong>in</strong>g just<br />
on the subspaces ranked top by the <strong>in</strong>terest<strong>in</strong>gness score, the analyst might miss some <strong>of</strong><br />
these di erent characteristics <strong>of</strong> the subspaces.<br />
Judg<strong>in</strong>g by the shape <strong>of</strong> the MDS projection representations, the overview also confirms<br />
that the subspace search did return a lot <strong>of</strong> redundant subspaces. The next step is therefore<br />
to group the subspaces accord<strong>in</strong>g to their similarity, allow<strong>in</strong>g the user to abstract to a<br />
smaller number <strong>of</strong> relevant subspaces to compare them <strong>in</strong> detail. We used our similarity<br />
function based on the data topology, creat<strong>in</strong>g a hierarchal agglomerative cluster<strong>in</strong>g us<strong>in</strong>g<br />
Ward’s m<strong>in</strong>imum variance method [144]. We found that this method turned out to show<br />
good results, <strong>in</strong> terms <strong>of</strong> provid<strong>in</strong>g clusters <strong>of</strong> subspaces that discrim<strong>in</strong>ate well from each<br />
other. The obta<strong>in</strong>ed cluster<strong>in</strong>g dendrogram has been shown <strong>in</strong> Figure 5.13. By sett<strong>in</strong>g a<br />
similarity threshold, Figure 5.16(1) shows that the number <strong>of</strong> subspaces can be reduced<br />
considerably <strong>in</strong> a mean<strong>in</strong>gful way by the user. The navigation buttons, as shown <strong>in</strong><br />
Figure 5.16(6), allow the user to move through each dendrogram level and to f<strong>in</strong>d the<br />
desired level <strong>of</strong> redundancy. Here the dendrogram was cut at 0.73 (value range (0,1)).<br />
As a result, six groups are found and visualized by their representatives. The number <strong>of</strong><br />
groups can be variated, by select<strong>in</strong>g di erent similarity levels <strong>in</strong> the dendrogram hierarchy.<br />
For this data we quickly found that six groups is the right level <strong>of</strong> detail for our further<br />
<strong>in</strong>vestigation.<br />
We <strong>in</strong>vestigate the components <strong>of</strong> each group <strong>of</strong> subspaces <strong>in</strong> more detail. Figure 5.16(5)<br />
shows the group detail view <strong>of</strong> the orange, green, and purple subspace groups as framed<br />
<strong>in</strong> Figure 5.16(1). Topologically similar subspaces are grouped together. In this way, the<br />
analyst is given an overview <strong>of</strong> the exist<strong>in</strong>g groups and, if needed, can further compare<br />
<strong>in</strong>dividual group components.<br />
On top <strong>of</strong> the scatterplots, a dimension histogram is <strong>in</strong>dicat<strong>in</strong>g the distribution <strong>of</strong> dimensions<br />
for each group. The last bar <strong>of</strong> the histogram is marked <strong>in</strong> orange and represents<br />
the percentage <strong>of</strong> subspaces conta<strong>in</strong>ed <strong>in</strong> this group. It is scaled logarithmically, so that<br />
this bar is also visible for groups with few elements. A click on the dimension histogram<br />
<strong>of</strong> one group representative highlights its dimensions <strong>in</strong> all the other representatives. In<br />
Figure 5.16(1) (enlarged <strong>in</strong> Figure 5.14) the green group was clicked. To understand why<br />
the green- and gray-framed groups are split, we can consult the additional view <strong>in</strong> Figure<br />
5.16(4). It shows an MDS layout <strong>of</strong> all <strong>in</strong>terest<strong>in</strong>g subspaces based on the dimension<br />
overlap (Tanimoto) similarity. In this view closeness <strong>of</strong> two subspaces corresponds to dimension<br />
similarity. We see that the green- and gray-framed cluster groups are located<br />
on the far left side <strong>in</strong> the plot. This shows us that the subspaces are similar <strong>in</strong> terms <strong>of</strong><br />
dimensions, but be<strong>in</strong>g <strong>in</strong> di erent groups, they must show di erent topological similarity<br />
accord<strong>in</strong>g to our similarity measure. The reason is that all the subspaces <strong>of</strong> the grayframed<br />
group conta<strong>in</strong> dimension d12, while none <strong>of</strong> the subspaces <strong>in</strong> the green-framed<br />
group conta<strong>in</strong> this dimension, which is visible by the bars <strong>in</strong> the dimension histogram <strong>of</strong><br />
the gray-framed group (see Figure 5.16(1)). As it is not highlighted, it is not conta<strong>in</strong>ed <strong>in</strong>
122 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
the marked green-framed group, and obviously this dimension is responsible for a di erent<br />
data distribution.<br />
We can also go one step further <strong>in</strong> detailed comparison <strong>of</strong> subspaces by cross-colorcod<strong>in</strong>g<br />
clusters <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> the MDS representation. Our lasso tool allows the user to<br />
manually mark clusters <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> the MDS subspace representation, which allows to<br />
cross-compare the group<strong>in</strong>gs among di erent subspaces. For example, we manually marked<br />
six separate clusters <strong>of</strong> po<strong>in</strong>ts <strong>in</strong> the p<strong>in</strong>k-framed subspace group (group number two <strong>in</strong><br />
Figure 5.16(1)) and assigned dist<strong>in</strong>ct colors. By analyz<strong>in</strong>g the distribution <strong>of</strong> colors among<br />
subspace group representatives, we see that other subspaces merge some <strong>of</strong> these clusters<br />
and spread others. This is also true for the purple framed group representative. The dark<br />
blue and p<strong>in</strong>k po<strong>in</strong>t cluster (the upper most <strong>in</strong> the orig<strong>in</strong>al colored subspace) are clustered<br />
<strong>in</strong> the purple subspace but some <strong>of</strong> their po<strong>in</strong>ts also became noise <strong>in</strong> this subspace.<br />
Summ<strong>in</strong>g up, we can see how our visual analytics workflow helps to deal with the<br />
extensive number <strong>of</strong> possibly <strong>in</strong>terest<strong>in</strong>g subspaces <strong>in</strong> a natural overview-first based visual<br />
analytics workflow. In a first step, the SURFING approach reduced the number <strong>of</strong><br />
subspaces <strong>of</strong> the 12 dimensional data set from 4095 to 296 <strong>in</strong>terest<strong>in</strong>g ones. S<strong>in</strong>ce this<br />
set <strong>of</strong> subspaces still showed a high redundancy, <strong>in</strong> our next step we grouped them us<strong>in</strong>g<br />
our topological similarity measure. Based on the grouped subspaces, further <strong>in</strong>vestigations<br />
could take place for compar<strong>in</strong>g the relations and distributions among po<strong>in</strong>ts <strong>of</strong> data with<strong>in</strong><br />
the subspaces.<br />
Application Scenario 2: Exploration/Discovery<br />
We will now demonstrate the exploratory functionalities <strong>of</strong> our proposed approach based on<br />
a real data set. We analyze aga<strong>in</strong> the USDA Food Composition data set 2 , a full collection<br />
<strong>of</strong> raw and processed foods characterized by their composition <strong>in</strong> terms <strong>of</strong> nutrients. The<br />
database conta<strong>in</strong>s more than 7000 records and 44 dimensions. After remov<strong>in</strong>g miss<strong>in</strong>g<br />
values and outliers, as well as normalizations, 722 records (foods) rema<strong>in</strong>ed for which we<br />
selected 18 dimensions <strong>of</strong> the data set that where <strong>in</strong>terpretable.<br />
From this <strong>in</strong>put data set, application <strong>of</strong> the SURFING algorithm returned 216 <strong>in</strong>terest<strong>in</strong>g<br />
subspaces for further exploration. To obta<strong>in</strong> a first impression <strong>of</strong> this data, we<br />
<strong>in</strong>vestigated the l<strong>in</strong>early sorted view (see Figure 5.17 for a cut-out). Many subspaces, <strong>in</strong><br />
particular those ranked with a high <strong>in</strong>terest<strong>in</strong>gness <strong>in</strong>dex, showed a rather skewed distribution<br />
<strong>of</strong> po<strong>in</strong>ts <strong>in</strong> our projection representation, concentrat<strong>in</strong>g along the edges <strong>of</strong> the<br />
diagrams. Only later <strong>in</strong> the rank<strong>in</strong>g, we observed the projections form<strong>in</strong>g out more structure<br />
that could be mean<strong>in</strong>gful. The red color framed subspace <strong>in</strong> Figure 5.17 seems to be<br />
very <strong>in</strong>terest<strong>in</strong>g, form<strong>in</strong>g long, clear stripes. With the help <strong>of</strong> the s<strong>in</strong>gle subspace view, we<br />
further <strong>in</strong>vestigated this subspace (Iron,Maganase,V it D ) by color<strong>in</strong>g each stripe with a<br />
di erent color and compared the formation <strong>of</strong> these clusters across the other subspaces.<br />
Most <strong>of</strong> them seemed to be overspread by the cyan class (see Figure 5.17 right).<br />
At the same time, it is clear that a high level <strong>of</strong> redundancy is still present, and a further<br />
group<strong>in</strong>g is deemed necessary. Therefore, we cont<strong>in</strong>ued with our next analytical step, the<br />
subspace group<strong>in</strong>g by agglomerative hierarchical cluster<strong>in</strong>g. We obta<strong>in</strong>ed di erent groups<br />
<strong>of</strong> subspaces and found out that these clearly striped clusters only appear <strong>in</strong> subspaces<br />
conta<strong>in</strong><strong>in</strong>g Vit D .<br />
We therefore reset the color<strong>in</strong>g and started a new <strong>in</strong>teractive analysis step, beg<strong>in</strong>n<strong>in</strong>g<br />
2 http://www.ars.usda.gov/
5.2.4 Application 123<br />
Figure 5.17: L<strong>in</strong>early sorted view cut-out <strong>of</strong> subspaces for the 18D USDA Food Composition data<br />
set. The full result <strong>of</strong> SURFING, consist<strong>in</strong>g <strong>of</strong> 216 subspaces. We see a rather high level <strong>of</strong><br />
redundancy. Subspaces exhibit<strong>in</strong>g more structure are found <strong>in</strong> particular at the mid and end<br />
positions <strong>in</strong> the rank<strong>in</strong>g. Rely<strong>in</strong>g only on the numerically top ranked results, we would have<br />
omitted such <strong>in</strong>terest<strong>in</strong>g cases from the analysis.<br />
with this stage <strong>of</strong> our workflow. After test<strong>in</strong>g di erent filter<strong>in</strong>g thresholds and compar<strong>in</strong>g<br />
the topological- and the dimension-based similarity relations, we obta<strong>in</strong>ed a number <strong>of</strong> 12<br />
groups, and considered this suitable for subsequent analysis (see Figure 5.19(1)).<br />
A<br />
B<br />
C<br />
D<br />
Figure 5.18: (A) Interest<strong>in</strong>g spotted subspace (Carbohydrat,Fibre) present<strong>in</strong>g two clusters.<br />
(B) Subspace (Carbohydarte,Lipid,Prote<strong>in</strong>) <strong>in</strong> the same cluster group <strong>of</strong> (A) wherethecluster<br />
structure changes. (C) Green marked third cluster <strong>in</strong> subspace from (B). (D) Subspace<br />
(Fiber,Prote<strong>in</strong>,Vit D ) <strong>of</strong> orange color-framed subspace group, where the alternative cluster<strong>in</strong>g <strong>of</strong><br />
po<strong>in</strong>ts is visible.<br />
From the reduced number <strong>of</strong> representative subspaces, one particular subspace stood<br />
out to us (see Figure 5.19(1) for the group representatives and Figure 5.18(A) for the<br />
<strong>in</strong>terest<strong>in</strong>g spotted one). This subspace shows the most structure and allows to discern<br />
two po<strong>in</strong>t clusters (p<strong>in</strong>k and blue). We selected this specific subspace group (framed<br />
brown <strong>in</strong> Figure 5.19) for further analysis. Cross-color<strong>in</strong>g is used to highlight its group<br />
components, that are shown at the bottom <strong>of</strong> the figure. It is visible that the group <strong>of</strong><br />
subspaces are topologically similar, consequently this subspace is a valid representative.<br />
In addition, we observe that there are some subspaces <strong>in</strong> this group where the cluster<strong>in</strong>g<br />
is chang<strong>in</strong>g. One example is shown <strong>in</strong> Figure 5.18(B). We assigned the green color to the<br />
outstand<strong>in</strong>g po<strong>in</strong>ts on the left side s<strong>in</strong>ce they seem to form a di erent structure. In the<br />
group view (see Figure 5.19(1)) we can see that this green cluster overspreads on five
124 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
<strong>of</strong> the 12 subspace group representatives. After a closer look to the components <strong>of</strong> the<br />
orange subspace group, we spotted a sharply def<strong>in</strong>ed green cluster (see Figure 5.18(D) and<br />
highlighted <strong>in</strong> Figure 5.19(2)). By highlight<strong>in</strong>g the dimensions <strong>of</strong> the orange group, we<br />
can see that the brown group has a dom<strong>in</strong>ant dimension (Prote<strong>in</strong>) that is not conta<strong>in</strong>ed<br />
by any subspace <strong>of</strong> the orange group. We can therefore assume that this dimension is<br />
decisive for the cluster<strong>in</strong>g <strong>of</strong> the po<strong>in</strong>ts. In the dimension-based similarity view (MDS<br />
Layout <strong>in</strong> Figure 5.19(3)), the subspaces <strong>of</strong> the brown and orange groups are far apart<br />
from each other, which supports our f<strong>in</strong>d<strong>in</strong>g that the groups conta<strong>in</strong> di erent dimensions.<br />
Likewise we can see that the group components <strong>of</strong> the brown group are scattered across<br />
the MDS layout. This is due to the fact that the group subspaces are dissimilar <strong>in</strong> terms<br />
<strong>of</strong> their dimensions, but their topological similarity is dom<strong>in</strong>ated by the shared dimension<br />
(Prote<strong>in</strong>).<br />
1<br />
3<br />
2<br />
Figure 5.19: (1) Grouped view <strong>of</strong> subspaces for the 18D USDA Food Composition <strong>Data</strong> Set with 12<br />
group representatives. (2) The brown and orange group components are shown <strong>in</strong> the components<br />
view. (3) MDS Layout <strong>of</strong> the total number <strong>of</strong> subspaces with cross-colored group representatives.<br />
Summ<strong>in</strong>g up, we demonstrated how our <strong>in</strong>teractive exploratory workflow can be applied<br />
to real data. Compared to the previous scenario, the <strong>in</strong>formation about the clusters is not<br />
known <strong>in</strong> real data sets, mean<strong>in</strong>g that several <strong>in</strong>teractive attempts are needed to <strong>in</strong>vestigate<br />
the vast number <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces provided by the subspace search algorithm. With<br />
the help <strong>of</strong> the topological similarity functionalities, we could group the redundant clusters<br />
and have a closer look <strong>in</strong> their topological change. Us<strong>in</strong>g the di erent l<strong>in</strong>ked views <strong>of</strong> our<br />
approach helped us to identify di erent subspaces that present alternative cluster<strong>in</strong>gs.<br />
5.2.5 Discussion and Possible Extensions<br />
We will now summarize the ma<strong>in</strong> goal <strong>of</strong> our system, and discuss limitations and possible<br />
extensions next.
5.2.5 Discussion and Possible Extensions 125<br />
Summariz<strong>in</strong>g the Ma<strong>in</strong> Goals <strong>of</strong> our Approach<br />
Our presented approach supports visual-<strong>in</strong>teractive analysis <strong>of</strong> high-dimensional data from<br />
multiple perspectives based on the notion <strong>of</strong> automatic subspace search. The core assumption<br />
for our approach is that useful <strong>in</strong>formation could be extracted <strong>in</strong> a comparative way<br />
from several di erent subspaces resid<strong>in</strong>g <strong>in</strong> a larger high-dimensional data space. This<br />
assumption is the major driv<strong>in</strong>g force beh<strong>in</strong>d subspace search and subspace cluster<strong>in</strong>g<br />
algorithms developed <strong>in</strong> the <strong>Data</strong> M<strong>in</strong><strong>in</strong>g community over the past few years. We exploit<br />
algorithmic subspace search <strong>in</strong> an encompass<strong>in</strong>g visual-<strong>in</strong>teractive system. Our approach<br />
is designed around Shneiderman’s <strong>Visual</strong> Information-Seek<strong>in</strong>g Mantra [127], applied to the<br />
problem <strong>of</strong> analyz<strong>in</strong>g potentially large sets <strong>of</strong> subspaces. Modern subspace search methods<br />
such as SURFING e ciently identify candidate subspaces that are expected to exhibit <strong>in</strong>formative<br />
structure without restriction on a specific nature <strong>of</strong> the structure. Specifically,<br />
<strong>in</strong>teractively detect<strong>in</strong>g and understand<strong>in</strong>g relevant structures <strong>in</strong> subspaces is an explicit<br />
goal <strong>of</strong> our system. Our <strong>in</strong>teractive support allows users to condense and compare subspaces,<br />
and even groups <strong>in</strong> data, whereby the analytical loop from the algorithmic search<br />
<strong>of</strong> subspaces to the sense-mak<strong>in</strong>g by the user is closed. Subspace search algorithms are<br />
very useful as a start<strong>in</strong>g po<strong>in</strong>t. S<strong>in</strong>ce the identification based on <strong>in</strong>terest<strong>in</strong>gness is performed<br />
heuristically, the search methods alone cannot solve the analytical problems at<br />
hand. To this end, capable visual-analytic systems need to be designed based on the output<br />
<strong>of</strong> the subspace search algorithm. We therefore designed, implemented, and applied<br />
an encompass<strong>in</strong>g system design based on a subspace search method (exemplarily we used<br />
SURFING). It allows to explore high-dimensional data tak<strong>in</strong>g <strong>in</strong>to account the curse <strong>of</strong><br />
dimensionality and the possibility to f<strong>in</strong>d alternative clusters <strong>in</strong> di erent subspaces.<br />
Limitations and Possible Extensions<br />
We identify the follow<strong>in</strong>g limitations and improvement opportunities for our approach:<br />
• Computational scalability<br />
We designed and tested our system around data sets <strong>of</strong> moderate high-dimensionality<br />
<strong>of</strong> tens <strong>of</strong> dimensions. For higher-dimensional data, we will have to deal with scalability<br />
issues <strong>in</strong> (1) computational complexity <strong>of</strong> the subspace search and (2) scalability<br />
<strong>of</strong> the visual representation <strong>of</strong> subspaces. Regard<strong>in</strong>g (1), the search space <strong>in</strong>creases<br />
exponentially with dimensionality. Subspace search algorithms probably need more<br />
aggressive filter<strong>in</strong>g mechanisms to keep the number <strong>of</strong> searched subspaces tractable.<br />
A dynamically adjustable threshold could be useful here. However, we still need<br />
to ensure that no relevant results are excluded. To this end, sensitivity analysis is<br />
needed.<br />
• <strong>Visual</strong> scalability<br />
Regard<strong>in</strong>g (2), also scalable visual representations are needed for higher-dimensional<br />
data. We need to scale with the number <strong>of</strong> subspaces and the representation <strong>of</strong> each<br />
subspace. Hierarchical group<strong>in</strong>g <strong>of</strong> subspaces is already <strong>in</strong>cluded <strong>in</strong> our system to<br />
scale with the number <strong>of</strong> subspaces. The l<strong>in</strong>early sorted view per se does not scale<br />
with many subspaces, yet it can be restricted to the representative subspaces obta<strong>in</strong>ed<br />
from hierarchical group<strong>in</strong>g. <strong>Visual</strong> representation <strong>of</strong> subspaces takes place by<br />
projection to show the data po<strong>in</strong>ts and an <strong>in</strong>dex view to show conta<strong>in</strong>ed dimensions.
126 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong><br />
In particular, the latter will only scale for a limited number <strong>of</strong> dimensions. How<br />
to design set-oriented views to compare many sets <strong>of</strong> dimensions is a challeng<strong>in</strong>g<br />
problem that if solved, would improve our tool.<br />
• Projection-based subspace representation<br />
We currently represent the subspaces by MDS projections <strong>of</strong> the data resid<strong>in</strong>g <strong>in</strong><br />
respective subspaces. However, projection typically <strong>in</strong>duces loss <strong>in</strong> <strong>in</strong>formation, that<br />
could be <strong>in</strong>corporated <strong>in</strong> our visualization, e.g., by show<strong>in</strong>g the stress values <strong>in</strong> an<br />
overlay visualization [121]. In our experiments, MDS performed very well compared<br />
to us<strong>in</strong>g PCA. Yet, it would be <strong>in</strong>terest<strong>in</strong>g to test other projections. Also, other<br />
subspace representations besides scatterplots could be thought <strong>of</strong>, <strong>in</strong> essence similar<br />
to Value-and-Relation displays [157]. Likewise, many di erent, useful similarity<br />
notions to group and compare subspaces, such as notions based on stress measures,<br />
implicit cluster<strong>in</strong>g structures, relations to outliers, scagnostics features [151], etc.<br />
could be employed. Test<strong>in</strong>g them <strong>in</strong> di erent application doma<strong>in</strong>s is considered<br />
valuable future work. We note that our analytical approach can easily accommodate<br />
alternative subspace search algorithms, representations, and filter<strong>in</strong>g options.<br />
• Interpretable dimensions<br />
To relate subspaces and data groups <strong>in</strong> subspaces, it is important for the analyst to be<br />
aware <strong>of</strong> the mean<strong>in</strong>g <strong>of</strong> the dimensions <strong>of</strong> the respective subspace. Our <strong>in</strong>dex-based<br />
glyph does not convey <strong>in</strong>formation about the type <strong>of</strong> dimension. More semantically<br />
mean<strong>in</strong>gful dimension representations would be useful. Detail-on-demand functions<br />
could be added to help the user <strong>in</strong>terpret the <strong>in</strong>volved dimensions and properties <strong>of</strong><br />
the data po<strong>in</strong>ts more e ciently.<br />
• Def<strong>in</strong>ition <strong>of</strong> <strong>in</strong>terest<strong>in</strong>gness and sensitivity to noise<br />
Subspace search algorithms heuristically identify subspaces as <strong>in</strong>terest<strong>in</strong>g based on<br />
certa<strong>in</strong> properties <strong>of</strong> object relations. Based on the user and application, additional<br />
<strong>in</strong>terest<strong>in</strong>gness formulations are possible and should be supported. Follow<strong>in</strong>g best<br />
practices <strong>in</strong> data analysis, we have applied a data clean<strong>in</strong>g step (outlier and miss<strong>in</strong>g<br />
value removal) to our tested data before we fed it <strong>in</strong>to our system. The SURFING<br />
algorithm is not robust with respect to miss<strong>in</strong>g values, whereas it seems to be robust<br />
with respect to outliers. The orig<strong>in</strong>al paper does not discuss this aspect, and we<br />
did not further <strong>in</strong>vestigate it. The projections used to represent data distributions<br />
<strong>in</strong> subspaces are sensitive to outliers and may generate clamped distributions if not<br />
pre-processed. We postpone the analysis <strong>of</strong> this problem to future work.<br />
• Automatic support for cluster comparison<br />
Add<strong>in</strong>g automatic cluster<strong>in</strong>g <strong>of</strong> data po<strong>in</strong>ts <strong>in</strong> subspaces would be useful as a postprocess<strong>in</strong>g<br />
step. Equipped with automatic cluster<strong>in</strong>g, we can color-code the found<br />
clusters. This could lead to new visual-oriented <strong>in</strong>terest<strong>in</strong>gness measures useful for<br />
select<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces <strong>in</strong> the future. User <strong>in</strong>teraction with the subspace<br />
search output could be a useful analytical feature for ref<strong>in</strong>ement. Allow<strong>in</strong>g expert<br />
users to split or merge subspaces, or construct new subspaces by add<strong>in</strong>g or remov<strong>in</strong>g<br />
dimensions, would be one option.<br />
• Usability and user adoption<br />
Our current system design targets users with expertise <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. End-user<br />
applications, e.g., <strong>in</strong> Market Segment analysis, could benefit from subspace analysis.
5.2.6 Conclusions 127<br />
However, we recognize that for end-users, the <strong>in</strong>terface <strong>of</strong> our system would need to<br />
be customized, possibly. Our experience <strong>in</strong> collaborat<strong>in</strong>g with data m<strong>in</strong><strong>in</strong>g experts<br />
showed that the tool can be useful not only for data exploration but also as an<br />
evaluation tool to assess the output generated by subspace analysis algorithms.<br />
5.2.6 Conclusions<br />
We presented an encompass<strong>in</strong>g visual-<strong>in</strong>teractive system for subspace-based analysis <strong>in</strong><br />
high-dimensional data. Subspace-based analysis can constitute a new paradigm for highdimensional<br />
data analysis s<strong>in</strong>ce <strong>in</strong>formative structures <strong>in</strong> the data can be found and compared<br />
<strong>in</strong> di erent subspaces <strong>of</strong> a larger high-dimensional <strong>in</strong>put space. We def<strong>in</strong>ed, implemented,<br />
and demonstrated an analytical workflow based on automatic subspace search. A<br />
larger set <strong>of</strong> automatically identified <strong>in</strong>terest<strong>in</strong>g subspaces is grouped for <strong>in</strong>teractive exploration<br />
by the user. A custom subspace similarity function allows for compar<strong>in</strong>g subspaces.<br />
Our approach is able to e ectively p<strong>in</strong> down several <strong>in</strong>terest<strong>in</strong>g views and helps to come<br />
up with specific f<strong>in</strong>d<strong>in</strong>gs regard<strong>in</strong>g similarities <strong>of</strong> groups <strong>in</strong> data. We discussed a set <strong>of</strong><br />
possible extensions <strong>of</strong> the system, which could be addressed as future work.
128 Chapter 5. <strong>Visual</strong> Subspace Analysis <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong>
6<br />
Conclusion and Future Work<br />
„The important th<strong>in</strong>g is not to stop question<strong>in</strong>g. Curiosity has its own reason<br />
for exist<strong>in</strong>g.”<br />
Albert E<strong>in</strong>ste<strong>in</strong><br />
Contents<br />
6.1 Summary <strong>of</strong> Contributions and Future Work . . . . . . . . . . . 129<br />
T<br />
his chapter takes a step back from the concrete presented solutions <strong>of</strong> each chapter<br />
position<strong>in</strong>g the work <strong>in</strong>to the big picture <strong>of</strong> pattern f<strong>in</strong>d<strong>in</strong>g <strong>in</strong> high-dimensional data<br />
po<strong>in</strong>t<strong>in</strong>g out the contributions <strong>of</strong> this thesis and conclud<strong>in</strong>g the work. Future work is presented<br />
here with respect to the big picture s<strong>in</strong>ce each chapter conta<strong>in</strong>s specific conclusions<br />
and identifies particular further research directions.<br />
All doma<strong>in</strong>s nowadays produce high-dimensional data sets that can hide numerous<br />
important patterns. F<strong>in</strong>d<strong>in</strong>g these patterns is a complex task, but at the same time it is<br />
<strong>of</strong> high significance. Us<strong>in</strong>g visual representation to show the numerical data, automation<br />
to compute the <strong>in</strong>terest<strong>in</strong>g elements, and <strong>in</strong>teraction to navigate the <strong>in</strong>formation spaces<br />
can help to spot the valuable patterns <strong>in</strong> the data. Due to the high-dimensionality <strong>of</strong><br />
the data, one doma<strong>in</strong> alone is not powerful enough, and s<strong>in</strong>ce the results for certa<strong>in</strong><br />
data analysis questions are previous unknown <strong>in</strong> accurateness and complexity, systems<br />
comb<strong>in</strong><strong>in</strong>g visualization, automation, and <strong>in</strong>teraction are needed to <strong>in</strong>vestigate this data.<br />
6.1 Summary <strong>of</strong> Contributions and Future Work<br />
We presented two ma<strong>in</strong> research directions to search for <strong>in</strong>terest<strong>in</strong>g patterns <strong>in</strong> highdimensional<br />
data: (1) reduc<strong>in</strong>g the data dimensionality by projections, rank<strong>in</strong>g them and<br />
visualiz<strong>in</strong>g the best and (2) look<strong>in</strong>g <strong>in</strong>to di erent subspaces <strong>of</strong> the data and spott<strong>in</strong>g the<br />
<strong>in</strong>terest<strong>in</strong>g ones for further analysis, by compar<strong>in</strong>g them <strong>in</strong> terms <strong>of</strong> dimensions, records<br />
and clusters.<br />
For (1) we presented new quality measures to judge the quality <strong>of</strong> projections automatically<br />
and a systematization <strong>of</strong> exist<strong>in</strong>g quality metrics for high-dimensional data. For our<br />
new developed quality metrics, we choose from the large spectrum <strong>of</strong> patterns correlation<br />
and cluster<strong>in</strong>g. Two di erent types <strong>of</strong> metrics were presented <strong>in</strong> Chapter 3, namely data<br />
quality metrics and image quality metrics. Both are developed for two di erent visualization<br />
techniques – scatterplots and parallel coord<strong>in</strong>ates. The new measures are applied on
130 Chapter 6. Conclusion and Future Work<br />
di erent synthetic and real data sets to demonstrate their properties. As these automatic<br />
measures should represent the user’s preference, we conducted an empirical evaluation on<br />
four state <strong>of</strong> the art measures to evaluate their correspondence to the user’s preference.<br />
This study helped us <strong>in</strong> develop<strong>in</strong>g guidel<strong>in</strong>es for further metric development. To see the<br />
big picture regard<strong>in</strong>g the lately proposed quality measures for high-dimensional data visualizations,<br />
we conducted a literature review and present <strong>in</strong> Chapter 4 a systematization<br />
<strong>of</strong> the exist<strong>in</strong>g measures, identify<strong>in</strong>g a number <strong>of</strong> characteristic factors and develop<strong>in</strong>g a<br />
quality metrics pipel<strong>in</strong>e to illustrate the process. The goal is to put the exist<strong>in</strong>g methods<br />
<strong>in</strong>to a common framework, thus eas<strong>in</strong>g the generation <strong>of</strong> new research <strong>in</strong> the field and<br />
identify<strong>in</strong>g important gaps to bridge with future research.<br />
Learn<strong>in</strong>g from the outcome <strong>of</strong> these two chapters, the follow<strong>in</strong>g ma<strong>in</strong> directions are<br />
<strong>in</strong>dicated for future research:<br />
• develop<strong>in</strong>g new quality metrics for purposes like view optimization or visual mapp<strong>in</strong>g<br />
optimization;<br />
• develop<strong>in</strong>g new quality metrics for “non-standard” visualization techniques for highdimensional<br />
data like pixel based techniques, glyphs, etc;<br />
• runn<strong>in</strong>g user evaluations to test the quality metric applicability <strong>in</strong> real world sett<strong>in</strong>gs;<br />
• us<strong>in</strong>g quality metrics to explore projection techniques’ properties like noise or rotation<br />
<strong>in</strong>variance, scalability with respect to data po<strong>in</strong>ts or data dimensions;<br />
• us<strong>in</strong>g quality metrics to select features to be used for build<strong>in</strong>g a model for data<br />
classifiers.<br />
For (2) we presented visual analytics approaches to understand the relations between<br />
subspaces that conta<strong>in</strong> important patterns. We recognize four ma<strong>in</strong> research directions to<br />
use visualization and <strong>in</strong>teraction <strong>in</strong> understand<strong>in</strong>g patterns identified by subspace algorithms:<br />
1. Interactive subspace cluster<strong>in</strong>g result exploration<br />
The probably simplest way to use visualization <strong>in</strong> conjunction with subspace algorithms<br />
is to visualize their results. The workflow <strong>in</strong> Figure 6.1 illustrates the needed<br />
steps. <strong>Data</strong> is processed by a certa<strong>in</strong> subspace algorithm and the results are visualized<br />
provid<strong>in</strong>g <strong>in</strong>teractive facilities to explore the result.<br />
HD <strong>Data</strong><br />
Subspace<br />
Cluster<strong>in</strong>g<br />
<strong>Visual</strong>ization<br />
Figure 6.1: Interactive exploration <strong>of</strong> subspace cluster<strong>in</strong>g results.<br />
In Section 5.1 we presented ClustNails, a tool to visualize subspace cluster<strong>in</strong>g results,<br />
and support comparison <strong>of</strong> di erent subspace clusters regard<strong>in</strong>g their data<br />
distribution and dimension overlap. We proposed order<strong>in</strong>g strategies for dimensions<br />
and clusters to ease the cluster comparison. Brush<strong>in</strong>g and l<strong>in</strong>k<strong>in</strong>g <strong>of</strong> dimensions and<br />
clusters support the exploration.
6.1. Summary <strong>of</strong> Contributions and Future Work 131<br />
2. Interactive subspace search result exploration<br />
One problem for all the subspace cluster<strong>in</strong>g algorithms is the exponential number <strong>of</strong><br />
exist<strong>in</strong>g subspaces. To address this issue, we decoupled the process <strong>of</strong> cluster<strong>in</strong>g and<br />
subspace search, to restrict the number <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces that are <strong>in</strong>spected<br />
for clusters (see Figure 6.2).<br />
HD <strong>Data</strong><br />
Subspace<br />
Search<br />
<strong>Visual</strong>ization<br />
Cluster<strong>in</strong>g<br />
Figure 6.2: Interactive exploration <strong>of</strong> subspace search results.<br />
In Section 5.2 (see Figure 6.2 for the specific workflow) we addressed this research<br />
direction and presented a subspace search approach to identify alternative views<br />
(valid groups <strong>of</strong> clusters) <strong>in</strong> the data. We first run a subspace search algorithm<br />
on the data to identify possible <strong>in</strong>terest<strong>in</strong>g subspaces. Then we visualize them for<br />
a better understand<strong>in</strong>g. Given the high redundancy <strong>of</strong> spotted subspaces, due to<br />
the high number <strong>of</strong> shared dimensions, group<strong>in</strong>g and filter<strong>in</strong>g functions, based on<br />
topological and dimension similarities are proposed. The groups can be navigated<br />
and <strong>in</strong>teractive lasso tools are available to manually mark clusters. Di erent views<br />
can be identified by compar<strong>in</strong>g di erent subspaces accord<strong>in</strong>g to the marked clusters.<br />
For further work automatic cluster<strong>in</strong>g can be used to <strong>in</strong>crease the number <strong>of</strong> di erent<br />
identified views.<br />
3. <strong>Visual</strong> comparison <strong>of</strong> subspace cluster<strong>in</strong>g results.<br />
Subspace<br />
Cluster<strong>in</strong>g<br />
HD <strong>Data</strong><br />
Subspace<br />
Cluster<strong>in</strong>g<br />
Compar<strong>in</strong>g<br />
<strong>Visual</strong>ization<br />
...<br />
Subspace<br />
Cluster<strong>in</strong>g<br />
Figure 6.3: <strong>Visual</strong> comparison <strong>of</strong> subspace cluster<strong>in</strong>g results us<strong>in</strong>g visualization.<br />
One further research direction would be to use visualization to compare di erent<br />
results <strong>of</strong> subspace algorithms (see Figure 6.3). Di erent results can be obta<strong>in</strong>ed<br />
either by runn<strong>in</strong>g di erent subspace cluster<strong>in</strong>g algorithms on the same data set, or<br />
one algorithm with di erent parameter sett<strong>in</strong>gs. <strong>Visual</strong>ization can then be used to:<br />
• identify what we call a “common sense cluster<strong>in</strong>g” – mean<strong>in</strong>g, clusters that will<br />
probably pop out <strong>in</strong>dependent on the algorithm or parameters;
132 Chapter 6. Conclusion and Future Work<br />
• compare di erent cluster<strong>in</strong>g results, and identify the role <strong>of</strong> parameters for<br />
specific algorithms;<br />
• the user feedback by look<strong>in</strong>g at the visualization can be <strong>in</strong>tegrated <strong>in</strong>to the computation<br />
<strong>of</strong> clusters. Clusters can be labeled with user’s preference (like/dislike),<br />
merged or alternatives to selected cluster can be computed.<br />
4. <strong>Visual</strong>ly-assisted <strong>in</strong>-l<strong>in</strong>e steer<strong>in</strong>g <strong>of</strong> subspace cluster<strong>in</strong>g.<br />
HD <strong>Data</strong><br />
Subspace<br />
Cluster<strong>in</strong>g<br />
<strong>in</strong>termediate<br />
result<br />
feedback<br />
<strong>Visual</strong>ization<br />
<strong>Visual</strong>ization<br />
Figure 6.4: <strong>Visual</strong>-assisted <strong>in</strong>-l<strong>in</strong>e steer<strong>in</strong>g <strong>of</strong> subspace cluster<strong>in</strong>g.<br />
Another research direction, and probably the most complex one, could be an <strong>in</strong>termediate<br />
use <strong>of</strong> visualization and user feedback <strong>in</strong>to the algorithmic process (see<br />
Figure 6.4). This is like open<strong>in</strong>g the box <strong>of</strong> subspace cluster<strong>in</strong>g and provid<strong>in</strong>g a<br />
steerable cluster<strong>in</strong>g tool. The algorithm can be <strong>in</strong>terrupted and <strong>in</strong>termediary results<br />
could be visualized. User’s preference can be <strong>in</strong>tegrated <strong>in</strong> a feedback loop to steer<br />
the algorithm.<br />
In conclusion, we can say that the complexity and amount <strong>of</strong> high-dimensional data<br />
requires a comb<strong>in</strong>ation <strong>of</strong> the strengths <strong>of</strong> visualization, automation and <strong>in</strong>teraction to<br />
discover <strong>in</strong>terest<strong>in</strong>g, unknown facets <strong>of</strong> these data sets. This thesis has addressed some<br />
<strong>of</strong> the relevant research questions <strong>in</strong> this field. At the same time, new questions arose<br />
that will hopefully motivate other researchers to develop applicable solutions <strong>in</strong> the near<br />
future.
List <strong>of</strong> Figures<br />
1.1 Multiple valid and <strong>in</strong>terest<strong>in</strong>g group<strong>in</strong>gs <strong>of</strong> a high-dimensional data set [104]. 3<br />
1.2 Schematic overview <strong>of</strong> the <strong>in</strong>terrelation <strong>of</strong> chapters <strong>in</strong> this thesis. . . . . . . 7<br />
2.1 <strong>High</strong>-dimensional visualization techniques taken from [145]. A: Scatterplot<br />
matrix show<strong>in</strong>g on the diagonal a histogram plot for each dimension. Selected<br />
po<strong>in</strong>ts are marked <strong>in</strong> red <strong>in</strong> all plots. B: Parallel coord<strong>in</strong>ates plot <strong>of</strong><br />
a seven-dimensional data set. One polyl<strong>in</strong>e represent<strong>in</strong>g one data po<strong>in</strong>t is<br />
highlighted <strong>in</strong> red. C: Star glyphs <strong>in</strong> a MDS layout. D: Dense pixel displays<br />
represent<strong>in</strong>g a 14-dimensional data set. . . . . . . . . . . . . . . . . . . . . . 14<br />
2.2 (A) Scagnostics SPLOM hav<strong>in</strong>g as axes scagnostics measures and show<strong>in</strong>g<br />
each data scatterplot as a po<strong>in</strong>t <strong>in</strong> the measures scatterplot [152]. (B)<br />
Scagnostics <strong>in</strong>dices used as quality measures to rank data scatterplots [152]. 21<br />
2.3 <strong>Visual</strong> <strong>in</strong>teractive feature selection systems. A: Rank-by-Feature Framework<br />
presented <strong>in</strong> [125]. B: Feature selection supported by quality measures<br />
[82]. C: DimStiller for feature selection [76]. . . . . . . . . . . . . . . . 23<br />
2.4 Interactive visual analysis systems for cluster<strong>in</strong>g <strong>in</strong> high-dimensional visualization.<br />
A: Interactive exploration <strong>of</strong> hierarchically clustered data along<br />
a dendrogram [124]. B: (a) Group<strong>in</strong>g icons to form clusters based on visual<br />
similarity. (b) User-def<strong>in</strong>ed group<strong>in</strong>g <strong>of</strong> icons [35]. . . . . . . . . . . . . . . . 24<br />
2.5 Interactive visual analysis systems for classification <strong>in</strong> high-dimensional<br />
data. A: <strong>Visual</strong> classification from [11] illustrates the decision tree for DNA<br />
tra<strong>in</strong><strong>in</strong>g data hav<strong>in</strong>g 19 attributes, visualiz<strong>in</strong>g each attribute-value by a<br />
colored pixel arranged <strong>in</strong> bars. B: Decision tree construction system [142],<br />
represent<strong>in</strong>g the tree <strong>in</strong> a node-l<strong>in</strong>k diagram, display<strong>in</strong>g split po<strong>in</strong>ts on the<br />
l<strong>in</strong>ks and the split attributes on the node. . . . . . . . . . . . . . . . . . . . 25<br />
2.6 (a) VISA system [14]. Left: MDS projection for the global view <strong>of</strong> clusters.<br />
Right: Matrix <strong>of</strong> subspace clusters for <strong>in</strong>-depth view. (b) Heidi Matrix [141]<br />
over a subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
2.7 <strong>Visual</strong>ization techniques applied <strong>in</strong> Ferdosi’s work [52]. Left: 1D subspace.<br />
Middle: 2D subspace. Right: Subspace with 3 or more dimensions. . . . . . 28<br />
3.1 Work<strong>in</strong>g steps for us<strong>in</strong>g quality metrics to rank high-dimensional visualizations<br />
accord<strong>in</strong>g to a given task. . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
3.2 Scatterplot example and its respective density image. For each pixel we<br />
compute the mass distribution along di erent directions and save the smallest<br />
value, here depicted by the blue l<strong>in</strong>e. . . . . . . . . . . . . . . . . . . . . 33<br />
3.3 2D view and rotated projection axes. The projection on the rotated plane<br />
has less overlap, and the structures <strong>of</strong> the data can be seen even <strong>in</strong> the<br />
projection. This is not possible for a projection on the orig<strong>in</strong>al axes. . . . . 36<br />
3.4 First step <strong>of</strong> the HDM approach: each plot is ranked for di erent rotations<br />
with the 1D-HDM. The best measure value is taken for the plot. . . . . . . 37<br />
3.5 Second step <strong>of</strong> the HDM approach: PCA is computed on the k best selected<br />
dimensions and on all the possible subsets greater than 3 dimensions. The<br />
first two components are plotted <strong>in</strong> scatterplots, that are ranked with the<br />
2D-HDM. The best measure value <strong>in</strong>dicates the best scatterplot where the<br />
class <strong>in</strong>formation is separated. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
134 List <strong>of</strong> Figures<br />
3.6 Synthetic examples <strong>of</strong> parallel coord<strong>in</strong>ates and their respective Hough spaces:<br />
(a) presents two well def<strong>in</strong>ed l<strong>in</strong>e clusters and is more <strong>in</strong>terest<strong>in</strong>g for the<br />
cluster identification task than (b), where no l<strong>in</strong>e cluster can be identified.<br />
Note that the bright areas <strong>in</strong> the fl◊-plane represent the clusters <strong>of</strong> l<strong>in</strong>es<br />
with similar fl and ◊. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.7 Results for the Park<strong>in</strong>son’s Disease data set us<strong>in</strong>g our RVM measure (Section<br />
3.1.2). While clumpy low-correlation bear<strong>in</strong>g views are punished (bottom<br />
row), views conta<strong>in</strong><strong>in</strong>g higher correlation between the variables are<br />
preferred (top row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3.8 Results for the Olives data set us<strong>in</strong>g our CDM measure (Section 3.1.3).<br />
The di erent colors depict the di erent classes (regions) <strong>of</strong> the data set.<br />
While it is impossible for this data set to f<strong>in</strong>d views completely separat<strong>in</strong>g<br />
all classes, our CDM measure still found views where most <strong>of</strong> the classes<br />
are mutually separated (top row). In the worst ranked views the classes<br />
clearly overlap with each other (bottom row). . . . . . . . . . . . . . . . . . 43<br />
3.9 Results for the Olives data set us<strong>in</strong>g our HDM measure (Section 3.1.3). The<br />
best ranked plot is the PCA <strong>of</strong> dim(4,5,8) reveal<strong>in</strong>g a good view on all the<br />
classes, the second best is the PCA <strong>of</strong> dim(1,2,4) and the third is the PCA<br />
on all 8 dimensions. The di erences between the last two are small because<br />
the variance <strong>in</strong> that additional dimensions for the 3rd eigenvector relative<br />
to the 2nd, is not big. The di erence between the last two views and the<br />
first view is clearly visible (e.g. look<strong>in</strong>g at the yellow class). . . . . . . . . 43<br />
3.10 Results for the W<strong>in</strong>e data set us<strong>in</strong>g our CSM measure (Section 3.1.3). The<br />
best ranked plots present a large distance between the centers <strong>of</strong> the class<br />
clusters while the worst ranked views show only cluttered data. . . . . . . . 44<br />
3.11 Results for the W<strong>in</strong>e data set us<strong>in</strong>g our CDM measure (Section 3.1.3). Note<br />
that the second best ranked view, (dim1,dim7) (with CDM = 89), is not<br />
considered good us<strong>in</strong>g the CSM measure (CSM = 58). . . . . . . . . . . . . 45<br />
3.12 Results on the WDBC data set for the RVM (top) and the CDM (bottom).<br />
In this example, views with a quality value <strong>of</strong> less than 0.95 have been<br />
faded out. This way many irrelevant views can be faded out reduc<strong>in</strong>g the<br />
number <strong>of</strong> the plots to be <strong>in</strong>spected by the user <strong>in</strong> more detail to a better<br />
manageable number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
3.13 Results for the non-classified version <strong>of</strong> the Park<strong>in</strong>sons Disease data set.<br />
Best and worst ranked visualizations us<strong>in</strong>g our HSM measure for nonclassified<br />
data (ref. Section 3.1.4). Top row: The three best ranked visualizations<br />
and their respective normalized measures. Well def<strong>in</strong>ed clusters<br />
<strong>in</strong> the data set are favored. Bottom row: The three worst ranked visualizations.<br />
The large amount <strong>of</strong> spread exacerbates <strong>in</strong>terpretation. Note<br />
that the user task related to this measure is not to f<strong>in</strong>d possible correlation<br />
between the dimensions but to detect good separated clusters. . . . . . . . 47<br />
3.14 Results <strong>of</strong> the SM for the Cars data set. Cars us<strong>in</strong>g benz<strong>in</strong>e are shown <strong>in</strong><br />
black, diesel <strong>in</strong> red. Best and worst ranked visualizations us<strong>in</strong>g our Hough<br />
Similarity Measure (Section 3.1.5) for parallel coord<strong>in</strong>ates. Top row: The<br />
three best ranked visualizations and their respective normalized measures.<br />
Bottom row: The three worst ranked visualizations. . . . . . . . . . . . . . 48
List <strong>of</strong> Figures 135<br />
3.15 Results <strong>of</strong> the OM for the WDBC data set. Malign nuclei are colored black<br />
while healthy nuclei are red. Best and worst ranked visualizations us<strong>in</strong>g<br />
our Overlap Measure (Section 3.1.5) for parallel coord<strong>in</strong>ates. Top row: The<br />
three best ranked visualizations. Despite good similarity, which are similar<br />
to clusters, visualizations are favored that m<strong>in</strong>imize the overlap between the<br />
classes, so that the di erence between malign and benign cells becomes more<br />
clear. Bottom row: The three worst ranked visualizations. The overlap <strong>of</strong><br />
the data complicates the analysis and the <strong>in</strong>formation is useless for the task<br />
<strong>of</strong> discrim<strong>in</strong>at<strong>in</strong>g malign and benign cells. . . . . . . . . . . . . . . . . . . . 48<br />
3.16 Results <strong>of</strong> the HSM for the synthetic data set from [82] present<strong>in</strong>g the best<br />
and worst ranked visualizations us<strong>in</strong>g our HSM measure for non-classified<br />
data (ref. Section 3.1.4). Top row: The three best ranked visualizations and<br />
their respective normalized measures. Well def<strong>in</strong>ed clusters <strong>in</strong> the data set<br />
are favored. Bottom row: The three worst ranked visualizations. The large<br />
amount <strong>of</strong> spread exacerbates <strong>in</strong>terpretation. Note that the user task related<br />
to this measure is not to f<strong>in</strong>d high correlation between the dimensions but<br />
to detect good separated clusters. . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
3.17 Matrix for the synthetical data set with scatterplots above the ma<strong>in</strong> diagonal<br />
and parallel coord<strong>in</strong>ate plots bellow. . . . . . . . . . . . . . . . . . . . 50<br />
3.18 Results <strong>of</strong> the 7 measures for classified and unclassified data. The left<br />
column shows the result for the scatterplot measures and the right column<br />
for the parallel coord<strong>in</strong>ates measures. The ranks are sorted decreas<strong>in</strong>g and<br />
the target patterns are marked with red crosses. . . . . . . . . . . . . . . . 51<br />
3.19 Scatterplot <strong>of</strong> the first two components <strong>of</strong> the PCA over dimensions 2, 5<br />
and 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
3.20 Projections <strong>of</strong> scatterplots used <strong>in</strong> the experiment. Participants had to<br />
select the best five projections and order them by their quality. The order<br />
<strong>of</strong> the scatterplots was permuted for each participant separately us<strong>in</strong>g the<br />
Lat<strong>in</strong>-Square method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.21 Correlation <strong>of</strong> measures with users’ classification shows highest R 2 values<br />
for the 2D-HDM measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
3.22 Correlation <strong>of</strong> measures with users’ classification for highest and one lowest<br />
quality projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
3.23 Surpris<strong>in</strong>g study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
4.1 (Top row <strong>of</strong> Figure 3.8) Rank<strong>in</strong>g projections accord<strong>in</strong>g to the Class Density<br />
Measure, favor<strong>in</strong>g projections with m<strong>in</strong>imal overlap between predef<strong>in</strong>ed<br />
classes (i.e., the colors) [133]. . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
4.2 Clutter reduction achieved through axes reorder<strong>in</strong>g <strong>in</strong> a scatterplot matrix<br />
(<strong>in</strong>itial visualization on the left, reordered on the right) [112]. . . . . . . . . 68<br />
4.3 <strong>Data</strong> abstraction algorithm based on sampl<strong>in</strong>g, aim<strong>in</strong>g at reduc<strong>in</strong>g data size<br />
while preserv<strong>in</strong>g relevant patterns. Orig<strong>in</strong>al visualization on the left with<br />
16384 data items. Sampled visualization on the right with 987 items and a<br />
visual quality <strong>of</strong> 0.95 [80]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
136 List <strong>of</strong> Figures<br />
4.4 Quality metrics pipel<strong>in</strong>e. The pipel<strong>in</strong>e provides an additional layer named<br />
quality metrics base automation on top <strong>of</strong> the traditional <strong>in</strong>formation visualization<br />
pipel<strong>in</strong>e [36]. The layer obta<strong>in</strong>s <strong>in</strong>formation from the stages <strong>of</strong> the<br />
pipel<strong>in</strong>e (the boxes) and <strong>in</strong>fluences the processes <strong>of</strong> the pipel<strong>in</strong>e through the<br />
metrics it calculates. The user is always <strong>in</strong> control. . . . . . . . . . . . . . . 72<br />
4.5 Mapp<strong>in</strong>g a 10 dimensional data set to a scatterplot with four visual primitives<br />
(x-axis, y-axis, size, and color) has over 5000 possible alternative<br />
mapp<strong>in</strong>gs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
4.6 Quality metrics pipel<strong>in</strong>e for the first example from [133]: (A) generation <strong>of</strong><br />
alternatives; (B) evaluation <strong>of</strong> alternatives (image space); (C) creation <strong>of</strong><br />
the f<strong>in</strong>al representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
4.7 Interactive chart to select number <strong>of</strong> dimensions to keep vs. <strong>in</strong>formation<br />
loss [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
4.8 Top: best order<strong>in</strong>g to enhance cluster<strong>in</strong>g. Bottom: best order<strong>in</strong>g to enhance<br />
correlation [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
4.9 Quality metrics pipel<strong>in</strong>e for the second example from [82]: (A) dimensions<br />
ranked by their importance; (B) selection <strong>of</strong> number <strong>of</strong> dimensions to reta<strong>in</strong><br />
vs. <strong>in</strong>formation loss; (C) creation <strong>of</strong> the f<strong>in</strong>al mapp<strong>in</strong>g with order<strong>in</strong>g. . . . . 81<br />
4.10 <strong>Visual</strong> abstraction <strong>of</strong> a scatterplot matrix from [42]. . . . . . . . . . . . . . 81<br />
4.11 Quality metrics pipel<strong>in</strong>e for example three from [42]: (A) data features compared<br />
between the orig<strong>in</strong>al data and the abstracted data; (B) <strong>in</strong>stantiation<br />
<strong>of</strong> the desired abstraction level guided by quality metrics. . . . . . . . . . . 82<br />
4.12 <strong>Visual</strong> abstraction chart with threshold sett<strong>in</strong>g for the abstraction level and<br />
feedback on abstraction quality [42]. . . . . . . . . . . . . . . . . . . . . . . 82<br />
4.13 Left: star glyphs represent<strong>in</strong>g orig<strong>in</strong>al data set. Right: visualized data after<br />
DOSFA was applied [158]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.14 Quality metrics pipel<strong>in</strong>e for example four from [158]: (A) construct hierarchical<br />
structure <strong>of</strong> dimensions by cluster<strong>in</strong>g; (B) filter dimensions by<br />
similarity and importance; (C) map dimensions order<strong>in</strong>g to visualization;<br />
(D) <strong>in</strong>fluence the view accord<strong>in</strong>g to the quality measured (spac<strong>in</strong>g the parallel<br />
coord<strong>in</strong>ates accord<strong>in</strong>g to their similarity). The user can steer all these<br />
steps, after <strong>in</strong>teract<strong>in</strong>g with the clustered dimensions showed <strong>in</strong> an Inter-<br />
R<strong>in</strong>g visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.15 Taxonomy <strong>of</strong> factors <strong>in</strong> visual cluster separation, where factor axes are<br />
marked to show the ranges where exist<strong>in</strong>g measures are successful; gaps<br />
represent failure cases. The centroid measure (CDM) is marked <strong>in</strong> blue and<br />
the grid (2D-HDM) is marked <strong>in</strong> red. All positions are approximate estimates.<br />
Marked along the factor axes are six data sets that are exemplified<br />
<strong>in</strong> the paper. (Used with permission by [122].) . . . . . . . . . . . . . . . . 88<br />
4.16 A taxonomy <strong>of</strong> data characteristics with respect to class separation <strong>in</strong> scatterplots.<br />
Some factors are organized as axes (arrows) while others are<br />
b<strong>in</strong>ned. Between-Class factors <strong>of</strong>ten result from the variance <strong>of</strong> With<strong>in</strong>-<br />
Class factors (horizontal dependencies), and factors at the top can strongly<br />
<strong>in</strong>fluence factors below them (vertical dependencies). Class Separation is<br />
therefore dependent on all other factors (used with permission by [122]). . . 89<br />
5.1 <strong>Data</strong> projected <strong>in</strong> several subspaces. . . . . . . . . . . . . . . . . . . . . . . 95<br />
5.2 Workflow <strong>of</strong> subspace cluster analysis us<strong>in</strong>g the ClustNails system. . . . . 102
List <strong>of</strong> Figures 137<br />
5.3 Two subspace clusters visualized as spikes. The clusters share common dimensions<br />
but the importance <strong>of</strong> the dimensions for the clusters are di erent.<br />
Dim29 and dim32 <strong>in</strong> the left cluster show smaller pikes than <strong>in</strong> the right<br />
cluster, as they are considered less important for the def<strong>in</strong>ition <strong>of</strong> that cluster<br />
accord<strong>in</strong>g to our measure wk m . Furthermore, the left cluster has fewer<br />
dimensions and more objects than the right cluster. . . . . . . . . . . . . . . 103<br />
5.4 HeatNails visualization. Bottom: show<strong>in</strong>g the distribution <strong>of</strong> dimension<br />
values for all dimensions (rows) and records (columns). Top: show<strong>in</strong>g histograms<br />
for the values <strong>of</strong> all dimensions per cluster for comparison purposes.104<br />
5.5 <strong>Visual</strong>ization <strong>of</strong> the subspace clusters <strong>of</strong> the USDA Food Composition data<br />
set generated by Proclus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
5.6 Sorted view (Value order<strong>in</strong>g function applied). . . . . . . . . . . . . . . . . 107<br />
5.7 <strong>Visual</strong>ization <strong>of</strong> the subspace clusters <strong>in</strong> VISA [14] framework discussed <strong>in</strong><br />
Subsection 5.1.5. Cluster view (left), record view (right). . . . . . . . . . . . 108<br />
5.8 Alternative data distributions and group<strong>in</strong>gs from [103] <strong>in</strong> two di erent subspaces<br />
<strong>of</strong> a larger high-dimensional data space (doma<strong>in</strong> here: demographic<br />
data analysis). Our proposed visual analysis method <strong>in</strong>tegrates the notion<br />
<strong>of</strong> alternative subspaces <strong>in</strong>to the analysis process and l<strong>in</strong>ks it to the task <strong>of</strong><br />
comparative cluster analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
5.9 Our proposed analysis pipel<strong>in</strong>e. A subspace selection algorithm is applied<br />
to automatically identify a candidate set <strong>of</strong> <strong>in</strong>terest<strong>in</strong>g subspaces. A filter<strong>in</strong>g<br />
step reduces the potentially large and redundant set <strong>of</strong> automatically<br />
obta<strong>in</strong>ed subspaces to a user-selectable number <strong>of</strong> represent<strong>in</strong>g subspaces.<br />
<strong>Visual</strong>-<strong>in</strong>teractive user exploration then proceeds on the subspace representations.<br />
Subspace analysis is also supported by comparative cluster views,<br />
allow<strong>in</strong>g users to identify mean<strong>in</strong>gful similar, complementary or even conflict<strong>in</strong>g<br />
cluster<strong>in</strong>g structures <strong>in</strong> the set <strong>of</strong> subspaces. . . . . . . . . . . . . . 114<br />
5.10 Filter<strong>in</strong>g cases that can be supported by our two def<strong>in</strong>ed subspace similarity<br />
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
5.11 Subspace representation by 2D scatterplots with dimension glyph. We can<br />
see the visual representations <strong>of</strong> two 5D subspaces (left) and one 4D subspace<br />
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
5.12 (1) L<strong>in</strong>early sorted view <strong>of</strong> subspaces for the 12D synthetical data set from<br />
[52] show<strong>in</strong>g the full result <strong>of</strong> SURFING, consist<strong>in</strong>g <strong>of</strong> 296 subspaces. The<br />
selected subspace <strong>in</strong> this view is shown <strong>in</strong> a (2) s<strong>in</strong>gle subspace view to<br />
enable <strong>in</strong>teraction and <strong>in</strong> (3) a parallel coord<strong>in</strong>ates view with the subspace<br />
dimensions as the first axes (highlighted), and all the other data dimension<br />
as the last axes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
5.13 Hierarchical agglomerative group<strong>in</strong>g <strong>of</strong> the 296 <strong>in</strong>terest<strong>in</strong>g subspaces. The<br />
red l<strong>in</strong>e shows the threshold for 6 groups shown <strong>in</strong> the subspace group view.<br />
Each group is marked by a colored rectangle. The colors are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong><br />
Figure 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
5.14 Subspace group view for the 12D synthetic data set with six subspace groups.118<br />
5.15 Dimension-based subspace similarity MDS view <strong>of</strong> the 296 subspaces selected<br />
by the subspace search algorithm. . . . . . . . . . . . . . . . . . . . . 119
138 List <strong>of</strong> Figures<br />
5.16 All l<strong>in</strong>ked views: (1) Subspace group view for the 12D synthetic data set<br />
with six subspace groups. (2) S<strong>in</strong>gle subspace view show<strong>in</strong>g the representative<br />
subspace for the first group. (3) Details-on-demand <strong>in</strong> the parallel<br />
coord<strong>in</strong>ates view for the selected subspace. (4) The MDS layout <strong>of</strong> the subspace<br />
search results based on their dimension similarity. (5) Group detail<br />
view for the three (orange, green, purple) subspace groups. (6) Hierarchical<br />
navigation buttons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
5.17 L<strong>in</strong>early sorted view cut-out <strong>of</strong> subspaces for the 18D USDA Food Composition<br />
data set. The full result <strong>of</strong> SURFING, consist<strong>in</strong>g <strong>of</strong> 216 subspaces.<br />
We see a rather high level <strong>of</strong> redundancy. Subspaces exhibit<strong>in</strong>g more structure<br />
are found <strong>in</strong> particular at the mid and end positions <strong>in</strong> the rank<strong>in</strong>g.<br />
Rely<strong>in</strong>g only on the numerically top ranked results, we would have omitted<br />
such <strong>in</strong>terest<strong>in</strong>g cases from the analysis. . . . . . . . . . . . . . . . . . . . . 123<br />
5.18 (A) Interest<strong>in</strong>g spotted subspace (Carbohydrat,Fibre)present<strong>in</strong>gtwoclusters.<br />
(B) Subspace (Carbohydarte,Lipid,Prote<strong>in</strong>) <strong>in</strong> the same cluster<br />
group <strong>of</strong> (A) where the cluster structure changes. (C) Green marked third<br />
cluster <strong>in</strong> subspace from (B). (D) Subspace (Fiber,Prote<strong>in</strong>,Vit D ) <strong>of</strong> orange<br />
color-framed subspace group, where the alternative cluster<strong>in</strong>g <strong>of</strong> po<strong>in</strong>ts<br />
is visible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
5.19 (1) Grouped view <strong>of</strong> subspaces for the 18D USDA Food Composition <strong>Data</strong><br />
Set with 12 group representatives. (2) The brown and orange group components<br />
are shown <strong>in</strong> the components view. (3) MDS Layout <strong>of</strong> the total<br />
number <strong>of</strong> subspaces with cross-colored group representatives. . . . . . . . . 124<br />
6.1 Interactive exploration <strong>of</strong> subspace cluster<strong>in</strong>g results. . . . . . . . . . . . . . 130<br />
6.2 Interactive exploration <strong>of</strong> subspace search results. . . . . . . . . . . . . . . . 131<br />
6.3 <strong>Visual</strong> comparison <strong>of</strong> subspace cluster<strong>in</strong>g results us<strong>in</strong>g visualization. . . . . 131<br />
6.4 <strong>Visual</strong>-assisted <strong>in</strong>-l<strong>in</strong>e steer<strong>in</strong>g <strong>of</strong> subspace cluster<strong>in</strong>g. . . . . . . . . . . . . . 132<br />
A.1 Empirical study experiment form version A. . . . . . . . . . . . . . . . . . . 153<br />
A.2 Empirical study experiment form version B. . . . . . . . . . . . . . . . . . . 154<br />
A.3 The eight projections that where never selected by a user as be<strong>in</strong>g on the<br />
scale 1 to 5 <strong>in</strong> terms <strong>of</strong> separability <strong>of</strong> classes among the 18 presented plots. 155<br />
A.4 Pipel<strong>in</strong>e for “A Projection Pursuit Algorithm for Exploratory <strong>Data</strong> Analysis”<br />
by Friedman and Tukey [54]: (A) di erent 2D l<strong>in</strong>ear, but not axisparallel,<br />
data projections are computed and evaluated by the quality metric;<br />
(B) the best projection direction is chosen by the quality metric, called “usefulness”<br />
<strong>in</strong>dex, that measures the quality <strong>of</strong> a projection axis and varies the<br />
projection direction so that the <strong>in</strong>dex is maximized. . . . . . . . . . . . . . 156
List <strong>of</strong> Figures 139<br />
A.5 Pipel<strong>in</strong>e for “A Rank-by-Feature Framework for Interactive Exploration <strong>of</strong><br />
Multidimensional <strong>Data</strong>” by Seo and Shneiderman [126]: (A) generation <strong>of</strong><br />
projections and each 1D and 2D projection is evaluated/ranked by a quality<br />
metric selected by the user; (B) best projections are presented; (C) present<br />
rank<strong>in</strong>g scores <strong>in</strong> a color coded grid (“Score Overview”), as well as an colorcoded<br />
“Ordered List” for each projection. The user selects one view <strong>in</strong> the<br />
list or grid, and can also change dimension axes and then the view adapts.<br />
Please note: here we have a visualization <strong>of</strong> dimensions and quality metric<br />
scores, that are highly <strong>in</strong>teractive, rather than a static projection <strong>of</strong> data<br />
records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156<br />
A.6 Pipel<strong>in</strong>e for “F<strong>in</strong>d<strong>in</strong>g and <strong>Visual</strong>iz<strong>in</strong>g Relevant Subspaces for Cluster<strong>in</strong>g<br />
<strong>High</strong>-<strong>Dimensional</strong> Astronomical <strong>Data</strong> Us<strong>in</strong>g Connected Morphological Operators”<br />
by Ferdosi et al. [52]: (A) generation <strong>of</strong> projections, all above 3D<br />
are reduced with PCA; the user can change the smooth<strong>in</strong>g parameter, what<br />
<strong>in</strong>fluences the number <strong>of</strong> projections; (B) evaluate each view; the user can<br />
select the view to <strong>in</strong>spect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
A.7 Pipel<strong>in</strong>e for “Graph-Theoretic Scagnostics” by Wilk<strong>in</strong>son et al. [151]: (A)<br />
generation <strong>of</strong> projections; (B) all 2D views are ranked by several metrics; (C)<br />
once the metrics have been computed, they are used to create the SPLOM<br />
(rows and columns are the metrics) - projections are mapped as data po<strong>in</strong>ts. 157<br />
A.8 Pipel<strong>in</strong>e for “Select<strong>in</strong>g good views <strong>of</strong> high-dimensional data us<strong>in</strong>g class consistency”<br />
by Sips et al. [129]: (A) all 2D projections are ranked with the<br />
quality metric; (B) each view is associated with a quality metric computed<br />
<strong>in</strong> A; (C) view transformation decides which scatterplot to highlight (fade<br />
out) depend<strong>in</strong>g on the quality values and the set threshold. . . . . . . . . . 157<br />
A.9 Pipel<strong>in</strong>e for “Coord<strong>in</strong>at<strong>in</strong>g computational and visual approaches for <strong>in</strong>teractive<br />
feature selection and multivariate cluster<strong>in</strong>g” by Guo [59]: (A) all 2D<br />
projections are evaluated with the “m<strong>in</strong>imum conditional entropy (MCE)”;<br />
(B) orig<strong>in</strong>al dimensions are clustered to f<strong>in</strong>d an order<strong>in</strong>g accord<strong>in</strong>g to their<br />
MCE value; (C) matrix ordered accord<strong>in</strong>g to dimension cluster<strong>in</strong>g. The<br />
user can 1) select, add to, or subtract from a variable subset that is analyzed<br />
further; 2) move the threshold bar for the connect<strong>in</strong>g edges, and<br />
clusters are automatically extracted and colored; 3) <strong>in</strong>teract to l<strong>in</strong>k, brush<br />
and select elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
A.10 Pipel<strong>in</strong>e for “Explor<strong>in</strong>g <strong>High</strong>-D Spaces with Multiform Matrices and Small<br />
Multiples” by MacEachren et al. [98]: (A) automatic selection <strong>of</strong> potentially<br />
<strong>in</strong>terest<strong>in</strong>g subspaces <strong>of</strong> variables; the user can also manually select<br />
subspaces; (B) all 2D plots are ranked with a quality metric (conditional<br />
entropy based); (C) the matrix view is colored and ordered accord<strong>in</strong>g to<br />
the quality metric value. The user can select a dimension subset to be<br />
visualized with other visualization techniques. . . . . . . . . . . . . . . . . . 158<br />
A.11 Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets<br />
Us<strong>in</strong>g Quality Measures” by Albuquerque et al. [8] for Jigsaw Maps: (A)<br />
mapp<strong>in</strong>g <strong>of</strong> dimension to 2D displays; (B) all 2D plots are ranked with a<br />
quality metric to select the best. . . . . . . . . . . . . . . . . . . . . . . . . 158
140 List <strong>of</strong> Figures<br />
A.12 Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets<br />
Us<strong>in</strong>g Quality Measures” by Albuquerque et al. [8] for RadVis: (A) all views<br />
are ranked with a quality metric; (B) dimensions are ordered accord<strong>in</strong>g to<br />
quality values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
A.13 Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets<br />
Us<strong>in</strong>g Quality Measures” by Albuquerque et al. [8] for Table Lens: (A)<br />
quality metric is computed on the data (B) user can select an area, mark<strong>in</strong>g<br />
dimensions and records; the view is than transformed accord<strong>in</strong>g to the user<br />
<strong>in</strong>teraction; (C) colors are mapped accord<strong>in</strong>g to the quality metrics values<br />
for outliers and correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
A.14 Pipel<strong>in</strong>e for “Pragnostics: Screen-Space Metrics for Parallel Coord<strong>in</strong>ates”<br />
by Dasputa and Kosara [43]: (A) all 2D views are evaluated accord<strong>in</strong>g to the<br />
metrics; (B) the best pairs are selected to compute the best order<strong>in</strong>g <strong>of</strong> dimensions.<br />
The user can also <strong>in</strong>fluence this decision by select<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g<br />
plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
A.15 Pipel<strong>in</strong>e for “Comb<strong>in</strong><strong>in</strong>g automated analysis and visualization techniques<br />
for e ective exploration <strong>of</strong> high-dimensional data” by Tatu et al. [133] for<br />
HDM: (A) all 2D data tables are evaluated accord<strong>in</strong>g to the 1D-HDM; (B)<br />
create the best nD visible on the 2D plot (with PCA), evaluated by the<br />
2D-HDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
A.16 Pipel<strong>in</strong>e for “<strong>High</strong>-<strong>Dimensional</strong> <strong>Visual</strong> <strong>Analytics</strong>: Interactive Exploration<br />
Guided by Pairwise Views <strong>of</strong> Po<strong>in</strong>t Distributions” by Wilk<strong>in</strong>son et al. [152]:<br />
(A) generation <strong>of</strong> projections; (B) all 2D views are evaluated accord<strong>in</strong>g to<br />
quality metric; (C) a sorted/highlighted view is created us<strong>in</strong>g the metrics.<br />
The user can navigate trough the ranked list, and sort and highlight plots<br />
<strong>in</strong> this and the SPLOM view. . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
A.17 Pipel<strong>in</strong>e for “Clutter Reduction <strong>in</strong> Multi-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization<br />
Us<strong>in</strong>g Dimension Reorder<strong>in</strong>g” by Peng et al. [112]: (A) quality metric is<br />
computed on the data; (B) quality metric calculated also dependent on the<br />
visual abstraction; (C) best visual mapp<strong>in</strong>g (order<strong>in</strong>g) decided based on<br />
metric values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />
A.18 Pipel<strong>in</strong>e for “Similarity Cluster<strong>in</strong>g <strong>of</strong> Dimensions for an Enhanced <strong>Visual</strong>ization<br />
<strong>of</strong> Multidimensional <strong>Data</strong>” by Ankerst et al. [9]: (A) quality metric<br />
is computed on the data; (B) quality metric calculated also dependent on<br />
the visual abstraction; (C) best visual mapp<strong>in</strong>g (order<strong>in</strong>g) decided based<br />
on metric values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />
A.19 Pipel<strong>in</strong>e for “Quality Metrics for 2D Scatterplot Graphics: Automatically<br />
Reduc<strong>in</strong>g <strong>Visual</strong> Clutter” by Bert<strong>in</strong>i and Santucci [24]: (A) quality metric<br />
is computed on the data density and screen density and compared; (B)<br />
projection and sampl<strong>in</strong>g based on metric values. . . . . . . . . . . . . . . . 160<br />
A.20 Pipel<strong>in</strong>e for “A Screen Space Quality Method for <strong>Data</strong> Abstraction” by Johansson<br />
and Cooper [80]: (A) sampled and orig<strong>in</strong>al data tables are associated<br />
to quality metric computed on the views <strong>of</strong> sampled and orig<strong>in</strong>al data;<br />
(B) the values are used to decide upon the sampl<strong>in</strong>g rate. . . . . . . . . . . 160
List <strong>of</strong> Figures 141<br />
A.21 Pipel<strong>in</strong>e for “Enabl<strong>in</strong>g Automatic Clutter Reduction <strong>in</strong> Parallel Coord<strong>in</strong>ate<br />
Plots” by Ellis and Dix [48]: (A) pixel occlusion is measured <strong>in</strong> the view<br />
space; the user can move a w<strong>in</strong>dow (lens) and sampl<strong>in</strong>g and measur<strong>in</strong>g<br />
occlusion is done only <strong>in</strong> this w<strong>in</strong>dow (B) the values <strong>of</strong> the quality metric<br />
are used to decide upon the sampl<strong>in</strong>g rate. . . . . . . . . . . . . . . . . . . . 161<br />
A.22 Pipel<strong>in</strong>e for “Pixnostics: Towards Measur<strong>in</strong>g the Value <strong>of</strong> <strong>Visual</strong>ization”<br />
by Schneidew<strong>in</strong>d et al. [120]: (A) a subset <strong>of</strong> dimensions is selected with<br />
standard m<strong>in</strong><strong>in</strong>g techniques; (B) alternative mapp<strong>in</strong>gs <strong>of</strong> selected data are<br />
evaluated on the screen space; (C) and (D) based on the quality value the<br />
best subset and mapp<strong>in</strong>g is determ<strong>in</strong>ed. The user can decide to fix map<br />
some data features to visual features manually. . . . . . . . . . . . . . . . . 161<br />
A.23 Hierarchical agglomerative group<strong>in</strong>g <strong>of</strong> the 296 <strong>in</strong>terest<strong>in</strong>g subspaces. The<br />
red l<strong>in</strong>e shows the threshold for 6 groups shown <strong>in</strong> the subspace group view.<br />
Each group is marked by a colored rectangle. The colors are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong><br />
Figure 5.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
142 List <strong>of</strong> Figures
List <strong>of</strong> Tables<br />
3.1 Overview and classification <strong>of</strong> our quality measures. . . . . . . . . . . . . . 31<br />
3.2 Overview over the data sets used to show the measures properties. . . . . . 41<br />
3.3 Overview <strong>of</strong> the analyzed measures with the reference for additional details. 55<br />
3.4 Results <strong>of</strong> the regression analysis. . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.1 <strong>Visual</strong>ization techniques categorized by their layout dimensionality (i.e., the<br />
number <strong>of</strong> axes <strong>of</strong> the visualization). . . . . . . . . . . . . . . . . . . . . . . 77<br />
4.2 Quality metrics papers classified accord<strong>in</strong>g to quality metrics factors (sorted<br />
by purpose). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
A.1 Dimension names for the Cars data set. . . . . . . . . . . . . . . . . . . . . 146<br />
A.2 Dimension names for the Olives data set [163]. . . . . . . . . . . . . . . . . 146<br />
A.3 Dimension names for the Park<strong>in</strong>son’s Disease data set [95, 96]. . . . . . . . 147<br />
A.4 Dimension names for the W<strong>in</strong>e data set [53]. . . . . . . . . . . . . . . . . . 147<br />
A.5 Dimension names for the WDBC data set [131]. . . . . . . . . . . . . . . . . 148
144 List <strong>of</strong> Tables
A<br />
Appendix<br />
Contents<br />
A.1 Orig<strong>in</strong>al <strong>Data</strong> Dimensions for Used <strong>Data</strong> Sets . . . . . . . . . . 145<br />
A.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
A.2.1 General Questions Form . . . . . . . . . . . . . . . . . . . . . . . 149<br />
A.2.2 Experiment Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 152<br />
A.2.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . 155<br />
A.3 Quality Metrics Pipel<strong>in</strong>es for the Literature Review . . . . . . 156<br />
A.4 Hierarchical Group<strong>in</strong>g <strong>of</strong> Interest<strong>in</strong>g Subspaces . . . . . . . . . 162<br />
A.1 Orig<strong>in</strong>al <strong>Data</strong> Dimensions for Used <strong>Data</strong> Sets<br />
The Cars data set was collected by another <strong>in</strong>stitute from Braunschweig and provided to<br />
our partners there and has the orig<strong>in</strong>al dimensions enumerated <strong>in</strong> Table A.1.
146 Appendix A. Appendix<br />
Table A.1: Dimension names for the Cars data set.<br />
orig<strong>in</strong>al<br />
TYPEOFMOTOR<br />
MANUFACTURER<br />
TYPE<br />
PRICE<br />
CYLINDERCAPACITY<br />
POWER<br />
RPM<br />
TORQUE<br />
VMAX<br />
ACCELERATION<br />
FUELCONSUMPTION<br />
CO2EMISSION<br />
WEIGHT<br />
LENGTH<br />
WIDTH<br />
HEIGHT<br />
WHEELBASE<br />
LOADCAPACITY<br />
TRUNK<br />
TOWINGCAPACITY<br />
ROOFLOAD<br />
TANKCAPACITY<br />
TAXES<br />
renamed<br />
dim0 (class)<br />
dim1<br />
dim2<br />
dim3<br />
dim4<br />
dim5<br />
dim6<br />
dim7<br />
dim8<br />
dim9<br />
dim10<br />
dim11<br />
dim12<br />
dim13<br />
dim14<br />
dim15<br />
dim16<br />
dim17<br />
dim18<br />
dim19<br />
dim20<br />
dim21<br />
dim22<br />
The Olives data set can be found at http://www2.chemie.uni-erlangen.de/publications/<br />
ANN-book/datasets/oliveoil/<strong>in</strong>dex.html and has the orig<strong>in</strong>al dimensions enumerated<br />
<strong>in</strong> Table A.2.<br />
Table A.2: Dimension names for the Olives data set [163].<br />
orig<strong>in</strong>al<br />
palmitic<br />
palmitoleic<br />
stearic<br />
oleic<br />
l<strong>in</strong>oleic<br />
l<strong>in</strong>olenic<br />
arachidic<br />
eicosenoic<br />
area<br />
renamed<br />
dim1<br />
dim2<br />
dim3<br />
dim4<br />
dim5<br />
dim6<br />
dim7<br />
dim8<br />
dim9 (class)
A.1. Orig<strong>in</strong>al <strong>Data</strong> Dimensions for Used <strong>Data</strong> Sets 147<br />
The Park<strong>in</strong>son’s Disease data set can be found at http://archive.ics.uci.edu/ml/<br />
datasets/Park<strong>in</strong>sons and has the orig<strong>in</strong>al dimensions enumerated <strong>in</strong> Table A.3.<br />
Table A.3: Dimension names for the Park<strong>in</strong>son’s Disease data set [95, 96].<br />
orig<strong>in</strong>al<br />
status - health status <strong>of</strong> the subject (one) - Park<strong>in</strong>son’s, (zero) - healthy<br />
MDVP:Fo(Hz) - average vocal fundamental frequency<br />
MDVP:Fhi(Hz) - maximum vocal fundamental frequency<br />
MDVP:Flo(Hz) - m<strong>in</strong>imum vocal fundamental frequency<br />
MDVP:Shimmer(dB) - measure <strong>of</strong> variation <strong>in</strong> amplitude<br />
HNR - measure <strong>of</strong> ratio <strong>of</strong> noise to tonal components <strong>in</strong> the voice<br />
RPDE - nonl<strong>in</strong>ear dynamical complexity measure<br />
D2 - nonl<strong>in</strong>ear dynamical complexity measure<br />
DFA - signal fractal scal<strong>in</strong>g exponent<br />
spread1 - nonl<strong>in</strong>ear measure <strong>of</strong> fundamental frequency variation<br />
spread2 - nonl<strong>in</strong>ear measure <strong>of</strong> fundamental frequency variation<br />
PPE - nonl<strong>in</strong>ear measure <strong>of</strong> fundamental frequency variation<br />
renamed<br />
dim1 (class)<br />
dim2<br />
dim3<br />
dim4<br />
dim5<br />
dim6<br />
dim7<br />
dim8<br />
dim9<br />
dim10<br />
dim11<br />
dim12<br />
The W<strong>in</strong>e data set can be found at http://archive.ics.uci.edu/ml/datasets/<br />
W<strong>in</strong>e and has the orig<strong>in</strong>al dimensions enumerated <strong>in</strong> Table A.4.<br />
Table A.4: Dimension names for the W<strong>in</strong>e data set [53].<br />
orig<strong>in</strong>al<br />
Alcohol<br />
Malic acid<br />
Ash<br />
Alcal<strong>in</strong>ity <strong>of</strong> ash<br />
Magnesium<br />
Total phenols<br />
Flavanoids<br />
Nonflavanoid phenols<br />
Proanthocyan<strong>in</strong>s<br />
Color <strong>in</strong>tensity<br />
Hue<br />
OD280/OD315 <strong>of</strong> diluted w<strong>in</strong>es<br />
Prol<strong>in</strong>e<br />
cluster ID<br />
renamed<br />
dim1<br />
dim2<br />
dim3<br />
dim4<br />
dim5<br />
dim6<br />
dim7<br />
dim8<br />
dim9<br />
dim10<br />
dim11<br />
dim12<br />
dim13<br />
dim14 (class)
148 Appendix A. Appendix<br />
The Wiscons<strong>in</strong> Diagnostic Breast Cancer (WDBC) data set can be found at http://<br />
archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wiscons<strong>in</strong>+(Diagnostic) and has<br />
the orig<strong>in</strong>al dimensions enumerated <strong>in</strong> Table A.5. [131] conta<strong>in</strong>s detailed descriptions <strong>of</strong><br />
how these features are computed.<br />
Table A.5: Dimension names for the WDBC data set [131].<br />
orig<strong>in</strong>al<br />
diagnosis<br />
radius<br />
texture<br />
perimeter<br />
area<br />
smoothness (local variation <strong>in</strong> radius lengths)<br />
compactness (perimeter 2 / area - 1.0)<br />
concavity (severity <strong>of</strong> concave portions <strong>of</strong> the contour)<br />
concave po<strong>in</strong>ts (number <strong>of</strong> concave portions <strong>of</strong> the contour)<br />
symmetry<br />
fractal dimension (“coastl<strong>in</strong>e approximation” - 1)<br />
renamed<br />
dim1 (class)<br />
dim2-dim31<br />
The mean, standard error, and “worst” or largest (mean <strong>of</strong> the three largest values) <strong>of</strong><br />
these features were computed for each image, result<strong>in</strong>g <strong>in</strong> 30 features. For <strong>in</strong>stance, dim2<br />
is Mean Radius, dim12 is Radius SE, dim22 is Worst Radius.
A.2. Empirical Study 149<br />
A.2 Empirical Study<br />
A.2.1<br />
General Questions Form<br />
On the next two pages we present the <strong>in</strong>troductory form to our study. The participants<br />
were asked to fill <strong>in</strong> personal <strong>in</strong>formation on the first page and the experiment task was<br />
expla<strong>in</strong>ed on the second page trough an example.<br />
Please note that the study took place at University <strong>of</strong> Konstanz, therefore the forms<br />
are <strong>in</strong> German language.
150 Appendix A. Appendix<br />
Fragen zur Person<br />
(* Zutreffendes bitte ankreutzen)<br />
Studienfach:<br />
Anzahl der Fachsemester:<br />
Geschlecht*: männlich weiblich<br />
Alter:<br />
Wie <strong>of</strong>t haben Sie sich mit Daten und ihrer Auswertung beschäftigt*?<br />
(wie z.B. Excel-Tabellen, Datenbanken, usw.)<br />
Laufend Oft Manchmal Selten Nie<br />
Verwendete S<strong>of</strong>tware:<br />
Wie <strong>of</strong>t haben Sie sich mit der graphischen Darstellung von Daten beschäftigt*?<br />
(wie z.B. Excel-Diagramme, usw.)<br />
Laufend Oft Manchmal Selten Nie<br />
Verwendete S<strong>of</strong>tware:<br />
(* Zutreffendes bitte ankreutzen)
A.2.1 General Questions Form 151<br />
Wir bitten Sie die Anweisung aufmerksam durchzulesen und dann den folgenden Bogen<br />
ohne Unterbrechung durchzuarbeiten.<br />
Stellen Sie sich vor Sie s<strong>in</strong>d We<strong>in</strong>händler und haben e<strong>in</strong> großes Repertoire an<br />
We<strong>in</strong>flaschen. Ihre We<strong>in</strong>flaschen lassen sich <strong>in</strong> drei We<strong>in</strong>sorten e<strong>in</strong>teilen (Apperetive-,<br />
Likör-, und Tafelwe<strong>in</strong>). Alle We<strong>in</strong>flaschen haben e<strong>in</strong>e Reihe von Standartanalysen<br />
durchlaufen, die Aufschluss über ihre Eigenschaften liefern, wie z.B. Alkoholgehalt,<br />
Farbtönung, usw. Die Ergebnisse dieser Analysen s<strong>in</strong>d <strong>in</strong> 18 Streudiagrammen<br />
dargestellt, <strong>in</strong> denen immer zwei Eigenschaften (z.B.: X–Y, etc.) gegene<strong>in</strong>ander<br />
aufgetragen s<strong>in</strong>d. Jede We<strong>in</strong>flasche ist durch e<strong>in</strong>en Punkt im Diagramm dargestellt, die<br />
Farben der Punkte stehen für die drei We<strong>in</strong>sorten. An Hand dieser Darstellungen<br />
müssen Sie bestimmen welches Eigenschaftspaar sich am besten zur Unterscheidung<br />
der We<strong>in</strong>sorten eignet, wie im folgenden Beispiel gezeigt wird:<br />
Eigenschaft Y<br />
Sorte A<br />
Sorte B<br />
Eigenschaft X<br />
Sorte C<br />
Nun liegt Ihre Aufgabe dar<strong>in</strong> die Darstellungen auszuwählen,<br />
die sich gut zur Unterscheidung von We<strong>in</strong>sorten eignen!<br />
Bitte vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 bis 5 (1 steht für die<br />
beste Darstellung). Die Zahlen s<strong>in</strong>d <strong>in</strong> die Kästchen neben der Darstellung e<strong>in</strong>zutragen.<br />
Die Kästchen der anderen Darstellungen können leer bleiben.<br />
Während der Bearbeitung der nächsten Seite bitten wir sie um Ruhe und Konzentration.<br />
Vielen Dank für Ihre Teilnahme!
152 Appendix A. Appendix<br />
A.2.2<br />
Experiment Form<br />
On the next page we show two examples <strong>of</strong> the study forms for the participants <strong>of</strong> the<br />
empirical study described <strong>in</strong> Section 3.2. Every participant were shown the same plots<br />
but ordered by a di erent permutation.
A.2.2 Experiment Form 153<br />
Vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 - 5 (1 für die Beste).A<br />
Figure A.1: Empirical study experiment form version A.
154 Appendix A. Appendix<br />
Vergeben Sie unter den 5 besten Darstellungen die Zahlen 1 - 5 (1 für die Beste).B<br />
Figure A.2: Empirical study experiment form version B.
A.2.3 Additional Experiment Results 155<br />
A.2.3<br />
Additional Experiment Results<br />
This plots have never been selected by a user as be<strong>in</strong>g on a scale from 1-5 between the<br />
best plots out <strong>of</strong> the 18 presented.<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
Figure A.3: The eight projections that where never selected by a user as be<strong>in</strong>g on the scale 1 to 5<br />
<strong>in</strong> terms <strong>of</strong> separability <strong>of</strong> classes among the 18 presented plots.
156 Appendix A. Appendix<br />
A.3 Quality Metrics Pipel<strong>in</strong>es for the Literature Review<br />
Here we attach all the quality metrics pipel<strong>in</strong>es for all the papers from the taxonomy<br />
presented <strong>in</strong> Section 4.1 and summarized <strong>in</strong> Table 4.2 that are not part <strong>of</strong> the examples<br />
<strong>of</strong> this section. We ordered them <strong>in</strong> the same order that the papers are presented <strong>in</strong> the<br />
taxonomy’s table.<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.4: Pipel<strong>in</strong>e for “A Projection Pursuit Algorithm for Exploratory <strong>Data</strong> Analysis” by Friedman<br />
and Tukey [54]: (A) di erent 2D l<strong>in</strong>ear, but not axis-parallel, data projections are computed<br />
and evaluated by the quality metric; (B) the best projection direction is chosen by the quality<br />
metric, called “usefulness” <strong>in</strong>dex, that measures the quality <strong>of</strong> a projection axis and varies the<br />
projection direction so that the <strong>in</strong>dex is maximized.<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
C<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.5: Pipel<strong>in</strong>e for “A Rank-by-Feature Framework for Interactive Exploration <strong>of</strong> Multidimensional<br />
<strong>Data</strong>” by Seo and Shneiderman [126]: (A) generation <strong>of</strong> projections and each 1D and<br />
2D projection is evaluated/ranked by a quality metric selected by the user; (B) best projections<br />
are presented; (C) present rank<strong>in</strong>g scores <strong>in</strong> a color coded grid (“Score Overview”), as well as an<br />
color-coded “Ordered List” for each projection. The user selects one view <strong>in</strong> the list or grid, and<br />
can also change dimension axes and then the view adapts. Please note: here we have a visualization<br />
<strong>of</strong> dimensions and quality metric scores, that are highly <strong>in</strong>teractive, rather than a static projection<br />
<strong>of</strong> data records.
A.3. Quality Metrics Pipel<strong>in</strong>es for the Literature Review 157<br />
Quality-Metrics-Driven Automation<br />
A<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.6: Pipel<strong>in</strong>e for “F<strong>in</strong>d<strong>in</strong>g and <strong>Visual</strong>iz<strong>in</strong>g Relevant Subspaces for Cluster<strong>in</strong>g <strong>High</strong>-<br />
<strong>Dimensional</strong> Astronomical <strong>Data</strong> Us<strong>in</strong>g Connected Morphological Operators” by Ferdosi et al. [52]:<br />
(A) generation <strong>of</strong> projections, all above 3D are reduced with PCA; the user can change the smooth<strong>in</strong>g<br />
parameter, what <strong>in</strong>fluences the number <strong>of</strong> projections; (B) evaluate each view; the user can<br />
select the view to <strong>in</strong>spect.<br />
Quality-Metrics-Driven Automation<br />
A C B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.7: Pipel<strong>in</strong>e for “Graph-Theoretic Scagnostics” by Wilk<strong>in</strong>son et al. [151]: (A) generation<br />
<strong>of</strong> projections; (B) all 2D views are ranked by several metrics; (C) once the metrics have been<br />
computed, they are used to create the SPLOM (rows and columns are the metrics) - projections<br />
are mapped as data po<strong>in</strong>ts.<br />
Quality-Metrics-Driven Automation<br />
A<br />
C<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.8: Pipel<strong>in</strong>e for “Select<strong>in</strong>g good views <strong>of</strong> high-dimensional data us<strong>in</strong>g class consistency”<br />
by Sips et al. [129]: (A) all 2D projections are ranked with the quality metric; (B) each view is<br />
associated with a quality metric computed <strong>in</strong> A; (C) view transformation decides which scatterplot<br />
to highlight (fade out) depend<strong>in</strong>g on the quality values and the set threshold.
158 Appendix A. Appendix<br />
Quality-Metrics-Driven Automation<br />
B A A C<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
1 2 3<br />
Views<br />
Figure A.9: Pipel<strong>in</strong>e for “Coord<strong>in</strong>at<strong>in</strong>g computational and visual approaches for <strong>in</strong>teractive feature<br />
selection and multivariate cluster<strong>in</strong>g” by Guo [59]: (A) all 2D projections are evaluated with the<br />
“m<strong>in</strong>imum conditional entropy (MCE)”; (B) orig<strong>in</strong>al dimensions are clustered to f<strong>in</strong>d an order<strong>in</strong>g<br />
accord<strong>in</strong>g to their MCE value; (C) matrix ordered accord<strong>in</strong>g to dimension cluster<strong>in</strong>g. The user<br />
can 1) select, add to, or subtract from a variable subset that is analyzed further; 2) move the<br />
threshold bar for the connect<strong>in</strong>g edges, and clusters are automatically extracted and colored; 3)<br />
<strong>in</strong>teract to l<strong>in</strong>k, brush and select elements.<br />
Quality-Metrics-Driven Automation<br />
A B C<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.10: Pipel<strong>in</strong>e for “Explor<strong>in</strong>g <strong>High</strong>-D Spaces with Multiform Matrices and Small Multiples”<br />
by MacEachren et al. [98]: (A) automatic selection <strong>of</strong> potentially <strong>in</strong>terest<strong>in</strong>g subspaces <strong>of</strong> variables;<br />
the user can also manually select subspaces; (B) all 2D plots are ranked with a quality metric<br />
(conditional entropy based); (C) the matrix view is colored and ordered accord<strong>in</strong>g to the quality<br />
metric value. The user can select a dimension subset to be visualized with other visualization<br />
techniques.<br />
Quality-Metrics-Driven Automation<br />
A<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.11: Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets Us<strong>in</strong>g<br />
Quality Measures” by Albuquerque et al. [8] for Jigsaw Maps: (A) mapp<strong>in</strong>g <strong>of</strong> dimension to 2D<br />
displays; (B) all 2D plots are ranked with a quality metric to select the best.<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.12: Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets Us<strong>in</strong>g<br />
Quality Measures” by Albuquerque et al. [8] for RadVis: (A) all views are ranked with a quality<br />
metric; (B) dimensions are ordered accord<strong>in</strong>g to quality values.
A.3. Quality Metrics Pipel<strong>in</strong>es for the Literature Review 159<br />
Quality-Metrics-Driven Automation<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
C<br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
B<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.13: Pipel<strong>in</strong>e for “Improv<strong>in</strong>g the <strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets Us<strong>in</strong>g<br />
Quality Measures” by Albuquerque et al. [8] for Table Lens: (A) quality metric is computed on<br />
the data (B) user can select an area, mark<strong>in</strong>g dimensions and records; the view is than transformed<br />
accord<strong>in</strong>g to the user <strong>in</strong>teraction; (C) colors are mapped accord<strong>in</strong>g to the quality metrics values<br />
for outliers and correlation.<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.14: Pipel<strong>in</strong>e for “Pragnostics: Screen-Space Metrics for Parallel Coord<strong>in</strong>ates” by Dasputa<br />
and Kosara [43]: (A) all 2D views are evaluated accord<strong>in</strong>g to the metrics; (B) the best pairs are<br />
selected to compute the best order<strong>in</strong>g <strong>of</strong> dimensions. The user can also <strong>in</strong>fluence this decision by<br />
select<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g plots.<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.15: Pipel<strong>in</strong>e for “Comb<strong>in</strong><strong>in</strong>g automated analysis and visualization techniques for e ective<br />
exploration <strong>of</strong> high-dimensional data” by Tatu et al. [133] for HDM: (A) all 2D data tables are<br />
evaluated accord<strong>in</strong>g to the 1D-HDM; (B) create the best nD visible on the 2D plot (with PCA),<br />
evaluated by the 2D-HDM.<br />
Quality-Metrics-Driven Automation<br />
A<br />
C<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.16: Pipel<strong>in</strong>e for “<strong>High</strong>-<strong>Dimensional</strong> <strong>Visual</strong> <strong>Analytics</strong>: Interactive Exploration Guided by<br />
Pairwise Views <strong>of</strong> Po<strong>in</strong>t Distributions” by Wilk<strong>in</strong>son et al. [152]: (A) generation <strong>of</strong> projections;<br />
(B) all 2D views are evaluated accord<strong>in</strong>g to quality metric; (C) a sorted/highlighted view is created<br />
us<strong>in</strong>g the metrics. The user can navigate trough the ranked list, and sort and highlight plots <strong>in</strong><br />
this and the SPLOM view.
160 Appendix A. Appendix<br />
Quality-Metrics-Driven Automation<br />
A<br />
C<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.17: Pipel<strong>in</strong>e for “Clutter Reduction <strong>in</strong> Multi-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization Us<strong>in</strong>g Dimension<br />
Reorder<strong>in</strong>g” by Peng et al. [112]: (A) quality metric is computed on the data; (B) quality<br />
metric calculated also dependent on the visual abstraction; (C) best visual mapp<strong>in</strong>g (order<strong>in</strong>g)<br />
decided based on metric values.<br />
Quality-Metrics-Driven Automation<br />
A<br />
C<br />
B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.18: Pipel<strong>in</strong>e for “Similarity Cluster<strong>in</strong>g <strong>of</strong> Dimensions for an Enhanced <strong>Visual</strong>ization<br />
<strong>of</strong> Multidimensional <strong>Data</strong>” by Ankerst et al. [9]: (A) quality metric is computed on the data;<br />
(B) quality metric calculated also dependent on the visual abstraction; (C) best visual mapp<strong>in</strong>g<br />
(order<strong>in</strong>g) decided based on metric values.<br />
Quality-Metrics-Driven Automation<br />
B A A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.19: Pipel<strong>in</strong>e for “Quality Metrics for 2D Scatterplot Graphics: Automatically Reduc<strong>in</strong>g<br />
<strong>Visual</strong> Clutter” by Bert<strong>in</strong>i and Santucci [24]: (A) quality metric is computed on the data density<br />
and screen density and compared; (B) projection and sampl<strong>in</strong>g based on metric values.<br />
Quality-Metrics-Driven Automation<br />
B A A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.20: Pipel<strong>in</strong>e for “A Screen Space Quality Method for <strong>Data</strong> Abstraction” by Johansson<br />
and Cooper [80]: (A) sampled and orig<strong>in</strong>al data tables are associated to quality metric computed<br />
on the views <strong>of</strong> sampled and orig<strong>in</strong>al data; (B) the values are used to decide upon the sampl<strong>in</strong>g<br />
rate.
A.3. Quality Metrics Pipel<strong>in</strong>es for the Literature Review 161<br />
Quality-Metrics-Driven Automation<br />
B<br />
A<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.21: Pipel<strong>in</strong>e for “Enabl<strong>in</strong>g Automatic Clutter Reduction <strong>in</strong> Parallel Coord<strong>in</strong>ate Plots” by<br />
Ellis and Dix [48]: (A) pixel occlusion is measured <strong>in</strong> the view space; the user can move a w<strong>in</strong>dow<br />
(lens) and sampl<strong>in</strong>g and measur<strong>in</strong>g occlusion is done only <strong>in</strong> this w<strong>in</strong>dow (B) the values <strong>of</strong> the<br />
quality metric are used to decide upon the sampl<strong>in</strong>g rate.<br />
Quality-Metrics-Driven Automation<br />
A D C B<br />
Source<br />
<strong>Data</strong><br />
<strong>Data</strong><br />
Transformation<br />
Transformed<br />
<strong>Data</strong><br />
<strong>Visual</strong> Mapp<strong>in</strong>g<br />
<strong>Visual</strong><br />
Structures<br />
View<br />
Transformation<br />
Render<strong>in</strong>g<br />
Views<br />
Figure A.22: Pipel<strong>in</strong>e for “Pixnostics: Towards Measur<strong>in</strong>g the Value <strong>of</strong> <strong>Visual</strong>ization” by Schneidew<strong>in</strong>d<br />
et al. [120]: (A) a subset <strong>of</strong> dimensions is selected with standard m<strong>in</strong><strong>in</strong>g techniques; (B)<br />
alternative mapp<strong>in</strong>gs <strong>of</strong> selected data are evaluated on the screen space; (C) and (D) based on the<br />
quality value the best subset and mapp<strong>in</strong>g is determ<strong>in</strong>ed. The user can decide to fix map some<br />
data features to visual features manually.
162 Appendix A. Appendix<br />
A.4 Hierarchical Group<strong>in</strong>g <strong>of</strong> Interest<strong>in</strong>g Subspaces<br />
hierarchical agglomerative group<strong>in</strong>g<br />
synthetic dataset<br />
0 5 10 15 20 25 30<br />
CFGIJK<br />
CFGIJKL<br />
CFGIK<br />
CFGIKL<br />
CFGJK<br />
CFGJKL<br />
CFGK<br />
CFGKL<br />
CFIKL<br />
CFIJKL<br />
CFIK<br />
CFIJK<br />
CFKL<br />
CFJKL<br />
CFHIKL<br />
CFHIJKL<br />
CFHKL<br />
CFHJKL<br />
CFHK<br />
CFHJK<br />
CFK<br />
CFJK<br />
CFGHJK<br />
CFGHJKL<br />
CFGHK<br />
CFGHKL<br />
CFHIK<br />
CFHIJK<br />
CFGHIK<br />
CFGHIJK<br />
CFIJL<br />
CFHIJL<br />
CFIL<br />
CFHIL<br />
CFJL<br />
CFHJL<br />
CFL<br />
CFHL<br />
CFHI<br />
CFHIJ<br />
CFI<br />
CFIJ<br />
CFH<br />
CFJ<br />
CFHJ<br />
CFGHI<br />
CFGHIJ<br />
CFGI<br />
CFGIJ<br />
CFGJ<br />
CFGHJ<br />
CFG<br />
CFGH<br />
CFGHIL<br />
CFGHIJL<br />
CFGHIKL<br />
CFGHIJKL<br />
CFGHL<br />
CFGHJL<br />
CFGJL<br />
CFGIJL<br />
CFGL<br />
CFGIL<br />
CDGIK<br />
CDGIJK<br />
CDGJK<br />
CDGHJK<br />
CDGK<br />
CDGHK<br />
CDGHKL<br />
CDGHJKL<br />
CDGJKL<br />
CDGIJKL<br />
CDGKL<br />
CDGIKL<br />
CDGHIJ<br />
CDGHIJL<br />
CDGHI<br />
CDGHIL<br />
CDGHIKL<br />
CDGHIJKL<br />
CDGHIK<br />
CDGHIJK<br />
CDGJ<br />
CDGIJ<br />
CDG<br />
CDGI<br />
CDGIL<br />
CDGIJL<br />
CDGJL<br />
CDGHJL<br />
CDGL<br />
CDGHL<br />
CDHIK<br />
CDHKL<br />
CDHIKL<br />
CDIK<br />
CDIKL<br />
CDK<br />
CDKL<br />
CDHJKL<br />
CDHIJKL<br />
CDJKL<br />
CDIJKL<br />
CDHK<br />
CDHJK<br />
CDJK<br />
CDIJK<br />
CDHIJK<br />
CDIJL<br />
CDHIJL<br />
CDIL<br />
CDHIL<br />
CDIJ<br />
CDHIJ<br />
CDI<br />
CDHI<br />
CDGH<br />
CDGHJ<br />
CDH<br />
CDHJ<br />
CDL<br />
CDHL<br />
CDJ<br />
CDJL<br />
CDHJL<br />
CF<br />
BCF<br />
CDF<br />
BCDF BC<br />
CD<br />
BCD<br />
CDFGHJ<br />
CDFGHJL<br />
CDFGJ<br />
CDFGJL<br />
CDFGL<br />
CDFGHL<br />
CDFG<br />
CDFGH<br />
CDFJL<br />
CDFHJL<br />
CDFJ<br />
CDFHJ<br />
CDFL<br />
CDFHL<br />
CDFIJL<br />
CDFHIJL<br />
CDFIL<br />
CDFHIL<br />
CDFGIL<br />
CDFGIJL<br />
CDFIJ<br />
CDFGIJ<br />
CDFI<br />
CDFGI<br />
CDFGHIL<br />
CDFGHIJL<br />
CDFGHI<br />
CDFGHIJ<br />
CDFH<br />
CDFHI<br />
CDFHIJ<br />
CDFIK<br />
CDFGIK<br />
CDFK<br />
CDFGK<br />
CDFHJK<br />
CDFHIJK<br />
CDFJK<br />
CDFIJK<br />
CDFHK<br />
CDFHIK<br />
CDFGHIK<br />
CDFGHIJK<br />
CDFGHK<br />
CDFGHJK<br />
CDFHIKL<br />
CDFHIJKL<br />
CDFHKL<br />
CDFHJKL<br />
CDFGHIKL<br />
CDFGHIJKL<br />
CDFGHKL<br />
CDFGHJKL<br />
CDFKL<br />
CDFIKL<br />
CDFJKL<br />
CDFIJKL<br />
CDFGKL<br />
CDFGIKL<br />
CDFGJKL<br />
CDFGIJKL<br />
CDFGJK<br />
CDFGIJK BL<br />
FL<br />
DL<br />
FH<br />
BH<br />
DH<br />
DG FG<br />
BG DF<br />
BD<br />
BF<br />
BDF BI<br />
FI<br />
DI<br />
BJ<br />
FJ<br />
DJ IL GI<br />
IJ<br />
JL<br />
GL<br />
HL<br />
GH GJ<br />
HJ<br />
BK<br />
FK<br />
DK HI<br />
IK<br />
HK KL<br />
GK JK<br />
CGHIL<br />
CGHIJL<br />
CGIL<br />
CGIJL<br />
CHIL<br />
CHIJL<br />
CIL<br />
CIJL CI<br />
CGI<br />
CGIJ<br />
CGHIJ<br />
CIJ<br />
CHIJ<br />
CGJL<br />
CGHJL<br />
CGL<br />
CGHL<br />
CJL<br />
CHJL<br />
CL<br />
CHL<br />
CGJ<br />
CGHJ<br />
CJ<br />
CHJ<br />
CG<br />
CGH CH<br />
CHI<br />
CGHI<br />
CGIJKL<br />
CGHIJKL<br />
CGIKL<br />
CGHIKL<br />
CGIJK<br />
CGHIJK<br />
CGIK<br />
CGHIK<br />
CHIKL<br />
CHIJKL<br />
CIKL<br />
CIJKL<br />
CIJK<br />
CHIJK<br />
CIK<br />
CHIK<br />
CGJKL<br />
CGHJKL<br />
CGJK<br />
CGHJK<br />
CGHK<br />
CGHKL<br />
CGK<br />
CGKL<br />
CJKL<br />
CHJKL<br />
CKL<br />
CHKL<br />
CHK<br />
CHJK CK<br />
CJK<br />
Subspaces<br />
Distance (Similarity)<br />
Figure A.23: Hierarchical agglomerative group<strong>in</strong>g <strong>of</strong> the 296 <strong>in</strong>terest<strong>in</strong>g subspaces. The red l<strong>in</strong>e<br />
shows the threshold for 6 groups shown <strong>in</strong> the subspace group view. Each group is marked by a<br />
colored rectangle. The colors are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> Figure 5.14.
Bibliography<br />
[1] ggplot2. http://had.co.nz/ggplot2/.<br />
[2] Protovis. http://vis.stanford.edu/protovis/.<br />
[3] Tableau. http://www.tableaus<strong>of</strong>tware.com/.<br />
[4] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for<br />
projected cluster<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD International Conference on<br />
Management <strong>of</strong> <strong>Data</strong> (SIGMOD ’99), pages 61–72. ACM, 1999.<br />
[5] C. C. Aggarwal and P. S. Yu. Redef<strong>in</strong><strong>in</strong>g cluster<strong>in</strong>g for high-dimensional applications. IEEE<br />
Transactions on Knowledge and <strong>Data</strong> Eng<strong>in</strong>eer<strong>in</strong>g, 14(2):210–225, 2002.<br />
[6] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace Cluster<strong>in</strong>g <strong>of</strong><br />
<strong>High</strong> <strong>Dimensional</strong> <strong>Data</strong> for <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Applications. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD<br />
International Conference on Management <strong>of</strong> <strong>Data</strong> (SIGMOD ’98), volume 27, pages 94–105.<br />
ACM, 1998.<br />
[7] R. Agrawal, T. Imiel<strong>in</strong>ski, and A. Swami. M<strong>in</strong><strong>in</strong>g Association Rules between Sets <strong>of</strong> Items<br />
<strong>in</strong> Large <strong>Data</strong>bases. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD International Conference on<br />
Management <strong>of</strong> <strong>Data</strong> (SIGMOD ’93), pages 207–216. ACM, 1993.<br />
[8] G. Albuquerque, M. Eisemann, D. J. Lehmann, H. Theisel, and M. Magnor. Improv<strong>in</strong>g the<br />
<strong>Visual</strong> Analysis <strong>of</strong> <strong>High</strong>-dimensional <strong>Data</strong>sets Us<strong>in</strong>g Quality Measures. In Proceed<strong>in</strong>gs <strong>of</strong><br />
the IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and Technology (VAST ’10), pages 19–26.<br />
IEEE CS Press, 2010.<br />
[9] M. Ankerst, S. Berchtold, and D. A. Keim. Similarity cluster<strong>in</strong>g <strong>of</strong> dimensions for an enhanced<br />
visualization <strong>of</strong> multidimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium Information<br />
<strong>Visual</strong>ization (InfoVis ’98), pages 52–60. IEEE CS Press, 1998.<br />
[10] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: order<strong>in</strong>g po<strong>in</strong>ts to identify<br />
the cluster<strong>in</strong>g structure. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD International Conference on<br />
Management <strong>of</strong> <strong>Data</strong> (SIGMOD ’99), pages 49–60. ACM, 1999.<br />
[11] M. Ankerst, M. Ester, and H. P. Kriegel. Towards an e ective cooperation <strong>of</strong> the user and the<br />
computer for classification. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGKDD International Conference<br />
on Knowledge Discovery and <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (KDD ’00), pages 179–188, 2000.<br />
[12] D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook. The Travel<strong>in</strong>g Salesman Problem:<br />
A Computational Study (Pr<strong>in</strong>ceton Series <strong>in</strong> Applied Mathematics). Pr<strong>in</strong>ceton University<br />
Press, 2007.<br />
[13] D. Asimov. The Grand Tour: A Tool for View<strong>in</strong>g Multidimensional <strong>Data</strong>. Journal on<br />
Scientific and Statistical Comput<strong>in</strong>g, 6(1):128–143, 1985.<br />
[14] I. Assent, R. Krieger, E. Müller, and T. Seidl. VISA: <strong>Visual</strong> Subspace Cluster<strong>in</strong>g Analysis.<br />
ACM SIGKDD Explorations Newsletter - Special Issue on <strong>Visual</strong> <strong>Analytics</strong>, 9(2):5–12, 2007.<br />
[15] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-<br />
Wesley, 1999.<br />
[16] C. Baumgartner, C. Plant, K. Kail<strong>in</strong>g, H.-P. Kriegel, and P. Kröger. Subspace selection for<br />
cluster<strong>in</strong>g high-dimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the Fourth IEEE Conference on <strong>Data</strong><br />
M<strong>in</strong><strong>in</strong>g (ICDM ’04), pages 11–18. IEEE CS Press, 2004.<br />
[17] R. Becker and W. Cleveland. Brush<strong>in</strong>g scatterplots. Technometrics, 29:127–142, 1987.<br />
[18] R. A. Becker, W. S. Cleveland, and M.-J. Shyu. The visual design and control <strong>of</strong> trellis<br />
display. Journal <strong>of</strong> Computational and Graphical Statistics, 5(2):123–155, 1996.
164 Bibliography<br />
[19] B. B. Bederson, J. D. Hollan, K. Perl<strong>in</strong>, J. Meyer, D. Bacon, and G. Furnas. Pad++: A<br />
Zoomable Graphical Sketchpad For Explor<strong>in</strong>g Alternate Interface Physics. Journal <strong>of</strong> <strong>Visual</strong><br />
Languages & Comput<strong>in</strong>g, 7(1):3–32, 1996.<br />
[20] R. Bellman. Dynamic Programm<strong>in</strong>g. Pr<strong>in</strong>ceton University Press, 1st edition, 1957.<br />
[21] P. Berkh<strong>in</strong>. A Survey <strong>of</strong> Cluster<strong>in</strong>g <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Techniques. Group<strong>in</strong>g Multidimensional<br />
<strong>Data</strong>, pages 25–71, 2006.<br />
[22] J. Bert<strong>in</strong>. Semiology <strong>of</strong> graphics. University <strong>of</strong> Wiscons<strong>in</strong> Press, 1983.<br />
[23] E. Bert<strong>in</strong>i and D. Lalanne. Investigat<strong>in</strong>g and reflect<strong>in</strong>g on the <strong>in</strong>tegration <strong>of</strong> automatic data<br />
analysis and visualization <strong>in</strong> knowledge discovery. ACM SIGKDD Explorations Newsletter,<br />
11:9–18, 2010.<br />
[24] E. Bert<strong>in</strong>i and G. Santucci. Quality Metrics for 2D Scatterplot Graphics: Automatically<br />
Reduc<strong>in</strong>g <strong>Visual</strong> Clutter. In Proceed<strong>in</strong>gs Smart Graphics (SG), volume 3031, pages 77–89,<br />
2004.<br />
[25] E. Bert<strong>in</strong>i and G. Santucci. Give chance a chance: model<strong>in</strong>g density to enhance scatter plot<br />
quality through random data sampl<strong>in</strong>g. Information <strong>Visual</strong>ization, 5(2):95–110, 2006.<br />
[26] E. Bert<strong>in</strong>i and G. Santucci. <strong>Visual</strong> Quality Metrics. In Proceed<strong>in</strong>gs <strong>of</strong> the 2006 AVI workshop<br />
on BEyond time and errors: noveL evaluation methods for Information <strong>Visual</strong>ization<br />
(BELIV), pages 1–5. ACM, 2006.<br />
[27] E. Bert<strong>in</strong>i, A. Tatu, and D. A. Keim. Quality Metrics <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong> <strong>Visual</strong>ization:<br />
An Overview and Systematization. Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on Information<br />
<strong>Visual</strong>ization (InfoVis ’11), 17(12):2203–2212, 2011.<br />
[28] K. Beyer, J. Goldste<strong>in</strong>, R. Ramakrishnan, and U. Shaft. When Is ”Nearest Neighbor” Mean<strong>in</strong>gful?<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the 7th International Conference on <strong>Data</strong>base Theory (ICDT ’99),<br />
pages 217–235, 1999.<br />
[29] E. A. Bier, M. C. Stone, K. Pier, K. Fishk<strong>in</strong>, T. Baudel, M. Conway, W. Buxton, and<br />
T. DeRose. Toolglass and Magic Lenses: The See-Through Interface. In Conference Companion<br />
on Human Factors <strong>in</strong> Comput<strong>in</strong>g Systems (CHI ’94), pages 445–446. ACM, 1994.<br />
[30] T. Boogaerts, L.-C. Tranchevent, G. A. Pavlopoulos, J. Aerts, and J. Vandewalle. <strong>Visual</strong>iz<strong>in</strong>g<br />
high dimensional datasets us<strong>in</strong>g parallel coord<strong>in</strong>ates: Application to gene prioritization. In<br />
IEEE 12th International Conference on Bio<strong>in</strong>formatics & Bioeng<strong>in</strong>eer<strong>in</strong>g (BIBE ’12), pages<br />
52–57. IEEE CS Press, 2012.<br />
[31] I. Borg and P. Groenen. Modern Multidimensional Scal<strong>in</strong>g: Theory and Applications.<br />
Spr<strong>in</strong>ger, 2005.<br />
[32] R. Brath. Metrics for e ective <strong>in</strong>formation visualization. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium<br />
Information <strong>Visual</strong>ization (InfoVis ’97), pages 108–111, 1997.<br />
[33] S. Bremm, T. v. Landesberger, J. Bernard, and T. Schreck. Assisted descriptor selection<br />
based on visual comparative data analysis. Computer Graphics Forum, 30(3):891–900, 2011.<br />
[34] S. Bremm, T. v. Landesberger, M. Heß, T. Schreck, P. Weil, and K. Hamacher. Interactive<br />
visual comparison <strong>of</strong> multiple trees. In Proceed<strong>in</strong>gs <strong>of</strong> IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong><br />
Science and Technology (VAST ’11), pages 31–40. IEEE CS Press, 2011.<br />
[35] N. Cao, D. Gotz, J. Sun, and H. Qu. DICON: Interactive <strong>Visual</strong> Analysis <strong>of</strong> Multidimensional<br />
Clusters. IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’ 11),<br />
17:2581–2590, 2011.<br />
[36] S. K. Card, J. D. Mack<strong>in</strong>lay, and B. Shneiderman. Read<strong>in</strong>gs <strong>in</strong> <strong>in</strong>formation visualization:<br />
us<strong>in</strong>g vision to th<strong>in</strong>k. Morgan Kaufmann Publishers Inc., 1999.
Bibliography 165<br />
[37] D. B. Carr, R. J. Littlefield, and W. L. Nichloson. Scatterplot Matrix Techniques for Large<br />
N. In Proceed<strong>in</strong>gs <strong>of</strong> the Seventeenth Symposium on the Interface <strong>of</strong> Computer Sciences and<br />
Statistics on Computer Science and Statistics, pages 297–306. Elsevier North-Holland, Inc.,<br />
1986.<br />
[38] C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace cluster<strong>in</strong>g for m<strong>in</strong><strong>in</strong>g numerical<br />
data. In Proceed<strong>in</strong>gs <strong>of</strong> the fifth ACM SIGKDD International Conference on Knowledge<br />
Discovery and <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (KDD ’99), pages 84–93. ACM, 1999.<br />
[39] E. H. Chi. A Taxonomy <strong>of</strong> <strong>Visual</strong>ization Techniques Us<strong>in</strong>g the <strong>Data</strong> State Reference Model.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’00), pages<br />
69–75. IEEE CS Press, 2000.<br />
[40] K. W. Church and P. Hanks. Word association norms, mutual <strong>in</strong>formation, and lexicography.<br />
Computational L<strong>in</strong>guistics, 16(1):22–29, 1990.<br />
[41] T. Cox and M. Cox. Multidimensional Scal<strong>in</strong>g. Chapman & Hall, 1994.<br />
[42] Q. Cui, M. Ward, E. Rundenste<strong>in</strong>er, and J. Yang. Measur<strong>in</strong>g <strong>Data</strong> Abstraction Quality <strong>in</strong><br />
Multiresolution <strong>Visual</strong>izations. IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics<br />
(TVCG ’06), 12:709–716, 2006.<br />
[43] A. Dasgupta and R. Kosara. Pargnostics: Screen-Space Metrics for Parallel Coord<strong>in</strong>ates.<br />
IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’10), 16:1017–1026,<br />
2010.<br />
[44] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley-Interscience, 2nd edition,<br />
2001.<br />
[45] C. Dunne and B. Shneiderman. Improv<strong>in</strong>g graph draw<strong>in</strong>g readability by <strong>in</strong>corporat<strong>in</strong>g readability<br />
metrics: A s<strong>of</strong>tware tool for network analysts. Technical Report HCIL-2009-13, University<br />
<strong>of</strong> Maryland, 2009.<br />
[46] S. G. Eick and G. J. Wills. <strong>High</strong> Interaction Graphics. European Journal <strong>of</strong> Operations<br />
Research, 81(3):445–459, 1995.<br />
[47] M. Eisen, P. Spellman, P. Brown, and D. Botste<strong>in</strong>. Cluster analysis and display <strong>of</strong> genomewide<br />
expression patterns. Proceed<strong>in</strong>gs <strong>of</strong> the National Academy <strong>of</strong> Sciences, 95(25):14863–<br />
14868, 1998.<br />
[48] G. Ellis and A. Dix. Enabl<strong>in</strong>g Automatic Clutter Reduction <strong>in</strong> Parallel Coord<strong>in</strong>ate Plots.<br />
IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’06), 12:717–724, 2006.<br />
[49] G. Ellis and A. Dix. A taxonomy <strong>of</strong> clutter reduction for <strong>in</strong>formation visualisation. IEEE<br />
Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’07), 13:1216–1223, 2007.<br />
[50] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discover<strong>in</strong>g<br />
clusters <strong>in</strong> large spatial databases with noise. In Proceed<strong>in</strong>gs <strong>of</strong> the Second ACM SIGKDD<br />
International Conference on Knowledge Discovery and <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (KDD ’96), pages 226–<br />
231. AAAI Press, 1996.<br />
[51] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extract<strong>in</strong>g useful<br />
knowledge from volumes <strong>of</strong> data. Communications <strong>of</strong> the ACM, 39:27–34, 1996.<br />
[52] B. J. Ferdosi, H. Buddelmeijer, S. Trager, M. H. F. Wilk<strong>in</strong>son, and J. B. T. M. Roerd<strong>in</strong>k.<br />
F<strong>in</strong>d<strong>in</strong>g and visualiz<strong>in</strong>g relevant subspaces for cluster<strong>in</strong>g high-dimensional astronomical data<br />
us<strong>in</strong>g connected morphological operators. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong><br />
<strong>Analytics</strong> Science and Technology (VAST ’11), pages 35–42. IEEE CS Press, 2010.<br />
[53] A. Frank and A. Asuncion. University <strong>of</strong> California Irv<strong>in</strong>e (UCI) Mach<strong>in</strong>e Learn<strong>in</strong>g Repository,<br />
2010.<br />
[54] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis.<br />
IEEE Transactions on Computers, 23:881–890, 1974.
166 Bibliography<br />
[55] Y.-H. Fua, M. Ward, and E. Rundenste<strong>in</strong>er. Hierarchical parallel coord<strong>in</strong>ates for exploration<br />
<strong>of</strong> large data sets. In Proceed<strong>in</strong>gs <strong>of</strong> the Conference on <strong>Visual</strong>ization (VIS ’99), pages 43–50.<br />
IEEE CS Press, 1999.<br />
[56] K. Fukunaga. Introduction to statistical pattern recognition. Academic Press Pr<strong>of</strong>essional,<br />
Inc., 2nd edition, 1990.<br />
[57] S. Guha, R. Rastogi, and K. Shim. Cure: an e cient cluster<strong>in</strong>g algorithm for large databases.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD International Conference on Management <strong>of</strong> <strong>Data</strong><br />
(SIGMOD ’98), pages 73–84. ACM, 1998.<br />
[58] S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection <strong>of</strong> orthogonal concepts <strong>in</strong> subspaces<br />
<strong>of</strong> high dimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the 18th ACM conference on Information<br />
and knowledge management (CIKM ’09), pages 1317–1326, 2009.<br />
[59] D. Guo. Coord<strong>in</strong>at<strong>in</strong>g computational and visual approaches for <strong>in</strong>teractive feature selection<br />
and multivariate cluster<strong>in</strong>g. Information <strong>Visual</strong>ization, 2(4):232–246, 2003.<br />
[60] D. Guo, J. Chen, A. M. MacEachren, and K. Liao. A visualization system for space-time<br />
and multivariate patterns (vis-stamp). IEEE Transactions on <strong>Visual</strong>ization and Computer<br />
Graphics (TVCG ’06), 12(6):1461–1474, 2006.<br />
[61] I. Guyon and A. Elissee . An <strong>in</strong>troduction to variable and feature selection. Journal <strong>of</strong><br />
Mach<strong>in</strong>e Learn<strong>in</strong>g Research - Special Issue on Variable and Feature Selection, (3):1157–1182,<br />
2003.<br />
[62] M. Hahsler, K. Hornik, and C. Buchta. Gett<strong>in</strong>g th<strong>in</strong>gs <strong>in</strong> order: An <strong>in</strong>troduction to the R<br />
package seriation. Journal <strong>of</strong> Statistical S<strong>of</strong>tware, 25(3):1–34, 2008.<br />
[63] J. Han and M. Kamber. <strong>Data</strong> M<strong>in</strong><strong>in</strong>g: Concepts and Techniques. Morgan Kaufmann Publishers<br />
Inc., 1st edition, 2000.<br />
[64] J. Han and M. Kamber. <strong>Data</strong> M<strong>in</strong><strong>in</strong>g: Concepts and Techniques. Morgan Kaufmann Publishers<br />
Inc., 2nd edition, 2006.<br />
[65] S. Haroz and K.-L. Ma. Natural visualization. In Proceed<strong>in</strong>gs <strong>of</strong> Eurographics <strong>Visual</strong>ization<br />
Symposium, pages 43–50, 2006.<br />
[66] P. N. Hart, N. Nilsson, and B. Raphael. A Formal Basis for the Heuristic Determ<strong>in</strong>ation <strong>of</strong><br />
M<strong>in</strong>imum Cost Paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107,<br />
1968.<br />
[67] C. G. Healey, K. S. Booth, and J. T. Enns. <strong>High</strong>-speed visual estimation us<strong>in</strong>g preattentive<br />
process<strong>in</strong>g. ACM Transactions on Computer-Human Interaction (TOCHI ’96), 3(2):107–135,<br />
1996.<br />
[68] C. G. Healey and J. T. Enns. Build<strong>in</strong>g perceptual textures to visualize multidimensional<br />
datasets. In Proceed<strong>in</strong>gs <strong>of</strong> the Conference on <strong>Visual</strong>ization (VIS ’98), pages 111–118. IEEE<br />
CS Press, 1998.<br />
[69] A. H<strong>in</strong>neburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor <strong>in</strong> high<br />
dimensional spaces? In Proceed<strong>in</strong>gs <strong>of</strong> the 26th International Conference on Very Large<br />
<strong>Data</strong> Bases (VLDB ’00), pages 506–515. Morgan Kaufmann Publishers Inc., 2000.<br />
[70] A. H<strong>in</strong>neburg and D. A. Keim. An E cient Approach to Cluster<strong>in</strong>g <strong>in</strong> Large Multimedia<br />
<strong>Data</strong>bases with Noise. In Proceed<strong>in</strong>gs 4th International Conference on Knowledge Discovery<br />
<strong>in</strong> <strong>Data</strong>bases (KDD ’98), pages 58–65, 1998.<br />
[71] A. H<strong>in</strong>neburg and D. A. Keim. Optimal grid-cluster<strong>in</strong>g: Towards break<strong>in</strong>g the curse <strong>of</strong> dimensionality<br />
<strong>in</strong> high-dimensional cluster<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the 25th International Conference<br />
on Very Large <strong>Data</strong> Bases (VLDB ’99), pages 506–517. Morgan Kaufmann Publishers Inc.,<br />
1999.
Bibliography 167<br />
[72] P. Ho man, G. Gr<strong>in</strong>ste<strong>in</strong>, and D. P<strong>in</strong>kney. <strong>Dimensional</strong> anchors: a graphic primitive for<br />
multidimensional multivariate <strong>in</strong>formation visualizations. In Proceed<strong>in</strong>gs Workshop on New<br />
Paradigms <strong>in</strong> Information <strong>Visual</strong>ization and Manipulation (NPIVM ’99), pages 9–16.<br />
[73] P. V. C. Hough. Method and means for recogniz<strong>in</strong>g complex patterns. US Patent, 3069654,<br />
1962.<br />
[74] P. J. Huber. Projection pursuit. The Annals <strong>of</strong> Statistics, 13(2):435–475, 1985.<br />
[75] C. B. Hurley and R. W. Oldford. Pairwise display <strong>of</strong> high-dimensional <strong>in</strong>formation via<br />
eulerian tours and hamiltonian decompositions. Journal <strong>of</strong> Computational and Graphical<br />
Statistics, 19(4):861–886, 2010.<br />
[76] S. Ingram, T. Munzner, V. Irv<strong>in</strong>e, M. Tory, S. Bergner, and T. Möller. DimStiller: Workflows<br />
for dimensional analysis and reduction. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong><br />
<strong>Analytics</strong> Science and Technology (VAST ’10). IEEE CS Press, 2010.<br />
[77] A. Inselberg. The plane with parallel coord<strong>in</strong>ates. The <strong>Visual</strong> Computer, 1(4):69–91, 1985.<br />
[78] A. Inselberg and B. Dimsdale. Parallel coord<strong>in</strong>ates: a tool for visualiz<strong>in</strong>g multi-dimensional<br />
geometry. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Conference on <strong>Visual</strong>ization (VIS ’90). IEEECS<br />
Press, 1990.<br />
[79] H. Jänicke and M. Chen. A Salience-based Quality Metric for <strong>Visual</strong>ization. Computer<br />
Graphics Forum (Proc. EuroVis), 29(3):1183–1192, 2010.<br />
[80] J. Johansson and M. Cooper. A Screen Space Quality Method for <strong>Data</strong> Abstraction. Computer<br />
Graphics Forum (Proc. EuroVis), 27(3):1039–1046, 2008.<br />
[81] J. Johansson, C. Forsell, M. L<strong>in</strong>d, and M. Cooper. Perceiv<strong>in</strong>g patterns <strong>in</strong> parallel coord<strong>in</strong>ates:<br />
determ<strong>in</strong><strong>in</strong>g thresholds for identification <strong>of</strong> relationships. Information <strong>Visual</strong>ization,<br />
7(2):152–162, 2008.<br />
[82] S. Johansson and J. Johansson. Interactive <strong>Dimensional</strong>ity Reduction Through User-def<strong>in</strong>ed<br />
Comb<strong>in</strong>ations <strong>of</strong> Quality Metrics. IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics<br />
(TVCG ’09), 15:993–1000, 2009.<br />
[83] I. T. Jolli e. Pr<strong>in</strong>cipal Component Analysis. Spr<strong>in</strong>ger, 2nd edition, 2002.<br />
[84] K. Kail<strong>in</strong>g, H.-P. Kriegel, P. Kröger, and S. Wanka. Rank<strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g subspaces for cluster<strong>in</strong>g<br />
high dimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the 7th European Conference on Pr<strong>in</strong>ciples<br />
and Practice <strong>of</strong> Knowledge Discovery <strong>in</strong> <strong>Data</strong>bases (PKDD ’03), pages 241–252, 2003.<br />
[85] L. Kaufman and P. J. Rousseeuw. F<strong>in</strong>d<strong>in</strong>g Groups <strong>in</strong> <strong>Data</strong>: An Introduction to Cluster<br />
Analysis. Wiley-Interscience, 9th edition, 1990.<br />
[86] D. A. Keim, M. Ankerst, and M. Sips. <strong>Visual</strong> <strong>Data</strong>-M<strong>in</strong><strong>in</strong>g Techniques, pages 813–825.<br />
Kolam Publish<strong>in</strong>g, 2004.<br />
[87] D. A. Keim, M. C. Hao, U. Dayal, and M. Hsu. Pixel bar charts: A visualization technique<br />
for very large multi-attribute data sets. Information <strong>Visual</strong>ization, 1(1):20–34, 2002.<br />
[88] D. A. Keim, F. Mansmann, J. Schneidew<strong>in</strong>d, J. Thomas, and H. Ziegler. <strong>Visual</strong> analytics:<br />
Scope and challenges. In S. J. Simo , M. H. Böhlen, and A. Mazeika, editors, <strong>Visual</strong> <strong>Data</strong><br />
M<strong>in</strong><strong>in</strong>g: Theory, Techniques and Tools for <strong>Visual</strong> <strong>Analytics</strong>, pages 76–90. Spr<strong>in</strong>ger-Verlag,<br />
2008.<br />
[89] Y. Koren and L. Carmel. <strong>Visual</strong>ization <strong>of</strong> labeled data us<strong>in</strong>g l<strong>in</strong>ear transformations. Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’03), 0:16, 2003.<br />
[90] H.-P. Kriegel, P. Kröger, and A. Zimek. Cluster<strong>in</strong>g high-dimensional data: A survey on<br />
subspace cluster<strong>in</strong>g, pattern-based cluster<strong>in</strong>g, and correlation cluster<strong>in</strong>g. ACM Transactions<br />
on Knowledge Discovery from <strong>Data</strong> (TKDD ’09), 3(1):1–58, 2009.
168 Bibliography<br />
[91] J. LeBlanc, M. O. Ward, and N. Wittels. Explor<strong>in</strong>g N-dimensional databases. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE Conference on <strong>Visual</strong>ization (VIS ’90). IEEE CS Press, 1990.<br />
[92] Y. K. Leung and M. D. Aerley. A review and taxonomy <strong>of</strong> distortion-oriented presentation<br />
techniques. ACM Transactions on Computer-Human Interaction, 1(2):126–160, 1994.<br />
[93] A. Lex, M. Streit, C. Partl, and D. Schmalstieg. Comparative analysis <strong>of</strong> multidimensional,<br />
quantitative data. IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’10),<br />
16(6):1027–1035, 2010.<br />
[94] J. Li, J.-B. Martens, and J. J. van Wijk. Judg<strong>in</strong>g correlation from scatterplots and parallel<br />
coord<strong>in</strong>ate plots. Information <strong>Visual</strong>ization, 9(1):13–30, 2008.<br />
[95] M. A. Little, P. E. McSharry, E. J. Hunter, and L. O. Ramig. Suitability <strong>of</strong> dysphonia<br />
measurements for telemonitor<strong>in</strong>g <strong>of</strong> park<strong>in</strong>son’s disease. In IEEE Transactions on Biomedical<br />
Eng<strong>in</strong>eer<strong>in</strong>g, pages 1015–1022, 2009.<br />
[96] M. A. Little, P. E. Mcsharry, S. J. Roberts, D. A. E. Costello, and I. M. Moroz. Exploit<strong>in</strong>g<br />
nonl<strong>in</strong>ear recurrence and fractal scal<strong>in</strong>g properties for voice disorder detection. BioMedical<br />
Eng<strong>in</strong>eer<strong>in</strong>g OnL<strong>in</strong>e, 6(1):23, 2007.<br />
[97] H. Liu and H. Motoda. Computational Methods <strong>of</strong> Feature Selection. Chapman & Hall/CRC,<br />
2008. edited by Huan Liu and Hiroshi Motoda.; Includes bibliographical references and <strong>in</strong>dex.<br />
[98] A. MacEachren, X. Dai, F. Hardisty, D. Guo, and G. Lengerich. Explor<strong>in</strong>g high-D spaces<br />
with multiform matrices and small multiples. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on<br />
Information <strong>Visual</strong>ization (InfoVis ’03), pages 31–38. IEEE CS Press, 2003.<br />
[99] J. Mack<strong>in</strong>lay. Automat<strong>in</strong>g the design <strong>of</strong> graphical presentations <strong>of</strong> relational <strong>in</strong>formation.<br />
ACM Transactions on Graphics, 5(2):110–141, 1986.<br />
[100] N. Miller, B. Hetzler, G. Nakamura, and P. Whitney. The need for metrics <strong>in</strong> visual <strong>in</strong>formation<br />
analysis. In Proceed<strong>in</strong>gs <strong>of</strong> the Workshop on New Paradigms <strong>in</strong> Information <strong>Visual</strong>ization<br />
and Manipulation. ACM, 1997.<br />
[101] E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace cluster<strong>in</strong>g:<br />
M<strong>in</strong><strong>in</strong>g the most <strong>in</strong>terest<strong>in</strong>g non-redundant concepts <strong>in</strong> high dimensional data. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE International Conference on <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (ICDM ’09), pages 377–386, 2009.<br />
[102] E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluat<strong>in</strong>g cluster<strong>in</strong>g <strong>in</strong> subspace projections<br />
<strong>of</strong> high dimensional data. In Proceed<strong>in</strong>gs <strong>of</strong> the International Conference on Very<br />
Large <strong>Data</strong> Bases (VLDB ’09), volume 2, pages 1270–1281, 2009.<br />
[103] E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discover<strong>in</strong>g multiple cluster<strong>in</strong>g solutions:<br />
Group<strong>in</strong>g objects <strong>in</strong> di erent views <strong>of</strong> the data. In Proceed<strong>in</strong>gs <strong>of</strong> the 10th IEEE Conference<br />
on <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (ICDM ’10), page 1220, 2010.<br />
[104] E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discover<strong>in</strong>g multiple cluster<strong>in</strong>g solutions:<br />
Group<strong>in</strong>g objects <strong>in</strong> di erent views <strong>of</strong> the data. In Tutorial at the 16th Pacific-Asia<br />
Conference on Knowledge Discovery and <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (PAKDD ’12), 2012.<br />
[105] T. Munzner. <strong>Visual</strong>ization (Chapter 27). In Fundamentals <strong>of</strong> Graphics, pages 675–707. AK<br />
Peters, 3rd edition, 2009.<br />
[106] E. MÃ ller, I. Assent, S. GÃ nnemann, T. Jansen, and T. Seidl. Opensubspace: An open<br />
source framework for evaluation and exploration <strong>of</strong> subspace cluster<strong>in</strong>g algorithms <strong>in</strong> weka.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the 1st Open Source <strong>in</strong> <strong>Data</strong> M<strong>in</strong><strong>in</strong>g Workshop (OSDM ’09) <strong>in</strong> conjunction<br />
with 13th Pacific-Asia Conference on Knowledge Discovery and <strong>Data</strong> M<strong>in</strong><strong>in</strong>g (PAKDD ’09),<br />
pages 2–13, 2009.<br />
[107] D. Niu, J. G. Dy, and M. I. Jordan. Multiple non-redundant spectral cluster<strong>in</strong>g views.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the 27th International Conference on Mach<strong>in</strong>e Learn<strong>in</strong>g (ICML), pages<br />
831–838. Omnipress, 2010.
Bibliography 169<br />
[108] C. North. Toward measur<strong>in</strong>g visualization <strong>in</strong>sight. IEEE Computer Graphics and Applications,<br />
26(3):6–9, 2006.<br />
[109] D. Oelke, H. Janetzko, S. Simon, K. Neuhaus, and D. A. Keim. <strong>Visual</strong> Boost<strong>in</strong>g <strong>in</strong> Pixelbased<br />
<strong>Visual</strong>izations. Computer Graphics Forum (Proc. EuroVis), 30(3):871–880, 2011.<br />
[110] L. Parsons, E. Haque, and H. Liu. Subspace Cluster<strong>in</strong>g for <strong>High</strong> <strong>Dimensional</strong> <strong>Data</strong>: A<br />
Review. ACM SIGKDD Explorations Newsletter - Special Issue on Learn<strong>in</strong>g from Imbalanced<br />
<strong>Data</strong>sets, 6(1):90–105, 2004.<br />
[111] F. Paulovich, M. Oliveira, and R. M<strong>in</strong>ghim. The Projection Explorer: A Flexible Tool<br />
for Projection-based Multidimensional <strong>Visual</strong>ization. In Proceed<strong>in</strong>gs <strong>of</strong> the XX Brazilian<br />
Symposium on Computer Graphics and Image Process<strong>in</strong>g (SIBGRAPI ’07), pages 27–36,<br />
Oct.<br />
[112] W. Peng, M. O. Ward, and E. A. Rundenste<strong>in</strong>er. Clutter Reduction <strong>in</strong> Multi-<strong>Dimensional</strong><br />
<strong>Data</strong> <strong>Visual</strong>ization Us<strong>in</strong>g Dimension Reorder<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on<br />
Information <strong>Visual</strong>ization (InfoVis ’04), pages 89–96. IEEE CS Press, 2004.<br />
[113] C. Plaisant, J.-D. Fekete, and G. Gr<strong>in</strong>ste<strong>in</strong>. Promot<strong>in</strong>g <strong>in</strong>sight-based evaluation <strong>of</strong> visualizations:<br />
From contest to benchmark repository. IEEE Transactions on <strong>Visual</strong>ization and<br />
Computer Graphics (TVCG ’08), 14(1):120–134, 2008.<br />
[114] W. M. Rand. Objective criteria for the evaluation <strong>of</strong> cluster<strong>in</strong>g methods. Journal <strong>of</strong> the<br />
American Statistical Association, 66(336):846–850, 1971.<br />
[115] R. Rao and S. K. Card. The table lens: merg<strong>in</strong>g graphical and symbolic representations <strong>in</strong><br />
an <strong>in</strong>teractive focus + context visualization for tabular <strong>in</strong>formation. In Proceed<strong>in</strong>gs <strong>of</strong> the<br />
SIGCHI Conference on Human Factors <strong>in</strong> Comput<strong>in</strong>g Systems (CHI ’94). ACM, 1994.<br />
[116] R. A. Rens<strong>in</strong>k and G. Baldridge. The perception <strong>of</strong> correlation <strong>in</strong> scatterplots. Computer<br />
Graphics Forum (Proc. EuroVis), 29(3):1203–1210, 2010.<br />
[117] D. J. Rogers and T. T. Tanimoto. A Computer Program for Classify<strong>in</strong>g Plants. Science,<br />
132(3434):1115–1118, 1960.<br />
[118] R. Rosenholtz, Y. Li, J. Mansfield, and Z. J<strong>in</strong>. Feature congestion: a measure <strong>of</strong> display<br />
clutter. In Proceed<strong>in</strong>gs <strong>of</strong> the SIGCHI Conference on Human Factors <strong>in</strong> Comput<strong>in</strong>g Systems<br />
(CHI ’05), pages 761–770. ACM, 2005.<br />
[119] M. Schaefer, L. Zhang, T. Schreck, A. Tatu, J. A. Lee, M. Verleysen, and D. A. Keim.<br />
Improv<strong>in</strong>g projection-based data analysis by feature space transformations. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> SPIE 8654, <strong>Visual</strong>ization and <strong>Data</strong> Analysis (VDA ’13), volume 8654, pages 86540H–<br />
86540H–15, 2013.<br />
[120] J. Schneidew<strong>in</strong>d, M. Sips, and D. A. Keim. Pixnostics: Towards measur<strong>in</strong>g the value <strong>of</strong> visualization.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and Technology<br />
(VAST ’06), pages 199–206. IEEE CS Press, 2006.<br />
[121] T. Schreck, T. von Landesberger, and S. Bremm. Techniques for precision-based visual<br />
analysis <strong>of</strong> projected data. Palgrave Macmillan Information <strong>Visual</strong>ization, 9(3):181–193,<br />
2010.<br />
[122] M. Sedlmair, A. Tatu, T. Munzner, and M. Tory. A taxonomy <strong>of</strong> visual cluster separation<br />
factors. Computer Graphics Forum (Proc. EuroVis), 31(3):1335–1344, 2012.<br />
[123] E. Segel and J. Heer. Narrative visualization: Tell<strong>in</strong>g stories with data. IEEE Transactions<br />
on <strong>Visual</strong>ization and Computer Graphics (TVCG ’10), 16:1139–1148, 2010.<br />
[124] J. Seo and B. Shneiderman. Interactively explor<strong>in</strong>g hierarchical cluster<strong>in</strong>g results. Computer,<br />
35(7):80–86, 2002.
170 Bibliography<br />
[125] J. Seo and B. Shneiderman. A rank-by-feature framework for unsupervised multidimensional<br />
data exploration us<strong>in</strong>g low dimensional projections. In Proceed<strong>in</strong>gs <strong>of</strong> IEEE Symposium on<br />
Information <strong>Visual</strong>ization (InfoVis ’04), pages 65–72. IEEE CS Press, 2004.<br />
[126] J. Seo and B. Shneiderman. A rank-by-feature framework for <strong>in</strong>teractive exploration <strong>of</strong><br />
multidimensional data. Information <strong>Visual</strong>ization, 4(2):96–113, 2005.<br />
[127] B. Shneiderman. The Eyes Have It: A Task by <strong>Data</strong> Type Taxonomy for Information<br />
<strong>Visual</strong>izations. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong> Languages (VL), pages<br />
336–343. IEEE CS Press, 1996.<br />
[128] J. H. Siegel, E. J. Farrell, R. M. Goldwyn, and H. P. Friedman. The surgical implication <strong>of</strong><br />
physiologic patterns <strong>in</strong> myocardial <strong>in</strong>farction shock. Surgery, 72:126–141, 1972.<br />
[129] M. Sips, B. Neubert, J. P. Lewis, and P. Hanrahan. Select<strong>in</strong>g good views <strong>of</strong> high-dimensional<br />
data us<strong>in</strong>g class consistency. Computer Graphics Forum (Proc. EuroVis), 28(3):831–838,<br />
2009.<br />
[130] A. Strauss and J. M. Corb<strong>in</strong>. Basics <strong>of</strong> Qualitative Research: Techniques and Procedures for<br />
Develop<strong>in</strong>g Grounded Theory. SAGE Publications, 1998.<br />
[131] W. Street, W. Wolberg, and O. Mangasarian. Nuclear feature extraction for breast tumor<br />
diagnosis. IS&T / SPIE International Symposium on Electronic Imag<strong>in</strong>g: Science and<br />
Technology, 1905:861–870, 1993.<br />
[132] A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. A. Keim.<br />
Automated <strong>Visual</strong> Analysis Methods for an E ective Exploration <strong>of</strong> <strong>High</strong>-<strong>Dimensional</strong> <strong>Data</strong>.<br />
IEEE Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’11), 17(5):pp. 584–<br />
597, 2011.<br />
[133] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidew<strong>in</strong>d, H. Theisel, M. Magnor, and<br />
D. Keim. Comb<strong>in</strong><strong>in</strong>g automated analysis and visualization techniques for e ective exploration<br />
<strong>of</strong> high dimensional data. Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science<br />
and Technology (VAST ’09), pages 59–66, 2009.<br />
[134] A. Tatu, P. Bak, E. Bert<strong>in</strong>i, D. A. Keim, and J. Schneidew<strong>in</strong>d. <strong>Visual</strong> quality metrics and<br />
human perception: an <strong>in</strong>itial study on 2D projections <strong>of</strong> large multidimensional data. In<br />
Proceed<strong>in</strong>gs <strong>of</strong> the Work<strong>in</strong>g Conference on Advanced <strong>Visual</strong> Interfaces (AVI), pages 49–56.<br />
ACM, 2010.<br />
[135] A. Tatu, F. Maaß, I. Färber, E. Bert<strong>in</strong>i, T. Schreck, T. Seidl, and D. Keim. Subspace<br />
Search and <strong>Visual</strong>ization to Make Sense <strong>of</strong> Alternative Cluster<strong>in</strong>gs <strong>in</strong> <strong>High</strong>-<strong>Dimensional</strong><br />
<strong>Data</strong>. Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and Technology<br />
(VAST ’12), pages 63–72, 2012.<br />
[136] A. Tatu, L. Zhang, E. Bert<strong>in</strong>i, T. Schreck, D. A. Keim, S. Bremm, and T. von Landesberger.<br />
ClustNails: <strong>Visual</strong> Analysis <strong>of</strong> Subspace Clusters. Ts<strong>in</strong>ghua Science and Technology, Special<br />
Issue on <strong>Visual</strong>ization and Computer Graphics, 17(4):419–428, 2012.<br />
[137] J. J. Thomas and K. A. Cook. Illum<strong>in</strong>at<strong>in</strong>g the Path: The Research and Development Agenda<br />
for <strong>Visual</strong> <strong>Analytics</strong>. National <strong>Visual</strong>ization and <strong>Analytics</strong> Ctr, 2005.<br />
[138] M. Tory and T. Möller. Reth<strong>in</strong>k<strong>in</strong>g <strong>Visual</strong>ization: A <strong>High</strong>-Level Taxonomy. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’04), pages 151–158. IEEE<br />
CS Press, 2004.<br />
[139] E. R. Tufte. The visual display <strong>of</strong> quantitative <strong>in</strong>formation. Graphics Press, 1986.<br />
[140] J. Tukey and P. Tukey. Computer graphics and exploratory data analysis: An <strong>in</strong>troduction.<br />
Proceed<strong>in</strong>gs <strong>of</strong> the Annual Conference and Exposition: Computer Graphics, 3:773–785, 1985.
Bibliography 171<br />
[141] S. Vadapalli and K. Karlapalem. Heidi matrix: nearest neighbor driven high dimensional<br />
data visualization. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGKDD Workshop on <strong>Visual</strong> <strong>Analytics</strong> and<br />
Knowledge Discovery, pages 83–92, 2009.<br />
[142] S. van den Elzen and J. J. van Wijk. BaobabView: Interactive Construction and Analysis<br />
<strong>of</strong> Decision Trees. In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on <strong>Visual</strong> <strong>Analytics</strong> Science and<br />
Technology (VAST ’11), pages 151–160. IEEE CS Press, 2011.<br />
[143] L. van der Maaten and G. H<strong>in</strong>ton. <strong>Visual</strong>iz<strong>in</strong>g data us<strong>in</strong>g t-SNE. Journal <strong>of</strong> Mach<strong>in</strong>e<br />
Learn<strong>in</strong>g Research, 9(2579-2605):85, 2008.<br />
[144] J. Ward. Hierarchical group<strong>in</strong>g to optimize an objective function. Journal <strong>of</strong> the American<br />
Statistical Association, 58:236–244, 1963.<br />
[145] M. Ward, G. Gr<strong>in</strong>ste<strong>in</strong>, and D. Keim. Interactive <strong>Data</strong> <strong>Visual</strong>ization: Foundations, Techniques,<br />
and Applications. Taylor & Francis, 2010.<br />
[146] M. O. Ward. Xmdvtool: Integrat<strong>in</strong>g multiple methods for visualiz<strong>in</strong>g multivariate data.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’94), pages<br />
326–333. IEEE CS Press, 1994.<br />
[147] M. O. Ward. A taxonomy <strong>of</strong> glyph placement strategies for multidimensional data visualization.<br />
Information <strong>Visual</strong>ization, 1(3/4):194–210, 2002.<br />
[148] C. Ware. Information <strong>Visual</strong>ization: Perception for Design. Morgan Kaufmann Publishers<br />
Inc., 2004.<br />
[149] C. Ware, H. Purchase, L. Colpoys, and M. McGill. Cognitive measurements <strong>of</strong> graph aesthetics.<br />
Information <strong>Visual</strong>ization, 1:103–110, 2002.<br />
[150] M. Wattenberg. A note on space-fill<strong>in</strong>g visualizations and space-fill<strong>in</strong>g curves. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’05). IEEE CS Press, 2005.<br />
[151] L. Wilk<strong>in</strong>son, A. Anand, and R. Grossman. Graph-theoretic scagnostics. In Proceed<strong>in</strong>gs <strong>of</strong><br />
the IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’05), pages 157–164. IEEE CS<br />
Press, 2005.<br />
[152] L. Wilk<strong>in</strong>son, A. Anand, and R. Grossman. <strong>High</strong>-dimensional visual analytics: Interactive<br />
exploration guided by pairwise views <strong>of</strong> po<strong>in</strong>t distributions. IEEE Transactions on <strong>Visual</strong>ization<br />
and Computer Graphics (TVCG ’06), 12:1363–1372, 2006.<br />
[153] A. Wismueller, M. Verleysen, M. Aupetit, and J. A. Lee. Recent Advances <strong>in</strong> Nonl<strong>in</strong>ear<br />
<strong>Dimensional</strong>ity Reduction, Manifold and Topological Learn<strong>in</strong>g. 18th European Symposium<br />
on Artificial Neural Networks - Computational Intelligence and Mach<strong>in</strong>e Learn<strong>in</strong>g (ESANN),<br />
pages 71–80, 2010.<br />
[154] I. H. Witten and E. Frank. <strong>Data</strong> M<strong>in</strong><strong>in</strong>g: Practical Mach<strong>in</strong>e Learn<strong>in</strong>g Tools and Techniques.<br />
The Morgan Kaufmann Series <strong>in</strong> <strong>Data</strong> Management Systems. Morgan Kaufmann Publishers,<br />
2nd edition, 2005.<br />
[155] R. Xu and D. C. W. II. Survey <strong>of</strong> cluster<strong>in</strong>g algorithms. IEEE Transactions on Neural<br />
Networks, 16(3):645–678, 2005.<br />
[156] J. Yang, D. Hubball, M. O. Ward, E. A. Rundenste<strong>in</strong>er, and W. Ribarsky. Value and relation<br />
display: Interactive visual exploration <strong>of</strong> large data sets with hundreds <strong>of</strong> dimensions. IEEE<br />
Transactions on <strong>Visual</strong>ization and Computer Graphics (TVCG ’07), 13:494–507, 2007.<br />
[157] J. Yang, A. Patro, S. Huang, N. Mehta, M. O. Ward, and E. A. Rundenste<strong>in</strong>er. Value and<br />
Relation Display for Interactive Exploration <strong>of</strong> <strong>High</strong> <strong>Dimensional</strong> <strong>Data</strong>sets. In Proceed<strong>in</strong>gs <strong>of</strong><br />
IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’04), pages 73–80. IEEE CS Press,<br />
2004.
172 Bibliography<br />
[158] J. Yang, W. Peng, M. O. Ward, and E. A. Rundenste<strong>in</strong>er. Interactive Hierarchical Dimension<br />
Order<strong>in</strong>g, Spac<strong>in</strong>g and Filter<strong>in</strong>g for Exploration <strong>of</strong> <strong>High</strong> <strong>Dimensional</strong> <strong>Data</strong>sets. In Proceed<strong>in</strong>gs<br />
<strong>of</strong> the IEEE Symposium Information <strong>Visual</strong>ization (InfoVis ’03). IEEE CS Press, 2003.<br />
[159] J. Yang, M. O. Ward, E. A. Rundenste<strong>in</strong>er, and S. Huang. <strong>Visual</strong> hierarchical dimension<br />
reduction for exploration <strong>of</strong> high dimensional datasets. In Proceed<strong>in</strong>gs <strong>of</strong> the Symposium on<br />
<strong>Data</strong> <strong>Visual</strong>ization (VISSYM), pages 19–28. Eurographics Association, 2003.<br />
[160] J. S. Yi, Y. a. Kang, J. Stasko, and J. Jacko. Toward a deeper understand<strong>in</strong>g <strong>of</strong> the role <strong>of</strong><br />
<strong>in</strong>teraction <strong>in</strong> <strong>in</strong>formation visualization. IEEE Transactions on <strong>Visual</strong>ization and Computer<br />
Graphics (TVCG ’07), 13:1224–1231, 2007.<br />
[161] X. Yuan, Z. Wang, and C. Guo. Mds-tree and mds-matrix for high dimensional data visualization.<br />
In Proceed<strong>in</strong>gs <strong>of</strong> IEEE Symposium on Information <strong>Visual</strong>ization (InfoVis ’11),<br />
2011. Poster abstract.<br />
[162] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an e cient data cluster<strong>in</strong>g method for<br />
very large databases. In Proceed<strong>in</strong>gs <strong>of</strong> the ACM SIGMOD International Conference on<br />
Management <strong>of</strong> <strong>Data</strong> (SIGMOD ’96), pages 103–114, New York, NY, USA, 1996. ACM.<br />
[163] J. Zupan, M. Novic, X. Li, and J. Gasteiger. Classification <strong>of</strong> multicomponent analytical<br />
data <strong>of</strong> olive oils us<strong>in</strong>g di erent neural networks. In Analytica Chimica Acta, volume 292,<br />
pages 219–234, 1994.