25.10.2016 Views

SAP HANA Predictive Analysis Library (PAL)

sap_hana_predictive_analysis_library_pal_en

sap_hana_predictive_analysis_library_pal_en

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.1.2 Agglomerate Hierarchical Clustering<br />

Hierarchical clustering is a widely used clustering method which can find natural groups within a set of data.<br />

The idea is to group the data into a hierarchy or a binary tree of the subgroups. A hierarchical clustering can be<br />

either agglomerate or divisive, depending on the method of hierarchical decomposition.<br />

The implementation in <strong>PAL</strong> follows the agglomerate approach, which merges the clusters with a bottom-up<br />

strategy. Initially, each data point is considered as an own cluster. The algorithm iteratively merges two<br />

clusters based on the dissimilarity measure in a greedy manner and forms a larger cluster. Therefore, the<br />

input data must be numeric and a measure of dissimilarity between sets of data is required, which is achieved<br />

by using the following two parameters:<br />

●<br />

●<br />

An appropriate metric (a measure of distance between pairs of groups)<br />

A linkage criterion which specifies the distances between groups<br />

An advantage of hierarchical clustering is that it does not require the number of clusters to be specified as the<br />

input. And the hierarchical structure can also be used for data summarization and visualization.<br />

The agglomerate hierarchical clustering functions in <strong>PAL</strong> now supports eight kinds of appropriate metrics and<br />

seven kinds of linkage criteria.<br />

Support for Category Attributes<br />

If the input data has category attributes, you must set the DISTANCE_FUNC parameter to Gower Distance to<br />

support calculating the distance matrix. Gower Distance is calculated in the following way.<br />

Suppose that the items X i and X j have K attributes, the distance between X i and X j is:<br />

For continuous attributes,<br />

S ijk =(X ik ‒X jk )/R k<br />

W k =1<br />

R k is the range of values for the k th variable; W k is set by user and the default is 1.<br />

For category attributes,<br />

If X ik =X jk : S ijk =0<br />

Other cases: S ijk =1<br />

Prerequisites<br />

●<br />

●<br />

The first column of the input data is an ID column and the other columns are of integer, double, varchar, or<br />

nvarchar data type.<br />

The input data does not contain null value. The algorithm will issue errors when encountering null values.<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Predictive</strong> <strong>Analysis</strong> <strong>Library</strong> (<strong>PAL</strong>)<br />

<strong>PAL</strong> Functions P U B L I C 25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!