25.10.2016 Views

SAP HANA Predictive Analysis Library (PAL)

sap_hana_predictive_analysis_library_pal_en

sap_hana_predictive_analysis_library_pal_en

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.6 Preprocessing Algorithms<br />

The records in business database are usually not directly ready for predictive analysis due to the following<br />

reasons:<br />

●<br />

●<br />

●<br />

Some data come in large amount, which may exceed the capacity of an algorithm.<br />

Some data contains noisy observations which may hurt the accuracy of an algorithm.<br />

Some attributes are badly scaled, which can make an algorithm unstable.<br />

To address the above challenges, <strong>PAL</strong> provides several convenient algorithms for data preprocessing.<br />

3.6.1 Binning<br />

Binning data is a common requirement prior to running certain predictive algorithms. It generally reduces the<br />

complexity of the model, for example, the model in a decision tree.<br />

Binning methods replace a value by a "bin number" defined by all elements of its neighborhood, that is, the bin<br />

it belongs to. The ordered values are distributed into a number of bins. Because binning methods consult the<br />

neighborhood of values, they perform local smoothing.<br />

Note<br />

Binning can only be used on a table with only one attribute.<br />

Binning Methods<br />

There are four binning methods:<br />

●<br />

●<br />

Equal widths based on the number of bins<br />

Specify an integer to determine the number of equal width bins and calculate the range values by:<br />

BandWidth = (MaxValue - MinValue) / K<br />

Where MaxValue is the biggest value of every column, MinValue is the smallest value of every column,<br />

and K is the number of bins.<br />

For example, according to this rule:<br />

○<br />

○<br />

MinValue + BinWidth > Values in Bin 1 ≥ MinValue<br />

MinValue + 2 * BinWidth > Values in Bin 2 ≥ MinValue + BinWidth<br />

Equal bin widths defined as a parameter<br />

Specify the bin width and calculate the start and end of bin intervals by:<br />

Start of bin intervals = Minimum data value – 0.5 * Bin width<br />

End of bin intervals = Maximum data value + 0.5 * Bin width<br />

For example, assuming the data has a range from 6 to 38 and the bin width is 10:<br />

Start of bin intervals = 6 – 0.5 * 10 = 1<br />

End of bin intervals = 38 + 0.5 * 10 = 43<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Predictive</strong> <strong>Analysis</strong> <strong>Library</strong> (<strong>PAL</strong>)<br />

<strong>PAL</strong> Functions P U B L I C 431

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!