25.10.2016 Views

SAP HANA Predictive Analysis Library (PAL)

sap_hana_predictive_analysis_library_pal_en

sap_hana_predictive_analysis_library_pal_en

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.2.10 Naive Bayes<br />

Naive Bayes is a classification algorithm based on Bayes theorem. It estimates the class-conditional<br />

probability by assuming that the attributes are conditionally independent of one another.<br />

Given the class label y and a dependent feature vector x 1 through x n , the conditional independence<br />

assumption can be formally stated as follows:<br />

Using the naive independence assumption that<br />

P(x i |y, x 1 , ..., x i-1 , x i+1 , ..., x n ) = P(x i |y)<br />

for all i, this relationship is simplified to<br />

Since P(x 1 , ..., x n ) is constant given the input, we can use the following classification rule:<br />

We can use Maximum a posteriori (MAP) estimation to estimate P(y) and P(x i |y). The former is then the<br />

relative frequency of class y in the training set.<br />

The different Naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution<br />

of P(x i |y).<br />

For continuous attributes, the attribute data are fitted to a Gaussian distribution and get the P(x i |y).<br />

For discrete attributes, the count number ratio is used as P(x i |y). However, if there are categories that did<br />

not occur in the training set, P(x i |y) will become 0, while the actual probability is merely small instead of 0.<br />

This will bring errors to the prediction. To handle this issue, <strong>PAL</strong> introduces Laplace smoothing. The P(x i |y) is<br />

then denoted as:<br />

This is a type of shrinkage estimator, as the resulting estimate is between the empirical estimate xi / N, and<br />

the uniform probability 1/d. α > 0 is the smoothing parameter, also called Laplace control value in the<br />

following discussion.<br />

Despite its simplicity, Naive Bayes works quite well in areas like document classification and spam filtering,<br />

and it only requires a small amount of training data to estimate the parameters necessary for classification.<br />

The Naive Bayes algorithm in <strong>PAL</strong> includes two functions: NBCTRAIN for generating training model; and<br />

NBCPREDICT for making prediction based on the training model.<br />

<strong>SAP</strong> <strong>HANA</strong> <strong>Predictive</strong> <strong>Analysis</strong> <strong>Library</strong> (<strong>PAL</strong>)<br />

<strong>PAL</strong> Functions P U B L I C 193

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!