Machine Learning Module (2011/2012) - Sweet

UNIVERSITY OF AVEIRO 

DEPARTMENT OF ELECTRONICS TELECOMMUNICATIONS AND INFORMATICS 

UC II - Machine Learning Module (2011/2012) 

Practical Exercises nº 1 

Part I - K Nearest Neighbour (KNN) and Naive Bayes classifiers 

Fig. 1 

Classification problems are described by the block diagram on Fig. 1. Each entry of the vector x is 

related with a deferent attribute (or feature) of the object and label is the name of its class. 

Ex.1. Table 1 describes the train arrivals into two classes (on time and late) and several attributes 

(features) were also stored. 

Table 1 

a) Map the data set with the general model for a classification problem 

• What are the possible values for label? 

• How many attributes? What is the type of each attribute/feature of the data set? 

• What is the dimension of the classification problem? 

• What is the size of the training data set? 

b) Estimate the following parameters of a Naive Bayes Classifier: 

• A-priori probabilities: P ( Ci 

) , C 

i 

is the class label. 

• Conditional probabilities (likelihood): P atribute = valueC ) 

( 

i 

Organize the parameter values into a table. 

c) The goal is to classify the following new object (instance) : 

weekday winter high heavy What is the class???? 

d) Estimate P x = new objectC ) . Explain the assumptions made on this step. 

( 

i 

e) What is the final decision about the new instance? 

f) Apply a Laplace correction (if necessary).

Ex. 2. The following table describes 20 Portuguese teenagers with weight and height. Given x 

where x1 = 60 is the weight and x2 = 165 is the height. Decide if x is a boy or a girl, using: 

a) KNN classifier with K = 1 and Euclidian distance. 

b) KNN classifier with K = 3 and Euclidian distance. 

c) Naive Bayes Classifier. Formalize the complete model (see the Appendix) 

Ex. 2 Spam detection 

Given the following data set of messages (Table 1) classified as spam and not spam (ham). We get 

a new message M =”today is secret” and want to calculate what is the probability that M is a spam 

message applying the Bayes rule. P(Spam/M) =? Create a “bag of words” and count the 

occurrences of words in spam and not spam messages. Apply Laplace smoothing if necessary. 

Table 1 

Spam 

1. Offer is secret 

2. Click secret link 

3. Secret sport link 

Further Questions 

Not spam (Ham) 

1. Play sport today 

2. Went play sport 

3. Secret sport vents 

4. Sport is today 

5. Sport costs money 

• Estimation of conditional probabilities with qualitative attributes. For instance, P atribute = valueC ) 

( 

i 

assumes that value exists in the training set. And if value does not exist in the instances related with 

class ? Can you propose a solution (tip: Laplace correction) 

• How to apply KNN (K nearest neighbour) to a data set with qualitative attributes? What difficulties 

need to be overcome? 

• Comment the following :"in a two class problem the KNN classifier the number of neighbours 

k should be an odd number". 

• Comment the following: "using Euclidean distance criteria in KNN it is convenient to have attributes on 

a similar scale". Explain with an example. 

• Comment:"the KNN classifier needs the training set during the test phase while Naive Bayes do not 

need the training set during test phase".

Appendix 

Fig. 2 plots of the teenager data set 

Fig. 2 

The Health System collected the data of the students and concluded that both attributes follow a 

Gaussian distribution 

P( 

A 

i 

| C 

j 

2 

ij 

( 

− 

A i 

−µ 

2 

ij ) 

2 

ij 

1 

2σ 

) = e 

(1) 

2πσ 

Eq. (1) serves as a likelihood function for the attribute 

Ai 

with respect to the class 

C 

j 

. Its 

parameters mean ( µ 

ij 

) and standard deviation ( σ 

ij 

) are described on the following table. The 

values on table can be used to estimate likelihood functions of each attribute.

Part II Introduction to Rapid Miner (RM) 

Ex3. 

3.1 Download and extract into a local folder (for example ml2012) the Pima Indian data set from the UCI 

Machine Learning Repository, accessible from 

http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/ 

This data set collects data about 768 patients with and without developing diabetes onset within five years. 

The attributes in the data set are: 

1) Pregnant: number of times pregnant 

2) PlasmaGlucose: the plasma glucose concentration measured using a two-hour oral glucose 

tolerance test. 

3) DiastolicBP: diastolic blood pressure (mmHg) 

4) TriceptsSFT: tricepts skin fold thickness (mm) 

5) SerumInsulin: two-hours serum insulin (mu U/mt) 

6) BMI: body mass index (weight in kg; height in m) 

7) DPF: diabetes pedigree function 

8) Age: age of the patient (years) 

9) Class: diabetes onset within five years (0 or 1) 

3.2. Run Rapid Miner (RM) 5 and create a New Local Repository (into a new folder, for example 

ML2012\RM_repository) to organize all your data sets and processes. 

3.3. (1 st way to import data into RM). From menu File-Import Data-Import CSV File, follow the instructions 

of the wizard to import pima-indians-diabetes.data. 

At step 2 choose a proper column separator. At step 4 give suggestive names to the attributes and specify 

correctly their role (label or regular attribute). At step 5 choose a name for data set, for example 

DiabetesData, and locate it into your Local Rapid Miner Repository 

3.4. (2 nd way to import data into RM). Create a new process (ImportData) to import the data from pimaindians-diabetes.data 

file into the repository, using: 

a) Operator ReadCSV to load data from a data file. Be careful in choosing a proper column separator 

and choosing correctly (yes/not) the first row as attribute names. 

b) Operator Store to add the data set into your RM repository. Choose a proper name. 

3.5. Create a new process (DataDescription1) to perform the following tasks: 

a) Use the Retrieve operator to load the DiabetesData from the repository or directly move the 

database icon into your process. 

b) Run the process, switch to the Results Perspective (Workspace) and explore the Data View and Plot 

View options. Choose quartile option to analyze the distribution of some attributes (for example 

Pregnant and age). Choose Scatter matrix (with plots-class) to observe the data distribution. 

3.6. Create a new process (DataDescription2) to perform the following tasks: 

a) Retrieve the DiabetesData from the repository. 

b) Use the Filter Examples operator to remove the records with missing values in attributes 

TriceptsSFT and BMI. Tip: use a logical expression such as (a = 0 and b = 0). 

Ex. 4 Create a new process (tean_classif) to solve the problem of ex. 2 applying RM. 

Suggestion: Import the training data (file tean_training.txt from moodle), create and import the testing 

data [60 165], use operators Retreave, k-NN, Naive Bayes, Apply Model. 

Remarks: Do not forger RM is a FREE SW (errors are possible). It is better to add names of the attributes 

still in the original data files.

Machine Learning Module (2011/2012) - Sweet

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?