www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

We again redefine our word search regular expression—if you were doing this in a real application, I recommend centralizing this. It is important that words are extracted in the same way for training and testing: word_search_re = re.compile(r"[\w']+") Next, we create the function that loads our model from a given filename: def load_model(model_filename): Chapter 12 The model parameters will take the form of a dictionary of dictionaries, where the first key is a word, and the inner dictionary maps each gender to a probability. We use defaultdicts, which will return zero if a value isn't present; model = defaultdict(lambda: defaultdict(float)) We then open the model and parse each line; with open(model_filename) as inf: for line in inf: The line is split into two sections, separated by whitespace. The first is the word itself and the second is a dictionary of probabilities. For each, we run eval on them to get the actual value, which was stored using repr in the previous code: word, values = line.split(maxsplit=1) word = eval(word) values = eval(values) We then track the values to the word in our model: model[word] = values return model Next, we load our actual model. You may need to change the model filename—it will be in the output dir of the last MapReduce job; model_filename = os.path.join(os.path.expanduser("~"), "models", "part-00000") model = load_model(model_filename) As an example, we can see the difference in usage of the word i (all words are turned into lowercase in the MapReduce jobs) between males and females: model["i"]["male"], model["i"]["female"] [ 289 ]

Working with Big Data Next, we create a function that can use this model for prediction. We won't use the scikit-learn interface for this example, and just create a function instead. Our function takes the model and a document as the parameters and returns the most likely gender: def nb_predict(model, document): We start by creating a dictionary to map each gender to the computed probability: probabilities = defaultdict(lambda : 1) We extract each of the words from the document: words = word_search_re.findall(document) We then iterate over the words and find the probability for each gender in the dataset: for word in set(words): probabilities["male"] += np.log(model[word].get("male", 1e- 15)) probabilities["female"] += np.log(model[word].get("female", 1e-15)) We then sort the genders by their value, get the highest value, and return that as our prediction: most_likely_genders = sorted(probabilities.items(), key=itemgetter(1), reverse=True) return most_likely_genders[0][0] It is important to note that we used np.log to compute the probabilities. Probabilities in Naive Bayes models are often quite small. Multiplying small values, which is necessary in many statistical values, can lead to an underflow error where the computer's precision isn't good enough and just makes the whole value 0. In this case, it would cause the likelihoods for both genders to be zero, leading to incorrect predictions. To get around this, we use log probabilities. For two values a and b, log(a,b) is equal to log(a) + log(b). The log of a small probability is a negative value, but a relatively large one. For instance, log(0.00001) is about -11.5. This means that rather than multiplying actual probabilities and risking an underflow error, we can sum the log probabilities and compare the values in the same way (higher numbers still indicate a higher likelihood). [ 290 ]

We again redefine our word search regular expression—if you were doing this in<br />

a real application, I re<strong>com</strong>mend centralizing this. It is important that words are<br />

extracted in the same way for training and testing:<br />

word_search_re = re.<strong>com</strong>pile(r"[\w']+")<br />

Next, we create the function that loads our model from a given filename:<br />

def load_model(model_filename):<br />

Chapter 12<br />

The model parameters will take the form of a dictionary of dictionaries, where the<br />

first key is a word, and the inner dictionary maps each gender to a probability. We<br />

use defaultdicts, which will return zero if a value isn't present;<br />

model = defaultdict(lambda: defaultdict(float))<br />

We then open the model and parse each line;<br />

with open(model_filename) as inf:<br />

for line in inf:<br />

The line is split into two sections, separated by whitespace. The first is the word itself<br />

and the second is a dictionary of probabilities. For each, we run eval on them to get<br />

the actual value, which was stored using repr in the previous code:<br />

word, values = line.split(maxsplit=1)<br />

word = eval(word)<br />

values = eval(values)<br />

We then track the values to the word in our model:<br />

model[word] = values<br />

return model<br />

Next, we load our actual model. You may need to change the model filename—it will<br />

be in the output dir of the last MapReduce job;<br />

model_filename = os.path.join(os.path.expanduser("~"), "models",<br />

"part-00000")<br />

model = load_model(model_filename)<br />

As an example, we can see the difference in usage of the word i (all words are turned<br />

into lowercase in the MapReduce jobs) between males and females:<br />

model["i"]["male"], model["i"]["female"]<br />

[ 289 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!