www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
We again redefine our word search regular expression—if you were doing this in a real application, I recommend centralizing this. It is important that words are extracted in the same way for training and testing: word_search_re = re.compile(r"[\w']+") Next, we create the function that loads our model from a given filename: def load_model(model_filename): Chapter 12 The model parameters will take the form of a dictionary of dictionaries, where the first key is a word, and the inner dictionary maps each gender to a probability. We use defaultdicts, which will return zero if a value isn't present; model = defaultdict(lambda: defaultdict(float)) We then open the model and parse each line; with open(model_filename) as inf: for line in inf: The line is split into two sections, separated by whitespace. The first is the word itself and the second is a dictionary of probabilities. For each, we run eval on them to get the actual value, which was stored using repr in the previous code: word, values = line.split(maxsplit=1) word = eval(word) values = eval(values) We then track the values to the word in our model: model[word] = values return model Next, we load our actual model. You may need to change the model filename—it will be in the output dir of the last MapReduce job; model_filename = os.path.join(os.path.expanduser("~"), "models", "part-00000") model = load_model(model_filename) As an example, we can see the difference in usage of the word i (all words are turned into lowercase in the MapReduce jobs) between males and females: model["i"]["male"], model["i"]["female"] [ 289 ]
Working with Big Data Next, we create a function that can use this model for prediction. We won't use the scikit-learn interface for this example, and just create a function instead. Our function takes the model and a document as the parameters and returns the most likely gender: def nb_predict(model, document): We start by creating a dictionary to map each gender to the computed probability: probabilities = defaultdict(lambda : 1) We extract each of the words from the document: words = word_search_re.findall(document) We then iterate over the words and find the probability for each gender in the dataset: for word in set(words): probabilities["male"] += np.log(model[word].get("male", 1e- 15)) probabilities["female"] += np.log(model[word].get("female", 1e-15)) We then sort the genders by their value, get the highest value, and return that as our prediction: most_likely_genders = sorted(probabilities.items(), key=itemgetter(1), reverse=True) return most_likely_genders[0][0] It is important to note that we used np.log to compute the probabilities. Probabilities in Naive Bayes models are often quite small. Multiplying small values, which is necessary in many statistical values, can lead to an underflow error where the computer's precision isn't good enough and just makes the whole value 0. In this case, it would cause the likelihoods for both genders to be zero, leading to incorrect predictions. To get around this, we use log probabilities. For two values a and b, log(a,b) is equal to log(a) + log(b). The log of a small probability is a negative value, but a relatively large one. For instance, log(0.00001) is about -11.5. This means that rather than multiplying actual probabilities and risking an underflow error, we can sum the log probabilities and compare the values in the same way (higher numbers still indicate a higher likelihood). [ 290 ]
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
- Page 270 and 271: Chapter 11 Each of these issues has
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311: Working with Big Data This gives us
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
We again redefine our word search regular expression—if you were doing this in<br />
a real application, I re<strong>com</strong>mend centralizing this. It is important that words are<br />
extracted in the same way for training and testing:<br />
word_search_re = re.<strong>com</strong>pile(r"[\w']+")<br />
Next, we create the function that loads our model from a given filename:<br />
def load_model(model_filename):<br />
Chapter 12<br />
The model parameters will take the form of a dictionary of dictionaries, where the<br />
first key is a word, and the inner dictionary maps each gender to a probability. We<br />
use defaultdicts, which will return zero if a value isn't present;<br />
model = defaultdict(lambda: defaultdict(float))<br />
We then open the model and parse each line;<br />
with open(model_filename) as inf:<br />
for line in inf:<br />
The line is split into two sections, separated by whitespace. The first is the word itself<br />
and the second is a dictionary of probabilities. For each, we run eval on them to get<br />
the actual value, which was stored using repr in the previous code:<br />
word, values = line.split(maxsplit=1)<br />
word = eval(word)<br />
values = eval(values)<br />
We then track the values to the word in our model:<br />
model[word] = values<br />
return model<br />
Next, we load our actual model. You may need to change the model filename—it will<br />
be in the output dir of the last MapReduce job;<br />
model_filename = os.path.join(os.path.expanduser("~"), "models",<br />
"part-00000")<br />
model = load_model(model_filename)<br />
As an example, we can see the difference in usage of the word i (all words are turned<br />
into lowercase in the MapReduce jobs) between males and females:<br />
model["i"]["male"], model["i"]["female"]<br />
[ 289 ]