www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 4 – Recommending Movies Using Affinity Analysis Appendix New datasets http://www2.informatik.uni-freiburg.de/~cziegler/BX/ There are many recommendation-based datasets that are worth investigating, each with its own issues. For example, the Book-Crossing dataset contains more than 278,000 users and over a million ratings. Some of these ratings are explicit (the user did give a rating), while others are more implicit. The weighting to these implicit ratings probably shouldn't be as high as for explicit ratings. The music website www.last.fm has released a great dataset for music recommendation: http://www.dtic.upf.edu/~ocelma/ MusicRecommendationDataset/. There is also a joke recommendation dataset! See here: http://eigentaste. berkeley.edu/dataset/. The Eclat algorithm http://www.borgelt.net/eclat.html The APriori algorithm implemented in this chapter is easily the most famous of the association rule mining graphs, but isn't necessarily the best. Eclat is a more modern algorithm that can be implemented relatively easily. Chapter 5 – Extracting Features with Transformers Adding noise In this chapter, we covered removing noise to improve features; however, improved performance can be obtained for some datasets by adding noise. The reason for this is simple—it helps stop overfitting by forcing the classifier to generalize its rules a little (although too much noise will make the model too general). try implementing a Transformer that can add a given amount of noise to a dataset. Test that out on some of the datasets from UCI ML and see if it improves test-set performance. [ 301 ]
Next Steps… Vowpal Wabbit http://hunch.net/~vw/ Vowpal Wabbit is a great project, providing very fast feature extraction for textbased problems. It comes with a Python wrapper, allowing you to call it from with Python code. Test it out on large datasets, such as the one we used in Chapter 12, Working with Big Data. Chapter 6 – Social Media Insight Using Naive Bayes Spam detection http://scikit-learn.org/stable/modules/model_evaluation.html#scoringparameter Using the concepts in this chapter, you can create a spam detection method that is able to view a social media post and determine whether it is spam or not. Try this out by first creating a dataset of spam/not-spam posts, implementing the text mining algorithms, and then evaluating them. One important consideration with spam detection is the false-positive/false-negative ratio. Many people would prefer to have a couple of spam messages slip through, rather than miss out on a legitimate message because the filter was too aggressive in stopping the spam. In order to turn your method for this, you can use a Grid Search with the f1-score as the evaluation criteria. See the above link for information on how to do this. Natural language processing and part-ofspeech tagging http://www.nltk.org/book/ch05.html The techniques we used in this chapter were quite lightweight compared to some of the linguistic models employed in other areas. For example, part-of-speech tagging can help disambiguate word forms, allowing for higher accuracy. The book that comes with NLTK has a chapter on this, linked above. The whole book is well worth reading too. [ 302 ]
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323: Next Steps… Chapter 3: Predicting
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Chapter 4 – Re<strong>com</strong>mending Movies<br />
Using Affinity Analysis<br />
Appendix<br />
New datasets<br />
http://<strong>www</strong>2.informatik.uni-freiburg.de/~cziegler/BX/<br />
There are many re<strong>com</strong>mendation-based datasets that are worth investigating,<br />
each with its own issues. For example, the Book-Crossing dataset contains more<br />
than 278,000 users and over a million ratings. Some of these ratings are explicit<br />
(the user did give a rating), while others are more implicit. The weighting to<br />
these implicit ratings probably shouldn't be as high as for explicit ratings.<br />
The music website <strong>www</strong>.last.fm has released a great dataset for<br />
music re<strong>com</strong>mendation: http://<strong>www</strong>.dtic.upf.edu/~ocelma/<br />
MusicRe<strong>com</strong>mendationDataset/.<br />
There is also a joke re<strong>com</strong>mendation dataset! See here: http://eigentaste.<br />
berkeley.edu/dataset/.<br />
The Eclat algorithm<br />
http://<strong>www</strong>.borgelt.net/eclat.html<br />
The APriori algorithm implemented in this chapter is easily the most famous of the<br />
association rule mining graphs, but isn't necessarily the best. Eclat is a more modern<br />
algorithm that can be implemented relatively easily.<br />
Chapter 5 – Extracting Features with<br />
Transformers<br />
Adding noise<br />
In this chapter, we covered removing noise to improve features; however, improved<br />
performance can be obtained for some datasets by adding noise. The reason for this<br />
is simple—it helps stop overfitting by forcing the classifier to generalize its rules a<br />
little (although too much noise will make the model too general). try implementing a<br />
Transformer that can add a given amount of noise to a dataset. Test that out on some<br />
of the datasets from UCI ML and see if it improves test-set performance.<br />
[ 301 ]