www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 4 – Recommending Movies Using Affinity Analysis Appendix New datasets http://www2.informatik.uni-freiburg.de/~cziegler/BX/ There are many recommendation-based datasets that are worth investigating, each with its own issues. For example, the Book-Crossing dataset contains more than 278,000 users and over a million ratings. Some of these ratings are explicit (the user did give a rating), while others are more implicit. The weighting to these implicit ratings probably shouldn't be as high as for explicit ratings. The music website www.last.fm has released a great dataset for music recommendation: http://www.dtic.upf.edu/~ocelma/ MusicRecommendationDataset/. There is also a joke recommendation dataset! See here: http://eigentaste. berkeley.edu/dataset/. The Eclat algorithm http://www.borgelt.net/eclat.html The APriori algorithm implemented in this chapter is easily the most famous of the association rule mining graphs, but isn't necessarily the best. Eclat is a more modern algorithm that can be implemented relatively easily. Chapter 5 – Extracting Features with Transformers Adding noise In this chapter, we covered removing noise to improve features; however, improved performance can be obtained for some datasets by adding noise. The reason for this is simple—it helps stop overfitting by forcing the classifier to generalize its rules a little (although too much noise will make the model too general). try implementing a Transformer that can add a given amount of noise to a dataset. Test that out on some of the datasets from UCI ML and see if it improves test-set performance. [ 301 ]

Next Steps… Vowpal Wabbit http://hunch.net/~vw/ Vowpal Wabbit is a great project, providing very fast feature extraction for textbased problems. It comes with a Python wrapper, allowing you to call it from with Python code. Test it out on large datasets, such as the one we used in Chapter 12, Working with Big Data. Chapter 6 – Social Media Insight Using Naive Bayes Spam detection http://scikit-learn.org/stable/modules/model_evaluation.html#scoringparameter Using the concepts in this chapter, you can create a spam detection method that is able to view a social media post and determine whether it is spam or not. Try this out by first creating a dataset of spam/not-spam posts, implementing the text mining algorithms, and then evaluating them. One important consideration with spam detection is the false-positive/false-negative ratio. Many people would prefer to have a couple of spam messages slip through, rather than miss out on a legitimate message because the filter was too aggressive in stopping the spam. In order to turn your method for this, you can use a Grid Search with the f1-score as the evaluation criteria. See the above link for information on how to do this. Natural language processing and part-ofspeech tagging http://www.nltk.org/book/ch05.html The techniques we used in this chapter were quite lightweight compared to some of the linguistic models employed in other areas. For example, part-of-speech tagging can help disambiguate word forms, allowing for higher accuracy. The book that comes with NLTK has a chapter on this, linked above. The whole book is well worth reading too. [ 302 ]

Chapter 4 – Re<strong>com</strong>mending Movies<br />

Using Affinity Analysis<br />

Appendix<br />

New datasets<br />

http://<strong>www</strong>2.informatik.uni-freiburg.de/~cziegler/BX/<br />

There are many re<strong>com</strong>mendation-based datasets that are worth investigating,<br />

each with its own issues. For example, the Book-Crossing dataset contains more<br />

than 278,000 users and over a million ratings. Some of these ratings are explicit<br />

(the user did give a rating), while others are more implicit. The weighting to<br />

these implicit ratings probably shouldn't be as high as for explicit ratings.<br />

The music website <strong>www</strong>.last.fm has released a great dataset for<br />

music re<strong>com</strong>mendation: http://<strong>www</strong>.dtic.upf.edu/~ocelma/<br />

MusicRe<strong>com</strong>mendationDataset/.<br />

There is also a joke re<strong>com</strong>mendation dataset! See here: http://eigentaste.<br />

berkeley.edu/dataset/.<br />

The Eclat algorithm<br />

http://<strong>www</strong>.borgelt.net/eclat.html<br />

The APriori algorithm implemented in this chapter is easily the most famous of the<br />

association rule mining graphs, but isn't necessarily the best. Eclat is a more modern<br />

algorithm that can be implemented relatively easily.<br />

Chapter 5 – Extracting Features with<br />

Transformers<br />

Adding noise<br />

In this chapter, we covered removing noise to improve features; however, improved<br />

performance can be obtained for some datasets by adding noise. The reason for this<br />

is simple—it helps stop overfitting by forcing the classifier to generalize its rules a<br />

little (although too much noise will make the model too general). try implementing a<br />

Transformer that can add a given amount of noise to a dataset. Test that out on some<br />

of the datasets from UCI ML and see if it improves test-set performance.<br />

[ 301 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!