www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Local n-grams https://github.com/robertlayton/authorship_tutorials/blob/master/ LNGTutorial.ipynb Appendix Another form of classifier is local n-gram, which involves choosing the best features per-author, not globally for the entire dataset. I wrote a tutorial on using local n-grams for authorship attribution, available at the above link. Chapter 10 – Clustering News Articles Evaluation The evaluation of clustering algorithms is a difficult problem—on the one hand, we can sort of tell what good clusters look like; on the other hand, if we really know that, we should label some instances and use a supervised classifier! Much has been written on this topic. One slideshow on the topic that is a good introduction to the challenges follows: http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf In addition, a very comprehensive (although now a little dated) paper on this topic is here: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_ survey.pdf. The scikit-learn package does implement a number of the metrics described in those links, with an overview here: http://scikit-learn.org/stable/modules/ clustering.html#clustering-performance-evaluation. Using some of these, you can start evaluating which parameters need to be used for better clusterings. Using a Grid Search, we can find parameters that maximize a metric—just like in classification. Temporal analysis The code we developed in this chapter can be rerun over many months. By adding some tags to each cluster, you can track which topics stay active over time, getting a longitudinal viewpoint of what is being discussed in the world news. To compare the clusters, consider a metric such as the adjusted mutual information score, which was linked to the scikit-learn documentation earlier. See how the clusters change after one month, two months, six months, and a year. [ 305 ]

Next Steps… Real-time clusterings The k-means algorithm can be iteratively trained and updated over time, rather than discrete analyses at given time frames. Cluster movement can be tracked in a number of ways—for instance, you can track which words are popular in each cluster and how much the centroids move per day. Keep the API limits in mind—you probably only need to do one check every few hours to keep your algorithm up-to-date. Chapter 11: Classifying Objects in Images Using Deep Learning Keras and Pylearn2 Other deep learning libraries that are worth looking at, if you are going further with deep learning in Python, are Keras and Pylearn2. They are both based on Theano and have different usages and features. Keras can be found here: https://github.com/fchollet/keras/. Pylearn2 can be found here: http://deeplearning.net/software/pylearn2/. Both are not stable platforms at the time of writing, although Pylearn2 is the more stable of the two. That said, they both do what they do very well and are worth investigating for future projects. Another library called Torch is very popular but, at the time of writing, it doesn't have python bindings (see http://torch.ch/). Mahotas Another package for image processing is Mahotas, including better and more complex image processing techniques that can help achieve better accuracy, although they may come at a high computational cost. However, many image processing tasks are good candidates for parallelization. More techniques on image classification can be found in the research literature, with this survey paper as a good start: http://luispedro.org/software/mahotas/. http://ijarcce.com/upload/january/22-A%20Survey%20on%20Image%20 Classification.pdf [ 306 ]

Next Steps…<br />

Real-time clusterings<br />

The k-means algorithm can be iteratively trained and updated over time, rather than<br />

discrete analyses at given time frames. Cluster movement can be tracked in a number<br />

of ways—for instance, you can track which words are popular in each cluster and<br />

how much the centroids move per day. Keep the API limits in mind—you probably<br />

only need to do one check every few hours to keep your algorithm up-to-date.<br />

Chapter 11: Classifying Objects in<br />

Images Using Deep Learning<br />

Keras and Pylearn2<br />

Other deep learning libraries that are worth looking at, if you are going further with<br />

deep learning in Python, are Keras and Pylearn2. They are both based on Theano and<br />

have different usages and features.<br />

Keras can be found here: https://github.<strong>com</strong>/fchollet/keras/.<br />

Pylearn2 can be found here: http://deeplearning.net/software/pylearn2/.<br />

Both are not stable platforms at the time of writing, although Pylearn2 is the more<br />

stable of the two. That said, they both do what they do very well and are worth<br />

investigating for future projects.<br />

Another library called Torch is very popular but, at the time of writing, it doesn't<br />

have python bindings (see http://torch.ch/).<br />

Mahotas<br />

Another package for image processing is Mahotas, including better and more<br />

<strong>com</strong>plex image processing techniques that can help achieve better accuracy, although<br />

they may <strong>com</strong>e at a high <strong>com</strong>putational cost. However, many image processing tasks<br />

are good candidates for parallelization. More techniques on image classification<br />

can be found in the research literature, with this survey paper as a good start:<br />

http://luispedro.org/software/mahotas/.<br />

http://ijarcce.<strong>com</strong>/upload/january/22-A%20Survey%20on%20Image%20<br />

Classification.pdf<br />

[ 306 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!