www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Appendix Other image datasets are available at: http://rodrigob.github.io/are_we_there_yet/build/classification_ datasets_results.html There are many datasets of images available from a number of academic and industry-based sources. The linked website lists a bunch of datasets and some of the best algorithms to use on them. Implementing some of the better algorithms will require significant amounts of custom code, but the payoff can be well worth the pain. Chapter 12 – Working with Big Data Courses on Hadoop Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to quite advanced levels. They don't specifically address using Python, but learning the Hadoop concepts and then applying them in Pydoop or a similar library can yield great results. Yahoo's tutorial: https://developer.yahoo.com/hadoop/tutorial/ Google's tutorial: https://cloud.google.com/hadoop/what-is-hadoop Pydoop Pydoop is a python library to run Hadoop jobs—it also has a great tutorial that can be found here: http://crs4.github.io/pydoop/tutorial/index.html. Pydoop also works with HDFS, the Hadoop File System, although you can get that functionality in mrjob as well. Pydoop will give you a bit more control over running some jobs. Recommendation engine Building a large recommendation engine is a good test of your Big data skills. A great blog post by Mark Litwintschik covers an engine using Apache Spark, a big data technology: http://tech.marksblogg.com/recommendation-engine-sparkpython.html. [ 307 ]

Next Steps… More resources Kaggle competitions: www.kaggle.com/ Kaggle runs data mining competitions regularly, often with monetary prizes. Testing your skills on Kaggle competitions is a fast and great way to learn to work with real-world data mining problems. The forums are nice and share environments—often, you will see code released for a top-10 entry during the competition! Coursera: www.coursera.org Coursera contains many courses on data mining and data science. Many of the courses are specialized such as big data and image processing. A great general one to start with is Andrew Ng's famous course: https://www.coursera.org/learn/ machine-learning/. It is a bit more advanced than this book and would be a great next step for interested readers. For neural networks, check out this course: https://www.coursera.org/course/ neuralnets. If you complete all of these, try out the course on probabilistic graphical models at https://www.coursera.org/course/pgm. [ 308 ]

Appendix<br />

Other image datasets are available at:<br />

http://rodrigob.github.io/are_we_there_yet/build/classification_<br />

datasets_results.html<br />

There are many datasets of images available from a number of academic and<br />

industry-based sources. The linked website lists a bunch of datasets and some<br />

of the best algorithms to use on them. Implementing some of the better algorithms<br />

will require significant amounts of custom code, but the payoff can be well worth<br />

the pain.<br />

Chapter 12 – Working with Big Data<br />

Courses on Hadoop<br />

Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to<br />

quite advanced levels. They don't specifically address using Python, but learning the<br />

Hadoop concepts and then applying them in Pydoop or a similar library can yield<br />

great results.<br />

Yahoo's tutorial: https://developer.yahoo.<strong>com</strong>/hadoop/tutorial/<br />

Google's tutorial: https://cloud.google.<strong>com</strong>/hadoop/what-is-hadoop<br />

Pydoop<br />

Pydoop is a python library to run Hadoop jobs—it also has a great tutorial that can<br />

be found here: http://crs4.github.io/pydoop/tutorial/index.html.<br />

Pydoop also works with HDFS, the Hadoop File System, although you can get that<br />

functionality in mrjob as well. Pydoop will give you a bit more control over running<br />

some jobs.<br />

Re<strong>com</strong>mendation engine<br />

Building a large re<strong>com</strong>mendation engine is a good test of your Big data skills. A<br />

great blog post by Mark Litwintschik covers an engine using Apache Spark, a big<br />

data technology: http://tech.marksblogg.<strong>com</strong>/re<strong>com</strong>mendation-engine-sparkpython.html.<br />

[ 307 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!