www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Appendix Other image datasets are available at: http://rodrigob.github.io/are_we_there_yet/build/classification_ datasets_results.html There are many datasets of images available from a number of academic and industry-based sources. The linked website lists a bunch of datasets and some of the best algorithms to use on them. Implementing some of the better algorithms will require significant amounts of custom code, but the payoff can be well worth the pain. Chapter 12 – Working with Big Data Courses on Hadoop Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to quite advanced levels. They don't specifically address using Python, but learning the Hadoop concepts and then applying them in Pydoop or a similar library can yield great results. Yahoo's tutorial: https://developer.yahoo.com/hadoop/tutorial/ Google's tutorial: https://cloud.google.com/hadoop/what-is-hadoop Pydoop Pydoop is a python library to run Hadoop jobs—it also has a great tutorial that can be found here: http://crs4.github.io/pydoop/tutorial/index.html. Pydoop also works with HDFS, the Hadoop File System, although you can get that functionality in mrjob as well. Pydoop will give you a bit more control over running some jobs. Recommendation engine Building a large recommendation engine is a good test of your Big data skills. A great blog post by Mark Litwintschik covers an engine using Apache Spark, a big data technology: http://tech.marksblogg.com/recommendation-engine-sparkpython.html. [ 307 ]
Next Steps… More resources Kaggle competitions: www.kaggle.com/ Kaggle runs data mining competitions regularly, often with monetary prizes. Testing your skills on Kaggle competitions is a fast and great way to learn to work with real-world data mining problems. The forums are nice and share environments—often, you will see code released for a top-10 entry during the competition! Coursera: www.coursera.org Coursera contains many courses on data mining and data science. Many of the courses are specialized such as big data and image processing. A great general one to start with is Andrew Ng's famous course: https://www.coursera.org/learn/ machine-learning/. It is a bit more advanced than this book and would be a great next step for interested readers. For neural networks, check out this course: https://www.coursera.org/course/ neuralnets. If you complete all of these, try out the course on probabilistic graphical models at https://www.coursera.org/course/pgm. [ 308 ]
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321 and 322: Next Steps… Extending the IPython
- Page 323 and 324: Next Steps… Chapter 3: Predicting
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329: Next Steps… Real-time clusterings
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Appendix<br />
Other image datasets are available at:<br />
http://rodrigob.github.io/are_we_there_yet/build/classification_<br />
datasets_results.html<br />
There are many datasets of images available from a number of academic and<br />
industry-based sources. The linked website lists a bunch of datasets and some<br />
of the best algorithms to use on them. Implementing some of the better algorithms<br />
will require significant amounts of custom code, but the payoff can be well worth<br />
the pain.<br />
Chapter 12 – Working with Big Data<br />
Courses on Hadoop<br />
Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to<br />
quite advanced levels. They don't specifically address using Python, but learning the<br />
Hadoop concepts and then applying them in Pydoop or a similar library can yield<br />
great results.<br />
Yahoo's tutorial: https://developer.yahoo.<strong>com</strong>/hadoop/tutorial/<br />
Google's tutorial: https://cloud.google.<strong>com</strong>/hadoop/what-is-hadoop<br />
Pydoop<br />
Pydoop is a python library to run Hadoop jobs—it also has a great tutorial that can<br />
be found here: http://crs4.github.io/pydoop/tutorial/index.html.<br />
Pydoop also works with HDFS, the Hadoop File System, although you can get that<br />
functionality in mrjob as well. Pydoop will give you a bit more control over running<br />
some jobs.<br />
Re<strong>com</strong>mendation engine<br />
Building a large re<strong>com</strong>mendation engine is a good test of your Big data skills. A<br />
great blog post by Mark Litwintschik covers an engine using Apache Spark, a big<br />
data technology: http://tech.marksblogg.<strong>com</strong>/re<strong>com</strong>mendation-engine-sparkpython.html.<br />
[ 307 ]