www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Using pandas to load the dataset The pandas library is a library for loading, managing, and manipulating data. It handles data structures behind-the-scenes and supports analysis methods, such as computing the mean. Chapter 3 When doing multiple data mining experiments, you will find that you write many of the same functions again and again, such as reading files and extracting features. Each time this reimplementation happens, you run the risk of introducing bugs. Using a high-class library such as pandas significantly reduces the amount of work needed to do these functions and also gives you more confidence in using well tested code. Throughout this book, we will be using pandas quite significantly, introducing use cases as we go. We can load the dataset using the read_csv function: import pandas as pd dataset = pd.read_csv(data_filename) The result of this is a pandas Dataframe, and it has some useful functions that we will use later on. Looking at the resulting dataset, we can see some issues. Type the following and run the code to see the first five rows of the dataset: dataset.ix[:5] Here's the output: This is actually a usable dataset, but it contains some problems that we will fix up soon. [ 43 ]
Predicting Sports Winners with Decision Trees Cleaning up the dataset After looking at the output, we can see a number of problems: • The date is just a string and not a date object • The first row is blank • From visually inspecting the results, the headings aren't complete or correct These issues come from the data, and we could fix this by altering the data itself. However, in doing this, we could forget the steps we took or misapply them; that is, we can't replicate our results. As with the previous section where we used pipelines to track the transformations we made to a dataset, we will use pandas to apply transformations to the raw data itself. The pandas.read_csv function has parameters to fix each of these issues, which we can specify when loading the file. We can also change the headings after loading the file, as shown in the following code: dataset = pd.read_csv(data_filename, parse_dates=["Date"], skiprows=[0,]) dataset.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"] The results have significantly improved, as we can see if we print out the resulting data frame: dataset.ix[:5] The output is as follows: Even in well-compiled data sources such as this one, you need to make some adjustments. Different systems have different nuances, resulting in data files that are not quite compatible with each other. [ 44 ]
- Page 15 and 16: Table of Contents GPU optimization
- Page 18 and 19: Preface If you have ever wanted to
- Page 20 and 21: What you need for this book It shou
- Page 22 and 23: Preface Reader feedback Feedback fr
- Page 24 and 25: Getting Started with Data Mining We
- Page 26 and 27: Chapter 1 In the preceding dataset,
- Page 28 and 29: After you have the above "Hello, wo
- Page 30 and 31: Chapter 1 Windows users may need to
- Page 32 and 33: Chapter 1 The dataset we are going
- Page 34 and 35: Chapter 1 As an example, we will co
- Page 36 and 37: We get the names of the features fo
- Page 38 and 39: Chapter 1 Two rules are near the to
- Page 40 and 41: Chapter 1 The scikit-learn library
- Page 42 and 43: We then iterate over all the sample
- Page 44 and 45: Chapter 1 Overfitting is the proble
- Page 46: Chapter 1 Summary In this chapter,
- Page 49 and 50: Classifying with scikit-learn Estim
- Page 51 and 52: Classifying with scikit-learn Estim
- Page 53 and 54: Classifying with scikit-learn Estim
- Page 55 and 56: Classifying with scikit-learn Estim
- Page 57 and 58: Classifying with scikit-learn Estim
- Page 59 and 60: Classifying with scikit-learn Estim
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 106 and 107: Chapter 5 Thought should always be
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 112 and 113: Chapter 5 [18, 19, 20], [21, 22, 23
- Page 114 and 115: Chapter 5 Next, we create our trans
Predicting Sports Winners with Decision Trees<br />
Cleaning up the dataset<br />
After looking at the output, we can see a number of problems:<br />
• The date is just a string and not a date object<br />
• The first row is blank<br />
• From visually inspecting the results, the headings aren't <strong>com</strong>plete or correct<br />
These issues <strong>com</strong>e from the data, and we could fix this by altering the data itself.<br />
However, in doing this, we could forget the steps we took or misapply them; that is,<br />
we can't replicate our results. As with the previous section where we used pipelines<br />
to track the transformations we made to a dataset, we will use pandas to apply<br />
transformations to the raw data itself.<br />
The pandas.read_csv function has parameters to fix each of these issues, which we<br />
can specify when loading the file. We can also change the headings after loading the<br />
file, as shown in the following code:<br />
dataset = pd.read_csv(data_filename, parse_dates=["Date"],<br />
skiprows=[0,])<br />
dataset.columns = ["Date", "Score Type", "Visitor Team",<br />
"VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]<br />
The results have significantly improved, as we can see if we print out the resulting<br />
data frame:<br />
dataset.ix[:5]<br />
The output is as follows:<br />
Even in well-<strong>com</strong>piled data sources such as this one, you need to make some<br />
adjustments. Different systems have different nuances, resulting in data files<br />
that are not quite <strong>com</strong>patible with each other.<br />
[ 44 ]