11.08.2013 Views

Data cleaning - Hadley Wickham

Data cleaning - Hadley Wickham

Data cleaning - Hadley Wickham

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

July 2010<br />

Friday, 9 July 2010<br />

<strong>Data</strong> <strong>cleaning</strong><br />

<strong>Hadley</strong> <strong>Wickham</strong><br />

Assistant Professor / Dobelman Family Junior Chair<br />

Department of Statistics / Rice University


Friday, 9 July 2010<br />

1. Intro to data <strong>cleaning</strong><br />

2. What you can fix<br />

3. What you can’t fix


Friday, 9 July 2010<br />

<strong>Data</strong><br />

<strong>cleaning</strong>


“Happy families are all alike;<br />

every unhappy family is<br />

unhappy in its own way.”<br />

—Leo Tolstoy<br />

Friday, 9 July 2010


“Clean datasets are all alike;<br />

every messy dataset is<br />

messy in its own way.”<br />

—<strong>Hadley</strong> <strong>Wickham</strong><br />

Friday, 9 July 2010


Friday, 9 July 2010<br />

No magic bullet<br />

No sequence of steps that will work in<br />

every case.<br />

Often will need to clean the data<br />

repeatedly, as you discover new problems<br />

during the analysis process.<br />

You know the basic tools - I’ll show you<br />

some new ways of applying them.


Friday, 9 July 2010<br />

What you<br />

can fix:


Friday, 9 July 2010<br />

Rectangular


Friday, 9 July 2010<br />

Observations<br />

in rows


Friday, 9 July 2010<br />

Variables<br />

in columns


Friday, 9 July 2010


Friday, 9 July 2010


library(reshape)<br />

library(stringr)<br />

options(stringsAsFactors = FALSE)<br />

note_raw


Friday, 9 July 2010<br />

Your turn<br />

What are the variables in<br />

country_population.csv and<br />

tb_notification.csv? How are they<br />

arranged in rows and columns? Can you<br />

form the variables into two groups?<br />

(Hint: you might need to look at the data<br />

dictionary)


Friday, 9 July 2010<br />

Identifier variable Measured variable<br />

Index of random<br />

variable<br />

Random variable<br />

Dimension Measure<br />

Experimental design Measurement<br />

Key Value


Friday, 9 July 2010<br />

Molten data<br />

Molten data has all variables in the<br />

rows, and all measured variables<br />

the columns. Sometimes called<br />

“long” form.<br />

Easy to pour into new shapes,<br />

graphics and models<br />

reshape::melt(data, measure, id)


Friday, 9 July 2010<br />

Your turn<br />

Melt the data. What are the id variables<br />

present in this data? What are the<br />

measured variables?


note


Pregnant<br />

Not-<br />

pregnant<br />

Friday, 9 July 2010<br />

Implicit vs. explicit<br />

missings<br />

Male Female<br />

10<br />

20 15<br />

Sex Pregnant Count<br />

Male No 20<br />

Female Yes 10<br />

Female No 15


Friday, 9 July 2010<br />

Your turn<br />

What’s missing? How does this data set<br />

compare to our original list of variables?<br />

Think about the operations you’d need to<br />

perform to create any extra pieces.


note$variable


# Combine into one file<br />

rate


# Why this form?<br />

# Because it works well with all our existing tools<br />

# Easy to summarise<br />

totals


Friday, 9 July 2010<br />

Concise<br />

(each fact represented once)<br />

Consistent<br />

(if repeated, each fact is the same)


Friday, 9 July 2010<br />

TB<br />

This data is already very concise, and<br />

therefor it’s also consistent.<br />

Did you spot the exception?


note_total


Friday, 9 July 2010<br />

Whenever there is inconsistency, you are<br />

going to have to make some tradeoff to<br />

ensure concision.<br />

In general<br />

Detecting inconsistency is not always<br />

easy, but you now have the basic tools.


Friday, 9 July 2010<br />

What you<br />

can’t fix:


Friday, 9 July 2010<br />

Complete<br />

Correct


Friday, 9 July 2010<br />

Correct<br />

Can’t restore correct values without<br />

original data but can remove clearly<br />

incorrect values<br />

Options:<br />

Remove entire row<br />

Mark incorrect value as missing


Friday, 9 July 2010<br />

What is a missing<br />

value?<br />

In R, written as NA. Has special<br />

behaviour:<br />

NA + 3 = ?<br />

NA > 2 = ?<br />

mean(c(2, 7, 10, NA)) = ?<br />

NA == NA ?<br />

Use is.na() to see if a value is NA<br />

Many functions have na.rm argument


Friday, 9 July 2010<br />

General strategy<br />

To find incorrect values you need to be<br />

creative, combining graphics and data<br />

processing.<br />

I’m going to try and give you a flavour of<br />

this process with another data set.


Friday, 9 July 2010<br />

Diamonds data<br />

~54,000 round diamonds from<br />

http://www.diamondse.info/<br />

Carat, colour, clarity, cut<br />

Total depth, table, depth,<br />

width, height<br />

Price


Friday, 9 July 2010<br />

x<br />

table width<br />

depth = z / diameter<br />

table = table width / x * 100<br />

z


Friday, 9 July 2010<br />

Your turn<br />

Look at histograms and scatterplots of x,<br />

y, z from the diamonds dataset (it’s<br />

included in ggplot2)<br />

Which values are clearly incorrect? Which<br />

values might we be able to correct?


Friday, 9 July 2010<br />

Plots<br />

qplot(x, data = diamonds, binwidth = 0.1)<br />

qplot(y, data = diamonds, binwidth = 0.1)<br />

qplot(z, data = diamonds, binwidth = 0.1)<br />

qplot(x, y, data = diamonds)<br />

qplot(x, z, data = diamonds)<br />

qplot(y, z, data = diamonds)


y_big 10<br />

z_big 6<br />

x_zero


Friday, 9 July 2010<br />

Your turn<br />

Fix the incorrect values and replot<br />

scatterplots of x, y, and z. Are all the<br />

unusual values gone?


diamonds$x[x_zero]


Friday, 9 July 2010<br />

New variables<br />

When <strong>cleaning</strong>, derived variables will be<br />

very important.<br />

e.g. qplot(a, b) is a straight line, then<br />

qplot(a, a / b) will be a flat line.<br />

For this example, we can also approximate<br />

the volume of a diamond, and use what<br />

we know about the density of diamonds.


Friday, 9 July 2010<br />

Your turn<br />

Using what you know about the density of<br />

diamonds, clean this data up some more.


diamonds


diamonds


ad 0.007)<br />

qplot(density, data = diamonds[!bad, ])<br />

subset(diamonds, bad)<br />

qplot(carat, density, data = diamonds[!bad, ])<br />

qplot(y, x/y, data = diamonds[!bad, ])<br />

Friday, 9 July 2010


Friday, 9 July 2010<br />

Summary


Friday, 9 July 2010<br />

Clean data is:<br />

Rectangular<br />

(observations in rows, variables in columns)<br />

Consistent<br />

Concise<br />

Complete<br />

Correct


Friday, 9 July 2010


This work is licensed under the Creative<br />

Commons Attribution-Noncommercial 3.0 United<br />

States License. To view a copy of this license,<br />

visit http://creativecommons.org/licenses/by-nc/<br />

3.0/us/ or send a letter to Creative Commons,<br />

171 Second Street, Suite 300, San Francisco,<br />

California, 94105, USA.<br />

Friday, 9 July 2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!