Data cleaning - Hadley Wickham

July 2010 

Friday, 9 July 2010 

Data cleaning 

Hadley Wickham 

Assistant Professor / Dobelman Family Junior Chair 

Department of Statistics / Rice University


1. Intro to data cleaning 

2. What you can fix 

3. What you can’t fix


Data 

cleaning

“Happy families are all alike; 

every unhappy family is 

unhappy in its own way.” 

—Leo Tolstoy 

Friday, 9 July 2010

“Clean datasets are all alike; 

every messy dataset is 

messy in its own way.” 

—Hadley Wickham 

Friday, 9 July 2010


No magic bullet 

No sequence of steps that will work in 

every case. 

Often will need to clean the data 

repeatedly, as you discover new problems 

during the analysis process. 

You know the basic tools - I’ll show you 

some new ways of applying them.


What you 

can fix:


Rectangular


Observations 

in rows


Variables 

in columns

Friday, 9 July 2010

Friday, 9 July 2010

library(reshape) 

library(stringr) 

options(stringsAsFactors = FALSE) 

note_raw


Your turn 

What are the variables in 

country_population.csv and 

tb_notification.csv? How are they 

arranged in rows and columns? Can you 

form the variables into two groups? 

(Hint: you might need to look at the data 

dictionary)


Identifier variable Measured variable 

Index of random 

variable 

Random variable 

Dimension Measure 

Experimental design Measurement 

Key Value


Molten data 

Molten data has all variables in the 

rows, and all measured variables 

the columns. Sometimes called 

“long” form. 

Easy to pour into new shapes, 

graphics and models 

reshape::melt(data, measure, id)


Your turn 

Melt the data. What are the id variables 

present in this data? What are the 

measured variables?

note

Pregnant 

Not- 

pregnant 


Implicit vs. explicit 

missings 

Male Female 

10 

20 15 

Sex Pregnant Count 

Male No 20 

Female Yes 10 

Female No 15


Your turn 

What’s missing? How does this data set 

compare to our original list of variables? 

Think about the operations you’d need to 

perform to create any extra pieces.

note$variable

# Combine into one file 

rate

# Why this form? 

# Because it works well with all our existing tools 

# Easy to summarise 

totals


Concise 

(each fact represented once) 

Consistent 

(if repeated, each fact is the same)


TB 

This data is already very concise, and 

therefor it’s also consistent. 

Did you spot the exception?

note_total


Whenever there is inconsistency, you are 

going to have to make some tradeoff to 

ensure concision. 

In general 

Detecting inconsistency is not always 

easy, but you now have the basic tools.


What you 

can’t fix:


Complete 

Correct


Correct 

Can’t restore correct values without 

original data but can remove clearly 

incorrect values 

Options: 

Remove entire row 

Mark incorrect value as missing


What is a missing 

value? 

In R, written as NA. Has special 

behaviour: 

NA + 3 = ? 

NA > 2 = ? 

mean(c(2, 7, 10, NA)) = ? 

NA == NA ? 

Use is.na() to see if a value is NA 

Many functions have na.rm argument


General strategy 

To find incorrect values you need to be 

creative, combining graphics and data 

processing. 

I’m going to try and give you a flavour of 

this process with another data set.


Diamonds data 

~54,000 round diamonds from 

http://www.diamondse.info/ 

Carat, colour, clarity, cut 

Total depth, table, depth, 

width, height 

Price


x 

table width 

depth = z / diameter 

table = table width / x * 100 

z


Your turn 

Look at histograms and scatterplots of x, 

y, z from the diamonds dataset (it’s 

included in ggplot2) 

Which values are clearly incorrect? Which 

values might we be able to correct?


Plots 

qplot(x, data = diamonds, binwidth = 0.1) 

qplot(y, data = diamonds, binwidth = 0.1) 

qplot(z, data = diamonds, binwidth = 0.1) 

qplot(x, y, data = diamonds) 

qplot(x, z, data = diamonds) 

qplot(y, z, data = diamonds)

y_big 10 

z_big 6 

x_zero


Your turn 

Fix the incorrect values and replot 

scatterplots of x, y, and z. Are all the 

unusual values gone?

diamonds$x[x_zero]


New variables 

When cleaning, derived variables will be 

very important. 

e.g. qplot(a, b) is a straight line, then 

qplot(a, a / b) will be a flat line. 

For this example, we can also approximate 

the volume of a diamond, and use what 

we know about the density of diamonds.


Your turn 

Using what you know about the density of 

diamonds, clean this data up some more.

diamonds

diamonds

ad 0.007) 

qplot(density, data = diamonds[!bad, ]) 

subset(diamonds, bad) 

qplot(carat, density, data = diamonds[!bad, ]) 

qplot(y, x/y, data = diamonds[!bad, ]) 

Friday, 9 July 2010


Summary


Clean data is: 

Rectangular 

(observations in rows, variables in columns) 

Consistent 

Concise 

Complete 

Correct

Friday, 9 July 2010

This work is licensed under the Creative 

Commons Attribution-Noncommercial 3.0 United 

States License. To view a copy of this license, 

visit http://creativecommons.org/licenses/by-nc/ 

3.0/us/ or send a letter to Creative Commons, 

171 Second Street, Suite 300, San Francisco, 

California, 94105, USA. 

Friday, 9 July 2010

Data cleaning - Hadley Wickham

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?