11.08.2013 Views

Introduction to plyr - Hadley Wickham

Introduction to plyr - Hadley Wickham

Introduction to plyr - Hadley Wickham

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

July 2010<br />

Friday, 9 July 2010<br />

<strong>Introduction</strong> <strong>to</strong> <strong>plyr</strong><br />

<strong>Hadley</strong> <strong>Wickham</strong><br />

Assistant Professor / Dobelman Family Junior Chair<br />

Department of Statistics / Rice University


Friday, 9 July 2010<br />

1. US baby names data<br />

2. Slice & dice revision<br />

3. Group-wise operation<br />

4. Challenges<br />

5. More <strong>plyr</strong> functions, including an<br />

example of llply and ldply


Friday, 9 July 2010<br />

Baby names<br />

Top 1000 male and female baby<br />

names in the US, from 1880 <strong>to</strong><br />

2008.<br />

258,000 records (1000 * 2 * 129)<br />

But only four variables: year,<br />

name, sex and prop.<br />

CC BY http://www.flickr.com/pho<strong>to</strong>s/the_light_show/2586781132


Friday, 9 July 2010<br />

Getting started<br />

library(<strong>plyr</strong>)<br />

library(stringr)<br />

options(stringsAsFac<strong>to</strong>rs = FALSE)<br />

bnames


head(bnames, 15)<br />

year name percent sex<br />

1 1880 John 0.081541 boy<br />

2 1880 William 0.080511 boy<br />

3 1880 James 0.050057 boy<br />

4 1880 Charles 0.045167 boy<br />

5 1880 George 0.043292 boy<br />

6 1880 Frank 0.027380 boy<br />

7 1880 Joseph 0.022229 boy<br />

8 1880 Thomas 0.021401 boy<br />

9 1880 Henry 0.020641 boy<br />

10 1880 Robert 0.020404 boy<br />

11 1880 Edward 0.019965 boy<br />

12 1880 Harry 0.018175 boy<br />

13 1880 Walter 0.014822 boy<br />

14 1880 Arthur 0.013504 boy<br />

15 1880 Fred 0.013251 boy<br />

Friday, 9 July 2010<br />

> tail(bnames, 15)<br />

year name percent sex<br />

257986 2008 Neveah 0.000130 girl<br />

257987 2008 Amaris 0.000129 girl<br />

257988 2008 Hadassah 0.000129 girl<br />

257989 2008 Dania 0.000129 girl<br />

257990 2008 Hailie 0.000129 girl<br />

257991 2008 Jamiya 0.000129 girl<br />

257992 2008 Kathy 0.000129 girl<br />

257993 2008 Laylah 0.000129 girl<br />

257994 2008 Riya 0.000129 girl<br />

257995 2008 Diya 0.000128 girl<br />

257996 2008 Carleigh 0.000128 girl<br />

257997 2008 Iyana 0.000128 girl<br />

257998 2008 Kenley 0.000127 girl<br />

257999 2008 Sloane 0.000127 girl<br />

258000 2008 Elianna 0.000127 girl


Friday, 9 July 2010<br />

Slicing<br />

and dicing


Friday, 9 July 2010<br />

subset(df, subset)<br />

Revision<br />

transform(df, var1 = expr1, ...)<br />

summarise(df, var1 = expr1, ...)<br />

arrange(df, var1, ...)


Friday, 9 July 2010<br />

Your turn<br />

Extract your name from the dataset. Plot<br />

the trend over time.<br />

What geom should you use? Do you<br />

need any extra aesthetics?


hadley


Friday, 9 July 2010<br />

Brains<strong>to</strong>rm<br />

Thinking about the data, what are some<br />

of the trends that you might want <strong>to</strong><br />

explore? What additional variables would<br />

you need <strong>to</strong> create? What other data<br />

sources might you want <strong>to</strong> use?<br />

Pair up and brains<strong>to</strong>rm for 2 minutes.


Friday, 9 July 2010<br />

Some of my ideas<br />

• First/last letter<br />

• Length<br />

• Number/percent<br />

of vowels<br />

• Biblical names?<br />

• Hurricanes?<br />

• Rank<br />

• Ecdf (how many<br />

babies have a<br />

name in the <strong>to</strong>p<br />

2, 3, 5, 100 etc)


letter


names


Friday, 9 July 2010<br />

Create a new variable that contains the<br />

first three (or four, or five) letters of each<br />

name. How many names start the same<br />

as yours? Plot the trend over time.<br />

Remember <strong>to</strong> use the group aesthetic if<br />

necessary.<br />

Your turn


names$first3


Friday, 9 July 2010<br />

Per-group<br />

operations


Friday, 9 July 2010<br />

What about group-wise transformations<br />

or summaries? e.g. what if we want <strong>to</strong><br />

compute the rank of a name within a sex<br />

and year?<br />

Group-wise<br />

This task is easy if we have a single year<br />

& sex, but hard otherwise.


Friday, 9 July 2010<br />

What about group-wise transformations<br />

or summaries? e.g. what if we want <strong>to</strong><br />

compute the rank of a name within a sex<br />

and year?<br />

Group-wise<br />

This task is easy if we have a single year<br />

& sex, but hard otherwise.<br />

Take two minutes <strong>to</strong> sketch out an approach


one


# Split<br />

pieces


x y<br />

a 2<br />

a 4<br />

b 0<br />

b 5<br />

c 5<br />

c 10<br />

Friday, 9 July 2010


x y<br />

a 2<br />

a 4<br />

b 0<br />

b 5<br />

c 5<br />

c 10<br />

Friday, 9 July 2010<br />

Split<br />

x y<br />

a 2<br />

a 4<br />

x y<br />

b 0<br />

b 5<br />

x y<br />

c 5<br />

c 10


x y<br />

a 2<br />

a 4<br />

b 0<br />

b 5<br />

c 5<br />

c 10<br />

Friday, 9 July 2010<br />

Split<br />

x y<br />

a 2<br />

a 4<br />

x y<br />

b 0<br />

b 5<br />

x y<br />

c 5<br />

c 10<br />

Apply<br />

3<br />

2.5<br />

7.5


x y<br />

a 2<br />

a 4<br />

b 0<br />

b 5<br />

c 5<br />

c 10<br />

Friday, 9 July 2010<br />

Split<br />

x y<br />

a 2<br />

a 4<br />

x y<br />

b 0<br />

b 5<br />

x y<br />

c 5<br />

c 10<br />

Apply<br />

3<br />

2.5<br />

7.5<br />

Combine<br />

x y<br />

a 2<br />

b 2.5<br />

c 7.5


# Or equivalently<br />

bnames


# Or equivalently<br />

bnames


Friday, 9 July 2010<br />

In a similar way, we can use ddply() for<br />

group-wise summaries.<br />

There are many base R functions for<br />

special cases. When available, they can<br />

be much faster, but you have <strong>to</strong> know<br />

they exist, and have <strong>to</strong> remember how <strong>to</strong><br />

use them.<br />

Summaries


Friday, 9 July 2010<br />

ddply + transform =<br />

group-wise transformation<br />

ddply + summary =<br />

per-group summaries<br />

ddply + subset =<br />

per-group subsets


Friday, 9 July 2010<br />

Challenges


Friday, 9 July 2010<br />

Warmups<br />

Which names were most popular in 1999?<br />

Work out the average usage for each<br />

name.<br />

List the all time <strong>to</strong>p 10 names.


# Which names were most popular in 1999?<br />

subset(bnames, year == 1999 & rank == 1)<br />

subset(bnames, year == 1999 & prop == max(prop))<br />

# Average usage<br />

overall


Friday, 9 July 2010<br />

Challenge 1<br />

For each name, find the year in which it<br />

was most popular, and the rank in that<br />

year. (Hint: you might find which.max<br />

useful).<br />

Print all names that have been the most<br />

popular name at least once.


most_pop


Friday, 9 July 2010<br />

Challenge 2<br />

What name has been in the <strong>to</strong>p 10 most<br />

often?<br />

(Hint: you'll have <strong>to</strong> do this in three steps.<br />

Think about what they are before starting)


<strong>to</strong>p10


Friday, 9 July 2010<br />

More <strong>plyr</strong><br />

functions


Friday, 9 July 2010<br />

Many problems involve splitting up a large<br />

data structure, operating on each piece<br />

and joining the results back <strong>to</strong>gether:<br />

split-apply-combine


Friday, 9 July 2010<br />

How you split up depends on the type of<br />

input: arrays, data frames, lists<br />

How you combine depends on the type of<br />

output: arrays, data frames, lists,<br />

nothing


array<br />

data frame<br />

list<br />

n replicates<br />

function<br />

arguments<br />

Friday, 9 July 2010<br />

array data frame list nothing<br />

aaply adply alply a_ply<br />

daply ddply dlply d_ply<br />

laply ldply llply l_ply<br />

raply rdply rlply r_ply<br />

maply mdply mlply m_ply


array<br />

data frame<br />

list<br />

n replicates<br />

function<br />

arguments<br />

Friday, 9 July 2010<br />

array data frame list nothing<br />

apply adply alply a_ply<br />

daply aggregate by d_ply<br />

sapply ldply lapply l_ply<br />

replicate rdply replicate r_ply<br />

mapply mdply mapply m_ply


Friday, 9 July 2010<br />

Case study


Friday, 9 July 2010<br />

Each country has its own GSHS spss file.<br />

How can we easily combine them all<br />

<strong>to</strong>gether?<br />

GSHS<br />

Run the following R code and try and<br />

figure out what each line does.<br />

Look at 04-gshs.r for answers.


library(foreign)<br />

library(<strong>plyr</strong>)<br />

library(stringr)<br />

files


Friday, 9 July 2010<br />

Resources


Friday, 9 July 2010<br />

Resources<br />

http://had.co.nz/<strong>plyr</strong>/<strong>plyr</strong>-<br />

intro-090510.pdf<br />

http://groups.google.com/group/<br />

manipulatr/


Friday, 9 July 2010


This work is licensed under the Creative<br />

Commons Attribution-Noncommercial 3.0 United<br />

States License. To view a copy of this license,<br />

visit http://creativecommons.org/licenses/by-nc/<br />

3.0/us/ or send a letter <strong>to</strong> Creative Commons,<br />

171 Second Street, Suite 300, San Francisco,<br />

California, 94105, USA.<br />

Friday, 9 July 2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!