11.02.2015 Views

Syllabus - Kosuke Imai's Homepage - Princeton University

Syllabus - Kosuke Imai's Homepage - Princeton University

Syllabus - Kosuke Imai's Homepage - Princeton University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

POL 245: Visualizing Data<br />

<strong>Syllabus</strong><br />

Summer 2014<br />

Jonathan Olmsted (Instructor)<br />

Bella Wang (Preceptor)<br />

<strong>Kosuke</strong> Imai (Course Head)<br />

Department of Politics, <strong>Princeton</strong> <strong>University</strong><br />

In this course, we consider ways to illustrate compelling stories hidden in a blizzard of data. Equal parts<br />

art, programming, and statistical reasoning, data visualization is a critical tool for anyone doing analysis.<br />

In recent years, data analysis skills have become essential for those pursuing careers in policy advocacy and<br />

evaluation, business consulting and management, or academic research in the fields of education, health,<br />

medicine, and social science. This course introduces students to the powerful R programming language and<br />

the basics of creating data-analytic graphics in R. From there, we use real datasets to explore topics ranging<br />

from network data (like social interactions on Facebook or trade between counties) to geographical data<br />

(like county-level election returns in the US or the spatial distribution of insurgent attacks in Afghanistan).<br />

No prior background in statistics or programming is required or expected.<br />

Contact Information<br />

Name Jonathan Olmsted Bella Wang <strong>Kosuke</strong> Imai<br />

Office Corwin 029 Corwin 026 Corwin 036<br />

Email jolmsted@princeton.edu bxw@princeton.edu kimai@princeton.edu<br />

Office Hours Fri: 2:00pm – 3:30pm Mon: 2:00pm – 3:30pm by appointment<br />

During our office hours we may be in either the office that is listed above or in Corwin 023 (just across<br />

the hall) which has space for more students. If these office hours do not fit your schedule, you should feel<br />

free to contact us directly via email. Also, do not forget about Piazza and QuantLab (see below) where<br />

you can ask questions and receive answers back immediately.<br />

Logistics<br />

Lectures. Monday and Wednesday, 9:00am–9:50am, Sherrerd Hall 101. Lecture slides will be posted on<br />

Blackboard by 10:00pm the night before the lecture. Students should bring their own printed copy of the<br />

slides to the lecture. The schedule during the first week deviates from this, details are below in the Course<br />

Outline section.<br />

Precepts. Tuesday and Thursday, 9:00am–10:20am, Friend Center 005 and 007. These are computer<br />

labs, and so there is no need to bring a personal laptop (although it is fine for you to do so). The schedule<br />

during the first week deviates from this, details are below in the Course Outline section.<br />

1


Review Sessions/Guest Speakers. Friday, 9:00am–10:20am, Visualization Laboratory in Lewis Library<br />

346. These sessions occur during the second week of FSI through the final week. The first four of<br />

these five sessions involve guest speakers from various industries where data visualization is used. The<br />

schedule during the first week deviates from this, details are below in the Course Outline section.<br />

Lunch with Guest Speaker. Friday, 12:00pm–1:30pm, Prospect House. Students will sign up to have<br />

lunch with one of the four guest speakers at the beginning of the course. During the selected week, students<br />

and the course team will meet with the guest speaker during a casual, catered lunch. The schedule during<br />

the first week deviates from this, details are below in the Course Outline section.<br />

Course Requirements<br />

• Precept participation (15%): There will be twelve precepts in the computer lab. Students must<br />

actively participate in these precepts. During each of the six weeks, there will be an assignment due<br />

before the first precept of that week. This assignment asks students to work on a small set of Precept<br />

Questions and submit their answers the night before the first precept. Details on these assignments<br />

are below. This is an individual assessment with limited collaboration.<br />

• Problem sets (40%): There will be four problem sets during the semester. These problem sets<br />

provide an opportunity for students to conduct analysis and visualization of a particular kind according<br />

to the week’s material. Each problem set will be equally weighted. Details on these assignments<br />

are below. This is an individual assessment with no collaboration.<br />

• Midterm Exam (25%): There will be a take-home midterm exam after the first three weeks of<br />

the course. This exam will provide students an opportunity to demonstrate proficiency of the introductory<br />

material that the subsequent topical weeks will build on. This is an individual assessment<br />

with no collaboration.<br />

• Final Project (20%): The final evaluation is a group data analysis project. Students will be<br />

assigned to groups. Each group will be assigned a dataset from those provided by the four guest<br />

speakers throughout the course. Using the assigned dataset, students will give a 5 minute presentation<br />

to the class focused on a compelling relationship or story from the dataset using no more than 3<br />

figures. A full set of details regarding the final project will be provided later in the course. This is<br />

a group assessment with collaboration allowed only within the assigned groups.<br />

Submission of Computer Code via Blackboard Folders<br />

For the Precept Questions on the Precept Handouts, answers to Problem Sets, and answers to<br />

the Midterm exam, students are required to turn in their computer code via assignment folders on<br />

Blackboard. For each assignment, you will submit your code to the appropriate folder as a single file<br />

named xxxHandoutX.R or xxxPSetX.R where xxx is your NetID and X is the handout/problem set number.<br />

For example, it might be kimaiHandout2.R or jolmstedPSet3.R for handout 2 or problem set 3.<br />

Precept Materials, Format, and Policy<br />

Precept Handouts. We will distribute a Precept Handout as preparation for the first precept of<br />

each week. It contains all the materials we expect you to learn before that session. Mastery of these<br />

materials is required for Problem Sets, Precept Worksheets, and the Midterm Exam. These<br />

2


handouts will be posted on Friday afternoon and due the night before Tuesday’s precept. You will want<br />

paper copies of Precept Handouts to use as reference.<br />

Before coming to Tuesday’s precept. Before coming to Tuesday’s precept, you should work through<br />

the Precept Handout and answer the small set of programming questions, called Precept Questions<br />

that are included in the handout. You are allowed to collaborate in only a limited fashion with others. See<br />

below for details. These questions are designed to check whether you understood the materials covered in<br />

the Precept Handout and facilitate problem-solving during precept sessions. You must turn in your<br />

computer code by 10:00 pm the night before precept (according to Submission of Computer Code).<br />

The assessment is graded only on a submitted or not basis, but only “good faith” submissions will count<br />

for full credit.<br />

During precept. The precept will begin by briefly reviewing the handout. Come to precept with<br />

questions if you have any. We will give special attention to issues that seemed problematic for students in<br />

the precept questions. The remainder of the precept time will then be spent working on the new material<br />

and solving Precept Exercises if time permits. Precept Worksheets will be distributed via Blackboard.<br />

You should work through these exercises on your own whether we work on them in the precept or<br />

not. These exercises are a good model for the kinds of problems asked on the Problem Sets and the<br />

Midterm Exam. Solutions to the Precept Exercises will be posted to Blackboard before the following<br />

precept. You will want paper copies of the Precept Worksheets as a reference.<br />

Problem Set/Midterm Exam Format<br />

Each week will end with the posting of either a problem set or the midterm exam. These assignments will<br />

be posted on Thursday afternoons via Blackboard.<br />

Submission. Hard copies of your problem sets must be turned in at the beginning of the Tuesday precept<br />

that they are due. Electronic submission via Blackboard of your computer code must also be done by the<br />

beginning of the Tuesday precept at 9:00am.<br />

Collaboration Policy<br />

The assignments in this course are designated as individual or group assessments. The degree of permissible<br />

collaboration depends on the kind of assignment:<br />

1. Ungraded problems/exercises<br />

2. Precept Questions from Precept Handouts<br />

3. Problem Sets and the Midterm Exam<br />

4. Final Project<br />

Assignment Types<br />

Ungraded problems. For ungraded problems, like those on the Precept Worksheets or old problem<br />

set questions, students are encouraged to interact with each other, the instruction team, and QuantLab tutors<br />

in discussing their approaches and solutions. This includes conceptual discussion and actual computer<br />

code. However, for all other assignments, this degree of collaboration is not appropriate!<br />

3


Precept Questions. Some collaboration on the Precept Questions from Precept Handouts is allowed.<br />

Students may discuss approaches to solving these problems with other students. They may also ask<br />

questions of QuantLab tutors, the instruction team, and on Piazza. However, when it comes time to write<br />

the code that is submitted for precept participation credit, all code must be each student’s own work.<br />

Problem Sets and the Midterm Exam. Problem Sets and the Midterm Exam do not allow<br />

for any collaboration with other students whatsoever. Students may ask clarifying questions regarding<br />

problem set and midterm questions to the instruction team (Jonathan Olmsted, Bella Wang, and <strong>Kosuke</strong><br />

Imai) through Piazza. This allows all students to benefit from clarifications equally. Clarifying questions<br />

about the problem sets may not be asked of QuantLab tutors, however.<br />

Final Project. The Final Project is a group visualization project. Students may fully collaborate within<br />

their assigned groups, but may not discuss their group’s work or another group’s work with other students<br />

in the course.<br />

Plagiarism Policy<br />

Violations of the above collaboration policy will be treated as instances of plagiarism. This course will<br />

follow a modified version of the guidelines used for computer science classes here at <strong>Princeton</strong>. Please<br />

take this guideline seriously. In the past, plagiarism cases typically result in one-year suspension from<br />

<strong>Princeton</strong>.<br />

Programming necessitates that you reach your own understanding of the problem and discover a path<br />

to its solution.<br />

Do not, under any circumstances, copy another person’s code. Incorporating someone<br />

else’s code into your program in any form is a violation of academic regulations. Abetting plagiarism or<br />

unauthorized collaboration by sharing your code is also prohibited. Sharing code in digital form is an<br />

especially egregious violation: do not e-mail your code to anyone.<br />

Novices often have the misconception that copying and mechanically transforming a program (by<br />

rearranging independent code, renaming variables, or similar operations) makes it something different.<br />

Actually, identifying plagiarized source code is easier than you might think. For example, there exists<br />

computer software that can detect plagiarism.<br />

This policy supplements the <strong>University</strong>’s academic regulations, making explicit what constitutes a<br />

violation for this course. <strong>Princeton</strong> Rights, Rules, Responsibilities handbook asserts:<br />

The only adequate defense for a student accused of an academic violation is that the work<br />

in question does not, in fact, constitute a violation. Neither the defense that the student was<br />

ignorant of the regulations concerning academic violations nor the defense that the student was<br />

under pressure at the time the violation was committed is considered an adequate defense.<br />

If you have any questions about these matters, please consult a member of the instruction team.<br />

Textbooks<br />

In addition to the Precept Handout and Precept Worksheets which serve as an introduction to,<br />

discussion of, and reference for the material in this course, there will be assigned readings each week. Some<br />

readings may be distributed (whether as a hard copy or via Blackboard) as appropriate. In the first few<br />

weeks as we learn statistical concepts, readings will come from:<br />

• Freedman, David, Robert Pisani, and Roger Purves. (2007). Statistics. W. W. Norton & Company.<br />

4


This book will be referred to as FPP hereafter.<br />

In addition, the following two books may be helpful references although they are completely optional<br />

for this self-contained course.<br />

• Verzani, John. (2005). Using R for Introductory Statistics. Chapman & Hall.<br />

• Verzani, John. (2011). Getting Started with RStudio. Safari Books Online (Series).<br />

Verzani (2005) is available on reserve at Firestone library: http://goo.gl/tYj5B. Verzani (2011)<br />

is available online through the <strong>Princeton</strong> <strong>University</strong> Library website: http://goo.gl/zKLqzj. You can<br />

access it from any computer on the <strong>Princeton</strong> network.<br />

Finally, at the beginning of the Precept Handouts and Precept Worksheets there will be<br />

references to any corresponding Verzani (2005) sections. In addition, throughout the materials, there will<br />

be pointers to specific sections in Verzani (2011). When you see these Verzani (2011) references, you should<br />

read the corresponding (brief) section online.<br />

Statistical Software<br />

In this course, we use the open-source statistical software R (http://www.r-project.org). Recently, a<br />

New York Times article (“Data Analysts Captivated by R’s Power” January 6, 2009) featured R as,<br />

a popular programming language used by a growing number of data analysts inside corporations<br />

and academia. It is becoming their lingua franca [...] whether being used to set ad prices, find<br />

new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer,<br />

Merck, Bank of America, the InterContinental Hotels Group and Shell use it. [...] “The great<br />

beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, chief economist<br />

at Google. “And you have a lot of prepackaged stuff that’s already available, so you’re standing<br />

on the shoulders of giants.”<br />

R can be more powerful than other statistical software such as SPSS, STATA and SAS, but it can also<br />

be more difficult to learn. A variety of resources will be made available for POL 245 students in order to<br />

learn R as efficiently as possible.<br />

To help make using R easier, we’ll be using RStudio (http://www.rstudio.com/)—a user-interface<br />

that simplifies many common operations.<br />

Computing Resources<br />

In the event that you need to install R and RStudio on your own personal laptop, this will be covered<br />

in Precept Handout 0 which is due before the first lecture. However, many students may find the<br />

additional computer clusters around campus convenient to use. In these cases, the R and RStudio software<br />

is already installed on the machines in several computer clusters.<br />

• Frist 200-level computer clusters<br />

• Butler D033<br />

Be sure to check the hours of the building’s availability before making plans to use a specific location.<br />

5


Getting Help<br />

POL 245 will be a challenging course and several resources additional are available to facilitate efficient<br />

learning about data analysis and data visualization. We encourage you to take advantage of them whenever<br />

you have questions about the course materials and are struggling with problem sets.<br />

• QuantLab: In addition to the instruction team, there are QuantLab tutors available who can<br />

help you work through the course materials. The tutors are Politics graduate students who have<br />

experience with R and statistics. Unless otherwise noted, QuantLab will be held on Sunday, Monday,<br />

and Wednesday nights from 7:00pm – 10:00pm. Shorter sessions of QuantLab will also be held on<br />

Tuesday and Thursday afternoons from 2:00pm – 3:30pm. Tutoring sessions will be held in Butler<br />

College D026 and D027.<br />

• Office hours: Each of us will hold weekly office hours, starting the first week. In addition to these<br />

office hours, we have an open door policy—you can stop by our Corwin offices at any time without<br />

an appointment and we will answer your questions so long as we are available. You may also e-mail<br />

to set up an appointment with either of us outside of our office hours.<br />

• Discussion board: In addition to office hours and individual appointments, we will be available<br />

online to answer any questions you may have about the course materials and the problem sets. We<br />

use the Piazza discussion forum that will be linked on Blackboard course page or accessible directly<br />

at http://piazza.com. Before posting your question, please review previous posts to make sure that<br />

a similar question has not been answered. In accordance with the collaboration policy described above,<br />

you should not directly post your code for a problem set. You should frame your questions in general<br />

terms rather than trying to have us debug your code directly. You may subscribe to the Discussion<br />

Forum so that you receive your fellow students’ questions and answers to those questions. You<br />

should also feel free to respond to questions that you can answer. Piazza also has a free smartphone<br />

application if you are interested.<br />

Additional Accommodations<br />

Students must register with the Office of Disability Services (ODS) (odsprinceton.edu ; 609-258-8840) for<br />

disability verification and determination of eligibility for reasonable academic accommodations. Requests<br />

for academic accommodations for this course need to be made at the beginning of the semester, or as soon<br />

as possible for newly approved students, and again at least two weeks in advance of any needed accommodations<br />

in order to make arrangements to implement the accommodations. Please make an appointment<br />

to meet with me in order to maintain confidentiality in addressing your needs. No accommodations will<br />

be given without authorization from ODS, or without advance notice.<br />

6


Course Outline<br />

Week 1: July 14 – July 18, Causality<br />

During Week 1, we will learn how to make causal claims from datasets. We learn the distinction between<br />

randomized experiments and observational studies. In addition, we introduce R, programming with data,<br />

and familiarize students with basic skills in data analysis by considering experimental data. These tools<br />

will be applied to topics like evaluating strategies for increasing voter turnout and the effect of class size<br />

on educational achievement.<br />

Note, the first session on Tuesday, July 15 will be a lecture in Sherrerd Hall 101 and not a precept.<br />

The second session on Wednesday, July 16 will be a precept in the Friend Center. The third session on<br />

Thursday, July 17 will be a lecture in Sherrerd Hall. And, for Friday, July 18 we will hold a precept in the<br />

Friend Center. These changes affect only the first week of the course.<br />

Handout 0: posted Monday, Jul. 14; due Tuesday, Jul. 15 (no submission)<br />

Handout 1-A: posted Tuesday, Jul. 15; due Wednesday, Jul. 16<br />

Handout 2-A: posted Friday, Jul. 18<br />

Problem Set 1: posted Thursday, Jul. 17<br />

Required Readings:<br />

FPP, Chapter 1: Controlled Experiments.<br />

FPP, Chapter 2: Observational Studies.<br />

FPP, Chapter 3: The Histogram.<br />

FPP, Chapter 4: The Average and the Standard Deviation.<br />

↩→ only Sections 4.1, 4.2, and 4.3 for this week<br />

Week 2: July 21 – July 25, Clustering<br />

During Week 2, we will further learn methods for describing data and how to identify “clusters” of data.<br />

The main example we discuss is ideological polarization in the United States. In addition, we will look at<br />

the distribution of income, democracy, and natural resource endowments across different countries.<br />

Handout 2-A: due Monday, Jul. 21<br />

Handout 3-A: posted Friday, Jul. 25<br />

Problem Set 1: due Tuesday, Jul. 22<br />

Problem Set 2: posted Thursday, Jul. 24<br />

Required Readings:<br />

FPP, Chapter 4: The Average and the Standard Deviation.<br />

↩→ only Sections 4.4, 4.5, 4.6, and 4.7 for this week<br />

FPP, Chapter 7: Plotting Points and Lines.<br />

Week 3: July 28 – August 1, Prediction and Forecasting<br />

In Week 3, we continue extending general R programming, data-analytic, and data visualization skills.<br />

Forecasting federal and state-level electoral outcomes will be used as a substantive motivation for these<br />

new tools.<br />

7


Handout 3-A: due Monday, Jul. 28<br />

Handout 4-A: posted Friday, Aug. 1<br />

Problem Set 2: due Tuesday, Jul. 29<br />

Midterm Exam: posted Thursday, Jul. 31<br />

Required Readings:<br />

FPP, Chapter 8: Correlation.<br />

FPP, Chapter 9: More about Correlation .<br />

FPP, Chapter 10: Regression.<br />

FPP, Chapter 11: The R.M.S. Error for Regression.<br />

FPP, Chapter 12: The Regression Line.<br />

Week 4: Aug 3 – August 8, Text as Data<br />

For Week 4, we start to cover how text can be analyzed as data. These methods allow researchers to process<br />

large amounts of unstructured text and extract meaningful measures of politically important concepts like<br />

partisanship. We use these methods to analyze the Federalist Papers and the press releases of legislators.<br />

Handout 4-A: due Monday, Aug. 4<br />

Handout 5-A: posted Friday, Aug. 8<br />

Midterm Exam: due Tuesday, Aug. 5<br />

Problem Set 3: posted Thursday, Aug. 7<br />

Week 5: Aug 11 – August 15, Network Data<br />

Week 5 focuses on analyzing data with a network structure. With the rise of social networks, network data<br />

is increasingly common. We use algorithms to describe the networks generated by bill (co-)sponsorship in<br />

Congress and the citation of Supreme Court opinions.<br />

Handout 5-A: due Monday, Aug. 11<br />

Handout 6-A: posted Friday, Aug. 15<br />

Problem Set 3: due Tuesday, Aug. 12<br />

Problem Set 4: posted Thursday, Aug. 14<br />

Week 6: August 18 – August 22, Maps and Spatial Data<br />

During the final week, Week 6, we introduce mapping and visualizing geo-coded data. With this, we will<br />

be able to answer questions about how US voters separate themselves by ideology or to what extent the<br />

country is split into “red” or “blue” states.<br />

Handout 6-A: due Monday, Aug. 18<br />

Problem Set 4: due Tuesday, Aug. 19<br />

Final Project: posted Thursday, Aug. 21; due Wednesday Aug. 27<br />

8


POL 245 Schedule<br />

Due to the condensed nature of the course, the semester and weekly schedules (including assignments) can<br />

seem very full. To help with this, we have a Google Calendar with events, due dates, and sessions. The<br />

URL for this calendar is http://goo.gl/D3pKXL. We also outline a typical week of assignments in POL<br />

245 below:<br />

Sun Mon Tue Wed Thur Fri Sat<br />

7:00pm-<br />

10:00pm<br />

QuantLab<br />

9:00am-9:50am<br />

Lecture<br />

7:00pm-<br />

10:00pm<br />

QuantLab<br />

10:00pm<br />

Submit Handout<br />

code<br />

9:00am<br />

Submit Problem<br />

Set code<br />

9:00am<br />

Submit Problem<br />

Set write-up<br />

9:00am-<br />

10:20am<br />

Precept<br />

2:00pm-<br />

3:30pm<br />

QuantLab<br />

9:00am-9:50am<br />

Lecture<br />

7:00pm-<br />

10:00pm<br />

QuantLab<br />

9:00am-<br />

10:20am<br />

Precept<br />

12:00pm<br />

New Problem Set<br />

posted<br />

2:00pm-<br />

3:30pm<br />

QuantLab<br />

9:00am-<br />

10:20am<br />

Guest Speaker<br />

12:00pm<br />

New Handout<br />

posted<br />

Guest Speakers<br />

For four of the six weeks during the course, we will have guest speakers during our Friday sessions. In<br />

addition to their presentation from 9:00am – 10:20am in the Lewis Library Visualization Lab, they will be<br />

on campus for lunch with a small group of students. Students will sign up for lunch with a specific speaker<br />

at the beginning of the course.<br />

FSI Week Date Guest Affiliation<br />

2 7/25<br />

Archie Tse<br />

Deputy Graphics Director<br />

Telling Stories with Data<br />

New York Times<br />

3 8/1<br />

4 8/8<br />

5 8/15<br />

Eurry Kim<br />

Advertising Researcher<br />

Visualizing the Social TV Networks of<br />

Facebook<br />

Bruce Willsie<br />

President/CEO<br />

Big Data, High Speeds, New Insights—How<br />

High Speed Data Visualization is Changing<br />

Politics<br />

Sondra Scott<br />

Head of Macro and Industry Trends<br />

Energy Fundamentals Analysis—How data<br />

drives insight<br />

Facebook<br />

L2<br />

Wood McKenzie Ltd.<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!