Introduction to Panel Data Analysis

Introduction to Panel Data Analysis 

Youngki Shin 

Department of Economics 

Email: yshin29@uwo.ca 

Statistics and Data Series at Western 

November 21, 2012 

1 / 40

Motivation 

More observations mean more information. 

2 / 40

Motivation 


More observations with a certain structure mean much more 

information: pooled cross sections and panel data 

2 / 40

Motivation 


More observations with a certain structure mean much more 

information: pooled cross sections and panel data 

How can we extract additional information from pooled cross sections 

or panel data? 

2 / 40

Example 

Effect of an Incinerator on Housing Prices 

With cross-sectional data in 1981, we have 

̂rprice = 101, 307.5 − 30, 688.27nearinc 

(3, 093.0) (5, 827.71) 

n = 142 

3 / 40

Example 




n = 142 

(3, 093.0) (5, 827.71) 

With another cross-sectional data in 1978 when there were no incinerator, 

we have 


n = 179 

(2, 653.79) (4, 744.59) 

3 / 40

Example 




n = 142 

(3, 093.0) (5, 827.71) 

With another cross-sectional data in 1978 when there were no incinerator, 

we have 


n = 179 

(2, 653.79) (4, 744.59) 

Therefore, the true effect of the incinerator in not −30, 688.27 but 

−30, 688.27 − (−18, 824.37) = −11, 863.90. 

3 / 40

Outline 

Data Structure 

Policy Evaluation with Pooled Cross Sections 

Three Approaches in Panel Data Estimation 

First Difference (FD) Estimator 

Fixed Effect (FE) Estimator 

Random Effect (RE) Estimator 

Empirical Application: Smoking on Birth Outcomes 

Concluding Remarks 

4 / 40

Outline 









5 / 40

Data Structure (cont.) 

A set of pooled cross sections is obtained by sampling randomly from 

a large population at different time points. 

A (typical) panel data set follow the same individuals over time. 

For example, consider that I sample three individuals from this room 

at two time points: 

Time Pooled Panel 

t=1 John, Jane, Evelyn Eric, Andrew, Rachel 

t=2 Kyle, Justin, Lisa Eric, Andrew, Rachel 

6 / 40


A Snapshot of Data 

Table: Pooled Data 

year rprice nearinc y81 

1978 60000 0 0 

1978 54000 1 0 

1978 38000 1 0 

. . . 

1981 82000 1 1 

1981 52000 0 1 

1981 97000 0 1 

Table: Panel Data 

id year inf unem 

12 1950 7.3 3.5 

12 1951 9.1 2.7 

16 1950 5.3 5.4 

16 1951 4.6 6.7 

. . . . 

43 1950 7.1 4.2 

43 1951 8.5 3.2 

47 1950 6.7 5.4 

47 1951 2.6 9.4 

7 / 40


There are also very useful panel structures other than the 

individual-time combination. 

1 Twins data: i is for twins id, and t is for the individual among the 

specific twins. Control for unobserved generic factors. 

2 School data: students sampled from many schools (or classrooms). 

Then, i is for school id, and t is for the student in school i. 

8 / 40


Examples of pooled cross sections: 

Current Population Survey (CPS), USA 

Examples of a panel data: 

Labor Market Activity Survey (LMAS), Canada 

Panel Study of Income Dynamics (PSID), USA 

National Longitudinal Survey of Youth (NLSY), USA 

A time series of provincial (or country) level data. 

ex) inflation and unemployment rate of 50 countries in 1950–2010. 

It is usually easier to collect pooled cross sections than to do panel 

data. 

9 / 40

Outline 









10 / 40


Difference-in-Difference Estimator 

Terminology: 

Treatment Group: those who are affected by a policy (a treatment) 

Control Group: those who are not. 

The object of policy evaluation is to measure the (mean) difference of 

outcomes between the treatment group and the control group. This 

measure is also called the average treatment effect. 

Consider that you are testing the effect of a new drug. How can you 

design the experiment? Randomization. 

Recall the incinerator and housing prices example. Is randomization 

possible? 

11 / 40



Consider an example of a drug test: 

blprs i = β 0 + β 1 treat i + u i 

If you randomized the control/treatment groups well, i.e. 

Cov(treat i , u i ) = 0, then you can estimate the effect of the drug by a 

single cross section. 

In policy evaluation in social sciences, treat i and u i are easily 

correlated: 

log(wage i ) = β 0 + β 1 jbtrn i + u i 

12 / 40



Pooled cross sections help us to evaluate the policy effect correctly by 

measuring the difference twice (before and after the policy 

implementation.) 

Recall the two regressions in the incinerator example: 

rprice = γ 0 + γ 1 nearinc + u in years 1978 and 1981 

ˆδ 1 = ˆγ 1,81 − ˆγ 1,78 

= ( rprice 81,nr − rprice 81,fr 

) 

− 

( 

rprice78,nr − rprice 78,fr 

) 

If perfectly randomized, the second term is 0. 

This estimator is called the Difference-in-Difference estimator. 

13 / 40

Policy Evaluation with a Pooled Cross Section 


The effect can be estimated just by a single regression with some 

dummy variable. 

rprice = β 0 + δ 0 y81 + β 1 nearinc + δ 1 y81 · nearinc + u 

This result is not intuitive. Just follow the logic: 

Before (y81 = 0) After (y81 = 1) After-Before 

Control (nearinc = 0) β 0 β 0 + δ 0 δ 0 

Treatment (nearinc = 1) β 0 + β 1 β 0 + δ 0 + β 1 + δ 1 δ 0 + δ 1 

Treatment-Control β 1 β 1 + δ 1 δ 1 

Therefore, δ 1 in the above regression gives the same estimate of the 

Difference-in-Difference estimator. 

14 / 40

Outline 









15 / 40

Panel Data and the First Difference (FD) Estimator 

In panel data, we follow the same individual over time. This specific 

structure enables us to conduct a better analysis. 

Specifically, we can control for certain types of omitted variables 

called unobserved heterogeneity. 

Let us think about some examples: 

log(wage it ) = β 0 + δ 0 d2 t + β 1 educ it + a i + u it 

} {{ } 

v it 

Notation: now we have two subscripts, i and t. 

Both a i and u it are unobservables called a fixed effect and an 

idiosyncratic error, respectively. 

16 / 40


For simplicity, consider two periods model: 

y it = β 0 + δ 0 d2 t + β 1 x it + a i + u it t = 1, 2. 

The pooled OLS does not work well since a i is usually correlated with 

x it , i.e. Cov(v it , x it ) ≠ 0. 

A simple solution is the First-Difference (FD) estimator. 

y i2 = (β 0 + δ 0 ) + β 1 x i2 + a i + u i2 t = 2 

y i1 = β 0 + β 1 x i1 + a i + u i1 t = 1 

Taking a difference gives 

y i2 − y i1 = δ 0 + β 1 (x i2 − x i1 ) + (u i2 − u i1 ) 

or 

∆y i = δ 0 + β 1 ∆x i + ∆u i . 

17 / 40


The (pooled) OLS works in the new regression, 

1 ∆u i and ∆x i are uncorrelated; 

2 ∆x i has some variation. 

∆y i = δ 0 + β 1 ∆x i + ∆u i , 

if 

The second condition is violated if x it does not change over time: 

ex) gender, race, etc.. Then, ∆x i = 0. 

Even in the wage equation example, 

log(wage it ) = β 0 + δ 0 d2 t + β 1 educ it + a i + u it , 

Most working population do not increase the years of educ. 

18 / 40


More than Two Time Periods 

When panel data contain more than two time periods, we can still 

apply the FD estimator to control for unobserved heterogeneity. 

The sufficient condition for the estimator to be valid is 

This condition is violated when 

Cov(x it , u is ) = 0 for all t and s. 

1 Future regressors react to the past dependent variable (feedback); 

2 Regressors contain a lagged dependent variable; 

3 An important (i.e. related to x it ) time-varying regressor is omitted. 

Take differences with adjacent time periods and run the following 

regression when t = 1, 2, and 3: 

∆y it = α 0 + α 3 d3 t + β 1 ∆x it + ∆u it for t = 2, 3. 

19 / 40

Additional Remarks on FD Estimator 

Due to the expansion over the time dimension, serial correlation may 

arise. 

Also, we cannot exclude the heteroskedasticity problem. 

Since we use the OLS estimator, we can apply the White correction or 

the HAC estimation method as before. 

20 / 40

Fixed Effect Estimator 

Consider a simple error component model again: 

y it = β 1 x it + a i + u it , t = 1, . . . , T and i = 1, . . . , n. 

We assume that the idiosyncratic error u it is ‘innocuous’ in the sense: 

E(u it |X i ) = 0 or E(u it |x it ) = 0. 

However, the individual fixed effect a i could be arbitrarily correlated 

with x it . 

We have already known that the FD estimator cancels out the 

unobserved heterogeneity a i . 

21 / 40


There is a different way to cancel out unobserved heterogeneity. 

First, fix the individual i and take an average over time: 

ȳ i = β 1 ¯x i + a i + ū i . 

where 

ȳ i = 1 T 

T∑ 

y it , ¯x i = 1 T 

t=1 

T∑ 

x it , and ū i = 1 T 

t=1 

T∑ 

u it . 

t=1 

The point is 

ā i = 1 T 

T∑ 

a i = 1 T Ta i = a i . 

t=1 

22 / 40


Now, take a difference between two equations: 

y it = β 1 x it + a i + u it , t = 1, 2, . . . , T . 

ȳ i = β 1 ¯x i + a i + ū i . 

Then, what we have is 

y it − ȳ i = β 1 (x it − ¯x i ) + (u it − ū i ), 

t = 1, 2, . . . , T 

or 

ÿ it = β 1 ẍ it + ü it , t = 1, 2, . . . , T . 

We may apply the pooled OLS on the last equation. 

23 / 40


The FE estimator uses information from within group (i) variation: 

ÿ i1 = y i1 − ȳ i 

ÿ i2 = y i2 − ȳ i 

. 

ÿ iT = y iT − ȳ i 

For this reason, the FE estimator is also called within estimator. 

This can be readily extended to a multiple regression model: 

ÿ it = β 1 ẍ 1it + β 2 ẍ 2it + . . . + β k ẍ kit + ü it 

24 / 40


FD vs. FE 

If T = 2, the FD estimator and the FE estimator are identical: 

( ) 

yi1 + y i2 

ÿ i2 ≡ y i2 − ȳ i = y i2 − 

= y 1 − y 2 

≡ 1 2 

2 2 ∆y i2. 

Therefore, 

ÿ i2 = β 1 ẍ it + ü it 

⇐⇒ 1 2 ∆y i2 = β 1 

1 

2 ∆x i2 + 1 2 ∆u i2 

⇐⇒ ∆y i2 = β 1 ∆x i2 + ∆u i2 

However, they are different in a finite sample if T > 2. Unless there is 

a unit root (or severe serial correlation) problem, you would better use 

the FE estimator. 

25 / 40

Random Effect Estimator 

In the random effect model: 

we assume that 

y it = β 0 + β 1 x it + a i + u it , 

Cov(x it , a i ) = 0. 

Then, we come back to the ‘nice’ world where we don’t need to 

cancel out a i . Just use the pooled OLS? 

26 / 40


In the random effect model: 

we assume that 

y it = β 0 + β 1 x it + a i + u it , 

Cov(x it , a i ) = 0. 

Then, we come back to the ‘nice’ world where we don’t need to 

cancel out a i . Just use the pooled OLS? 

No. There is a serial correlation problem. 

26 / 40


Serial Correlation in the RE model 

We have two components in the error term: 

v it = a i + u it 

Suppose that u it is totally innocuous again: 

Cov(a i , u it ) = Cov(u it , u is ) = 0 for t ≠ s. 

Now, we calculate Corr(v it , v is ) and show that it is not zero: 

Var(v it ) = Var(a i + u it ) = σ 2 a + σ 2 u 

Cov(v it , v is ) = E ((a i + u it )(a i + u is )) 

= E(a 2 i + a i u is + a i u it + u it u is ) 

= E(a 2 i ) = σ 2 a 

27 / 40


Serial Correlation in the RE model 

Therefore, 

Corr(v it , v is ) = 

σ2 a 

σ 2 a + σ 2 u 

≠ 0 

Any inference based on the pooled OLS would be incorrect. 

However, we know how to fix this problem. Do GLS! 

We want to transform the original model into 

ỹ it = β 0 + β 1˜x it + ṽ it 

where ṽ it does not have the serial correlation anymore. 

28 / 40


We multiplied ρ and took a difference when there is a AR(1) serial 

correlation. In this case, we multiply 

and take a difference as 

[ 

σu 

2 λ = 1 − 

σu 2 + T σa 

2 

] (1/2) 

y it − λȳ i = β 0 (1 − λ) + β 1 (x it − λ¯x i ) + v it − λ¯v i 

We can show that ṽ it (= v it − λ¯v i ) is not serially correlated. 

The λ should be estimated by ˆλ. 

This specific GLS estimator is called the Random Effect (RE) 

estimator. 

29 / 40


The RE estimator is something between the pooled OLS and the FE 

estimator. Note that in Equation: 

y it − λȳ i = β 0 (1 − λ) + β 1 (x it − λ¯x i ) + v it − λ¯v i , 

it becomes the pooled OLS when λ = 0, and does the FE estimator 

when λ = 1. 

The λ is always between 0 and 1 in the RE model. 

As T → ∞, the FE and RE estimators are equivalent since λ → 1. 

30 / 40


RE vs. FE 

If you believe that there is obvious endogenous fixed factor, a i , in 

your model, you should use the FE estimator. 

Otherwise, the RE estimator will tell you more: non time-varying 

regressors, efficiency etc. 

Keep in mind that the RE estimator is not even consistent if 

Cov(x it , a i ) ≠ 0. 

We can test whether Cov(x it , a i ) = 0 or not. 

31 / 40


Hausman Test 

The idea of the Hausman test is simple. The null hypothesis is 

H 0 :Cov(x it , a i ) = 0 

H 1 :Cov(x it , a i ) ≠ 0 

Under H 0 , both RE and FE are consistent: 

p p 

̂β RE → β, ̂βFE → β. 

Thus, we can expect that ̂β RE ≈ ̂β FE . 

However, under H 1 , only ̂β FE is consistent. Therefore, we reject H 0 if 

the difference between ̂β RE and ̂β FE is large enough. 

32 / 40

Outline 









33 / 40


“Infants born to women who smoke during pregnancy have a lower average 

birthweight... Low birthweight is associated with increased risk for 

neonatal, perinatal, and infant morbidity and mortality.” 

(Women and Smoking: A Report of the Surgeon General, 2001, requoted from 

Abrevaya (2006)) 

34 / 40


The direct medical costs: According to the estimates of Lewit et al. 

(1995), the low-birthweight (LBW) infants (less than 10% of births) 

account for more than 1/3 of health care costs during the first year of life. 

The long-term costs: 

“Hack et al. (1995) find that LBW babies have developmental problems in 

cognition, attention and neuromotor functioning that persist until 

adolescence.” (Abrevaya (2006)) 

35 / 40


How to Estimate 

The OLS estimates would be biased into the negative direction due to 

endogeneity. 

IV estimation? 

492 Comparison between J. ABREVAYA OLS and IV estimates 

from Abrevaya (2006) 

36 / 40


The fixed-effect (FE) estimation can be used if panel data are 

available. 

Abrevaya (2006) constructed a pseudo panel data set and showed 

that the FE estimate is smaller than that of the OLS. 

y ib = x ′ ib β + γs ib + c i + u ib 

where i is Mom’s id and b is the order of a baby from Mom i. 

The estimation results for γ by OLS and FE are −243.27(3.20) 

−144.04(4.75), respectively. 

37 / 40


38 / 40


Pooled cross sections are very similar to a single cross section, but 

observations across different time points help evaluate the correct 

policy effect. 

38 / 40





Extra information contained in panel data enables us to control for 

the individual fixed effect by FD and FE estimators. 

38 / 40







If the fixed effect is not correlated with regressors, we can apply RE 

estimator, which is a GLS estimator. 

38 / 40







If the fixed effect is not correlated with regressors, we can apply RE 

estimator, which is a GLS estimator. 

Panel data are not restricted to the individual-time structure. 

38 / 40

Stata Commands 

Load the data set filename.dta. 

First, we need to set an id variable and a time variable. Check the 

relevant variable names. 

xtset id time. 

Now type xtsum. 

The command for the FE estimator is 

xtreg dep x1 x2 x3, . . ., fe 

The command for the RE estimator is 

xtreg dep x1 x2 x3, . . ., re 

39 / 40

Introduction to Panel Data Analysis

Create successful ePaper yourself

Delete template?

Save as template?