Causal Inference - South Africa Government Online

REPUBLIC OF SOUTH AFRICA 

GOVERNMENT-WIDE MONITORING & IMPACT EVALUATION SEMINAR 

Session I 

Causal Inference 

Sebastian Martinez 

June 2006 

Slides by Sebastian Galiani, Paul Gertler and Sebastian Martinez 

ORGANIZED BY THE WORLD BANK AFRICA IMPACT EVALUATION INITIATIVE 

IN COLLABORATION WITH HUMAN DEVELOPMENT NETWORK 

AND WORLD BANK INSTITUTE

Motivation 

• Objective in evaluation is to estimate 

the CAUSAL effect of intervention 

(treatment) t on outcome Y 

– What is the effect of a cash transfer on 

household consumption? 

• For causal inference we must 

understand the data generation 

process 

– For impact evaluation, this means 

understanding the behavioral process 

that generates the data 

• how benefits are assigned

Technical Group 

• Causal Inference 

• Experimental design/randomization 

• Quasi-experiments 

– Regression Discontinuity 

– Double differences (Diff in diff) 

–Matching 

– Instrumental Variables 

• Sampling and Data

Causal Analysis 

• The aim of standard statistical analysis, typified by 

likelihood and other estimation techniques, is to 

infer parameters of a distribution from samples 

drawn of that distribution. 

• With the help of such parameters, one can: 

1. Infer association among variables, 

2. Estimate the likelihood of past and future events, 

3. As well as update the likelihood of events in light of new 

evidence or new measurement.


• These tasks are managed well by standard 

statistical analysis as long as experimental 

conditions remain the same. 

• Causal analysis goes one step further: 

– Its aim is to infer aspects of the data generation 

process. 

– With the help of such aspects, one can deduce 

not only the likelihood of events under static 

conditions, but also the dynamics of events 

under changing conditions.


• This capability includes: 

1.Predicting the effects of interventions 

2.Predicting the effects of spontaneous changes 

3.Identifying causes of reported events 

• This distinction implies that causal and 

associational concepts do not mix.


The word cause is not in the vocabulary of standard 

probability theory. 

• All Probability theory allows us to say is that two 

events are mutually correlated, or dependent – 

meaning that if we find one, we can expect to 

encounter the other. 

• Scientists seeking causal explanations for 

complex phenomena or rationales for policy 

decisions must therefore supplement the language 

of probability with a vocabulary for causality.


• Two languages for causality have 

been proposed: 

1.Structural equation modeling (ESM) 

(Haavelmo 1943). 

2.The Neyman-Rubin potential outcome 

model (RCM) (Neyman, 1923; Rubin, 

1974).

The Rubin Causal Model 

• Define the population by U. Each unit in U 

is denoted by u. 

• For each u ∈ U, there is associated a value 

Y(u) of the variable of interest Y, which we 

call: the response variable. 

• Let A be a second variable defined on U. 

We call A an attribute of the units in U.


• The key notion is the potential for 

exposing or not exposing each unit to the 

action of a cause: 

• Each unit has to be potentially exposable 

to any one of the causes. 

• Thus, Rubin takes the position that causes 

are only those things that could be 

treatments in hypothetical experiments. 

• An attribute cannot be a cause in an 

experiment, because the notion of potential 

exposability does not apply to it.


• For simplicity, we assume that there are just 

two causes or level of treatment. 

• Let D be a variable that indicates the cause 

to which each unit in U is exposed: 

⎧t 

D = ⎨ 

⎩c 

if 

if 

unit u is exposed to treatment 

unit u is exposed to control 

In a controlled study, D is constructed by the 

experimenter. In an uncontrolled study, it is 

determined by factors beyond the 

experimenter’s control.


• The values of Y are potentially affected by 

the particular cause, t or c, to which the 

unit is exposed. 

• Thus, we need two response variables: 

Y t (u), Y c (u) 

• Y t is the value of the response that would 

be observed if the unit were exposed to t 

and 

• Y c is the value that would be observed on 

the same unit if it were exposed to c.


• Let D also be expressed as a binary 

variable: 

D = 1 if D = t and D = 0 if D = c 

• Then, the outcome of each individual can 

be written as: 

Y(U) = D Y 1 + (1 – D) Y 0


• Definition: For every unit u treatment {D u = 1 instead of D u = 0} 

causes the effect 

δ u = Y 1 (u) – Y 0 (u) 

• This definition of a causal effect assumes that the treatment 

status of one individual does not affect the potential outcomes 

of other individuals. 

• Fundamental Problem of Causal Inference: It is impossible 

to observe the value of Y 1 (u) and Y 0 (u) on the same unit and, 

therefore, it is impossible to observe the effect of t on u. 

• Another way to express this problem is to say that we cannot 

infer the effect of treatment because we do not have the 

counterfactual evidence i.e. what would have happened in the 

absence of treatment.


• Given that the causal effect for a single unit u 

cannot be observed, we aim to identify the 

average causal effect for the entire population or 

for sub-populations. 

• The average treatment effect ATE of t (relative to 

c) over U (or any sub-population) is given by: 

ATE =E [Y 1 (u) – Y 0 (u)] 

= E [Y 1 (u)] – E [Y 0 (u)] 

= δ 

= Y − Y 

1 0 

(1)


• The statistical solution replaces the impossible-toobserve 

causal effect of t on a specific unit with 

the possible-to-estimate average causal effect of t 

over a population of units. 

• Although E(Y 1 ) and E(Y 0 ) cannot both be 

calculated, they can be estimated. 

• Most econometrics methods attempt to construct 

from observational data consistent estimates of 

Y and Y 

1 0


• Consider the following simple estimator of 

ATE: 

ˆ 

δ 

= 

[Ŷ1 | D = 1]-[Ŷ0 

| D = 

0] 

(2) 

• Note that equation (1) is defined for the 

whole population, whereas equation (2) 

represents an estimator to be evaluated on a 

sample drawn from that population

• Let π equal the proportion of the population 

that would be assigned to the treatment 

group. 

• Decomposing ATE, we have: 

δ 

= π δ{ D= 1} 

+ ( 1−π 

) δ{ 

D= 

0} 

[( − Y ) | D = 1] + (1 − ) [( Y − Y ) | D 0] 

δ = π 

π 

Y1 0 

1 0 

= 

δ = 

[ π [Y 

] 

1 

| D = 1] + (1 − π )[Y1 

| D = 0] + 

[ π [Y 

] 

0 

| D = 1] + (1 − π )[Y0 

| D = 0] = Y1 

− Y0

• If we assume that 

[ 

0 

Y1 | D = 1] = [Y1 

| D = 0] and [Y0 

| D = 1] = [Y | D = 

δ 

δ = 

= 

[ π [Y1 

| D = 1] + (1 − π )[Y1 

| D = 1] ] 

[ π [Y | D = 0] + (1 − π )[Y | D = 0] ] 

0 

[ 

0 

Y | D = 1] - [Y | D = 

1 

Which is consistently estimated by its sample 

analog estimator: 

ˆ 

δ 

= 

[Ŷ | D = 1] - [Ŷ | D = 

1 0 

0 

+ 

0] 

0] 

0]

The principal way to achieve this uncorrelatedness is 

through random assignment of treatment. 

• Thus, a sufficient condition for the standard 

estimator to consistently estimate the true ATE is 

that: 

[ 

0 

Y1 | D = 1] = [Y1 

| D = 0] and [Y0 

| D = 1] = [Y | D = 

In this situation, the average outcome under the 

treatment and the average outcome under the control 

do not differ between the treatment and control groups 

In order to satisfy these conditions, it is sufficient that 

treatment assignment D be uncorrelated with the 

potential outcome distributions of Y 1 and Y 2 . 

0]

• In most circumstances, there is simply no 

information available on how those in the 

control group would have reacted if they had 

received the treatment instead. 

• This is the basis for an important insight into 

the potential biases of the standard 

estimator (2). 

• After a bit of algebra, it can be shown that: 

ˆ 

δ = δ + 

0 0 

) 

{D= 1} 

− δ{D= 

1 4 4 4 4 2 4 4 4 4 3 1 44 

2 4 43 

([Y 

| D = 1] − [Y | D = 0] ) + (1 − π ( δ ) 

Baseline Difference 

0} 

Treatment Heterogeneity

• This equation specifies the two sources of 

biases that need to be eliminated from 

estimates of causal effects from 

observational studies. 

1. Selection Bias: Baseline difference. 

2. Treatment Heterogeneity. 

• Most of the methods available only deal with 

selection bias, simply assuming that the 

treatment effect is constant in the population 

or by redefining the parameter of interest in 

the population.

Treatment on the Treated 

• ATE is not always the parameter of 

interest. 

• In a variety of policy contexts, it is the 

average treatment effect for the treated 

that is of substantive interest: 

TOT =E [Y 1 (u) – Y 0 (u)| D = 1] 

=E [Y 1 (u)| D = 1] – E [Y 0 (u)| D = 1]

Treatment on the Treated 

• The standard estimator (2) consistently 

estimates TOT if: 

[ 

0 

Y | D = 1] = [Y | D = 

0 

0]

Structural Equation Modeling 

• Structural equation modeling was 

originally developed by geneticists 

(Wright 1921) and economists 

(Haavelmo 1943).

Structural Equations 

• Definition: An equation 

y = β x + ε (3) 

is said to be structural if it is to be interpreted as 

follows: 

• In an ideal experiment where we control X to x and 

any other set Z of variables (not containing X or Y) 

to z, the value y of Y is given by β x + ε, where ε is 

not a function of the settings x and z. 

• This definition is in the spirit of Haavelmo (1943), 

who explicitly interpreted each structural equation as 

a statement about a hypothetical controlled 

experiment.

• Thus, to the often asked question, “Under what 

conditions can we give causal interpretation to 

structural coefficients?” 

• Haavelmo would have answered: Always! 

• According to the founding father of SEM, the 

conditions that make the equation y = β x + ε 

structural are precisely those that make the 

causal connection between X and Y have no 

other value but β, and ensuring that nothing 

about the statistical relationship between x and ε 

can ever change this interpretation of β.

• The average causal effect: The average 

causal effect on Y of treatment level x is 

the difference in the conditional 

expectations: 

E(Y|X = x) – E(Y|X = 0) 

• In the context of dichotomous interventions 

(x = 1), this causal effect is called the 

average treatment effect (ATE).

Representing Interventions 

• Consider the structural model M: 

z = f z (w) 

x = f x (z, ν) 

y = f y (x, u) 

• We represent an intervention in the model through 

a mathematical operator denoted d 0 (x). 

• d 0 (x) simulates physical interventions by deleting 

certain functions from the model, replacing them 

by a constant X = x, while keeping the rest of the 

model unchanged.

• From this distribution, one is able to assess 

treatment efficacy by comparing aspects of this 

distribution at different levels of x . 

• To emulate an intervention d 0 (x 0 ) that holds X 

constant (at X = x 0 ) in model M, replace the 

equation for x with x = x 0 , and obtain a new model, 

M x0 

z = f z (w) 

x = x 0 

y = f y (x, u) 

• The joint distribution associated with the modified 

model, denoted P(z, y| d 0 (x 0 )) describes the postintervention 

(“experimental”) distribution.

• Definition: The interpretation of a structural 

equation as a statement about the behavior of Y 

under a hypothetical intervention yields a simple 

definition for the structural parameters. 

The meaning of β in the equation y = β x + ε is 

simply 

β = 

∂ 

∂x 

E[Y | 

d 

o 

(x)]

Counterfactual Analysis in Structural 

Models 

• Consider again model M xo . Call the solution 

of Y the potential response of Y to x 0 . 

• We denote it as Y x0 (u, ν, w). 

• This entity can be given a counterfactual 

interpretation, for it stands for the way an 

individual with characteristics (u, ν, w) would 

respond, had the treatment been x 0 , rather 

than the x = f x (z, ν) actually received by the 

individual.

• In our example, 

Y x0 (u, ν, w) = Y x0 (u) = y = f y (x 0 , u) 

• This interpretation of counterfactuals, cast as 

solutions to modified systems of equations, provides 

the conceptual and formal link between structural 

equation modeling and the Rubin potential-outcome 

framework. 

• It ensures us that the end results of the two 

approaches will be the same. 

• Thus, the choice of model is strictly a matter of 

convenience or insight.

References 

• Judea Pearl (2000): Causality: Models, Reasoning 

and Inference, CUP. Chapters 1, 5 and 7. 

• Trygve Haavelmo (1944): “The probability 

approach in econometrics”, Econometrica 12, pp. 

iii-vi+1-115. 

• Arthur Goldberger (1972): “Structural Equations 

Methods in the Social Sciences”, Econometrica 

40, pp. 979-1002. 

• Donald B. Rubin (1974): “Estimating causal effects 

of treatments in randomized and nonrandomized 

experiments”, Journal of Educational Psychology 

66, pp. 688-701. 

• Paul W. Holland (1986): “Statistics and Causal 

Inference”, Journal of the American Statistical 

Association 81, pp. 945-70, with discussion.

Causal Inference - South Africa Government Online

Create successful ePaper yourself

Delete template?

Save as template?