Count Regression Introduction - Paul Johnson Homepage

Count Regression Introduction 1 / 48 

Count Regression Introduction 

Paul Johnson 

December 6, 2011


Welcome 

This is not ready yet


Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 

Graph the Poisson Distribution to Get a “Feeling”. 

4 Negative Binomial 

Gamma distributed heterogeneity 

Gamma distribution background 

NB estimation 

Overdispersion 

5 Zero inflated models 

6 Additional Readings


Motivation 

Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 





Motivation 

Count means 

integer valued, 0, 1, 2,... 

must be positive (0 or greater). 

Well, if the expected number of observed cases is large, and the 

distribution of the count data is drawn from the Poisson distribution, 

then OLS might be OK. The Poisson is not all that different from a 

Normal distribution. So sometimes the Normal case can be thought of as 

an approximation. 

But if the expected number of counts is small, the Poisson distribution is 

not even a little bit like a normal distribution.


Motivation 

Alternatives might not be as good. 

There are alternatives 

tobit (OLS, but with a truncation at 0) 

ordinal logit/probit


Motivation 

Generalized Linear Model Approach 

Recall. We asserted the dependent variable y i as a sum of a 

“predictable part” and a “random” part: y i = b 0 + b 1 x i + e i . 

If e i is “Normal with a mean of 0 and standard deviation of σ e ”, 

then y i is “Normal with a mean of b 0 + b 1x i and a standard deviation 

of σ e.” 

So OLS with an assumed Normal error implies 

y i ∼ N(X i b, σ 2 e ) 

The symbol “∼” means “is distributed as” or “is drawn from”. X i b is 

shorthand matrix terminology for the “linear predictor”, 

b o + b 1 x1 i + b 2 x2 i .


Motivation 

What’s the big point here? 

We think of the predictor as determining a parameter in a 

distribution from which observations are drawn.


Motivation 

You can use any distribution you want 

y i can be any distribution you want 

Let the properties of that distribution depend on input variables and 

parameters. 

For a “count” model, all you absolutely need is an integer-valued 

distribution for which y i ≥ 0. 2 possibilities: 

Poisson 

Negative Binomial


Poisson. 

Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 





Poisson: λ 

Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 





Poisson: λ 

The Poisson is a “one parameter” distribution. 

The parameter is usually called λ, and that parameter determines 

the expected value and the variance. 

Pr(y i |input i ) = exp(−λ)λy i 

y i !


Poisson: λ 

Relabel λ as “input”For Interpretation 

Instead of the greek letter λ, let’s call it what we mean: “input”. 

Pr(y i |input i ) = exp(−input i)input y i 

i 

y i ! 

For any y i you put in here, this tells you how likely you are to count 

that many “things” if the input is “input”. 

When I write “input”, I mean the combined impact of parameters 

and variables. 

Input is not necessarily simply X i b. In fact, we usually have to “translate” 

or “curve” the linear predictor so it fits “within boundaries.” 

So “input” is typically some function that depends on X i b, for generality, 

g(X i b).


Poisson: λ 


Poisson Sample, small lambda 

lambda=5 

lambda=10 

Density 

0.00 0.05 0.10 0.15 

Density 

0.00 0.04 0.08 0.12 

0 2 4 6 8 10 12 14 

5 10 15 20 

y 

y 

lambda=50 

lambda=200 

Density 

0.00 0.02 0.04 

Density 

0.000 0.010 0.020 

30 40 50 60 70 80 

160 180 200 220 240


Poisson: λ 


Poisson Sample, large lambda 

lambda=5 

lambda=10 

Density 

0.00 0.05 0.10 0.15 

Density 

0.00 0.04 0.08 0.12 

0 2 4 6 8 10 12 14 

5 10 15 20 

y 

y 

lambda=50 

lambda=200 

Density 

0.00 0.02 0.04 

Density 

0.000 0.010 0.020 

30 40 50 60 70 80 

160 180 200 220 240


Poisson: λ 


Noteworthy 

1 The Expected Value of the Poisson is λ = ”input”. 

2 The Variance of the Poisson is λ = ”input”. 

3 The shape changes and gets “more normal” as “input” gets bigger. 

Implication: If your count data has high values, then the OLS 

Normal model may serve about as well as a Poisson model 

However, there are 2 problems. 

1 Nonlinearity. 

2 Heteroskedasticity.


Poisson: λ 


Nonlinear Transformation of X i b Required for Poisson 

The input i must be positive! We are considering a “count variable,” 

something that is always POSITIVE. The expected value of a Poisson 

variable has to be positive. Since the expected value equals the value of 

“input i ”, then X i b cannot serve as input i because it may be negative. 

All kinds of transformations have been considered to make sure input is 

positive. A common way is to say that the input should be 

exponentiated, because exp(anything) is positive. 

input i = exp(X i b) 

Now, that results in the stupid looking exp(exp) appearance of the 

Poisson regression model: 

Pr(y|Xb) = exp(−exp(Xb))(exp(Xb))y 

y! 

or it looks slightly less ugly (not much) if we write: 

Pr(y|Xb) = exp(−eXb )(e Xb ) y 

y!


Poisson: λ 


King called it the “Exponential Poisson”model, others call 

it the “log link”. 

If 

then 


log(input i ) = X i b 

In the Generalized Linear Model literature, they think of the 

transformation happing to the left hand side, they call it the link function. 

So the exponential on the right is the “inverse link” function.


Poisson: λ 


Estimation: straight forward ML 

Adjust the b’s to maximize the product of the probabilities of the 

observations: 

L(b; y, X ) = Pr(y 1 |Xb) ∗ Pr(y 2 |Xb) ∗ ... ∗ Pr(y N |Xb) 

Usually, one would take logs, and maximize the log likelihood, which 

would be a sum.


Poisson: λ 


Interpretation 

Recall the expected value of y i given input is just the input itself. 

E(y i |X i ) = exp(X i ˆb) 

So if the k’th variable changes, the impact is 

∂E(y i |X i ) 

∂x k 

= ˆb k ∗ E(y i |X i ) 

= ˆb k ∗ exp(X i ˆb) 

Long discusses the calculation of the percent change in expected y, 

i.e. 

E(y i |X i , x k + δ) 

= exp(ˆb ∗ δ) 

E(y i |X i , x k )


Poisson: λ 


See poisson-1.R for this example 

Ugly Poisson Data 

y 

0 1 2 3 4 5 6 

● 

● ● 

● 

● 

● 

● 

● 

● ● 

● ● ● 

● ● ● 

● 

● ● 

● ● 

● 

● 

●● 

●● 

● 

● ● ● 

● 

● ● 

●● 

● 

● 

●●● 

● 

● ● 

● 

30 40 50 60 70 

x1


Poisson: λ 


Poisson GLM Fit 

m1 ← glm ( y ∼ x1 + x2 , data=dat , f a m i l y=p o i s s o n ( l i n k=l o g ) ) 

summary (m1) 

C a l l : 

glm ( f o r m u l a = y ∼ x1 + x2 , f a m i l y = p o i s s o n ( l i n k = l o g ) , data = dat ) 

Deviance R e s i d u a l s : 

Min 1Q Median 3Q Max 

−1.41361 −0.36576 −0.07740 −0.02218 1 .90164 

C o e f f i c i e n t s : 

E s t i m a t e S t d . E r r o r z v a l u e Pr ( >| z | ) 

( I n t e r c e p t ) 1 .32000 1 .49137 0 . 8 8 5 0 . 3 7 6 

x1 0 .15659 0 .03117 5 . 0 2 4 5 .05e−07 *** 

x2 −0.13446 0 .01869 −7.193 6 .34e−13 *** 

−−− 

S i g n i f . c o des : 0 ' *** ' 0 . 0 0 1 ' ** ' 0 . 0 1 ' * ' 0 . 0 5 ' . ' 0 . 1 ' ' 1 

( D i s p e r s i o n p a r a m e t e r f o r p o i s s o n f a m i l y taken to be 1) 

N u l l d e v i a n c e : 147 . 4 9 4 on 99 d e g r e e s o f freedom 

R e s i d u a l d e v i a n c e : 33 . 7 7 1 on 97 d e g r e e s o f freedom 

AIC : 79 . 4 

Number o f F i s h e r S c o r i n g i t e r a t i o n s : 7


Poisson: λ 


Just for Curiosity, fit OLS 

lm1 ← lm ( y ∼ x1 + x2 , data=dat ) 

summary ( lm1 ) 

C a l l : 

lm ( f o r m u l a = y ∼ x1 + x2 , data = dat ) 

R e s i d u a l s : 

Min 1Q Median 3Q Max 

−1.0386 −0.5489 −0.0799 0 . 2 3 1 9 4 . 7 0 8 2 

C o e f f i c i e n t s : 

E s t i m a t e S t d . E r r o r t v a l u e Pr ( >| t | ) 

( I n t e r c e p t ) 0 .965354 0 .655034 1 . 4 7 4 0 .14379 

x1 0 .029317 0 .010552 2 . 7 7 8 0 .00656 ** 

x2 −0.020878 0 .004155 −5.025 2 .3e−06 *** 

−−− 

S i g n i f . c o des : 0 ' *** ' 0 . 0 0 1 ' ** ' 0 . 0 1 ' * ' 0 . 0 5 ' . ' 0 . 1 ' ' 1 

R e s i d u a l s t a n d a r d e r r o r : 0 . 9 4 9 on 97 d e g r e e s o f freedom 

M u l t i p l e R 2 : 0 .2388 , A d j u s t e d R 2 : 0 . 2 2 3 1 

F − s t a t i s t i c : 15 . 2 1 on 2 and 97 DF, p−value : 1 .793e−06


Poisson: λ 


See poisson-1.R for this example 

p l o t ( x1 , y , main=”Ugly P o i s s o n Data ”) 

l i b r a r y ( r o c k c h a l k ) 

newdat ← e x p a n d . g r i d ( x1=p l o t S e q ( dat $x1 , l e n g t h . o u t =50) , x2=mean ( 

dat $ x2 ) ) 

newdat $p1 ← p r e d i c t (m1, newdata=newdat , t y p e=”r e s p o n s e ”) 

l i n e s ( newdat $x1 , newdat $p1 , lwd =3, c o l=”r e d ”) 

newdat $ lmp1 ← p r e d i c t ( lm1 , newdata=newdat ) 

l i n e s ( newdat $x1 , newdat $lmp1 , lwd =3, c o l=”g r e e n ”) 

l e g e n d ( ” t o p l e f t ” , l e g e n d=c ( ”Exp. P o i s s o n ” , ”OLS ”) , lwd=c ( 3 , 3 ) , c o l= 

c ( ”r e d ” , ”g r e e n ”) )


Poisson: λ 


See poisson-1.R for this example ... 

Ugly Poisson Data 

y 

0 1 2 3 4 5 6 

● 

Exp. Poisson 

OLS 

● 

● 

● 

● ● 

● 

● ● ● 

● 

● ● 

● ● 

● 

● 

●● 

●● 

● 

● ● ● 

● 

● ● 

●● 

● 

● 

●●● 

● 

● ● 

● 

● 

● 

● 

● ● 

● 

30 40 50 60 70 

x1


Poisson: λ 


From x2’s point of view 

p l o t ( x2 , y , main=”Ugly P o i s s o n Data , Again ”) 

newdat ← e x p a n d . g r i d ( x1=mean ( dat $ x1 ) , x2=p l o t S e q ( dat $x2 , 

l e n g t h . o u t =50) ) 

newdat $p1 ← p r e d i c t (m1, newdata=newdat , t y p e=”r e s p o n s e ”) 

p l o t ( y ∼ x2 , dat= dat ) 

l i n e s ( newdat $x2 , newdat $p1 , lwd =3, c o l=”r e d ”) 

lm1 ← lm ( y ∼ x1 + x2 , data=dat ) 

newdat $ lmp1 ← p r e d i c t ( lm1 , newdata=newdat ) 

l i n e s ( newdat $x2 , newdat $lmp1 , lwd =3, c o l=”g r e e n ”) 

l e g e n d ( ” t o p r i g h t ” , l e g e n d=c ( ”Exp. P o i s s o n ” , ”OLS ”) , lwd=c ( 3 , 3 ) , c o l 

=c ( ”r e d ” , ”g r e e n ”) )


Poisson: λ 


From x2’s point of view ... 

y 

0 1 2 3 4 5 6 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

● 

Exp. Poisson 

OLS 

● ● ● ● 

● ●● 

● ● ●● 

●● 

● 

● 

● ● ● 

● 

● 

●● 

● ●● ●● 

● ● 

● 

● ● ● 

● 

60 80 100 120 140 

x2


Negative Binomial 

Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 






Poisson Weaknesses 

1 Poisson is a “one parameter” model. The variance is not separately 

under our control. Maybe we could find a two parameter distribution 

with a more well-suited variance parameter. 

2 Repeat the same point: The Poisson may not fit the data because 

the variance predicted by the Poisson may be too small for the 

observed data.



Negative Binomial Derivation: Overdispersion 

The Negative Binomial can be described in a number of ways. 

I think the “extra randomness” interpretation is the simplest. 

input i additional random error that causes “heterogeneity” 

(sometimes the term “frailty” is used) in the outputs for cases that 

have the same observed values of X i . 

Suppose the Poisson process has an expected value: 

Note that if 

new input i = input i ∗ δ i 

δ i = 1 

then this thing just degenerates back to the original Poisson model.



Log Link and Multiplicative Error 

In the most common version of the Poisson model, we use the “log 

link” 


Supplement with an additional error term u i : 

new input i = exp(X i b + u i )



Multiplicative Additive 

Easy: 

new input i = exp(X i b + u i ) = exp(X i b) × exp(u i ) 

So one can either think of the new error as an additive bit of noise 

with the linear predictor (+u i ) or a multiplicative effect applied to 

the transformed linear predictor (×δ i = exp(u i )). 

Obviously, we can convert “back and forth” 

u i = log(δ i )



Vital to Pick δ i Distribution Properly 

It is necessary to assume that this new noise is“neutral”, in the sense 

that it causes more uncertainty, but it does not change the average 

outcome. 

That is true if 

or, equivalently, 

E[δ i ] = 1 =⇒ E(exp(u i )) = 1 

E[u i ] = 0 

“On average” the extra error term has “no effect”.



Output is Conditional Poisson Model 

The maximum likelihood estimation has to be amended to 

incorporate a new likelihood component for each case. 

Hence, our theory says that GIVEN X i and an additional 

perturbation u i , the probability model is a a Poisson process. 

P(y i |X i , u i ) = exp(−new input i) × new input y i 

i 

y! 

The input on the right side includes the additional frailty. 

.




Gamma is most Common Frailty Distribution 

Gamma is common probability distribution for δ i = exp(u i ) 

The full Gamma distribution has two parameters, but we are going 

to simplify them so we only need to worry about one, v = shape, 

which determines the variance. This simplification of the gamma can 

be done in several ways, which will be outlined later. 

The key think is this: If δ i is drawn from “a properly selected” 

gamma distribution, then E(δ i ) = 1 and 

Var[δ i ] = 1/some parameter we choose.




Gamma Density Illustration 

The Gamma describes the probability of a continuous variable on [0, ∞]. 

It can look like a “ski slope” or it can look single-peaked. 

Figure: Gamma Distribution




Gamma PDF 

2 parameters, shape and scale. In some books, the scale parameter 

is replaced by a parameter called rate, which is equal to 1/scale. 

If δ i is Gamma distributed, the probability density function is: 

1 

f (δ i ) = 

scale shape Γ(shape) δ(shape−1) i 

e −( δ i 

scale )




What is that Gamma function? 

The function Γ(shape) is the 

Gamma function (which is a 

complicated math thing I’ve never 

looked into very much). It is 

Γ(s) = ´ ∞ 

t s−1 e −t dt if s > 0. If 

0 

you pick s as an integer, Γ(s) is 

very easy to calculate: 

Γ(s) = (s − 1)! s = 1, 2, ... 

So, the value of Γ(1) = 1. And 

Γ(2) = 1. And Γ(20) is some 

impossibly huge number. 

gamma(x) 

0 5 10 15 20 

Gamma function 

0 1 2 3 4 5 

x




Adjust the Gamma PDF to create the right kind of 

heterogeneity. 

The two parameter Gamma probability distribution has these 

interesting properties: 

E(δ i ) = shape ∗ scale 

Var(δ i ) = shape ∗ scale 2 

Simplify: scale = 1/shape 

The expected value and variance are 

E[δ i ] = shape/shape = 1 

Var[δ i ] = shape/shape 2 = 1/shape




Suppose shape is the Same for All Observations 

The shape parameter is assumed to exist, we need to estimate it. 

In this formulation, it is easy to see that if shape is very large, then 

the variance of δ i is very small. The extra heterogeneity has only a 

minor effect, and, in fact, as shape tends to ∞, the value of δ i 

collapses around 1.0.




Another Derivation that ends up at the same place 

Fix the scale = 1. 

Draw a random variable m i with gamma(shape, 1). Then the 

probability density formula with scale=1 simplifies to: 

1 

f (m i ) = 

Γ(shape) m(shape−1) i 

e −m i 

shape > 0 

If shape=1, then this is an exponential distribution (because 

Γ(1) = 1). 

The expected value and variance are: 

and 

E[m i ] = shape 

Var[m i ] = shape.




Not Better formulation, just Different 

The advantage of this formulation is that we can easily see what we 

need to do to convert m i into our final result: 

δ i = 

m i 

shape 

Notice that after diving each draw by shape, we have a variable with 

just the same properties as the other formulation. 

m i 

E(δ i ) = E( 

shape ) = 1 

shape E(m i) = shape 

shape = 1 

and also 

m i 

V (δ i ) = V ( 

shape ) = 1 

shape 2 V (m i) = shape 

shape 2 = 1 

shape 

If you go back and forth between books, you get a headache because 

no two books seem to write this down in exactly the same way. But 

I’m pretty sure I’ve written it down correctly.




Illustrate m i /shape 

. 

Histogram gamma/shape, shape= 0.5 scale= 1 

0.0 0.6 

0 1 2 3 4 5 

z 

Histogram gamma/shape, shape= 1 scale= 1 

0.0 0.6 

0 1 2 3 4 5 

z 

Histogram gamma/shape, shape= 5 scale= 1 

0.0 0.6 1.2 

0 1 2 3 4 5 

z




Illustrate log( v i 

shape ) 

Histogram log(gamma/shape), shape= 0.5 scale= 1 

0.00 0.15 

−10 −8 −6 −4 −2 0 2 

log(z) 

Histogram log(gamma/shape), shape= 1 scale= 1 

0.00 0.25 

−10 −8 −6 −4 −2 0 2 

log(z) 

Histogram log(gamma/shape), shape= 5 scale= 1 

0.0 0.4 0.8 

−10 −8 −6 −4 −2 0 2 

log(z)




About that“shape”parameter 

Histogram gamma shape= 0.5 scale= 1 

Frequency 

0 150 350 

0 1 2 3 4 5 6 7 

z 

Histogram gamma shape= 1 scale= 1 

Frequency 

0 100 

0 1 2 3 4 5 6 

z 

Histogram gamma shape= 5 scale= 1 

Frequency 

0 40 80 

0 5 10 15 

z




Estimating 

Fitting is an iterative, two-stage process. 

The shape estimate is chosen 

Then the slope parameters are estimated. 

Repeat until estimates converge to stable values. 

The MASS package for R provides a procedure “glm.nb” which will 

do maximum likelihood to estimate the b’s and the shape parameter. 

(In Venables & Ripley, p. 207, the “shape” parameter is called θ.




The Negative Binomial Distribution “Pops Out” 

If you start with a Poisson model, and then add random noise with 

multiplicative Gamma error, 

Y | δ i ∼ Poisson(input i ∗ δ i ) 

The result is known (in probability theory) to be a Negative 

Binomial Distribution. 

Γ(shape + y) input y shape shape 

f y (y|shape, input) = • 

Γ(shape)y! (input + shape) shape+y 

(Venables and Ripley, 4th ed, p. 206) 

E(y i ) = input 

Var(y i ) = input + input 2 /shape 

Note that if shape = ∞, then the variance of y i is just input i , meaning 

the original Poisson model is back! But for other values of the shape 

parameter, the variance of y i is greater than in the Poisson model.


Zero inflated models 

The results indicate one surprise, that the expected value of y is the same 

in the Poisson and the NB model. However, the variance is different. In 

the NB model, the variance is 

( 

Var(y i |X ) = exp(X i b) 1 + exp(X ) 

ib) 

v i 

Estimates from a Poisson model are inefficient and have bad standard 

errors if the data is really produced by a heterogeneous process of the NB 

sort. 

Note the Poisson model is really“nested”inside the NB model. If we do a 

significance test of H o : α = 0 and cannot reject it, then it means we 

ought to go back to the Poisson. Long p. 237 discusses other tests. See



the R package pscl for a test that can be used. 

Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 






The Poisson or NB models might not match the data because they don’t 

have enough observed 0’s. 

The “fix” is to think of the probability process as a two step thing. First, 

the observed y is either 0 or a number y i . Whether it is observed or not is 

modeled by any dichotomous regression model, such as logit or probit. 

Second, if it is observed, the count is given by one of the models above. 

All kinds of details flow forth if you get into writing out one of these 

models. Should the predictors in the dichotomous regression exercise be 

the same ones that are used in the Poisson or NB regression? Should we 

insist the predictive part of the probit model is proportional to the count 

model? 

Now, how can a probability process give back a 0? 

Either through the failure of the probit stage or a predicted 0 from the 

count stage, so in the Poisson 

P(y i = 0|X i ) = ψ i + (1 − ψ i ) ∗ exp(−exp(Xb)) 

(Write out the poisson for y=0 to understand the last term). 

And the probability of any other value is given by the regular poisson, 

multiplied by (1 − ψ i ) :


Additional Readings 

P(y i |X i ) = (1 − ψ i ) ∗ Poisson(Xb)



Outline 

1 Motivation 

2 Poisson. 

3 Poisson: λ 





NB estimation 






For more reading on Count models, consult the following, probably in this 

order: 

Scott Long, Regression Models for Categorical and Limited Dependent 

Variables, Chapter 8, “Count outcomes” 

Gary King. 1988. Statistical Models for Political Science Event Counts: 

Bias in Conventional Procedures and Evidence for the Exponential 

Poisson Regression Model, American Journal of Political Science, 32(3): 

838-863. 

Gary King. 1989. Variance Specification in Event Count Models: From 

Restrictive Assumptions to a Generalized Estimator. American Journal of 

Political Science 33(3): 762-784 

Cameron and Trivedi, 1998. Regression Analysis of Count Data, 

Cambridge University Press.

Count Regression Introduction - Paul Johnson Homepage

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?