Survival Analysis

Survival Analysis: 

The Study of Lifetimes and their Distributions 

Bruce L. Jones 

Department of Statistical and Actuarial Sciences 

The University of Western Ontario 

PLCS/RDC Statistics and Data Series at Western 

April 20, 2011

• What is survival analysis? 

Outline 

• What do we mean by lifetimes? 

• What are the challenges presented by lifetime data? 

• How can we characterize lifetime distributions? 

• What are the well-known parametric and non-parametric 

survival models? 

• How can we investigate/model the impact of fixed and 

time-varying covariates on a lifetime distribution? 

1

What is survival analysis? 

Survival analysis deals with 

• modelling or understanding the behaviour of the lifetime distribution for 

a population, 

• statistical tests for differences between the lifetime distributions of two 

or more populations, 

• modelling/assessing the impact of explanatory variables on a lifetime 

distribution. 

2

What do we mean by lifetimes? 

• Lifetime refers to the time until a specified event, not necessarily the 

end of a life. Examples: 

– time until death 

– time until retirement 

– time until termination of employment 

– time until a health-related event 

– time until marriage/divorce/remarriage 

– time until default of a bond or mortgage 

• The methods of survival analysis can even be applied when the quantity 

of interest is not a time - it can be any positive-valued random 

variable. 

3

What are the challenges presented by lifetime data? 

Suppose that T i is the lifetime of subject i. 

Our observation of T i could be censored or truncated. 

• Censoring 

If T i is in the interval (a i ,b i ), we observe only that a i

Example: Data on Time until Termination of Employment 

Event of Interest: Termination of employment, voluntary (the employee 

quits) or involuntary (the employee is fired), but not retirement. 

Lifetime: Time from date of hire to date of termination. 

Data: Date of birth, date of hire, sex, and salary for about 400 individuals 

who were employees of a company during 1999 or 2000, and the date of 

all terminations during 1999 and 2000. 

Questions: 

• How does the distribution of the time until termination behave? 

• Does the time until termination depend on the age at hire? How? 

• Does the time until termination depend on the sex of the employee? 

How? 

• Does the time until termination depend on the salary of the employee? 

How? 

5


Summary of Data 

1999 2000 

Sex Females Males Both Females Males Both 

Existing Employees 164 139 303 173 148 321 

New Hires 30 19 49 27 25 52 

Terminations 30 15 45 36 16 52 

6


Number of Employees 

number 

0 20 40 60 80 100 120 

0 10 20 30 

years since hire 

7

How can we characterize lifetime distributions? 

Let T be a continuous, positive-valued random variable that represents the 

time until some event. 

The distribution of T can be characterized in terms of 

S(t) = Pr(T>t) survival function 

F (t) = Pr(T ≤ t) distribution function 

f(t) = F ′ (t) probability density function 

h(t) = − S′ (t) 

S(t) = f(t) hazard function 

S(t) 

H(t) = 

∫ t 

0 

h(u)du cumulative hazard function 

r(t) = E[T − t|T >t]= 

∫ ∞ 

t 

(u − t) f(u) 

S(t) 

du mean residual lifetime 

8

Well-known parametric survival models 

Exponential Distribution 

The hazard function is constant. 

S(t) =e −λt , t ≥ 0, λ > 0. 

Gompertz Distribution 

S(t) =e − B 

ln c (ct −1) , t ≥ 0, B > 0, c>1. 

The hazard function increases exponentially. 

9

Well-known parametric survival models 

Weibull Distribution 

S(t) =e −(t/θ)τ , t ≥ 0, θ,τ > 0, 

The hazard function is increasing if τ > 1, decreasing if τ < 1, and is 

constant if τ =1. 

Lognormal Distribution 

( ) log t − µ 

S(t) =1− Φ 

, t > 0, −∞ 0, 

The hazard function is decreasing if α ≤ 1. Ifα>1, the hazard function 

( )1 ( α − 1 α α − 1 

is increasing for t< 

and decreasing for t> 

λ 

λ 

)1 

α. 

10

What is a nonparametric model? 

A nonparametric model has the following characteristics: 

• It does not impose a structure on the distribution. 

• The number of parameters is determined by the data. 

• It exhibits features that are apparent in the data. 

11

Why use a nonparametric model? 

• To explore the data graphically 

• As a model for calculations 

• To check the fit of a parametric model 

12

The Kaplan-Meier Estimator 

We can obtain a nonparametric model for a lifetime distribution using the 

the Kaplan-Meier (KM) estimator of the survival function. 

Suppose we have (left truncated and right censored) data on n subjects. 

Let t 1


Survival Function of Time until Termination 

probability 

0.0 0.2 0.4 0.6 0.8 1.0 

females 

males 

0 5 10 15 20 25 30 35 

years since entry 

14


Survival Function of Time until Termination 

probability 

0.0 0.2 0.4 0.6 0.8 1.0 

females 

males 

0 5 10 15 20 25 30 35 


14

The Kaplan-Meier Estimator 

Since 

S(t) = exp{−H(t)}, 

and therefore 

H(t) =− log S(t), 

we can use the KM estimates to obtain estimates of the H(t) values. 

15


Cumulative Hazard Function of Time until Termination 

probability 

0 1 2 3 4 

females 

males 

0 5 10 15 20 25 30 35 


16

Fitting a parametric model by maximum likelihood 

Suppose we observe n subjects. For subject i, we observe (l i ,r i ,δ i ), 

where l i is the left truncation time, r i is the event or right censoring time, 

and δ i is 1 if an event is observed and 0 otherwise. 

The likelihood function is given by 

L(θ) = 

n∏ 

i=1 

[f (r i ; θ)] δ i [S (r i ; θ)] 1−δ i 

/ 

S (li ; θ) . 

L is maximized with respect to θ to obtain ̂θ, the vector of MLEs. 

17



probability 

0 1 2 3 4 

females 

males 

0 5 10 15 20 25 30 35 


18



probability 

0 1 2 3 4 

females 

males 

0 5 10 15 20 25 30 35 


18

Modelling the Impact of Covariates 

Lifetime distributions may depend on factors related to the individuals or 

the conditions under which they are observed. In modelling lifetime distributions 

we frequently wish to reflect the impact of these factors. 

Information about relevant factors is expressed in terms of explanatory variables, 

also known as covariates. For each individual, we observe a vector 

x =(x 1 ,...,x p ) ′ , which gives the values of p covariates. 

We seek a model which captures the impact of x on the lifetime distribution. 

19

Modelling the Impact of Covariates 

The most common types of models are 

1. Accelerated failure time (AFT) models: 

T = T 0 exp{u(x)}, 

2. Proportional hazards (PH) models: 

h(t|x) =h 0 (t) exp{u(x)}, 

where normally u(x) is a linear function of the components of x 

with coefficients that are parameters to be estimated. 

20

Accelerated failure time (AFT) models 

where 

T = T 0 exp{u(x)}, 

u(x) =β 0 + β 1 x 1 + ···+ β p x p . 

Therefore, if Y = log T , then 

Y = β 0 + β 1 x 1 + ···+ β p x p + bZ. 

Popular choices for the distribution of Z are 

• Normal ⇒ T ∼ lognormal 

• Logistic ⇒ T ∼ loglogistic 

• Extreme value ⇒ T ∼ Weibull 

21

Proportional hazards (PH) models 

where 

h(t|x) =h 0 (t) exp{u(x)}, 

u(x) =β 1 x 1 + ···+ β p x p . 

When a nonparametric model is used for the baseline hazard function, 

h 0 (t), the PH model is a semiparametric model. 

Established methods allow us to make inferences about β 1 ,...,β p while 

ignoring the baseline hazard function. 

22


Model: h(t|x 1 )=h 0 (t) exp{β 1 x 1 } 

Covariate: x 1 = I(sex=male) 

Estimates: 

covariate coef exp(coef) se(coef) z p-value 

x 1 -0.5670 0.5672 0.2217 -2.557 0.0105 

Conclusion: 

The hazard function for males is significantly lower than that for females. 

According to the fitted model, the hazard function for males equals 0.5672 

times the hazard function for females. 

23


Model: h(t|x) =h 0 (t) exp{β 1 x 1 + β 2 x 2 } 

Covariates: x 1 = I(sex=male) and x 2 = age at hire 

Estimates: 


x 1 -0.52248 0.59305 0.22316 -2.341 0.0192 

x 2 -0.02072 0.97949 0.01159 -1.788 0.0737 

Conclusion: 

The hazard function decreases (by about 2% per year) with age at hire, 

though this effect is significant at the 10% level but not at the 5% level. 

24

Fixed and time-varying covariates 

So far, we have considered only fixed covariates - sex and age at hire do 

not change during the time the individual is employed. 

Time-varying covariates can change over time. 

In our example, calendar year is such a covariate, and we observe individuals 

in both 1999 and 2000. Salary is also a time-varying covariate. 

The PH model easily accommodates time-varying covariates: 

h(t|x(t)) = h 0 (t) exp{β 1 x 1 (t)+···+ β p x p (t)} 

25


Suppose a female employee is hired on July 1, 1999 at age 45 and quits 

on October 1, 2000. 

The data for this employee can be specified as two observations of the 

form (l i ,r i ,δ i ,x 1i ,x 2i ,x 3i ), where 

l i = left truncation time for observation i 

r i = termination or right censoring time for observation i 

δ i = termination indicator for observation i 

x 1i = I(sex i = male) 

x 2i = (age at hire) i 

x 3i = calendar year of observation i. 

That is, 

and 

(0, 0.5, 0, 0, 45, 1999) 

(0.5, 1.25, 1, 0, 45, 2000). 

26


Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)} 

Covariates: x 1 = I(sex=male), x 2 = age at hire, 

and x 3 (t) =I(calendar year at time t = 2000) 

Estimates: 


x 1 -0.51777 0.59585 0.22324 -2.319 0.0204 

x 2 -0.02073 0.97948 0.01158 -1.789 0.0735 

x 3 (t) 0.17245 1.18821 0.20621 0.836 0.4030 

Conclusion: 

The hazard function in 2000 did not differ significantly from that in 1999. 

27


Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)} 

Covariates: x 1 = I(sex=male), x 2 = age at hire, 

and x 3 (t) = log 10 (salary for the calendar year at time t) 

Estimates: 


x 1 -0.156276 0.855323 0.257055 -0.608 0.54322 

x 2 -0.006394 0.993626 0.011913 -0.537 0.59145 

x 3 (t) -1.516312 0.219520 0.534350 -2.838 0.00454 

Conclusion: 

Salary has a large and statistically significant effect on the termination rate, 

and salaries may be related to sex and age at hire. 

28


KM Estimates of the Survival Function of Salary 

probability 

0.0 0.2 0.4 0.6 0.8 1.0 

median male salary: $86,553 

median female salary: $43,080 

0 500000 1000000 1500000 

salary 

29

Summary 

• survival analysis provides effective tools for analyzing 

lifetime data. 

• We can easily handle left truncation and right censoring. 

• We can explore parametric, nonparametric, and 

semiparametric models. 

• Covariates may be fixed or time-varying. 

30

Survival Analysis

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?