19.09.2015 Views

Survival Analysis

Presentation Slides - University of Western Ontario

Presentation Slides - University of Western Ontario

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Survival</strong> <strong>Analysis</strong>:<br />

The Study of Lifetimes and their Distributions<br />

Bruce L. Jones<br />

Department of Statistical and Actuarial Sciences<br />

The University of Western Ontario<br />

PLCS/RDC Statistics and Data Series at Western<br />

April 20, 2011


• What is survival analysis?<br />

Outline<br />

• What do we mean by lifetimes?<br />

• What are the challenges presented by lifetime data?<br />

• How can we characterize lifetime distributions?<br />

• What are the well-known parametric and non-parametric<br />

survival models?<br />

• How can we investigate/model the impact of fixed and<br />

time-varying covariates on a lifetime distribution?<br />

1


What is survival analysis?<br />

<strong>Survival</strong> analysis deals with<br />

• modelling or understanding the behaviour of the lifetime distribution for<br />

a population,<br />

• statistical tests for differences between the lifetime distributions of two<br />

or more populations,<br />

• modelling/assessing the impact of explanatory variables on a lifetime<br />

distribution.<br />

2


What do we mean by lifetimes?<br />

• Lifetime refers to the time until a specified event, not necessarily the<br />

end of a life. Examples:<br />

– time until death<br />

– time until retirement<br />

– time until termination of employment<br />

– time until a health-related event<br />

– time until marriage/divorce/remarriage<br />

– time until default of a bond or mortgage<br />

• The methods of survival analysis can even be applied when the quantity<br />

of interest is not a time - it can be any positive-valued random<br />

variable.<br />

3


What are the challenges presented by lifetime data?<br />

Suppose that T i is the lifetime of subject i.<br />

Our observation of T i could be censored or truncated.<br />

• Censoring<br />

If T i is in the interval (a i ,b i ), we observe only that a i


Example: Data on Time until Termination of Employment<br />

Event of Interest: Termination of employment, voluntary (the employee<br />

quits) or involuntary (the employee is fired), but not retirement.<br />

Lifetime: Time from date of hire to date of termination.<br />

Data: Date of birth, date of hire, sex, and salary for about 400 individuals<br />

who were employees of a company during 1999 or 2000, and the date of<br />

all terminations during 1999 and 2000.<br />

Questions:<br />

• How does the distribution of the time until termination behave?<br />

• Does the time until termination depend on the age at hire? How?<br />

• Does the time until termination depend on the sex of the employee?<br />

How?<br />

• Does the time until termination depend on the salary of the employee?<br />

How?<br />

5


Example: Data on Time until Termination of Employment<br />

Summary of Data<br />

1999 2000<br />

Sex Females Males Both Females Males Both<br />

Existing Employees 164 139 303 173 148 321<br />

New Hires 30 19 49 27 25 52<br />

Terminations 30 15 45 36 16 52<br />

6


Example: Data on Time until Termination of Employment<br />

Number of Employees<br />

number<br />

0 20 40 60 80 100 120<br />

0 10 20 30<br />

years since hire<br />

7


How can we characterize lifetime distributions?<br />

Let T be a continuous, positive-valued random variable that represents the<br />

time until some event.<br />

The distribution of T can be characterized in terms of<br />

S(t) = Pr(T>t) survival function<br />

F (t) = Pr(T ≤ t) distribution function<br />

f(t) = F ′ (t) probability density function<br />

h(t) = − S′ (t)<br />

S(t) = f(t) hazard function<br />

S(t)<br />

H(t) =<br />

∫ t<br />

0<br />

h(u)du cumulative hazard function<br />

r(t) = E[T − t|T >t]=<br />

∫ ∞<br />

t<br />

(u − t) f(u)<br />

S(t)<br />

du mean residual lifetime<br />

8


Well-known parametric survival models<br />

Exponential Distribution<br />

The hazard function is constant.<br />

S(t) =e −λt , t ≥ 0, λ > 0.<br />

Gompertz Distribution<br />

S(t) =e − B<br />

ln c (ct −1) , t ≥ 0, B > 0, c>1.<br />

The hazard function increases exponentially.<br />

9


Well-known parametric survival models<br />

Weibull Distribution<br />

S(t) =e −(t/θ)τ , t ≥ 0, θ,τ > 0,<br />

The hazard function is increasing if τ > 1, decreasing if τ < 1, and is<br />

constant if τ =1.<br />

Lognormal Distribution<br />

( ) log t − µ<br />

S(t) =1− Φ<br />

, t > 0, −∞ 0,<br />

The hazard function is decreasing if α ≤ 1. Ifα>1, the hazard function<br />

( )1 ( α − 1 α α − 1<br />

is increasing for t<<br />

and decreasing for t><br />

λ<br />

λ<br />

)1<br />

α.<br />

10


What is a nonparametric model?<br />

A nonparametric model has the following characteristics:<br />

• It does not impose a structure on the distribution.<br />

• The number of parameters is determined by the data.<br />

• It exhibits features that are apparent in the data.<br />

11


Why use a nonparametric model?<br />

• To explore the data graphically<br />

• As a model for calculations<br />

• To check the fit of a parametric model<br />

12


The Kaplan-Meier Estimator<br />

We can obtain a nonparametric model for a lifetime distribution using the<br />

the Kaplan-Meier (KM) estimator of the survival function.<br />

Suppose we have (left truncated and right censored) data on n subjects.<br />

Let t 1


Example: Data on Time until Termination of Employment<br />

<strong>Survival</strong> Function of Time until Termination<br />

probability<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

females<br />

males<br />

0 5 10 15 20 25 30 35<br />

years since entry<br />

14


Example: Data on Time until Termination of Employment<br />

<strong>Survival</strong> Function of Time until Termination<br />

probability<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

females<br />

males<br />

0 5 10 15 20 25 30 35<br />

years since entry<br />

14


The Kaplan-Meier Estimator<br />

Since<br />

S(t) = exp{−H(t)},<br />

and therefore<br />

H(t) =− log S(t),<br />

we can use the KM estimates to obtain estimates of the H(t) values.<br />

15


Example: Data on Time until Termination of Employment<br />

Cumulative Hazard Function of Time until Termination<br />

probability<br />

0 1 2 3 4<br />

females<br />

males<br />

0 5 10 15 20 25 30 35<br />

years since entry<br />

16


Fitting a parametric model by maximum likelihood<br />

Suppose we observe n subjects. For subject i, we observe (l i ,r i ,δ i ),<br />

where l i is the left truncation time, r i is the event or right censoring time,<br />

and δ i is 1 if an event is observed and 0 otherwise.<br />

The likelihood function is given by<br />

L(θ) =<br />

n∏<br />

i=1<br />

[f (r i ; θ)] δ i [S (r i ; θ)] 1−δ i<br />

/<br />

S (li ; θ) .<br />

L is maximized with respect to θ to obtain ̂θ, the vector of MLEs.<br />

17


Example: Data on Time until Termination of Employment<br />

Cumulative Hazard Function of Time until Termination<br />

probability<br />

0 1 2 3 4<br />

females<br />

males<br />

0 5 10 15 20 25 30 35<br />

years since entry<br />

18


Example: Data on Time until Termination of Employment<br />

Cumulative Hazard Function of Time until Termination<br />

probability<br />

0 1 2 3 4<br />

females<br />

males<br />

0 5 10 15 20 25 30 35<br />

years since entry<br />

18


Modelling the Impact of Covariates<br />

Lifetime distributions may depend on factors related to the individuals or<br />

the conditions under which they are observed. In modelling lifetime distributions<br />

we frequently wish to reflect the impact of these factors.<br />

Information about relevant factors is expressed in terms of explanatory variables,<br />

also known as covariates. For each individual, we observe a vector<br />

x =(x 1 ,...,x p ) ′ , which gives the values of p covariates.<br />

We seek a model which captures the impact of x on the lifetime distribution.<br />

19


Modelling the Impact of Covariates<br />

The most common types of models are<br />

1. Accelerated failure time (AFT) models:<br />

T = T 0 exp{u(x)},<br />

2. Proportional hazards (PH) models:<br />

h(t|x) =h 0 (t) exp{u(x)},<br />

where normally u(x) is a linear function of the components of x<br />

with coefficients that are parameters to be estimated.<br />

20


Accelerated failure time (AFT) models<br />

where<br />

T = T 0 exp{u(x)},<br />

u(x) =β 0 + β 1 x 1 + ···+ β p x p .<br />

Therefore, if Y = log T , then<br />

Y = β 0 + β 1 x 1 + ···+ β p x p + bZ.<br />

Popular choices for the distribution of Z are<br />

• Normal ⇒ T ∼ lognormal<br />

• Logistic ⇒ T ∼ loglogistic<br />

• Extreme value ⇒ T ∼ Weibull<br />

21


Proportional hazards (PH) models<br />

where<br />

h(t|x) =h 0 (t) exp{u(x)},<br />

u(x) =β 1 x 1 + ···+ β p x p .<br />

When a nonparametric model is used for the baseline hazard function,<br />

h 0 (t), the PH model is a semiparametric model.<br />

Established methods allow us to make inferences about β 1 ,...,β p while<br />

ignoring the baseline hazard function.<br />

22


Example: Data on Time until Termination of Employment<br />

Model: h(t|x 1 )=h 0 (t) exp{β 1 x 1 }<br />

Covariate: x 1 = I(sex=male)<br />

Estimates:<br />

covariate coef exp(coef) se(coef) z p-value<br />

x 1 -0.5670 0.5672 0.2217 -2.557 0.0105<br />

Conclusion:<br />

The hazard function for males is significantly lower than that for females.<br />

According to the fitted model, the hazard function for males equals 0.5672<br />

times the hazard function for females.<br />

23


Example: Data on Time until Termination of Employment<br />

Model: h(t|x) =h 0 (t) exp{β 1 x 1 + β 2 x 2 }<br />

Covariates: x 1 = I(sex=male) and x 2 = age at hire<br />

Estimates:<br />

covariate coef exp(coef) se(coef) z p-value<br />

x 1 -0.52248 0.59305 0.22316 -2.341 0.0192<br />

x 2 -0.02072 0.97949 0.01159 -1.788 0.0737<br />

Conclusion:<br />

The hazard function decreases (by about 2% per year) with age at hire,<br />

though this effect is significant at the 10% level but not at the 5% level.<br />

24


Fixed and time-varying covariates<br />

So far, we have considered only fixed covariates - sex and age at hire do<br />

not change during the time the individual is employed.<br />

Time-varying covariates can change over time.<br />

In our example, calendar year is such a covariate, and we observe individuals<br />

in both 1999 and 2000. Salary is also a time-varying covariate.<br />

The PH model easily accommodates time-varying covariates:<br />

h(t|x(t)) = h 0 (t) exp{β 1 x 1 (t)+···+ β p x p (t)}<br />

25


Example: Data on Time until Termination of Employment<br />

Suppose a female employee is hired on July 1, 1999 at age 45 and quits<br />

on October 1, 2000.<br />

The data for this employee can be specified as two observations of the<br />

form (l i ,r i ,δ i ,x 1i ,x 2i ,x 3i ), where<br />

l i = left truncation time for observation i<br />

r i = termination or right censoring time for observation i<br />

δ i = termination indicator for observation i<br />

x 1i = I(sex i = male)<br />

x 2i = (age at hire) i<br />

x 3i = calendar year of observation i.<br />

That is,<br />

and<br />

(0, 0.5, 0, 0, 45, 1999)<br />

(0.5, 1.25, 1, 0, 45, 2000).<br />

26


Example: Data on Time until Termination of Employment<br />

Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)}<br />

Covariates: x 1 = I(sex=male), x 2 = age at hire,<br />

and x 3 (t) =I(calendar year at time t = 2000)<br />

Estimates:<br />

covariate coef exp(coef) se(coef) z p-value<br />

x 1 -0.51777 0.59585 0.22324 -2.319 0.0204<br />

x 2 -0.02073 0.97948 0.01158 -1.789 0.0735<br />

x 3 (t) 0.17245 1.18821 0.20621 0.836 0.4030<br />

Conclusion:<br />

The hazard function in 2000 did not differ significantly from that in 1999.<br />

27


Example: Data on Time until Termination of Employment<br />

Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)}<br />

Covariates: x 1 = I(sex=male), x 2 = age at hire,<br />

and x 3 (t) = log 10 (salary for the calendar year at time t)<br />

Estimates:<br />

covariate coef exp(coef) se(coef) z p-value<br />

x 1 -0.156276 0.855323 0.257055 -0.608 0.54322<br />

x 2 -0.006394 0.993626 0.011913 -0.537 0.59145<br />

x 3 (t) -1.516312 0.219520 0.534350 -2.838 0.00454<br />

Conclusion:<br />

Salary has a large and statistically significant effect on the termination rate,<br />

and salaries may be related to sex and age at hire.<br />

28


Example: Data on Time until Termination of Employment<br />

KM Estimates of the <strong>Survival</strong> Function of Salary<br />

probability<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

median male salary: $86,553<br />

median female salary: $43,080<br />

0 500000 1000000 1500000<br />

salary<br />

29


Summary<br />

• survival analysis provides effective tools for analyzing<br />

lifetime data.<br />

• We can easily handle left truncation and right censoring.<br />

• We can explore parametric, nonparametric, and<br />

semiparametric models.<br />

• Covariates may be fixed or time-varying.<br />

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!