Survival Analysis
Presentation Slides - University of Western Ontario
Presentation Slides - University of Western Ontario
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Survival</strong> <strong>Analysis</strong>:<br />
The Study of Lifetimes and their Distributions<br />
Bruce L. Jones<br />
Department of Statistical and Actuarial Sciences<br />
The University of Western Ontario<br />
PLCS/RDC Statistics and Data Series at Western<br />
April 20, 2011
• What is survival analysis?<br />
Outline<br />
• What do we mean by lifetimes?<br />
• What are the challenges presented by lifetime data?<br />
• How can we characterize lifetime distributions?<br />
• What are the well-known parametric and non-parametric<br />
survival models?<br />
• How can we investigate/model the impact of fixed and<br />
time-varying covariates on a lifetime distribution?<br />
1
What is survival analysis?<br />
<strong>Survival</strong> analysis deals with<br />
• modelling or understanding the behaviour of the lifetime distribution for<br />
a population,<br />
• statistical tests for differences between the lifetime distributions of two<br />
or more populations,<br />
• modelling/assessing the impact of explanatory variables on a lifetime<br />
distribution.<br />
2
What do we mean by lifetimes?<br />
• Lifetime refers to the time until a specified event, not necessarily the<br />
end of a life. Examples:<br />
– time until death<br />
– time until retirement<br />
– time until termination of employment<br />
– time until a health-related event<br />
– time until marriage/divorce/remarriage<br />
– time until default of a bond or mortgage<br />
• The methods of survival analysis can even be applied when the quantity<br />
of interest is not a time - it can be any positive-valued random<br />
variable.<br />
3
What are the challenges presented by lifetime data?<br />
Suppose that T i is the lifetime of subject i.<br />
Our observation of T i could be censored or truncated.<br />
• Censoring<br />
If T i is in the interval (a i ,b i ), we observe only that a i
Example: Data on Time until Termination of Employment<br />
Event of Interest: Termination of employment, voluntary (the employee<br />
quits) or involuntary (the employee is fired), but not retirement.<br />
Lifetime: Time from date of hire to date of termination.<br />
Data: Date of birth, date of hire, sex, and salary for about 400 individuals<br />
who were employees of a company during 1999 or 2000, and the date of<br />
all terminations during 1999 and 2000.<br />
Questions:<br />
• How does the distribution of the time until termination behave?<br />
• Does the time until termination depend on the age at hire? How?<br />
• Does the time until termination depend on the sex of the employee?<br />
How?<br />
• Does the time until termination depend on the salary of the employee?<br />
How?<br />
5
Example: Data on Time until Termination of Employment<br />
Summary of Data<br />
1999 2000<br />
Sex Females Males Both Females Males Both<br />
Existing Employees 164 139 303 173 148 321<br />
New Hires 30 19 49 27 25 52<br />
Terminations 30 15 45 36 16 52<br />
6
Example: Data on Time until Termination of Employment<br />
Number of Employees<br />
number<br />
0 20 40 60 80 100 120<br />
0 10 20 30<br />
years since hire<br />
7
How can we characterize lifetime distributions?<br />
Let T be a continuous, positive-valued random variable that represents the<br />
time until some event.<br />
The distribution of T can be characterized in terms of<br />
S(t) = Pr(T>t) survival function<br />
F (t) = Pr(T ≤ t) distribution function<br />
f(t) = F ′ (t) probability density function<br />
h(t) = − S′ (t)<br />
S(t) = f(t) hazard function<br />
S(t)<br />
H(t) =<br />
∫ t<br />
0<br />
h(u)du cumulative hazard function<br />
r(t) = E[T − t|T >t]=<br />
∫ ∞<br />
t<br />
(u − t) f(u)<br />
S(t)<br />
du mean residual lifetime<br />
8
Well-known parametric survival models<br />
Exponential Distribution<br />
The hazard function is constant.<br />
S(t) =e −λt , t ≥ 0, λ > 0.<br />
Gompertz Distribution<br />
S(t) =e − B<br />
ln c (ct −1) , t ≥ 0, B > 0, c>1.<br />
The hazard function increases exponentially.<br />
9
Well-known parametric survival models<br />
Weibull Distribution<br />
S(t) =e −(t/θ)τ , t ≥ 0, θ,τ > 0,<br />
The hazard function is increasing if τ > 1, decreasing if τ < 1, and is<br />
constant if τ =1.<br />
Lognormal Distribution<br />
( ) log t − µ<br />
S(t) =1− Φ<br />
, t > 0, −∞ 0,<br />
The hazard function is decreasing if α ≤ 1. Ifα>1, the hazard function<br />
( )1 ( α − 1 α α − 1<br />
is increasing for t<<br />
and decreasing for t><br />
λ<br />
λ<br />
)1<br />
α.<br />
10
What is a nonparametric model?<br />
A nonparametric model has the following characteristics:<br />
• It does not impose a structure on the distribution.<br />
• The number of parameters is determined by the data.<br />
• It exhibits features that are apparent in the data.<br />
11
Why use a nonparametric model?<br />
• To explore the data graphically<br />
• As a model for calculations<br />
• To check the fit of a parametric model<br />
12
The Kaplan-Meier Estimator<br />
We can obtain a nonparametric model for a lifetime distribution using the<br />
the Kaplan-Meier (KM) estimator of the survival function.<br />
Suppose we have (left truncated and right censored) data on n subjects.<br />
Let t 1
Example: Data on Time until Termination of Employment<br />
<strong>Survival</strong> Function of Time until Termination<br />
probability<br />
0.0 0.2 0.4 0.6 0.8 1.0<br />
females<br />
males<br />
0 5 10 15 20 25 30 35<br />
years since entry<br />
14
Example: Data on Time until Termination of Employment<br />
<strong>Survival</strong> Function of Time until Termination<br />
probability<br />
0.0 0.2 0.4 0.6 0.8 1.0<br />
females<br />
males<br />
0 5 10 15 20 25 30 35<br />
years since entry<br />
14
The Kaplan-Meier Estimator<br />
Since<br />
S(t) = exp{−H(t)},<br />
and therefore<br />
H(t) =− log S(t),<br />
we can use the KM estimates to obtain estimates of the H(t) values.<br />
15
Example: Data on Time until Termination of Employment<br />
Cumulative Hazard Function of Time until Termination<br />
probability<br />
0 1 2 3 4<br />
females<br />
males<br />
0 5 10 15 20 25 30 35<br />
years since entry<br />
16
Fitting a parametric model by maximum likelihood<br />
Suppose we observe n subjects. For subject i, we observe (l i ,r i ,δ i ),<br />
where l i is the left truncation time, r i is the event or right censoring time,<br />
and δ i is 1 if an event is observed and 0 otherwise.<br />
The likelihood function is given by<br />
L(θ) =<br />
n∏<br />
i=1<br />
[f (r i ; θ)] δ i [S (r i ; θ)] 1−δ i<br />
/<br />
S (li ; θ) .<br />
L is maximized with respect to θ to obtain ̂θ, the vector of MLEs.<br />
17
Example: Data on Time until Termination of Employment<br />
Cumulative Hazard Function of Time until Termination<br />
probability<br />
0 1 2 3 4<br />
females<br />
males<br />
0 5 10 15 20 25 30 35<br />
years since entry<br />
18
Example: Data on Time until Termination of Employment<br />
Cumulative Hazard Function of Time until Termination<br />
probability<br />
0 1 2 3 4<br />
females<br />
males<br />
0 5 10 15 20 25 30 35<br />
years since entry<br />
18
Modelling the Impact of Covariates<br />
Lifetime distributions may depend on factors related to the individuals or<br />
the conditions under which they are observed. In modelling lifetime distributions<br />
we frequently wish to reflect the impact of these factors.<br />
Information about relevant factors is expressed in terms of explanatory variables,<br />
also known as covariates. For each individual, we observe a vector<br />
x =(x 1 ,...,x p ) ′ , which gives the values of p covariates.<br />
We seek a model which captures the impact of x on the lifetime distribution.<br />
19
Modelling the Impact of Covariates<br />
The most common types of models are<br />
1. Accelerated failure time (AFT) models:<br />
T = T 0 exp{u(x)},<br />
2. Proportional hazards (PH) models:<br />
h(t|x) =h 0 (t) exp{u(x)},<br />
where normally u(x) is a linear function of the components of x<br />
with coefficients that are parameters to be estimated.<br />
20
Accelerated failure time (AFT) models<br />
where<br />
T = T 0 exp{u(x)},<br />
u(x) =β 0 + β 1 x 1 + ···+ β p x p .<br />
Therefore, if Y = log T , then<br />
Y = β 0 + β 1 x 1 + ···+ β p x p + bZ.<br />
Popular choices for the distribution of Z are<br />
• Normal ⇒ T ∼ lognormal<br />
• Logistic ⇒ T ∼ loglogistic<br />
• Extreme value ⇒ T ∼ Weibull<br />
21
Proportional hazards (PH) models<br />
where<br />
h(t|x) =h 0 (t) exp{u(x)},<br />
u(x) =β 1 x 1 + ···+ β p x p .<br />
When a nonparametric model is used for the baseline hazard function,<br />
h 0 (t), the PH model is a semiparametric model.<br />
Established methods allow us to make inferences about β 1 ,...,β p while<br />
ignoring the baseline hazard function.<br />
22
Example: Data on Time until Termination of Employment<br />
Model: h(t|x 1 )=h 0 (t) exp{β 1 x 1 }<br />
Covariate: x 1 = I(sex=male)<br />
Estimates:<br />
covariate coef exp(coef) se(coef) z p-value<br />
x 1 -0.5670 0.5672 0.2217 -2.557 0.0105<br />
Conclusion:<br />
The hazard function for males is significantly lower than that for females.<br />
According to the fitted model, the hazard function for males equals 0.5672<br />
times the hazard function for females.<br />
23
Example: Data on Time until Termination of Employment<br />
Model: h(t|x) =h 0 (t) exp{β 1 x 1 + β 2 x 2 }<br />
Covariates: x 1 = I(sex=male) and x 2 = age at hire<br />
Estimates:<br />
covariate coef exp(coef) se(coef) z p-value<br />
x 1 -0.52248 0.59305 0.22316 -2.341 0.0192<br />
x 2 -0.02072 0.97949 0.01159 -1.788 0.0737<br />
Conclusion:<br />
The hazard function decreases (by about 2% per year) with age at hire,<br />
though this effect is significant at the 10% level but not at the 5% level.<br />
24
Fixed and time-varying covariates<br />
So far, we have considered only fixed covariates - sex and age at hire do<br />
not change during the time the individual is employed.<br />
Time-varying covariates can change over time.<br />
In our example, calendar year is such a covariate, and we observe individuals<br />
in both 1999 and 2000. Salary is also a time-varying covariate.<br />
The PH model easily accommodates time-varying covariates:<br />
h(t|x(t)) = h 0 (t) exp{β 1 x 1 (t)+···+ β p x p (t)}<br />
25
Example: Data on Time until Termination of Employment<br />
Suppose a female employee is hired on July 1, 1999 at age 45 and quits<br />
on October 1, 2000.<br />
The data for this employee can be specified as two observations of the<br />
form (l i ,r i ,δ i ,x 1i ,x 2i ,x 3i ), where<br />
l i = left truncation time for observation i<br />
r i = termination or right censoring time for observation i<br />
δ i = termination indicator for observation i<br />
x 1i = I(sex i = male)<br />
x 2i = (age at hire) i<br />
x 3i = calendar year of observation i.<br />
That is,<br />
and<br />
(0, 0.5, 0, 0, 45, 1999)<br />
(0.5, 1.25, 1, 0, 45, 2000).<br />
26
Example: Data on Time until Termination of Employment<br />
Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)}<br />
Covariates: x 1 = I(sex=male), x 2 = age at hire,<br />
and x 3 (t) =I(calendar year at time t = 2000)<br />
Estimates:<br />
covariate coef exp(coef) se(coef) z p-value<br />
x 1 -0.51777 0.59585 0.22324 -2.319 0.0204<br />
x 2 -0.02073 0.97948 0.01158 -1.789 0.0735<br />
x 3 (t) 0.17245 1.18821 0.20621 0.836 0.4030<br />
Conclusion:<br />
The hazard function in 2000 did not differ significantly from that in 1999.<br />
27
Example: Data on Time until Termination of Employment<br />
Model: h(t|x(t)) = h 0 (t) exp{β 1 x 1 + β 2 x 2 + β 3 x 3 (t)}<br />
Covariates: x 1 = I(sex=male), x 2 = age at hire,<br />
and x 3 (t) = log 10 (salary for the calendar year at time t)<br />
Estimates:<br />
covariate coef exp(coef) se(coef) z p-value<br />
x 1 -0.156276 0.855323 0.257055 -0.608 0.54322<br />
x 2 -0.006394 0.993626 0.011913 -0.537 0.59145<br />
x 3 (t) -1.516312 0.219520 0.534350 -2.838 0.00454<br />
Conclusion:<br />
Salary has a large and statistically significant effect on the termination rate,<br />
and salaries may be related to sex and age at hire.<br />
28
Example: Data on Time until Termination of Employment<br />
KM Estimates of the <strong>Survival</strong> Function of Salary<br />
probability<br />
0.0 0.2 0.4 0.6 0.8 1.0<br />
median male salary: $86,553<br />
median female salary: $43,080<br />
0 500000 1000000 1500000<br />
salary<br />
29
Summary<br />
• survival analysis provides effective tools for analyzing<br />
lifetime data.<br />
• We can easily handle left truncation and right censoring.<br />
• We can explore parametric, nonparametric, and<br />
semiparametric models.<br />
• Covariates may be fixed or time-varying.<br />
30