Slides - Åbo Akademi

Slides - Åbo Akademi Slides - Åbo Akademi

Software Safety<br />

Lecture 8: System Reliability<br />

Elena Troubitsyna<br />

Anton Tarasyuk<br />

<strong>Åbo</strong> <strong>Akademi</strong> University

System Dependability<br />

A. Avizienis, J.-C. Laprie and B. Randell:<br />

Dependability and its Threats (2004):<br />

Dependability is the ability of a system to deliver a service that<br />

can be justifiably trusted or, alternatively, as the ability of a system<br />

to avoid service failures that are more frequent or more severe than<br />

is acceptable<br />

Dependability attributes: reliability, availability, maintainability,<br />

safety, confidentiality, integrity

Reliability Definition<br />

Reliability: the ability of a system to deliver correct service under<br />

given conditions for a specified period of time<br />

Reliability is generally measured by the probability that a system<br />

(or system component) S can perform a required function under<br />

given conditions for the time interval [0, t]:<br />

R(t) = P {S not failed over time [0, t]}

More Definitions<br />

Failure rate (or failure intensity): the number of failures per time<br />

unit (hour, second, ...) or per natural unit (transaction, run, ...)<br />

• can be considered as an alternative way of expressing reliability<br />

Availability: the ability of a system to be in a state to deliver<br />

correct service under given conditions at a specified instant of time<br />

• usually measured by the average (over time) probability that a<br />

system is operational in a specified environment<br />

Maintainability: the ability of a system to be restored to a state in<br />

which it can deliver correct service<br />

• usually measured by the probability that maintenance of the<br />

system will restore it within a given time period

Safety-critical Systems<br />

• Pervasive use of computer-based systems in many critical<br />

infrastructures<br />

• flight/traffic control, driverless trains/cars operation, nuclear<br />

plant monitoring, robotic surgery, military applications, etc.<br />

• Extremely high reliability requirements for safety-critical<br />

systems<br />

• avionics domain example: failure rate ≤ 10 −9 failures per hour,<br />

i.e., more than a hundred years of operation without<br />

encountering a failure

Hardware vs. Software<br />

• Hardware for safety-critical systems is very reliable and its<br />

reliability is being improved<br />

• Software is not as reliable as hardware, however, its role in<br />

safety-critical systems increases<br />

• The division between hardware and software reliability is<br />

somewhat artificial [J. D. Musa, 2004]<br />

• Many concepts of software reliability engineering are adapted<br />

from the mature and successful techniques of hardware<br />


Hardware Failures<br />

The system is said to have a failure when its actual behaviour<br />

deviates from the intended one specified in design documents<br />

Underlying faults: the largest part of hardware failures caused by<br />

physical wear-out or physical defect of a component<br />

• transient faults: the faults that may occur and then disappear after<br />

some period of time<br />

• permanent faults: the faults that remain in the system until they<br />

are repaired<br />

• intermittent faults: the reoccurring transient faults

Failure Rate<br />

The failure rate of a system (or system component) is the mean<br />

number of failures within a given period of time<br />

The failure rate is a random variable (gives us the probability that<br />

a system, which has been operating over time t, fails over the next<br />

time unit)<br />

The failure rate of a component normally varies with time: λ(t)

Failure Rate (ctd.)<br />

Classification of failures (time-dependent):<br />

• Early failure period<br />

• Constant failure rate period<br />

• Wear-out failure period

Failure Rate (ctd.)<br />

Let T be the random variable measuring the uptime of some<br />

system (or system component) S:<br />

where<br />

R(t) = P {T > t} = 1 − F (t) =<br />

∫ ∞<br />

t<br />

f (t)dt,<br />

– F (t) is the cumulative distribution function of T<br />

– f (t) is the (failure) probability density function of T<br />

The failure rate of the system can be defined as:<br />

λ(t) = f (t)<br />


Most Important Distributions<br />

Discrete distributions:<br />

• binomial distribution<br />

• Poisson distribution<br />

Continuous distributions:<br />

• exponential distribution<br />

• normal distribution<br />

• lognormal distribution<br />

• Weibull distribution<br />

• gamma distribution<br />

• Pareto distribution

Repair Rate<br />

The repair rate expresses the probability that a system, failed for a<br />

time t, recovers its ability to perform its function in the next time<br />

unit<br />

The repair rate of the system can be defined as:<br />

where<br />

µ(t) =<br />

g(t)<br />

1 − M(t)<br />

– g(t) is the (repair) probability density function and<br />

– M(t) is the system maintainability

Reliability Parameters<br />

MTTF: Mean Time to Failure, i.e., the expected time that a system will<br />

operate before the first failure occurs (often term MTBF – Mean Time<br />

Between Failures – is used for repairable systems)<br />

MTTR: Mean Time to Repair, i.e., the average time taken to repair a<br />

system that has failed<br />

MTTR includes the time taken to detect the failure, locate the fault,<br />

repair and reconfigure the system.<br />

MTTF =<br />

∫ ∞<br />

0<br />

R(t)dt MTTR =<br />

∫ ∞<br />

0<br />

(1 − M(t))dt<br />

Availability of a repairable systems is defined as<br />

MTTF<br />


Most Important Distributions<br />

Discrete distributions:<br />

• binomial distribution<br />

• Poisson distribution<br />

Continuous distributions:<br />

• exponential distribution<br />

• normal distribution<br />

• lognormal distribution<br />

• Weibull distribution<br />

• gamma distribution<br />

• Pareto distribution

Exponential Distribution<br />

The distribution gives:<br />

• failure density: f (t) = λe −λt<br />

• reliability function: R(t) = e −λt<br />

• constant failure and repair rates: λ(t) = λ and µ(t) = µ<br />

• MTTF = 1 λ<br />

• MTTR = 1 µ<br />

The probability of a system working correctly throughout a given<br />

period of time decreases exponentially with the length of this time<br />


Exponential Distribution (cont.)<br />

The exponential distribution is often used in calculations as they<br />

are thus made much simpler<br />

During the useful-life stage, the failure rate is related to the<br />

reliability of the component or system by exponential failure law<br />

Describes neither the case of early failures (decreasing failure rate)<br />

nor the case of worn-out components (increasing failure rate)

MTTF Example<br />

A system with a constant failure rate of 0.001 failures per hour has<br />

a MTTF of 1000 hours.<br />

This does not mean that the system will operate correctly<br />

for 1000 hours!<br />

The reliability of such system at a time t is: R(t) = e −λt<br />

Assume that t = MTTF = 1 λ<br />

R(t) = e −1 ≈ 0.37<br />

Any given system has only a 37% chance of functioning correctly<br />

for an amount of time equal to the MTTF (i.e., a 63% chance of<br />

failing in this period).

Failure Rate Estimation<br />

The failure rate is often assumed to be constant (this assumption<br />

simplifies calculation)<br />

Often no other assumption can be made because of the small<br />

number of available event data<br />

In this case, an estimator of the failure rate is given by:<br />

where<br />

λ = N f<br />

T f<br />

,<br />

– N f – number of failures observed during operation<br />

– T f – cumulative operating time

Example: Failure Rate Calculation<br />

Failure rate calculation example<br />

Ten identical components are each tested until they either fail or reach<br />

1000 hours, at which time the test is terminated for that component.<br />

• Ten Theidentical results components are: are each tested until they either fail or reach 1000<br />

hours, at which time the test is terminated for that component. The results<br />

are:<br />

Component Hours Failure<br />

Component 1 1000 No failure<br />

Component 2 1000 No failure<br />

Component 3 467 Failed<br />

Component 4 1000 No failure<br />

Component 5 630 Failed<br />

Component 6 590 Failed<br />

Component 7 1000 No failure<br />

Component 8 285 Failed<br />

Component 9 648 Failed<br />

Component 10 882 Failed<br />

Totals 7502 6<br />

The estimated failure rate is:<br />

Estimated failure rate is:<br />

6 failures<br />

7502 hrs = 799.8 · 10−6 failures/hr<br />

or<br />

or =799.8 failures for every million<br />

hours λ of = operation. 799.8 failures for every million hours of<br />

operation.<br />

12 April 2007 8

Combinational Models<br />

Allow overall reliability of the system to be calculated from the<br />

reliability of its components<br />

Physical separation and fault isolation<br />

• highly reduce the complexity of the reliability models<br />

• redundancy to achieve the fault tolerance<br />

Distinguishes between two different situations:<br />

• Failure of any components causes system failure – series<br />

model<br />

• Several components must fail simultaneously to cause a<br />

malfunctioning – parallel model

Series Systems<br />

Such a configuration that failure of any system component causes<br />

failure of the entire system<br />

For a series system that consists N (independent) components the<br />

overall system failure rate is<br />

λ =<br />

N∑<br />

i=1<br />

λ i<br />

and, consequently,<br />

N∏<br />

R(t) = R i (t)<br />


Parallel Systems<br />

Redundant system – failure of one component<br />

does not lead to failure of the entire system<br />

The system will remain operational if at least<br />

one of the parallel elements is functioning<br />

correctly<br />

Reliability of the parallel system is determined by considering first<br />

probability of failure (unreliability) of an individual component and then<br />

the overall system:<br />

N∏<br />

R(t) = 1 − (1 − R i (t))<br />


Series-Parallel Combinations<br />

The most common in practice.<br />

Any systems may be simplified, i.e.,<br />

reduced to a single component<br />

A series-parallel arrangement:<br />

(a) The original arrangement<br />

(b) The result of combining parallel<br />

modules<br />

(c) The result of combining the series<br />

modules<br />

The overall reliability of the system is<br />

represented by that of module 13

Triple Modular Redundancy<br />

Reminder: TMR system consists of three<br />

parallel modules<br />

Reliability of a single module is R m (t)<br />

Reliability of a TMR system that consists of three identical modules, is<br />

R TMR (t) = R 3 M(t) + 3R 2 M(t)(1 − R M (t)) = 3R 2 M(t) − 2R 3 M(t)<br />

Reliability of the TMR arrangement may be worse than the reliability of<br />

an individual module!

M-of-N Arrangement<br />

A system consists of N identical modules<br />

At least M modules must function correctly in order to prevent a<br />

system failure<br />

R M−of −N (t) =<br />

N−M<br />

∑<br />

i=0<br />

[( i<br />

N)<br />

R N−i<br />

m (t)(1 − R m (t)) i ]

Software Reliability<br />

• Software, in general, logically more complex<br />

• Software failures are design failures<br />

• caused by design faults (have human nature)<br />

• design = all software development steps (from requirements to<br />

deployment and maintenance)<br />

• harder to identify, measure and detect<br />

• Software does not wear out

Failure rate<br />

The term failure intensity is often used instead of failure rate (to<br />

avoid confusions)<br />

Typically, a software is under permanent development and<br />

maintenance, which leads to jumps in the overall failure rate<br />

• May increase after major program/environmental changes<br />

• May decrease due to the improvements and bug-fixes

Software Reliability (ctd.)<br />

• Software, unlike hardware, can be fault-free (theoretically :))<br />

• some formal methods can guarantee the correctness of<br />

software (proof-based verification, model checking, etc.)<br />

• Correctness of software does not ensure its reliability!<br />

• software can satisfy the specification document, yet the<br />

specification document itself might already be faulty<br />

• No independence assumption, i.e., copies of software will fail<br />

together<br />

• most hardware fault tolerance mechanisms ineffective for<br />

software<br />

• design diversity instead of component redundancy<br />

(e.g., N-version programming )

Design diversity<br />

Each variant of software is generated by a separate (independent)<br />

team of developers<br />

• higher probability to generate a correct variant<br />

• independent design faults in different variants<br />

Costly, yet leads to an effective reliability improvement<br />

Not as efficient as N-modular redundancy in hardware reliability<br />

engineering [J. C. Knight and N. G. Leveson, 1986]

Appendix: Failure Rate Proof<br />

Proof:<br />

1<br />

λ(t) = lim<br />

∆t→0 ∆t · P {A[t,t+∆t] S<br />

| ¬A [0,t]<br />

S<br />

} =<br />

lim<br />

∆t→0<br />

lim<br />

∆t→0<br />

1<br />

∆t · P {A[t,t+∆t] S<br />

P {¬A [0,t]<br />

∩ ¬A [0,t]<br />

S<br />

}<br />

S<br />

}<br />

1<br />

∆t · P {A[0,t+∆t] S<br />

} − P {A [0,t]<br />

R(t)<br />

lim<br />

∆t→0<br />

S<br />

}<br />

=<br />

=<br />

R(t) − R(t + ∆t)<br />

∆tR(t)<br />

= f (t)<br />

R(t) ,<br />

where<br />

– A [x,y]<br />

S<br />

̂= S failed over time [x, y]<br />

– ¬A [x,y]<br />

S<br />

̂= S did not fail over time [x, y]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!