10.07.2015 Views

Availability and Reliability Theory

Availability and Reliability Theory

Availability and Reliability Theory

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Availability</strong> <strong>and</strong> <strong>Reliability</strong> <strong>Theory</strong><strong>and</strong> the ExpectationsBehind the Numbers


PurposeProvide an underst<strong>and</strong>ing of reliability <strong>and</strong>availability which enables IT decision makers tomake sound management decisions.© 2003 APC corporation.


NCPI Science CenterMission:To advance the state of knowledge about the design <strong>and</strong>operation of Network-Critical Physical Infrastructure (NCPI) inboth the industry <strong>and</strong> user communities. To providetechniques, guidelines, <strong>and</strong> tools to empower users to make themost effective planning decisions regarding their NCPIinvestments, maximizing availability <strong>and</strong> agility while reducingTCO.© 2003 APC corporation.


<strong>Availability</strong> / <strong>Reliability</strong> Theories <strong>and</strong>ApplicationUsing numbers tomake sound decisions


© 2003 APC corporation.99.999% - Når oppetid er kritisk


Why use <strong>Reliability</strong> Analysis? Answer how <strong>and</strong> where dollars shouldbe allocated to increase reliability – may bedifferent than availability answer Value appears the same throughout theorganization Most useful for systems with highest $risk in first outage cycle Most applicable industries: Data centerapplications, chip fabrications© 2003 APC corporation.


© 2003 APC corporation.Key Terminology<strong>Reliability</strong> – The probability that an entity experiences no failuresunder given conditions for a given time interval<strong>Availability</strong> – The probability that an entity is operating undergiven conditions at a given instant of time<strong>Availability</strong> categories: steady state / inherent, achieved, operationalMean Time Between Failure (MTBF) – The average operating timebetween failures. MTBF ≅ MTTF when MTBF >> MTTR- At time = MTBF; 63% of population has failed Mean Time To Recover (MTTR) – The average time to restore orrecover1 Failure Rate (λ) – The inverse of MTBF or (ExponentialMTBFDistribution only)1 Recovery Rate (µ) – The inverse of MTTR or (ExponentialDistribution only)MTTRSource: <strong>Reliability</strong> Engineering H<strong>and</strong>book


Bathtub CurveEarly failureperiodNormal operatingperiodWear outperiodRateoffailureConstant failure rate region0Time© 2003 APC corporation.


© 2003 APC corporation.<strong>Availability</strong> vs. Mean Time to Failure


Key <strong>Reliability</strong> Equations<strong>Reliability</strong> –−λ tR = eex. MTBF = 1000hrst=0hrsR=e−0.001*0=1t= 1000 hrsR=e−0.001*1000=0.368© 2003 APC corporation.


What makes calculated probabilities credible? Assumptions- Without assumptions the results mean nothing- I.e. Human error, spare parts on h<strong>and</strong>, constant failure rate Methods- Comparing two or more similar architectures is effective inevaluating the relative differences between them.Helps to avoid the vender specific definitions of “failure”Helps mitigate the assumption of constant failure rate- Confidence limits depend on the size of the population,you are studyingwww.weibull.com/hotwire/issue4/relbasics4.htmUseful when calculating the availability or reliability of individualcomponents. Data from Resistor batch much easier to gather then DataCenter batch Data- The results are only as good as the data© 2003 APC corporation.


What makes calculated probabilities credible? Models- The framework upon which the results are based.- Visually models can be represented using RBDs butultimately models are the equations used to calculateavailability or reliability- Equations used depend on the failure distribution of thecomponentsExponential Distribution vs. Weibull DistributionEx. TransformerEx. Capacitor (Polypropylene film)R(t)= e−λ t⎡ ⎛ t ⎞R(t)= exp⎢−⎜ ⎟⎢⎣⎝θ⎠β⎤⎥⎥⎦θ =Life AtNamePlate⎛ Rated Cap Voltage ⎞Rating ×⎜⎟⎝ UPS Applied Cap Voltage ⎠Vscale× 2⎛ Rated Temp−Tempat steady state⎜⎝10⎞⎟⎠© 2003 APC corporation.


<strong>Reliability</strong> Block DiagramsSeries/ Parallel Combination ModelA 1 = A 2 = 0.99B 1 = B 2 = 0.999C 1 = 0.99999<strong>Availability</strong> calculations:A A = 1 - [ (1-A 1 ) * (1-A 2 ) ]A C = C 1A B = 1 - [ (1-B 1 ) * (1-B 2 ) ]A System = A A * A B * A CAssumptions:- 5 repairpersons available- Remaining components in operation during repair- Components exhibit constant failure rate- Human error has not been accounted for- Independence of failures© 2003 APC corporation.


Markov ChainsRewards are assigned based on theproportion of the data center that isoperationalRepresents the state of the facilityas a continuumIncludes both normal <strong>and</strong> degradedstatesCombined with RBDs, this canrepresent a data center- Ex. If there are 20 sub-panels, powermight be available in 19 of them, but not inthe 20 th . For this state, a reward of 0.95would be assigned. If power wererestored to the 20 th sub-panel, then thereward would be 1.0.Markov of Simple Parallel System© 2003 APC corporation.


Fault TreesLogical – easy to underst<strong>and</strong>Composed mainly of Events, ANDgates, OR gatesEverything leads to top failureeventAnalysis highly dependent ontree structureCut set – a set of events thatleads to the top eventMinimal cut set – a set of eventsin which all must fail in order toreach top event© 2003 APC corporation.


Probabilistic Analysis Methodologies


APC’s <strong>Availability</strong> MethodologyUsed for comparing two or more electricalarchitectures, logic based on system successDetailed approach, from utility to loadCombination of <strong>Reliability</strong> block diagrams <strong>and</strong> Markovchains.<strong>Reliability</strong> block diagrams are used to representsubsystems of the architecture.- For example, the PDU is in series with the UPS, so 2series blocks would be drawn to illustrate this logic.Markov chains are then used to represent the DataCenter state based on the number of sub-systems thatare operational.© 2003 APC corporation.


Where does the data come from?- IEEE gold book– Power Systems <strong>Reliability</strong>- Power Quality Magazine- 3 rd Party Vendor Data- <strong>Reliability</strong> Analysis Center- ASHRAE© 2003 APC corporation.


3 rd Party feedback of availability methodologyAli Mosleh, Professor <strong>and</strong> Director of <strong>Reliability</strong> EngineeringProgram, University of Maryl<strong>and</strong>- “In my judgment the methodology is sound for theintended objective, namely a comparative analysis of twoarchitectures..”Joanne Bechta Dugan, Ph.D., Professor at University ofVirginia- “[I have] found the analysis credible <strong>and</strong> the methodologysound.. The combination of <strong>Reliability</strong> Block Diagrams(RBD) <strong>and</strong> Markov reward models (MRM) is an excellentchoice that allows the flexibility <strong>and</strong> accuracy of the MRMto be combined with the simplicity of the RBD.”© 2003 APC corporation.


MTech’s PRA MethodologyBased on nuclear industry PRAInvasive data gathering, analysis can “st<strong>and</strong> alone”,logic based on system failurePrimarily uses Fault trees sometime combined withhuman factorsDetermine minimal cut sets from fault tree logicalstructureCalculate probability of each minimal cut set duringmissionRank component contributions to failureDetermine sensitivity of results to componentfailure rate© 2003 APC corporation.


Where does the data come from?- IEEE gold book– Power Systems <strong>Reliability</strong>- Nuclear industry- 3 rd Party studies <strong>and</strong> data- ASHRAE- MTech database© 2003 APC corporation.


Learning from the real worldReal customer issues


© 2003 APC corporation.


© 2003 APC corporation.Source: Uptime Institute


© 2003 APC corporation. Source: Uptime Institute


Capacitor Bank DesignsFailed cap –pressureinterrupteropenedRigid bus using threadedlugsFlexible bususing fast-onconnectors© 2003 APC corporation.


Power ReactorFacility: KEWAUNEERegion: 3 State: WIUnit: [1] [ ] [ ]RX Type: [1] W-2-LPNRC Notified By: ETHAN TREPTOWHQ OPS Officer: JOHN MacKINNONEmergency Class: NON EMERGENCY10 CFR Section:50.72(b)(3)(v)(D) - ACCIDENT MITIGATIONEvent Number: 41226Notification Date: 11/26/2004Notification Time: 03:22 [ET]Event Date: 11/26/2004Event Time: 00:25 [CST]Last Update Date: 11/26/2004Person (Organization):DAVE PASSEHL (R3)UnitSCRAMCodeRXCRITInitialPWRInitial RX ModeCurrentPWRCurrent RX ModeEvent Text1NN0Intermediate Shutdown0Intermediate ShutdownSAFETY INJECTION ACCUMULATOR ISOLATION VALVES FOUND CLOSED AND THEIR BREAKERS LOCKEDOFF."During plant startup following a Refueling Outage, the Reactor Coolant System was pressurized greater than 1000psig with the Safety Injection Accumulator Isolation Valves (SI-20A <strong>and</strong> SI-20B) closed <strong>and</strong>their breakers locked off. This is contrary to the plant Technical Specificationrequirement to open the valves <strong>and</strong> lock out their breakers prior to the ReactorCoolant System exceeding 1000 psig. The Safety Injection Accumulators are required to inject intothe Reactor Coolant System to mitigate the consequences to a LOCA. This is conservatively being reported under10CFR50.72(b)(3)(v)(D) as "Any event or condition that at the time of discovery could have prevented the fulfillmentof the safety function of structures or systems that are needed to mitigate the consequences of an accident.""At the time of discovery, Reactor Coolant System pressure was approximately 1090 psig <strong>and</strong> Reactor CoolantSystem temperature was approximately 440 deg. Fahrenheit. Approximately three minutes after the condition wasdiscovered, the SI Accumulator Isolation Valves were opened <strong>and</strong> their power breakers were locked out."The STA discovered the problem while reviewing Technical Specification's <strong>and</strong> Plant Conditions.The NRC Resident Inspector was notified of this event by the licensee.© 2003 APC corporation.


Power ReactorFacility: INDIAN POINTRegion: 1 State: NYUnit: [2] [3] [ ]RX Type: [2] W-4-LP,[3] W-4-LPNRC Notified By: BRIAN ROKESHQ OPS Officer: BILL HUFFMANEmergency Class: NON EMERGENCY10 CFR Section:26.73 - FITNESS FOR DUTYEvent Number: 41353Notification Date: 01/24/2005Notification Time: 16:07 [ET]Event Date: 01/24/2005Event Time: 10:00 [EST]Last Update Date: 01/24/2005Person (Organization):WILLIAM COOK (R1)UnitSCRAMCodeRXCRITInitialPWRInitial RX ModeCurrentPWRCurrent RX Mode2NY100Power Operation100Power Operation3NY100Power Operation100Power OperationEvent TextFITNESS FOR DUTY REPORTA non-licensed plant employee was determined to be under the influence ofalcohol during a test conducted for cause. The employee's access to the plant hasbeen terminated. Contact the Headquarters Operations Officer for additional details.The licensee has notified the NRC Resident Inspector.© 2003 APC corporation.


Power ReactorFacility: ROBINSONRegion: 2 State: SCUnit: [2] [ ] [ ]RX Type: [2] W-3-LPNRC Notified By: BILL STOVERHQ OPS Officer: JOHN KNOKEEmergency Class: UNUSUAL EVENT10 CFR Section:50.72(a) (1) (i) - EMERGENCY DECLAREDEvent Number: 41296Notification Date: 12/27/2004Notification Time: 01:02 [ET]Event Date: 12/27/2004Event Time: 00:31 [EST]Last Update Date: 12/27/2004Person (Organization):CAUDLE JULIAN (R2)PETER WILSON (IRD)MICHAEL JOHNSON (NRR)STEINDURF () AKERS (DHS)UnitSCRAMCodeRXCRITInitialPWRInitial RX ModeCurrentPWRCurrent RX ModeEvent Text2NY100Power Operation100Power OperationUNUSUAL EVENT DECLARED DUE TO FIRE INSIDE PROTECTED AREA LASTING GREATER THAN 10 MINUTESFire within the Protected Area lasting more than ten minutes. The Unusual Event was declared at 00:31 EST. The fire was locatedin the small arms locker in an out-building within the Protected Area. The fire was extinguished at 00:49 EST. Darlington CountyFire Department responded."Onsite security detected smoke <strong>and</strong> heard ammunition rounds discharging from alocked door small arms locker located in the Protected Area. Security reported the fire <strong>and</strong> onsite fire personnel responded.At 00:49 EST the fire was declared extinguished <strong>and</strong> a fire watch was set. No personnel were injured in this event. Darlington Fire Department wascalled <strong>and</strong> responded to the site, however, by the time they arrived the fire was put out <strong>and</strong> they, therefore, did not enter the Protected Area.Security believes this to be an isolated event with no evidence of malevolent actions.The Licensee notified state <strong>and</strong> local agencies <strong>and</strong> the NRC Resident Inspector.* * * UPDATE ON 12/27/04 AT 02:03 EST FROM B. STOVER TO J. KNOKE * * *At 01:30 on 12/27/04 the Licensee terminated the Unusual Event. The Licensee indicated the fire has been out for 41 minutes. All on site small armslockers were inspected by security <strong>and</strong> no evidence of any damage or tampering was found. The LLEA was not called for this event, however, theydid respond when the Darlington Fire Department responded. The LLEA did not enter the Protected Area. As a precautionary followup measure theLicensee requested the Darlington County Sheriff's office send an arson investigator to the scene.The Licensee notified state <strong>and</strong> local agencies <strong>and</strong> the NRC Resident Inspector. Notified R2DO (Julian), NRR(Johnson), IRD (Wilson), FEMA (Steindurf), <strong>and</strong> DHS (Akers).© 2003 APC corporation.


Where do failures occur?Causes of failure in data centerUPS Failure:20%Balance ofSystem:10%PDU <strong>and</strong> circuitbreakers:30%Other circuitbreakers:40%© 2003 APC corporation.Source: MTech


© 2003 APC corporation.Questions?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!