EDITOR’S FOREWORDEDITOR’S FOREWORDUptime. Downtime.For decades, designers <strong>and</strong> customers of embedded computersystems have attempted to establish reliability by estimatingMean Time Between Failures (MTBF). This is almost always acalculation, usually based on the failure rates of individual componentsas established by tables like MIL-HDBK-217. Thesetypes of methods provide an estimate of how long a systemshould operate before failure:■ If the components were used within their design margins■ If the components used were good■ If the circuit assemblies were properly manufacturedA lot of ifs. Most MTBF calculators are largely silent about thingslike software, which is usually the most failure prone element ofany modern, complex embedded computer. And new failure mechanismsare beginning to appear. One good example is the increasingsusceptibility of very small geometry integrated circuits to logicfaults <strong>and</strong> failures due to radiation sources such as solar neutronsthat practically cannot be shielded against (see the October, 2004<strong>CompactPCI</strong> <strong>and</strong> <strong>AdvancedTCA</strong> <strong>Systems</strong> Editor’s Foreword). And,MTBF is just a statistical estimate. A system with a 30,000-hourcalculated MTBF may fail after a few hours of operation. MTBF’ssister calculation, Mean Time To Repair (MTTR) also assumes thatthe right replacement part or assembly is available <strong>and</strong> is good, <strong>and</strong>that the individual performing the repair or replacement is skilled inthe process. All in all, a lot of big fat ifs <strong>and</strong> guess-timates. So, areMTBF <strong>and</strong> MTTR calculations useful? Almost certainly. Are theyenough? Increasingly, the answer is no.Is there a better way?The telecommunications industry has, for years, used availabilityin addition to MTBF as a better measure of a system’s overall reliability<strong>and</strong> robustness. Numbers like 5-nines (99.999% uptime,By Joe PavlatEditorial Director<strong>CompactPCI</strong> &<strong>AdvancedTCA</strong>or about 5 minutes of downtime a year) or 6-nines (99.9999%uptime, or about 30 seconds of downtime a year) are often used asmeasures of availability. Availability requires a somewhat differentmindset when compared to MTBF thinking. Highly availablesystems are generally architected in very different ways from traditionalsystems. They usually have multiple, redundant resourcessuch as processors, power supplies, <strong>and</strong> storage. Specialized hardware<strong>and</strong> software combine to detect failures <strong>and</strong> switch out badresources <strong>and</strong> subsequently switch in good ones. Downtimes areoften measured in seconds or minutes, not hours or days. Of courseit is almost always desirable to replace failed resources with goodones for continued redundancy, <strong>and</strong> features like hot swap <strong>and</strong>system management help repair personnel keep the still-runningsystem ready for the next failure. The term 24x7 is being replacedin the communications world by 3600 by forever, which is a bettermeasure of real world requirements. Downtimes need to be measuredin minutes at most, not days.Designers of military electronics should be interested in high availabilityarchitectures. Traditional military systems have achieveda level of reliability by robust packaging <strong>and</strong> careful componentselection, but usually have simple single-resource architectureswithout the capability of failure tolerance <strong>and</strong> automatic repair.Additional forces are in play that should cause military electronicsdesigners to take a few chapters from the telecommunicationsequipment design h<strong>and</strong>book <strong>and</strong> start to think about availabilityinstead of just MTBF. For example, today many necessary componentsare of commercial grade, including almost all silicon.That’s not necessarily a bad thing. One good aspect of this trendis that complex silicon gets cheaper every year, permitting theduplication of many functions for redundancy. Also, today’s netcentricwarfare environment is largely about information technology<strong>and</strong> communications. Many of the lessons about makingthose types of systems highly available have already been learnedin the telecom world, including different methodologies for softwarerobustness than those used in the military systems. Sure,environmental extremes <strong>and</strong> operating temperature requirementswill often make some military electronics systems specialized,but the underlying architectures <strong>and</strong> components developed forthe much larger communications marketplace should be consideredwherever possible.Keeping modern military electronics systems operating 3600 byforever will be absolutely necessary in the future as warplanners<strong>and</strong> warfighters make rapid decisions based on real-time information.Putting the best heads together from both the telecom <strong>and</strong>military electronics worlds would be a great opportunity to furtherthe state of the art for both <strong>and</strong> to face common challenges,such as better cooling technologies, for the future.Joe Pavlat, Editorial Director8 / <strong>CompactPCI</strong> <strong>and</strong> <strong>AdvancedTCA</strong> <strong>Systems</strong> / June 2005
<strong>CompactPCI</strong> <strong>and</strong> <strong>AdvancedTCA</strong> <strong>Systems</strong> / June 2005 / 9RSC# 9 @www.compactpci-systems.com/rsc