Saturday, March 10, 2012

Measurement of Fault Tolerance

Coverage is the conditional probability that the system will recover automatically within the required time interval given that an error has occurred.  Coverage can be computed from the probability associated with detection and recovery.

Coverage = Prob(successful error detection) x Prob(successful error recovery)

Obtaining the probabilities used to compute coverage is difficult and requires extensive stability testing and fault insertion testings.

MTTF (Mean Time to Fail) is the average time from the start of operation until the time when the first failure happens.

MTTR (Mean Time to Repair) is the time required to restore a failing component to operation.  This includes travel time to the site to perform the repair, system intialization and recovery actions to be executed.

MTBF is the time from the start of operation until the component is repaired.  In other words, MTBF = MTTF + MTTR.

MTBF is used when the component is repairable and MTTF is when the component is not repairable.

Reliability is the probability that a system can perform according to specification for a specific period of time.  In other words, there is no failures within the specific time. 

Reliability = e ^ -(1/MTTF)

Failure rate = 1/MTTF

FIT (Failure in Time) = number of failures in 1x10^9 hours (1 billion hours)

Availability is the percentage of time that a system is able to perform its designed function. 

Availability = MTTF/(MTTF+MTTR) = MTTF/MTBF.

Note that availability is a percentage and reliability is a probability.

Dependability is a measure of a system's trustworthiness to be relied upon to perform the desired function.  The attribute of dependability is reliability, availability, safety and security.  Safety refers to the non-occurence of catastrophic failures, whose consequence is greater than the benefits.

The capacity of a system represents a tradeoff between the system's cost and its dependability under load.  When the load is higher than the capacity, the system can have a total or partial failure or in degraded performance. 

No comments: