Saturday, March 10, 2012

RAID

0 = Disk striping for performance
1 = Disk mirroring with slight performance improvement
2 = Hamming encoding.  Encode data on disk with error correcting code.  Data on a single disk is correctable when error occurs
3 = Virtual disk blocks. Data is striped across multiple disks with a single disk for parity information.  Data can be reconstructed based on parity.
4 = Dedicated parity disk.  Parity of all data is stored on a separate disk.  Data can be reconstructed based on parity
5 = Striped parity disk.  Parity is striped across all disks.

Measurement of Fault Tolerance

Coverage is the conditional probability that the system will recover automatically within the required time interval given that an error has occurred.  Coverage can be computed from the probability associated with detection and recovery.

Coverage = Prob(successful error detection) x Prob(successful error recovery)

Obtaining the probabilities used to compute coverage is difficult and requires extensive stability testing and fault insertion testings.

MTTF (Mean Time to Fail) is the average time from the start of operation until the time when the first failure happens.

MTTR (Mean Time to Repair) is the time required to restore a failing component to operation.  This includes travel time to the site to perform the repair, system intialization and recovery actions to be executed.

MTBF is the time from the start of operation until the component is repaired.  In other words, MTBF = MTTF + MTTR.

MTBF is used when the component is repairable and MTTF is when the component is not repairable.

Reliability is the probability that a system can perform according to specification for a specific period of time.  In other words, there is no failures within the specific time. 

Reliability = e ^ -(1/MTTF)

Failure rate = 1/MTTF

FIT (Failure in Time) = number of failures in 1x10^9 hours (1 billion hours)

Availability is the percentage of time that a system is able to perform its designed function. 

Availability = MTTF/(MTTF+MTTR) = MTTF/MTBF.

Note that availability is a percentage and reliability is a probability.

Dependability is a measure of a system's trustworthiness to be relied upon to perform the desired function.  The attribute of dependability is reliability, availability, safety and security.  Safety refers to the non-occurence of catastrophic failures, whose consequence is greater than the benefits.

The capacity of a system represents a tradeoff between the system's cost and its dependability under load.  When the load is higher than the capacity, the system can have a total or partial failure or in degraded performance. 

Saturday, March 3, 2012

Failure Perception

A fail-silent failure is one in which the failing unit either present incorrect result or no result at all.  Fail-slient failure is the easist type of failure to be tolerated because observerd failure is that the failing unit has stopped working.  The reason for failure is unknown but the failing element is identified and the failure is contained and not spread to other part of the system.

A crash-failure is one where the unit stops after the first fail-silent failure. 

A fail-stop failure is a crash-failure that is visible to the rest of the system.

Consistent failures are seen as the same kind of failure by all observers in the system.  Inconsistent failures are ones that appear different to different observers.  These are also called two-faced failures, malicious failure of Byzantine failures  These are most diffiuclt to isolate or correct.  An example of consistent failure is a system that report 1 to all questions.  An example of inconsistent failure is a system that report 1 to user 1 and 2 to the rest of the users, or route all traffic to a certain network address and none to another.

It is a common design principle for fault tolerance is to assume only one failure at any one time.  However, many failures have occured when this assumption is invalid. 

Fail-silent failures requires n+1 to tolerate.  Consistent failure requires 2n+1 to tolerate.  Malicious failure requires 3n+1 to tolerate.  The computer system in Space Shuttle is designed to tolerate 2 simultaneous failures which must be consistent but need not be fail-silent requires to have 5 computer systems. 

Fault, Error and Failure

Failure = delivered service is no longer complies with specification (agreed description of the system expected function and service)

Error = part of the system state that is liable to lead to subsequent failure

Fault = adjudged or hypothesized cause of an error

Failures are detected by observers or users of system.  Failures are dependent upon the definition of agreed-upon correct operation of the system.  If there is no specification of what a system should do, there could not be a failure.  The same failure can be resulted from different errors.

Error is the incorrect system behanvior from which a faulure may occur.  Error can be categorized into 2 types.  Error that manifest as value error might be incorrect discrete values or incorrect system state.  Timing error can include total non-performance or race condition.  Errors can be detected before they become failures.  Error is a manifestion of fault.  Error is the way that we can look into the system to discover if fault is present. 

Fault is the defect that in system that can cause error.  Neither the software or observer are aware of the presence of fault until an error occurs.  Latent fault is fault that is lying dormant and has not cause any error.  A latent fault becomes an active fault when the circumstances arise.