Reliability AMP
Reliability AMP
Reliability AMP
1
∝=
Mean Time Between Failure
Reliability is the probability that a failure will not occur during a specific span of time.
Probability of Failure=1−reliability
Overall failure rate for a complex item is the sum of all the failure rates for all of the
individual components in the item. This applies to situations where failure of one
component causes the entire item to fail. The type of calculation is similar to a series
electronic circuit.
N
∝=∝ 1+∝ 2+…+∝ n=∑ ∝ k
k=1
Overall failure rate for items with full redundant overlap is the inverse of the sum of
MTBF for all of the individual redundant items. This applies to situations where all of the
components in the item must all fail before the item fails. The type of calculation is
similar to a parallel electronic circuit.
1 1
∝=
( 1
+
1
∝1 ∝ 2
+…
1
∝n ) (∑ )
= N
k=1
1
∝k
Failure rate for silicon and carbon devices doubles for each 5 0 C temperature rise.
Electronic devices operating at 60 0 C will fail 64 times more frequently than the same
kind of items operating at 30 0 C. This relationship holds true above 25 0 C.
Transportation reliability is similar, but values are expressed in terms of distance, such
as fault per mile or faults per kilometer.
Failure rate can be expressed in terms of the number of cycles. Thermal shock caused
by heating and cooling can induce failure when power is cycled on and off. Most
mechanical switches are built to operate 10,000 cycles before failure, which is about 30
years for a cycle rate of 1 action per day.
Distance, cycle, and decay reliability all have separate contributions that effect the
overall failure rate.
Availability
Availability is generally used with systems that incorporate periodic maintenance.
Availability is the probability that an item will operate correctly during a period of time
when used at random times during that period.
Available Time
Availability=
Total Time
Available time is the time while the system is fully operational. Down time is the time
while the system is unavailable for normal use, and this consist of the time while
periodic maintenance is being performed and the amount of time while the system is
faulted.
Availability calculations are meaningful for items with replaceable parts only when
failure modes have adequate coverage.
Coverage> Availability
Readiness
Readiness is meaningful when the item does not require down time for periodic
maintenance. This is a useful measurement for items that incorporate automatic
recovery or condition based maintenance.
Readiness is the probability that an item will operate as expected when used at any
random time while the item is in the correct mode of operation.
Mean Time To Recover form manual actions is generally measured or estimated. The
following is an example of the kind of values that could be used for estimating the
mechanical portion of the recovery time associated with replacing a failed circuit card.
Static wrist strap
120 seconds
Small cables
disconnect 15 seconds; reconnect 60 seconds
Circuit card
remove 30 seconds; insert 120 seconds
Coverage> Readiness
Coverage
Maintenance coverage evaluates the proportion of faults detected by CBM and PMS.
A rough estimate of coverage can be made by observing the ratio between operational
failures and maintenance actions.
Introduction _________________________________________________________________ 4
Types of Reliability ____________________________________________________________5
Introduction
We are not going to overlook the dispatch reliability, however. This is a distinct
part of the reliability program we discuss in the following pages. But we must
make the distinction and understand the difference. We must also realize that
not all delays are caused by maintenance or equipment even though
maintenance is the center of attention during such a delay. Nor can we only
investigate equipment, maintenance procedures, or personnel for those
discrepancies that have caused a delay. As you will see through later
discussions, dispatch reliability is a subset of overall reliability. Types of
Reliability The term reliability can be used in various respects. You can talk
about the overall reliability of an airline's activity, the reliability of a component
or system, or even the reliability of a process, function, or person. Here,
however, we will discuss reliability in reference to the maintenance program
specifically. There are four types of reliability one can talk about related to the
maintenance activity. They are
(a) statistical reliability,
(b) historical reliability,
(c) event-oriented reliability, and
(d) dispatch reliability.
about 30 data points the statistical calculations are not very significant. Another case of improper
use of statistics was given as an example presented in an aviation industry seminar on reliability.
The airline representative used this as an example of why his airline was going to stop using
statistical reliability. Here is his example. We use weather radar only 2 months of the year. When we
calculate the mean value of failure rates and the alert level in the conventional manner [discussed in
detail later in this chapter] we find that we are always on alert. This, of course, is not true. The
gentleman was correct in defining an error in this method, and he was correct in determining that
at least in this one case statistics was not a valid approach. Figure 1 shows why.
Figure 1 Comparison of alert level calculation methods
The top curve in Fig 1 shows the 2 data points for data collected when the
equipment was in service. It also shows 10 zero data points for those months
when the equipment was not used and no data were collected (12-month
column). These zeros are not valid statistical data points. They do not represent
zero failures; they represent "no data" and therefore should not be used in the
calculation. Using these data, however, has generated a mean value (lower,
dashed line) of 4.8 and an alert level at two standard deviations above the mean
(upper, solid line) of 27.6. One thing to understand about mathematics is that
the formulas will work, will produce numerical answers, whether or not the input
data are correct. Garbage in, garbage out. The point is, you only have two valid
data points here shown in the bottom curve of Fig.1 (2-month data). The only
meaningful statistic here is the average of the two numbers, 29 (dashed line).
One can calculate a standard deviation (SD) here using the appropriate formula
or a calculator, but the parameter has no meaning for just two data points. The
alert level set by using this calculation is 37.5 (solid line). For this particular
example, statistical reliability is not useable, but historical reliability is quite
useful. We will discuss that subject in the next section.
Historical reliability
Event-oriented reliability
Dispatch reliability
time). In other words, the airline dispatched 96 percent of its flights on time. The
use of dispatch reliability at the airlines is, at times, misinterpreted. The
passengers are concerned with timely dispatch for obvious reasons. To respond
to FAA pressures on dispatch rate, airlines often overreact. Some airline
maintenance reliability programs track only dispatch reliability; that is, they only
track and investigate problems that resulted in a delay or a cancellation of a
flight. But this is only part of an effective program and dispatch reliability
involves more than just maintenance. An example will bear this out. The aircraft
pilot in command is 2 hours from his arrival station when he experiences a
problem with the rudder controls. He writes up the problem in the aircraft
logbook and reports it by radio to the flight following unit at the base. Upon
arrival at the base, the maintenance crew meets the plane and checks the log
for discrepancies. They find the rudder control write-up and begin
troubleshooting and repair actions. The repair takes a little longer than the
scheduled turnaround time and, therefore, causes a delay. Since maintenance is
at work and the rudder is the problem, the delay is charged to maintenance and
the rudder system would be investigated for the cause of the delay. This is an
improper response. Did maintenance cause the delay? Did the rudder
equipment cause the delay? Or was the delay caused by poor airline
procedures? To put it another way: could a change of airline procedures
eliminate the delay? Let us consider the events as they happened and how we
might change them for the better. If the pilot and the flight operations
organization knew about the problem 2 hours before landing, why wasn't
maintenance informed at the same time? If they had been informed, they could
have spent the time prior to landing in studying the problem and performing
some troubleshooting analysis. It is quite possible, then, that when the airplane
landed, maintenance could have met it with a fix in hand. Thus, this delay could
have been prevented by procedural changes. The procedure should be changed
to avoid such delays in the future. While the maintenance organization and the
airline could benefit from this advance warning of problems, it will not always
eliminate delays. The important thing to remember is that if a delay is caused
by procedure, it should be attributed to procedure and it should be avoided in
the future by altering the procedure. That is what a reliability program is about:
detecting where the problems are and correcting them, regardless of who or
what is to blame.
Another fallacy in overemphasizing dispatch delay is that some airlines will investigate each delay
(as they should), but if an equipment
problem is involved, the investigation may or may not take into account other
similar failures that did not cause delays. For example, if you had 12 write-ups
of rudder problems during the month and only one of these caused a delay, you
actually have two problems to investigate: (a) the delay, which could be caused
by problems other than the rudder equipment and (b) the 12 rudder write-ups
that may, in fact, be related to an underlying maintenance problem. One must
understand that dispatch delay constitutes one problem and the rudder system
malfunction constitutes another. They may indeed overlap but they are two
different problems. The delay is an event-oriented reliability problem that must
be investigated on its own; the 12 rudder problems (if this constitutes a high
failure rate) should be addressed by the statistical (or historical) reliability
program. The investigation of the dispatch delays should look at the whole
operation. Equipment problems—whether or not they caused delays—should be
investigated separately.
A Reliability Program
A reliability program for our purposes is, essentially, a set of rules and practices
for managing and controlling a maintenance program. The main function of a
reliability program is to monitor the performance of the vehicles and their
associated equipment and call attention to any need for corrective action. The
program has two additional functions: (a) to monitor the effectiveness of those
corrective actions and (b) to provide data to justify adjusting the maintenance
intervals or maintenance program procedures whenever those actions are
appropriate.
Data collection
We will list 10 data types that can be collected, although they may not necessarily be collected by
all airlines. Other items may be added at the airline's discretion, The data collection process gives
the reliability department the information needed to observe the effectiveness of the maintenance
program. Those items that are doing well might be eliminated from the program simply because the
data show that there are no problems. On the other hand, items not being tracked may need to be
added to the program because there are serious problems
related to those systems. Basically, you collect the data needed to stay on top of
your operation. The data types normally collected are as follows:
We will discuss each of these in detail below. Flight time and flight cycles. Most
reliability calculations are "rates" and are based on flight hours or flight cycles;
e.g., 0.76 failures per 1000 flight hours or 0.15 removals per 100 flight cycles.
Cancellations and delays over 15 minutes,
Some operators collect data on all such events, but maintenance is concerned
primarily with those that are maintenance related. The lo-minute time frame is
used because that amount of time can usually be made up in flight. Longer
delays may cause schedule interruptions or missed connections, thus the need
for rebooking. This parameter is usually converted to a "dispatch rate" for the
airline as discussed above.
These discrepancies may not be as serious as those the flight crew deals with, but passenger
comfort and the ability of the cabin crew to perform their duties may be affected. These items may
include cabin safety inspection, operational check of cabin emergency lights, first aid kits, and fire
extinguishers. If any abnormality is found, these items are written up by the flight crew in the
maintenance logbook as a discrepancy item.
Component failures.
Any problems found during shop maintenance visits are tallied for the reliability program. This
refers to major components within the black boxes (avionics) or parts and components within
mechanical systems.
Critical failures.
Failures involving a loss of function or secondary damage that could have a direct adverse effect on
operating safety.
This alert level is based on a statistical analysis of the event rates of the previous year, offset by 3
months. The mean value of the failure rates and the standard deviation from the mean are
calculated, and an alert level is set at one to three standard deviations above that mean rate (more
on setting and adjusting alert levels later). This value, the upper control limit (UCL), is commonly
referred to as the alert level. However, there is an additional calculation that can be made to
smooth the curve and help eliminate "false alerts." This is the 3-month rolling average, or trend
line. The position of these two lines (the monthly rate and the 3-month average) relative to the UCL
is used to determine alert status.