Reliability AMP

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Reliability

Maintenance is closely associated with reliability because maintenance is required to


restore capability that has been lost due to failure.

Electronic devices decay in a way that is mathematically equivalent to radioactive


decay processes for unstable atoms.

Electronic failure is governed by random processes, where Mean Time Between


Failure identifies the average number of hours until failure occurs. Lambda ∝ identifies
the number of failures expected per hour.

1
∝=
Mean Time Between Failure

Reliability is the probability that a failure will not occur during a specific span of time.

Reliability =e(−∝ x time)

Probability of Failure=1−reliability

Failure rate relies on logarithmic math to simplify calculations using ∝∝ that is very


similar to the type of analysis used for electronic circuits.

Overall failure rate for a complex item is the sum of all the failure rates for all of the
individual components in the item. This applies to situations where failure of one
component causes the entire item to fail. The type of calculation is similar to a series
electronic circuit.
N
∝=∝ 1+∝ 2+…+∝ n=∑ ∝ k
k=1
Overall failure rate for items with full redundant overlap is the inverse of the sum of
MTBF for all of the individual redundant items. This applies to situations where all of the
components in the item must all fail before the item fails. The type of calculation is
similar to a parallel electronic circuit.

1 1
∝=
( 1
+
1
∝1 ∝ 2
+…
1
∝n ) (∑ )
= N

k=1
1
∝k

 reliability block diagram


 is used to construct a model for large items. This provides traceability when funding
and manpower requirements are identified using reliability calculations.

Failure rate for silicon and carbon devices doubles for each 5 0 C temperature rise.
Electronic devices operating at 60 0 C will fail 64 times more frequently than the same
kind of items operating at 30 0 C. This relationship holds true above 25 0 C.

Transportation reliability is similar, but values are expressed in terms of distance, such
as fault per mile or faults per kilometer.

Failure rate can be expressed in terms of the number of cycles. Thermal shock caused
by heating and cooling can induce failure when power is cycled on and off. Most
mechanical switches are built to operate 10,000 cycles before failure, which is about 30
years for a cycle rate of 1 action per day.

Distance, cycle, and decay reliability all have separate contributions that effect the
overall failure rate.

Availability
Availability is generally used with systems that incorporate periodic maintenance.
Availability is the probability that an item will operate correctly during a period of time
when used at random times during that period.

Available Time
Availability=
Total Time

Total Ti me= Available Time + Down Time

Down Time=Maintenance Time+ faulted Time

Available time is the time while the system is fully operational. Down time is the time
while the system is unavailable for normal use, and this consist of the time while
periodic maintenance is being performed and the amount of time while the system is
faulted.

Availability calculations are meaningful for items with replaceable parts only when
failure modes have adequate coverage.

Coverage> Availability

Readiness
Readiness is meaningful when the item does not require down time for periodic
maintenance. This is a useful measurement for items that incorporate automatic
recovery or condition based maintenance.

Readiness is the probability that an item will operate as expected when used at any
random time while the item is in the correct mode of operation.

Readiness=1−∝ x Mean Time ¿ Recover

Mean Time To Recover form manual actions is generally measured or estimated. The
following is an example of the kind of values that could be used for estimating the
mechanical portion of the recovery time associated with replacing a failed circuit card.
 Static wrist strap
120 seconds

 Bolts and screws with captive nut


remove 15 seconds; replace 30 seconds

 Bolts and screws with loose nut


remove 30 seconds; replace 60 seconds

 Small cables
disconnect 15 seconds; reconnect 60 seconds

 Circuit card
remove 30 seconds; insert 120 seconds

Readiness calculations are meaningful for items with replaceable


parts only when failure modes have adequate coverage.

Coverage> Readiness

Coverage
Maintenance coverage evaluates the proportion of faults detected by CBM and PMS.

Faults Detected by CBM + Faults Dtected by PMS


Coverage=
Total Possible Faults

A rough estimate of coverage can be made by observing the ratio between operational
failures and maintenance actions.

Total Faults Excluding Operational Failure


Coverage ≈
Total Faults Including Operational Failure
Availability calculations, readiness calculations, and related claims are only valid if
coverage exceeds availability

Module 21.8 Reliability _____________________________________


4

Introduction _________________________________________________________________ 4
Types of Reliability ____________________________________________________________5

A Reliability Program _________________________________________________________ 10


Elements of a Reliability Program _______________________________________________10
Other Functions of the Reliability Program ________________________________________24
Administration and Management of the Reliability Program __________________________24
Module 21.8 Reliability

Introduction

Reliability equals consistency. It can be defined as the probability that an item


will perform a required function, under specified conditions without failure, for a
specified amount of time according to its intended design. The reliability
program is a valuable means of achieving better operational performance in an
aircraft maintenance environment, and it is designed to decrease maintenance-
related issues and increase flight safety. The intent of this program is to deal
systematically with problems as they arise instead of trying to cure immediate
symptoms. This program is normally customized, depending on the operators, to
accurately reflect the specific operation's requirements. Although the word
reliability has many meanings, in this book we will define the terms that have
specialized meanings to aviation maintenance and engineering. In the case of
reliability, we first must discuss one important difference in the application of
the term. There are two main approaches to the concept of reliability in the
aviation industry. One looks essentially at the whole airline operation or the M&E
operation within the whole, and the other looks at the maintenance program in
particular. There is nothing wrong with either of these approaches, but they
differ somewhat, and that difference must be understood. The first approach is
to look at the overall airline reliability. This is measured essentially by dispatch
reliability; that is, by how often the airline achieves an on-time departure of its
scheduled flights. Airlines using this approach track delays. Reasons for the
delay are categorized as maintenance, flight operations, air traffic control (ATC),
etc. and are logged accordingly. The M&E organization is concerned only with
those delays caused by maintenance. Very often, airlines using this approach to
reliability overlook any maintenance problems (personnel or equipment related)
that do not cause delays, and they track and investigate only those problems
that do cause delays. This is only partially effective in establishing a good
maintenance program.
The second approach (which we should actually call the primary approach) is to consider reliability
as a program specifically designed to address the problems of maintenance whether or not they
cause delays and provide analysis of and corrective actions for those items to improve the overall
reliability of the equipment. This contributes to the dispatch reliability, as well as to the overall
operation.

We are not going to overlook the dispatch reliability, however. This is a distinct
part of the reliability program we discuss in the following pages. But we must
make the distinction and understand the difference. We must also realize that
not all delays are caused by maintenance or equipment even though
maintenance is the center of attention during such a delay. Nor can we only
investigate equipment, maintenance procedures, or personnel for those
discrepancies that have caused a delay. As you will see through later
discussions, dispatch reliability is a subset of overall reliability. Types of
Reliability The term reliability can be used in various respects. You can talk
about the overall reliability of an airline's activity, the reliability of a component
or system, or even the reliability of a process, function, or person. Here,
however, we will discuss reliability in reference to the maintenance program
specifically. There are four types of reliability one can talk about related to the
maintenance activity. They are
(a) statistical reliability,
(b) historical reliability,
(c) event-oriented reliability, and
(d) dispatch reliability.

Although dispatch reliability is a special case of event-oriented reliability, we will


discuss it separately due to its significance. Statistical reliability Statistical
reliability is based upon collection and analysis of failure, removal, and repair
rates of systems or components. From this point on, we will refer to these
various types of maintenance actions as "events." Event rates are calculated on
the basis of events per 1000 flight hours or events per 100 flight cycles. This
normalizes the parameter for the purpose of analysis. Other rates may be used
as appropriate.
Many airlines use statistical analysis, but some often give the statistics more credence than they
deserve. For one example, airlines with 10 or more aircraft tend to use the statistical approach, but
most teachers and books on statistics tell us that for any data set with less than

about 30 data points the statistical calculations are not very significant. Another case of improper
use of statistics was given as an example presented in an aviation industry seminar on reliability.
The airline representative used this as an example of why his airline was going to stop using
statistical reliability. Here is his example. We use weather radar only 2 months of the year. When we
calculate the mean value of failure rates and the alert level in the conventional manner [discussed in
detail later in this chapter] we find that we are always on alert. This, of course, is not true. The
gentleman was correct in defining an error in this method, and he was correct in determining that
at least in this one case statistics was not a valid approach. Figure 1 shows why.
Figure 1 Comparison of alert level calculation methods

The top curve in Fig 1 shows the 2 data points for data collected when the
equipment was in service. It also shows 10 zero data points for those months
when the equipment was not used and no data were collected (12-month
column). These zeros are not valid statistical data points. They do not represent
zero failures; they represent "no data" and therefore should not be used in the
calculation. Using these data, however, has generated a mean value (lower,
dashed line) of 4.8 and an alert level at two standard deviations above the mean
(upper, solid line) of 27.6. One thing to understand about mathematics is that
the formulas will work, will produce numerical answers, whether or not the input
data are correct. Garbage in, garbage out. The point is, you only have two valid
data points here shown in the bottom curve of Fig.1 (2-month data). The only
meaningful statistic here is the average of the two numbers, 29 (dashed line).
One can calculate a standard deviation (SD) here using the appropriate formula
or a calculator, but the parameter has no meaning for just two data points. The
alert level set by using this calculation is 37.5 (solid line). For this particular
example, statistical reliability is not useable, but historical reliability is quite
useful. We will discuss that subject in the next section.

Historical reliability

Historical reliability is simply a comparison of current event rates with those of


past experience. In the example of Fig., the data collected show fleet failures of
26 and 32 for the 2 months the equipment was in service. Is that good or bad?
Statistics will not tell you but history will. Look at last year's data for the same
equipment, same time period. Use the previous year's data also, if available. If
current rates compare favorably with past experience, then everything is okay;
if there is a significant difference in the data from one year to the next, that
would be an indication of a possible problem. That is what a reliability program
is all about: detecting and subsequently resolving problems.
Historical reliability can be used in other instances, also. The most common one is when new
equipment is being introduced (components, systems, engines, aircraft) and there is no previous
data available on event rates, no information on what sort of rates to expect. What is "normal" and
what constitutes "a problem" for this equipment? In historical reliability we merely collect the
appropriate data and literally "watch what happens." When sufficient data are collected to
determine the "norms," the equipment can be added to the statistical reliability program.

Historical reliability can also be used by airlines wishing to establish a


statistically based program. Data on event rates kept for 2 or 3 years can be
tallied or plotted graphically and analyzed to determine what the normal or
acceptable rates would be (assuming no significant problems were incurred).
Guidelines can then be established for use during the next year. This will be
covered in more detail in the reliability program section below.

Event-oriented reliability

Event-oriented reliability is concerned with one-time events such as bird strikes,


hard landings, overweight landings, in-flight engine shutdowns, lighting strikes,
ground or flight interruption, and other accidents or incidents. These are events
that do not occur on a daily basis in airline operations and, therefore, produce
no usable statistical or historical data. Nevertheless, they do occur from time to
time, and each occurrence must be investigated to determine the cause and to
prevent or reduce the possibility of recurrence of the problem. In ETOPS2
operations, certain events associated with this program differ from conventional
reliability programs, and they do rely on historical data and alert levels to
determine if an investigation is necessary to establish whether a problem, can
be reduced or eliminated by changing the maintenance program. Events that
are related to ETOPS flights are designated by the PAA as actions to be tracked
by an "event-oriented reliability program" in addition to any statistical or
historical reliability program. Not all the events are investigated, but everything
is continually monitored in case a problem arises.

Dispatch reliability

Dispatch reliability is a measure of the overall effectiveness of the airline


operation with respect to on-time departure. It receives considerable attention
from regulatory authorities, as well as from airlines and passengers, but it is
really just a special form of the event-oriented reliability approach. It is a simple
calculation based on 100 flights. This makes it convenient to relate dispatch rate
in percent. An example of the dispatch rate calculation follows.
If eight delays and cancellations are experienced in 200 flights, that would mean that there were
four delays per 100 flights, or a 4 percent delay rate. A 4 percent delay rate would translate to a 96
percent dispatch rate (100 percent - 4 percent delayed = 96 percent dispatched on

time). In other words, the airline dispatched 96 percent of its flights on time. The
use of dispatch reliability at the airlines is, at times, misinterpreted. The
passengers are concerned with timely dispatch for obvious reasons. To respond
to FAA pressures on dispatch rate, airlines often overreact. Some airline
maintenance reliability programs track only dispatch reliability; that is, they only
track and investigate problems that resulted in a delay or a cancellation of a
flight. But this is only part of an effective program and dispatch reliability
involves more than just maintenance. An example will bear this out. The aircraft
pilot in command is 2 hours from his arrival station when he experiences a
problem with the rudder controls. He writes up the problem in the aircraft
logbook and reports it by radio to the flight following unit at the base. Upon
arrival at the base, the maintenance crew meets the plane and checks the log
for discrepancies. They find the rudder control write-up and begin
troubleshooting and repair actions. The repair takes a little longer than the
scheduled turnaround time and, therefore, causes a delay. Since maintenance is
at work and the rudder is the problem, the delay is charged to maintenance and
the rudder system would be investigated for the cause of the delay. This is an
improper response. Did maintenance cause the delay? Did the rudder
equipment cause the delay? Or was the delay caused by poor airline
procedures? To put it another way: could a change of airline procedures
eliminate the delay? Let us consider the events as they happened and how we
might change them for the better. If the pilot and the flight operations
organization knew about the problem 2 hours before landing, why wasn't
maintenance informed at the same time? If they had been informed, they could
have spent the time prior to landing in studying the problem and performing
some troubleshooting analysis. It is quite possible, then, that when the airplane
landed, maintenance could have met it with a fix in hand. Thus, this delay could
have been prevented by procedural changes. The procedure should be changed
to avoid such delays in the future. While the maintenance organization and the
airline could benefit from this advance warning of problems, it will not always
eliminate delays. The important thing to remember is that if a delay is caused
by procedure, it should be attributed to procedure and it should be avoided in
the future by altering the procedure. That is what a reliability program is about:
detecting where the problems are and correcting them, regardless of who or
what is to blame.
Another fallacy in overemphasizing dispatch delay is that some airlines will investigate each delay
(as they should), but if an equipment

problem is involved, the investigation may or may not take into account other
similar failures that did not cause delays. For example, if you had 12 write-ups
of rudder problems during the month and only one of these caused a delay, you
actually have two problems to investigate: (a) the delay, which could be caused
by problems other than the rudder equipment and (b) the 12 rudder write-ups
that may, in fact, be related to an underlying maintenance problem. One must
understand that dispatch delay constitutes one problem and the rudder system
malfunction constitutes another. They may indeed overlap but they are two
different problems. The delay is an event-oriented reliability problem that must
be investigated on its own; the 12 rudder problems (if this constitutes a high
failure rate) should be addressed by the statistical (or historical) reliability
program. The investigation of the dispatch delays should look at the whole
operation. Equipment problems—whether or not they caused delays—should be
investigated separately.

A Reliability Program

A reliability program for our purposes is, essentially, a set of rules and practices
for managing and controlling a maintenance program. The main function of a
reliability program is to monitor the performance of the vehicles and their
associated equipment and call attention to any need for corrective action. The
program has two additional functions: (a) to monitor the effectiveness of those
corrective actions and (b) to provide data to justify adjusting the maintenance
intervals or maintenance program procedures whenever those actions are
appropriate.

Elements of a Reliability Program

A good reliability program consists of seven basic elements as well as a number


of procedures and administrative functions. The basic elements (discussed in
detail below) are (a) data collection; (b) problem area alerting, (c) data display;
(d) data analysis; (e) corrective actions; (/) follow-up analysis; and (g -) a monthly
report. We will look at each of these seven program elements in more detail.

Data collection

We will list 10 data types that can be collected, although they may not necessarily be collected by
all airlines. Other items may be added at the airline's discretion, The data collection process gives
the reliability department the information needed to observe the effectiveness of the maintenance
program. Those items that are doing well might be eliminated from the program simply because the
data show that there are no problems. On the other hand, items not being tracked may need to be
added to the program because there are serious problems

related to those systems. Basically, you collect the data needed to stay on top of
your operation. The data types normally collected are as follows:

1. Flight time and cycles for each aircraft


2. Cancellations and delays over 15 minutes

3. Unscheduled component removals

4. Unscheduled engine removals

5. In-flight shutdowns of engines

6. Pilot reports or logbook write-ups

7. Cabin logbook write-ups

8. Component failures (shop maintenance)

9. Maintenance cheek package findings

10. Critical failures

We will discuss each of these in detail below. Flight time and flight cycles. Most
reliability calculations are "rates" and are based on flight hours or flight cycles;
e.g., 0.76 failures per 1000 flight hours or 0.15 removals per 100 flight cycles.
Cancellations and delays over 15 minutes,
Some operators collect data on all such events, but maintenance is concerned
primarily with those that are maintenance related. The lo-minute time frame is
used because that amount of time can usually be made up in flight. Longer
delays may cause schedule interruptions or missed connections, thus the need
for rebooking. This parameter is usually converted to a "dispatch rate" for the
airline as discussed above.

Unscheduled component removals.

This is the unscheduled maintenance mentioned earlier and is definitely a


concern of the reliability program. The rate at which aircraft components are
removed may vary widely depending on the equipment or system involved. If
the rate is not acceptable, an investigation should be made and some sort of
corrective action must be taken. Components that are removed and replaced on
schedule—e.g., HT items and certain OC items—are not included here, but these
data may be collected to aid in justifying a change in the HT or OC interval
schedule.

Unscheduled removals of engines.

This is the same as component removals, but obviously an engine removal


constitutes a considerable amount of time and manpower; therefore, these data
are tallied separately.

In-flight shutdown (IFSD) of engines.


This malfunction is probably one of the most serious in aviation, particularly if
the airplane only has two engines (or one). The FAA requires a report of IFSD
within 72 hours. The report must include the cause and the corrective action.
The ETOPS operators are required to track IFSDs and respond to excessive rates
as part of their authorization to fly ETOPS. However, non-ETOPS operators also
have to report shutdowns and should also be tracking and responding to high
rates through the reliability program.

Pilot reports or logbook write-ups.

These are malfunctions or degradations in airplane systems noted by the flight


crew during flight. Tracking is usually by ATA Chapter numbers using two, four,
or six digits. This allows pinpointing of the problems to the system, subsystem,
or component level as desired. Experience will dictate what levels to track for
specific equipment.

Cabin logbook write-ups.

These discrepancies may not be as serious as those the flight crew deals with, but passenger
comfort and the ability of the cabin crew to perform their duties may be affected. These items may
include cabin safety inspection, operational check of cabin emergency lights, first aid kits, and fire
extinguishers. If any abnormality is found, these items are written up by the flight crew in the
maintenance logbook as a discrepancy item.

Component failures.

Any problems found during shop maintenance visits are tallied for the reliability program. This
refers to major components within the black boxes (avionics) or parts and components within
mechanical systems.

Maintenance check package findings.

Systems or components found to be in need of repair or adjustment during normal scheduled


maintenance checks (non-routine items) are tracked by the reliability program.

Critical failures.

Failures involving a loss of function or secondary damage that could have a direct adverse effect on
operating safety.

Problem detection an alerting system


The data collection system allows the operator to compare present performance with past
performance in order to judge the effectiveness of maintenance and the maintenance program. An
alerting system should be in place to quickly identify those areas where the performance is
significantly different from normal. These are items that might need to be investigated for possible
problems. Standards for event rates are set according to analysis of past performances and
deviations from these standards.

This alert level is based on a statistical analysis of the event rates of the previous year, offset by 3
months. The mean value of the failure rates and the standard deviation from the mean are
calculated, and an alert level is set at one to three standard deviations above that mean rate (more
on setting and adjusting alert levels later). This value, the upper control limit (UCL), is commonly
referred to as the alert level. However, there is an additional calculation that can be made to
smooth the curve and help eliminate "false alerts." This is the 3-month rolling average, or trend
line. The position of these two lines (the monthly rate and the 3-month average) relative to the UCL
is used to determine alert status.

You might also like