Aviation Maintenance Management
Aviation Maintenance Management
Aviation Maintenance Management
2013
Tariq Siddiqui
[email protected]
Part of the Aviation Safety and Security Commons, and the Maintenance Technology Commons
This Book is brought to you for free and open access by Scholarly Commons. It has been accepted for inclusion in
Publications by an authorized administrator of Scholarly Commons. For more information, please contact
[email protected].
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 1
Part
Fundamentals of Maintenance
1I
“... maintenance is a science since its execution
relies, sooner or later, on most or all of the
sciences. It is an art because seemingly identical
problems regularly demand and receive varying
approaches and actions and because some
managers, foremen, and mechanics display
greater aptitude for it than others show or even
attain. It is above all a philosophy because it is
a discipline that can be applied intensively,
modestly, or not at all, depending upon a wide
range of variables that frequently transcend
more immediate and obvious solutions.”
LINDLEY R. HIGGINS
Maintenance Engineering Handbook;
McGraw-Hill, NY, 1990.
1
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 2
2 Fundamentals of Maintenance
Chapter
Thermodynamics Revisited
Nearly all engineering students have to take a course in thermodynamics in
their undergraduate years. To some students, aerodynamicists and power
plant engineers for example, thermodynamics is a major requirement for grad-
uation. Others, such as electrical engineers for instance, take the course as a
necessary requirement for graduation. Of course, thermodynamics and numer-
ous other courses are “required” for all engineers because these courses apply
to the various theories of science and engineering that must be understood to
effectively apply the “college learning” to the real world. After all, that is what
engineering is all about—bridging the gap between theory and reality.
There is one concept in thermodynamics that often puzzles students. That
concept is labeled entropy. The academic experts in the thermodynamics field
got together one day (as one thermo professor explained) to create a classical
thermodynamic equation describing all the energy of a system—any system.
When they finished, they had an equation of more than several terms; and all
3
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 4
4 Fundamentals of Maintenance
but one of these terms were easily explainable. They identified the terms for heat
energy, potential energy, kinetic energy, etc., but one term remained. They
were puzzled about the meaning of this term. They knew they had done the work
correctly; the term had to represent energy. So, after considerable pondering by
these experts, the mysterious term was dubbed “unavailable energy”—energy
that is unavailable for use. This explanation satisfied the basic law of thermo-
dynamics that energy can neither be created nor destroyed; it can only be trans-
formed. And it helped to validate their equation.
Let us shed a little more light on this. Energy is applied to create a system
by manipulating, processing, and organizing various elements of the universe.
More energy is applied to make the system do its prescribed job. And whenever
the system is operated, the sum total of its output energy is less than the total
energy input. While some of this can be attributed to heat loss through friction
and other similar, traceable actions, there is still an imbalance of energy. Defining
entropy as the “unavailable energy” of a system rectifies that imbalance.
The late Dr. Isaac Asimov, biophysicist and prolific writer of science fact and
science fiction,1 had the unique ability to explain the most difficult science to
the layperson in simple, understandable terms. Dr. Asimov says that if you
want to understand the concept of entropy in practical terms, think of it as the
difference between the theoretically perfect system you have on the drawing
board and the actual, physical system you have in hand. In other words, we can
design perfect systems on paper, but we cannot build perfect systems in the real
world. The difference between that which we design and that which we can build
constitutes the natural entropy of the system.
1
Dr. Asimov wrote over 400 books during his lifetime.
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 5
The fact that the saw blade has width and that the act of sawing creates a
kerf in the wood wider than the saw blade itself, constitutes the entropy of this
system. And no matter how thin you make the saw blade, the fact that it has
width will limit the number of cuts that can be made. Even a laser beam has
width. This is a rather simple example, but you can see that the real world is not
the same as the theoretical one that scientists and some engineers live in. Nothing
is perfect.
6 Fundamentals of Maintenance
entropy of a system during design, the mechanic’s job is to combat the natu-
ral, continual increase in the entropy of the system during its operational life-
time.
To summarize, it is the engineer’s responsibility to design the system with as
high degree of perfection (low entropy) as possible within reasonable limits.
The mechanic’s responsibility is to remove and replace parts, troubleshoot sys-
tems, isolate faults in systems by following the fault isolation manual (FIM, dis-
cussed in Chap. 4), and restore systems for their intended use.
100%
Attainable level of
Entropy perfection
Perfection
Time
Figure 1-1 The difference between theory and practice.
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 7
100%
Designed - in level
Entropy of perfection
a b
Perfection or reliability
Scheduled
maintenance c
Unscheduled
maintenance
Point at which
scheduled
maintenance is done
Time
Figure 1-2 Restoration of system perfection.
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 8
8 Fundamentals of Maintenance
Reliability
The level of perfection we have been talking about can also be referred to as the
reliability of the system. The designed-in level of perfection is known as the
inherent reliability of that system. This is as good as the system gets during real
world operation. No amount of maintenance can increase system reliability any
higher than this inherent level. However, it is desirable for the operator to
maintain this level of reliability (or this level of perfection) at all times. We will
discuss reliability and maintenance in more detail in Chap. 19. But there is one
more important point to cover—redesign of the equipment.
Redesign
Figure 1-3 shows the original curve of our theoretical system, curve A. The
dashed line shows the system’s original level of perfection. Our system, however,
has now been redesigned to a higher level of perfection; that is, a higher level
of reliability with a corresponding decrease in total entropy. During this
redesign, new components, new materials, or new techniques may have been
used to reduce the natural entropy of the system. In some cases, a reduction in
Improved reliability
(reduced entropy)
100% Due to Redesign
(3 cases; see text)
B A C D
Designed - in level
Reliability
of reliability
Point at which
scheduled
maintenance is done
Time
Figure 1-3 Effects of redesign on system reliability.
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 9
man-made entropy may result because the designer applied tighter tolerances,
attained improved design skills, or changed the design philosophy.
Although the designers have reduced the entropy of the system, the system
will still deteriorate. It is quite possible that the rate of deterioration will change
from the original design depending upon numerous factors; thus, the slope of
the curve may increase, decrease, or stay the same. Whichever is the case, the
maintenance requirements of the system could be affected in some way.
If the decay is steeper, as in (B) in Fig. 1-3, the point at which preventive main-
tenance needs to be performed might occur sooner, and the interval between sub-
sequent actions would be shorter. The result is that maintenance will be needed
more often. In this case, the inherent reliability is increased, but more mainte-
nance is required to maintain that level of reliability (level of perfection). Unless
the performance characteristics of the system have been improved, this redesign
may not be acceptable. A decision must be made to determine if the perform-
ance improvement justifies more maintenance and thus an increase in main-
tenance costs.
Conversely, if the decay rate is the same as before, as shown in curve C of
Fig. 1-3, or less steep, as shown in curve D, then the maintenance interval would
be increased and the overall amount of preventive maintenance might be
reduced. The question to be considered, then, is this: Does the reduction of
maintenance justify the cost of the redesign? This question, of course, is a matter
for the designers to ponder, not the maintenance people.
One of the major factors in redesign is cost. Figure 1-4 shows the graphs of
two familiar and opposing relationships. The upper curve is logarithmic. It rep-
resents the increasing perfection attained with more sophisticated design efforts.
100%
Increase in perfection Is
logarithmic
Perfection
Increase in cost Is
exponential
10 Fundamentals of Maintenance
The closer we get to perfection (top of the illustration) the harder it is to make
a substantial increase. (We will never get to 100 percent.) The lower curve
depicts the cost of those ongoing efforts to improve the system. This, unfortu-
nately, is an exponential curve. The more we try to approach perfection, the more
it is going to cost us. It is obvious, then, that the designers are limited in their
goal of perfection, not just by entropy but also by costs. The combination of
these two limitations is basically responsible for our profession of maintenance.
2
Nowlan, F. Stanley and Howard F. Heap, Reliability-Centered Maintenance. National Technical
Information Service, Washington, DC, 1978.
01-Kinnison_CH01 p01-14.qxd 8/14/12 4:45 PM Page 11
reasons: poor design, improper parts, or incorrect usage. Once the bugs are worked
out and the equipment settles into its pattern, the failure rate levels off or rises
only slightly over time. That is, until the later stages of the component’s life. The
rapid rise shown in curve A near the end of its life is an indication of wear out.
The physical limit of the component’s materials has been reached.
Curve B exhibits no infant mortality but shows a level, or slightly rising fail-
ure rate characteristic throughout the component’s life until a definite wear-out
period is exhibited toward the end.
Curve C depicts components with a slightly increasing failure rate with no
infant mortality and no discernible wear-out period, but at some point, it
becomes unusable.
Curve D shows a low failure rate when new (or just out of the shop), which
rises to some steady level and holds throughout most of the component’s life.
Curve E is an ideal component: no infant mortality and no wear-out period,
just steady (or slightly rising) failure rate throughout its life.
Curve F shows components with an infant mortality followed by a level or
slightly rising failure rate and no wear-out period.
The United Airlines study showed that only about 11 percent of the items
included in the experiment (those shown in curves A, B, and C of Table 1-1)
would benefit from setting operating limits or from applying a repeated check
of wear conditions. The other 89 percent would not. Thus, time of failure or
deterioration beyond useful levels could be predicted on only 11 percent of the
items (curves A, B, and C of Table 1-1). The other 89 percent (depicted by
curves D, E, and F of Table 1-1) would require some other approach. The
implication of this variation is that the components with definite life limits
and/or wear-out periods will benefit from scheduled maintenance. They will
not all come due for maintenance or replacement at the same time, however,
but they can be scheduled; and the required maintenance activity can be
spread out over the available time, thus avoiding peaks and valleys in the
workload. The other 89 percent, unfortunately, will have to be operated to fail-
ure before replacement or repair is done. This, being unpredictable, would
result in the need for maintenance at odd times and at various intervals; i.e.,
unscheduled maintenance.
These characteristics of failure make it necessary to approach maintenance
in a systematic manner, to reduce peak periods of unscheduled maintenance.
The industry has taken this into consideration and has employed several tech-
niques in the design and manufacturing of aircraft and systems to accommo-
date the problem. These are discussed in the next section.
12 Fundamentals of Maintenance
These flight crew personnel also determine how long (1, 3, 10, or 30 days) they
can tolerate this condition. Although this is determined in general terms prior
to delivering the airplane, the flight crew on board makes the final decision based
on actual conditions at the time of dispatch. The pilot in command (PIC) can,
based on existing circumstances, decide not to dispatch until repairs are made
or can elect to defer maintenance per the airline’s MEL. Maintenance must
abide by that decision.
Associated with the MEL is a dispatch deviation guide (DDG) that contains
instructions for the line maintenance crew when the deviation requires some
maintenance action that is not necessarily obvious to the mechanic. A dispatch
deviation guide is published by the airplane manufacturer to instruct the
mechanic on these deviations. The DDG contains such information as tying up
cables and capping connectors from removed units, opening and placarding cir-
cuit breakers to prevent inadvertent power-up of certain equipment during
flight, and any other maintenance action that needs to be taken for precau-
tionary reasons. Similar to the MEL is a configuration deviation list (CDL).
This list provides information on dispatch of the airplane in the event that cer-
tain panels are missing or when other configuration differences not affecting
safety are noted. The nonessential equipment and furnishing (NEF) items list
contains the most commonly deferred items that do not affect airworthiness or
safety of the flight of the aircraft. This is also a part of the MEL system.
Although failures on these complex aircraft can occur at random and can
come at inopportune times, these three management actions—redundancy of
design, line replaceable units, and minimum dispatch requirements—can help
to smooth out the workload and reduce service interruptions.
14 Fundamentals of Maintenance
Chapter
18
Reliability
Introduction
Reliability equals consistency. It can be defined as the probability that an item
will perform a required function, under specified conditions without failure, for
a specified amount of time according to its intended design. The reliability pro-
gram is a valuable means of achieving better operational performance in an
aircraft maintenance environment, and it is designed to decrease maintenance-
related issues and increase flight safety. The intent of this program is to deal
systematically with problems as they arise instead of trying to cure immediate
symptoms. This program is normally customized, depending on the operators,
to accurately reflect the specific operation requirements. Although the word
reliability has many meanings, in this book we will define the terms that have
specialized meanings to aviation maintenance and engineering. In the case of
reliability, we first must discuss one important difference in the application of
the term.
There are two main approaches to the concept of reliability in the aviation
industry. One looks essentially at the whole airline operation or the M&E oper-
ation within the whole, and the other looks at the maintenance program in par-
ticular. There is nothing wrong with either of these approaches, but they differ
somewhat, and that difference must be understood.
The first approach is to look at the overall airline reliability. This is measured
essentially by dispatch reliability; that is, by how often the airline achieves an
on-time departure1 of its scheduled flights. Airlines using this approach track
delays. Reasons for the delay are categorized as maintenance, flight operations,
air traffic control (ATC), etc. and are logged accordingly. The M&E organization
is concerned only with those delays caused by maintenance.
1
On-time departure means that the aircraft has been “pushed back” from the gate within 15 minutes
of the scheduled departure time.
217
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 218
Very often, airlines using this approach to reliability overlook any mainte-
nance problems (personnel or equipment related) that do not cause delays, and
they track and investigate only those problems that do cause delays. This is only
partially effective in establishing a good maintenance program.
The second approach (which we should actually call the primary approach)
is to consider reliability as a program specifically designed to address the prob-
lems of maintenance—whether or not they cause delays—and provide analysis
of and corrective actions for those items to improve the overall reliability of the
equipment. This contributes to the dispatch reliability, as well as to the overall
operation.
We are not going to overlook the dispatch reliability, however. This is a dis-
tinct part of the reliability program we discuss in the following pages. But we
must make the distinction and understand the difference. We must also real-
ize that not all delays are caused by maintenance or equipment even though
maintenance is the center of attention during such a delay. Nor can we only
investigate equipment, maintenance procedures, or personnel for those dis-
crepancies that have caused a delay. As you will see through later discussions,
dispatch reliability is a subset of overall reliability.
Types of Reliability
The term reliability can be used in various respects. You can talk about the over-
all reliability of an airline’s activity, the reliability of a component or system, or
even the reliability of a process, function, or person. Here, however, we will dis-
cuss reliability in reference to the maintenance program specifically.
There are four types of reliability one can talk about related to the mainte-
nance activity. They are (a) statistical reliability, (b) historical reliability, (c) event-
oriented reliability, and (d) dispatch reliability. Although dispatch reliability is
a special case of event-oriented reliability, we will discuss it separately due to
its significance.
Statistical reliability
Statistical reliability is based upon collection and analysis of failure, removal,
and repair rates of systems or components. From this point on, we will refer to
these various types of maintenance actions as “events.” Event rates are calcu-
lated on the basis of events per 1000 flight hours or events per 100 flight cycles.
This normalizes the parameter for the purpose of analysis. Other rates may be
used as appropriate.
Many airlines use statistical analysis, but some often give the statistics more
credence than they deserve. For one example, airlines with 10 or more aircraft
tend to use the statistical approach, but most teachers and books on statistics
tell us that for any data set with less than about 30 data points the statistical
calculations are not very significant. Another case of improper use of statistics
was given as an example presented in an aviation industry seminar on reliability.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 219
Reliability 219
The airline representative used this as an example of why his airline was going
to stop using statistical reliability. Here is his example.
We use weather radar only two months of the year. When we calculate the mean
value of failure rates and the alert level in the conventional manner [discussed in
detail later in this chapter] we find that we are always on alert. This, of course, is
not true.
The gentleman was correct in defining an error in this method, and he was
correct in determining that—at least in this one case—statistics was not a valid
approach. Figure 18-1 shows why.
The top curve in Fig. 18-1 shows the two data points for data collected when
the equipment was in service. It also shows 10 zero data points for those months
when the equipment was not used and no data were collected (12-month column).
These zeros are not valid statistical data points. They do not represent zero fail-
ures; they represent “no data” and therefore should not be used in the calcula-
tion. Using these data, however, has generated a mean value (lower, dashed line)
of 4.8 and an alert level at two standard deviations above the mean (upper, solid
line) of 27.6.
One thing to understand about mathematics is that the formulas will work,
will produce numerical answers, whether or not the input data are correct.
Garbage in, garbage out. The point is, you only have two valid data points here
shown in the bottom curve of Fig. 18-1 (2-month data). The only meaningful sta-
tistic here is the average of the two numbers, 29 (dashed line). One can calcu-
late a standard deviation (SD) here using the appropriate formula or a calculator,
but the parameter has no meaning for just two data points. The alert level set
Aug 0
Sep 26 26
Oct 32 32
Nov 0 40
Dec 0
30
Failure Rate
20 Mean Value
Sum 58 58 Alert Level
n 12 2 10
Avg. 4.8 29.0 0
Std. Dev. 11.4 4.2 1 2 3 4 5 6 7 8 9 10 11 12
by using this calculation is 37.5 (solid line). For this particular example, sta-
tistical reliability is not useable, but historical reliability is quite useful. We will
discuss that subject in the next section.
Historical reliability
Historical reliability is simply a comparison of current event rates with those
of past experience. In the example of Fig. 18-1, the data collected show fleet fail-
ures of 26 and 32 for the 2 months the equipment was in service. Is that good
or bad? Statistics will not tell you but history will. Look at last year’s data for
the same equipment, same time period. Use the previous year’s data also, if
available. If current rates compare favorably with past experience, then every-
thing is okay; if there is a significant difference in the data from one year to the
next, that would be an indication of a possible problem. That is what a relia-
bility program is all about: detecting and subsequently resolving problems.
Historical reliability can be used in other instances, also. The most common one
is when new equipment is being introduced (components, systems, engines, air-
craft) and there is no previous data available on event rates, no information on
what sort of rates to expect. What is “normal” and what constitutes “a problem”
for this equipment? In historical reliability we merely collect the appropriate
data and literally “watch what happens.” When sufficient data are collected to
determine the “norms,” the equipment can be added to the statistical reliability
program.
Historical reliability can also be used by airlines wishing to establish a sta-
tistically based program. Data on event rates kept for 2 or 3 years can be tal-
lied or plotted graphically and analyzed to determine what the normal or
acceptable rates would be (assuming no significant problems were incurred).
Guidelines can then be established for use during the next year. This will be cov-
ered in more detail in the reliability program section below.
Event-oriented reliability
Event-oriented reliability is concerned with one-time events such as bird strikes,
hard landings, overweight landings, in-flight engine shutdowns, lighting strikes,
ground or flight interruption, and other accidents or incidents. These are events
that do not occur on a daily basis in airline operations and, therefore, produce
no usable statistical or historical data. Nevertheless, they do occur from time
to time, and each occurrence must be investigated to determine the cause and
to prevent or reduce the possibility of recurrence of the problem.
In ETOPS2 operations, certain events associated with this program differ
from conventional reliability programs, and they do rely on historical data and
alert levels to determine if an investigation is necessary to establish whether a
problem can be reduced or eliminated by changing the maintenance program.
2
Requirements for extended range operations with two-engine airplanes (ETOPS) are outlined
in FAA Advisory Circular AC 120-42B, and also discussed in Appendix E of this book.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 221
Reliability 221
Events that are related to ETOPS flights are designated by the FAA as actions
to be tracked by an “event-oriented reliability program” in addition to any sta-
tistical or historical reliability program. Not all the events are investigated, but
everything is continually monitored in case a problem arises.
Dispatch reliability
Dispatch reliability is a measure of the overall effectiveness of the airline oper-
ation with respect to on-time departure. It receives considerable attention from
regulatory authorities, as well as from airlines and passengers, but it is really
just a special form of the event-oriented reliability approach. It is a simple cal-
culation based on 100 flights. This makes it convenient to relate dispatch rate
in percent. An example of the dispatch rate calculation follows.
If eight delays and cancellations are experienced in 200 flights, that would mean
that there were four delays per 100 flights, or a 4 percent delay rate. A 4 percent
delay rate would translate to a 96 percent dispatch rate (100 percent − 4 percent
delayed = 96 percent dispatched on time). In other words, the airline dispatched
96 percent of its flights on time.
The use of dispatch reliability at the airlines is, at times, misinterpreted.
The passengers are concerned with timely dispatch for obvious reasons. To
respond to FAA pressures on dispatch rate, airlines often overreact. Some air-
line maintenance reliability programs track only dispatch reliability; that is,
they only track and investigate problems that resulted in a delay or a cancel-
lation of a flight. But this is only part of an effective program and dispatch reli-
ability involves more than just maintenance. An example will bear this out.
The aircraft pilot in command is 2 hours from his arrival station when he expe-
riences a problem with the rudder controls. He writes up the problem in the air-
craft logbook and reports it by radio to the flight following unit at the base. Upon
arrival at the base, the maintenance crew meets the plane and checks the log
for discrepancies. They find the rudder control write-up and begin trou-
bleshooting and repair actions. The repair takes a little longer than the sched-
uled turnaround time and, therefore, causes a delay. Since maintenance is at
work and the rudder is the problem, the delay is charged to maintenance and
the rudder system would be investigated for the cause of the delay.
This is an improper response. Did maintenance cause the delay? Did the
rudder equipment cause the delay? Or was the delay caused by poor airline pro-
cedures? To put it another way: could a change of airline procedures eliminate
the delay? Let us consider the events as they happened and how we might
change them for the better.
If the pilot and the flight operations organization knew about the problem
2 hours before landing, why wasn’t maintenance informed at the same time? If
they had been informed, they could have spent the time prior to landing in
studying the problem and performing some troubleshooting analysis. It is quite
possible, then, that when the airplane landed, maintenance could have met it
with a fix in hand. Thus, this delay could have been prevented by procedural
changes. The procedure should be changed to avoid such delays in the future.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 222
While the maintenance organization and the airline could benefit from this
advance warning of problems, it will not always eliminate delays. The impor-
tant thing to remember is that if a delay is caused by procedure, it should be
attributed to procedure and it should be avoided in the future by altering the
procedure. That is what a reliability program is about: detecting where the
problems are and correcting them, regardless of who or what is to blame.
Another fallacy in overemphasizing dispatch delay is that some airlines will
investigate each delay (as they should), but if an equipment problem is involved,
the investigation may or may not take into account other similar failures that
did not cause delays. For example, if you had 12 write-ups of rudder problems
during the month and only one of these caused a delay, you actually have two
problems to investigate: (a) the delay, which could be caused by problems other
than the rudder equipment and (b) the 12 rudder write-ups that may, in fact,
be related to an underlying maintenance problem. One must understand that
dispatch delay constitutes one problem and the rudder system malfunction
constitutes another. They may indeed overlap but they are two different prob-
lems. The delay is an event-oriented reliability problem that must be investi-
gated on its own; the 12 rudder problems (if this constitutes a high failure
rate) should be addressed by the statistical (or historical) reliability program.
The investigation of the dispatch delays should look at the whole operation.
Equipment problems—whether or not they caused delays—should be investi-
gated separately.
A Reliability Program
A reliability program for our purposes is, essentially, a set of rules and practices
for managing and controlling a maintenance program. The main function of a reli-
ability program is to monitor the performance of the vehicles and their associated
equipment and call attention to any need for corrective action. The program has
two additional functions: (a) to monitor the effectiveness of those corrective actions
and (b) to provide data to justify adjusting the maintenance intervals or mainte-
nance program procedures whenever those actions are appropriate.
Data collection
We will list 10 data types that can be collected, although they may not necessarily
be collected by all airlines. Other items may be added at the airline’s discretion.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 223
Reliability 223
The data collection process gives the reliability department the information
needed to observe the effectiveness of the maintenance program. Those items that
are doing well might be eliminated from the program simply because the data
show that there are no problems. On the other hand, items not being tracked may
need to be added to the program because there are serious problems related to
those systems. Basically, you collect the data needed to stay on top of your oper-
ation. The data types normally collected are as follows:
Flight time and flight cycles. Most reliability calculations are “rates” and are
based on flight hours or flight cycles; e.g., 0.76 failures per 1000 flight hours or
0.15 removals per 100 flight cycles.
Cancellations and delays over 15 minutes. Some operators collect data on all such
events, but maintenance is concerned primarily with those that are maintenance
related. The 15-minute time frame is used because that amount of time can usu-
ally be made up in flight. Longer delays may cause schedule interruptions or
missed connections, thus the need for rebookings. This parameter is usually con-
verted to a “dispatch rate” for the airline as discussed above.
Unscheduled component removals. This is the unscheduled maintenance
mentioned earlier and is definitely a concern of the reliability program. The rate
at which aircraft components are removed may vary widely depending on the
equipment or system involved. If the rate is not acceptable, an investigation
should be made and some sort of corrective action must be taken. Components
that are removed and replaced on schedule—e.g., HT items and certain OC
items—are not included here, but these data may be collected to aid in justify-
ing a change in the HT or OC interval schedule.
Unscheduled removals of engines. This is the same as component removals, but
obviously an engine removal constitutes a considerable amount of time and
manpower; therefore, these data are tallied separately.
In-flight shutdown (IFSD) of engines. This malfunction is probably one of the most
serious in aviation, particularly if the airplane only has two engines (or one).
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 224
The FAA requires a report of IFSD within 72 hours.3 The report must include
the cause and the corrective action. The ETOPS operators are required to track
IFSDs and respond to excessive rates as part of their authorization to fly ETOPS.
However, non-ETOPS operators also have to report shutdowns and should also
be tracking and responding to high rates through the reliability program.
Pilot reports or logbook write-ups. These are malfunctions or degradations in
airplane systems noted by the flight crew during flight. Tracking is usually by
ATA Chapter numbers using two, four, or six digits. This allows pinpointing of
the problems to the system, subsystem, or component level as desired.
Experience will dictate what levels to track for specific equipment.
Cabin logbook write-ups. These discrepancies may not be as serious as those
the flight crew deals with, but passenger comfort and the ability of the cabin
crew to perform their duties may be affected. These items may include cabin
safety inspection, operational check of cabin emergency lights, first aid kits, and
fire extinguishers. If any abnormality is found, these items are written up by
the flight crew in the maintenance logbook as a discrepancy item.
Component failures. Any problems found during shop maintenance visits are
tallied for the reliability program. This refers to major components within the
black boxes (avionics) or parts and components within mechanical systems.
Maintenance check package findings. Systems or components found to be in need
of repair or adjustment during normal scheduled maintenance checks (non-
routine items) are tracked by the reliability program.
Critical failures. Failures involving a loss of function or secondary damage that
could have a direct adverse effect on operating safety.
3
See Federal Aviation Regulation 121.703, Mechanical Reliability Report.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 225
Reliability 225
0.600
0.550
0.500
0.450
0.400
0.350
0.300
0.250
Sep-99
Sep-00
Jan-99
Jan-00
Feb-99
Dec-99
Feb-00
Dec-00
Jul-99
Jul-00
Jun-99
Jun-00
Mar-99
Mar-00
May-99
May-00
Jan-01
Feb-01
Nov-99
Nov-00
Mar-01
Oct-99
Oct-00
Aug-99
Aug-00
Apr-99
Apr-00
Monthly Event Rate Mean Value UCL (Mean + 2 SD) Offset
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
it is easy to see the pattern as we look at the year’s events. But in reality, you
will only see 1 month at a time and the preceding months. Information on what
is going to happen the next month is not available to you.
When the event rate goes above the alert level (as in February), it is not nec-
essarily a serious matter. But if the rate stays above the alert level for 2 months
in succession, then it may warrant an investigation. The preliminary investi-
gation may indicate a seasonal variation or some other one-time cause, or it may
suggest the need for a more detailed investigation. More often than not, it can
be taken for what it was intended to be—an “alert” to a possible problem. The
response would be to wait and see what happens next month. In Fig. 18-3, the
data show that, in the following month (March) the rate went below the line;
thus, no real problem exists. In other words, when the event rate penetrates the
alert level, it is not an indication of a problem; it is merely an “alert” to the pos-
sibility of a problem. Reacting too quickly usually results in unnecessary time
and effort spent in investigation. This is what we call a “false alert.”
If experience shows that the event rate for a given item varies widely from
month to month above and below the UCL as in Fig. 18-3—and this is common
for some equipment—many operators use a 3-month rolling average. This is
shown as the dashed line in Fig. 18-3. For the first month of the new data year,
the 3-month average is determined by using the offset data points in Fig. 18-2.
(Actually, only 2 months offset is needed, but we like to keep things on a quar-
terly basis.) The purpose for the offset is to ensure that the plotted data for the
new year do not contain any data points that were used to determine the mean
and alert levels we use for comparison.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 227
Reliability 227
While the event rate swings above and below the alert level, the 3-month
rolling average (dashed line) stays below it—until October. This condition—
event rate and 3-month average above the UCL—indicates a need to watch the
activity more closely. In this example, the event rate went back down below the
UCL in November, but the 3-month average stayed above the alert level. This
is an indication that the problem should be investigated.
28
27
26
25
24
23
22
21
20
0 2 4 6 8 10 12 14 16
Figure 18-4 shows the difference between two data sets. The data points in
(A) are widely scattered or distributed about the mean while those in (B) are all very
close together around the mean. Note that the averages of these two data sets are
nearly equal but the standard deviations are quite different. Figure 18-5 shows the
bell-shaped distribution curve. One, two, and three standard deviations in each
case are shown on the graph. You can see here that, at one SD only 68 percent of
the valid failure rates are included. At two standard deviations above the mean,
you still have not included all the points in the distribution. In fact, two stan-
dard deviations above and below the mean encompass only 95.5 percent of the
points under the curve; i.e., just over 95 percent of the valid failure rates. This
is why we do not consider an event rate in this range a definite problem. If it
remains above this level in the following month it may suggest a possible problem.
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 229
Reliability 229
x
−3σ −2σ −σ B +σ +2σ +3σ
68.26%
95.46%
99.73%
Figure 18-5 Standard bell-shaped curve. (Source: The
Standard Handbook for Aeronautical and Astronautical
Engineers, New York, NY: McGraw-Hill, 2003.)
On the other hand, if the event rate data you are working with had a small stan-
dard deviation, it would be difficult to distinguish between two and three SDs.
In this case, the alert level should be set at three SDs.
This alert level system can be overdone at times. The statistics used are not
exact. We are assuming that the event rates will always have a distribution
depicted by the bell-shaped curve. We assume that our data are always accu-
rate and that our calculations are always correct. But this may not be true. These
alert levels are merely guidelines to identifying what should be investigated and
what can be tolerated. Use of the alert level is not rocket science but it helps
ease the workload in organizations with large fleets and small reliability staffs.
Some airlines, using only event rates, will investigate perhaps the 10 highest
rates; but this does not always include the most important or the most signifi-
cant equipment problems. The alert level approach allows you to prioritize these
problems and work the most important ones first.
Data display
Several methods for displaying data are utilized by the reliability department to
study and analyze the data they collect. Most operators have personal computers
available so that data can easily be displayed in tabular and graphical forms. The
data are presented as events per 100 or 1000 flight hours or flight cycles. Some,
such as delays and cancellations, are presented as events per 100 departures. The
value of 100 allows easy translation of the rate into a percentage.
Tabular data allow the operator to compare event rates with other data on
the same sheet. It also allows the comparison of quarterly or yearly data (see
Table 18-1). Graphs, on the other hand, allow the operator to view the month-
to-month performance and note, more readily, those items that show increasing
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 230
TABLE 18-1 Pilot Reports per 100 Landings (by ATA Chapter)
NOTE: Alert status codes: CL = clear from alert; YE = yellow alert; AL = red alert; RA = remains in alert; WA = watch.
rates and appear to be heading for alert status (see Fig. 18-3). This is a great
help in analysis. Some of the data collected may be compared on a monthly basis,
by event, or by sampling.
Table 18-1 is a listing of pilot reports (PIREPS) or maintenance logbook
entries recorded by a typical airline for 1 month of operation for a fleet of air-
craft. The numbers are examples only and do not represent any particular oper-
ator, aircraft, or fleet size. For these data, a tally is kept by ATA Chapter, and
event rates are calculated as PIREPS per 100 landings. The chart shows data
for the current month (August 99) and the two previous months along with the
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 231
Reliability 231
3-month rolling average. The alert level or UCL and the mean value of event
rate, calculated as discussed in the text, are also included. Seven of these ATA
Chapters have alert indications noted in the last column.
Chapter 21 has had an event rate above the UCL for 2 months running (July,
August); therefore, this represents a yellow alert (YE). Depending on the sever-
ity of the problem, this may or may not require an immediate investigation.
Chapter 24, however, is different. For July, the event rate was high, 1.15. If this
were the first time for such a rate, it would have been listed in the report for
that month as a watch (WA). The rate went down in July but has gone up again
in August. In the current report, then, it is a full alert condition. It is not only
above the alert level, it has been above 2 of the 3 months, and it appears some-
what erratic. It is left as an exercise for the student to analyze the other alert
status items. What about ATA Chapter 38?
Data analysis
Whenever an item goes into alert status, the reliability department does a pre-
liminary analysis to determine if the alert is valid. If it is valid, a notice of the
on-alert condition is sent to engineering for a more detailed analysis. The engi-
neering department is made up of experienced people who know maintenance
and engineering. Their job relative to these alerts is to troubleshoot the prob-
lem, determine the required action that will correct the problem, and issue an
engineering order (EO) or other official paperwork that will put this solution in
place.
At first, this may seem like a job for maintenance. After all, troubleshooting
and corrective action is their job. But we must stick with our basic philosophy
from Chap.7 of separating the inspectors from the inspected. Engineering can
provide an analysis of the problem that is free from any unit bias and be free
to look at all possibilities. A unit looking into its own processes, procedures, and
personnel may not be so objective. The engineering department should provide
analysis and corrective action recommendations to the airline Maintenance
Program Review Board (discussed later) for approval and initiation.
Note: Appendix C discusses the troubleshooting process that applies to engi-
neers as well as mechanics; and Appendix D outlines additional procedures for
reliability and engineering alert analysis efforts.
Corrective action
Corrective actions can vary from one-time efforts correcting a deficiency in a pro-
cedure to the retraining of mechanics to changes in the basic maintenance pro-
gram. The investigation of these alert conditions commonly results in one or
more of the following actions: (a) modifications of equipment; (b) change in or
correction to line, hangar, or shop processes or practices; (c) disposal of defec-
tive parts (or their suppliers); (d) training of mechanics (refresher or upgrade);
(e) addition of maintenance tasks to the program; or ( f ) decreases in maintenance
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 232
intervals for certain tasks. Engineering then produces an engineering order for
implementation of whatever action is applicable. Engineering also tracks the
progress of the order and offers assistance as needed. Completion of the cor-
rective action is noted in the monthly reliability report (discussed later).
Continual monitoring by reliability determines the effectiveness of the selected
corrective action.
Corrective actions should be completed within 1 month of issuance of the EO.
Completion may be deferred if circumstances warrant, but action should be
completed as soon as possible to make the program effective. Normally, the
Maintenance Program Review Board (MPRB) will require justification in writ-
ing for extensions of this period; the deferral, and the reason for deferral, will
be noted in the monthly report.
Follow-up analysis
The reliability department should follow up on all actions taken relative to
on-alert items to verify that the corrective action taken was indeed effective.
This should be reflected in decreased event rates. If the event rate does not
improve after action has been taken, the alert is reissued and the investiga-
tion and corrective action process is repeated, with engineering taking a dif-
ferent approach to the problem. If the corrective action involves lengthy
modifications to numerous vehicles, the reduction in the event rate may not
be noticeable for some time. In these cases, it is important to continue mon-
itoring the progress of the corrective action in the monthly report along with
the ongoing event rate until corrective action is completed on all vehicles.
Then follow-up observation is employed to judge the effectiveness (wisdom)
of the action. If no significant change is noted in the rates within a reason-
able time after a portion of the fleet has been completed, the problem and the
corrective action should be reanalyzed.
Data reporting
A reliability report is issued monthly. Some organizations issue quarterly and
yearly reports in summary format. The most useful report, however, is the
monthly. This report should not contain an excessive amount of data and graphs
without a good explanation of what this information means to the airline and
to the reader of the report. The report should concentrate on the items that have
just gone on alert, those items under investigation, and those items that are in
or have completed the corrective action process. The progress of any items that
are still being analyzed or implemented will also be noted in the report, show-
ing status of the action and percent of fleet completed if applicable. These items
should remain in the monthly report until all action has been completed and the
reliability data show positive results.
Other information, such as a list of alert levels (by ATA Chapter or by item)
and general information on fleet reliability will also be included in the monthly
report. Such items as dispatch rates, reasons for delays and/or cancellations,
Kinnison_CH18_p217-236.qxd 9/7/12 5:37 PM Page 233
Reliability 233
flight hours and cycles flown and any significant changes in the operation that
affect the maintenance activity would also be included. The report should be
organized by fleet; that is, each airplane model would be addressed in a sepa-
rate section of the report.
The monthly reliability report is not just a collection of graphs, tables, and
numbers designed to dazzle higher-level management. Nor is it a document
left on the doorstep of others, such as QA or the FAA, to see if they can detect
any problems you might have. This monthly report is a working tool for main-
tenance management. Besides providing operating statistics, such as the
number of aircraft in operation, the number of hours flown, and so forth, it also
provides management with a picture of what problems are encountered (if any)
and what is being done about those problems. It also tracks the progress and
effectiveness of the corrective action. The responsibility for writing the report
rests with the reliability department, not engineering.
and the determination of what data to track are basic functions of the reliabil-
ity section. Collecting data is the responsibility of various M&E organizations,
such as line maintenance (flight hours and cycles, logbook reports, etc.); overhaul
shops (component removals); hangar (check packages); and materiel (parts
usage). Some airlines use a central data collection unit for this, located in M&E
administration, or some other unit such as engineering or reliability. Other air-
lines have provisions for the source units to provide data to the reliability depart-
ment on paper or through the airline computer system. In either case, reliability
is responsible for collecting, collating, and displaying these data and performing
the preliminary analysis to determine alert status.
The reliability department analyst in conjunction with MCC keeps a watch-
ful eye on the aircraft fleet and its systems for any repeat maintenance dis-
crepancies. The analyst reviews reliability reports and items on a daily basis,
including aircraft daily maintenance, time-deferred maintenance items,
MEL, and other out of service events with any type of repeat mechanical
discrepancies.
The analyst plans a sequence of repair procedures if aircraft have repeated the
maintenance discrepancy three times or more and have exhausted any type of
fix to rid the aircraft of the maintenance discrepancy. The analyst is normally
in contact with the MCC and local aircraft maintenance management to coordi-
nate a plan of attack with the aircraft manufacturer’s maintenance help desk to
ensure proper tracking and documenting of the actual maintenance discrepancy
and corrective action planned or maintenance performed. These types of com-
munications are needed for airlines to run a successful maintenance operation
and to keep the aircraft maintenance downtime to a minimum. This normally
occurs when a new type of aircraft is added to the airline’s fleet. Sometimes
maintenance needs help fixing a recurring problem.
Reliability 235
f. Manager of engineering
g. Manager of reliability
3. Adjunct members are representatives of affected M&E departments
a. Engineering supervisors (by ATA Chapter or specialty)
b. Airplane maintenance (line, hangar)
c. Overhaul shops (avionics, hydraulics, etc.)
d. Production planning and control
e. Materiel
f. Training
The head of MPE is the one who deals directly with the regulatory authority,
so as chairman of the Maintenance Program Review Board, he or she would coor-
dinate any recommended changes requiring regulatory approval.
The MPRB meets monthly to discuss the overall status of the maintenance reli-
ability and to discuss all items that are on alert. The permanent members, or their
designated assistants, attend every meeting; the advisory members attend those
meetings where items that relate to their activities will be discussed. Items
coming into alert status for the recent month are discussed first to determine if
a detailed investigation by engineering is needed. Possible problems and solu-
tions may be offered. If engineering is engaged in or has completed investigation
of certain problems, these will be discussed with the MPRB members. Items that
are currently in work are then discussed to track and analyze their status and
to evaluate the effectiveness of the corrective action. If any ongoing corrective
actions involve long-term implementation, such as modifications to the fleet that
must be done at the “C” check interval, the progress and effectiveness of the cor-
rective action should be studied to determine (if possible) whether or not the
chosen action appears to be effective. If not, a new approach would be discussed
and subsequently implemented by a revision to the original engineering order.
Other activities of the MPRB include the establishment of alert levels and the
adjustment of these levels as necessary for effective management of problems.
The rules governing the reliability program are developed with approval by the
MPRB. Rules relating to the change of maintenance intervals, alert levels, and
all other actions addressed by the program must be approved by the MPRB. The
corrective actions and the subsequent EOs developed by the engineering depart-
ment are also approved by the MPRB before they are issued.
The air carrier may use this AC’s provisions along with its own or other main-
tenance information to standardize, develop, implement, and update the FAA-
approved minimum schedule of maintenance and/or inspection requirements
for this program to become a final written report for each type certificate holder.
The MRB revision issued by the manufacturer is sent to the fleet mainte-
nance manager (FMM) or dedicated maintenance person assigned by the air
carrier. In some cases, this is the director of maintenance (DOM). The FMM/
DOM interfaces with the aircraft maintenance and production department to
advise them about the MRB program updates and revisions. The air carrier nor-
mally tracks each revision by fleet type to ensure the corrective action plan has
been recommended to bring the maintenance production department into com-
pliance. The MRB runs concurrent with the continuous analysis and surveil-
lance system (CASS) and the reliability-centered maintenance (RCM) and is
applied using the maintenance steering group MSG-3 system. The MSG-3 orig-
ination is associated with the Air Transport Association of America (ATA). The
ATA coding system (detailed in Chap. 5) divides aircraft into distinct ATA units,
and every ATA unit is analyzed for regulatory purposes to understand the
results retrieved from the system and then passed on to an aviation industry
steering group/committee. After the data has been reviewed by the steering com-
mittee and approved by the regulatory board for the MRB, the results are pub-
lished as part of the aircraft maintenance manual.
This document also includes detailed discussion of the data collection, problem
investigation, corrective action implementation, and follow-up actions. It also
includes an explanation of the methods used to determine alert levels; the
rules relative to changing maintenance process (HT, OC, CM), or MPD task
intervals; when to initiate an investigation; definitions of MPRB activities and
responsibilities; and the monthly report format. The document also includes
such administrative elements as responsibility for the document, revision
status, a distribution list, and approval signatures.
The reliability program document is a control document and thus contains
a revision status sheet and a list of effective pages, and it has a limited distri-
bution within the airline. It is usually a separate document but can be included
as part of the TPPM.
FAA interaction
It is customary, in the United States, to invite the FAA to sit in on the MPRB
meetings as a nonvoting member. (They have, in a sense, their own voting
power.) Since each U.S. airline has a principal maintenance inspector (PMI)
assigned and usually on site, it is convenient for the FAA to attend these meet-
ings. Airlines outside the United States that do not have the on-site represen-
tative at each airline may not find it as easy to comply. But the invitation should
be extended nevertheless. This lets the regulatory authority know that the air-
line is attending to its maintenance problems in an orderly and systematic
manner and gives the regulatory people an opportunity to provide any assistance
that may be required.