100% found this document useful (2 votes)
1K views459 pages

Reliability Engineering Methods and Applications

Uploaded by

Mohammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
100% found this document useful (2 votes)
1K views459 pages

Reliability Engineering Methods and Applications

Uploaded by

Mohammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 459

Reliability Engineering

Advanced Research in Reliability and System Assurance Engineering


Series Editor: Mangey Ram, Professor, Graphic Era (Deemed to be University),
Dehradun, India

Modeling and Simulation Based Analysis in Reliability Engineering


Edited by Mangey Ram

Reliability Engineering
Theory and Applications
Edited by Ilia Vonta and Mangey Ram

System Reliability Management


Solutions and Technologies
Edited by Adarsh Anand and Mangey Ram

Reliability Engineering
Methods and Applications
Edited by Mangey Ram

For more information about this series, please visit: https:// www.crcpress.com/
Reliability-Engineering-Theory-and-Applications/Vonta-Ram/p/book/9780815355175
Reliability Engineering
Methods and Applications

Edited by
Mangey Ram
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2020 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-138-59385-5 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereaf-
ter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the
CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Ram, Mangey, editor.


Title: Reliability engineering : methods and applications / edited by Mangey Ram.
Other titles: Reliability engineering (CRC Press : 2019)
Description: Boca Raton, FL : CRC Press/Taylor & Francis Group, 2018.
Series: Advanced research in reliability and system assurance engineering | Includes
bibliographical references and index.
Identifiers: LCCN 2019023663 (print) | LCCN 2019023664 (ebook) | ISBN
9781138593855 (hardback) | ISBN 9780429488009 (ebook)
Subjects: LCSH: Reliability (Engineering)
Classification: LCC TA169 .R439522 2019 (print) | LCC TA169 (ebook) | DDC
620/.00452--dc23
LC record available at https://lccn.loc.gov/2019023663

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents
Preface......................................................................................................................vii
Acknowledgments......................................................................................................ix
Editor.........................................................................................................................xi
Contributors............................................................................................................ xiii

Chapter 1 Preventive Maintenance Modeling: State of the Art.............................1


Sylwia Werbińska-Wojciechowska

Chapter 2 Inspection Maintenance Modeling for Technical Systems:


An Overview....................................................................................... 41
Sylwia Werbińska-Wojciechowska

Chapter 3 Application of Stochastic Processes in Degradation Modeling:


An Overview....................................................................................... 79
Shah Limon, Ameneh Forouzandeh Shahraki,
and Om Prakash Yadav

Chapter 4 Building a Semi-automatic Design for Reliability Survey with


Semantic Pattern Recognition........................................................... 107
Christian Spreafico and Davide Russo

Chapter 5 Markov Chains and Stochastic Petri Nets for Availability and
Reliability Modeling......................................................................... 127
Paulo Romero Martins Maciel, Jamilson Ramalho Dantas,
and Rubens de Souza Matos Júnior

Chapter 6 An Overview of Fault Tree Analysis and Its Application in Dual


Purposed Cask Reliability in an Accident Scenario......................... 153
Maritza Rodriguez Gual, Rogerio Pimenta Morão, Luiz
Leite da Silva, Edson Ribeiro, Claudio Cunha Lopes,
and Vagner de Oliveira

Chapter 7 An Overview on Failure Rates in Maintenance Policies.................. 165


Xufeng Zhao and Toshio Nakagawa

v
vi Contents

Chapter 8 Accelerated Life Tests with Competing Failure Modes:


An Overview..................................................................................... 197
Kanchan Jain and Preeti Wanti Srivastava

Chapter 9 European Reliability Standards........................................................ 223


Miguel Angel Navas, Carlos Sancho, and Jose Carpio

Chapter 10 Time-Variant Reliability Analysis Methods for Dynamic


Structures.......................................................................................... 259
Zhonglai Wang and Shui Yu

Chapter 11 Latent Variable Models in Reliability............................................... 281


Laurent Bordes

Chapter 12 Expanded Failure Modes and Effects Analysis: A Different


Approach for System Reliability Assessment................................... 305
Perdomo Ojeda Manuel, Rivero Oliva Jesús, and Salomón
Llanes Jesús

Chapter 13 Reliability Assessment and Probabilistic Data Analysis of


Vehicle Components and Systems..................................................... 337
Zhigang Wei

Chapter 14 Maintenance Policy Analysis of a Marine Power Generating


Multi-state System............................................................................. 361
Thomas Markopoulos and Agapios N. Platis

Chapter 15 Vulnerability Discovery and Patch Modeling: State of the Art........ 401
Avinash K. Shrivastava, P. K. Kapur, and Misbah Anjum

Chapter 16 Signature Reliability Evaluations: An Overview of Different


Systems.............................................................................................. 421
Akshay Kumar, Mangey Ram, and S. B. Singh

Index....................................................................................................................... 439
Preface
The  theory, methods, and applications of reliability analysis have been developed
significantly over the last 60 years and have been recognized in many publications.
Therefore, awareness about the importance of each reliability measure of the system
and its fields is very important to a reliability specialist.
This book Reliability Engineering: Methods and Applications is a collection of
different models, methods, and unique approaches to deal with the different techno-
logical aspects of reliability engineering. A deep study of the earlier approaches and
models has been done to bring out better and advanced system reliability techniques
for different phases of the working of the components. Scope for future develop-
ments and research has been suggested.
The main areas studied follow under different chapters:
Chapter 1 provides the review and analysis of preventive maintenance modeling
issues. The discussed preventive maintenance models are classified into two main
groups for one-unit and multi-unit systems.
Chapter  2 provides the literature review on the most commonly used optimal
inspection maintenance mode using appropriate inspection strategy analyzing the
complexity of the system whether single or multi-stage system etc. depending on the
requirements of quality, production, minimum costs, and reducing the frequency of
failures.
Chapter 3 presents the application of stochastic processes in degradation modeling
to assess product/system performances. Among the continuous stochastic processes,
the Wiener, Gamma, and inverse Gaussian processes are discussed and applied for
degradation modeling of engineering systems using accelerated degradation data.
Chapter  4 presents a novel approach for analysis of Failure Modes and Effect
Analysis (FMEA)-related documents through a semi-automatic procedure involving
semantic tools. The aim of this work is reducing the time of analysis and improving
the level of detail of the analysis through the introduction of an increased number of
considered features and relations among them.
Chapter 5 studies the reliability and availability modeling of a system through
Markov chains and stochastic Petri nets.
Chapter  6 talks about the fault tree analysis technique for the calculation of
reliability and risk measurement in the transportation of radioactive materials.
This study aims at reducing the risk of environmental contamination caused due to
human errors.
Chapter 7 surveys the failure rate functions of replacement times, random, and
periodic replacement models and their properties for an understanding of the com-
plex maintenance models theoretically.
Chapter 8 highlights the design of accelerated life tests with competing failure
modes which give rise to competing risk analysis. This design helps in the prediction
of the product reliability accurately, quickly, and economically.

vii
viii Preface

Chapter 9 presents an analysis, classification, and orientation of content to encour-


age researchers, organizations, and professionals to use IEC standards as applicable
procedures and/or as reference guides. These standards provide methods and math-
ematical metrics known worldwide.
Chapter  10 discusses the time-variant reliability analysis methods for real-life
dynamic structures under uncertainties and vibratory systems having high nonlinear
performance. These methods satisfy the accuracy requirements by considering the
time correlation.
Chapter 11 presents a few reliability or survival analysis models involving latent
variables. The latent variable model considers missing information, heterogeneity of
observations, measurement of errors, etc.
Chapter 12 highlights the failure mode and effects analysis technique that esti-
mates the system reliability when the components are dependent on each other and
there is common cause failure as in redundant systems using the logical algorithm.
Chapter 13 provides an overview of the current state-of-the-art reliability assess-
ment approaches, including testing and probabilistic data analysis approaches, for
vehicle components and systems, vehicle exhaust components, and systems. The new
concepts include a fatigue S-N curve transformation technique and a variable trans-
formation technique in a damage-cycle diagram.
Chapter  14 is an attempt to develop a semi-Markov model of a ship’s electric
power generation system and use multi-state systems theory to develop an alterna-
tive aspect of maintenance policy, indicating the importance of the human capital
management relating to its cost management optimization.
Chapter 15 discusses the quantitative models proposed in the software security
literature called vulnerability discovery model for predicting the total number of
vulnerabilities detected, identified, or discovered during the operational phase of
the software. This work also described the modeling framework of the vulnerability
discovery models and vulnerability patching models.
Chapter  16 discusses the signature and its factor such as mean time to failure,
expected cost, and Barlow-Proschan index with the help of the reliability function
and the universal generating function also using Owen’s method for a coherent sys-
tem, which has independent identically, distributed elements.
Throughout this book, engineers and academician gain great knowledge and help
in understanding reliability engineering and its overviews. This book gives a broad
overview on the past, current, and future trends of reliability methods and applica-
tions for the readers.

Mangey Ram
Graphic Era (Deemed to be University), India
Acknowledgments
The  Editor acknowledges CRC Press for this opportunity and professional sup-
port. My special thanks to Ms. Cindy Renee Carelli, Executive Editor, CRC Press/
Taylor & Francis Group for the excellent support she provided me to complete this
book. Thanks to Ms. Erin Harris, Editorial Assistant to Mrs. Cindy Renee Carelli,
for her follow up and aid. Also, I would like to thank all the chapter authors and
reviewers for their availability for this work.

Mangey Ram
Graphic Era (Deemed to be University), India

ix
Editor
Dr. Mangey Ram received a PhD degree major in Mathematics and minor in
Computer Science from G. B. Pant University of Agriculture and Technology,
Pantnagar, India. He has been a Faculty Member for over 11 years and has taught
several core courses in pure and applied mathematics at undergraduate, postgradu-
ate, and doctorate levels. He is currently a Professor at Graphic Era (Deemed to be
University), Dehradun, India. Before joining Graphic Era, he was a Deputy Manager
(Probationary Officer) with Syndicate Bank for a short period. He is Editor-in-Chief
of International Journal of Mathematical, Engineering and Management Sciences
and the guest editor and member of the editorial board of various journals. He is
a regular reviewer for international journals, including IEEE, Elsevier, Springer,
Emerald, John Wiley, Taylor  & Francis, and many other publishers. He  has pub-
lished 150-plus research publications in IEEE, Taylor & Francis, Springer, Elsevier,
Emerald, World Scientific, and many other national and international journals of
repute and presented his works at national and international conferences. His fields
of research are reliability theory and applied mathematics. Dr. Ram is a Senior
Member of the IEEE, life member of Operational Research Society of India, Society
for Reliability Engineering, Quality and Operations Management in India, Indian
Society of Industrial and Applied Mathematics, member of International Association
of Engineers in Hong Kong, and Emerald Literati Network in the UK. He has been
a member of the organizing committee of a number of international and national
conferences, seminars, and workshops. He was conferred with the Young Scientist
Award by the Uttarakhand State Council for Science and Technology, Dehradun,
in 2009. He was awarded the Best Faculty Award in 2011; the Research Excellence
Award in 2015; and the Outstanding Researcher Award in 2018 for his significant
contribution in academics and research at Graphic Era (Deemed to be University)
in, Dehradun, India.

xi
Contributors
Misbah Anjum Kanchan Jain
Amity Institute of Information Department of Statistics
Technology Panjab University
Amity University Chandigarh, India
Noida, India
Rivero Oliva Jesús
Laurent Bordes Departamento de Engenharia Nuclear
Laboratory of Mathematics and its Universidade Federal do Rio de Janeiro
Applications—IPRA, UMR 5142 (UFRJ)
University of Pau and Pays Rio de Janeiro, Brazil
Adour—CNRS—E2S UPPA
Pau, France Salomón Llanes Jesús
GAMMA SA
Jose Carpio La Habana, Cuba
Department of Electrical, Electronic
and Control Engineering P. K. Kapur
Spanish National Distance Education Amity Centre for Interdisciplinary
University Research
Madrid, Spain Amity University
Noida, India
Jamilson Ramalho Dantas
Departamento de Ciência da Akshay Kumar
Computação Centro de Informática Department of Mathematics
da UFPE—CIN Recife Graphic Era Hill University
Pernambuco, Brasil Dehradun, India
and
Shah Limon
Departamento de Ciência da Industrial & Manufacturing
Computação Universidade Federal Engineering
do Vale do São Francisco— North Dakota State University
UNIVASF Campus Salgueiro Fargo, North Dakota
Salgueiro, Pernambuco, Brasil
Claudio Cunha Lopes
Maritza Rodriguez Gual Department of Reactor Technology
Department of Reactor Technology Service ( SETRE)
Service (SETRE) Centro de Desenvolvimento da
Centro de Desenvolvimento da Tecnologia Nuclear—CDTN
Tecnologia Nuclear—CDTN Belo Horizonte, Brazil
Belo Horizonte, Brazil

xiii
xiv Contributors

Paulo Romero Martins Maciel Vagner de Oliveira


Departamento de Ciência da Department of Reactor Technology
Computação Centro de Informática Service (SETRE)
da UFPE—CIN Recife Centro de Desenvolvimento da
Pernambuco, Brasil Tecnologia Nuclear—CDTN
Belo Horizonte, Brazil
Perdomo Ojeda Manuel
Instituto Superior de Tecnologías y Agapios N. Platis
Ciencias Aplicadas Department of Financial and
Universidad de La Habana (UH) Management Engineering
La Habana, Cuba University of the Aegean
Chios, Greece
Thomas Markopoulos
Department of Financial and Mangey Ram
Management Engineering Department of Mathematics; Computer
University of the Aegean Science & Engineering
Chios, Greece Graphic Era (Deemed to be University)
Dehradun, India
Rubens de Souza Matos Júnior
Coordenadoria de Informática Instituto Edson Ribeiro
Federal de Educação, Ciência e Centro de Desenvolvimento da
Tecnologia de Sergipe, IFS Lagarto Tecnologia Nuclear—CDTN
Sergipe, Brasil Belo Horizonte, Brazil

Rogerio Pimenta Morão Davide Russo


Department of Reactor Technology Department of Management,
Service (SETRE) Information and Production
Centro de Desenvolvimento da Engineering
Tecnologia Nuclear—CDTN University of Bergamo
Belo Horizonte, Brazil Bergamo, Italy

Toshio Nakagawa Carlos Sancho


Department of Business Administration Department of Electrical, Electronic
Aichi Institute of Technology and Control Engineering
Toyota, Japan Spanish National Distance Education
University
Miguel Angel Navas Madrid, Spain
Department of Electrical, Electronic
and Control Engineering Ameneh Forouzandeh Shahraki
Spanish National Distance Education Civil & Industrial Engineering
University North Dakota State University
Madrid, Spain Fargo, North Dakota
Contributors xv

Avinash K. Shrivastava Zhigang Wei


Department: QT, IT and Operations Tenneco Inc.
International Management Institute Grass Lake, Michigan
Kolkata, West Bengal, India
Sylwia Werbin′ska-Wojciechowska
Luiz Leite da Silva Department of Operation and
Department of Reactor Technology Maintenance of Logistic,
Service (SETRE) Transportation and Hydraulic
Centro de Desenvolvimento da Systems Faculty of Mechanical
Tecnologia Nuclear—CDTN Engineering
Belo Horizonte, Brazil Wroclaw University of Science and
Technology
S. B. Singh Wrocław, Poland
Department of Mathematics,
Statistics & Computer Science Om Prakash Yadav
G. B. Pant University of Agriculture & Civil & Industrial Engineering
Technology North Dakota State University
Pantnagar, India Fargo, North Dakota

Christian Spreafico Shui Yu


Department of Management, School of Mechanical and Electrical
Information and Production Engineering
Engineering University of Electronic Science and
University of Bergamo Technology of China
Dalmine, Italy Chengdu, China

Preeti Wanti Srivastava Xufeng Zhao


Department of Operational Research College of Economics and Management
University of Delhi Nanjing University of Aeronautics and
New Delhi, India Astronautics
Nanjing, China
Zhonglai Wang
School of Mechanical and Electrical
Engineering
University of Electronic Science and
Technology of China
Chengdu, China
1 Preventive Maintenance
Modeling
State of the Art
Sylwia Werbińska-Wojciechowska

CONTENTS
1.1 Introduction........................................................................................................1
1.2  Preventive Maintenance Modeling for Single-Unit Systems.............................3
1.3  Preventive Maintenance Modeling for Multi-unit Systems............................. 14
1.4  Conclusions and Directions for Further Research...........................................24
References.................................................................................................................26

1.1 INTRODUCTION
Preventive maintenance (PM) is an important part of facilities management in many
of today’s companies. The goal of a successful PM program is to establish consistent
practices designed to improve the performance and safety of the operated equip-
ment. Recently, this type of maintenance strategy is applied widely in many techni-
cal systems such as production, transport, or critical infrastructure systems.
Many studies have been devoted to PM modeling since the 1960s. One of the first
surveys of maintenance policies for stochastically failing equipment—where PM
models are under investigation—is given in [1]. In this work, the author investigated
PM for known and uncertain distributions of time to failure. Pierskalla and Voelker [2]
prepared another excellent survey of maintenance models for proper scheduling and
optimizing maintenance actions, which Valdez-Flores and Feldman [3] updated later.
Other valuable surveys summarize the research and practice in this area in different
ways (e.g.,  [4–18]. In  turn, the comparison between time-based maintenance and
condition-based maintenance is the authors’ area of interest, e.g., in works [19,20]).
In  this chapter, the author focuses on the review and summary of recent PM
policies developed and presented in the literature. The adopted main maintenance
models classification is based on developments given in [15–18]. The models classi-
fication includes two main groups of maintenance strategies—single- and multi-unit
systems. The main scheme for classification of PM models for technical system is
presented in Figure 1.1.

1
2 Reliability Engineering

PREVENTIVE MAINTENANCE (PM) FOR TECHNICAL SYSTEMS

PM FOR SINGLE-UNIT SYSTEMS PM FOR MULTI-UNIT SYSTEMS

Age-based PM policies Sequential PM policies BASIC MODELS FOR HYBRID PM MODELS


SYSTEMS WITHOUT
COMPONENTS * inspection maintenance modeling
DEPENDENCE * spare parts provisioning policy
Periodic PM policies Failure limit policies
* dynamic reliability maintenance
Extended PM models for
single-unit systems
Repair limit policies

BASIC MODELS FOR SYSTEMS


WITH COMPONENTS
Repair cost limit policies Repair time limit policies
DEPENDENCE

* group maintenance policy


* opportunistic maintenance policy
* cannibalization maintenance

FIGURE  1.1  The  classification for preventive maintenance models for technical system.
(Own contribution based on Wang, H., European Journal of Operational Research, 139,
469–489, 2002; Werbińska-Wojciechowska, S., Technical System Maintenance, Delay-time-
based modeling, Springer, London, UK, 2019; Werbińska-Wojciechowska, S., Multicomponent
technical systems maintenance models: State of art (in Polish), in Siergiejczyk, M. (ed.),
Technical Systems Maintenance Problems: Monograph (in Polish), Publication House of
Warsaw University of Technology, Warsaw, Poland, pp. 25–57, 2014.)

Many well-known research papers focus on PM models dedicated for optimi-


zation of single-unit systems performance. The  well-known maintenance models
for single-unit systems are age-dependent PM and periodic PM models. In  these
areas, the most frequently used replacement models are based on age replacement
and block replacement policies. The basic references in this area are [3,15,22,23].
The maintenance policies comparison is presented, e.g., in works [24–29].
According to Cho and Parlar [4], “multi component maintenance models are con-
cerned with optimal maintenance policies for a system consisting of several units
of machines or many pieces of equipment, which may or may not depend on each
other.” In 1986, Thomas, in his work [30], presents classification of optimal mainte-
nance strategies for multi-unit systems. He focuses on the models that are based on
one of three types of dependence that occurs between system elements—economic,
failure, and structural. According to the author, economic dependence implies that
an opportunity for a group replacement of several components costs less than sepa-
rate replacements of the individual components. Stochastic dependence, also called
failure or probabilistic dependence, occurs if the condition of components influences
the lifetime distribution of other components. Structural dependence means that com-
ponents structurally form a part, so that maintenance of a failed component implies
maintenance of working components. These definitions are adopted in this chapter.
Literature reviews are given, e.g., in works  [5,31–33] that are compatible with
research findings given in [30]. More comprehensive discussion in maintenance from
an application point of view can be found in [34,35]. For other recent references, see,
e.g., [8,18,23]. A detailed review of the most commonly used PM policies for single-
and multi-unit systems is presented in subchapters 1.2 and 1.3.
Preventive Maintenance Modeling 3

1.2 PREVENTIVE MAINTENANCE MODELING


FOR SINGLE-UNIT SYSTEMS
First, the PM models for single-unit systems are investigated. Here a unit may be
perceived as a component, an assembly, a subsystem, or even the whole system
(treated as a complex system). The main classification for maintenance models of
such systems is given in Figure  1.2. The  comparisons concerning different PM
policies are given in works [22,24,25,28,29,36–38].
One of the most commonly used PM policies for single-unit systems is an age
replacement policy (ARP) that was developed in the early 1960s  [39]. Under this
policy, a unit is always replaced at its age T or at failure, whichever occurs first [40].
The issues of ARP modeling have been extensively studied in the literature since
the 1990s. The main extensions that are developed for this maintenance policy apply
to minimal repair, imperfect maintenance performance, shock modeling, or inspec-
tion action implementation. Following this, in the known maintenance models, the
PM at T and corrective maintenance (CM) at failure might be either minimal, imper-
fect, or perfect. The main optimization criteria are based on maintenance cost struc-
ture. Therefore, in the case of the simple ARP, the expected cost per unit of time for
an infinite time span is given as [39,41]:

cr F (T ) + c p F (T )
C (T ) = T (1.1)

∫ F (t )dt
0

where:
C(T) is the long-run expected cost per unit time
cp is the cost of preventive replacement of a unit
cr is the cost of failed unit replacement
F(t) is the probability distribution function of system/unit lifetime: F (t ) = 1 − F (t )

PREVENTIVE MAINTENANCE (PM) FOR SINGLE-UNIT SYSTEMS

ARP MODELS FOR BRP MODELS FOR SEQUENTIAL PM MODELS LIMIT PM MODELS FOR
SINGLE-UNIT SYSTEMS SINGLE-UNIT SYSTEMS FOR SINGLE-UNIT SINGLE-UNIT SYSTEMS
SYSTEMS
*minimal repair implementation *minimal repair implementation
*perfect/imperfect repair *perfect/imperfect repair *minimal repair implementation
*shock modelling *shock modeling *finite/infinite time horizon
*cost/availability/reliability *cost/availability constraints *hybrid models
constraints *inspection policy
*inspection policy *finite/infinite time horizon
*new/used unit maintenance
modeling
*negligible/non-negligible downtime LIMIT PM MODELS FOR LIMIT PM MODELS FOR
SINGLE-UNIT SYSTEMS SINGLE-UNIT SYSTEMS

*perfect/imperfect repair
*finite/infinite time horizon
*dynamic reliability models
*mixed PM models

REPAIR-TIME LIMIT REPAIR-COST LIMIT


POLICY POLICY

*finite/infinite time horizon *perfect/imperfect maintenance


*different modeling approaches *inspection performance
*mixed PM models *mixed PM models

FIGURE 1.2  The classification for PM models for single-unit systems.


4 Reliability Engineering

The first investigated group of ARP models apply to minimal repair implementa-


tion. Minimal repair is defined herein as “the repair that put the failed item back
into operation with no significant effect on its remaining life time” [39]. A simple
ARP model with minimal repair is given in [42], where the author investigates a
one-unit system that is replaced at first failure after age T. All failures that happen
before the age T are minimally repaired. The model is based on the optimization of
the mean cost rate function. The extension of this model is given in [43,44], where
the authors develop the ARP with minimal repair and general random repair cost.
The continuation of this research also is given in [45], where the author introduces
the model for determining the optimal number of minimal repairs before replace-
ment. The main assumptions are compatible with [43,44] and incorporate minimal
repair, replacement, and general random repair cost.
A similar problem is analyzed later in [46], where the authors investigate PM with
Bayesian imperfect repair. In the given PM model, the failure that occurred (for the
unit age Ty < T) can be either minimally repaired or perfectly repaired with random
probabilities. The expected cost per unit time is investigated for the infinite-horizon
case and the one-replacement-cycle case.
The implementation of Bayesian approach for determining optimal replacement
strategy also is given in [47]. In this paper, the authors present a fully Bayesian anal-
ysis of the optimal replacement problem for the block replacement protocol with
minimal repair and the simple age replacement protocol. The optimal replacement
strategies are obtained by maximizing the expected utility with uncertainty analysis.
The  ARP with minimal repair usually is investigated with the use of mainte-
nance costs constraints for optimization performance. However, a few PM models
are developed based on availability optimization. For example, in [48] the authors
investigate the steady-state availability of imperfect repair model for repairable two-
state items. The authors use the renewal theory for providing analytical solutions for
single and multi-component systems.
In another work [49], the author introduces an ARP with non-negligible down-
times. In this work, the author develops the sufficient conditions for the ARP in the
aspect of the existence of a global minimum to the asymptotic expected cost rate.
The introduction of periodic testing or inspections in ARP performance is given
in [50]. The author in this work introduces an ARP for components whose failures can
occur randomly but are detected only by periodic testing or inspections. The devel-
oped model includes finite repair and maintenance times and cost contributions due
to inspection (or testing), repair, maintenance, and loss of production (or accidents).
The analytical solution encompasses general cost rate and unavailability equations.
The continuation of inspection maintenance and PM optimization problems is given
in [51], where the authors focus on the issues of random failure and replacement time
implementation.
In [52], the authors introduce replacement policies for a unit that is running suc-
cessive works with cycle times. In the paper, three replacement policies are defined
that are scheduled at continuous and discrete times:

• Continuous age replacement: The  unit is replaced before failure at a


planned time T
Preventive Maintenance Modeling 5

• Discrete age replacement: The unit is replaced before failure at completion


of the Nwcth working cycle
• Age replacement with overtime: The unit is replaced before failure at the
first completion of some working cycle over the planned time T

Analytical equations of the expected cost rate with numerical solutions are provided.
The authors also present the comparison of given replacement policies.
Another extension of ARP modeling is given in [53], where the authors investigate
the problem of PM uncertainty by assuming that the quality of PM actions is a random
variable with a defined probability distribution. Following this, the authors analyze an
age reduction PM model and a failure rate PM model. Under the age reduction PM
model, it is assumed that each PM reduces operational stress to the existing time units
previous to the PM intervention, where the restoration interval is less than or equal to
the PM interval. The optimization criteria also is based on maintenance cost structure.
The  issues of warranty policy are investigated in  [54]. The  author in this work
investigates a general age-replacement model that incorporates minimal repair,
planned replacement, and unplanned replacement for a product under a renewing
free-­replacement warranty policy. The main assumptions of the ARP are compatible
with [43,44]. The authors assume that all the product failures that cause minimal repair
can be detected instantly and repaired instantaneously by a user. Thus, it is assumed
in this study that the user of the product should be responsible for all minimal repairs
before and after the warranty expires. Following this, for the product with an increas-
ing failure rate function, the authors show that a unique optimal replacement age exists
such that the long-run expected cost rate is minimized. The  authors also compare
­analytically the optimal replacement ages for products with and without warranty.
The  warranty policy problem is analyzed in  [55], where the authors propose
an age-dependent failure-repair model to analyze the warranty costs of products.
In  this paper, the authors consider four typical warranty policies (fixed warranty,
renewing warranty, mixture of minimal and age-reducing repairs, and partial rebate
warranty).
The last group of ARP models applies to PM strategies based on the implementa-
tion of shock models. The simple age-based policy with shock model is presented
in [56]. In this work, the authors introduce the three main cumulative damage m ­ odels:
(1) a unit that is subjected to shocks and suffers some damage due to shocks, (2) the
model includes periodic inspections, and (3) the model assumes that the amount of
damage increases linearly with time. For the defined shock models, optimal replace-
ment policies are derived for the expected cost rate minimization.
The extension of the given models is presented in [57], where the authors study
the mean residual life of a technical object as a measure used in the age replacement
model assessment. The analytical solution is supplied with a new U-statistic test pro-
cedure for testing the hypothesis that the life is exponentially distributed against the
alternative that the life distribution has a renewal-increasing mean residual property.
Another development of general replacement models of systems subject to shocks
is presented in [58], where the authors introduce the fatal and nonfatal shocks occur-
rence. The fatal shock causes the system total breakdown and the system is replaced,
whereas the nonfatal shock weakens the system and makes it more expensive to run.
6 Reliability Engineering

Following this, the authors focus on finding the optimal T that minimizes the long-
run expected cost per unit time.
Another extension of the ARP with shock models is to introduce the minimal repair
performance. Following this, in [59] the authors extend the generalized replacement
policy given in [58] by introducing minimal repair of minor failures. Moreover, in the
given PM model, the cost of minimal repair of the system is age dependent.
Later, in [60], the authors introduce an extended ARP policy with minimal repairs
and a cumulative damage model implementation. Under the developed maintenance
policy, the fatal shocks are removed by minimal repairs and the minor shocks increase
the system failure rate by a certain amount. Without external shocks, the failure rate
of the system also increases with age due to the aging process. The optimality criteria
also are focused on the long-run expected cost per unit time. This model is extended
later in [61], where the authors consider the ARP with minimal repair for an extended
cumulative damage model with maintenance at each shock. According to the devel-
oped PM policy, when the total damage does not exceed a predetermined failure level,
the system undergoes maintenance at each shock. When the total damage has reached
a given failure level, the system fails and undergoes minimal repair at each failure.
The system is replaced at periodic times T or at Nth failure, whichever occurs first.
To sum up, many authors usually discuss ARPs of single-unit systems analyti-
cally. The main models that address this maintenance strategy also should be sup-
plemented by works that investigate the problem of ARP modeling with the use of
semi-Markov processes (see, e.g.,  [62,63]), TTT-plotting (see, e.g.,  [64]), heuristic
models (see, e.g.,  [65]), or approximate methods implementation (see, e.g.,  [66]).
The authors in [67] introduce the new stochastic order for ARP based on the com-
parison of the Laplace transform of the time to failure for two different lifetime
distributions. The comparison of ARP models for a finite horizon case based on a
renewal process application and a negative exponential and Weibull failure-time dis-
tribution is presented in [68]. The additional interesting problems in ARP modeling
may be connected with spare provisioning policy implementation (see, e.g., [69]) or
multi-state systems investigation (see, e.g., [62,70,71]).
The quick overview of the given ARPs is presented in Table 1.1.
Another popular PM policy for single-unit systems is block replacement policy
(BRP). For the given maintenance policy, it is assumed that all units in a system are
replaced at periodic intervals regardless of their individual age in kT time moments,
where k = 1, 2, 3, and so on. The maintenance problem usually is aimed at finding
the optimal cycle length T either to minimize total maintenance and operational
costs or to maximize system availability. The simple BRP, when the maintenance
times are negligible, is based on the optimization of the expected long-run mainte-
nance cost per unit time as a function of T, given as [72]:

cr N (T ) + c p
C (T ) = (1.2)
T

where:
N(t) is the expected number of failure/renewals for time interval (0,t)
TABLE 1.1
Summary of PM Policies for Single-Unit Systems
Type of Maintenance Policy Planning Horizon Optimality Criterion Modeling Method Typical References
ARP Infinite (∞) The long-run expected cost per time unit Bayesian approach [47]
ARP Infinite (∞) The long-run expected cost per unit time, Analytical [38]
availability function
ARP Infinite (∞) The long-run expected cost per time unit Analytical [39,40–42,44,53,54,​
60,118]
ARP Infinite (∞) The expected cost rate Analytical [45,49,51,​56,59,61,​
66,119]
Preventive Maintenance Modeling

ARP Infinite (∞) The mean cost rate Analytical [120]


ARP Infinite (∞) The total cost rate, the expected Analytical [50]
unavailability
ARP Infinite (∞) The expected replacement cost rate Analytical [52]
ARP Infinite (∞) The expected warranty cost Analytical [55]
ARP Infinite (∞) The steady-state availability function Analytical [48]
ARP Infinite (∞) The survival function Analytical [121]
ARP Infinite (∞) The mean time to failure Analytical (Laplace [67]
transform)
ARP Infinite (∞) The long-run expected cost per unit time, Multi-attribute value model [122]
availability, lifetime, and reliability
functions
ARP Infinite (∞) The expected long-run cost rate Heuristic model [65]
ARP Infinite (∞) The expected long-run cost rate Semi-Markov decision [63]
process
(Continued)
7
8

TABLE 1.1 (Continued)


Summary of PM Policies for Single-Unit Systems
Type of Maintenance Policy Planning Horizon Optimality Criterion Modeling Method Typical References
ARP Infinite (∞) The expected long-run cost rate Semi-Markov process [62]
ARP Infinite (∞) The long-run average cost per unit time Proportional hazard model [64]
and TTT-plotting
ARP Infinite (∞) The total system costs Simulation model [69]
ARP Infinite (∞) State-age-dependent policy Multi-phase Markovian model [71]
ARP Infinite (∞) Mean residual life Analytical/simulation [57]
ARP Infinite (∞) The expected cost of operating the system Analytical [123]
over a time interval
ARP Infinite (∞) The expected long-run cost per unit time, Analytical [78]
the total discounted cost
ARP Infinite (∞)/finite The expected cost rate per unit time Analytical [46]
ARP Infinite (∞)/finite The long-run expected cost per unit time Analytical [43,58,124]
ARP Finite Expected cumulative cost Analytical [68]
ARP Finite Customer’s expected discounted Continuous-time Markov [70]
maintenance cost process
BRP Infinite (∞) The long-run expected cost per time unit Analytical [72,74–80,83,​
125–127]
BRP Infinite (∞) The long-run expected cost per time unit Analytical/semi-Markov [81]
processes
BRP Finite The long-run expected cost per time unit Analytical [7]
(Continued)
Reliability Engineering
TABLE 1.1 (Continued)
Summary of PM Policies for Single-Unit Systems
Type of Maintenance Policy Planning Horizon Optimality Criterion Modeling Method Typical References
Sequential PM policy Infinite (∞) Mean maintenance costs Analytical [41]

Sequential PM policy Infinite (∞) Expected cost rate Analytical [88]

Sequential PM policy Infinite (∞) Expected costs per unit time Analytical [90]

Sequential PM policy Infinite (∞) Total expected maintenance costs Genetic algorithm [92]
Preventive Maintenance Modeling

Sequential PM policy Infinite (∞) Mean cost rate Bayesian approach [93]
Sequential PM policy Infinite (∞)/finite Expected cost rate till replacement Analytical [89]

Sequential PM policy Finite Expected cost till replacement Analytical [7]

Sequential PM policy Finite Expected profit Genetic algorithm [91]


Failure limit policy (Failure rate Infinite (∞) Total expected long-run cost per unit time Analytical [94]
through wear/accumulated damage
or stress)
Failure limit policy (Failure rate) Infinite (∞) Cost rate Analytical [95,96]
Failure limit policy (Failure rate) Infinite (∞) Availability function Analytical [128]
Failure limit policy (Degradation Infinite (∞) Total expected long-run cost per unit Analytical [129]
ratio) time/availability function
Failure limit policy (Failure rate) Infinite (∞) Unit-cost life of a system Genetic algorithms [98]
Failure limit policy (Age) Finite Total costs function Analytical (branching [97]
algorithm)
(Continued)
9
10

TABLE 1.1 (Continued)


Summary of PM Policies for Single-Unit Systems
Type of Maintenance Policy Planning Horizon Optimality Criterion Modeling Method Typical References
Repair-time limit policy Infinite (∞) Expected cost per unit time Markov renewal process [101]
Repair-time limit policy Infinite (∞) Expected cost per unit time Analytical [100]
Repair-time limit policy Infinite (∞) The total expected costs per unit time Graphical approach (TTT) [102,107,112]
Repair-time limit policy Infinite (∞) The expected total discounted cost Graphical approach [106]
Repair-time limit policy Infinite (∞) The expected cost per unit time Lorenz curve [105]
Repair-time limit policy Infinite (∞) The long-run average profit rate/the total Analytical/nonparametric [104]
discounted profit algorithms
Repair-cost limit policy Infinite (∞) Cost rate Analytical [108,110,111]
Repair-cost limit policy Infinite (∞) Mean cost rate Analytical [109,115]
Repair-cost limit policy Infinite (∞) Mean cost rate Markov renewal process [117]
Repair-cost limit policy Infinite (∞) The long-term cost per unit time Analytical [113]
Repair-cost limit policy Infinite (∞) The long-run total maintenance cost rate Analytical [114]
Repair-cost limit policy Infinite (∞) Total expected cost per unit time Graphical approach (TTT) [103]
Repair-cost limit policy Infinite (∞) The expected average cost per unit time Optimal stopping theory [130]
Repair-cost limit policy Infinite (∞) The long-run average expected Semi-Markov decision [131]
maintenance cost per unit time process
Repair-cost limit policy Finite The expected cost of servicing Analytical [132]
Reliability Engineering
Preventive Maintenance Modeling 11

The  main advantage of this policy is its simplicity. However, the main drawback
of simple block replacement policy is that at planned replacement times practically
new items might be replaced and a major portion of the useful life of these units is
wasted. Thus, to overcome this disadvantage, various modifications have been intro-
duced in the literature. The main extensions for the simple BRP include minimal
repair implementation, finite/infinite time horizon, shock modeling use, and inspec-
tion maintenance performance.
The introduction of minimal repair performance was analyzed first in the 1970s.
(see, e.g., [41,73]). Later, in [74], the author considers a BRP with minimal repair at
failure for a used unit of age Tax. In the given model, the item is preventively replaced
by new ones at times kT, k = 1, 2, 3, and so on. If the system fails in [(k−1)T, kT−Δδ],
then the item either is replaced by new ones or is repaired minimally. If the failure
occurs in [kT−Δδ , kT ], then the item either is replaced by used ones with age vary-
ing from Δδ to T or is repaired minimally. The choice is random with age-dependent
probability. The cost structure also is age-dependent. For the given assumptions, the
author defines the expected long-run cost per unit time function. This maintenance
model is extended later in [75] for single and multi-unit cases.
An interesting model is introduced in [76], where the authors investigate optimal
maintenance model for repairable systems under two types of failures with differ-
ent maintenance costs. The model assumes that there are performed periodic visual
inspections that detect potential failures of type I. For the given assumptions, the
total expected costs are estimated.
The  presented models are developed for an infinite time span. In  [7] finite
replacement models are considered. Taking into account, that the working time of
a unit is given by a specified value Two, the long-run expected costs per unit time
are estimated.
Another extension of the simple BRP applies to shock modeling implementation.
For example, in [77] the authors investigate the system subjected to shocks, which occur
independently and according to a Poisson process with intensity rate λs. The occurred
shocks either may be nonlethal with probability ps or lethal with probability (1−ps).
Later, the extension of the given model is presented in [78]. In the given paper, the author
analyzes a system subject to shocks that arrive according to an Non-Homogeneous
Poisson (NHP) process. As shocks occur, the system has two types of failures:

• Type I (minor) failure: Removed by minimal repair


• Type II (catastrophic) failure: Removed by unplanned replacement

The probability of the type II failure is dependent on the number of shocks suffered


since the last replacement. The author derives the expressions for the expected long-
run cost per unit time and the total α-discounted cost for each policy. This model
is later extended in [79], where the authors consider a BPR model for a system sub-
jected to shock occurrence and with minimal repair at failure for a used unit of age
Tax. The proposed solution was based on assumptions given in [74].
The time-dependent cost structure is investigated in [80], where the authors deter-
mine a replacement time for a system with the use of counting process whose jump
size is of one unit magnitude.
12 Reliability Engineering

To sum up, many authors discuss BRPs of single-unit systems due to their sim-
plicity. The  main models that address this maintenance strategy also should be
supplemented by works that investigate the problem of imperfect maintenance (see,
e.g., [81,82]), joint preventive maintenance with production inventory control policy
(see, e.g., [83]), risk at failure investigation (see, e.g., [84]), or estimation issues (see,
e.g.,  [72]). The  examples of BRP implementation apply to transportation systems
maintenance (see, e.g.,  [85]), aircraft component maintenance (see, e.g.,  [86]), or
preventive maintenance for milling assemblies (see, e.g., [87]). The quick overview
of the given BRPs is presented in Table 1.1.
Another PM policy applied in the area of maintenance of single-unit systems
is sequential PM policy. Under this PM policy a unit is preventively maintained at
unequal time intervals. The unequal time interval usually is related to the age of the
system or is predetermined as in periodic maintenance policies [15].
One of the first works where the author considers sequential PM policy is [88].
In this work, the sequential preventive maintenance for a system with minimal repair
at failure is investigated. The policy assumes that the system is replaced at constant
time intervals and at the Nth failure. This model is later investigated in [7], where the
author proposes the simple sequential PM policy with imperfect maintenance for a
finite time span.
Another interesting model of the sequential PM policy is presented in [89], where
the authors introduce a shock model and a cumulative damage model. In this article,
two replacement policies are developed—a periodic PM and a sequential PM pol-
icy with minimal repair at failure and imperfect PM. The solutions are obtained for
finite and infinite time spans. These problems are investigated later in [90], where
the authors adopt improvement factors in the hazard rate function for modeling the
imperfect PM performance. The  model is presented for an infinite time-horizon.
The main characteristic of the given model is connected with considering the age-
dependent minimal repair cost and the stochastic failure type.
In [91], the authors present a sequential imperfect PM policy for a degradation
system. This model extends assumptions given in [88]. The developed model is based
on maximal/equal cumulative-hazard rate constraints. The optimization is obtained
using a genetic algorithm. Later, the random adjustment-reduction maintenance
model with imperfect maintenance policy for a finite time span is presented in [92].
The authors also use the genetic algorithm implementation.
The Bayesian approach implementation in the sequential PM problem is presented
in [93]. The authors determine the optimal PM schedules for a hybrid sequential PM
policy, where the age reduction PM model and the hazard rate PM model are com-
bined. Under such a hybrid PM model, each PM action reduces the effective age of
the system to a certain value and also adjusts the slope of the hazard rate (slows down
the degradation process of the maintained system).
Sequential PM policies are practical for most units that need more frequent main-
tenance with increasing age. The quick overview of the main known sequential PM
models is given in Table 1.1.
The last group of PM policies applies to predefined limit level policies. The PM
policy depends on the failure model assumed for operated units—failure limit p­ olicy.
Under this policy, PM is performed only when the defined state variable, which
Preventive Maintenance Modeling 13

describes the state of the unit at age T (e.g., failure rate), reaches a predetermined
level and failures that occur are repaired.
One of the first works that investigates the optimal replacement model with the use
of the failure limit policy is in [94]. The author in this work presents the replacement
policy based on the failure model defined for an operating unit. In this model, a unit
state at age T is defined by a random variable. The replacement is performed either at
failure or when the unit state reaches or exceeds a given level, whichever occurs first.
Model optimization is based on the average long-run cost per unit time estimation.
This problem is investigated later in [95]. The author in his work introduces a PM
model with the monotone hazard function affected by system degradation. The author
develops a hazard model and achieves a cost optimization of system operation.
The imperfect repair in failure limit policy is introduced in [96]. The authors in
their work consider two types of PM (simple PM and preventive replacement) and
two types of corrective maintenance (minimal repair and corrective replacement).
The developed cost-rate model is based on adjustment of the failure rate after simple
PM with the use of a concept of improvement factor. The expected costs are the sum
of average costs of both types of PM and average cost of downtime. This problem is
addressed continued in [97]. The authors in their work propose a cost model for two
types of PM (as in [96]) and one type of corrective maintenance (corrective replace-
ment) that considers inflationary trends over a finite time horizon.
The PM scheduling for a system with deteriorated components also is analyzed
in [98]. The authors consider a PM policy compatible with those presented in [97],
but the degraded behavior of maintained components is modeled by a dynamic reli-
ability equation. The optimal solution, based on unit-cost life estimation, is obtained
with the use of genetic algorithms.
Another example of PM modeling under the failure limit policy is presented
in [99], where the authors focus on system availability optimization. In the presented
model system failure rate is reduced after each PM and depends on age and on the
number of performed PM actions.
Maintenance models under the failure limit policy are summarized in the
Table 1.1.
The second group of PM policies based on predefined limit levels are repair limit
policies. In the known literature, there are two types of repair limit policies: a repair
cost limit policy and a repair time limit policy [13]. Under the repair cost limit policy,
when a unit fails, a repair cost is estimated and repair is undertaken if the estimated
cost is less than a predetermined limit. Otherwise, the unit is replaced. For the repair
cost limit policy, a decision variable applies to time of repair. If the time of corrective
repair is greater than the specified time Trmax, a unit is replaced. Otherwise, the unit
is repaired [15,100].
The first models on repair limit policies are presented in [100,101]. The modeling
methods are based on Markov renewal process use. Later, in [102], the authors dis-
cuss the optimal repair limit replacement policy based on a graphical approach with
the use of the Total Time on Test (TTT) concept. This graphical approach is used
in [103] to determine the optimal repair limit replacement policy.
Another extension of the simple repair time limit policy is imperfect maintenance
implementation. In this implementation, known models are presented in [104–107].
14 Reliability Engineering

The implemented modeling methods are based on using the TTT concept and Lorenz
statistics.
The  second type of repair limit policies is repair cost estimations at a system
failure and is defined as a repair-cost limit policy. One of the first studies that inves-
tigates a general maintenance model with replacements and minimal repair as a
base for repair limit replacement policy is  [108]. The  author presents three basic
maintenance policies (based on age-dependent PM and periodic PM) and two basic
repair limit replacement policies. In the first repair-cost limit replacement policy, the
author assumes that a system is replaced by the new one if the random repair cost
exceeds a given repair cost limit; otherwise, it is minimally repaired. This problem
is later investigated in [109], in which the minimal repairs follow Non-Homogeneous
Poisson Process (NHPP).
The problem of imperfect maintenance is introduced in [110], whereas in [111]
the authors investigate the problem of imperfect estimation of repair cost (imperfect
inspection case).
The implementation of a graphical method (TTT concept) in the repair-cost limit
replacement problem with imperfect repair is presented in  [112]. In  the presented
model, the authors introduce the imperfect repair (according to [110]) and a lead time
for failed unit replacement. The solution is based on the assumption of negligible
replacement time and uses the renewal reward process.
The  cumulative damage model for systems subjected to shocks is presented
in [113]. The author introduces a periodical replacement policy with the concept of
repair cost limit under a cumulative damage model and solves it analytically for an
infinite time span.
Another interesting approach to the repair-cost limit replacement policies is pre-
sented in [114]. The author proposes the total repair-cost limit replacement policy,
where a system is replaced by the new one as soon as its total repair cost reaches
or exceeds a given level. The presented problem is later investigated and extended
in [115,116], where the authors introduce two types of failures (repairable and non-
repairable) and propose a mixed maintenance policy similar to the one presented
in [117].
The current repair limit policies and their extensions are summarized in the Table 1.1.

1.3 PREVENTIVE MAINTENANCE MODELING


FOR MULTI-UNIT SYSTEMS
In  this subchapter, the PM models for multi-unit systems are investigated. In  this
research area models can be distinguished for system with component dependence
and for systems without that component dependence defined. For systems without
component dependence simple age- and block-maintenance models can be imple-
mented. When there is possibility to identify any occurrence of components depen-
dence in a system, three main types of maintenance policies may be used:

• Group maintenance policy


• Opportunistic maintenance policy
• Cannibalization maintenance
Preventive Maintenance Modeling 15

First, the group maintenance policies may be used. Under such a policy, a group of
items is replaced at the same time to take advantage of economies of scale.
Opportunity-based replacement models is based on the rule that replacement is
performed at the time when an opportunity arrives, such as scheduled downtime,
planned shutdown of the machines, or failure of a system in close proximity to the
item of interest.
In the situation when one machine is inoperative due to lack of components and
at the same time one or more other machines are inoperative due to the lack of dif-
ferent components, maintenance personnel may cannibalize operative components
from one or more machines to repair the other or others. This practice is common in
systems that are composed of sufficiently identical component parts (see, e.g., [34]).
The main classification for these types of PM maintenance models is given in Figure 1.3.
Following is a detailed review of the most commonly used maintenance policies.
First, maintenance policies for multi-unit systems without component dependence
are reviewed. In these systems two PM policies usually are used—ARP and BRP.
One of the first works that applies the simple age replacement policy imple-
mentation is [133]. The author proposes the simple ARP model for an nk-out-of-n
warm stand-by system, where the lifetime of components is exponentially distrib-
uted. The  optimal maintenance policy for n failure-independent but non-identical
machines in series is given in [134]. The solution is obtained with the use of nonlin-
ear programming models.
The  maintenance models with the use of ARP for multi-unit systems mostly
implement minimal repair, a shock-modeling approach, and hybrid PM.
The minimal repair is introduced in [135]. In this paper, the model assumes that a
system is replaced at age T. When the system fails before age T, it is either replaced
or minimally repaired depending on the random repair cost at failure. The  model
considers finite and infinite time spans and is solved with a Bayesian approach
implementation.

PREVENTIVE MAINTENANCE (PM) FOR MULTI-UNIT SYSTEMS

BASIC MODELS FOR MULTI-UNIT BASIC MODELS FOR MULTI-UNIT


SYSTEMS WITHOUT SYSTEMS WITH COMPONENTS
COMPONENTS DEPENDENCE DEPENDENCE

ARP MODELS FOR MULTI- BRP MODELS FOR MULTI- OPPORTUNISTIC GROUP MAINTENANCE
UNIT SYSTEMS UNIT SYSTEMS MAINTENANCE MODELS MODELS

*minimal repair implementation *minimal repair implementation *age-based maintenance *static models
*perfect/imperfect repair *failure-based maintenance *dynamic models
*perfect/imperfect repair
*shock modeling *condition-based maintenance
*shock modeling *HYBRID MODELS (mixed PM)
*HYBRID MODELS (mixed PM) *cost/availability constraints
*HYBRID MODELS (mixed PM,
economic dependence occurrence)

CANNIBALIZATION
MAINTENANCE

*reliability-based models
* simulation models
*inventory-based models

FIGURE 1.3  The classification for PM models for multi-unit systems.


16 Reliability Engineering

Another interesting extension of the simple ARP is shock-modeling implemen-


tation. This problem is investigated in [136,137]. In [136], the authors introduce a
maintenance model for a two-unit system subjected to shocks and with a failure
rate interaction. The two types of shocks (minor and catastrophic) stem from a non-
homogeneous pure birth process and their occurrence is dependent on the number of
shocks that have occurred since the last replacement. In [137], this model is extended
by a spare parts availability investigation.
The hybrid ARP applies mostly to opportunity-based maintenance implementa-
tion. This problem is investigated in [138], where maintenance opportunities arise
according to a Poisson process. The problem of opportunity-based ARP also is inves-
tigated in [139–141].
In the available literature, ARP models can be found that apply to a repair priority
problem (see [142]), a machine repair problem (see, [143]), or production systems main-
tenance (see [144]). The quick overview of the given ARPs is presented in Table 1.2.
The second group of PM policies for multi-unit systems without economic depen-
dence applies to BRPs. Various BRPs are investigated in [145]. The author analyzes
a two-unit system in a series reliability structure.
The  maintenance problems of a two-unit parallel system also are investigated
in  [146]. In  this article, the authors introduce a replacement model with minimal
repair at minor failure. The  analyzed system is based on structural dependence.
The significant development of this model is given in [147], where the authors focus
on periodic replacement for an n-unit parallel system subject to common cause shock
failures. In this model, two types of failures are considered:

• Independent failures of one component in the system


• Failures of many components of the system at the same time, not necessar-
ily independent

The summary of optimum replacement policies for an n-unit system in parallel is given


in [148]. The authors compare four replacement policies—a simple BRP and a mixed
BRP. This work is the basis for other authors to introduce many extensions of the BRPs
for multi-unit systems. The analysis of a system with non-identical components is given
in [149]. Imperfect maintenance is introduced in [150]. Moreover, the periodic replace-
ment with minimal repair at failure for a multi-unit system is considered in [151]. In this
work, the author investigates a simple model of BRP with minimal repair, when repair
costs depend on system age and the number of performed minimal repairs.
The problem of minimal repair performance is investigated in [152], where the
authors introduce a periodical inspection for a two-unit parallel system. This model
considers the detection capacity of inspections (perfect/imperfect), minimal repairs,
and failure interactions to examine dependence between subsystems. The  investi-
gation is continued in  [153], where the authors examine issues analyzed in  [152]
and [150].
The main maintenance models focus on optimization of the cycle length T between
performance of preventive maintenance actions. A number of research works also deal
with the problem of cyclically scheduling maintenance activities assuming a fixed cycle
length. In [154], the authors formulate a maintenance scheduling problem to maintain a
Preventive Maintenance Modeling 17

TABLE 1.2
Summary of Age and Block Replacement Policies for Multi-unit Systems
Type of
Maintenance Planning Typical
policy Horizon Optimality Criterion Modeling Method References
ARP Infinite (∞) The expected long-run Analytical [133,138–141,​
costs per unit time 144]
ARP Infinite (∞) The expected long-run Nonlinear [134]
costs per unit time programming
ARP Infinite (∞) The expected cost rate Analytical [136,137,143]
ARP Infinite (∞) Average loss rate Renewal process/ [142]
geometric process/
Markov process
ARP Infinite (∞)/ The expected long-run Renewal reward theory/ [119]
finite costs per unit time Bayesian approach
BRP Infinite (∞) The expected long-run Analytical/simulation [145,149]
cost per unit time
BRP Infinite (∞) The expected long-run Analytical (hybrid PM) [152,157]
cost per unit time
BRP Infinite (∞) The expected long-run Analytical (expected and [155]
cost per unit time critical value models)
BRP Infinite (∞) The expected long-run Markov processes [158]
cost per unit time
BRP Infinite (∞) The expected long-run Embedded Markov [153]
cost per unit time chain
BRP Infinite (∞) The expected long-run Analytical [75,146–151]
cost per unit time
BRP Infinite (∞) The expected long-run Analytical [160]
cost per unit time,
system availability
BRP Infinite (∞) System availability Analytical [150,161]

BRP Infinite (∞) System availability Analytical (matrix [156]


and reliability Laplace
transformations)
BRP Infinite (∞) Total operating and Branch and price [154]
servicing cost algorithm
BRP Infinite (∞) System reliability Simulation [162]

set of machines for a given determined T. The study presents the completely determinis-
tic approach to decide for each period t ∈ T which machine to service (if any) such that
total servicing costs and operating costs are minimized. The solution is obtained with the
use of a branch and price algorithm. Another interesting maintenance problem applies to
investigation of uncertain lifetime of system units (see [155]), introduction of repairable
18 Reliability Engineering

and non-repairable failures of a system (see [156]), lives of heterogeneous components


of a system (see [157]), implementation of a ergodic Markov environment (see [158],
or nearly optimal and optimal PM assessment for real-life systems (see  [128,159]).
The quick overview of the given BRPs is presented in Table 1.2.
For  technical systems, where component dependence can be defined, group
maintenance policies may be used to optimize system performance. This mainte-
nance policy is based on the performance of a maintenance activity for a group of
components. According to [15], the group maintenance is performed either when a
fixed time interval is expired or when a fixed number of units have failed, which-
ever comes first. The  main classification of group replacement policies includes
two main groups of models—static maintenance models and dynamic maintenance
models.
In the group of static maintenance models, four main classes of group replace-
ment policies can be defined. A T-age policy that assumes a system replacement is
performed after every T units of time. An m-failure policy that calls for replacing
a system at the time of mth failure. The (m, T)-policy combines features of T-age
policy and m-failure policy—under such a policy, system replacement is performed
at the time of the mth failure or at time T, whichever occurs first. The T-policy refers
to the assumptions of the block replacement.
The presented classes of maintenance models are based on the assumption that a
failure distribution of a system is known with certainty. However, in practice the fail-
ure distribution of a system is usually unknown or known with uncertain parameters.
In this case, there are proposed Bayesian group replacement policies.
Considering the planning aspect, group maintenance models can be classified as
stationary or dynamic. In stationary models, a long-term stable situation is assumed
during which the rules for maintenance do not change over the planning horizon.
The models in this overview mostly applies to this type. However, stationary mod-
els cannot incorporate dynamically changing information during operational pro-
cess performance, such as a varying deterioration of components or unexpected
opportunities.
To consider such short-term circumstances there are proposed dynamic models
that can adapt the long-term plan according to information becoming available in the
short term. This situation yields a dynamic grouping policy [163].
The main extensions of the group maintenance apply to minimal repair perfor-
mance, shock modeling, or periodic inspection implementation.
Additional replacement problems that are investigated in grouping maintenance
models apply to risk management (see [164]), continuous deteriorating process imple-
mentation (see  [165]), or joint optimization of production scheduling (see  [166]).
In [164], the author analyzes the correlation among potential human error, grouping
maintenance, and major accident risk. In [165], the authors introduce the novel sto-
chastic Petri-Net and genetic algorithm-based approach to solve maintenance model-
ing and optimization problems. The authors in [166] present a Bayesian approach to
develop a joint optimization model connecting group PM with production schedul-
ing of a series system.
Group maintenance models are investigated widely in the literature. A review is
presented in Table 1.3.
TABLE 1.3
Summary of Group Maintenance Policies for Deteriorating Multi-unit Systems
Planning
Horizon Type of Group Maintenance Optimality Criterion Modeling Method Typical References
Infinite (∞) Static (T-policy) The long-run cost per unit time Analytical [167,168]
The expected cost per unit time [169]
The expected cost rate [148]
System maintenance cost in a unit time [170]
Stationary availability [161]
Expected discounted cost to go Control theory of jump process/ [171]
Preventive Maintenance Modeling

dynamic programming
Static (T-age policy) The long run expected cost per unit of time Analytical [172,173]
Static (T-age policy, m-failure The expected cost per unit time [174,175]
policy, (m, T)-policy)
Static The long run expected cost per unit of time Bayesian approach [164,176,177]
The long-run average maintenance cost per Markov processes [35]
unit time Discrete-time Markov decision chains/ [178]
simulation
Total maintenance possession time and cost Petri-net and GA-based approach [165]
Total maintenance costs Random-key genetic algorithm [166]
Finite rolling Dynamic The long-term tentative plan Dynamic programming [179]
horizon The economic profit of group Heuristic approach based on genetic [180]
algorithm and MULTIFIT algorithm
The economic profit of group Heuristic approach based on GA [181]
Penalty cost function, total maintenance cost Analytical [163]
savings over the scheduling interval
19
20 Reliability Engineering

Another group of maintenance policies for multi-unit systems with component


dependence is opportunity-based maintenance. During performance processes of a
multi-unit system, some maintenance opportunities may occur due to breakdowns
of units in a series configuration. In most cases opportunities cannot be predicted in
advance and, because of their random occurrence, opportunistic maintenance mod-
els can be used for effective maintenance planning. Types of opportunistic mainte-
nance policies considered in this chapter are based mainly on [182] and include four
main groups of maintenance policies:

• Age-based opportunity maintenance models


• Failure-based opportunity maintenance models
• Opportunity and condition-based maintenance models
• Mixed PM models that consider implementation of different types of main-
tenance policies

The detailed classification and review of the given opportunity-based maintenance


policies is presented in Table 1.4.
The main extensions of opportunity-based maintenance models apply to minimal
repair performance, imperfect maintenance implementation, data uncertainty inves-
tigation, finite horizon case, or shock modeling. The  main applications are main-
tenance of production systems (see  [183–185]) or offshore wind turbine systems
(see [186]).
A  few papers deal with an opportunistic maintenance policy under a multi-­
criteria perspective. The main research studies apply to production system perfor-
mance (see [187]) and a power plant (see [188]).
Worth mentioning also is a group of risk-based opportunistic maintenance
models. This  modeling problem is considered in  [189]. The authors develop a
reliability model for a system that releases signals as it degrades. These released
signals are used to inform opportunistic maintenance. They assume that system
vulnerability to shock occurrence is dependent on its deterioration level. The risk-
based opportunistic maintenance model also is analyzed in [190]. In [190], the
authors present the model that uses risk evaluation of system shutdown caused by
component failure. The proposed approach is based on the analysis of fault cou-
pling features of a complex mechanical system considering age and risk factors.
In  this research area, the issues of dynamic opportunistic maintenance policy
optimization are analyzed. For example, in [191], the authors develop a dynamic
opportunistic maintenance policy for a continuously monitored multi-unit series
system with imperfect maintenance. The model is based on short-term optimization.
It is assumed also that a unit’s hazard rate distribution in the current maintenance
cycle can be directly derived through condition-based predictive maintenance.
This  problem is later investigated in  [192], where the authors present a dynamic
TABLE 1.4
Summary of Opportunity-Based Maintenance Policies for Deteriorating Multi-unit Systems
Planning Maintenance Typical
Horizon Model Optimality Criterion Modeling Method References
Infinite (∞) Age-based Expected total discounted time/ expected total discounted value of Analytical [203]
good time minus costs, total discounted good time vs cost ratio
Cost rate [168]
Expected long-run cost per unit time Analytical (deterministic problem) [204]
Optimal production stops Odds algorithm-based approach [198]
One-step cost function discrete-time Markov chain [205]
Total expected mainte­nance cost per unit per day Simulation [206]
Preventive Maintenance Modeling

Total maintenance cost [207]


The expected cost per unit time Monte Carlo simulation [184]
MC simulation and Bootstrap [208]
technique
Finite Total maintenance cost in a given time period Shortest path algorithm [209]
The total maintenance cost Linear programming [194]
The cumulative maintenance cost in a given time horizon Monte Carlo simulation [210]
The average cost per unit time Heuristic approach [211]
Infinite (∞) Failure-based Expected system cost rate Analytical [212]
The long-run mean cost rate [213]
Long-run expected system maintenance cost per unit time [214]
Number of failures Analytical/coupling technique [215]
The total maintenance cost rate Dynamic simulation [190]
Signals of failure state and degradation state of a component Signal model/simulation [189]
System availability MAM [188]

(Continued)
21
22

TABLE 1.4 (Continued)


Summary of Opportunity-Based Maintenance Policies for Deteriorating Multi-unit Systems
Planning Maintenance Typical
Horizon Model Optimality Criterion Modeling Method References
Finite The expected total maintenance cost Analytical [185]
The total maintenance cost Simulation [216]
Genetic algorithm [217]
MAM-APB model [187]
Survival function Expert judgment [202]
The total maintenance cost Genetic algorithm [196]
Infinite (∞) Condition-based The long-run expected maintenance cost rate Simulation [218]
The long-run average maintenance cost rate Markov decision process [219]
The long-run average maintenance cost per blade and per time Analytical [220]
unit [186]
Finite Cumulative OM cost saving [191]
The long-term average maintenance cost [192]
The expected total cost per unit time Dynamic Bayesian networks [201]
Infinite (∞) Mixed PM Joint stationary probability A deterioration state space partition [182]
method
Optimal total cost Discrete-event simulation model [200]
Finite The expected cost incurred in a cycle Analytical [221]
The total maintenance cost per unit time [193]
Average net benefit over failure replacement policy Genetic algorithm [195]
The expected maintenance cost Dynamic programming [197]
– Components proximity measure Fuzzy approach [199]
Reliability Engineering
Preventive Maintenance Modeling 23

opportunistic condition-based maintenance strategy that is based on real-time pre-


dictions of the remaining useful life of components with stochastic and economic
dependencies.
In  [193], the authors propose a dynamic opportunistic PM optimization policy
for multi-unit series systems that integrates two PM techniques: periodic PM and
sequential PM policies. Whenever one unit reaches its reliability threshold level, the
whole system has to stop and at that time PM opportunities arise for other units of
the system. The optimal PM policy is determined by maximizing the cost saving for
short-term cumulative opportunistic maintenance of the whole system.
Moreover, some research studies are based on the implementation of lin-
ear programming (see  [194]), genetic algorithms (see  [195,196]), dynamic pro-
gramming (see  [197]), theory of optimal stopping (see  [198]), fuzzy modeling
approach (see [199]), and simulations (see [200]). A generalized modeling method
for maintenance optimization of single- and multi-unit systems is given in [182].
Moreover, a Bayesian perspective in opportunistic maintenance is investigated
in  [201], where the authors propose a PM policy for multi-component systems
based on dynamic Bayesian networks (DBN)—Hazard and Operability Study
(HAZOP) model. The use of expert judgment to parameterize a model for degra-
dation, maintenance, and repair is provided in [202].
The last group of PM models for multi-unit systems with component dependence
applies to cannibalization maintenance. Cannibalization in maintenance occurs
“when a failed unit in a system is replaced with a functioning component from another
system that is failed for some other reason” [222]. The key issue in cannibalization
is how to use the component of failed units to maximize the number of working
units. Thus, cannibalization actions often are used in systems with large costs
associated with their critical components maintenance and operation (e.g., critical
infrastructures, transport systems, and production systems).
In the recent literature, a significant amount of research is available on the use
of mathematical modeling to analyze the effects of cannibalization. For a literature
survey, see [18,223,224].
Following  [222,225], this research can be separated into the three main
approaches [18]:

• Reliability-based models
• Inventory-based maintenance models
• Simulation (queuing) maintenance models

The detailed classification and review of the given opportunity-based maintenance


policies is presented in Table 1.5.
24 Reliability Engineering

TABLE 1.5
Summary of Cannibalization Maintenance Policies for Deteriorating
Multi-unit Systems
Optimality Criterion Approach Modeling Method Typical References
System minimum condition Reliability-based Analytical [226]
Cannibalized structure function (allocation model) [227]
Four measures: expected system Analytical [228]
state, defectives per failed (allocation
machine, MTTCFa, total model)/simulation
cannibalizations
The survival function of number Analytical [229]
of units of equipment available
or use at the end of given time
period
System reliability for mission Nonlinear [225]
programming
Total profit resulting from a Simulation [230]
component reusing
Reasons for product returns Case study [223]
Expected number of inoperative Markov process [34]
machines
The average total maintenance Simulation-based A closed-network, [222]
investments discrete-event
Average total maintenance simulation [231]
costs/average fleet readiness
NORS rate Inventory-based NORS model [232]
Optimal portfolio, optimal stock Allocation problem – [233]
level heuristic approach
The expected availability objective DRIVE model [224]
function
Aircraft availability Analytical (AAM [234]
model)
Cannibalization rates Analytical [235]
Cannibalization rates Performance indicators [236]
analysis
Product cannibalization Statistical data analysis [237]
e.g., Inter-Squadron cannibalization Balanced Scorecard [238]

a MTTCF – Mean time to complete failure

1.4  CONCLUSIONS AND DIRECTIONS FOR FURTHER RESEARCH


In this chapter, the literature is reviewed on the most commonly used preventive main-
tenance models for single- and multi-unit systems. The literature was selected based
on using Google Scholar as a search engine and ScienceDirect, JStor, SpringerLink,
and SAGEJournals. The author primarily searched the relevant literature based on
Preventive Maintenance Modeling 25

keywords, abstracts, and titles. The following main terms and/or a combination of


them were used for searching the literature: preventive maintenance, maintenance
model, time-based maintenance.
The  selection methodology was based on searching for the defined keywords,
and later choosing the models, that satisfy the main reviewing criteria. For example,
when searching for the keyword preventive maintenance in a Google search, there
were about 260 million hits. In the ScienceDirect database, this keyword had about
68,440  hits. Comparing the obtained search results to the main required criteria,
such as age-based maintenance model, block-based maintenance model, mainte-
nance optimization for multi-unit system, and periodic maintenance, the author
focused on the most frequently used inspection models published from 1964 to 2015.
Preventive maintenance issues have been investigated by various researchers and
practitioners for over 60  years. Thus, it is impossible to present all of the known
models that appeared during the period under consideration. As a result, just a few
of the other problems are presented that are investigated in the literature but omitted
in this chapter:

• Spare part optimization issues (see [239,240])


• Data uncertainty (see [241,242])
• Maintenance decision-making issues (see [243]).

Moreover, the given literature overview provided definition for the following main
conclusions:

• The most commonly used mathematical methods for analyzing maintenance


scheduling problems include applied probability theory, renewal reward
processes, and Markov decision theory. When the functional relationship
between the system’s input and output parameters cannot be described
analytically, various maintenance models have been developed that apply
linear and nonlinear programming, dynamic programming, simulation pro-
cesses, genetic algorithms, Bayesian approach, and heuristic approaches,
which were only mentioned in the presented overview.
• The investigated maintenance models usually are based on cost criterion to
obtain the optimal maintenance parameters. However, maintenance actions
focused on improving system dependability. Thus, for complex systems, where
various types of components have different maintenance cost and different
reliability importance in the system, it is more appropriate to analyze the opti-
mal maintenance policy under cost and reliability constraints simultaneously.
• Many maintenance models consider the grouping of maintenance activities
on a long-term basis with an infinite horizon. In practice, planning horizons
are usually finite for a number of reasons: information is only available
over the short term, a modification of the system changes the maintenance
problem completely, and some events are unpredictable.
• In the most existing literature on maintenance theory, the maintenance time
is assumed to be negligible. This assumption makes availability modeling
impossible or unrealistic.
26 Reliability Engineering

• Most maintenance models are based on the assumption of fully available


logistic support when it is needed. Thus, in the modeling approach, it is
assumed that whenever a system component is to be replaced, a new com-
ponent is immediately available. However, considering real life situations,
the number of spare parts is usually limited and the procurement lead-time
is non-negligible. This  situation implies that the maintenance policy and
spare provisioning policy should be modeled and optimized jointly.
• Another problem applies to data availability and reliability. Maintenance
and replacement decisions are based on the information available, such as
the failure data of the equipment under consideration, maintenance per-
formance times, and type and number of necessary support resources.
Sufficient data rarely exist for estimating parameters in a complex model,
and if data do exist, they are often unreliable. This  situation makes the
application of mathematical models to support maintenance and replace-
ment decisions less obvious.

In summary, traditional PM programs often require very time-consuming, manual


data and rely heavily on “tribal knowledge” estimates or require in-depth knowledge
and analysis of each individual piece of equipment on an ongoing basis to stay up-
to-date. Thus, based on the authors main conclusions and following the global trends
in maintenance (see [244,245] for recent reports), in the future most likely the main
interests will be on more advanced maintenance optimization models that are based
on the use of digital technologies.

REFERENCES
1. Mccall, J. J. (1965). Maintenance policies for stochastically failing equipment: A sur-
vey. Management Science 11(5): 493–524.
2. Pierskalla, W. P. and Voelker, J. A. (1976). A survey of maintenance models: The con-
trol and surveillance of deteriorating systems. Naval Research Logistics Quarterly 23:
353–388.
3. Valdez-Flores, C. and Feldman, R. (1989). A survey of preventive maintenance mod-
els for stochastically deteriorating single-unit systems. Naval Research Logistics 36:
419–446.
4. Cho, I. D. and Parlar, M. (1991). A survey of maintenance models for multi-unit sys-
tems. European Journal of Operational Research 51(1): 1–23.
5. Dekker, R., Wildeman, R. E., and Van Der Duyn Schouten, F. A. (1997). A  review
of multi-component maintenance models with economic dependence. Mathematical
Methods of Operations Research 45: 411–435.
6. Mazzuchi, T. A., Van Noortwijk, J. M., and Kallen, M. J. (2007). Maintenance optimi-
zation. Technical Report, TR-2007-9.
7. Nakagawa, T. and Mizutani, S. (2009). A  summary of maintenance policies for a
finite interval. Reliability Engineering and System Safety 94: 89–96. doi:10.1016/
j.ress.2007.04.004.
8. Nicolai, R. P. and Dekker, R. (2007). A review of multi-component maintenance models.
In: Aven, T. and Vinnem, J. M. (eds.) Risk, Reliability and Societal Safety: Proceedings
of European Safety and Reliability Conference ESREL 2007, Stavanger, Norway, June
25–27, 2007, Leiden, the Netherlands: Taylor & Francis Group: pp. 289–296.
Preventive Maintenance Modeling 27

9. Nowakowski, T. and Werbińska, S. (2009). On problems of multi-component system


maintenance modelling. International Journal of Automation and Computing 6(4):
364–378.
10. Pham, H. and Wang, H. (1996). Imperfect maintenance. European Journal of
Operational Research 94: 425–438.
11. Pophaley, M. and Ways, R. K. (2010). Plant maintenance management practices in
automobile industries: A  retrospective and literature review. Journal of Industrial
Engineering and Management 3(3): 512–541. doi:10.3926/jiem..v3n3.p512-541.
12. Popova, E. and Popova, I. (2014). Replacement strategies. Wiley StatsRef: Statistics
Reference Online.
13. Sarkar, A., Behera, D. K., and Kumar, S. (2012). Maintenance policies of single
and multi-unit systems in the past and present. International Journal of Current
Engineering and Technology 2(1): 196–205.
14. Vasili, M., Hond, T. S., Ismail, N., and Vasili, M. (2011). Maintenance optimization
models: A review and analysis. In: Proceedings of the 2011 International Conference
on Industrial Engineering and Operations Management, January 22–24, 2011, Kuala
Lumpur, Malaysia: pp. 1131–1138.
15. Wang, H. (2002). A survey of maintenance policies of deteriorating systems. European
Journal of Operational Research 139(3): 469–489. doi:10.1016/S0377-2217(01)00197-7.
16. Wang, H. and Pham, H. (2003). Optimal imperfect maintenance models. In: Pham,
H. (ed.) Handbook of Reliability Engineering, London, UK: Springer-Verlag London
Limited: pp. 397–414.
17. Wang, H. and Pham, H. (1997). A survey of reliability and availability evaluation of
complex networks using Monte Carlo techniques. Microelectronics Reliability 37(2):
187–209. doi:10.1016/S0026-2714(96)00058-3.
18. Werbińska-Wojciechowska, S. (2019). Technical System Maintenance. Delay-Time-
Based Modelling. London, UK: Springer.
19. Ahmad, R. and Kamaruddin, S. (2012). An overview of time-based and condition-
based maintenance in industrial application. Computers and Industrial Engineering
63: 135–149. doi:10.1016/j.cie.2012.02.002.
20. Geurts, J. H. J. (1983). Optimal age replacement versus condition based replacement:
Some theoretical and practical considerations. Journal of Quality Technology 15(4):
171–179.
21. Werbińska-Wojciechowska, S. (2014). Multicomponent technical systems mainte-
nance models: State of art (in Polish). In: Siergiejczyk, M. (ed.) Technical Systems
Maintenance Problems: Monograph (in Polish), Warsaw, Poland: Publication House of
Warsaw University of Technology: pp. 25–57.
22. Barlow, R. E. and Proschan, F. (1964). Comparison of replacement policies, and
renewal theory implications. The  Annals of Mathematical Statistics 35(2): 577–589.
doi:10.1214/aoms/1177703557.
23. Wang, H. and Pham, H. (2006). Reliability and Optimal Maintenance, London, UK:
Springer-Verlag.
24. Aven, T. and Dekker, R. (1997). A  useful framework for optimal replacement
models. Reliability Engineering and System Safety 58(1): 61–67. doi:10.1016/
S0951-8320(97)00055-0.
25. Block, H. W., Langberg, N.A., and Savits, T.H. (1990). Maintenance comparisons:
Block policies. Journal of Applied Probability 27: 649–657. doi:10.2307/3214548.
26. Block, H. W., Langberg, N. A., and Savits, T. H. (1990). Comparisons for maintenance
policies involving complete and minimal repair. Lecture Notes-Monograph Series
16(Topics in Statistical Dependence): 57–68.
27. Christer, A. H. and Keddie, E. (1985). Experience with a stochastic replacement model.
Journal of Operational Research Society 36(1): 25–34.
28 Reliability Engineering

28. Frostig, E. (2003). Comparison of maintenance policies with monotone failure rate dis-
tributions. Applied Stochastic Models in Business and Industry 19: 51–65. doi:10.1002/
asmb.485.
29. Langberg, N. A. (1988). Comparisons of replacement policies. Journal of Applied
Probability 25: 780–788.
30. Thomas, L. C. (1986). A survey of maintenance and replacement models for maintain-
ability and reliability of multi-item systems. Reliability Engineering 16(4):297–309.
31. Aboulfath, F. (1995). Optimal maintenance schedules for a fleet of vehicles under the
constraint of the single repair facility. MSc Thesis. Toronto, ON: University of Toronto.
32. Nicolai, R. P. and Dekker, R. (2006). Optimal maintenance of multicomponent sys-
tems: A review. Economic Institute Report.
33. Lamberts, S. W. J. and Nicolai, R. P. (2008). Maintenance Models for Systems Sub-
ject to Measurable Deterioration. Rotterdam, the Netherlands: Rozenberg Publishers,
University Dissertations.
34. Fisher, W. W. (1990). Markov process modelling of a maintenance system with spares,
repair, cannibalization and manpower constraints. Mathematical Computer Modelling
13(7): 119–125.
35. Gurler, U. and Kaya, A. (2002). A  maintenance policy for a system with multi-state
components: An approximate solution. Reliability Engineering and System Safety 76:
117–127. doi:10.1016/S0951-8320(01)00125-9.
36. Block, H. W., Langberg, N. A., and Savits, T. H. (1993). Repair replacement policies.
Journal of Applied Probability 30: 194–206. doi:10.2307/3214632.
37. Park, M. and Pham, H. (2016). Cost models for age replacement policies and block
replacement policies under warranty. Applied Mathematical Modelling 40(9–10):
5689–5702. doi:10.1016/j.apm.2016.01.022.
38. Scarf, P. A., Dwight, R., and Al-Musrati, A. (2005). On reliability criteria and the
implied cost of failure for a maintained component. Reliability Engineering and System
Safety 89: 199–207. doi:10.1016/j.ress.2004.08.019.
39. Chowdhury, C. H. (1988). A systematic survey of the maintenance models. Periodica
Polytechnica. Mechanical Engineering 32(3–4): 253–274.
40. Glasser, G. J. (1967). The age replacement problem. Technometrics 9(1): 83–91.
41. Rakoczy, A. and Żółtowski, J. (1977). About the issues on technical object renewal
principles definition (in Polish). In: Proceedings of Winter School on Reliability,
Szczyrk, Poland: pp. 175–191.
42. Yun, W. Y. (1989). An age replacement policy with increasing minimal repair cost.
Microelectronics Reliability 29(2): 153–157.
43. Sheu, S.-H. (1991). A general age replacement model with minimal repair and general
random repair cost. Microelectronics Reliability 31(5): 1009–1017.
44. Sheu, S.-H. and Liou, C.-T. (1992). An age replacement policy with minimal repair and
general random repair cost. Microelectronics Reliability 32(9): 1283–1289.
45. Sheu, S.-H. (1993). A generalized model for determining optimal number of minimal
repairs before replacement. European Journal of Operational Research 69: 38–49.
46. Lim, J. H., Qu, J., and Zuo, M. J. (2016). Age replacement policy based on imperfect
repair with random probability. Reliability Engineering and System Safety 149: 24–33.
doi:10.1016/j.ress.2015.10.020.
47. Mazzuchi, T. A. and Soyer, R. (1996). A Bayesian perspective on some replacement
strategies. Reliability Engineering and System Safety 51: 295–303.
48. Cha, J. H. and Kim, J. J. (2002). On the existence of the steady state availability of
imperfect repair model. Sankhya: The Indian Journal of Statistics 64, series B. Pt.
1: 76–81.
49. Dagpunar, J. S. (1994). Some necessary and sufficient conditions for age replacement
with non-zero downtimes. Journal of Operational Research Society 45(2): 225–229.
Preventive Maintenance Modeling 29

50. Vaurio, J. K. (1999). Availability and cost functions for periodically inspected pre-
ventively maintained units. Reliability Engineering and System Safety 63: 133–140.
doi:10.1016/S0951-8320(98)00030-1.
51. Nakagawa, T., Zhao, X., and Yun, W. Y. (2011). Optimal age replacement and inspection
policies with random failure and replacement times. International Journal of Reliability,
Quality and Safety Engineering 18(5): 405–416. doi:10.1142/S0218539311004159.
52. Zhao, X., Mizutani, S., and Nakagawa, T. (2015). Which is better for replacement poli-
cies with continuous or discrete scheduled times? European Journal of Operational
Research 242: 477–486. doi:10.1016/j.ejor.2014.11.018.
53. Wu, S. and Clements-Croome, D. (2005). Preventive maintenance models with random
maintenance quantity. Reliability Engineering and System Safety 90: 99–105.
54. Chien, Y.-H. (2008). A  general age-replacement model with minimal repair under
renewing free-replacement warranty. European Journal of Operational Research 186:
1046–1058. doi:10.1016/j.ejor.2007.02.030.
55. Dimitrov, B., Chukova, S., and Khalil, Z. (2004). Warranty costs: An age-dependent
failure/repair model. Naval Research Logistics 51(7): 959–976. doi:10.1002/nav.20037.
56. Ito, K. and Nakagawa, T. (2011). Comparison of three cumulative damage models.
Quality Technology and Quantitative Management 8(1): 57–66. doi:10.1080/16843703
.2011.11673246.
57. Sepehrifar, M. B., Khorshidian, K., and Jamshidian, A. R. (2015). On renewal
increasing mean residual life distributions: An age replacement model with hypoth-
esis testing application. Statistics and Probability Letters 96: 117–122. doi:10.1016/
j.spl.2014.09.009.
58. Sheu, S.-H. (1992). A general replacement of a system subject to shocks. Microelectronics
Reliability 32(5): 657–662.
59. Sheu, S.-H., Griffith, W. S., and Nakagawa, T. (1995). Extended optimal replace-
ment model with random repair cost. European Journal of Operational Research 85:
636–649.
60. Lai, M.-T. and Leu, B.-Y. (1996). An economic discrete replacement policy for a shock
damage model with minimal repairs. Microeconomics Reliability 36(10): 1347–1355.
61. Qian, C., Nakamura, S., and Nakagawa, T. (2003). Replacement and minimal repair pol-
icies for a cumulative damage model with maintenance. Computers and Mathematics
with Applications 46: 1111–1118.
62. Lam, C. T. and Yeh, R. H. (1994). Optimal replacement policies for multi-state deterio-
rating systems. Naval Research Logistics 41(3): 303–315.
63. Segawa, Y., Ohnishi, M., and Ibaraki, T. (1992). Optimal minimal-repair and replace-
ment problem with age dependent cost structure. Computers and Mathematics with
Applications 24(1/2): 91–101.
64. Kumar, D. and Westberg, U. (1997). Maintenance scheduling under age replacement
policy using proportional hazards model and TTT-ploting. European Journal of
Operational Research 99: 507–515.
65. Mahdavi, M. and Mahdavi, M. (2009). Optimization of age replacement policy using
reliability based heuristic model. Journal of Scientific and Industrial Research 68:
668–673.
66. Zhao, X., Al-Khalifa, K. N., and Nakagawa, T. (2015). Approximate method for opti-
mal replacement, maintenance, and inspection policies. Reliability Engineering and
System Safety 144: 68–73. doi:10.1016/j.ress.2015.07.005.
67. Kayid, M., Izadkhah, S., and Alshami, S. (2016). Laplace transform ordering of time
to failure in age replacement models. Journal of the Korean Statistical Society 45(1):
101–113.
68. Christer, A. H. (1986). Comments on finite-period applications of age-based replace-
ment models. IMA Journal of Mathematics in Management 1: 111–124.
30 Reliability Engineering

69. Kabir, A. B. M. Z. and Farrash, S. H. A. (1996). Simulation of an integrated age replace-


ment and spare provisioning policy using SLAM. Reliability Engineering and System
Safety 52: 129–138.
70. Wu, S. and Zuo, M. J. (2010). Linear and nonlinear preventive maintenance models.
IEEE Transactions on Reliability 59(1): 242–249. doi:10.1109/TR.2010.2041972.
71. Yeh, R. H. (1997). State-age-dependent maintenance policies for deteriorating systems
with Erlang sojourn time distributions. Reliability Engineering and System Safety 58:
55–60.
72. Crowell, J. I. and Sen, P. K. (1989). Estimation of optimal block replacement policies.
Mimeo series/the Institute of Statistics, the Consolidated University of North Carolina,
Department of Statistics, available at: stat.ncsu.edu.
73. Rakoczy, A. (1980). Simulation method for technical object’s optimal preventive main-
tenance time assessment (in Polish). In: Proceedings of Winter School on Reliability,
Szczyrk, Poland: pp. 143–152.
74. Sheu, S.-H. (1994). Extended block replacement policy with used item and general ran-
dom minimal repair cost. European Journal of Operational Research 79(3): 405–416.
75. Sheu, S.-H. (1991). Periodic replacement with minimal repair at failure and general ran-
dom repair cost for a multi-unit system. Microelectronics Reliability 31(5): 1019–1025.
76. Colosimo, E. A., Santos, W. B., Gilardoni, G. L., and Motta, S. B. (2006). Optimal
maintenance time for repairable systems under two types of failures. In: Soares, C. G.
and Zio, E. (eds.) Safety and Reliability for Managing Risk: Proceedings of European
Safety and Reliability Conference ESREL 2006, Estoril, Portugal, September 18–22,
2006, Leiden, the Netherlands: Taylor & Francis Group.
77. Lai, M.-T. and Yuan, J. (1993). Cost-optimal periodical replacement policy for a system
subjected to shock damage. Microelectronics Reliability 33(8): 1159–1168.
78. Sheu, S.-H. (1998). A  generalized age and block replacement of a system subject to
shocks. European Journal of Operational Research 108: 345–362.
79. Sheu, S.-H. and Griffith, W. S. (2002). Extended block replacement policy with shock
models and used items. European Journal of Operational Research 140: 50–60.
doi:10.1016/S0377-2217(01)00224-7.
80. Abdel-Hameed, M. (1986). Optimum replacement of a system subject to shocks.
Journal of Applied Probability 23: 107–114.
81. Abdel-Hameed, M. (1995). Inspection, maintenance and replacement models. Computers
and Operations Research 22(4): 435–441. doi:10.1016/0305-0548(94)00051-9.
82. Zhao, X., Qian, C., and Nakagawa, T. (2017). Comparisons of replacement policies with
periodic times and repair numbers. Reliability Engineering and System Safety 168:
161–170. doi:10.1016/j.ress.2017.05.015.
83. Berthaut, F., Gharbi, A., and Dhouib, K. (2011). Joint modified block replacement and
production/inventory control policy for a failure-prone manufacturing cell. Omega 39:
642–654. doi:10.1016/j.omega.2011.01.006.
84. Drobiszewski, J. and Smalko, Z. (2006). The equable maintenance strategy. Journal of
KONBiN 2: 375–383.
85. Pilch, R., Smolnik, M., Szybka, J., and Wiązania, G. (2014). Concept of preventive
maintenance strategy for a chosen example of public transport vehicles (in Polish). In:
Siergiejczyk, M. (ed.) Maintenance Problems of Technical Systems, Warsaw, Poland:
Publication House of Warsaw University of Science and Technology: pp. 171–182.
86. Kustroń, K. and Cieślak, Ł. (2012). The  optimization of replacement time for non-
repairable aircraft component. Journal of KONBiN 2(22): 45–58.
87. Pilch, R. (2017). Determination of preventive maintenance time for milling assemblies
used in coal mills. Journal of Machine Construction and Maintenance 1(104): 81–86.
88. Nakagawa, T. (1986). Periodic and sequential preventive maintenance policies. Journal
of Applied Probability 23: 536–542.
Preventive Maintenance Modeling 31

89. Nakagawa, T. and Mizutani, S. (2008). Periodic and sequential imperfect preven-
tive maintenance policies for cumulative damage models. In: Pham, H. (ed.) Recent
Advances in Reliability and Quality in Design, London, UK: Springer.
90. Sheu, S.-H., Chang, C. C., and Chen, Y.-L. (2012). An extended sequential imper-
fect preventive maintenance model with improvement factors. Communications in
Statistics: Theory and Methods 41(7): 1269–1283. doi:10.1080/03610926.2010.542852.
91. Liu, Y., Li, Y., Huang, H.-Z., and Kuang, Y. (2011). An optimal sequential preventive
maintenance policy under stochastic maintenance quality. Structure and Infrastructure
Engineering: Maintenance, Management, Life-Cycle Design and Performance 7(4):
315–322.
92. Peng, W., Liu, Y., Zhang, X., and Huang, H.-Z. (2015). Sequential preventive main-
tenance policies with consideration of random adjustment-reduction features.
Eksploatacja i Niezawodnosc: Maintenance and Reliability 17(2): 306–313.
93. Kim, H. S., Sub Kwon, Y., and Park, D. H. (2006). Bayesian method on sequential
preventive maintenance problem. The  Korean Communications in Statistics 13(1):
191–204.
94. Bergman, B. (1978). Optimal replacement under a general failure model. Advances in
Applied Probability 10: 431–451.
95. Canfield, R. V. (1986). Cost optimization of periodic preventive maintenance. IEEE
Transactions on Reliability R-35(1): 78–81. doi:10.1109/TR.1986.4335355.
96. Lie, C. H. and Chun, Y. H. (1986). An algorithm for preventive maintenance policy.
IEEE Transactions on Reliability R-35(1): 71–75.
97. Jayabalan, V. and Chaudhuri, D. (1992). Cost optimization of maintenance scheduling
for a system with assured reliability. IEEE Transactions on Reliability 41(1): 21–25.
doi:10.1109/24.126665.
98. Tsai, Y.-T., Wang, K.-S., and Teng, H.-Y. (2001). Optimizing preventive maintenance
for mechanical components using genetic algorithms. Reliability Engineering and
System Safety 74: 89–97. doi:10.1016/S0951-8320(01)00065-5.
99. Chan, J.-K. and Shaw, L. (1993). Modeling repairable systems with failure rates that
depend on age and maintenance. IEEE Transactions on Reliability 42(4): 566–571.
doi:10.1109/24.273583.
100. Nakagawa, T. and Osaki, S. (1974). The  optimum repair limit replacement policies.
Operational Research Quarterly 25(2): 311–317.
101. Okumoto, K. and Osaki, S. (1976). Repair limit replacement policies with lead time.
Zeitschrift fur Operations Research 20: 133–142.
102. Koshimae, H., Dohi, T., Kaio, N., and Osaki, S. (1996). Graphical/statistical approach to
repair limit replacement policies. Journal of the Operations Research 39(2): 230–246.
103. Dohi, T., Kaio, N., and Osaki, S. (2000). A  graphical method to repair-cost limit
replacement policies with imperfect repair. Mathematical and Computer Modelling 31:
99–106. doi:10.1016/S0895-7177(00)00076-5.
104. Dohi, T., Ashioka, A., Kaio, N., and Osaki, S. (2006). Statistical estimation algorithms
for repairs-time limit replacement scheduling under earning rate criteria. Computers
and Mathematics with Applications 51: 345–356. doi:10.1016/j.camwa.2005.11.004.
105. Dohi, T., Ashioka, A., Kaio, N., and Osaki, S. (2003). The optimal repair-time limit
replacement policy with imperfect repair: Lorenz transform approach. Mathematical
and Computer Modelling 38: 1169–1176. doi:10.1016/S0895-7177(03)90117-8.
106. Dohi, T., Kaio, N., and Osaki, S. (2003). A  new graphical method to estimate the
optimal repair-time limit with incomplete repair and discounting. Computers and
Mathematics with Applications 46: 999–1007. doi:10.1016/S0898-1221(03)90114-3.
107. Dohi, T., Matsushima, N., Kaio, N., and Osaki, S. (1996). Nonparametric repair-limit
replacement policies with imperfect repair. European Journal of Operational Research
96: 260–273.
32 Reliability Engineering

108. Beichelt, F. (1992). A  general maintenance model and its application to repair
limit replacement policies. Microelectronics Reliability 32(8): 1185–1196.
doi:10.1016/0026-2714(92)90036-K.
109. Bai, D. S. and Yun, W. Y. (1986). An age replacement policy with minimal repair cost
limit. IEEE Transactions on Reliability R-35(4): 452–454.
110. Yun, W. Y. and Bai, D. S. (1987). Cost limit replacement policy under imperfect repair.
Reliability Engineering 19: 23–28.
111. Yun, W. Y. and Bai, D. S. (1988). Repair cost limit replacement policy under imperfect
inspection. Reliability Engineering and System Safety 23: 59–64.
112. Dohi, T., Takeita, K., and Osaki, S. (2000). Graphical method for determining/­estimating
optimal repair-limit replacement policies. International Journal of Reliability, Quality
and Safety Engineering 7(1): 43–60.
113. Lai, M.-T. (2014). Optimal replacement period with repair cost limit and cumulative
damage model. Eksploatacja i Niezawodnosc: Maintenance and Reliability 16(2):
246–252.
114. Beichelt, F. (1999). A general approach to total repair cost limit replacement policies.
ORiON 15(1/2): 67–75.
115. Chang, C.-C., Sheu, S.-H., and Chen, Y.-L. (2013). Optimal replacement model with
age-dependent failure type based on a cumulative repair-cost limit policy. Applied
Mathematical Modelling 37: 308–317. doi:10.1016/j.apm.2012.02.031.
116. Chang, C.-C., Sheu, S.-H., and Chen, Y.-L. (2013) Optimal number of minimal repairs
before replacement based on a cumulative repair-cost limit policy. Computers and
Industrial Engineering 59: 603–610. doi:10.1016/j.cie.2010.07.005.
117. Kapur, P. K. and Garg, R. B. (1989) Optimal number of minimal repairs before replace-
ment with repair cost limit. Reliability Engineering and System Safety 26: 35–46.
118. Chien, Y.-H. and Sheu, S.-H. (2006). Extended optimal age-replacement policy with
minimal repair of a system subject to shocks. European Journal of Operational
Research 174: 169–181. doi:10.1016/j.ejor.2005.01.032.
119. Sheu, S.-H. (1999). Extended optimal replacement model for deteriorating systems.
European Journal of Operational Research 112: 503–516.
120. Chang, C.-C. (2014). Optimum preventive maintenance policies for systems subject to
random working times, replacement, and minimal repair. Computers and Industrial
Engineering 67: 185–194. doi:10.1016/j.cie.2013.11.011.
121. Martorell, S., Sanchez, A., and Serradell, V. (1999). Age-dependent reliability model
considering effects of maintenance and working conditions. Reliability Engineering
and System Safety 64: 19–31.
122. Jiang, R. and Ji, P. (2002). Age replacement policy: A  multi-attribute value
model. Reliability Engineering and System Safety 76: 311–318. doi:10.1016/
S0951-8320(02)00021-2.
123. Sheu, S.-H. and Chien, Y.-H. (2004). Optimal age-replacement policy of a system sub-
ject to shocks with random lead-time. European Journal of Operational Research 159:
132–144. doi:10.1016/S0377-2217(03)00409-0.
124. Legat, V., Zaludowa, A. H., Cervenka, V., and Jurca, V. (1996). Contribution to opti-
mization of preventive replacement. Reliability Engineering and System Safety 51:
259–266.
125. Nakagawa, T. and Kowada, M. (1983). Analysis of a system with minimal repair and its
application to replacement policy. European Journal of Operational Research 12(2):
176–182.
126. Park, D. H., Jung, G. M., and Yum, J. K. (2000). Cost minimization for periodic main-
tenance policy of a system subject to slow degradation. Reliability Engineering and
System Safety 68(2): 105–112. doi:10.1016/S0951-8320(00)00012-0.
Preventive Maintenance Modeling 33

127. Sheu, S.-H., Chen, Y.-L., Chang, C. H.-C. H., and Zhang, Z. G. (2016). A note on a
two variable block replacement policy for a system subject to non-homogeneous pure
birth shocks. Applied Mathematical Modelling 40(5–6): 3703–3712. doi:10.1016/​
j.apm.2015.10.001.
128. Bukowski, L. (1980). Optimization of technical systems maintenance policy (case
study of metallurgical production line) (in Polish). In: Proceedings of Winter School on
Reliability. Katowice, Ploand: Centre for Technical Progress: pp. 47–62.
129. Zhao, Y. X. (2003). On preventive maintenance policy of a critical reliability level for
system subject to degradation. Reliability Engineering and System Safety 79: 301–308.
doi:10.1016/S0951-8320(02)00201-6.
130. Jiang, X., Cheng, K., and Makis, V. (1998). On the optimality of repair-cost-limit poli-
cies. Journal of Applied Probability 35: 936–949.
131. Segawa, Y. and Ohnishi, M. (2000). The average optimality of a repair-limit replace-
ment policy. Mathematical and Computer Modelling 31: 327–334.
132. Murthy, D. N. P. and Nguyen, D. G. (1988). An optimal repair cost limit policy for ser-
vicing warranty. Mathematical and Computer Modelling 11: 595–599.
133. Frees, E. W. (1986). Optimizing costs on age replacement policies. Stochastic Processes
and their Applications 21: 195–212.
134. Maillart, L. M. and Fang, X. (2006). Optimal maintenance policies for serial, multi-
machine systems with non-instantaneous repairs. Naval Research Logistics 53(8):
804–813.
135. Sheu, S.-H., Yeh, R. H., Lin, Y.-B., and Juang, M.-G. (1999). A Bayesian perspective
on age replacement with minimal repair. Reliability Engineering and System Safety 65:
55–64.
136. Sheu, S.-H., Sung, C. H.-K., Hsu, T.-S., and Chen, Y.-C. H. (2013a). Age replacement
policy for a two-unit system subject to non-homogeneous pure birth shocks. Applied
Mathematical Modelling 37: 7027–7036. doi:10.1016/j.apm.2013.02.022.
137. Sheu, S.-H., Zhang, Z. G., Chien, Y.-H., and Huang, T.-H. (2013). Age replacement pol-
icy with lead-time for a system subject to non-homogeneous pure birth shocks. Applied
Mathematical Modelling 37: 7717–7725. doi:10.1016/j.apm.2013.03.017.
138. Dekker, R. and Dijkstra, M. C. (1992) Opportunity-based age replacement:
Exponentially distributed times between opportunities. Naval Research Logistics 39:
175–190.
139. Iskandar, B. P. and Sandoh, H. (2000). An extended opportunity-based age replacement
policy. RAIRO Operations Research 34: 145–154.
140. Jhang, J. P. and Sheu, S. H. (1999). Opportunity-based age replacement policy with
minimal repair. Reliability Engineering and System Safety 64: 339–344.
141. Satow, T. and Osaki, S. (2003). Opportunity-based age replacement with different
intensity rates. Mathematical and Computer Modelling 38: 1419–1426. doi:10.1016/
S0895-7177(03)90145-2.
142. Leung, F. K. N., Zhang, Y. L., and Lai, K. K. (2011). Analysis for a two-dissimilar-
component cold standby repairable system with repair priority. Reliability Engineering
and System Safety 96: 1542–1551. doi:10.1016/j.ress.2011.06.004.
143. Armstrong, M. J. (2002). Age repair policies for the machine repair problem. European
Journal of Operational Research 138: 127–141. doi:10.1016/S0377-2217(01)00135-7.
144. Van Dijkhuizen, G. C. and Van Harten, A. (1998). Two-stage generalized age mainte-
nance of a queue-like production system. European Journal of Operational Research
108: 363–378.
145. Scarf, P. A. and Deara, M. (2003). Block replacement policies for a two-component
system with failure dependence. Naval Research Logistics 50: 70–87. doi:10.1002/
nav.10051.
34 Reliability Engineering

146. Yusuf, I. and Ali, U. A. (2012). Structural dependence replacement model for parallel
system of two units. Journal of Basic and Applied Science 20(4): 324–326.
147. Lai, M.-T. and Yuan, J. (1991). Periodic replacement model for a parallel system subject
to independent and common cause shock failures. Reliability Engineering and System
Safety 31(3): 355–367.
148. Yasui, K., Nakagawa, T., and Osaki, S. (1988). A summary of optimum replacement
policies for a parallel redundant system. Microelectronic Reliability 28(4): 635–641.
149. Jodejko, A. (2008). Maintenance problems of technical systems composed of hetero-
geneous elements. In: Proceedings of Summer Safety and Reliability Seminars, June
22–28, 2008, Gdańsk-Sopot, Poland: pp. 187–194.
150. Sheu, S.-H., Lin, Y.-B., and Liao, G.-L. (2006). Optimum policies for a system with
general imperfect maintenance. Reliability Engineering and System Safety 91(3): 362–
369. doi:10.1016/j.ress.2005.01.015.
151. Sheu, S.-H. (1990). Periodic replacement when minimal repair costs depend on the age
and the number of minimal repairs for a multi-unit system. Microelectronics Reliability
30(4): 713–718.
152. Zequeira, R. I. and Berenguer, C. (2005). A block replacement policy for a periodically
inspected two-unit parallel standby safety system. In: Kołowrocki, K. (ed.) Advances in
Safety and Reliability: Proceedings of the European Safety and Reliability Conference
(ESREL 2005), Gdynia-Sopot-Gdańsk, Poland, June 27–30, 2005, Leiden, the
Netherlands: A. A. Balkema: pp. 2091–2098.
153. Park, J. H., Lee, S. C., Hong, J. W., and Lie, C. H. (2009). An optimal Block pre-
ventive maintenance policy for a multi-unit system considering imperfect mainte-
nance. Asia-Pacific Journal of Operational Research 26(6): 831–847. doi:10.1142/
S021759590900250X.
154. Grigoriev, A., Van De Klundert, J., and Spieksma, F. C. R. (2006). Modeling and solv-
ing the periodic maintenance problem. European Journal of Operational Research
172: 783–797. doi:10.1016/j.ejor.2004.11.013.
155. Ke, H. and Yao, K. (2016). Block replacement policy with uncertain lifetimes. Reliability
Engineering and System Safety 148: 119–124. doi:10.1016/j.ress.2015.12.008.
156. Wells, C. H. E. (2014). Reliability analysis of a single warm-standby system subject
to repairable and non-repairable failures. European Journal of Operational Research
235: 180–186. doi:10.1016/j.ejor.2013.12.027.
157. Scarf, P. A. and Cavalcante, C. A. V. (2010). Hybrid block replacement and inspection
policies for a multi-component system with heterogeneous component lives. European
Journal of Operational Research 206: 384–394. doi:10.1016/j.ejor.2010.02.024.
158. Anisimov, V. V. (2005). Asymptotic analysis of stochastic block replacement policies
for multi-component systems in a Markov environment. Operations Research Letters
33: 26–34. doi:10.1016/j.orl.2004.03.009.
159. Caldeira, D. J., Taborda, C. J., and Trigo, T. P. (2012). An optimal preventive main-
tenance policy of parallel-series systems. Journal of Polish Safety and Reliability
Association Summer Safety and Reliability Seminars 3(1): 29–34.
160. Duarte, A. C., Craveiro Taborda, J. C., Craveiro, A., and Trigo, T. P. (2005). Optimization
of the preventive maintenance plan of a series components system. In: Kołowrocki,
K. (ed.) Advances in Safety and Reliability: Proceedings of the European Safety and
Reliability Conference (ESREL 2005), Gdynia-Sopot-Gdańsk, Poland, June 27–30,
2005, Leiden, the Netherlands: A.A. Balkema.
161. Chelbi, A., Ait-Kadi, D., and Aloui, H. (2007). Availability optimization for multi-
component systems subjected to periodic replacement. In: Aven, T. and Vinnem, J. M.
(eds.) Risk, Reliability and Societal Safety: Proceedings of European Safety and
Reliability Conference ESREL 2007, Stavanger, Norway, June 25–27, 2007, Leiden,
the Netherlands: Taylor & Francis Group.
Preventive Maintenance Modeling 35

162. Okulewicz, J. and Salamonowicz, T. (2008). Preventive maintenance with imper-


fect repairs of a system with redundant objects. In: Proceedings of Summer Safety
and Reliability Seminars SSARS 2008, June 22–28, 2008, Gdańsk-Sopot, Poland:
pp. 279–286.
163. Do Van, P., Barros, A., Berenguer, C. H., and Bouvard, K. (2013). Dynamic group-
ing maintenance with time limited opportunities. Reliability Engineering and System
Safety 120: 51–59. doi:10.1016/j.ress.2013.03.016.
164. Okoh, P. (2015). Maintenance grouping optimization for the management of risk
in offshore riser system. Process Safety and Environmental Protection 98: 33–39.
doi:10.1016/j.psep.2015.06.007.
165. Zhang, T., Cheng, Z., Liu, Y.-J., and Guo, B. (2012). Maintenance scheduling for multi-
unit system: A stochastic Petri-net and genetic algorithm based approach. Eksploatacja
i Niezawodność: Maintenance and Reliability 14(3): 256–264.
166. Xiao, L., Song, S., Chen, X., and Coit, D. W. (2016). Joint optimization of production
scheduling and machine group preventive maintenance. Reliability Engineering and
System Safety 146: 68–78. doi:10.1016/j.ress.2015.10.013.
167. Sandve, K. and Aven, T. (1999). Cost optimal replacement of monotone, repairable
systems. European Journal of Operational Research 116: 235–248.
168. Zequeira, R. I. and Berenguer, C. (2004). Maintenance cost analysis of a two-
component parallel system with failure interaction. In: Proceedings of Reliability
and Maintainability, 2004 Annual Symposium: RAMS, 26-29 Jan. 2004, IEEE,
pp. 220–225. doi:10.1109/RAMS.2004.1285451.
169. Sheu, S.-H. and Jhang, J.-P. (1996). A generalized group maintenance policy. European
Journal of Operational Research 96: 232–247.
170. Bai, Y., Jia, X., and Cheng, Z. (2011) Group optimization models for multi-component
system compound maintenance tasks. Eksploatacja i Niezawodnosc: Maintenance and
Reliability 1: 42–47.
171. Haurie, A. and L’ecuyer, P. L. (1982). A stochastic control approach to group preventive
replacement in a multicomponent system. IEEE Transactions on Automatic Control,
AC-27 2: 387–393.
172. Lai, M.-T. and Chen, Y.-C. H. (2006). Optimal periodic replacement policy for a
two-unit system with failure rate interaction. International Journal of Advanced
Manufacturing Technology 29: 367–371.
173. Shafiee, M. and Finkelstein, M. (2015). An optimal age-based group maintenance pol-
icy for multi-unit degrading systems. Reliability Engineering and System Safety 134:
230–238. doi:10.1016/j.ress.2014.09.016.
174. Popova, E. and Wilson, J. G. (1999). Group replacement policies for parallel systems whose
components have phase distributed failure times. Annals of Operations Research 91: 163–189.
175. Ritchken, P. and Wilson, J. G. (1990). (m, T) group maintenance policies. Management
Science 36(5): 632–639.
176. Popova, E. (2004), Basic optimality results for Bayesian group replacement policies.
Operations Research Letters 32: 283–287.
177. Sheu, S.-H., Yeh, R. H., Lin, Y.-B., and Juang, M.-G. (2001). A Bayesian approach to an
adaptive preventive maintenance model. Reliability Engineering and System Safety 71:
33–44. doi:10.1016/S0951-8320(00)00072-7.
178. Dekker, R. and Roelvink, I. F. K. (1995). Marginal cost criteria for preventive replacement
of a group of components. European Journal of Operational Research 84: 467–480.
179. Wildeman, R. E., Dekker, R., and Smit, A. C. J. M. (1997). A dynamic policy for group-
ing maintenance activities. European Journal of Operational Research 99: 530–551.
180. Do, P., Vu, H. C., Barros, A., and Berrenguer, C. H. (2015). Maintenance grouping for
multi-component systems with availability constraints and limited maintenance teams.
Reliability Engineering and System Safety 142: 56–67. doi:10.1016/j.ress.2015.04.022.
36 Reliability Engineering

181. Vu, H. C., Do, P., Barros, A., and Berenguer, C. H. (2014). Maintenance grouping strat-
egy for multi-component systems with dynamic contexts. Reliability Engineering and
System Safety 132: 233–249. doi:10.1016/j.ress.2014.08.002.
182. Zhang, X. and Zeng, J. (2015) A general modelling method for opportunistic mainte-
nance modelling of multi-unit systems. Reliability Engineering and System Safety 140:
176–190. doi:10.1016/j.ress.2015.03.030.
183. Zequeira, R. I., Valdes, J. E., and Berenguer, C. (2008). Optimal buffer inventory
and opportunistic preventive maintenance under random production capacity avail-
ability. International Journal of Production Economics 111: 686–696. doi:10.1016/​
j.ijpe.2007.02.037.
184. Laggoune, R., Chateauneuf, A., and Aissani, D. (2009). Opportunistic policy for opti-
mal preventive maintenance of a multi-component system in continuous operating
units. Computers and Chemical Engineering 33: 1499–1510.
185. Hou, W. and Jiang, Z. (2013). An opportunistic maintenance policy of multi-unit series
production system with consideration of imperfect maintenance. Applied Mathematics
and Information Sciences 7(1L): 283–290.
186. Shafiee, M., Finkelstein, M., and Berenguer, C. H. (2015). An opportunistic condition-
based maintenance policy for offshore wind turbine blades subjected to degradation
and environmental shocks. Reliability Engineering and System Safety 142: 463–471.
doi:10.1016/j.ress.2015.05.001.
187. Xia, T., Jin, X., Xi, L., and Ni, J. (2015). Production-driven opportunistic mainte-
nance for batch production based on MAM-APB scheduling. European Journal of
Operational Research 240: 781–790. doi:10.1016/j.ejor.2014.08.004.
188. Cavalcante, C. A. V. and Lopes, R. S. (2015). Multi-criteria model to support the defini-
tion of opportunistic maintenance policy: A study in a cogeneration system. Energy 80:
32–80.
189. Bedford, T., Dewan, I., Meilijson, I., and Zitrou, A. (2011). The signal model: A model
for competing risks of opportunistic maintenance. European Journal of Operational
Research 214: 665–673. doi:10.1016/j.ejor.2011.05.016.
190. Hu, J. and Zhang, L. (2014). Risk based opportunistic maintenance model for complex
mechanical systems. Expert Systems with Applications 41(6): 3105–3115. doi:10.1016/j.
eswa.2013.10.041.
191. Zhou, X., Xi, L., and Lee, J. (2006). A  dynamic opportunistic maintenance policy
for continuously monitored systems. Journal of Quality in Maintenance Engineering
12(3): 294–305. doi:10.1108/13552510610685129.
192. Shi, H. and Zeng, J. (2016). Real-time prediction of remaining useful life and p­ reventive
opportunistic maintenance strategy for multi-component systems considering stochas-
tic dependence. Computers and Industrial Engineering 93: 192–204. doi:10.1016/​
j.cie.2015.12.016.
193. Zhou, X., Lu, Z.-Q., Xi, L.-F., and Lee, J. (2010). Opportunistic preventive maintenance
optimization for multi-unit series systems with combing multi-preventive maintenance
techniques. Journal of Shanghai Jiaotong University 15(5): 513–518.
194. Gustavsson, E., Patriksson, M., Stromberg, A.-B., Wojciechowski, A., and Onnheim,
M. (2014). Preventive maintenance scheduling of multi-component systems with
interval costs. Computers and Industrial Engineering 76: 390–400. doi:10.1016/​
j.cie.2014.02.009.
195. Haque, S. A., Zohrul Kabir, A. B. M., and Sarker, R. A. (2003). Optimization model for
opportunistic replacement policy using genetic algorithm with fuzzy logic controller.
Proceedings of the Congress on Evolutionary Computation 4: 2837–2843.
196. Samhouri, M. S., Al-Ghandoor, A., Fouad, R. H., and Alhaj Ali, S. M. (2009). An intel-
ligent opportunistic maintenance (OM) system: A genetic algorithm approach. Jordan
Journal of Mechanical and Industrial Engineering 3(4): 246–251.
Preventive Maintenance Modeling 37

197. Kececioglu, D. and Sun, F.-B. (1995). A general discrete-time dynamic programming
model for the opportunistic replacement policy and its application to ball-bearing sys-
tems. Reliability Engineering and System Safety 47: 175–185.
198. Iung, B., Levrat, E., and Thomas, E. (2007). Odds algorithm-based opportunistic main-
tenance task execution for preserving product conditions. Annals of the CIRP 56/1:
13–16.
199. Derigent, W., Thomas, E., Levrat, E., and Iung, B. (2009). Opportunistic maintenance
based on fuzzy modelling of component proximity. CIRP Annals  – Manufacturing
Technology 58: 29–32.
200. Assid, M., Gharbi, A., and Hajji, A. (2015). Production planning and opportunistic pre-
ventive maintenance for unreliable one-machine two-products manufacturing systems.
IFAC-PapersOnLine 48–43: 478–483. doi:10.1016/j.ifacol.2015.06.127.
201. Hu, J., Zhang, L., and Liang, W. (2012). Opportunistic predictive maintenance for
complex multi-component systems based on DBN-HAZOP model. Process Safety and
Environmental Protection 90: 376–386.
202. Bedford, T. and Alkabi, B. M. (2009). Modelling competing risks and opportunis-
tic maintenance with expert judgement. In: Martorell, S., Guedes Soares, C. and
Barnett, J. Safety, Reliability and Risk Analysis: Theory, Methods and Applications:
Proceedings of European Safety and Reliability Conference ESREL 2008, Valencia,
Spain, September 22–25, 2008, Leiden, the Netherlands: Taylor & Francis Group:
pp. 515–521.
203. Radner, R. and Jorgenson, D. W. (1963). Opportunistic replacement of a single part in
the presence of several monitored parts. Management Science 10(1): 70–84.
204. Epstain, S. and Wilamowsky, Y. (1985). Opportunistic replacement in a deterministic
environment. Computers and Operations Research 12(3): 311–322.
205. Van Der Duyn Schouten, D. A., and Vanneste, S. G. (1990). Analysis and computation
of (n, N)-strategies for maintenance of a two-component system. European Journal of
Operational Research 48: 260–274.
206. Ding, S.-H. and Kamaruddin, S. (2012). Selection of optimal maintenance policy
by using fuzzy multi criteria decision making method. In: Proceedings of the 2012
International Conference on Industrial Engineering and Operations Management,
July 3–6, 2012, Istanbul, Turkey: pp. 435–443.
207. Sarker, B. R. and Ibn Faiz, T. (2016). Minimizing maintenance cost for offshore wind
turbines following multi-level opportunistic preventive strategy. Renewable Energy 85:
104–113. doi:10.1016/j.renene.2015.06.030.
208. Laggoune, R., Chateauneuf, A., and Aissani, D. (2010). Impact of few failure data
on the opportunistic replacement policy for multi-component systems. Reliability
Engineering and System Safety 95: 108–119. doi:10.1016/j.ress.2009.08.007.
209. Gunn, E. A. and Diallo, C. (2015). Optimal opportunistic indirect grouping of preven-
tive replacements in multicomponent systems. Computers and Industrial Engineering
90: 281–291. doi:10.1016/j.cie.2015.09.013.
210. Zhou, X., Huang, K., Xi, L., and Lee, J. (2015). Preventive maintenance ­modeling
for multi-component systems with considering stochastic failures and ­disassembly
sequence. Reliability Engineering and System Safety 142: 231–237. doi:10.1016/​
j.ress.2015.05.005.
211. Hopp, W. J. and Kuo, Y.-L. (1998). Heuristics for multicomponent joint replacement:
Applications to aircraft engine maintenance. Naval Research Logistics 45: 435–458.
212. Fard, N. and Zheng, X. (1991). An approximate method for non-repairable systems
based on opportunistic replacement policy. Reliability Engineering and System Safety
33: 277–288.
213. Zheng, X. and Fard, N. (1991). A maintenance policy for repairable systems based on
opportunistic failure-rate tolerance. IEEE Transactions on Reliability 40(2): 237–244.
38 Reliability Engineering

214. Pham, H. and Wang, H. (1999). Optimal (τ,T) opportunistic maintenance of a k-out-
of-n:G system with imperfect PM and partial failure. Naval Research Logistics 47:
223–239.
215. Cui, L. and Li, H. (2006). Opportunistic maintenance for multi-component shock
models. Mathematical Methods of Operations Research 63(3): 493–511. doi:10.1007/
s00186-005-0058-9.
216. Tambe, P. P. and Kularni, M. S. (2013). An opportunistic maintenance decision of a
multi-component system considering the effect of failures on quality. In: Proceedings
of the World Congress on Engineering 2013, Vol. 1, July 3–5, 2013, London, UK: WCE
2013: pp. 1–6.
217. Tambe, P. P., Mohite, S., and Kularni, M. S. (2013). Optimisation of opportunistic main-
tenance of a multi-component system considering the effect of failures on quality and
production schedule: A case study. International Journal of Advanced Manufacturing
Technology 69(5): 1743–1756.
218. Huynh, T. K., Barros, A., and Berenguer, C.H. (2013). A  reliability-based opportu-
nistic predictive maintenance model for k-out-of-n deteriorating systems. Chemical
Engineering Transactions 33: 493–498.
219. Cheng, Z., Yang, Z., Tan, L., and Guo, B. (2011). Optimal inspection and maintenance
policy for the multi-unit series system. In: Proceedings of 9th International Conference
on Reliability, Maintainability and Safety (ICRMS) 2011, June 12–15, 2011, Guiyang,
China: pp. 811–814.
220. Cheng, Z., Yang, Z., and Guo, B. (2013). Optimal opportunistic maintenance model of
multi-unit systems. Journal of Systems Engineering and Electronics 24(5): 811–817.
doi:10.1109/JSEE.2013.00094.
221. Taghipour, S. and Banjevic, D. (2012). Optimal inspection of a complex system sub-
ject to periodic and opportunistic inspections and preventive replacements. European
Journal of Operational Research 220: 649–660. doi:10.1016/j.ejor.2012.02.002.
222. Ormon, S. W. and Cassady, C. R. (2004). Cannibalization policies for a set of paral-
lel machines. In: Reliability and Maintainability, 2004 Annual Symposium: RAMS,
January 26–29, 2004, Colorado Springs, CO: pp. 540–545.
223. Nowakowski, T. and Plewa, M. (2009). Cannibalization: Technical system maintenance
method (in Polish). In: Proceedings of XXXVII Winter School on Reliability, Warsaw,
Poland: Szczyrk, Publication House of Warsaw University of Technology: pp. 230–238.
224. Sherbrooke, C. C. (2004). Optimal Modeling Inventory of Systems. Multi-echelon
Techniques. Boston, MA: Kluwer Academic Publishers.
225. Lv, X.-Z., Fan, B.-X., Gu, Y., and Zhao, X.-H. (2013), Selective maintenance model
considering cannibalization and its solving algorithm. In: Proceedings of 2013
International conference on Quality, Reliability, Risk, Maintenance, and Safety
Engineering (WR2MSE), IEEE: pp. 717–723.
226. Simon, R. M. (1970). Cannibalization policies for multicomponent systems. SIAM
Journal on Applied Mathematics 19(4): 700–711.
227. Baxter, L. A. (1988). On the theory of cannibalization. Journal of Mathematical
Analysis and Applications 136: 290–297. doi:10.1016/0022-247X(88)90131-X.
228. Khalifa, D., Hottenstein, M., and Aggarwal, S. (1977). Technical note: Cannibalization
policies for multistate systems. Operations Research 25(6): 1032–1039.
229. Byrkett, D. L. (1985). Units of equipment available using cannibalization for repair-part
support. IEEE Transactions on Reliability R-34(1): 25–28.
230. Jodejko-Pietruczuk, A. and Plewa, M. (2012). The model of reverse logistics, based on
reliability theory with elements’ rejuvenation. Logistics and Transport 2(15): 27–35.
231. Salman, S., Cassady, C. R., Pohl, E. A., and Ormon, S. W. (2007). Evaluating the
impact of cannibalization on fleet performance. Quality and Reliability Engineering
International 23: 445–457. doi:10.1002/qre.826.
Preventive Maintenance Modeling 39

232. Sherbrooke, C. C. (1971). An evaluator for the number of operationally ready aircraft in
a multilevel supply system. Operations Research 19(3): 618–635.
233. Shah, J. and Avittathur, B. (2007). The  retailer multi-item inventory problem with
demand cannibalization and substitution. International Journal of Production
Economics 106: 104–114. doi:10.1016/j.ijpe.2006.04.004.
234. Gaver, D. P., Isaacson, K. E., and Abell, J. B. (1993). Estimating aircraft recoverable
spares requirements with cannibalization of designated items. Santa Monica, CA:
RAND Corporation. https://www.rand.org/pubs/reports/R4213.html.
235. Hoover, J., Jondrow, J. M., Trost, R. S., and Ye, M. (2002). A  model to study:
Cannibalization, FMC, and customer waiting time. Alexandria, VA: CNA.
236. Albright, T. L., Geber, C. A., and Juras, P. (2014). How naval aviation uses the Balanced
Scorecard. Strategic Finance 10: 21–28.
237. Meenu, G. (2011). Identification of factors affecting product cannibalization in Indian
automobile sector. IJCEM International Journal of Computational Engineering and
Management 12: 2230–7893.
238. Curtin, N. P. (2001). Military Aircraft: Cannibalizations Adversely Affect Personnel
and Maintenance. Washington, DC: US General Accounting Office.
239. Cheng, Y.-H. and Tsao, H.-L. (2010). Rolling stock maintenance strategy selection,
spares parts’ estimation, and replacements’ interval calculation. International Journal
of Production Economics 128: 404–412. doi:10.1016/j.ijpe.2010.07.038.
240. Garg, J. (2013). Maintenance: Spare Parts Optimization. M2 Research Intern Theses,
Ecole Centrale de Paris, Capgemini Consulting.
241. Ondemir, O. and Gupta, S. M. (2014). A  multi-criteria decision making model for
advanced repair-to-order and disassembly-to-order system. European Journal of
Operational Research 233: 408–419. doi:10.1016/j.ejor.2013.09.003.
242. Silver, E. A. and Fiechter, C.-N. (1995). Preventive maintenance with limited historical
data. European Journal of Operational Research 82: 125–144.
243. Nguyen, K.-A., Do, P., and Grall, A. (2015). Multi-level predictive maintenance for
multi-component systems. Reliability Engineering and System Safety 144: 83–94.
doi:10.1016/j.ress.2015.07.017.
244. Predictive maintenance 4.0. Predict the unpredictable. PWC, Mainnovation,
Pricewaterhouse Coopers B.V. 2017.
245. Predictive maintenance and the smart factory. Deloitte Development LLC. 2017.
2 Inspection Maintenance
Modeling for
Technical Systems
An Overview
Sylwia Werbińska-Wojciechowska

CONTENTS
2.1 Introduction..................................................................................................... 41
2.2 Inspection Maintenance Modeling for Single-Unit Systems...........................44
2.2.1 Inspection Maintenance for Two-State Systems.................................44
2.2.2 Inspection Maintenance for Multi-state Systems................................ 47
2.3 Inspection Maintenance Modeling for Multi-unit Systems............................. 56
2.3.1 Inspection Maintenance for Standby Systems..................................... 56
2.3.2 Inspection Maintenance for Operating Systems.................................. 58
2.4 Hybrid Inspection Models............................................................................... 65
2.5 Other Inspection Maintenance Models........................................................... 67
2.6 Conclusions and Directions for Further Research........................................... 67
References................................................................................................................. 69

2.1 INTRODUCTION
All equipment breaks down from time to time, requiring materials, tradespeople
to repair it, and causing some negative consequences, such as loss in production or
transportation delays. To reduce the number of these breakdowns, planned main-
tenance actions are implemented. One of the most familiar planned maintenance
actions is inspection.
Currently, inspection and inspection policy development have an important role in
various technical systems, thus they attract a lot of attention in the literature. In many
situations there are no apparent systems indicating the forthcoming failure. In such
systems with non-self-announcing failures (also called unrevealed faults or latent
faults), the typical preventive maintenance policies cannot be used [1]. In maintenance
of such systems the inspection actions performance is introduced. Examples of these
systems include protective devices, emergency devices, and standby units (see [1,2]).
The  main purpose of an inspection is to determine the state of equipment
based on the chosen indicators, such as bearing wear, gauge readings, and quality
of a product [3]. Following this, the main definition of inspection can be derived.
41
42 Reliability Engineering

According to EN 13306:2018 standard  [4], inspection is defined as “examination


for conformity by measuring, observing, or testing the relevant characteristics of an
item.” The authors [5] extend this definition, providing that inspection is defined as
“­measuring, examining, testing, and gauging one or more characteristics of a prod-
uct or service and comparing the results with specified requirements to determine
whether conformity is achieved for each characteristic.”
The main benefits obtained from inspection performance include detection and
correction of minor defects before major breakdown occurs. Consequently, the
inspection maintenance optimization is strictly connected with system’s deteriora-
tion processes, which are generally stochastic. Thus, the condition of a system is
revealed only by its inspection. In other words, inspection models usually assume
that the state of the system is completely unknown unless an inspection is performed.
Following this, the knowledge about the true status of an inspected system gives the
possibility to take appropriate maintenance actions. However, execution of frequent
inspections incurs substantial cost. Conversely, infrequent inspections result in a
higher cost for system downtime because of longer intervals between performance
of these maintenance actions. Following this, to determine an inspection policy, the
correct balance between the number of inspections and the resulting output accord-
ing to the defined optimization criteria (e.g., maximization of profit, minimization of
downtime, and maximization of availability) must be sought.
Moreover, inspection schemes may be periodic and non-periodic (sequential) [6].
In  this chapter, the focus is on periodic inspection maintenance modeling issues.
More information about non-periodic inspection maintenance modeling may be
found in [1,7].
Early inspection maintenance models were developed in 1959 by R.E. Barlow
and L.C. Hunter in their work Mathematical models for system reliability (according
to [8]). A standard decision problem includes answering for the question: An unde-
tected failure causes an economic loss which increases in time, whereas inspec-
tions are costly too. What is the most cost-efficient way to schedule inspections
in time? Many extensions and modifications of the standard inspection model have
been developed and investigated. They have been surveyed in the last five decades.
One of the first research works that surveys inspection models is [9], where the
authors focus on the inspection and replacement problems of single and multi-unit
systems. The summary of optimal scheduling of replacement and inspection of sto-
chastically failing equipment is developed in [10]. Later, in [11] the authors review
the research studies that appeared between 1965 and 1976. In this work, the authors
present the discrete time maintenance models in which a unit (or units) is monitored
and a decision is made to repair, replace, and/or restock the unit(s). In [3], the author
gives a state-of-the-art review of the literature related to optimal inspection model-
ing of failing systems. The surveyed research papers were published in the 1960s
and 1970s. In 1989, the authors in [12] present a survey on the research published
after [11]. In this work, the authors focus on single-unit systems (one-unit and com-
plex systems), providing a section on inspection models. The  authors indicate the
main differences between developed models are time horizon, available information,
the nature of cost functions, models objective, and system’s constraints. The focus on
multi-unit systems inspection problems is given in [13]. In [14], the authors present
Inspection Maintenance Modeling for Technical Systems 43

the literature review on inspection maintenance models. The authors focus on the


inspection models with different types of inspection information (perfect or not) and
different costs of inspections (costly or costless inspection information). The same
year, the author in [15] reviews recent developments in the methodology for solving
inspection problems. The author focuses on the most important issues that need fur-
ther development (e.g., fallible tests performance).
In 2002, the authors in work [16] review classical maintenance models ­including
inspection strategies. They focus on the models developed in the 1960s and 1970s that
are based on the general inspection policy discussed by R. E. Barlow and F. Proschan
in Mathematical Theory of Reliability. The  author also investigates the standard
inspection policies in [17].
Later, in 2012 the authors in [8] review the main inspection models for systems.
They present the two main maintenance models—an inspection without replacement
and an inspection with replacement. The first group of inspection models includes
solutions for three situations: lifetime distribution is known, lifetime distribution is
partially known, and lifetime distribution is unknown.
In  the second group of maintenance models, the assumption of inspection-
replacement process is introduced. The  next year, the authors in  [18] present the
three classes of inspection problems: (1) inspection frequencies for equipment that is
in continuous operation and subject to breakdown, (2) inspection intervals for equip-
ment used only in emergency conditions, and (3) condition monitoring of equipment.
The recent literature review on inspection maintenance also is provided in [19], where
the author focuses on inspection maintenance for single-unit and multi-unit systems.
Moreover, some recent research works are dedicated to comparing the problems
with various maintenance policies. The  main comparisons between optimum and
nearly optimum inspection policies are given in [20,21], where authors refer to the
models developed by R. E. Barlow and F. Proschan as standard optimal policies.
In  [22], the three sub-optimal inspection polices are proposed and compared—
periodic policy, mean residual life policy, and constant hazard policy. The review
and comparison of known classical optimum-checking policies is given in  [23].
Comparisons for inspection and repair policies are analyzed in [24–26].
In summary, based on the developed literature reviews, the existing inspection
models can be classified many ways. One classification is given in [15], where the
author defines five main groups of optimal inspection models: imperfect inspec-
tion models, inspection with replacement policies, inspection policies with delayed
symptoms of failure, inspection models for stand-by systems, and Bayesian models.
More general classifications divide existing maintenance models into the inspection
models for two-states systems and multi-states systems ([27]), or inspection models
for single- and multi-unit systems ([28,29]). According to [1], inspection models are
classified considering the type of maintained systems: protective devices (safety sys-
tems), or standby units, and operating devices.
In this chapter, classification proposed divides the known models into four main
groups of inspection strategies: single-unit systems, multi-unit systems, hybrid
inspection models, and models dedicated to solving other maintenance problems
(e.g., case studies). Thus, the main scheme for classification of inspection models for
technical systems is given in Figure 2.1.
44 Reliability Engineering

INSPECTION MAINTENANCE MODELS FOR


INSPECTION MODELS FOR INSPECTION MODELS FOR
TECHNICAL SYSTEM
MULTI-UNIT SYSTEMS SINGLE-UNIT SYSTEMS

* finite/infinite horizon case * standby units/operating systems

* standby/operating systems * optimal or nearly optimal inspection


policy
* standby unit types HYBRID INSPECTION OTHER INSPECTION
MODELS MODELS * perfect/imperfect inspection performance
* perfect/imperfect inspection
performance * known/unknown lifetime distributions
* Risk-based inspection * case studies
* test procedure searching or optimal * shock models
* preventive maintenance with * safety issues in inspection
inspection models
inspections maintenance *two- or multi-state objects
* maintenance with reliability
constraints

* cumulative damage modeling


issues

* inventory policy joint optimization

*delay-time modeling concept

FIGURE 2.1  Inspection maintenance models for technical systems – the main classifica-
tion. (Own contribution based on Tang, T., Failure finding interval optimization for peri-
odically inspected repairable systems, PhD Thesis, University of Toronto, 2012; Beichelt, F.,
Nav. Res. Logist. Q., 28, 375–381, 1981; Cazorla, D.M. and R. Perez-Ocon, Eur. J. Oper. Res.,
190, 494–508, 2008; Boland, P.J. and E. El-Neweihi, Comput. Oper. Res., 22, 383–390, 1995.)

2.2 INSPECTION MAINTENANCE MODELING


FOR SINGLE-UNIT SYSTEMS
In this section, the author investigates a one-unit stochastically failing or deteriorat-
ing system in which only actual inspection can detect a system’s failure. Following
Figure 2.1, inspection models for two-state, single-unit systems are investigated first.

2.2.1  Inspection Maintenance for Two-State Systems


The first inspection model formulated by R. E. Barlow and F. Proschan [7] is called a
pure inspection model for a system and is characterized by the following assumptions:

• Two-stated system’s condition (functioning and failed state)


• The system’s condition is known only by inspections
• Inspections are perfect in the sense that a failure will be identified at inspection
• Inspections do not degrade or rejuvenate the system
• System cannot fail or age during inspection performance
• Inspection actions take negligible time

For the given assumptions, the expected total cost is obtained according to the formula:

∑∫
n+1
t in
C (Tin ) = tinn
cin1(n + 1) + cin2 (tinn+1 − x )  dF ( x ) (2.1)
n=0

where:
C(Tin) Long-run expected cost per unit time
cin1 Cost of first inspection action performance
cin2 Cost of second (and subsequent) inspection action performance
F(x) Probability distribution function of system/unit lifetime
Inspection Maintenance Modeling for Technical Systems 45

The main extensions of this pure inspection model of a system applies to perfect/


imperfect inspection process performance, assuming known/unknown system
lifetime distribution, cost/reliability optimization criteria use, or shock modeling
implementation.
One of the first extensions of the given pure inspection model applies to finite
horizon case implementation. In [30], the author analyzes a model that is based on
the selection of the best maintenance strategy for the object’s reliability state. In [31],
the author analyzes the problem of determining an optimum checking schedule over
the finite horizon with cost considerations.
In [32,33], a heuristic approach for determining the optimal inspection interval is
investigated. The authors in [33] assume that the optimal interval between inspec-
tions depends on a likelihood of malfunction, a cost of inspection, and a cost of
treatment. The developed model is examined later to analyze the relation of subjects’
judgments to the model description. Later, in [32], the author focuses on the develop-
ment of a mathematical model for determining a periodic inspection schedule in a
preventive maintenance program for a single machine.
The second, and very often investigated, extension of the basic inspection model
includes the situation when no or only partial information on a lifetime distribution of
a system is available. One of the first works that investigates this issue is given in [34].
The author in this work considers that the system lifetime distribution is unknown. To
find the optimal inspection policy parameters, the author uses the minimax inspection
strategies with respect to cost criterions. This model later is extended in [35] and [36].
Another interesting problem applies to the imperfect inspection performance
analysis. For  example, in  [37] the authors develop an imperfect inspection policy
for systems subject to a multiple correlated degradation process. In [38], the author
presents a problem of finding the optimum inspection procedure for a system,
whose time to failure is exponentially distributed. The problem is considered as a
­continuous-time Markovian decision process with two states (before and after fail-
ure) and provides a basis for the extended model given in [35].
A work worth noting is [39], where the authors introduce an optimal inspection
policy that is based on implementation of a failure detection zone. The idea is like a
delayed time approach (see [19]) or a Fault Trees with Time Dependencies modeling
approach (see [40]). In this model, if inspection is conducted in a pre-specified time
zone, a failure will be noticed before it occurs. Otherwise, the failure will remain
undetected. The analytical algorithm for searching for the optimal inspection inter-
val is given considering cost and availability criteria.
Another interesting problem is presented in [41], where the authors propose a
model in which the ith test increases a remaining failure rate without changing the
form of the conditional lifetime distribution. The solution algorithms for finding
the best testing times are developed for two cases of uniform and exponential fail-
ure time distributions.
The  problem of determination of an optimal inspection policy when inspec-
tions may be harmful to a maintained unit is continued also in [42]. The author in
this work develops a hazardous-inspection model where every performed test may
impair the tested unit. The proposed model is developed based on a Markov decision
process implementation and the emphasis is put on maximization of the expected
46 Reliability Engineering

lifetime of the inspected unit. A non-Markovian case is analyzed in [43]. The author


in this work develops two inspection policies: one-test and two-test. The two-stage
inspection procedure is dedicated to expensive devices and is based on perform-
ing a fallible test first and an error-free test whenever the first test reports a failure.
The models are based on the assumptions of arbitrary failure distributions, general
optimality conditions, and algorithms for reduction of the infinite horizon optimiza-
tion to two dimensions. This inspection problem is continued later in [44].
The problem of imperfect inspections with the implementation of multiple post
repair inspections and accidents during inspection is analyzed in [45]. The authors
in this model propose an inspection policy for single- and two-unit systems, where
a repairman is called immediately to repair a failed unit. The analytical solutions
are provided for various measures of reliability such as mean time to system failure,
steady-state availability, busy period of repairman for repair, and inspection per unit
time by using semi-Markov processes and regenerative point techniques.
Another interesting model is given in  [46]. The  author in this work considers
the problem of the optimal choice of periodic inspection intervals for a renewable
equipment without preventive replacement performance. The model is based on two
optimization criteria: minimization of maintenance costs and maximization of sys-
tem availability. The author develops an approximate method for inspection interval
calculations and proves that the obtained solutions are very close to the exact ones.
The  extended inspection models with imperfect testing also are investigated
in [47–50,51]. The continuation of inspection modeling with availability constraints,
given in [51], is presented in [52]. The authors in this work analyze the instantaneous
availability of a system maintained under periodic inspection with the use of random
walk models. Two cases are analyzed: deterministic and stochastic .
Some summary and extensions of the models presented in  [52] are given also
in [53]. In this work, the authors focus on periodic inspection, developing five basic
models with availability requirements. All the inspection models are based on dif-
ferent approaches to the determination of inspection times. In a later work [54], the
authors also extend the inspection models given in [51]. The main extension is based
on the assumption that periodic inspections take place at fixed time points after repair
or replacement in case of failure. The  implementation of minimal repairs before
replacement or perfect repair is analyzed in [55]. The authors in this work propose a
minimal repair model with periodic inspection and constant repair time. The instan-
taneous availability of the proposed model is derived by a set of recursive formulas,
providing the introduction to optimization of system reliability characteristics.
Recently, in  [56] the authors focus on the availability of a system under peri-
odic inspection with perfect repair/replacement and non-negligible downtime due
to repair/replacement for a detected failure and due to inspection. The  model is
an extension of the works given in [51,54,57]. The authors in this work analyze a
­calendar-based inspection policy and an age-based inspection policy.
The last group of inspection policies for two-stated, single-unit systems applies
to implementation of shock models. One of the first works focused implementation
of random shocks modeling for systems with non-self-announcing failures is given
in [85]. The authors in this work consider a periodic inspection model for a system
with randomly occurring shocks that follows a Poisson process and cumulatively
Inspection Maintenance Modeling for Technical Systems 47

damages the system. This  model is investigated and extended later in  [59,60].
The new inspection policy considers random shock magnitudes and times between
shock arrivals and focuses on optimization of availability criterion.
Another extension of the model presented in [58] is given in [61]. The authors in
this work incorporate a more general deterioration process that includes both shock
degradation and graceful degradation (continuous accumulation of damage). With
the use of regenerative arguments and considering a constant rate of graceful deg-
radation occurrence, an expression for the limiting average availability is derived.
The maintenance models for systems with two failure modes—type I failure rela-
tive to non-maintainable failure mode, and type II failure relative to periodically
maintainable failure mode—are developed in [62–65].
In  2006, a model with three types of inspections is introduced in  [66]. In  this
article, the authors assume that a system can fail because of three competing failure
types: I, II, and III. Partial inspections detect type I failures without error. Failures
of type II can be detected by imperfect inspections. Type III failures are detectable
only by perfect inspections. If the system is found to have failed in an inspection, a
perfect repair is made.
The  summary of the main known models published in the recent literature is
presented in Table 2.1. The author considers a few main criteria for summarizing
this review:

• The problem category (the main model characteristic that distinguishes it)


• Planning horizon (investigating infinite or finite case)
• Assumption about the quality of performed inspections in a maintained system
• Type of introduced failure modes (for shock modeling)
• Used optimality criterion (cost or reliability constraints)
• Modeling method that is used in order to optimize the inspection policy
• Model’s reference with the year of its publication

2.2.2  Inspection Maintenance for Multi-state Systems


In  some systems, such as critical infrastructure where the safety issues are very
important, reliability analysis carried out in relation to two-state technical objects
usually is insufficient (see [19] for a review). The solution to this problem is to con-
sider a technical object in terms of a minimum of three reliability states, where a
third state is the state of partial failure.
The  known inspection models for multi-state deteriorating single unit systems
may be classified to the two main groups: models for systems with perfect/­imperfect
inspection and models for systems subjected to shocks. Following are the main
directions of research done in these model groups.
One of the first developed inspection models for multi-state units is given in [79].
In  this work, the author presents a Markovian model, which is focused on proper
scheduling of inspections and preventive repairs considering minimization of the
total expected cost per time unit. The main assumptions include performance of peri-
odic inspections, implementation of perfect repair and inspection actions, and ran-
dom holding times of systems.
48

TABLE 2.1
Summary of Inspection Policies for Two-State, Single-Unit Systems
Quality of Modeling Method/
Planning Performed Checking Type of Publication
Problem Category Horizon Inspections Failure Modes Optimization Criterion Procedures References Years
Original algorithm Infinite Perfect n/a Expected cost per unit of time Analytical [67] 1980
Original algorithm Infinite Perfect n/a Expected cost per unit of time Analytical/optimal [68] 1984
Original algorithm Infinite Perfect n/a Expected profit per unit of time Analytical/heuristic [32] 1996
approach
Original algorithm Infinite Perfect n/a Expected cost function Heuristic approach [33] 1992
Original algorithm Infinite Perfect n/a Expected total cost Analytical [69] 2005
Original algorithm Finite Perfect n/a Expected costs of loss Discrete dynamic [30] 1980
programming
One-parameter Infinite Perfect n/a Average total cost per time unit [60] 1998
optimization model
Model with unknown or Infinite Perfect n/a Expected loss cost per time unit Analytical [34] 1981
partially unknown system
lifetime probabilitya
Model with known or Infinite Perfect n/a Total expected cost Analytical [36] 2006
unknown slpa
Model with unknown slpa Infinite Perfect/imperfect n/a Total expected cost Analytical [35] 2001
Model with known slpa Infinite Imperfect n/a Cost per unit of time Analytical [70] 2002
Model with known slpa Infinite Imperfect n/a Long-run expected cost per unit Renewal reward [71] 1995
time/ availability function process/non linear
programming
Model with known slpa Infinite Imperfect n/a Long-run expected cost per unit Renewal reward [72] 2003
time process
Reliability Engineering

(Continued)
TABLE 2.1 (Continued)
Summary of Inspection Policies for Two-State, Single-Unit Systems
Quality of Modeling Method/
Planning Performed Checking Type of Publication
Problem Category Horizon Inspections Failure Modes Optimization Criterion Procedures References Years
Model with known slpa Infinite Imperfect n/a Long-run expected cost per unit Renewal theory, [37] 2016
time Wiener process
Model with known slpa Infinite Imperfect n/a Total cost over a lifetime Continuous-time [38] 1982
Markovian decision
process
Model with known slpa Infinite Imperfect n/a Expected cost per time unit Markovian model [73] 1998
Model with known slpa Infinite Fallible/ n/a Long-run cost per unit time Dynamic [43] 1993
error-free tests programming
Model with known slpa Infinite Fallible tests n/a Long-run cost per unit time Analytical [44] 1993
Model with known slpa Infinite Fallible tests n/a Mean loss per unit time Analytical [41] 1979
Model with known slpa Infinite Fallible tests n/a Expected lifetime of the unit Markov decision [42] 1979
process
Inspection Maintenance Modeling for Technical Systems

Model with known slpa Infinite/ Failure detection n/a Long-run cost per unit time Analytical [39] 2015
finite zone
Model with known slpa Finite Imperfect n/a Expected sum of discounted cost Markov decision [74] 2008
process +
quasi-Bayes
approach + dynamic
programming
(Continued)
49
50

TABLE 2.1 (Continued)


Summary of Inspection Policies for Two-State, Single-Unit Systems
Quality of Modeling Method/
Planning Performed Checking Type of Publication
Problem Category Horizon Inspections Failure Modes Optimization Criterion Procedures References Years
Optimization model Infinite Perfect n/a Limiting average availability and Analytical [72] 2000
long-run inspection rate
Optimization model Infinite Perfect n/a Limiting average availability Analytical [51] 2000
Optimization model Infinite Perfect n/a Long-run average cost per unit Analytical [75] 2012
time
Optimization model Infinite Perfect n/a Average availability and the Analytical [76] 2014
long-run average cost rate
Optimization model Infinite Imperfect n/a Expected operational readiness Analytical [77] 1963
of a system
Optimization model Infinite Imperfect n/a System stationary availability Analytical [78] 2008
Optimization model Infinite Imperfect n/a Measures of system reliability Semi-Markov process [45] 2005
+ regenerative point
technique
Optimization model Infinite Imperfect n/a Stationary availability coefficient Analytical [46] 2009
and total expected cost per one
renewal period
Optimization model Infinite Imperfect n/a Limiting average availability and Analytical [49] 2012
the long-run average cost per
unit time
(Continued)
Reliability Engineering
TABLE 2.1 (Continued)
Summary of Inspection Policies for Two-State, Single-Unit Systems
Quality of Modeling Method/
Planning Performed Checking Type of Publication
Problem Category Horizon Inspections Failure Modes Optimization Criterion Procedures References Years
Optimization model Finite/ Perfect n/a Limiting average availability, Analytical [53] 2004
infinite long-run inspection rate,
instantaneous availability,
instantaneous inspection rate
Optimization model Finite/ Perfect n/a Limiting average availability, Analytical [54] 2005
infinite instantaneous availability
Optimization model Finite/ Perfect n/a Limiting average availability, Analytical [56] 2013
infinite instantaneous availability
Optimization model Finite Perfect n/a Instantaneous availability Analytical (random [52] 2001
walk model)
Optimization model Finite Perfect n/a Instantaneous availability Analytical [55] 2013
Optimization model Finite Imperfect n/a Long-run average cost per unit Analytical [48] 2013
time or cost-rate over the time
Inspection Maintenance Modeling for Technical Systems

to retirement
Shock model Infinite Perfect Random shocks Time-stationary availability Analytical (renewal [58–60] 1994, 1998,
arriving according to process) 2000
a Poisson process
Shock model Infinite Perfect Random shocks (a Limiting average availability Analytical (renewal [61] 2002
Poisson process) and process)
graceful degradation
(Continued)
51
52

TABLE 2.1 (Continued)


Summary of Inspection Policies for Two-State, Single-Unit Systems
Quality of Modeling Method/
Planning Performed Checking Type of Publication
Problem Category Horizon Inspections Failure Modes Optimization Criterion Procedures References Years
Shock model Infinite Perfect Two dependable failure Expected maintenance cost per Analytical (renewal [64] 2006
modes: maintainable unit time process)
and non-maintainable
Shock model Infinite Partial, perfect, Three competing Cost rate function Analytical (renewal [66] 2006
and imperfect failure modes: I, II, III process)
Shock model Infinite Perfect Two failure modes: Expected cost per unit of time Analytical (renewal [62,63] 2006
minor failure and process)
catastrophic failures
Shock model Infinite Perfect Two failure modes: Expected net cost rate Analytical (renewal [65] 2015
minor failure and process)
catastrophic failures

a slp – information about system lifetime probability


Reliability Engineering
Inspection Maintenance Modeling for Technical Systems 53

Another implementation of Markovian modeling in multi-state, single-unit sys-


tems maintenance problems are given in [80]. The authors in this work use non-
homogeneous Markovian techniques to model systems with tolerable down times.
The issues of partially observable process are examined also in [81]. The author
in this paper presents a model of a system that deteriorates according to a discrete-
time Markov processes and its operation and repair costs increase with system
deterioration state number. He  proposes a monotonic four-region policy with cost
considerations, where the decision process adopts a countable state space and a finite
action space. The continuation of this problem is given in [82], where the authors
propose a semi-Markov decision algorithm operating on the class of control-limit
rules. This problem is extended later in [83], where the authors allow for delayed
replacement performance and investigate the discounted cost structure.
The semi-Markov processes are applied in [84]. The author in this work develops
a maintenance model for systems with five states that constitute all possible cycles,
which begin with inspections. The  solution is based on reliability characteristics
assessment (asymptotic availability, reliability function).
Moreover, the maintenance inspection issues of production multi-state systems
and processes are analyzed in [85–88].
The second investigated problem regards to shock modeling. One of the first works
that considers inspection policies for multi-state, single-unit systems with shock
modeling is given in  [89]. The  given model is extended later in work  [90], where
the author determines an optimal inspection policy for a system with deterioration
process assumed to be an increasing pure jump Markov process. Later, in work [91]
the authors develop an optimal inspection-replacement policy for an item subject to
cumulative damage. In this model, a unit fails depending on the accumulated damage
caused by gradual damage. The authors calculate the optimal damage limit according
to the long-run expected cost rate criterion using the renewal reward theory.
The problem of imperfect inspections and imperfect repairs is investigated in [92].
A model considers a system submitted to external and internal failures whose dete-
rioration level is known by means of inspections. Moreover, the authors assume the
performance of two types of repairs—minimal and perfect—depending on the dete-
rioration level and following a different phase-type distribution. The solutions are
based on implementation of a generalized Markov process and the use of a phase-
type renewal process as a special case.
Another extension of  [89] is given in  [93], where the authors propose a state-
dependent maintenance policy for a multi-state continuous-time Markovian dete-
riorating system subject to aging and fatal shocks. The  model incorporates the
assumptions of state-dependent cost structure, imperfect repair, and perfect inspec-
tions, and is based on implementation of periodic inspections.
The availability of periodically inspected systems subjected to shocks is analyzed
in [94]. In this model, the authors analyze a system whose deterioration process is
modulated by a continuous-time Markov chain and additional damage is induced by
a Poisson shock process.
The  summary of the main known models published in the recent literature is
presented in Table  2.2. The  author applies the same classification criteria as in
Section 2.2.1.
54

TABLE 2.2
Summary of Inspection Policies for Multi-state, Single-Unit Systems
Quality of
Planning Performed Optimization Modeling Method/ Type of Publication
Problem Category Horizon Inspections Failure Modes Criterion Checking Procedures References Years
Optimization model Infinite Perfect n/a Discounted and Discrete-time Markov [81] 1976
average cost process
Optimization model Infinite Perfect n/a Discounted and Markov decision [88] 1978
average cost process
Optimization model Infinite Perfect n/a Total expected Markovian model [79] 1976
cost per time unit
Optimization model Infinite Perfect n/a Long-run Markov renewal [85] 1997
expected average theory
cost per unit time
Optimization model Infinite Perfect n/a Expected long-run Semi-Markov decision [83] 1992
discounted cost process
Optimization model Infinite Imperfect n/a Long-run Analytical [76] 2014
expected cost per
unit time
Optimization model Infinite Imperfect n/a Expected total Discrete-time Markov [86] 1986
discounted cost chain
Optimization model Infinite Imperfect n/a Reliability Semi-Markov [77] 1962
function processes
Inspection with CBM Infinite Imperfect n/a Operational Analytical [50] 2013
modeling reliability
Optimization model Finite Perfect n/a Average cost Semi-Markov decision [82] 1984
model
(Continued)
Reliability Engineering
TABLE 2.2 (Continued)
Summary of Inspection Policies for Multi-state, Single-Unit Systems
Quality of
Planning Performed Optimization Modeling Method/ Type of Publication
Problem Category Horizon Inspections Failure Modes Criterion Checking Procedures References Years
Shock model Infinite Perfect Cumulative damage attributed Long-run average Analytical (renewal [89] 1980
to shocks occurrence cost per unit time reward theorem)
(Poisson process)
Shock model Infinite Perfect Deterioration level assumed Long-run average Markov process/ [90] 1987
as increasing pure jump cost per unit time control-limit policy
Markov process
Shock model Infinite Perfect Cumulative damage caused Expected long-run Analytical (renewal [91] 1997
by gradual damage cost rate reward theorem)
Shock model Infinite Perfect Poisson shock process Limiting average Continuous-time [94] 2006
availability Markov chain
Inspection Maintenance Modeling for Technical Systems

Shock model Infinite Perfect Fatal shocks occurrence Expected long-run Continuous-time [93] 2001
cost rate Markov process
Shock model Infinite Perfect/imperfect Internal and external failures Total costs per Generalized Markov [92] 2008
occurrence unit time process
55
56 Reliability Engineering

2.3 INSPECTION MAINTENANCE MODELING


FOR MULTI-UNIT SYSTEMS
The  general classification of the main investigated inspection policies for multi-­
component systems considers the type of hidden failures. According to [39], there
are two types of hidden failures:

• Type I: protective devices or standby unit. The function of these devices is


to protect the main system in case of failures.
• Type II: operating devices. They  are operating systems, and their failure
will cause direct loss.

At the beginning models are investigated for protective devices and standby units.

2.3.1  Inspection Maintenance for Standby Systems


The standby units are characteristic for many engineering systems. Spare compo-
nents, or systems, that are not in continuous operation are the examples of this sort
of unit  [129]. The  main function of the spare unit is to replace the component in
use when the latter fails so that the system is restored to operating condition as
soon as possible. However, the standby units also deteriorate and fail with its fail-
ures remaining undiscovered until the next attempt to use them, unless some test or
inspection is carried out (unrevealed failures).
Many inspection models dedicated to the inspection of standby systems were
developed in the 1970s and 1980s. For example, a two-unit repairable system is ana-
lyzed in [95]. In this work, the first unit is operative and the other is in cold standby.
The author in this work considers two types of failure situations: (1) a failure of an
active element is detected instantaneously but a failure of a standby unit is revealed
at inspection epochs only and (2) a failure of both the active and the standby units
is revealed at the time of an inspection only. The  extension of this model is pre-
sented also in  [96], where the authors discuss a two-unit cold standby redundant
system with repair, inspection, and preventive maintenance. The model is based on
the assumption of arbitrary distributions of failure time, inspection time, repair, and
preventive repair times.
The reliability analysis of a two-unit cold standby system with the consideration
of single repair facility performance is given in [97]. In this work [97], the authors
assume that a single repair facility facilitates inspection, replacement, preparation,
and repair. Moreover, failure, delivery, replacement, and inspection times have expo-
nential distributions, whereas all other time distributions are general.
A similar problem is analyzed in [98], where the authors investigate a two-unit
warm standby system with minor (internal) and major (external) repair. Another
extension of these works applies to the analysis of two non-identical units. Using the
regenerative point technique, various pointwise and steady-state reliability charac-
teristics of system effectiveness are obtained.
Later, a warm standby n-system with operational and repair times following
phase-type distributions is considered in [99]. The analyzed system is governed by
Inspection Maintenance Modeling for Technical Systems 57

a level-dependent quasi-birth-and-death process and the general Markov model is


provided. The main reliability characteristics that are calculated include availability
and rate of occurrence of failures.
Another extension of the inspection model developed in [97] is given in [100].
In  this work, the authors consider a reliability model for a two-unit cold standby
system with a single server. In  the work, various reliability measures of system
effectiveness are obtained by using a semi-Markov process and a regenerative point
technique. Later, this model is extended in [101], where the authors investigate two
non-identical units, where the first unit goes for repair, inspection, and post repair
(when needed), whereas the second unit is as good as new after repair. The priority
in operation is given to the first unit (lower running costs), while the priority in repair
is given to the second unit (less time consuming). The model also is based on various
calculations of reliability characteristics with the use of regenerative point technique
and Monte Carlo simulation.
Moreover, the extension of [100] is given in [102]. The authors in this work study
two dissimilar (automatic and manual) cold standby systems. An inspection policy is
introduced for an automatic machine to detect this kind of a failure. The model solu-
tion is based on the estimation of various measures of reliability and profit incurred
to the system using a semi-Markov process and a regenerative point technique.
The problem of time-dependent unavailability of periodically tested aging com-
ponents under various testing and repair policies is analyzed in [103,104].
The  investigation of maintenance for multi-component systems, which may be
either in operating condition or in the standby mode is presented in [70]. The authors
in this work define an inspection policy along with a preventive maintenance (PM)
procedure and imperfect testing for a series system. The cost optimization is per-
formed based on the renewal theory use.
The shock model implementation is considered in [105]. The authors in this work
consider a parallel redundant system consisting of n components. Considering the
assumption that the arrival rate of shocks and the failure probabilities of compo-
nents may depend on an external Markovian environment, the authors propose
several state-dependent maintenance policies based on system availability and cost
functions.
The  components failure interaction is considered in  [106]. The  authors in this
work investigate a two-component cold standby system under periodic inspections.
They assume that a failure of one component can modify the failure probability of a
component still operating with a constant probability and obtain the system reliabil-
ity function for the case of staggered inspections. The failure interaction scheme is
like the shock model used in studies of common cause failures (known as a β-Factor
model).
The continuation of research studies about testing policies for two-unit parallel
standby systems without identical components is presented in [107]. The authors in
this work propose an optimal testing policy for a system under the criteria of avail-
ability and maintenance costs. The analytical solution is provided in the context of
recognition of common cause failure.
Moreover, the comparison of various inspection models for redundant systems is
given in [108]. In this work, the authors provide the comparison of four models of
58 Reliability Engineering

two- and three-component systems using discrete Markov chains. The first model
applies to active redundancy without component repair, the second model includes
active redundancy with component repair, the third and fourth models analyze
standby redundancy without and with component repair.

2.3.2  Inspection Maintenance for Operating Systems


Inspection models for multi-unit operating systems include two main groups of research
works: test procedure searching models and optimal inspection models. The first group
of models is focused on the development of the best maintenance scheduling order,
answering the question: In what order the components should be tested to satisfy the
time requirements? The second group of inspection models focuses on optimal main-
tenance policy searching considering cost and/or reliability criteria.
One of the first research works on optimum test procedure models is given in [109].
The author in this work focuses on searching for test procedures that maximize the
probability of locating a failed component within the given time. The solution is pro-
vided using renewal theory and dynamic programming. Later, the authors in [110]
study the problem of scheduling activities of several types under time constraints.
The developed model is focused on finding an optimal schedule that specifies the
periods to execute each of the activity types to minimize the long-run average cost
per period. The discrete time maintenance problem of n machines is solved for finite
and infinite time horizon cases.
The  implementation of an imperfect inspection case into a maintenance man-
agement model is presented in [111]. The authors in this work analyze a two-stage
inspection process that considers detection and sizing activities. The purpose of this
study is to develop a method that simulates deterioration, inspection, repair, and
failure of structures over time using Markov matrices.
Another inspection model that includes an imperfect inspection problem is given
in [112]. The authors present a model for determining optimal inspection plans for
critical multi-characteristic components. The inspection is performed in stages by
inspectors who may make mistakes—errors of false acceptance and false rejection
occurrence possibility. This problem is continued later in [113] and the extension of
this model is given in [114]. The model is focused on finding the optimal number of
inspections necessary to minimize the total cost per accepted component.
The  issues of imperfect inspections performance are analyzed in  [115,116].
In [116], the authors investigate an imperfect inspection model focused on processes
of testing and estimation of model parameters. The probability of failure detection is
a constant variable and the solution is based on a Markov chain and use of simulation
modeling. In [115], the authors develop a maintenance policy for pipelines subjected
to corrosion, including predictive degradation modeling, time-dependent reliability
assessment, inspection uncertainty, and expected cost optimization. The solution is
obtained with the use of Bayesian modeling. The influence of the type I and type II
inspection errors on maintenance costs is investigated in [117].
The second group of models applies to the problem of optimization of inspec-
tion policy parameters. In this area, one of the preliminary models is given in [118].
The  author in this work develops an optimal inspection and replacement model
Inspection Maintenance Modeling for Technical Systems 59

for a coherent system with components having exponential life-time distributions.


The  solution is based on the implementation of a semi-Markov decision process
framework.
One of the extensions of this model is presented in [119], where the author develops
an optimal inspection strategy under two optimality criteria: the long-run average net
income and the total expected discounted net income. The author considers a multi-
unit machine in a series-reliability structure, if along the inspection process only one
unit can be tested. This problem later is investigated in [120], where the author gives
an example to demonstrate that the previously presented characterization of the opti-
mal inspection policy for series systems is not correct in the discounted case.
Another extension of the optimal inspection model given in [118] applies to the
investigation of reliability characteristics. For example, in [121] the author presents
an analytical method that gives upper and lower bounds for the reliability in a case of
systems subject to inspections at Poisson random times. This model later is extended
in  [122] by providing the exact expression of the reliability function, its Laplace
transform, and the Mean Time To Failure (MTTF) of the system.
Later, perfect and minimal repair policies in a reliability model are considered
in [123]. The author in this work considers two-unit systems with stochastic depen-
dence and two types of failures (soft and hard failures), providing analytical reliabil-
ity and cost models. The practical application is based on the optimization of steam
turbine system maintenance.
The issues of structural reliability are considered in [124], where the authors ana-
lyze the optimal time interval for inspection and maintenance of offshore structures.
The structural reliability is expressed here by means of closed-form mathematical
formulas that are incorporated into the cost-benefit analysis.
Moreover, in the literature inspection maintenance policies for multi-state sys-
tems can be found. For example, in [125] the authors focus on a periodic inspection
maintenance model for a system with several multi-state components over a finite
time horizon. The  degradation process of the components is modeled by the non-
homogeneous continuous-time Markov chain, and the particle swarm optimization
is used to optimize the maintenance threshold and inspection intervals under cost
constraints. Later, in [126] an optimization model of an inspection-based PM policy
is developed for three-state mechanical components subject to competing failure
modes, which integrates continuous degradation and discrete shock effects. Periodic
inspection of series systems with revealed and unrevealed failures is considered
in [127]. This model extends the one given in [118] by introducing the probability of
failure revealing. The simple maintenance model for n independent components in
series is based on renewal theory.
Series-parallel systems are considered in  [128]. The  authors propose a general
preventive maintenance model used to optimize the maintenance cost. The model is
developed using a simulation approach and a parallel simulation algorithm for avail-
ability analysis. A special ratio-criterion is based on a Birnbaum importance factor.
The optimization is performed using a genetic algorithm technique.
The  summary of the main known models published in the recent literature is
presented in Table 2.3. The author considers the same classification criteria as in the
previous sections.
60

TABLE 2.3
Summary of Inspection Policies for Multi-unit Systems
Quality of
System Stand by Planning Performed Modeling Method/ Type of
Type Unit Type Horizon Inspections Optimization Criterion Checking Procedures References Publication Years
Standby Cold standby Infinite Perfect Main unreliability Analytical (regenerative [104] 1997
system characteristics point technique)
Standby Cold standby Infinite Perfect Reliability function, MTTF Analytical (renewal theory) [95] 1970
system
Standby Cold standby Infinite Perfect Expected loss due to system Analytical (renewal theory) [107] 2012
system unavailability per time
unit, the average system
unavailability per cycle
Standby Cold standby Infinite Perfect Main reliability Semi-Markov process and [102] 2016
system characteristics, the regenerative point
expected total profit per technique
unit of time
Standby Cold standby Infinite Perfect Main reliability Semi-Markov process and [100] 2011
system characteristics, the profit regenerative point
function technique
Standby Cold standby Infinite Perfect Main reliability Regenerative point [101] 2012
system characteristics, the technique, MC
expected total profit per simulation, Bayesian
unit of time setup
(Continued)
Reliability Engineering
TABLE 2.3 (Continued)
Summary of Inspection Policies for Multi-unit Systems
Quality of
System Stand by Planning Performed Modeling Method/ Type of
Type Unit Type Horizon Inspections Optimization Criterion Checking Procedures References Publication Years
Standby Warm standby Infinite Perfect Main reliability Generalized Markov [99] 2008
system characteristics, the total process
cost of a system per unit
of time
Standby Warm standby Infinite Perfect/ imperfect Total cost per unit of time Analytical (renewal theory) [129] 2002
system
Standby Cold/warm Infinite Perfect Limiting average Analytical (renewal [105] 2009
system standby availability, the expected theory), Markov jump
cost rate process
Standby Cold standby Finite/infinite Perfect Main reliability Analytical (regenerative [97] 1995
system characteristics point technique)
Standby Warm standby Finite/infinite Perfect Main reliability Analytical (regenerative [98] 1995
system characteristics, the point technique)
Inspection Maintenance Modeling for Technical Systems

expected total profit in


(0,t] and per unit of time
Standby Cold standby Finite/infinite Perfect Main unreliability Analytical (regenerative [103] 1999
system characteristics point technique)
Standby Cold standby Finite Perfect Distribution function of Analytical (renewal theory) [96] 1970
system time to the first system
down and the mean time
to the first system down
(Continued)
61
62

TABLE 2.3 (Continued)


Summary of Inspection Policies for Multi-unit Systems
Quality of
System Stand by Planning Performed Modeling Method/ Type of
Type Unit Type Horizon Inspections Optimization Criterion Checking Procedures References Publication Years
Standby Warm standby Finite Perfect Average unavailability in Analytical [106] 2005
system inspection interval
Operating n/a Infinite Perfect Long-run expected cost per Semi-Markov decision [118] 1987
system unit time framework
Operating n/a Infinite Perfect Long-run average net Renewal theory [119] 1989
system income and total expected
discounted net income
Operating n/a Infinite Perfect Expected cost of operation Renewal theory [126] 2016
system per unit of time
Operating n/a Infinite Perfect Average total cost of Renewal theory [127] 2009
system maintenance for unit of
time
Operating n/a Infinite Perfect Total expected discounted Analytical [120] 1991
system net income
Operating n/a Finite/infinite Perfect Long-run average cost per Analytical [110] 1998
system period
Operating n/a Finite Perfect Probability that the failed Renewal theory and [109] 1964
system component is checked out dynamic programming
before given time period
Operating n/a Finite Imperfect Total cost of inspection Analytical [114] 2008
system
(Continued)
Reliability Engineering
TABLE 2.3 (Continued)
Summary of Inspection Policies for Multi-unit Systems
Quality of
System Stand by Planning Performed Modeling Method/ Type of
Type Unit Type Horizon Inspections Optimization Criterion Checking Procedures References Publication Years
Operating n/a Finite Imperfect Expected annual total cost Markov model and Event [111] 2010
system based decision theory
Operating n/a Finite Imperfect Expected total cost per Analytical and Bayes [112,113] 1995, 2002
system accepted component theorem
Operating n/a Finite Perfect Expected total cost Analytical [24] 1995
system
Operating n/a Finite Perfect Failure distribution Nonlinear programming [130] 2014
system parameters
Operating n/a Finite Perfect Total inspection cost Particle swarm [131] 2012
system optimization algorithm
Operating n/a Finite Perfect Availability function Analytical [132] 2011
system
Operating n/a Finite Perfect Sum of inspection, repair Simulation modeling [133] 1999
Inspection Maintenance Modeling for Technical Systems

system and risk cost


Operating n/a Finite Perfect Reliability characteristics Renewal theory [121] 1999
system
Operating n/a Finite Perfect Reliability function, MTTF Analytical [122] 2002
system
Operating n/a Finite Perfect Expected cost incurred in Analytical [123] 2016
system the inspection for each
cycle
(Continued)
63
64

TABLE 2.3 (Continued)


Summary of Inspection Policies for Multi-unit Systems
Quality of
System Stand by Planning Performed Modeling Method/ Type of
Type Unit Type Horizon Inspections Optimization Criterion Checking Procedures References Publication Years
Operating n/a Finite Perfect System availability GA and MC simulation [128] 2003
system function, inspection cost
Operating n/a Finite Perfect Maintenance cost rate in a Non-homogeneous [125] 2015
system renewal cycle continuous Markov chain
Operating n/a Finite Perfect/imperfect Total expected social cost Markov decision process [74] 2008
system and quasi-Bayes approach
Operating n/a Finite Imperfect Expected total cost function Analytical (cost-benefit [124] 2014
system analysis)
Operating n/a Finite Imperfect Expected cost incurred in a Analytical (and Bayes [115] 2013
system cycle theory)
Operating n/a Finite Imperfect Probability functions Analytical and three-state [116] 1993
system Markov chain
Reliability Engineering
Inspection Maintenance Modeling for Technical Systems 65

2.4  HYBRID INSPECTION MODELS


In the investigation of hybrid inspection models, two main groups of models can be
defined:

• Risk-based inspection models (RBI)


• Inspection models with preventive maintenance policy implementation

The first group of models focuses on “designing and optimization of an inspection


scheme based on the performance of a risk assessment progress using historical
database, analytical methods, experience and engineering judgment” [134]. In this
approach, risk assessment is used as a valuable tool to assign priorities among inspec-
tion and maintenance activities by analyzing the likelihood of failure and its conse-
quences [135,136]. This approach is predominantly used in the oil and gas industries
(see [134,136–139]), but some implementations also may be found for marine sys-
tems (see [135]), nuclear power plants (see [140–142]), or railway systems (see [143]).
A basic overview on RBI is given in [6].
The second group of the maintenance models is based on different types of prob-
lem investigations. For example, in the literature maintenance models can be found
that are based on the implementation of maintenance-free operating periods in the
development of inspection policy (see [73]). The maintenance model as a mixture
of a standard age replacement policy (ARP) and a maintenance procedure for unre-
vealed failures is given in [70]. The maintenance policy for a unit as inspected and
maintained preventively at periodic intervals is given in  [144]. The  author in this
work develops two maintenance models as an extension of the well-known ARP and
an inspection model with constant checking time.
The  introduction of an inspection-repair-replacement (IRR) policy is given
in  [71,72]. In  these works, the authors assume that a system is inspected at pre-
assigned times to distinguish between the up and down states. If the system is
identified as being in the down state during the inspection, then a repair action (per-
fect repair according to  [71] or minimal repair (according to  [72]) will be taken.
Moreover, periodic preventive replacement is performed. The focus is to determine
an optimal IRR policy so that the availability of the system is high enough at any
time considering the minimization of cost criterion. The models are based on the
renewal reward process use.
Simple and hybrid inspection policies focused on guaranteeing a high level of
availability are investigated in [175]. First, the simple periodical inspection is ana-
lyzed. To overcome its weaknesses and consider the information about remaining
life of a system, the quantile-based inspections are introduced. This inspection pol-
icy is valid for increasing failure rate of the system. Later, a hybrid inspection policy
is developed that considers performance of maintenance actions (periodic inspec-
tions or quantile-based inspections) according to the type of lifetime distributions:
increasing failure rate or decreasing failure rate. Analytical solutions and numerical
examples are provided for the limiting average availability and the long-run inspec-
tion rate assumptions.
66 Reliability Engineering

A randomly failing single unit system whose failures may be self-announcing or


not  self-announcing is considered later in  [78]. The  authors in this work consider
a randomly failing single unit system that is submitted to inspection when its age
reaches Tyin units of time. The model includes imperfect inspection and preventive
replacement performance. The proposed model is based on the implementation of
the basic strategy of an ARP for the case of self-announcing failures. The objective
is to determine the inspection and preventive maintenance interval that maximizes
the stationary availability of the system.
The hybrid inspection models are developed for maintenance of multi-unit sys-
tems. The block inspection and replacement policy is presented in [106], where the
authors introduce a periodical inspection for a two-unit parallel system. This model
considers the detection capacity of inspections (perfect/imperfect), minimal repairs,
and failure interactions to consider dependence between subsystems.
An interesting model is developed in [146], where the authors continue investiga-
tion of issues analyzed in [106] and [147]. The authors consider a multi-unit system
composed of identical units having periodic imperfect PM and periodic inspection
carried out every Tin time units. During the performance of inspection actions, units
are checked to ascertain whether they are working or not. Failed units are replaced
by new ones at inspection time. Assuming negligible PM times, the authors estimate
an average cost per unit time function.
Another interesting problem is presented in  [148], where the authors consider
periodic and opportunistic inspections of a system with hard-type and soft-type
components. Failures of soft-type components can be detected only at inspections.
Thus, a system can operate with a soft failure, but its performance may be reduced.
The  hard-type component failures are self-announcing and create an opportunity
for additional inspection (opportunistic inspection) of all soft-type components.
Moreover, the system also is inspected periodically. Based on this assumption, the
two optimization models are discussed using the simulation modeling approach and
cost criteria. This problem also is continued in [149].
The  problem of opportunistic inspection performance is considered in  [150].
The authors in this work investigate an nk-out-of-n system with hidden failures and
under periodic inspection. The  developed model is based on the assumption that
every system failure presents an additional opportunity for inspection. The objec-
tive is to find the optimal periodic inspection policy and the optimal maintenance
action at each inspection for the entire system. Moreover, three types of maintenance
are considered: minimal repair, preventive replacement, and corrective replace-
ment. The inspection maintenance model is based on implementation of a genetic
algorithm and on cost criteria. The extensions of this model is presented in [151],
where the authors focus on an nk-out-of-n system with components whose failures
follow a Non-Homogeneous Poisson Process (NHPP). This model does not optimize
the maintenance action, which is based on the components state (age dependent).
However, the model considers an inventory policy that focuses on supporting the
inspection policy to ensure the required spares when necessary (at inspection times).
The modeling approach is based on development of the simulation model.
Inspection Maintenance Modeling for Technical Systems 67

2.5  OTHER INSPECTION MAINTENANCE MODELS


When analyzing and reviewing the literature on inspection maintenance, other
issues (not mentioned in the previous subsections) also are noticeable. To the most
commonly investigated issues we may include:

• Production planning and quality control (see [152–155])


• Cumulative damage modeling (see [156,157])
• Joint optimization of inventory policy with inspection maintenance model-
ing (see [158,159])
• Safety and reliability in maintenance (see [6,160–165])

Some examples of case studies can be found on optimization of inspection schedules


for different systems. For example, in the literature optimization of inspection policy
can be found for railway carriers (see [166]), nuclear power plants (see [161,167,168]),
tunnel lighting systems (see [169]), a scale that weighs products in the final stage of
the manufacturing process (see [170,171]), sewing machines (see [172]), or wooden
poles structures (see  [173]). Other inspection problems that are investigated apply
to optimization of the periodic inspection of aircraft (see  [130]), maintenance of
transport systems with a subjective estimation approach (see [174]), investigations of
system reliability structure (see [175]), inspection frequency of safety-related control
systems of machinery (see [132,176]), optimization of inspection and maintenance
decisions for infrastructure facilities (see [74]), inspection issues of hydraulic com-
ponents (see [133]), safety-related control systems (see [132]), or multi-stage inspec-
tion problems (see [131]). Simulation modeling is investigated in [177].
A  widely investigated inspection of production process/systems and the main-
tenance issues is worth noting. Research in this area focuses mostly on computer-
aidediInspection planning systems (see [178] for state of the art) or maintenance and
inspection models for production inventory systems (see [179–183]). In this research
area, authors are interested in development of inspection policies for systems in stor-
age to provide high reliability (see [184–189]).

2.6  CONCLUSIONS AND DIRECTIONS FOR FURTHER RESEARCH


In  this chapter, the author provides a literature review on the most commonly
used optimal inspection maintenance models. The  literature was selected using
Google Scholar as a search engine and ScienceDirect, JStor, SpringerLink, and
SAGEJournals. The  author primarily searched the relevant literature based on
keywords, abstracts, and titles. Moreover, also articles were searched for relevant
references. The following main terms and/or a combination of them were used for
searching the literature: inspection maintenance, inspection model, and inspection
maintenance optimization.
The  selection methodology was based on searching for the defined keywords,
and later choosing the models that satisfy the main reviewing criteria. For example,
68 Reliability Engineering

2% 6%
10%

59% 23%

1962–1969 1970–1979 1980–1989 1990–1999 2000–2016

FIGURE 2.2  Models distribution in relation to the period of their publication.

when searching for the keyword “inspection maintenance” in Google search, there
were about 260 million hits. In the ScienceDirect database, this keyword had about
98,500 hits. Comparing the obtained search results to the main required criteria
such as periodic inspection, maintenance optimization, and technical system, 122
inspection models published from 1962 to 2016 (see Figure 2.2) were the focus of
this chapter.
Due to the plethora of available publications on inspection maintenance, there
was no possibility to present all the known models from this research area. The most
investigated ones that are not included in this chapter apply to:

• Sequential inspection maintenance modeling (see [17,23,57])


• Condition-based maintenance with inspection modeling issues (see [190])
• Delay-time modeling (see [19])

This literature overview lets the author draw the following main conclusions:

• The  most commonly used mathematical methods applied for analysis of


inspection maintenance scheduling problems include applied probability
theory, renewal theory, Markov decision theory, and Genetic Algorithms
(GA)  technique. However, there are a lot of inspection maintenance
problems that are too complex (e.g., shocks modeling and information
uncertainty) to be solved in an analytical way. Thus, in practice, simulation
processes and Bayesian approaches can be used widely.
• Most research on periodic inspections for hidden failures assumes that the
times for inspection are negligible. However, in some cases the inspection
time cannot be ignored due to its influence on system reliability characteris-
tics. Thus, the optimal inspection policy is not obtained using this assumption.
• Many inspection maintenance models are based on simplified assumptions
of infinite planning horizon, the steady-state conditions, perfect repair pol-
icy, available spare parts, and so on. These assumptions often are not valid
for performance of real-life systems.
Inspection Maintenance Modeling for Technical Systems 69

• Due to the complexity of models developed for inspection maintenance,


in many cases there are problems with optimal computation of checking
procedures. Thus, in such situations, the nearly optimal methods or algo-
rithms should be implemented. Such algorithms usually are developed for
the single-unit case.
• The widely known inspection maintenance models focus on performance
of the inspection action that only gives the information about the state of the
tested system (up state or down state). There are no models developed that
give additional information about the signals of forthcoming failures (some
defects occurrence); thus, this type of maintenance models is not enough for
systems in which such symptoms may be diagnosed.

REFERENCES
1. Tang T (2012) Failure finding interval optimization for periodically inspected repair-
able systems. PhD Thesis, University of Toronto.
2. Keller JB (1982) Optimum inspection policies. Management Science 28(4): 447–450.
3. Sheriff YS (1982) Reliability analysis: Optimal inspection & maintenance schedules of
failing equipment. Microelectronics Reliability 22(1): 59–115.
4. PN-EN 13306:2018 Maintenance—Maintenance terminology, The Polish Committee
for Standardization, Warsaw.
5. Gulati R, Kahn J, Baldwin R (2010) The professional’s guide to maintenance and reli-
ability terminology. Reliabilityweb.com.
6. Peters R (2014) Reliable, Maintenance Planning, Estimating, and Scheduling. Gulf
Professional Publishing.
7. Barlow RE, Hunter LC, Proschan F (1963) Optimum checking procedures. Journal
of the Society for Industrial and Applied Mathematics 11(4): 1078–1095. https://www.
jstor.org/stable/2946496.
8. Beichelt F, Tittmann P (eds.) (2012) Reliability and Maintenance. Networks and
Systems. CRC Press.
9. Radner R, Jorgenson DW (1962) Optimal replacement and inspection of stochasti-
cally failing equipment. In: Arrow KJ, Karlin S, Scarf H (eds.) Studies in Applied
Probability and Management Science, Stanford University Press: 184–206.
10. Jorgenson DW, Mccall JJ (1963) Optimal scheduling of replacement and inspection.
Operations Research 11(5): 732–746.
11. Pierskalla WP, Voelker JA (1976) A survey of maintenance models: The control and
surveillance of deteriorating systems. Naval Research Logistics Quarterly 23: 353–388.
12. Valdez-Flores C, Feldman R (1989) A survey of preventive maintenance models for sto-
chastically deteriorating single-unit systems. Naval Research Logistics 36: 419–446.
13. Cho ID, Parlar M (1991) A  survey of maintenance models for multi-unit systems.
European Journal of Operational Research 51(1): 1–23.
14. Thomas LC, Gaver DP, Jacobs PA  (1991) Inspection models and their application.
IMA Journal of Mathematics Applied in Business and Industry 3: 283–303.
15. Parmigiani G (1991) Scheduling inspections in reliability. Institute of Statistics and
Decision Sciences Discussion Paper no.  92–A11:1–21, Duke University. https://stat.
duke.edu/research/papers/1992-11 (accessed 17 October 2018).
16. Osaki S (ed.) (2002) Stochastic Models in Reliability and Maintenance, Springer-
Verlang, Berlin, Germany.
17. Nakagawa T (2005) Maintenance Theory of Reliability. Springer.
70 Reliability Engineering

18. Jardine AKS, Tsang AHC (2013) Maintenance, replacement and reliability. Theory and
Applications. CRC Press.
19. Werbińska-Wojciechowska S (2019) Technical System Maintenance. Delay-Time-Based
Modeling. Springer.
20. Kaio N, Osaki S (1989) Comparison of inspection policies. Journal of Operations
Research Society 40(5): 499–503. Palgrave Macmillan Journals.
21. Kaio N, Osaki S (1988) Inspection policies: Comparisons and modifications. Revenue
française d’automatique, d’informatique et de recherché opérationnelle. Recherche
opérationnelle 22(4): 387–400.
22. Munford AG (1981) Comparison among certain inspection policies. Management
Science 27(3): 260–267.
23. Jiang R, Jardine AKS (2005) Two optimization models of the optimum inspection prob-
lem. The Journal of the Operational Research Society 56(10): 1176–1183. doi:10.1057/
palgrave.jors.2601885.
24. Boland PJ, El-Neweihi E (1995) Expected cost comparisons for inspec-
tion and repair policies. Computers and Operations Research 22(4): 383–390.
doi:10.1016/0305-0548(94)00047-C.
25. Hu T, Wei Y (2001) Multivariate stochastic comparisons of inspection and repair poli-
cies. Statistics and Probability Letters 51: 315–324.
26. Mccall JJ (1963) Operating characteristics of opportunistic replacement and inspection
policies. Management Science 10(1): 85–97.
27. Choi KM (1997) Semi-Markov and delay time models of maintenance. PhD thesis,
University of Salford, UK.
28. Chelbi A, Ait-Kadi D (2009) Inspection strategies for randomly failing systems. In:
Ben-Daya M, Duffuaa SO, Raouf A, Knezevic J, Ait-Kadi D (eds.) Handbook of
Maintenance Management and Engineering. Springer, London, UK.
29. Lee C (1999) Applications of delay time theory to maintenance practice of complex
plant. PhD thesis, University of Salford, UK.
30. Bobrowski D (1980) Optimisation of technical object maintenance with inspections (in
Polish). In: Proceedings of Winter School on Reliability, Center for Technical Progress,
Katowice, Poland: 31–46.
31. Viscolani B (1991) A  note on checking schedules with finite horizon. Operations
Research 25(2): 203–208. doi:10.1051/ro/1991250202031.
32. Hariga MA (1996) A maintenance inspection model for a single machine with general
failure distribution. Microelectronics Reliability 36(3): 353–358.
33. Klatzky RL, Messick DM, Loftus J (1992) Heuristics for determining the optimal inter-
val between checkups. Psychological Science 3(5): 279–284.
34. Beichelt F (1981) Minimax inspection strategies for single unit systems. Naval Research
Logistics Quarterly 28(3): 375–381.
35. Leung FKN (2001) Inspection schedules when the lifetime distribution of a single-
unit system is completely unknown. European Journal of Operational Research 132:
106–115. doi:10.1016/S0377-2217(00)00115-6.
36. Okumura S (2006) Determination of inspection schedules of equipment by variational
method. Mathematical Problems in Engineering, Hindawi Publishing Corporation,
Article ID 95843: 1–16.
37. Liu B, Zhao X, Yeh R-H, Kuo W (2016) Imperfect inspection policy for systems with
multiple correlated degradation processes. IFAC-PapersOnLine 49–12: 1377–1382.
38. Senegupta B (1982) An exponential riddle. Journal of Applied Probability 19(3):
737–740.
39. Guo H, Szidarovszky F, Gerokostopoulos A, Niu P (2015) On determining optimal
inspection interval for minimizing maintenance cost. In: Proceedings of 2015 Annual
Reliability and Maintainability Symposium (RAMS), IEEE: 1–7.
Inspection Maintenance Modeling for Technical Systems 71

40. Magott J, Nowakowski T, Skrobanek P, Werbinska-Wojciechowska S (2010) Logistic


system modeling using fault trees with time dependencies—Example of tram network.
In: Bris R, Guedes Soares C, Martorell S (eds.) Reliability, Risk and Safety: Theory and
Applications. Vol. 3, Taylor & Francis, London, UK: 2293–2300.
41. Wattanapanom N, Shaw L (1979) Optimal inspection schedules for failure detection in
a model where tests hasten failures. Operations Research 27(2): 303–317.
42. Butler DA (1979) A hazardous-inspection model. Management Science 25(1): 79–89.
43. Parmigiani G (1993) Optimal inspection and replacement policies with age-dependent fail-
ures and fallible tests. The Journal of the Operational Research Society 44(11): 1105–1114.
44. Parmigiani G (1993) Optimal scheduling of fallible inspections. DP no. 92–38: 1–30,
https://stat.duke.edu/research/papers/1992-38 (accessed 17 October 2018).
45. Rizwan SM, Chauhan H, Taneja G (2005) Stochastic analysis of systems with accident
and inspection. Emirates Journal for Engineering Research 10(2): 81–88.
46. Hryniewicz O (2009) Optimal inspection intervals for maintainable equipment. In:
Martorell S, Guedes-Soares C, Barnett J (eds.) Safety, Reliability and Risk Analysis:
Theory, Methods and Applications, Taylor & Francis Group, London.
47. Berrade MD (2012) A  two-phase inspection policy with imperfect testing. Applied
Mathematical Modelling 36: 108–114. doi:10.1016/j.apm.2011.05.035.
48. Berrade MD, Cavalcante CAV, Scarf PA (2013) Modelling imperfect inspection over
a finite horizon. Reliability Engineering and System Safety 111: 18–29. doi:10.1016/j.
ress.2012.10.003.
49. Berrade MD, Cavalcante CAV, Scarf PA (2012) Maintenance scheduling of a protec-
tion system subject to imperfect inspection and replacement. European Journal of
Operational Research 218: 716–725. doi:10.1016/j.ejor.2011.12.003.
50. Berrade MD, Scarf PA, Cavalcante CAV, Dwight RA (2013) Imperfect inspection and
replacement of a system with a defective state: A cost and reliability analysis. Reliability
Engineering and System Safety 120: 80–87. doi:10.1016/j.ress.2013.02.024.
51. Sarkar J, Sarkar S (2000) Availability of a periodically inspected system under perfect
repair. Journal of Statistical Planning and Inference 91: 77–90.
52. Cui L, Xie M (2001) Availability analysis of periodically inspected systems with
random walk model. Journal of Applied Probability 38: 860–871. doi:10.1017/
S0021900200019082.
53. Cui L, Xie M, Loh H-T (2004) Inspection schemes for general systems. IIE Transactions
36: 817–825. doi:10.1080/07408170490473006.
54. Cui L, Xie M (2005) Availability of a periodically inspected system with random
repair or replacement times. Journal of Statistical Planning and Inference 131: 89–100.
doi:10.1016/j.jspi.2003.12.008.
55. Yang J, Gang T, Zhao Y (2013) Availability of a periodically inspected system main-
tained through several minimal repairs before a replacement of a perfect repair. Hindawi
Publishing Corporation, Abstracts and Applied Analysis, Article ID 741275: 1–6.
56. Tang T, Lin D, Banjevic D, Jardine AKS (2013) Availability of a system subject to
hidden failure inspected at constant intervals with non-negligible downtime due to
inspection and downtime due to repair/replacement. Journal of Statistical Planning
and Inference 143: 176–185. doi:10.1016/j.jspi.2012.05.011.
57. Luss H, Kander Z (1974) Inspection policies when duration of checkings is non-­negligible.
Operational Research Quarterly 25(2): 299–309.
58. Wortman MA, Klutke G-A, Ayhan A (1994) A maintenance strategy for systems sub-
jected to deterioration governed by random shocks. IEEE Transactions on Reliability
43(3): 439–445.
59. Chelbi A, Ait-Kadi D (2000) Generalized inspection strategy for randomly failing sys-
tems subjected to random shocks. International Journal of Production Economics 64:
379–384. doi:10.1016/S0925-5273(99)00073-0.
72 Reliability Engineering

60. Chelbi A, Ait-Kadi D (1998) Inspection and predictive maintenance strategies.


International Journal of Computer Integrated Manufacturing 11(3): 226–231.
doi:10.1080/095119298130750.
61. Klutke G-A, Yang Y (2002) The availability of inspected systems subject to shocks and
graceful degradation. IEEE Transactions on Reliability 51(3): 371–374.
62. Badia FG, Berrade MD (2006) Optimal inspection of a system with two types of fail-
ures under age dependent minimal repair. Monografias del Seminario Matematico
Garcia de Galdeano 33: 207–214.
63. Badia FG, Berrade MD (2006) Optimum maintenance of a system under two types of
failure. International Journal of Materials and Structural Reliability 4(1): 27–37.
64. Zequeira RI, Berenguer C (2006) An inspection and imperfect maintenance model
for a system with two competing failure modes. In: Proceedings of the 6th IFAC
Symposium: Supervision and Safety of Technical Processes: 932–937.
65. Sheu S-H, Tsai H-N, Wang F-K, Zhang ZG (2015) An extended optimal replacement
model for a deteriorating system with inspections. Reliability Engineering and System
Safety 139: 33–49. doi:10.1016/j.ress.2015.01.014.
66. Zequeira RI, Berenguer C (2006) Optimal scheduling of non-perfect inspections.
IMA Journal of Management Mathematics 17: 187–207. doi:10.1093/imaman/dpi037.
67. Nakagawa T, Yasui K (1980) Approximate calculation of optimal inspection times.
Journal of Operational Research Society 31: 851–853.
68. Aven T (1984) Optimal inspection when the system is repaired upon detection of fail-
ure. Microelectronics Reliability 24(5): 961–963.
69. Yeh RH, Chen HD, Wang C-H (2005) An inspection model with discount factor for
products having Weibull lifetime. International Journal of Operations Research 2(1):
77–81.
70. Badia FG, Berrade MD, Campos CA (2002) Optimal inspection and preventive main-
tenance of units with revealed and unrevealed failures. Reliability Engineering and
System Safety 78: 157–163.
71. Yeh L (1995) An optimal inspection-repair-replacement policy for standby systems.
Journal of Applied Probability 32(1): 212–223.
72. Yang Y, Klutke G-A (2000) Improved inspection schemes for deteriorating equipment.
Probability in the Engineering and Informational Sciences 14(4): 445–460.
73. Dagg RA, Newby M (1998) Optimal overhaul intervals with imperfect inspection and
repair. IMA Journal of Mathematics Applied in Business and Industry 9: 381–391.
74. Durango-Cohen PL, Madanat SM (2008) Optimization of inspection and maintenance
decisions for infrastructure facilities under performance model uncertainty: A Quasi-
Bayes approach. Transportation Research Part A: Policy and Practice 42(8): 1074–
1085. doi:10.1016/j.tra.2008.03.004.
75. Cheng GQ, Li L (2012) A geometric process repair model with inspections and its opti-
misation. International Journal of Systems Science 43(9): 1650–1655. doi:10.1080/002
07721.2010.549586.
76. Wang W, Zhao F, Peng R (2014) A  preventive maintenance model with a two-level
inspection policy based on a three-stage failure process. Reliability Engineering and
System Safety 121: 207–220. doi:10.1016/j.ress.2013.08.007.
77. Weiss GH (1963) Optimal periodic inspection programs for randomly failing equip-
ment. Journal of Research of the National Bureau of Standards—B. Mathematics and
Mathematical Physics 67B(4): 223–228.
78. Chelbi A, Ait-Kadi D, Aloui H (2008) Optimal inspection and preventive maintenance
policy for systems with self-announcing and non-self-announcing failures. Journal of
Quality in Maintenance Engineering 14(1): 34–45, doi:10.1108/13552510810861923.
79. Luss H (1976) Maintenance policies when deterioration can be observed by inspec-
tions. Operational Research 24(2): 359–366.
Inspection Maintenance Modeling for Technical Systems 73

80. Becker G, Camarinopoulos L, Ziouas G (1994) A  Markov type model for systems
with tolerable down times. The Journal of the Operational Research Society 45(10):
1168–1178. doi:10.2307/2584479.
81. Rosenfield D (1976) Markovian deterioration with uncertain information. Operations
Research 24(1): 141–155.
82. Tijms HC, Van Der Duyn Schouten FA (1984) A Markov decision algorithm for optimal
inspections and revisions in a maintenance system with partial information. European
Journal of Operational Research 21: 245–253. Elsevier.
83. Kawai H, Koyanagi J (1992) An optimal maintenance policy of a discrete time Markovian
deterioration system. Computers Mathematics with Applications 24(1/2): 103–108.
84. Weiss GH (1962) A  problem in equipment maintenance. Management Science 8(3):
266–277.
85. Fung J, Makis V (1997) An inspection model with generally distributed restoration and
repair times. Microelectronics Reliability 37(3): 381–389.
86. Ohnishi M, Kawai H, Mine H (1986) An optimal inspection and replacement policy for
a deteriorating system. Journal of Applied Probability 23(4): 973–988.
87. Wang GJ, Zhang YL (2014) Geometric process model for a system with inspections
and preventive repair. Computers and Industrial Engineering 75: 13–19. doi:10.1016/​
j.cie.2014.06.007.
88. White III ChC (1978) Optimal inspection and repair of a production process subject to
deterioration. The Journal of the Operational Research Society 29(3): 235–243.
89. Zuckerman D (1980) Inspection and replacement policies. Journal of Applied
Probability 17(1): 168–177.
90. Abdel-Hameed M (1987) Inspection and maintenance policies of devices subject to
deterioration. Advances in Applied Probability 19(4): 917–931.
91. Kong MB, Park KS (1997) Optimal replacement of an item subject to cumulative dam-
age under periodic inspections. Microelectronics Reliability 37(3): 467–472.
92. Delia M-C, Rafael P-O (2008) A  maintenance model with failures and inspection
following Markovian arrival processes and two repair modes. European Journal of
Operational Research 186: 694–707. doi:10.1016/j.ejor.2007.02.009.
93. Chiang JH, Yuan J (2001) Optimal maintenance policy for a Markovian system
under periodic inspection. Reliability Engineering and System Safety 71: 165–172.
doi:10.1016/S0951-8320(00)00093-4.
94. Kharoufer JP, Finkelstein DE, Mixon DG (2006) Availability of periodically inspected
systems with Markovian wear and shocks. Journal of Applied Probability 43(2): 303–
317. doi:10.1239/jap/1152413724.
95. Mazumdar M (1970) Reliability of two-unit redundant repairable systems when failures
are revealed by inspections. SIAM Journal on Applied Mathematics 19(4): 637–647.
96. Osaki S, Asakura T (1970) A two-unit standby redundant system with repair and pre-
ventive maintenance. Journal of Applied Probability 7(3): 641–648.
97. Mahmoud MAW, Mohie El-Din MM, El-Said Moshref M (1995) Reliability analysis of
a two-unit cold standby system with inspection, replacement, proviso of rest, two types
of repair and preparation time. Microelectronics Reliability 35(7): 1063–1072.
98. Pandey D, Tyagi SK, Jacob M (1995) Profit evaluation of a two-unit system with inter-
nal and external repairs, inspection and post repair. Microelectronics Reliability 35(2):
259–264.
99. Cazorla DM, Perez-Ocon R (2008) An LDQBD process under degradation, inspection,
and two types of repair. European Journal of Operational Research 190: 494–508.
doi:10.1016/j.ejor.2007.04.056.
100. Kumar J (2011) Cost-benefit analysis of a redundant system with inspection and priority
subject to degradation. IJCSI International Journal of Computer Science Issues 8(6/2):
314–321.
74 Reliability Engineering

101. Kishan R, Jain D (2012) A two non-identical unit standby system model with repair,
inspection and post-repair under classical and Bayesian viewpoints. Journal of
Reliability and Statistical Studies 5(2): 85–103.
102. Bhatti J, Chitkara AK, Kakkar MK (2016) Stochastic analysis of dis-similar standby
system with discrete failure, inspection and replacement policy. Demonstratio
Mathematica 49(2): 224–235.
103. Vaurio JK (1999) Availability and cost functions for periodically inspected preventively
maintained units. Reliability Engineering and System Safety 63: 133–140. doi:10.1016/
S0951-8320(98)00030-1.
104. Vaurio JK (1997) On time-dependent availability and maintenance optimization of
standby units under various maintenance policies. Reliability Engineering and System
Safety 56: 79–89. doi:10.1016/S0951-8320(96)00132-9.
105. Kenzin M, Frostig E (2009) M out of n inspected systems subject to shocks in random
environment. Reliability Engineering and System Safety 94: 1322–1330. doi:10.1016/j.
ress.2009.02.005.
106. Zequeira RI, Berenguer C (2005) On the inspection policy of a two-component parallel
system with failure interaction. Reliability Engineering and System Safety 88: 99–107.
doi:10.1016/j.ress.2004.07.009.
107. Lee BL, Wang M (2012) Approximately optimal testing policy for two-unit parallel
standby systems. International Journal of Applied Science and Engineering 10(3):
263–272.
108. Mendes AA, Coit DW, Duarte Ribeiro JL (2014) Establishment of the optimal time
interval between periodic inspections for redundant systems. Reliability Engineering
and System Safety 131: 148–165. doi:10.1016/j.ress.2014.06.021.
109. Greenberg H (1964) Optimum test procedure under stress. Operations Research 12(5):
689–692.
110. Anily S, Glass CA, Hassin R (1998) The scheduling of maintenance service. Discrete
Applied Mathematics 82(1–3): 27–42. doi:10.1016/S0166-218X(97)00119-4.
111. Sheils E, O’connor A, Breysse D, Schoefs F, Yotte S (2010) Development of a two-
stage inspection process for the assessment of deteriorating infrastructure. Reliability
Engineering and System Safety 95: 182–194. doi:10.1016/j.ress.2009.09.008.
112. Duffuaa S, Al-Najjar HJ (1995) An optimal complete inspection plan for critical
multicharacteristic components. Journal of the Operational Research Society 46(8):
930–942.
113. Duffuaa S, Khan M (2002) An optimal repeat inspection plan with several classifi-
cations. Journal of the Operational Research Society 53(9): 1016–1026. doi:10.1057/
palgrave.jors.2601392.
114. Duffuaa S, Khan M (2008) A general repeat inspection plan for dependent multicharac-
teristic critical components. European Journal of Operational Research 191: 374–385.
doi:10.1016/j.ejor.2007.02.033.
115. Sahraoui Y, Khelif R, Chateauneuf A (2013) Maintenance planning under imperfect
inspections of corroded pipelines. International Journal of Pressure Vessels and
Piping 104: 76–82. doi:10.1016/j.ijpvp.2013.01.009.
116. Srivastava MS, Wu Y (1993) Estimation and testing in an imperfect-inspection model.
IEEE Transactions on Reliability 42(2): 280–286. IEEE, doi: 10.1109/24.229501.
117. Godziszewski J (2001) The impact of errors of the first and second types made dur-
ing inspections on the costs of maintenance of a homogeneous equipment park (in
Polish). In: Proceedings of XIX Winter School on Reliability—Computer Aided
Dependability Analysis, Publishing House of Institute for Sustainable Technologies,
Radom: 89–100.
118. Aven T (1987) Optimal inspection and replacement of a coherent system.
Microelectronics Reliability 27(3): 447–450. doi:10.1016/0026-2714(87)90460-4.
Inspection Maintenance Modeling for Technical Systems 75

119. Zuckerman D (1989) Optimal inspection policy for a multi-unit machine. Journal of
Applied Probability 26: 543–551.
120. Qiu Y (1991) A note on optimal inspection policy for stochastically deteriorating series
systems. Journal of Applied Probability 28: 934–939.
121. Dieulle L (1999) Reliability of a system with Poisson inspection times. Journal of
Applied Probability 36(4): 1140–1154.
122. Dieulle L (2002) Reliability of several component sets with inspections at random
times. European Journal of Operational Research 139: 96–114.
123. Rezaei E (2017) A  new model for the optimization of periodic inspection intervals
with failure interaction: A case study for a turbine rotor. Case Studies in Engineering
Failure Analysis 9: 148–156. doi:10.1016/j.csefa.2015.10.001.
124. Tolentino D, Ruiz SE (2014) Influence of structural deterioration over time on the opti-
mal time interval for inspection and maintenance of structures. Engineering Structures
61: 22–30. doi:10.1016/j.engstruct.2014.01.012.
125. Lu Z, Chen M, Zhou D (2015) Periodic inspection maintenance policy with a general
repair for multi-state systems. In: Proceedings of Chinese Automation Congress (CAC):
2116–2121.
126. Zhang J, Huang X, Fang Y, Zhou J, Zhang H, Li J (2016) Optimal inspection-based pre-
ventive maintenance policy for three-state mechanical components under competing
failure modes. Reliability Engineering and System Safety 152: 95–103. doi:10.1016/j.
ress.2016.02.007.
127. Carvalho M, Nunes E, Telhada J (2009) Optimal periodic inspection of series sys-
tems with revealed and unrevealed failures. In: Safety, Reliability and Risk Analysis:
Theory, Methods and Applications—Proceedings of the Joint Esrel and SRA-Europe
Conference, CRC Press: 587–592.
128. Bris R, Chatelet E, Yalaoui F (2003) New method to minimize the preventive main-
tenance cost of series-parallel systems. Reliability Engineering and System Safety 82:
247–255. doi:10.1016/S0951-8320(03)00166-2.
129. Badia FG, Berrade MD, Campos CA  (2002) Maintenance policy for multivariate
standby/operating units. Applied Stochastic Models in Business and Industry 18:
147–155.
130. Huang J, Song Y, Ren Y, Gao Q (2014) An optimization method of aircraft periodic inspec-
tion and maintenance based on the zero-failure data analysis. In: Proceedings of 2014
IEEE Chinese Guidance, Navigation and Control Conference, Yantai, China: 319–323.
131. Azadeh A, Sangari MS, Amiri AS (2012) A particle swarm algorithm for inspection
optimization in serial multi-stage process. Applied Mathematical Modelling 36: 1455–
1464. doi:10.1016/j.apm.2011.09.037.
132. Dzwigarek M, Hryniewicz O (2011) Frequency of periodical inspections of safety-
related control systems of machinery—Practical recommendations for determining
methods. In: Proceedings of Summer Safety and Reliability Seminars, SSARS 2011,
Gdańsk-Sopot, Poland: 17–26.
133. Alfares H (1999) A simulation model for determining inspection frequency. Computers
and Indus-trial Engineering 36: 685–696. doi:doi.org/10.1016/S0360-8352(99)00159-X.
134. Bai Y, Bai Q (2014) Subsea Pipeline Integrity and Risk Management. Elsevier.
doi:10.1016/C2011-0-00113-8.
135. Bai Y, Jin W-L (2015) Marine Structural Design. Elsevier.
136. Zhaoyang T, Jianfeng L, Zongzhi W, Jianhu Z, Weifeng H (2011) An evaluation of main-
tenance strategy using risk-based inspection. Safety Science 49: 852–860. doi:10.1016/j.
ssci.2011.01.015.
137. Hagemeijer PM, Kerkveld G (1998) A methodology for risk-based inspection of pres-
surized systems. Proceedings of the Institution of Mechanical Engineers, Part E:
Journal of Process Mechanical Engineering 212(1): 37–47. SAGE Journals.
76 Reliability Engineering

138. Hagemeijer PM, Kerkveld G (1998) Application of risk-based inspection for pressurized
HC production systems in a Brunei petroleum company. Proceedings of the Institution of
Mechanical Engineers, Part E: Journal of Process Mechanical Engineering 212(1):
49–54.
139. Wang J, Matellini B, Wall A, Phipps J (2012) Risk-based verification of large off-
shore systems. Proceedings of the Institution of Mechanical Engineers, Part
M: Journal of Engineering for the Maritime Environment 226(3): 273–298.
doi:10.1177/1475090211430302.
140. Jovanovic A  (2003) Risk-based inspection and maintenance in power and process
plants in Europe. Nuclear Engineering and Design 226: 165–182.
141. Kallen MJ, Van Noortwijk JM (2005) Optimal maintenance decisions under imper-
fect inspection. Reliability Engineering and System Safety 90: 177–185. doi:10.1016/j.
ress.2004.10.004.
142. You J-S, Kuo H-T, Wu W-F (2006) Case studies of risk-informed inservice inspection
of nuclear piping systems. Nuclear Engineering and Design 236: 35–46.
143. Podofillini L, Zio E, Vatn J (2006) Risk-informed optimisation of railway tracks
inspection and maintenance procedures. Reliability Engineering and System Safety 91:
20–35. doi:10.1016/j.ress.2004.11.009.
144. Nakagawa T (1980) Replacement models with inspection and preventive maintenance.
Microelectronics and Reliability 20: 427–433.
145. Yeh L (2003) An inspection-repair-replacement model for a deteriorating system with
unobservable state. Journal of Applied Probability 40: 1031–1042.
146. Park JH, Lee SC, Hong JW, Lie CH (2009) An optimal block preventive maintenance
policy for a multi-unit system considering imperfect maintenance. Asia-Pacific Journal
of Operational Research 26(6): 831–847.
147. Sheu S-H, Lin Y-B, Liao G-L (2006) Optimum policies for a system with general imper-
fect maintenance. Reliability Engineering and System Safety 91(3): 362–369.
148. Taghipour S, Banjevic D (2012) Optimum inspection interval for a system under peri-
odic and opportunistic inspections. IIEE Transactions 44: 932–948. doi:10.1080/07408
17X.2011.618176.
149. Taghipour S, Banjevic D (2012) Optimal inspection of a complex system subject to
periodic and opportunistic inspections and preventive replacements. European Journal
of Operational Research 220: 649–660. doi:10.1016/j.ejor.2012.02.002.
150. Babishin V, Taghipour S (2016) Joint optimal maintenance and inspection for a k-out-
of-n system. International Journal of Advanced Manufacturing Technology 87(5):
1739–1749. doi:10.1109/RAMS.2016.7448039.
151. Bjarnason ETS, Taghipour S (2014) Optimizing simultaneously inspection interval and
inventory levels (s, S) for a k-out-of-n system. In: 2014 Reliability and Maintainability
Symposium, Colorado Springs, CO: 1–6. doi:10.1109/RAMS.2014.6798463.
152. Chen C-T, Chen Y-W, Yuan J (2003) On dynamic preventive maintenance policy
for a system under inspection. Reliability Engineering and System Safety 80: 41–47.
doi:10.1016/S0951-8320(02)00238-7.
153. Chen Y-C (2013) An optimal production and inspection strategy with preventive main-
tenance error and rework. Journal of Manufacturing Systems 32: 99–106. doi:10.1016/j.
jmsy.2012.07.010.
154. Duffuaa S, El-Ga’aly A (2013) A multi-objective mathematical optimization model for
process targeting using 100% inspection policy. Applied Mathematical Modelling 37:
1545–1552. doi:10.1016/j.apm.2012.04.008.
155. Wang H, Wang W, Peng R (2017) A two-phase inspection model for a single compo-
nent system with three-stage degradation. Reliability Engineering and System Safety
158: 31–40.
Inspection Maintenance Modeling for Technical Systems 77

156. Feng Q, Peng H, Coit DW (2010) A  degradation-based model for joint optimiza-
tion of burn-in, quality inspection, and maintenance: A light display device applica-
tion. International Journal of Advanced Manufacturing Technology 50: 801–808.
doi:10.1007/s00170-010-2532-7.
157. Tsai H-N, Sheu S-H, Zhang ZG (2016) A  trivariate optimal replacement policy for
a deteriorating system based on cumulative damage and inspections. Reliability
Engineering and System Safety 160: 122–135. doi:10.1016/j.ress.2016.10.031.
158. Bjarnason ETS, Taghipour S, Banjevic D (2014) Joint optimal inspection and inven-
tory for a k-out-of-n system. Reliability Engineering and System Safety 131: 203–215.
doi:10.1016/j.ress.2014.06.018.
159. Panagiotidou S (2014) Joint optimization of spare parts ordering and maintenance
policies for multiple identical items subject to silent failures. European Journal of
Operational Research 235: 300–314. doi:10.1016/j.ejor.2013.10.065.
160. Bukowski JV (2001) Modeling and analyzing the effects of periodic inspection on
the performance of safety-critical systems. IEEE Transactions on Reliability 50(3):
321–329. doi:10.1109/24.974130.
161. Ellingwood BR, Mori Y (1997) Reliability-based service life assessment of con-
crete structures in nuclear power plants: Optimum inspection and repair. Nuclear
Engineering and Design 175: 247–258.
162. Estes AC, Frangopol DM (2000) An optimized lifetime reliability-based inspection
program for deteriorating structures. In: Proceedings of the 8th ASCE Joint Specialty
Conference on Probabilistic Mechanics and Structural Reliability, Notre Dame, IN.
163. Faber MH, Sorensen JD (2002) Indicators for inspection and maintenance planning of
concrete structures. Structural Safety 24: 377–396. doi:10.1016/S0167-4730(02)00033-4.
164. Onoufriou T, Frangopol DM (2002) Reliability-based inspection optimization of com-
plex structures: A brief retrospective. Computers and Structures 80: 1133–1144.
165. Woodcock K (2014) Model of safety inspection. Safety Science 62: 145–156.
166. Ten Wolde M, Ghobbar AA (2013) Optimizing inspection intervals—Reliability and
availability in terms of a cost model: A  case study on railway carriers. Reliability
Engineering and System Safety 114: 137–147. doi:10.1016/j.ress.2012.12.013.
167. Ali SA, Bagchi G (1998) Risk-informed in service inspection. Nuclear Engineering
and Design 181: 221–224.
168. Garnero M-A, Beaudouin F, Delbos J-P (1998) Optimization of bearing-inspection
intervals. IEEE Proceedings of Annual Reliability and Maintainability Symposium:
332–338.
169. Aoki K, Yamamoto K, Kobayashi K (2007) Optimal inspection and replacement policy
using stochastic method for deterioration prediction, In: Proceedings of 11th World
Conference on Transport Research, Berkeley CA:1–13.
170. Sandoh H, Igaki N (2003) Optimal inspection policies for a scale. Computers and
Mathematics with Applications 46: 1119–1127.
171. Sandoh H, Igaki N (2001) Inspection policies for a scale. Journal of Quality in
Maintenance Engineering 7(3): 220–231.
172. Guduru RKR, Shaik SH, Yaramala S (2018) A dynamic optimization model for multi-
objective maintenance of sewing machine. International Journal of Pure and Applied
Mathematics 118(20): 33–43.
173. Gravito FM, Dos Santos Filho N (2003) Inspection and maintenance of wooden poles
structures. Global ESMO 2003, Orlando, Florida: 151–155.
174. Jazwinski J, Zurek J (2000) Principles of determining the maintenance set of the condi-
tion of the transport system with the use of expert opinions (in Polish). In: Proceeding of
XXVIII Winter School on Reliability—Decision Problems in Dependability Engineering,
Publishing House of Institute for Sustainable Technologies, Radom, Poland: 118–125.
78 Reliability Engineering

175. Salamonowicz T (2007) Maintenance strategy for systems in k-out-of-n reliability


structure (in Polish). In: Proceedings of XXXV Winter School on Reliability—Problems
of Systems Dependability, Publishing House of Institute for Sustainable Technologies,
Radom: 414–420.
176. Dzwigarek M, Hryniewicz O (2012) Periodical inspection frequency of protection sys-
tems of machinery—Case studies (in Polish). Journal of KONBiN 3(23): 109–120.
177. Landowski B, Woropay M (2003) Simulation of exploitation processes of technical
objects preventively maintained (in Polish). In: Proceedings of XXXI Winter School on
Reliability—Forecasting Methods in Dependability Engineering, Publishing House of
Institute for Sustainable Technologies, Radom: 297–308.
178. Zhao F, Xu X, Xie SQ (2009) Computer-aided inspection planning—The state of the
art. Computers in Industry 60: 453–466. doi:10.1016/j.compind.2009.02.002.
179. Ballou DP, Pazer HL (1982) The impact of inspector fallibility on the inspection pol-
icy in serial production systems. Management Science 28(4): 387–399. doi:10.1287/
mnsc.28.4.387.
180. Darwish MA, Ben-Daya M (2007) Effect of inspection errors and preventive mainte-
nance on a two-stage production inventory system. International Journal of Production
Economics 107: 301–313. doi:10.1016/j.ijpe.2006.09.008.
181. Lee HL, Rosenblatt MJ (1987) Simultaneous determination of production cycle and
inspection schedules in a production system. Management Science 33(9): 1125–1136.
182. Meyer RR, Rothkopf MH, Smith SA (1979) Reliability and inventory in a production-
storage system. Management Science 25(8): 799–807.
183. Tirkel I (2016) Efficiency of Inspection based on out of control detection in wafer
fabrication. Computers and Industrial Engineering 99: 458–464. doi:10.1016/j.
cie.2016.05.022.
184. Ito K, Nakagawa T (2000) Optimal inspection policies for a storage system with degra-
dation at periodic tests. Mathematical and Computer Modelling 31: 191–195.
185. Ito K, Nakagawa T (1995) An optimal inspection policy for a storage system with high
reliability. Microelectronics Reliability 36(6): 875–882.
186. Ito K, Nakagawa T (1995) An optimal inspection policy for a storage system with three
types of hazard rate functions. Journal of the Operations Research Society of Japan
38(4): 423–431.
187. Ito K, Nakagawa T, Nishi K (1995) Extended optimal inspection policies for a system
in storage. Mathematical and Computer Modelling 22(10–12): 83–87.
188. Martinez EC (1984) Storage reliability with periodic test. IEEE Proceedings of Annual
Reliability and Maintainability Symposium: 181–185.
189. Su Ch, Zhang Y-J, Cao B-X (2012) Forecast model for real time reliability of stor-
age system based on periodic inspection and maintenance data. Eksploatacja i
Niezawodnosc – Maintenance and Reliability 14(4): 342–348.
190. Neves ML, Santiago LP, Maia CA  (2011) A  condition-based maintenance policy
and input parameters estimation for deteriorating systems under periodic inspection.
Computers and Industrial Engineering 61: 503–511. doi:10.1016/j.cie.2011.04.005.
3 Application of
Stochastic Processes in
Degradation Modeling
An Overview
Shah Limon, Ameneh Forouzandeh Shahraki,
and Om Prakash Yadav

CONTENTS
3.1 Introduction..................................................................................................... 79
3.2 Continuous State Stochastic Processes............................................................80
3.2.1 Wiener Process.................................................................................... 81
3.2.2 Gamma Process................................................................................... 83
3.2.3 Inverse Gaussian Process.....................................................................84
3.2.4 Case Example: Degradation Analysis with a Continuous State
Stochastic Process................................................................................ 86
3.2.5 Selection of Appropriate Continuous State Stochastic Process........... 88
3.3 Discrete State Stochastic Processes.................................................................90
3.3.1 Markovian Structure............................................................................ 91
3.3.2 Semi-Markov Process..........................................................................99
3.4 Summary and Conclusions............................................................................ 104
References............................................................................................................... 104

3.1 INTRODUCTION
Most engineering systems experience the aging phenomena during their life cycle.
The operating conditions and external stresses further expedite the aging process of
these systems. The aging process reflects the propagation of the failure mechanism,
which ultimately results in a decline of product performances and finally product
failure. To reduce the downtime and ensure safe operations, it is desirable to identify
the product’s lifetime and reliability measure accurately so that appropriate main-
tenance policies can be executed. Therefore, the knowledge of product deteriora-
tion characteristics and fundamental root causes is a great source of information
to assess the product performance and reliability using the degradation modeling
(Limon et al. 2017a; Shahraki et al. 2017). In degradation modeling, a predefined
threshold value is considered to identify the time-to-failure. Further, the degradation

79
80 Reliability Engineering

approach provides more accurate reliability estimates compared to the traditional


failure time approaches.
In traditional deterministic models, system behavior is defined by a set of equa-
tions that can describe with certainty how the system performance will evolve over
the period of time. However, in a reality, there exists variation or uncertainty in sys-
tem performance that causes probabilistic behavior of the system. This situation led
to the increasing importance of the stochastic processes for modeling the probabilis-
tic degradation behavior of the engineering systems. A stochastic process is defined
by a collection of random variables that are associated with a set of numbers that
represent the random changes of a system over time. It can be divided into two broad
categories: discrete and continuous state stochastic process.
The continuous state stochastic processes, mostly the members of the Levy fam-
ily, such as the Wiener process, Gamma process, and Inverse Gaussian process are
being successfully used in modeling degradation processes of the system (Ye et al.
2013; Limon et al. 2017b; Limon et al. 2018). These processes have the independent
increment referred to as a Markov property that is very applicable to many engi-
neering degradation phenomena. Further, time-to-failure’s explicit expression by the
first passage of time concept provides clear advantages of continuous stochastic pro-
cesses in degradation modeling for reliability assessment.
On the other hand, the discrete state stochastic processes are used to model the
degradation process where the overall status of the degradation process can be
divided into a finite number of discrete levels ranging from perfect functioning to
complete failure. Each state can correspond to a certain level of performance of
a system under operation. The discrete state stochastic processes are used in deg-
radation modeling because of the simplicity associated with dealing with only a
limited number of states and their practical applications in degradation modeling
(Moghaddass and Zuo 2014; Shahraki and Yadav 2018). The change of the system
state may happen at the discrete or continuous time that leads to different models.
Moreover, in some applications that the system’s history and age may influence the
future state of the system, the aging Markovian and semi-Markov processes are used
as an extension of Markov processes.
The remainder of this chapter is organized as follows. Section 3.2 presents the
different types of continuous state stochastic processes, degradation modeling with
those processes, and selection of appropriate stochastic process. Section 3.3 describes
the discrete state stochastic processes with case examples. Finally, Section 3.4 sum-
marizes the application of stochastic processes in degradation modeling to evaluate
the system reliability.

3.2  CONTINUOUS STATE STOCHASTIC PROCESSES


The  continuous state stochastic process represents the continuity of the system
changes as a function of time and implies a well-behaved sample path property to
further analysis. The commonly used continuous state stochastic processes are mem-
bers of the Levy processes such as the Wiener process, Gamma process, and Inverse
Gaussian process. The fundamental idea of using the Levy processes in degradation
modeling is based on the assumption that every degradation process is a cumulative
Application of Stochastic Processes in Degradation Modeling 81

result of the small and independent degradation increments. Besides capturing the
temporal variation of the degradation processes, these members of the Levy pro-
cesses also have well-established mathematical properties useful for explaining the
degradation behavior. Further, the members of the Levy processes also have a strong
Markov property with the following mathematical expression:

Pr( X ti | X ti −1 , X ti −2 , X ti −3 ……… X t1 ) = Pr ( X ti | X ti −1 )

This  implies that the next degradation increment is only dependent on the cur-
rent state of the degradation and independent of the past degradation increments.
This  property is also intuitive and practical for many deterioration processes.
The following sections provide the details of each stochastic processes for degrada-
tion modeling.

3.2.1 Wiener Process
The basic Wiener process can be expressed as:

Y ( t ) = µΛ ( t ) + σ B ( Λ ( t ) ) (3.1)

Here B(.) is the standard Brownian motion, µ and σ represents the drift and volatility
parameter respectively, Λ(.) indicates the timescale function, and Y(t) is the charac-
teristic indicator that represents the system behavior. Suppose, a random variable
Y(t) follows the Wiener stochastic process, then it has the following mathematical
properties:

1. y ( 0 ) = 0
2. y(t ) follows a normal distribution with N ~ ( µ Λ(t ), σ 2Λ(t ))
3. y(t ) has an independent increment for every time interval ∆t ( ∆t = ti − ti −1 )
4. The independent increment ∆y ( t ) = yi − yi −1 follows the normal distribution
( )
N ~ µ ∆Λ(t ), σ 2∆Λ(t ) with probability density function (PDF):
  ∆y − µ ∆Λ t  2 
  ( ) 
− 
1  2σ 2 ∆Λ ( t ) 
f ∆y (t ) = e  
(3.2)
σ 2π∆Λ ( t )

The Wiener process is known also as the standard Brownian motion that is the random
movement of particles suspended in a fluid environment resulting from their collision.
This  random movement of small particles is very analogous to the random incre-
ment of the deterioration path. Besides, the Wiener process has many other attractive
properties that are well suited to model the degradation behavior. For example, the
degradation process can be viewed as an integration of small environmental effects in
a cumulative form. The increment process of these small effects can be approximated
by a normal distribution according to the central limit theorem. The  environmen-
tal effects such as temperature, shocks, and humidity are most often independent,
82 Reliability Engineering

and resulting degradation are also independent in the time interval. Considering this
aspect, the Wiener process is a good versatile model to describe many degradation
phenomena. In  a Wiener process, the drift parameter µ represents the degradation
rate and timescale function Λ(.) captures the nonlinearity in the degradation process.
The manufacturer often uses the accelerated degradation test (ADT) to quickly
analyze the reliability matrices during the product design stages. In ADT, to expedite
the degradation process, product samples are subjected to higher stress levels than
the normal operating conditions. The effect of stress on product degradation as well
as the lifetime can be explained by several existing physics or empirical-based reac-
tion rate models. For example, the temperature or any thermal effect on a product
deterioration can be captured easily by the Arrhenius model. Following are several
other well-established reaction rate models where d ( s) represents the rate of deg-
radation at stress level s, and a1 and a2 are the constant coefficients that depend on
material or product types (Nelson 2004):
a
− 2
d ( s) = a1e T ; Arrhenius model ( s = T )

= aV
1
a2
; Power law model ( s = V ) (3.3)

= a1e a2W ; Exponential model ( s = W )

Since the magnitude of stress measurement units may differ significantly in the
multi-stress scenario, it is important to use standardized transform stresses to disre-
gard the influence of stress measurement units. The transformed stress level is given
as (Park and Yum 1997):

1 S0′ − 1 Sk′
Sk = , for Arrhenius model
1 S0′ − 1 S M′

=
( ) ( ),
log Sk′ − log S0′
for Power law model (3.4)
log ( S ) − log ( S )

M

0

Sk′ − S0′
= , for Exponential law modell
S M′ − S0′
where So′ , Sk′ , and S M′ represent the operational, applied accelerated, and maximum
stress level in their original form, whereas Sk represents corresponding transformed
stress. It is considered the multiple stress degradation test with possible interaction
effect between stresses. The nonlinear behavior of the degradation is described by
the power law function ( Λ ( t ) = t c, c is a constant ). Considering both the Wiener
parameter is stress dependent, the log-likelihood function can be written as:

( )
2
n m p 1  ∆yijk − µ ( s) tijk
c
− t(ci −1) jk 
L (θ) = ∏∏∏
− log 2π c
2 ((
tijk − t(ci −1) jk )) 1
− log(σ ) −
2
2 
2σ 2 tijk
c
(− t(ci −1) jk)

 i =1 j =1 k =1

(3.5)
Application of Stochastic Processes in Degradation Modeling 83

The maximum likelihood estimation (MLE) method can be applied to estimate


the model parameter of the previous function. The  time to failure according
to the Wiener process is defined when the first passage of time reaches the
threshold degradation D and it follows the inverse Gaussian (IG) distribution
with the PDF:

1  b ( y − a )2 
 b  2 − 2 a2 y 

f IG , ( y , a,b ) = 3 
e (3.6)
 2π y 

Here, a and b are the IG distribution parameters. The mean time to failure than can
be written as:

1
 D − y0  c
ξw =   (3.7)
 µ ( s) 

The reliability function can be approximated with:

 D − y − µ ( s)t c 
R (t ) ≈ Φ   (3.8)
0


 σ 2
( s )t c 

3.2.2 Gamma Process
The  gamma process represents the degradation behavior in a form of cumulative
damage where the deterioration occurs gradually over the period of time. Assuming
a random variable Y(t) represents the deterioration, then the gamma process that
is a continuous-time stochastic process has the following mathematical properties
(O’Connor 2012):

1. y(0) = 0
2. y(t ) follow a gamma distribution with Ga ~ (α t , β )
3. y(t ) has an independent increment in a time interval ∆t ( ∆t = ti − ti −1 )
4. The independent increment ∆y(t ) = yi − yi −1 also follows the gamma distri-
bution Ga ~ (α∆t , β ) with PDF:

c c
β α ( ti −ti −1 ) α ( t c − t c ) −1
f ∆y ( t ) = ∆y i i −1 e −( β∆y ) (3.9)
Γ(α (ti − ti −1 ))
c c

where α > 0 and β > 0 represent the gamma shape and scale parameters, respectively,
c is a nonlinearity parameter, and Г(.) is a gamma function with Γ ( a ) = ∫0 x a−1e −( x ) dx.

84 Reliability Engineering

Now, considering the accelerated test and both gamma parameter dependent on stresses
with interaction effect, the log-likelihood function can be written as:

β ( s )  tijk
c
−t c 
n m p [α ( s)]  ( i −1) jk 
∆yijk

  c c  
α ( s )  tijk − t( i −1) jk  −1 − ∆yijk β ( s )
e 
L(θ ) = ∏∏∏ Γ α (s) (t
i =1 j =1 k =1
c
ijk − t(ci −1) jk 
 ) (3.10)

The MLE method with advanced optimization software can be used to solve this
complex equation. Now assuming that a failure occurs while the degradation path
reaches the threshold D, then the time to failure ξ is defined as the time when the
degradation path crosses the threshold D and the reliability function at time t will be:

R (t ) = P (t < tD ) = 1 −
(
Γ α t c, Dβ ) (3.11)
Γ αt ( ) c

where Dβ = (D−y0)β and y0 is the initial degradation value. The cumulative distribu-


tion function (CDF) of t D is given as:

F (t ) =
(
Γ α t c, Dβ ) (3.12)
Γ αt ( ) c

Because of the gamma function, the evaluation of the CDF becomes mathematically
intractable. To deal with this issue, Park and Padgett (2005) proposed an approxi-
mation of time-to-failure ξ with a Birnbaum-Saunders (BS) distribution having the
following CDF:

 1  tc b 
FBS ( t ) ≈ φ   − c   (3.13)
 a  b t  

where a = 1/√ (ωβ) and b = ωβ/α. Considering BS approximation, the expected failure
time can be estimated as:

1
ω 1 c
ξG =  β + (3.14)
 α 2α 

3.2.3  Inverse Gaussian Process


Consider a system’s behavior is represented by the IG process. If Y(t) indicates the
system’s performance characteristic at time t, then the IG process has the following
properties (Wang and Xu 2010):
Application of Stochastic Processes in Degradation Modeling 85

1. y(0) = 0 with probability one


2. y(t ) has an independent increment in each time interval ∆t ( ∆t = ti − ti −1 )
3. The  independent increment ∆y (t ) = yi − yi −1 follows the IG distribution
(
IG ~ µ ∆ Λ(t ), λ ∆ Λ(t )2 with PDF: )
 λ ∆y − µΛ ( t ) 2 
1/ 2 − ( ) 
 λ Λ (t ) 
2 
 2 µ 2 ∆y 

f( ∆y| µ∆Λ (t ), 2
λ∆Λ ( t ) )
= 3 
e (3.15)
 2π∆y 

Here µ and λ denote the mean and scale parameter and Λ (t) represents the shape
function. The mean of Y(t) is defined by µΛ(t) and the variance is µ3Λ(t)/λ. The shape
function is nonlinear, and a power law is chosen in this work to represent the nonsta-
tionary process (Λ (t) = tc). By the properties of the IG process and Equation 3.15, the
likelihood function of the degradation increment can be given as:

( ∆y ( ))
2
c c
− µijk tijk − tijk
( )
2 ijk
n m p λijk t − t
c c − λijk

∏∏∏
ijk ijk 2
2 µijk ∆yijk
L(θ) = e (3.16)
2π∆yijk
3
i =1 j =1 k =1

Suppose Y (t ) is a monotonic degradation process and the lifetime ξD is defined by


the first passage of time where degradation reaches the threshold value D. If the ini-
tial degradation is indicated by y0, then y(t)−y0 follows the IG distribution. Therefore,
the CDF of ξD can be written as:

 2λt c 
 λ  c D − y0     λ  c D − y0 
( c c 2
)
F ξ D | D, µ t , λ (t ) = Φ 
D − y0
t −
µ 
 − e
µ 
Φ −
D − y0
t +
µ 

    
(3.17)

where Φ (.) is the CDF of the standard normal distribution. However, when µΛ(t) and
t are large, Y(t) can be approximated by the normal distribution with mean µΛ(t) and
variance µ3Λ(t)/λ. Therefore, the CDF of ξD also can be approximated by the follow-
ing equation (Ye and Chen 2014):

 D − µ ( s)t c 
(
F ξ IG | D, µ t c , λ (t c )2 = Φ  )
 µ ( s)3 t c / λ
 (3.18)


And the approximated mean lifetime expression is:


1/ c
 D 
ξ IG =   (3.19)
 µ ( s) 
86 Reliability Engineering

3.2.4 Case Example: Degradation Analysis with a Continuous


State Stochastic Process
To demonstrate the proposed method, light emitting diodes (LEDs) are taken as a
case study example. Recently, LEDs have become very popular due to their very
low energy consumption, low costs, and long life (Narendran and Gu 2005). As
a solid-state lighting source, the use of LEDs is increasing in many sectors such
as communications, medical services, backlighting, sign-post, and general lighting
purposes. LEDs produce illumination and unlike the traditional lamp light instead
of catastrophic failure, the output light of LEDs is usually degraded over the useful
time and experiences soft failure modes. Therefore, it is reasonable to consider the
light intensity of LEDs as a degradation of performance characteristics in this study.
The  experiment data on degradation of LEDs are taken from the literature
(Chaluvadi 2008). Table 3.1 provides the details of experimental set up of the LED

TABLE 3.1
Accelerated Degradation Test Dataset of LEDs
Stress Level Degradation Measurement (lux)
Sample/time (hrs) 0 50 100 150 200 250
1 1 0.866 0.787 0.76 0.716 0.68
2 1 0.821 0.714 0.654 0.617 0.58
3 1 0.827 0.703 0.64 0.613 0.593
4 1 0.798 0.683 0.623 0.6 0.59
5 1 0.751 0.667 0.628 0.59 0.54
6 1 0.837 0.74 0.674 0.63 0.613
40 mA 7 1 0.73 0.65 0.607 0.583 0.58
8 1 0.862 0.676 0.627 0.6 0.597
9 1 0.812 0.65 0.606 0.593 0.573
10 1 0.668 0.633 0.593 0.573 0.565
11 1 0.661 0.642 0.594 0.58 0.553
12 1 0.765 0.617 0.613 0.597 0.56
1 1 0.951 0.86 0.776 0.7 0.667
2 1 0.933 0.871 0.797 0.743 0.73
3 1 0.983 0.924 0.89 0.843 0.83
4 1 0.966 0.882 0.851 0.814 0.786
5 1 0.958 0.89 0.84 0.81 0.8
6 1 0.94 0.824 0.774 0.717 0.706
35 mA 7 1 0.882 0.787 0.75 0.7 0.693
8 1 0.867 0.78 0.733 0.687 0.673
9 1 0.89 0.8 0.763 0.723 0.713
10 1 0.962 0.865 0.814 0.745 0.742
11 1 0.975 0.845 0.81 0.75 0.741
12 1 0.924 0.854 0.8 0.733 0.715

Source: Chaluvadi, V.N.H., Accelerated life testing of electronic revenue meters, PhD dissertation,
Clemson University, Clemson, SC, 2008.
Application of Stochastic Processes in Degradation Modeling 87

FIGURE 3.1  LED degradation data at a different stress level.

and degradation data from the test. Two different combinations of constant acceler-
ated stresses were used to accelerate the lumen degradation of LEDs. At each stress
level, twelve samples are assigned, and the light intensity of each sample LED was
measured at room temperature every 50 hours up to 250 hours. The operating stress
is defined as 30 mA and 50 percent degradation of the initial light intensity is con-
sidered to be the failure threshold value.
Figure  3.1  shows the nonlinear nature of the LEDs degradation path that
justifies our assumption of the non-stationary continuous state stochastic pro-
cess. The  nonlinear likelihood function with multiple model parameters makes
a greater challenge to estimate parameter values. The  MLE method with an
advanced optimization software R has been used to solve these complex equa-
tions. The built-in “mle” function that uses the Nelder-Mead algorithm (optim) to
optimize the likelihood function is used to estimate model parameters. After the
model parameters for each stochastic process have been estimated, the lifetime
and reliability under any given set of operating conditions can be estimated. Now,
considering the different stochastic process models, the parameter and lifetime
estimates are provided in Table 3.2.
The results show that the Wiener process has deviated (larger) lifetime ­estimates
compared to the Gamma and IG process. Figure 3.2 illustrates the reliability ­estimates
considering different stochastic process models. Similar to lifetimes, r­ eliability plots
also show deviated (higher) estimate by the Wiener process.

TABLE 3.2
Parameter and Lifetime Estimates with Different Degradation Model
Model γ 0 γ δ 0 δ 1 
c Lifetime
Weibull −4.3516 0.9483 −3.8413 0.1570 0.4569 3002.26
Gamma −0.7636 0.0954 4.2685 −.08528 0.5802 1812.28
IG −5.1956 0.9481 −6.1025 0.1185 0.6097 1611.15
88 Reliability Engineering

FIGURE 3.2  Reliability estimates using various continuous stochastic processes.

3.2.5 Selection of Appropriate Continuous State Stochastic Process


The appropriate selection of the stochastic process is very important because effec-
tive degradation modeling depends on the appropriate choice of the process. The reli-
ability estimation and its accuracy also are dependent on the appropriate stochastic
process selection. From the LED case study example, it is observed that the lifetime
and reliability estimates differ among three continuous state stochastic processes.
There  are several criteria to choose an appropriate stochastic process for specific
degradation cases which are discussed next.
The graphical analysis is a very common method to check the data patterns and
behavior. Figure 3.3 illustrates the histogram and CDF graphs to compare the fitness
of three different stochastic processes. The histogram and the CDF graphs suggest
that the Gamma process provides the best fit for LED degradation data. On the other
hand, the Wiener process is the least fitted degradation model for LED data. Besides,
quantile-quantile (Q-Q) plot and probability plots are also a very useful graphical
technique to check the model fitness. These plots also provide the same conclusion
for the LED data (see Figure 3.4).
Besides graphical methods, there are other stronger statistical methods that are
used to check the model fitness such as goodness-of-fit tests. Several parametric
or nonparametric methods are available to compare the model fitness such as KS
(Kolmogorov-Smirnov) statistic, CVM (Cramer-von Mises) statistic, AD (Anderson-
Darling) statistic, AIC (Akaike’s Information Criterion), and BIC (Bayesian
Information Criterion). All these statistics and criteria are used to select the best-
fitted model. Table  3.3 provides the goodness-of-fit statistic value to compare the
fitness to the stochastic processes for LED data. It is observed that the Gamma pro-
cess has the least statistic value in all cases and Wiener has the highest statistic
value. This  observation implies that the Gamma process is the most suitable and
Application of Stochastic Processes in Degradation Modeling 89

FIGURE 3.3  Graphical model fitness of LED degradation data.

FIGURE 3.4  Q-Q and probability plots of degradation data.

TABLE 3.3
Goodness-of-fit Statistics for Stochastic Processes
Goodness-of-fit Statistic Wiener Gamma Inverse Gaussian
KS statistic 0.1802 0.0708 0.1590
CVM statistic 1.27821 0.1159 0.5977
AD statistic 7.1927 0.6224 2.9771
AIC −315.2034 −407.427 −389.4947
BIC −309.6285 −401.852 −383.9197
90 Reliability Engineering

Wiener is the least suitable model for the LED degradation data. This result explains
the huge discrepancy between the lifetime and reliability estimates of the Wiener
process compared to other two degradation models. The physical degradation phe-
nomena also is intuitive to this fitness checking criteria. As LEDs are monotonically
degraded over a period of time, thus it basically follows the assumption of a mono-
tonic and nonnegative Gamma process most and then an IG process. Because of the
clear monotonic behavior of the LED data, the degradation definitely does not follow
the Wiener process. All the model fitness test statistic and criteria also indicate an
ill-fitted degradation behavior of Wiener process for LED data. Further, this poorly
fitted Wiener process also resulted in much lower nonlinear constant estimates (see
Table 3.2) that represent a slower degradation rate than the actual situation. This mis-
representation of the degradation increment and the lower degradation rate than the
actual situation causes the overestimate of the lifetime and reliability by the Wiener
degradation modeling. This case example clearly shows the importance of choosing
the right stochastic process for assessing the system’s degradation behavior.

3.3  DISCRETE STATE STOCHASTIC PROCESSES


This section presents and discusses different stochastic processes used to model the
discrete state degradation process. Unlike the Wiener process, Gamma process, and
IG process models, a finite state stochastic process evolves through a finite number
of states. In a continuous state degradation process, the degradation process is mod-
eled as a continuous variable. When the degradation process exceeds a predefined
threshold, the item is considered failed. However, most engineering systems consist
of components that have a range of performance levels from perfect functioning to
complete failure. In  the discrete-state space, the overall status of the degradation
process is divided into several discrete levels with different performances ranging
from perfect functioning to complete failure. It is important to highlight here that
when a number of states approach to infinity, the discrete-state space and continu-
ous-state space become equivalent to each other.
In  general, it is assumed that the degradation process { X (t ), t ≥ 0} evolves on a
finite state space S = {0,1, …, M − 1, M } with 0 corresponding to the perfect healthy
state, M representing the failed state of the monitored system, and others are inter-
mediate states. At time t = 0, the process is in the perfect state and as time passes it
moves to degraded states. A state transition diagram used for modeling the degrada-
tion process is shown in Figure 3.5. Each node represents the state of the degradation
process and each branch between two nodes represents the transition between the
states corresponding to the nodes. A system can degrade according to three types of
transitions: transition to the neighbor state (Type 1), transition to any intermediate
state (Type 2), and transition to the failure state (Type 3). Type 1 transitions from
one state to the next degraded state are typical of degradation mechanisms driven by
cumulative damage and is called minor degradation. Type 2 and Type 3 transitions
are called major degradation.
In the context of modeling degradation process, this section focuses on cases in
which there is no intervention in the degradation process; i.e., once the process tran-
sits to a degradation state, the previous state is not visited again.
Application of Stochastic Processes in Degradation Modeling 91

FIGURE 3.5  A multi-state degradation process with minor and major degradation.

The discrete state stochastic process used to model the degradation process can
be divided into different categories depending on the continuous or discrete nature
of the time variable, and Markovian and non-Markovian property (Moghaddass and
Zuo 2014).
From a time viewpoint, the multistate degradation process can evolve according
to a discrete-time stochastic process or a continuous-time stochastic process. In the
discrete-time type, the transition between different states occurs only at a specific
time; however, transitions can occur at any time for the continuous-time stochastic
process. With respect to the dependency of degradation transitions to the history
of the degradation process, the multistate degradation process can be divided into
Markovian degradation process and non-Markovian degradation process. When the
degradation transition between two states depends only on its current states, that is,
the degradation process is independent of the history of the process, the degradation
model follows the Markovian structure. On the other hand, in a multistate degrada-
tion process with a non-Markovian structure, the transition between two states may
depend on other factors like previous states, the age of the system, and on how long
the system has been in its current state. The following sections provide a detailed dis-
cussion on Markovian structure and semi-Markov process with suitable examples.

3.3.1  Markovian Structure


A  stochastic process { X (t ) | t ≥ 0} is called a Markov process if for any
t0 < t1 < t2 <  < t n−1 < t n < t the conditional distribution of X(t) for given values of
X (t0 ), X (t1 ), …, X (t n ) depends only on X (t n ):

Pr{X ( t ) ≤ x | X ( t n ) = xn , X ( t n −1 ) = xn −1, …, X (t1 ) = x1,


(3.20)
X (t0 ) = x0 )} = Pr{ X (t ) ≤ x | X (t n ) = xn}
This applies to a Markov process with discrete-state space or continuous-state space.
A Markov process with discrete-state space is known as a Markov chain. If the time
space is discrete, then it is a discrete-time Markov chain otherwise it is a continuous-
time Markov chain.
92 Reliability Engineering

A discrete-time Markov chain is a sequence of random variables X 0 , X 1, …, X n , …


that satisfy the following equation for every n (n = 0, 1, 2,…):

Pr ( X n = xn | X 0 = x0 , X 1 = x1, …, X n −1 = xn −1 ) = Pr ( X n = xn | X n −1 = xn −1 ) (3.21)

If the state of the Markov chain at time step n is xn, we denote it as X n = xn. Equa-
tion 3.21 implies that the chain behavior in the future depends only on its current
state and it is independent of its behavior in the past. Therefore, the probability that
the Markov chain is going from state i into state j in one step, which is called one-step
transition probability, is pij = Pr ( X n = j | X n −1 = i ). For time a homogeneous Markov
chain, the transition probability between two states does not depend on the n, i.e.,
pij = Pr ( X n = j | X n −1 = i ) = Pr ( X 1 = j | X 0 = i ) = constant. The  one-step transition
probabilities can be condensed into a transition probability matrix for a discrete-time
Markov chain with M + 1 states as follows:

 p00 p01 … p0 M 
 
p10 p11 … p1M
P =  (3.22)
 … … … … 
 
 pM 0 pM 1 … pMM 

The  sum of each row in P is one and all elements are non-negative. As the
discrete-time Markov chain is used to model the degradation process of an
­
item, the transition probability matrix P is in upper-triangular form ( pij = 0 for
i > j ) to reflect the system deterioration without considering maintenance or repair.
Moreover, for the failure state M, which is also known as an absorbing state,
pMM = 1 and pMj = 0 for j = 0,1,…, M − 1.
Having the transition probability matrix P and the knowing the initial conditions
of the Markov chain, p(0) = [ p0 (0), p1(0),…, pM (0) ], we can compute the state proba-
bilities at step n, p ( n ) =  p0 ( n ) , p1 ( n ) ,…, pM ( n) . p j ( n ) = Pr { X n = j} , j = 1, …, M,
which is the probability that the chain is in state j after n transitions. For  many
applications such as reliability estimation and prognostics, state probabilities are of
utmost interest.
Based on the Chapman-Kolmogorov equation, the probability of a process mov-
ing from state i to state j after n steps (transitions) can be calculated by multiplying
the matrix P by itself n times (Ross 1995). Thus, assuming that p(0) is the initial
state vector, the row-vector of the state probabilities after the nth step is given as:

p( n) = p(0).P n (3.23)

For most of the systems, as the system is in the perfect condition at the beginning of
its mission, the initial state vector is given as p(0) = [1, 0, 0,…, 0].
When the transition from the current state i to a lower state j takes place at any
instant of the time, the continuous-time Markov chain is used to model the degra-
dation process. In analogy with discrete-time Markov chains, a stochastic process
Application of Stochastic Processes in Degradation Modeling 93

{ X (t ) | t ≥ 0} is a continuous-time Markov chain if the following equation holds for


every t0 < t1 < … < t n−1 < t n (n is a positive integer):

Pr ( X (t n ) = xn | X (t0 ) = x0 , …, X (t n −1 ) = xn −1 ) = Pr ( X (t n ) = xn | X (t n −1 ) = xn −1 ) (3.24)

Equation  3.24 is analogous to Equation  3.21. Thus, most of the properties of the
continuous-time Markov process are similar to those of the discrete-time Markov
process. The probability of the continuous-time Markov chain going from state i into
state j during ∆t , which is called transition probability, is Pr ( X (t + ∆t ) = j | X (t ) = i ) =
π ij ( t , ∆t ) . They satisfy: π ij ( t , ∆t ) ≥ 0 and ∑ Mj = 0 π ij ( t , ∆t ) = 1.
For time homogeneous continuous-time Markov chain, the transition probability
between two states does not depend on the t but depends only on the length of the
time interval ∆t . Moreover, the transition rate ( λ ij (t ) ) from state i to state j ( i ≠ j ) at
π ij ( t , ∆t )
time t is defined as: λ ij (t ) = ∆lim t →0 ∆t , which does not depend on t and is constant
for a homogeneous Markov process.
Like the discrete-time case, it is important to get the state probabilities for calcu-
lating the availability and reliability measures for the system. The state probabilities
of X (t ) are:
M

p j ( t ) = Pr { X ( t ) = j} , j = 0,1, …, M for t ≥ 0 and ∑p (t ) = 1 (3.25)


j =0
j

Knowing the initial condition and based on the theorem of total probability and
Chapman-Kolmogorov equation, the state probabilities are obtained using the sys-
tem of differential equations as (Trivedi 2002; Ross 1995):
M M

∑ ∑λ ,
dp j (t )
p′ j ( t ) = = pi (t )λij − p j ( t ) ji j = 0,1, ..., M (3.26)
dt i =0 i =0
i≠ j i≠ j

Equation 3.26 can be written in the matrix notation as:

 λ00 λ01 … λ0 M 
 
dp(t ) λ10 λ11 … λ1M 
= p(t )λ , p ( t ) =  p0 ( t ) , p1 ( t ) ,…, pM (t )  , λ = 
dt  … … … … 
  
 λM 0 λM 1 … λMM 
(3.27)
In the transition rate matrix, λ jj = −∑ i ≠ j λ ji and ∑ j =0 λij = 0 for 0 ≤ i ≤ M. As the
M

continuous-time Markov chain is used to model the degradation process, the tran-
sition rate matrix λ is in upper-triangular form (λij = 0 for i > j ) to reflect the
degradation process without considering maintenance or repair. Since state M
is an absorbing state, all the transition rates from this state are equal to zero,
λMj = 0 for j = 0,1, …, M − 1.
Regarding the method to solve the system of Equation  3.27, there are several
methods including numerical and analytical methods such as enumerative method
94 Reliability Engineering

(Liu and Kapur 2007), recursive approach (Sheu and Zhang 2013), and Laplace-
Stieltjes transform (Lisnianski and Levitin 2003).

Example 3.3.1.1 

Consider a system that can have four possible states, S = {0,1,2,3}, where state
0 indicates that the system is in as good as new condition, states 1 and 2 are inter-
mediate degraded conditions, and state 3 is the failure state. The system has only
minor failures; i.e., there is no jump between different states without passing all
intermediate states. The transition rate matrix is given as:

 λ00 λ01 λ02 λ03   −3 3 0 0


   
λ10 λ11 λ12 λ13   0 −2 2 0
λ = =
 λ20 λ21 λ22 λ23   0 0 −1 1
   
 λ30 λ31 λ32 λ33   0 0 0 0 

The λ33 = 0 shows that the state 3 is an absorbing state. If the system is in the best
state=
at the beginning ( p(0) [ =p0 (0), p1(0), p2 (0), p3 (0)] [10
, ,0,0]), the goal is to com-
pute the system reliability at time t > 0.

Solution 3.3.1.1:  For  the multi-state systems, the reliability measure can be
based on the ability of the system to meet the customer demand W (required
performance level). Therefore, the state space can be divided into two subsets
of acceptable states in which their performance level is higher than or equal to
the demand level and unacceptable states. The reliability of the system at time
t is the summation of probabilities of all acceptable states. All the unacceptable
states can be regarded as failed states, and the failure probability is a sum of
probabilities of all the unacceptable states.
First, find the state probabilities at time t for each state solving the following
differential equations:
 dp0 (t )
 dt = −λ01 p0 ( t )

 dp1(t ) = −λ01 p0 ( t ) − λ12 p1 (t )
 dt

 dp2 (t ) = −λ12 p1( t ) − λ23 p2 (t )
 dt
 dp (t )
 3 = −λ23 p2 ( t )
 dt
Using the Laplace-Stieltjes transforms and inverse Laplace-Stieltjes transforms
(Lisnianski et al. 2010), the state probabilities at time t are found as:
 p0 (t ) = e − λ43t

 p1(t ) = λ01 (e − λ12t −e − λ01t )
 λ01 − λ12


 p2 (t ) = − λ 12λ 01[( λ01 − λ 12 )e − λ23 t
+ ( λ23 − λ01 )e − λ12 t
+( λ12 − λ23 )e − λ01t
]
 ( λ12 − λ21) ( λ01 − λ12 ) ( λ23 − λ01)

 p3 = 1− p2 (t ) − p1(t ) − p0 (t )
Application of Stochastic Processes in Degradation Modeling 95

FIGURE 3.6  System state probabilities: Example 3.3.1.1.

The plot of the state probabilities is shown in Figure 3.6. As shown, the probability


of being in state 0 is decreasing with time and the probability of being in state 3
is increasing with time.
Then the reliability of the system at time t is calculated based on the demand
level by summation of the probabilities of all acceptable states as:

If acceptable states are : 0,1, 2 → R1 ( t ) = p0 (t ) + p1(t ) + p2 ( t )



If acceptable states are : 0,1 → R2 ( t ) = p0 ( t ) + p1 ( t )
If acceptable states are : 0 → R ( t ) = p ( t )
 3 0

The plots of the system reliability for all three cases are shown in Figure 3.7.

FIGURE 3.7  System reliability for various cases.


96 Reliability Engineering

Let τ i denote the time that the degradation process spent in state i. According to
the Markov property in Equation 3.24, i does not depend on the past state of the
process, so the following equation holds:

P (τ i > t + ∆t τ i > t ) = h ( ∆t )  (3.28)

Function h(∆t ) in Equation 3.28 only depends on ∆t , and not on the past time t.
The  only continuous probability distribution that satisfies Equation  3.28 is the
exponential distribution. In the discrete time case, requirement in Equation 3.28
leads to the geometric distribution.
In a Markovian degradation structure, the transition between two states at time t
depends only on the two states involved and is independent of the history of the pro-
cess before time t (memoryless property). The fixed transition probabilities/rates and
the geometric/exponential sojourn time distribution limit the use of a Markov chain to
model the degradation process of real systems. For the degradation process of some
systems, the probability of making the transition from one state to a more degraded
state may increase with the age and the probability that it continuously stays at the
current state will decrease. That is, pii (t + ∆t ) ≤ pii (t ) and ∑ j =i +1 pij (t + ∆t )≥∑ j =i +1 pij (t ).
n n

Therefore, the transition probabilities and transition rates are not constant during the
time and an extension of the Markovian model, which is called aging Markovian
deterioration model, is used to include this aging effect.
For the discrete-time aging Markovian model, P(t ) is one-step transition prob-
ability matrix at time t and pij (t ) represents the transition probability from state i to
state j at time t. As shown in Chen and Wu (2007), each row of P(t ) represents a
state probability distribution given the current state at i that will form a bell-shape
distribution. Let Ni satisfy pi ,Ni (t ) = max { pi , j ( t ) , j = 0,1,… , M}, where Ni represents
j
the peak transition probability in the bell-shape distribution. Then:
Ni M

Pi L ( t ) ≡ ∑p (t ) ; P
j =1
ij i
R
(t ) ≡ ∑ p (t )
j = Ni +1
ij (3.29)

Pi L ( t ) and Pi R ( t ) are left-hand side and right-hand side cumulated probabilities,


respectively. Since ∑ j =1 pij ( t ) = 1, then Pi R ( t ) = 1− Pi L ( t ). For  j ≤ N i , pij (t + 1) ≤ pij (t )
M

and for j > Ni , pij (t + 1) ≤ pij (t ). When the system becomes older, Pi L increases while
Pi R decreases, therefore:

Pi L ( t ) ≥ Pi L ( t + 1) ; Pi R ( t ) ≤ Pi R ( t + 1) (3.30)

Then P(t + 1) can be modified as:

piL ( t + 1) piR ( t + 1)
pij ( t + 1) ≡ pij ( t ) ∀j ≤ N ; p ( t + 1) ≡ p ( t ) ∀j > Ni (3.31)
piL ( t ) piR ( t )
i ij ij

The aging factor δ (0 ≤ δ < 1) is defined by Chen and Wu (2007) as δ = PiP R( t +t 1) − 1 that


R

i ( )
can be estimated from historical data. Therefore, Equation 3.31 is represented as:

 piR ( t + 1) 
pij ( t + 1) ≡ pij ( t ) .  1−  ∀j ≤ Ni ; pij ( t + 1) ≡ pij ( t ) . (1+ δ ) ∀j > Ni (3.32)

 piL ( t ) 
Application of Stochastic Processes in Degradation Modeling 97

Starting with the initial transition probability matrix P(0), the values of the P(t),
which are changing during the time, can be calculated according to Equation 3.32.
For  the continuous-time aging Markovian model, which is called the non-
homogeneous continuous-time Markov process, the amount of time that the sys-
tem spends in each state before proceeding to the degraded state does not follow
the exponential distribution. Usually, the transition times are assumed to obey
Weibull distribution because of its flexibility, which allows considering hazard
functions both increasing and decreasing over time, at different speeds.
To get the state probabilities at each time t, we have to solve the Chapman-
Kolmogorov equations as:
M M

∑ ∑λ
dp j (t )
= pi (t )λij (t ) − p j ( t ) ji ( t ), j = 0,1,…,M  (3.33)
dt i =0 i =0
i≠ j i≠ j

Equation 3.33 can be written in the matrix form as:

d p (t )
= p (t )λ (t ),
dt
 λ00 ( t ) λ01 ( t ) … λ0 M ( t ) 
 
λ10 ( t ) λ11 ( t ) … λ1M ( t )  
p ( t ) =  p0 ( t ),…, p M (t ) , λ ( t ) =  (3.34)
… … … … 
 
 λM0 ( t ) λM1 ( t ) … λMM ( t ) 

The  transition rate matrix λ (t ) has the same properties as the transition matrix
in Equation  3.27. To find the state probabilities at time t, many methods have
been used to solve Equation 3.34 such as state–state integration method (Liu and
Kapur 2007) and recursive approach (Sheu and Zhang 2013). Equation 3.34 can
be recursively solved from state 0 to state M as follows:
t

∫λ00 ( s)ds
p0 ( t ) = e 0  (3.35)

t
j −1 t
∫ λ jj ( s ) ds
pj ( t ) = ∑∫p (τ
i =0 0
i i +1 ) λij (τ i +1) e
τ i +1
dτ i +1 , j = 1,… , M − 1 (3.36)

M −1
pM ( t ) = 1 − ∑ p (t ) 
j=0
j (3.37)

The initial conditions are assumed to be p( t ) =  p0 ( 0 ) =1, p1( 0 ) = 0,…. p M ( 0 ) = 0 .

Example 3.3.1.2 

(Sheu and Zhang 2013; Shu et al. 2015) Assume that a system degrades through
five different possible states, S = {0,1, 2, 3, 4} and state 0 is the best state and state
4 is the worst state. The time Tij spent in each state i before moving to the next state
j follows the Weibull distribution Tij ~ Weibull(1 / ( i − 0.5 j ) , 3) with scale parameter
98 Reliability Engineering

α ij = 1/(i − 0.5 j) and shape parameter β = 3. The  nonhomogeneous continuous


time Markov process is used to model the degradation process. The transition rate
from state i to state j at time t is λij ( t ) = 3t 2 / ( i − 0.5 j ) ∀ i, j ∈ S, i > j . Based on the
3

demand level, the states 3 and 4 are unacceptable states. The goal is to compute
the system reliability at time t(0 < t < 4) .

Solution 3.3.1.2: The transient degradation rate matrix is:

 λ00 ( t ) λ01 ( t ) λ02 ( t ) λ03 ( t ) λ04 ( t ) 


 
 0 λ11 ( t ) λ12 ( t ) λ13 ( t ) λ14 ( t ) 
λ (t ) =  0 0 λ22 ( t ) λ23 ( t ) λ24 ( t )  =
 
 0 0 0 λ33 ( t ) λ34 ( t ) 
 0 0 0 0 0 

 −0.419945 t 2 0.1920 t 2 0.1111t 2 0.06997t 2 0.046875t 2 


 
 0 −0.6781t 2 0.3750 t 2 0.1920 t 2 0.1111t 2 
 0 0 −1.2639t 2 0.8889t 2 0.3750 t 2 

 
 0 0 0 −3t 2 3t 2 
 
 0 0 0 0 0 

p0 ( 0 ) = 1, pj ( 0 ) = 0 j = 1, 2,… , M.
The state probabilities can be obtained using Equations 3.36 and 3.37 as:

3
p0 (t ) = e −0.14t

p1 ( t ) = −0.7439(e −0.226t − e −0.14t )


3 3

p2 ( t ) = 0.0139e −0.4213t + 0.4623e −0.14t − 0.4761e −0.226t


3 3 3

p3 ( t ) = −0.005109e − t + 0.24178e −0.14t − 0.24381e −0.226t + 0.007117e −0.4213t


3 3 3 3

p4 ( t ) = 1− p0 ( t ) − p1 ( t ) − p2 ( t ) − p3 ( t )

= 1− 2.44798e −0.14t + 1.46381e −0.226t − 0.021017e −0.4213t + 0.005109e − t


3 3 3 3

The system state probabilities are shown in Figure 3.8.


As the states 3 and 4 are unacceptable states, the reliability of the system at
time t is Rs ( t ) = p0 (t ) + p1(t ) + p2 (t ). Figure 3.9 shows the system reliability as a func-
tion of time.
The aging Markovian models used to overcome the limitations of Markov chain
structures can be framed as a semi-Markov process. Semi-Markovian structures
consider the history of the degradation process and consider arbitrary sojourn time
distributions at each state. Semi-Markovian models as an extension of Markovian-
based models will be explained in the next section.
Application of Stochastic Processes in Degradation Modeling 99

FIGURE 3.8  System state probabilities: Example 3.3.1.2.

FIGURE 3.9  System reliability as a function of time.

3.3.2 Semi-Markov Process
The  semi-Markov process can be applied to model the degradation process of
some systems whose degradation process cannot be captured by a Markov process.
For example, Ng and Moses (1998) used the semi-Markov process to model bridge
degradation behavior. They described the semi-Markov process in terms of a transi-
tion matrix and a holding time or sojourn time matrix. A transition matrix has a set
100 Reliability Engineering

of transition probabilities between states that describe the embedded Markov chain.
The holding time matrix has a set of probabilities obtained from the probability den-
sity function of the holding times between states.
For Markov models, the transition probability of going from one state to another
does not depend on how the item arrived at the current state or how long it has been
there. However, semi-Markov models relax this condition to allow the time spent in
a state to follow an arbitrary probability distribution. Therefore, the process stays in
a particular state for a random duration that depends on the current state and on the
next state to be visited (Ross 1995).
To describe the semi-Markov process X ≡ { X ( t ) : t ≥ 0}, consider the degrada-
tion process of a system with finite state space S = {0,1, 2,…, M } (M + 1: the total
number of possible states). The process visits some state i ∈ S and spends a random
amount of time there that depends on the next state it will visit, j ∈ S , i ≠ j . Let Tn
denote the time of the nth transition of the process, and let X (Tn ) be the state of the
process after the nth transition. The process transitions from state i to state j ≠ i
with the probability pij = P ( X (Tn+1 ) = j X (Tn ) = i ). Given the next state is j, the
sojourn time from state i to state j has a CDF, Fij . For a semi-Markov process, the
sojourn times can follow any distribution, and pij is defined also as the transition
probability of the embedded Markov chain.
The  one-step transition probability of the semi-Markov process transiting to
state 𝑗 within a time interval less than or equal to t, provided starting from state, is
expressed as (Cinlar 1975):

(
Qij ( t ) = Pr X (Tn+1 ) = j , Tn+1 − Tn ≤ t , X (Tn ) = i ) t ≥ 0 (3.38)

The random time between every transition (Tn +1 − Tn ), sojourn time, has a CDF as:

( )
Fij ( t ) = Pr Tn +1 − Tn ≤ t X (Tn +1 ) = j , X (Tn ) = i (3.39)

If the sojourn time in a state depends only on the current visited state, then the
unconditional sojourn time in state i is Fij ( t ) = Fi ( t ) = ∑ j∈S Qij (t ). The  transition
probabilities of the semi-Markov process ( Q ( t ) = [Qij (t )], i , j ∈ S ), which is called
semi-Markov kernel, is the essential quantity of a semi-Markov process and satisfies
the relation:

Qij ( t ) = pij Fij ( t ) (3.40)

Equation 3.40 indicates that the transition of the semi-Markov model has two steps.
Figure 3.10 shows a sample degradation path of a system. The system is in the state i
at the initial time instance and transits to the next worse state j with transition prob-
ability pij . As the process is a monotone non-increasing function without considering
the maintenance, j = i +1 with probability one. Before moving into the next state j,
the process will wait for a random time with CDF Fij (t ). This process continues until
Application of Stochastic Processes in Degradation Modeling 101

FIGURE 3.10  A sample degradation process.

the process enters the state M that is an absorbing state. For this example the transi-
tion probability matrix is given as:

0 1 0 … 0
 
0 0 1 … 0
P = (3.41)
… … … … …
 
0 0 0 … 1

When the semi-Markov process is used to model the degradation process, the initial
state of the process, the transition probability matrix P, and matrix F(t ) must be
known. Another way of defining the semi-Markov process is knowing the kernel
matrix and the initial state probabilities.
Like previous models, it is important to find the state probabilities of the semi-
Markov process. The  probability that a semi-Markov process will be in state j at
time t ≥ 0 given that it entered state i at time zero, π ij ( t ) ≡ Pr { X ( t ) = j | X ( 0 ) = i},
is found as follows (Howard 1960; Kulkarni 1995):

π ij ( t ) = δ ij [1 − Fi (t ) ] + ∑∫q
k ∈S 0
ik (ϑ ) π kj ( t − ϑ ) dϑ (3.42)

dQik (ϑ )
qik (ϑ ) = (3.43)

1i = j
δ ij =  (3.44)
0 i ≠ j
102 Reliability Engineering

In  general, it is difficult to obtain the transition functions, even when the kernel
matrix is known. Equation  3.42 can be solved using numerical methods such as
quadrature method (Blasi et al. 2004; Corradi et al. 2004) and Laplace and inverse
Laplace transforms (Dui et  al. 2015) or simulation methods (Sánchez-Silva and
Klutke 2016).
Moreover, the stationary distribution π = (π j ; j ∈S ) of the semi-Markov process
is defined, when it exists, as:
υ jw j
π j := lim π ij ( t ) = (3.45)

M
t →∞
υi wi
i =0

where υ j for j ∈ S denotes the stationary probability of the embedded Markov chain
satisfying the property: υ j = ∑ iM= 0 υi pij , ∑ iM= 0 υi = 1, and w j for j ∈ S is the expected
sojourn time in state j.
For some systems, degradation transitions between two states and may depend on
the states involved in the transitions, the time spent at the current state (t), the time
that the system reached the current state (s), and/or the total age of the system (t+s).
As another extension, a nonhomogeneous semi-Markov process is used for model-
ing the degradation of such systems in which degradation transition can follow an
arbitrary distribution.
The associated non-homogeneous semi-Markov kernel is defined by:

(
Qij ( s, t ) = Pr X (Tn +1 ) = j , Tn +1 ≤ t , X (Tn ) = i ,Tn = s t ≥ 0 (3.46) )
In non-homogeneous semi-Markov, the state probabilities are defined and obtained
using the following equation:
t

π ij ( t ) = Pr { X ( t ) = j | X ( 0 ) = i} = δ ij [1 − Fi (t , s) ] + ∑∫q (s,ϑ )π
k∈S s
ik kj (t − ϑ )( dϑ ) (3.47)

The obtained state probabilities can be used to find different availability and reli-
ability indexes.

Example 3.3.2

Consider a system (or a component) whose possible states during its evolution in
time are S = {0,1, 2}. Denote by U = {0,1} the subset of working states of the system
and by D = {2} the failure state. In this system, both minor and major failures are
possible. The state transition diagram is shown in Figure 3.11.
The holding times are normally distributed, i.e., Fij ~ N(µij ,σ ij ) . Therefore, the
CDF of the holding time from state i to state j is:

t  (u − µij ) 
− 
1
Fij ( t ) =
∫e
 2σ ij 
du ∀i, j ∈ S
2πσ ij2 0
Application of Stochastic Processes in Degradation Modeling 103

FIGURE 3.11  State transition diagram for semi-Markov model.

The goal is to find the system reliability at time t given the best state is the initial
state of the system.

Solution 3.3.2: As the system is at state 0 at the beginning, the reliability of the
system at time t is the probability of transition from state 0 to state 2 at time t,
π 02 ( t ).
First, we find the kernel matrix of the semi-Markov process Q ( t ) = [Qij (t )], i, j∈ S:

0 Q01(t ) Q02 (t )
 
Q ( t ) = 0 0 Q12 (t ) 
0 0 0 

Q01(t ) is the probability that the process transitions from state 0 to 1 within a time
interval less than or equal to t that can be determined as the probability that the
time of transition from state 0 to 1 (T01) is less than or equal to t and the time of
transition from state 0 to 2 (T02) is greater than t.



Q 01( t ) = Pr(T01 ≤ t and T02 > t ) = 1 − F02 (t ) dF01(t )
0

Other values of the kernel matrix are obtained as:



Q 02 ( t ) = Pr(T02 ≤ t and T01 > t ) = 1 − F01(t ) dF02 (t )
0
Q12 ( t ) = Pr(T12 ≤ t ) = F12

According to Equation 3.42, the following system of equations has to be solved to


obtain the system reliability (π 02 (t ) ):

 t


π 02 ( t ) = q01 (ϑ ) π 12 (t − ϑ )dϑ


0
t

 ∫
 π 12 ( t ) = q12 (ϑ ) π 22 (t − ϑ )dϑ
0

π
 22 ( t ) = 1

104 Reliability Engineering

All these models presented are based on the assumption that the degradation
process is directly observable. However, in many cases, the degradation level is
not directly observable due to the complexity of the degradation process or the
nature of the product type. Therefore, to deal with indirectly observed states,
models such as hidden Markov models (HMM) and hidden semi-Markov mod-
els (HSMM) have been developed. The HMM deals with two different stochastic
processes: the unobservable degradation process and measurable characteristics
(which is dependent on the actual degradation process). In HHMs, finding a sto-
chastic relationship between unobservable degradation process and the output
signals of the observation process is a critical prerequisite for condition monitoring
and reliability analysis. As discussed, the details of HMM are beyond the scope of
this chapter, interested readers can refer to Shahraki et al. (2017 and Si et al. (2011)
for more details.

3.4  SUMMARY AND CONCLUSIONS


This chapter presented the application of stochastic processes in degradation mod-
eling to assess product/system performances. All the stochastic processes are cat-
egorized into continuous state and discrete state processes. Among the continuous
state stochastic processes, the Wiener, Gamma, and IG processes are discussed and
applied for degradation modeling of engineering systems using accelerated deg-
radation data. The  lifetime and reliability estimation approaches also are derived
based on stochastic degradation models. For accurately assessing the product perfor-
mances, appropriate selection of the stochastic process is crucial. The graphical and
statistical methods are presented to assist in successful selection of the best-fitted
degradation model for a case specific situation.
In addition, discrete state stochastic processes have been discussed and applied
to model the degradation of systems when their degraded states take values from
discrete space. The discrete- and continuous-time Markov chain models are used to
model the degradation process when the state transitions will happen at a discrete or
continuous time, respectively. In Markov chain models, the next state of the system
only depends on the current and not the history of the system (memoryless property)
that limits their application for some systems. As the extensions of the Markovian
model, aging Markovian deterioration and semi-Markov models are applied to cap-
ture the influence of the age and the history on the future states. The system reli-
ability is calculated for systems that are degrading with time after modeling their
degradation process using proper models.

REFERENCES
Blasi, A., Janssen, J. and Manca, R., 2004. Numerical treatment of homogeneous and non-
homogeneous semi-Markov reliability models. Communications in Statistics, Theory
and Methods 33(3): 697–714.
Chaluvadi, V. N. H., 2008. Accelerated life testing of electronic revenue meters. PhD disser-
tation, Clemson, SC: Clemson University.
Chen, A. and Wu, G.S., 2007. Real-time health prognosis and dynamic preventive main-
tenance policy for equipment under aging Markovian deterioration. International
Journal of Production Research 45(15): 3351–3379.
Application of Stochastic Processes in Degradation Modeling 105

Cinlar E., 1975. Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-Hall.
Corradi, G., Janssen, J. and Manca, R., 2004. Numerical treatment of homogeneous semi-
Markov processes in transient case—a straightforward approach. Methodology and
Computing in Applied Probability 6(2): 233–246.
Dui, H., Si, S., Zuo, M. J. and Sun, S., 2015. Semi-Markov process-based integrated impor-
tance measure for multi-state systems. IEEE Transactions on Reliability 64(2): 754–765.
Howard R. 1960. Dynamic Programming and Markov Processes, Cambridge, MA: MIT
press.
Kulkarni, V. G. 1995. Modeling and Analysis of Stochastic Systems, London, UK: Chapman
and Hall.
Limon, S., Yadav, O. P. and Liao, H., 2017a. A  literature review on planning and analysis
of accelerated testing for reliability assessment. Quality and Reliability Engineering
International 33(8): 2361–2383.
Limon, S., Yadav, O. P. and Nepal, B., 2017b. Estimation of product lifetime considering
gamma degradation process with multi-stress accelerated test data. IISE Annual
Conference Proceedings, pp. 1387–1392.
Limon, S., Yadav, O. P. and Nepal, B., 2018. Remaining useful life prediction using ADT data
with Inverse Gaussian process model. IISE Annual Conference Proceedings, pp. 1–6.
Lisnianski, A., Frenkel, I. and Ding, Y., 2010. Multi-state System Reliability Analysis and
Optimization for Engineers and Industrial Managers, Berlin, Germany: Springer
Science & Business Media.
Lisnianski, A. and Levitin, G., 2003. Multi-state System Reliability: Assessment,
Optimization, and Applications, Singapore: World scientific.
Liu, Y. W. and Kapur, K. K. C., 2007. Customer’s cumulative experience measures for reli-
ability of non-repairable aging multi-state systems. Quality Technology & Quantitative
Management 4(2): 225–234.
Moghaddass, R. and Zuo, M. J., 2014. An integrated framework for online diagnostic and
prognostic health monitoring using a multistate deterioration process. Reliability
Engineering & System Safety 124: 92–104.
Narendran, N. and Gu, Y., 2005. Life of led-based white light sources. Journal of Display
Technology 1: 167–171.
Nelson, W., 2004. Accelerated Testing: Statistical Models, Test Plans and Data Analysis (2nd
ed.), New York: John Wiley & Sons.
Ng, S. K. and Moses, F., 1998. Bridge deterioration modeling using semi-Markov theory.
A. A. Balkema Uitgevers B. V, Structural Safety and Reliability 1: 113–120.
O’Connor, P. D. D. T. and Kleyner, A., 2012. Practical Reliability Engineering (5th ed.),
Chichester, UK: Wiley.
Park, C. and Padgett, W. J., 2005. Accelerated degradation models for failure based on geo-
metric Brownian motion and gamma processes. Lifetime Data Analysis 11: 511–527.
Park, J. I. and Yum, B. J., 1997. Optimal design of accelerated degradation tests for estimating
mean lifetime at the use condition. Engineering Optimization 28: 199–230.
Ross, S., 1995. Stochastic Processes, New York: Wiley.
Sánchez-Silva, M. and Klutke, G. A., 2016. Reliability and Life-cycle Analysis of
Deteriorating Systems (Vol. 182). Cham, Switzerland: Springer International
Publishing.
Shahraki, A. F. and Yadav, O. P., 2018. Selective maintenance optimization for multi-
state systems operating in dynamic environments. In  2018  Annual Reliability and
Maintainability Symposium (RAMS). IEEE: pp. 1–6.
Shahraki, A. F., Yadav, O. P. and Liao, H., 2017. A  review on degradation modelling and its
engineering applications. International Journal of Performability Engineering 13(3): 299.
Sheu, S. H. and Zhang, Z. G., 2013. An optimal age replacement policy for multi-state
systems. IEEE Transactions on Reliability 62(3): 722–735.
106 Reliability Engineering

Sheu, S. H., Chang, C. C., Chen, Y. L. and Zhang, Z. G., 2015. Optimal preventive mainte-
nance and repair policies for multi-state systems. Reliability Engineering  & System
Safety, 140, 78–87.
Si, X. S., Wang, W., Hu, C. H. and Zhou, D. H., 2011. Remaining useful life estimation:
A review on the statistical data driven approaches. European Journal of Operational
Research 213(1): 1–14.
Trivedi, K, 2002. Probability and Statistics with Reliability, Queuing and Computer Science
Applications, New York: Wiley.
Wang, X. and Xu, D., 2010. An inverse Gaussian process model for degradation data.
Technometrics 52: 188–197.
Ye, Z. S. and Chen, N., 2014. The  inverse Gaussian process as a degradation model.
Technometrics 56: 302–311.
Ye, Z. S., Wang, Y., Tsui, K. L. and Pecht, M., 2013. Degradation data analysis using Wiener
processes with measurement errors. IEEE Transactions on Reliability 62: 772–780.
4 Building a Semi-automatic
Design for Reliability
Survey with Semantic
Pattern Recognition
Christian Spreafico and Davide Russo

CONTENTS
4.1 Introduction................................................................................................... 107
4.2 Research Methodology and Pool Definition.................................................. 109
4.2.1 Definition of the Electronic Pool....................................................... 109
4.2.2 Definition of the Features of Analysis............................................... 109
4.2.2.1 Goals................................................................................... 110
4.2.2.2 Strategies (FMEA Interventions)........................................ 110
4.2.2.3 Integrations......................................................................... 111
4.3 Semi-automatic Analysis............................................................................... 111
4.4 Results and Discussion.................................................................................. 115
4.5 Conclusions.................................................................................................... 119
References............................................................................................................... 120

4.1 INTRODUCTION
Almost 70 years after its introduction, Failure Modes and Effects Analysis (FMEA)
has been applied in a large series of cases from different sectors, such as automotive,
electronics, construction and services, and has become a standard procedure in many
companies for quality control and for the design of new products. FMEA has also a
great following in the scientific community as testified by the vast multitude of related
documents from scientific and patent literature; to date, more than 3,600  papers
in Scopus DB and 146  patents in Espacenet DB come up by just searching for
FMEA without synonyms, with a trend of constant growth over the years.
The  majority of those contributions deals with FMEA  modifications involving
the procedure and the integrations with new methods and tools to enlarge the field
of application and to improve the efficiency of the analysis, such as by reducing the
required time and by finding more results.
To be able to orientate among the many contributions, the surveys proposed in
the literature can play a fundamental role, which have been performed according to
different criteria of data gathering and classification.

107
108 Reliability Engineering

In [1] the authors analyzed scientific papers about the description and review of
basic principles, the types, the improvements, the computer automation codes, the
combination with other techniques, and specific applications of FMEA.
The literature survey in [2] analyzes the FMEA applications for enhancing service
reliability by determining how FMEA is focused on profit and supply chain-oriented
service business practices. The significant contribution consists in comparing what
previously was mentioned about FMEA research opportunities and in observing how
FMEA is related to enhancement in Risk Priority Number (RPN), reprioritization,
versatility of its application in service supply chain framework and non-profit service
sector, as well as in combination with other quality control tools, which are proposed
for further investigations.
In  [3], the authors studied 62  methodologies about risk analysis by separat-
ing them into three different phases (identification, evaluation, and hierarchiza-
tion) and by studying their inputs (plan or diagram, process and reaction, products,
probability and frequency, policy, environment, text, and historical knowledge),
the implemented techniques to analyze risk (qualitative, quantitative, determin-
istic, and ­ probabilistic), and their output (management, list, probabilistic, and
hierarchization).
In  [4], the authors analyzed the innovative proposed approaches to overcome
the limitations of the conventional RPN method within 75 FMEA papers published
between 1992 and 2012 by identifying which shortcomings attract the most attention,
which approaches are the most popular, and the inadequacy of approaches.
Other authors focused on analyzing specific applications of the FMEA approach.
In [5] the authors studied how 78 companies of motor industry in the United Kingdom
apply FMEA by identifying some common difficulties such as time constraints, poor
organizational understanding of the importance of FMEA, inadequate training, and
lack of management commitment.
However, despite the results achieved by these surveys, no overview considers
all the proposals presented, including patents, and analyzes at a higher level than
“simple” document counting within the cataloging classes and tools used.
To fulfill this aim, a previous survey  [6] considerably increased the number
of analyzed documents, by including also patents. In  addition, the analysis of the
content was improved by carrying out the analysis on two related levels: followed
strategies of intervention (e.g., reduce time of application) and integrated tools
(e.g., fuzzy logic). Although the results achieved are remarkable, the main limita-
tions of this analysis are the onerous amount of time required along with the number
of correlations between different aspects (e.g., problems and solutions, methods and
tools, etc.).
This  chapter proposes a semi-automatic semantic analysis about documents
related to FMEA modifications and the subsequent manual review for reassuming
each of them through a simple sentence made by a causal chain including the decla-
ration of the goals, the followed strategies (FMEA modifications), and integrations
with methods/tools.
This chapter is organized as follows. Section 4.2 presents the proposed procedure
of analysis, Section 4.3 proposes the results and the discussions, and Section 4.4 draws
conclusions.
Building a Semi-automatic Design for Reliability Survey 109

4.2  RESEARCH METHODOLOGY AND POOL DEFINITION


The first step of this work is the definition of the pool of documents to be analyzed:
starting from the same pool of documents in  [6] proposing FMEA modifications.
This  pool counts 286  documents, 177  scientific papers (165  from academia and
12  from industry), and 109  patents (23  from academia and 86  from industry).
Figure  4.1  shows the time distribution for patents and for scientific publications.
The number of patents is increasing, except for the last period that does not include
all potential patents since they are not disclosed for the first 18 months.

4.2.1 Definition of the Electronic Pool


In order to automatically process the collected documents through available tools
for semantic analysis, for each document, an XML file was manually created, which
was nominated with a unique ID and compiled according to a rigid structure where
each part of the original document was inserted within specific text fields (e.g., Title,
Abstract, Introduction, State_of_the_Art, Proposal).
The objective of this classification is to divide the original proposals from each
document, within the field Proposal, from the previous ones, reported within the
field State_of_the_Art, so as not distort the survey with redundant results, and to
provide the possibility to separately process the different parts to achieve specific
purposes (e.g., keywords investigation). In  addition, the comparison with the ID
allows referencing the content to the specific document.

4.2.2 Definition of the Features of Analysis


An additional preliminary activity deals with the definition of the features to be
analyzed. Since one purpose of the proposed method is to perform a deeper analy-
sis by relating different aspects, the features deliberately consider heterogeneous
aspects (goal, strategies of interventions, and integrations) and they work at different
levels of detail (e.g., goals and sub-goals, methods and tools).
Some features have been hypothesized a priori by considering previous
FMEA surveys, while others iteratively emerged during the analysis.
In the following discussion, the features are presented in detail.

80
70
203
60 Papers 17
50
40
30 23 86
20 Patents
10
0
(a) (b) Academia Industry

Papers Patents

FIGURE 4.1  (a) Time distribution (priority date) of the collected documents and (b) compo-
sition of the final set of documents (papers vs. patents and academia vs. industry).
110 Reliability Engineering

4.2.2.1 Goals
These features deal with targets that the authors who is proposing the analyzed
FMEA modifications wants to achieve through them. All of them focus on improving
the main aspects related to the applicability of the method (e.g., reducing the required
input, improving expected output, ameliorating the approach of the involved actors):

• Reduce FMEA time/costs of application by applying the modified FMEA


version to reduce: the number of participant (e.g., experts), the time required
to gather the useful information and perform the analysis ([9], [30], [36], [37],
[47], [52], [56], [64], [72], [78], [80], [90], [99], [132]).
• Reduce production time/costs of the considered product by using FMEA mod-
ifications for finding and preventing possible faults during production that can
cause possible delays or extra costs, without modifying product design ([35],
[43], [57], [63], [88], [89], [93], [109], [110], [119], [121], [126]).
• Improve design of the product by applying a modified FMEA during design
process in order to specifically change the design of the product in order to
make it: more robust (i.e., robust design), more able to meet the requirements,
or to not dissatisfy them (i.e., product re-design), more easily to be manufac-
tured (i.e., design for manufacturing) though a radically change of product’s
shape and components, more easily been repaired (i.e., design for mainte-
nance) ([15], [19], [23], [24], [25], [27], [39], [40], [49], [58], [61], [62], [65],
[69], [70], [76], [79], [87], [92], [94], [96], [100], [103], [104], [107], [114]).
• Analyze complex systems. If the modified version of FMEA has been spe-
cifically improved to manage products with a high number of component
and functionalities ([26], [31], [32], [82], [98], [118], [117], [124], [128]).
• Ameliorate human approach. If the modified version of FMEA is able to
improve the user interface, reduce its tediousness and better involve the user in
a more pro-active approach ([10], [13], [16], [22], [28], [29], [33], [34], [41], [42],
[46], [48], [50], [51], [53], [54], [55], [59], [60], [66], [68], [71], [73], [74], [77],
[81], [83], [84], [85], [86], [105], [106], [108], [111], [112], [113], [115], [116],
[122], [123], [125], [130], [131]).

4.2.2.2  Strategies (FMEA Interventions)


These features investigate the strategies of intervention on FMEA  structure, or the
parts/steps of the traditional procedure that are modified by the considered documents:

• Improve/automate Bill of Material (BoM) determination to provide criteria


to (1) identify the parts (e.g., sub-assemblies and single components) and
their useful features and attributes and (2) facilitate the management of the
parts and their relations.
• Improve/automate function determination by suggesting modalities to
identify and describe product requirements, functions and sub-functions,
and associate them to the related parts.
• Improve/automate failure determination to increase the number of consid-
ered failure modes, effects and causes, identify their relations, and improve
their representation by introducing supporting models.
Building a Semi-automatic Design for Reliability Survey 111

• Improve/automatize Risk Analysis by overcoming the main limitations


of traditional indexes by providing explanations about their uses or new
complementary or alternative methods ([14], [18], [20], [21], [44], [45], [55],
[75], [91], [95], [120]).
• Improve/automate problem solving by improving the decision making and
solving phase.

4.2.2.3 Integrations
The following kinds of integrations have been collected:

• Templates (e.g., tables and matrices) to organize and manage the bill of
material, the list of functions and faults, and the related risk.
• Database (DB) containing information about product parts, functions,
historical failures, risk, and the related economic quantifications. They are
used to automatically or manually gather the content for the analysis.
• Tools for fault analysis (Fault A.) including Fault Tree Analysis (FTA),
Fishbone diagram and Root Cause Analysis (RCA) ([17], [38]).
• Interactive graphical interfaces or software that directly involve user inter-
actions through graphical elements and representations (e.g., plant schemes
and infographics) for data entry and visualization.
• Artificial Intelligence (AI) based tools involving Semantic Recognition and
Bayesian Networks ([12], [67], [102], [125], [127], [129], [133]).

Other considered integrations are function analysis (FA), fuzzy logic, Monte Carlo
method, quality function deployment (QFD), hazard and operability study (HAZOP),
ontologies, theory of inventive problem solving (TRIZ), guidelines, automatic mea-
surements (AM) methods, brainstorming techniques, and cognitive maps (C Map).

4.3  SEMI-AUTOMATIC ANALYSIS


At this point, the defined features have been semi-automatically investigated within
the collected pool using a software for semantic analysis. The first step of the proce-
dure deals with the manual translation of each considered feature into one or more
search queries consisting of single keywords (e.g., name, verb, adjective).
For each keyword, the software provides its main linguistic relations with other
term found within the specific sentences of the documents through semantic analysis.
The  kinds of relations are different depending on the linguistic nature of the
used keyword. If a substantive (e.g., FMEA) is used, then the following can be
identified: the modifiers, or adjectives or substantives acting as adjectives (e.g., tra-
ditional FMEA, fuzzy FMEA, cost-based FMEA), nouns and verbs modified by the
keyword (e.g., FMEA  table, FMEA  sheet), verbs with the keyword used as object
(e.g., executing FMEA, evaluate FMEA), verbs with the keyword used as subject
(e.g., FMEA is …, FMEA generates …), substantives linked to the keyword through
AND/OR relations (e.g., FMEA  and QFD, FMEA  and risk), prepositional phrases
112 Reliability Engineering

TABLE 4.1
Keywords Used to Explain the Features Through the Queries
Generic terms
Name Verbs FMEA Terms Methods/Tool
FMEA, Human, Approach, Improve, Anticipate, Failures, Modes, Fuzzy, TRIZ, Database,
Design, Production, Ameliorate, Effects, Cause, Artificial Intelligence,
Maintenance, Time, Automatize, Analyze, Risk, Solving, QFD, Function
Costs, Problem Reduce, Eliminate, Decision making Analysis, etc.
Solve

(e.g., … of FMEA, … through FMEA). When a verb is used as keyword, the follow-


ing can be identified: its modifiers (e.g., effectively improve), the objects (e.g., improve
quality, improve design), the subjects (e.g., QFD improves), and other particles used
before or after the verb (e.g., improve and evaluate).
In this way, by using the restricted number of keywords, reported in Table 4.1, all
the features can be easily investigated.
Thus, the translation of a generic feature (e.g., ameliorate human approach)
depends on the manual formulation of a keyword (e.g., ameliorate), the automatic
processing, and the manual research of the more suitable relations to express the
features itself (e.g., ameliorate + human approach).
However, since the features can be expressed in a variety of ways, by increasing
the number of alternative keywords, the number of pertinent identified documents
also increases (recall). What achieves this aim is the expansion of the synonyms
(e.g., improve in addition to ameliorate) and the research of the alternative forms that
can be used to express the feature (e.g., Reduce Tediousness and Reduce Subjectivity
for Improve Human Approach).
The research of specific terms, such as the name of the integrated tools (e.g., fuzzy,
TRIZ, QFD), can instead be carried out according to different strategies: (1) includ-
ing them within the keywords, (2) using verbs (e.g., introduce, integrate), and search-
ing the tools among the objects (e.g., Introduce fuzzy logic), (3) using the modifiers
of FMEA (e.g., fuzzy FMEA), and (4) searching the relations between FMEA and
linguistic particles (e.g., FMEA and TRIZ).
Then, for each interesting relation identified, the software provides the list of the
related sentences for each document manually checked in order evaluate its adher-
ence with the investigated feature.
At this point, each selected sentence is summarized through a triad consisting of
subject + verb + object.
Table 4.2 shows the followed steps to define the triads in the paper proposed in [7].
All the identified triads are then collected within a table (as shown in Table 4.3),
the data for each document (row) is organized according to the features (columns),
where, in each cell, the subject of a triad is reported (e.g., The  improved failure
Building a Semi-automatic Design for Reliability Survey 113

TABLE 4.2
Example of the Strategy Used to Build the Triads
Considered document
Investigated Used Syntactic Triad Subject +
Features Keyword Parser Related Sentence Verb + Object
Ameliorate Improve Improve + The objective of this paper is The improved
Human Human to propose a new approach Failure Modes
Approach Approach for simplifying FMEA by Determination
determining the failures in a ameliorates human
more practical way by better approach
involving the problem solver
in a more pro-active and
creative approach
Improve Failure Improve Improve + Perturbed Functional Analysis Perturbed Function
Modes Failure is proposed in order to Analysis improves
determination Modes improve the capability of Failure Modes
determine Failure Modes determination
Introduce TRIZ TRIZ TRIZ + Specifically, an inedited The authors propose
Perturbed version of TRIZ function the Perturbed
Function analysis, called “Perturbed Function Analysis
Analysis Function Analysis” is
(Modifier) proposed

Source: Spreafico, C. and Russo, D., Can TRIZ functional analysis improve FMEA? Advances in
Systematic Creativity Creating and Managing Innovations, Palgrave Macmillan, Cham,
Switzerland, pp. 87–100, 2019.

TABLE 4.3
An Extract from the Table of Comparison of the Documents and the Triads
Features
Goal Strategy Methods/Tools
Ameliorate Improve Failures Introduce Perturbed
Document Human Approach … Determination … Function Analysis …
[7] The improved … Perturbed Function … The authors …
failure modes Analysis
… … … … … … …
114 Reliability Engineering

Why? Why?

Spreafico and Ameliorate Improve Perturbed


Russo (2019) Human Failures Funcon
Approach Determinaon Analysis

How? How?
Node N+1 Node N Node N-1

PART 1 - GOAL PART 1 - STRATEGY PART 3 -


METHODS/TOOLS

FIGURE 4.2  Example of a causal chain constituted by goal, strategy, and method/tool.

modes) related to a determined feature that has been redefined by using the verb and
the object of the triad (e.g., ameliorates human approach).
Therefore, the identified subjects are used as links to build the causal chains,
starting from the latter ones, related to the integrations with methods and tools.
For example, the causal chain resulting from the previous example (Table 4.3) is the
authors introduce the Perturbed Function Analysis (METHOD/TOOL) IN ORDER
TO Improve the failure identification (STRATEGY) IN ORDER TO Ameliorate
Human Approach (GOAL).
By reading the causal chain in this manner, the logic on its base is the following:
each node provides the explanation of the existence of the previous one (WHY?) and
it represents a way to obtain the next one (HOW?).
Figure  4.2 shows an example of the simpler causal chain that can be built,
which is constituted by one goal (i.e., Ameliorate Human Approach), one strategy
(e.g., Improve Failure Determination), and one integration with methods or tools
(i.e., The Perturbed Function Analysis).
This example represents the simplest obtained causal chain, consisting of only
three nodes arranged in sequence: one for the goals, one for the strategies, and one
for the integrations with methods/tools.
However, the structure of the causal chain can be more complex because the num-
ber of nodes can increase and their reciprocal disposition can change from series to
parallel and by a mix of both.
In the first case (nodes in series), each intermediate node is preceded (on the left)
by another node expressing its motivation (WHY?—relation) and it is followed by
another representing a way to realize it (HOW?—relation). More goals can be con-
nected in the same way, through their hierarchization: e.g., the goal “reduce the
number of experts” can be preceded by the more generic goal “reduce FMEA costs.”
The  same reasoning is valid for the strategies and the integrations with methods/
tools. In particular, in this case, we stratified them into four hierarchical levels: (1)
theories and logics (e.g., fuzzy logic), (2) methods (e.g., TRIZ), (3) tool, which can be
included in the methods (e.g., FA is part of TRIZ), and (4) knowledge sources (e.g.,
costs DB).
Building a Semi-automatic Design for Reliability Survey 115

Automate
Reduce FMEA
Failures Fuzzy logic Failures DB
me/costs
CN202887188 determinaon

Analyze complex Automate Risk


Fuzzy logic Risks DB
systems Analysis

FIGURE 4.3  Example of a complex causal chain obtained from the patent. (From Ming, X.
et al., System capable of achieving failure mode and effects analysis (FMEA) data multi­
dimension processing, CN202887188, filed June 4, 2012, and issued April 17, 2013.
Representation is courtesy of the authors.)

In the second case (nodes in parallel), two or more nodes can concurrently pro-
vide a motivation for a previous node or be two possibilities to realize the subsequent
node.
As example of a more complex causal chain, consider the Chinese patent  [8].
Table 4.4 represents an extract from the table of comparison relative to this docu-
ment: as can be seen, the resulting relations between the included subjects and the
features are more complex and interlaced in comparison to the example shown in
Table 4.3.
Figure 4.3 represents the causal chain obtained for this document. In this case the
two nodes reduce FMEA time/costs and analyze complex systems represent the two
main independent goals pursued by this contribution. The  two nodes Automate
Failure Determination and Automate Risk Analysis are the two followed strategies
both for reduce FMEA  time/cost” and to analyze Complex Systems. Finally, the
node fuzzy logic represents a high-level integration to realize the two strategies,
while a failure DB and a risk DB have been used to provide the knowledge for a
fuzzy logic-based reasoning in two different ways: the first one is used for Automate
the Failure determination (through fuzzy logic) and the second one is to Automate
Risk Analysis (through fuzzy logic).

4.4  RESULTS AND DISCUSSION


The  proposed methodology has been tested during two distinct phases. During
phase 1 (automatic semantic analysis), all the documents in the selected pool were
processed because the algorithm of semantic parsing of the used tool is strictly influ-
enced by number of analyzed sentences in terms of founded linguistic synonyms
and relations. During phase 2 (manual review and causal chains building), instead
a restricted set of documents was considered to test the methodology in a restricted
time period under a temporal burden of required operations.
To obtain a significant sample, the documents were selected based on the typol-
ogy (papers or patents), date of publication, kind of source (for papers—journal or
proceedings), and nationality (for patents). The resulting sample counts 127 docu-
ments consisting of 80 papers and 47 patents.
After the sample was processed, the features were investigated, and the docu-
ments were classified, one causal chain was built for each document, which usually
116

TABLE 4.4
An Extract from the Table of Comparison of the Documents and the Triads, Line of the Document
Features
Goal Strategy Methods/Tools
Reduce Analyze Complex Automate Failure Automate Introduce Fuzzy
Document FMEA Time/Costs Systems Determination Risk Analysis Logic Introduce Failure DB Introduce Risk DB
[8] Automate Failure Automate Failure Fuzzy logic Fuzzy logic Failure DB The authors The authors
Determination Determination
Automate Risk Automate Risk Risk DB
Analysis Analysis

Source: Ming, X. et al., System capable of achieving failure mode and effects analysis (FMEA) data multi-dimension processing, CN202887188, filed June 4, 2012, and
issued April 17, 2013.
Reliability Engineering
Building a Semi-automatic Design for Reliability Survey 117

consists of more than four nodes, including at least one for each part (goal, strategy,
and integration). The total number of the causal chains is the same of the analyzed
document (127), since their correspondence is biunivocal: for each document there
was only one causal chain and vice versa.
In  general, the more followed goals are Improve Design and Improve Human
Approach, which together are contained within 61 percent of the triads, while the
more considered strategies are related to the failure determination (automate and
improve), followed by Automate Risk Analysis.
Among the integrations with methods and tools, fuzzy logic and databases are
the most diffused, respectively, with 37 and 28 occurrences within the causal chains,
followed by the interface with 23 occurrences.
More detailed considerations are possible by analyzing the relations between goals
and strategies. In fact, the two more diffused strategies are considered differently:
those for failure determination are implemented to realize all the goals, while those
for Improving Risk Analysis are especially considered to Improve Human Approach
but practically ignored for achieving other purposes (i.e., Improve Design and
Analyze Complex Systems).
Other considerations can be done by comparing the couplings between multiple
goals, strategies, and tools.
By comparing the combinations between goals, the most considered combina-
tions found are: Improve Design—Improve Human Approach (8 occurrences) and
Improve Design—Analyze Complex Systems (7 occurrences), and Improve Human
Approach—Reduce Production Time/Costs (7 occurrences).
Among the combinations of the strategies that  emerged, the most considered
combinations are: Automate Failure Determination—Automate Risk Analysis
(12  occurrences) and Automate Failure Determination—Improve Risk Analysis
(7 occurrences).
Finally, the analysis of the multiple integrations revealed that the common cou-
pling is between fuzzy logic and DBs with 6 occurrences.
A deeper analysis can be done by considering the causal chains. Among the dif-
ferent possibilities, the most significant deals with the comparison of the common
triads, or the combinations of three nodes: goal, strategy, and integration. In this way,
a synthetic but sufficiently significant indication is obtained to understand how the
authors are working to improve FMEA.
Figure 4.4 shows the tree map of the common triads, where the five main areas
are the goals, their internal subdivisions (colored) represent the strategies, in turn
divided between the integrations, where are reported the documents index (please
refer to the legend).
For example, analyzing the graph shows that the three documents [11,97,101] pro-
pose modified versions of FMEA  based on the same common triad, or with the
objective to Improve Design phase, by improving the determination of the failures
through the introduction of databases (DB). Other goals, strategies, or integrations
differentiate the three contributions.
Analysis of the common triad shows that the most diffused consider the
goal Improve Human Approach: Improve Human Approach—Improve Risk
Analysis—Fuzzy (8  documents), Improve Human Approach—Improve Function
118 Reliability Engineering

FIGURE 4.4  Main solutions proposed in papers and patents to improve FMEA, represented
through triads (goal, strategy, and method/tool).
Building a Semi-automatic Design for Reliability Survey 119

Determination—Interface (5 documents), and Improve Human Approach—Improve


BoM Determination—Interface (5 documents).
By considering the triads, some more interesting observations can be made about
the integrations. In  general, their distribution is quite heterogeneous in relation
to strategies and goals. In  fact, fuzzy logic almost always has been introduced to
Improve and Automate Risk Analysis to achieve all the goals, while has been used
for improving Failure Determination or automate it, but only in order to Improve
Design but not for other purposes.
Another case is represented by the interfaces, which have been introduced to
improve almost the strategies and goals.
Other integrations instead are related almost exclusively to same strategy for
achieving each goal. This is the case of the databases used to Automate Risk Analysis
and secondly to Automate Failures Determination and guidelines that generally are
used to Automate or to Improve Risk Analysis.

4.5 CONCLUSIONS
In  this chapter a method for performing semi-automatic semantic analysis about
FMEA  documents has been presented and applied on a pool of 127  documents,
consisting of paper and patents, selected from international journals, conference
proceedings, and international patents.
As a result, each document has been summarized through a specific causal chain
including its considered goals (i.e., Improve Design, Improve Human Approach,
Reduce FMEA  Time/Costs, Reduce Production Time/Costs, Analyze Complex
Systems), its strategies of intervention (Improve/Automate BoM, Function, Failures
Determination, Risk Analysis and Problem solving) and the integrated methods,
tools, and knowledge sources.
The main output of this work is summarized in an infographic based on a Treemap
diagram style comparing all the considered documents on the basis of the common
elements in their causal chains, which highlights the more popular direction at dif-
ferent levels of detail (i.e., strategies, methods, and tools) of intervention in relation
to the objective to pursue.
The consistent reduction of required time along with the number of considered
analyzed sources and the level of deepening of the same, represented by the ability to
determine the relationships between the different parameters of the analysis within
the causal chain, are elements of novelty compared to previous surveys, which could
positively impact scientific research in the sector.
The  main limitations of the approach consist of the complexity of the manual
operations required to define the electronic pool and to create part of the relations
within the causal chains, which will be partly solved by automating the method for
future developments.
120 Reliability Engineering

REFERENCES
1. Bouti, A., and Kadi, D. A. 1994. A  state-of-the-art review of FMEA/FMECA.
International Journal of Reliability Quality and Safety Engineering 1(04): 515–543.
2. Sutrisno, A., and Lee, T. J. 2011. Service reliability assessment using failure mode and
effect analysis (FMEA): Survey and opportunity roadmap. International Journal of
Engineering Science and Technology 3(7): 25–38.
3. Tixier, J., Dusserre, G., Salvi, O., and Gaston, D. 2002. Review of 62 risk analysis meth-
odologies of industrial plants. Journal of Loss Prevention in the Process Industries
15(4): 291–303.
4. Liu, H. C., Liu, L., and Liu, N. 2013. Risk evaluation approaches in failure mode and
effects analysis: A literature review. Expert Systems with Applications 40(2): 828–838.
5. Dale, B. G., and Shaw, P. 1990. Failure mode and effects analysis in the UK motor
industry: A state‐of‐the‐art study. Quality and Reliability Engineering International
6(3): 179–188.
6. Spreafico, C., Russo, D., and Rizzi, C. 2017. A state-of-the-art review of FMEA/FMECA
including patents. Computer Science Review 25: 19–28.
7. Spreafico, C., & Russo, D. (2019). Case: Can TRIZ Functional Analysis Improve FMEA?
In Advances in Systematic Creativity (pp. 87–100). Palgrave Macmillan, Cham.
8. Ming, X., Zhu, B., Liang, Q., Wu, Z., Song, W., Xia R., and Kong, F. 2013. System
capable of achieving failure mode and effects analysis (FMEA) data multi-dimension
processing. CN202887188, filed June 4, 2012, and issued April 17, 2013.
9. Ahmadi, M., Behzadian, K., Ardeshir, A., and Kapelan, Z. 2017. Comprehensive risk
management using fuzzy FMEA  and MCDA  techniques in highway construction
projects. Journal of Civil Engineering and Management 23(2): 300–310.
10. Almannai, B., Greenough, R., and Kay, J. 2008. A decision support tool based on QFD
and FMEA for the selection of manufacturing automation technologies. Robotics and
Computer-Integrated Manufacturing 24(4): 501–507.
11. Arcidiacono, G., and Campatelli, G. 2004. Reliability improvement of a diesel engine
using the FMETA approach. Quality and Reliability Engineering International 20(2):
143–154.
12. Augustine, M., Yadav, O. P., Jain, R., and Rathore, A. 2009. Modeling physical systems
for failure analysis with rate cognitive maps. Industrial Engineering and Engineering
Management. IEEM 2009 IEEE International Conference 1758–1762.
13. Lai, J., Zhang, H., & Huang, B. (2011, June). The object-FMA based test case generation
approach for GUI software exception testing. In the Proceedings of 2011 9th International
Conference on Reliability, Maintainability and Safety (pp. 717–723). IEEE.
14. Banghart, M., and Fuller, K. 2014. Utilizing confidence bounds in Failure Mode Effects
Analysis (FMEA) hazard risk assessment. Aerospace Conference, 2014 IEEE 1–6.
15. Bertelli, C. R., and Loureiro, G. 2015. Quality problems in complex systems even con-
sidering the application of quality initiatives during product development. ISPE CE
40–51.
16. Bevilacqua, M., Braglia, M., and Gabbrielli, R. 2000. Monte Carlo simulation
approach for a modified FMECA in a power plant. Quality and Reliability Engineering
International 16(4): 313–324.
17. Bluvband, Z., Polak, R., and Grabov, P. 2005. Bouncing failure analysis (BFA):
The  unified FTA-FMEA  methodology. Reliability and Maintainability Symposium
Proceedings Annual 463–467.
18. Bowles, J. B., and Peláez, C. E. 1995. Fuzzy logic prioritization of failures in a system
failure mode, effects and criticality analysis. Reliability Engineering & System Safety
50(2): 203–213.
Building a Semi-automatic Design for Reliability Survey 121

19. Braglia, M., Fantoni, G., and Frosolini, M. 2007. The house of reliability. International
Journal of Quality & Reliability Management 24(4): 420–440.
20. Braglia, M., Frosolini, M., and Montanari, R. 2003. Fuzzy TOPSIS approach for
failure mode, effects and criticality analysis. Quality and Reliability Engineering
International 19(5): 425–443.
21. Doskocil, D. C., and Offt, A. M. 1993. Method for fault diagnosis by assessment
of confidence measure. CA2077772, filed September 9, 1992, and issued April 25,
1993.
22. Draber S. 2000. Method for determining the reliability of technical systems.
CA2300546, filed March 7, 2000, and issued September 8, 2000.
23. Chang, K. H., and Wen, T. C. 2010. A novel efficient approach for DFMEA combining
2–tuple and the OWA operator. Expert Systems with Applications 37(3): 2362–2370.
24. Chen, L. H., and Ko, W. C. 2009. Fuzzy linear programming models for new product
design using QFD with FMEA. Applied Mathematical Modelling 33(2): 633–647.
25. Chin, K. S., Chan, A., and Yang, J. B. 2008. Development of a fuzzy FMEA based prod-
uct design system. The International Journal of Advanced Manufacturing Technology
36(7–8): 633–649.
26. Zhang, L., Liang, W., and Hu, J. 2011. Modeling method of early warning model of
mixed failures and early warning model of mixed failures. CN102262690, filed June 7,
2011, and issued November 30, 2011.
27. Pan, L., Chin, X., Liu, X., Wang, W., Chen, C., Luo, J., Peng, X. et al., 2012. Intelligent
integrated fault diagnosis method and device in industrial production process.
CN102637019, filed February 10, 2011, and issued August 15, 2012.
28. Ming, X., Zhu, B., Liang, Q., Wu, Z., Song, W., Xia, R., and Kong, F. 2012. System
for implementing multidimensional processing on failure mode and effect analysis
(FMEA) data, and processing method of system. CN102810112, filed June 4, 2012, and
issued December 5, 2012.
29. Li, G., Zhang, J., and Cui, C. 2012. FMEA (Failure Mode and Effects Analysis) pro-
cess auxiliary and information management method based on template model and text
matching. CN102831152, filed June 28, 2012, and issued December 19, 2012.
30. Li, R., Xu, P., and Xu, Y. 2012. Accidence safety analysis method for nuclear fuel repro-
cessing plant. CN102841600, filed August 24, 2012, and issued December 26, 2012.
31. Jia, Y., Shen, G., Jia, Z., Zhang, Y., Wang, Z., and Chen, B. 2013. Reliability com-
prehensive design method of three kinds of functional parts. CN103020378, filed
December 26, 2012, and issued April 3, 2013.
32. Chen, Y., Zhang, X., Gao, L., and Kang, R. 2014. Newly-developed aviation electronic
product hardware comprehensive FMECA method. CN103760886, filed December 2,
2013, and issued April 30, 2014.
33. Liu, Y., Deng, Z., Liu, S., Chen, X., Pang, B., Zhou, N., and Chen, Y. 2014. Method
for evaluating risk of simulation system based on fuzzy FMEA. CN103902845, filed
April 25, 2014, and issued July 2, 2014.
34. He, C., Zhao, H., Liu, X., Zong, Z., Li, L., Jiang, J., and Zhu, J. 2014. Data mining-based
hardware circuit FMEA (Failure Mode and Effects Analysis) method. CN104198912,
filed July 24, 2014, and issued December 10, 2014.
35. Xu, H., Wang, Z., Ren, Y., Yang D., and Liu, L. 2015. Failure knowledge storage and
push method for FMEA  (failure mode and effects analysis) process. CN104361026,
filed October 22, 2014, and issued February 18, 2015.
36. Tang, Y., Sun, Q., and Lü, Z. 2015. Failure diagnosis modeling method based on design-
ing data analysis. CN104504248, filed December 5, 2014, and issued April 8, 2015.
37. David, P., Idasiak, V., and Kratz, F. 2010. Reliability study of complex physical systems
using SysML. Reliability Engineering & System Safety 95(4): 431–450.
122 Reliability Engineering

38. Demichela, M., Piccinini, N., Ciarambino, I., and Contini, S. 2004. How to avoid the
generation of logic loops in the construction of fault trees. Reliability Engineering &
System Safety 84(2): 197–207.
39. Deshpande, V. S., and Modak, J. P. 2002. Application of RCM to a medium scale
industry. Reliability Engineering & System Safety 77(1): 31–43.
40. Doble, M. 2005. Six Sigma and chemical process safety. International Journal of Six
Sigma and Competitive Advantage 1(2): 229–244.
41. Van Bossuyt, D., Hoyle, C., Tumer, I. Y., and Dong, A. 2012. Risk attitudes in risk-
based design: Considering risk attitude using utility theory in risk-based design. AI
EDAM 26(4): 393–406.
42. Ebrahimipour, V., Rezaie, K., and Shokravi, S. 2010. An ontology approach to support
FMEA studies. Expert Systems with Applications 37(1): 671–677.
43. Draber, C. D. 2000. Method for determining the reliability of technical systems.
EP1035454, filed March 8, 1999, and issued September 8, 2000.
44. Eubanks, C. F., Kmenta, S., and Ishii, K. 1996. System behavior modeling as a basis
for advanced failure modes and effects analysis. ASME Computers in Engineering
Conference, Irvine, CA, pp. 1–8.
45. Eubanks, C. F., Kmenta, S., and Ishii, K. 1997. Advanced failure modes and effects
analysis using behavior modeling. ASME Design Engineering Technical Conferences,
Sacramento, CA, pp. 14–17.
46. Gandhi, O. P., and Agrawal, V. P. 1992. FMEA—A  diagraph and matrix approach.
Reliability Engineering & System Safety 35(2): 147–158.
47. Hartini, S., Nugroho, W. P., and Subekti, K. R. 2010. Design of Equipment Rack with
TRIZ Method to Reduce Searching Time in Change Over Activity (Case Study: PT.
Jans2en Indonesia). Proceedings of the Apchi Ergo Future.
48. Hassan, A., Siadat, A., Dantan, J. Y., and Martin, P. 2010. Conceptual process plan-
ning–an improvement approach using QFD, FMEA, and ABC methods. Robotics and
Computer-Integrated Manufacturing 26(4): 392–401.
49. Hu, C. M., Lin, C. A., Chang, C. H., Cheng, Y. J., and Tseng, P. Y. 2014. Integration with
QFDs, TRIZ and FMEA for control valve design. Advanced Materials Research Trans
Tech Publications 1021: 167–180.
50. Jenab, K., Khoury, S., and Rodriguez, S. 2015. Effective FMEA  analysis or not.
Strategic Management Quarterly 3(2): 25–36.
51. Jong, C. H., Tay, K. M., and Lim, C. P. 2013. Application of the fuzzy failure mode and
effect analysis methodology to edible bird nest processing. Computers and Electronics
in Agriculture 96: 90–108.
52. Koizumi, A., Shimokawa K., and Isaki, Y. 2003. Fmea system. JP2003036278, filed
July 25, 2001, and issued February 7, 2003.
53. Wada, T., Miyamoto, Y., Murakami, S., Sugaya, A., Ozaki Y., Sawai, T., Matsumoto, S.
et al., 2003. Diagnosis rule structuring method based on failure mode analysis, diagnosis
rule creating program, and failure diagnosis device. JP2003228485, filed February 6,
2002, and issued August 15, 2003
54. Yatake, H., Konishi, H., and Onishi T. 2009. Fmea sheet creation support system and
creation support program. JP2011008355, filed June 23, 2009.
55. Suzuki, K., Hayata, A., and Yoshioka, M. 2009. Reliability analysis device and method.
JP2011113217, issued November 25, 2009.
56. Kawai, M., Hirai, K., and Aryoshi, T. 1990. Fmea simulation method for analyzing
circuit. JPH0216471, filed July 4, 1988, and issued January 19, 1990.
57. Sonoda, Y., and Kageyama., T. 1992. Plant diagnostic device. JPH086635, filed May 9,
1990, and issued January 23, 1992.
Building a Semi-automatic Design for Reliability Survey 123

58. Kim, J. H., Kim, I. S., Lee, H. W., and Park, B. O. 2012. A Study on the Role of TRIZ
in DFSS. SAE International Journal of Passenger Cars-Mechanical Systems 5(2012–
01–0068): 22–29.
59. Kimura, F., Hata, T., and Kobayashi, N. 2002. Reliability-centered maintenance plan-
ning based on computer-aided FMEA. Proceeding of the 35th CIRP-International
Seminar on Manufacturing Systems 506–511.
60. Kmenta, S., and Ishii, K. 2000. Scenario-based FMEA: A life cycle cost perspective.
Proceedings of ASME Design Engineering Technical Conference, Baltimore, MD.
61. Kmenta, S., and Ishii, K. 2004. Scenario-based failure modes and effects analysis using
expected cost. Journal of Mechanical Design 126(6): 1027–1035.
62. Kmenta, S., and Ishii, K. 1998. Advanced FMEA using meta behavior modeling for
concurrent design of products and controls. Proceedings of the 1998 ASME Design
Engineering Technical Conferences.
63. Kmenta, S., Cheldelin, B., and Ishii, K. 2003. Assembly FMEA: A simplified method
for identifying assembly errors. ASME 2003 International Mechanical Engineering
Congress and Exposition 315–323.
64. Lee, M. S., and Lee, S., H. 2013. Real-time collaborated enterprise asset management
system based on condition-based maintenance and method thereof. KR20130065800,
filed November 30, 2011, and issued June 24, 2013.
65. Choi, S. H., Kim, G. H., Cho, C. H., and Kim, Y., G. 2013. Reliability centered main-
tenance method for power generation facilities. KR20130118644, filed April 20, 2012,
and issued December 12, 2013.
66. Lim, S. S., and Lee, J., Y. 2014. Intelligent failure asset management system for railway
car. KR20140036375, filed September 12, 2012, and issued March 3, 2014.
67. Ku, C., Chen, Y. S., and Chung, Y. K. 2008. An intelligent FMEA system implemented
with a hierarchy of back-propagation neural networks. Cybernetics and Intelligent
Systems IEEE Conference 203–208.
68. Kutlu, A. C., and Ekmekçioğlu, M. 2012. Fuzzy failure modes and effects analysis by
using fuzzy TOPSIS-based fuzzy AHP. Expert Systems with Applications 39(1): 61–67.
69. Laaroussi, A., Fiès, B., Vankeisbelckt, R., and Hans, J. 2007. Ontology-aided
FMEA  for construction products. Bringing ITC knowledge to work. Proceedings of
W78 Conference 26(29): 6.
70. Lee, B. H. 2001. Using FMEA models and ontologies to build diagnostic models. AI
EDAM 15(4): 281–293.
71. Lindahl, M. 1999. E-FMEA—a new promising tool for efficient design for environment.
Proceedings of Environmentally Conscious Design and Inverse Manufacturing 734–739.
72. Liu, H. T. 2009. The extension of fuzzy QFD: From product planning to part deploy-
ment. Expert Systems with Applications 36(8): 11131–11144.
73. Liu, J., Martínez, L., Wang, H., Rodríguez, R. M., and Novozhilov, V. 2010. Computing
with words in risk assessment. International Journal of Computational Intelligence
Systems 3(4): 396–419.
74. Liu, H. C., Liu, L., Liu, N., and Mao, L. X. 2013. Risk evaluation in failure mode
and effects analysis with extended VIKOR method under fuzzy environment. Expert
Systems with Applications 40(2): 828–838.
75. Grantham, K. (2007). Detailed risk analysis for failure prevention in conceptual design:
RED (Risk in early design) based probabilistic risk assessments.
76. Mader, R., Armengaud, E., Grießnig, G., Kreiner, C., Steger, C., and Weiß, R. 2013.
OASIS: An automotive analysis and safety engineering instrument. Reliability
Engineering & System Safety 120: 150–162.
77. Mandal, S., and Maiti, J. 2014. Risk analysis using FMEA: Fuzzy similarity value and
possibility theory-based approach. Expert Systems with Applications 41(7): 3527–3537.
124 Reliability Engineering

78. Montgomery, T. A., and Marko, K. A. 1997. Quantitative FMEA  automation.


Proceedings of Reliability and Maintainability Symposium 226–228.
79. Moratelli, L., Tannuri, E. A., and Morishita, H. M. 2008. Utilization of FMEA during
the preliminary design of a dynamic positioning system for a Shuttle Tanker. ASME
27th International Conference on Offshore Mechanics and Arctic Engineering, Estoril,
Portugal, pp. 787–796.
80. Ormsby, A. R. T., Hunt, J. E., and Lee, M. H. 1991. Towards an automated FMEA assis-
tant. Applications of Artificial Intelligence in Engineering VI, Springer, the Netherlands
739–752.
81. Ozarin, N. 2008. What’s wrong with bent pin analysis, and what to do about it.
Reliability and Maintainability Symposium, Washington, DC: IEEE Computer Society,
pp. 386–392.
82. Pelaez, C. E., and Bowles, J. B. 1995. Applying fuzzy cognitive-maps knowledge-
representation to failure modes effects analysis. Proceedings of Reliability and
Maintainability Symposium: 450–456.
83. Pang, L. M., Tay, K. M., and Lim, C. P. 2016. Monotone fuzzy rule relabeling for the
zero-order TSK fuzzy inference system. IEEE Transactions on Fuzzy Systems 24(6):
1455–1463.
84. Kim, J. H., Jeong, H. Y., and Park, J. S. 2009. Development of the FMECA process
and analysis methodology for railroad systems. International Journal of Automotive
Technology 10(6): 753.
85. Petrović, D. V., Tanasijević, M., Milić, V., Lilić, N., Stojadinović, S.,  & Svrkota, I.
2014. Risk assessment model of mining equipment failure based on fuzzy logic. Expert
Systems with Applications 41(18): 8157–8164.
86. Price, C. J. 1996. Effortless incremental design FMEA. Reliability and Maintainability
Symposium Proceedings. IEEE International Symposium on Product Quality and
Integrity 43–47.
87. Regazzoni, D., and Russo, D. 2011. TRIZ tools to enhance risk management. Procedia
Engineering 9: 40–51.
88. Rhee, S. J., and Ishii, K. 2003. Using cost based FMEA to enhance reliability and ser-
viceability. Advanced Engineering Informatics 17(3–4): 179–188.
89. Rhee, S. J., and Ishii, K. 2002. Life cost-based FMEA incorporating data uncertainty.
ASME International Design Engineering Technical Conferences and Computers and
Information in Engineering Conference 309–318.
90. Russomanno, D. J., Bonnell, R. D., and Bowles, J. B. 1994. Viewing computer-aided
failure modes and effects analysis from an artificial intelligence perspective. Integrated
Computer-Aided Engineering 1(3): 209–228.
91. Shahin, A. 2004. Integration of FMEA and the Kano model: An exploratory examina-
tion. International Journal of Quality & Reliability Management 21(7): 731–746.
92. Sharma, R. K., and Sharma, P. 2010. System failure behavior and maintenance decision
making using, RCA, FMEA and FM. Journal of Quality in Maintenance Engineering
16(1): 64–88.
93. Sharma, R. K., Kumar, D., and Kumar, P. 2005. Systematic failure mode effect analy-
sis (FMEA) using fuzzy linguistic modelling. International Journal of Quality  &
Reliability Management 22(9): 986–1004.
94. Sharma, R. K., Kumar, D., and Kumar, P. 2007. Modeling and analysing system failure
behaviour using RCA, FMEA and NHPPP models. International Journal of Quality &
Reliability Management, 24(5): 525–546.
95. Sharma, R. K., Kumar, D., and Kumar, P. 2008. Fuzzy modeling of system behav-
ior for risk and reliability analysis. International Journal of Systems Science 39(6):
563–581.
Building a Semi-automatic Design for Reliability Survey 125

96. Su, C. T., and Chou, C. J. 2008. A systematic methodology for the creation of Six Sigma
projects: A  case study of semiconductor foundry. Expert Systems with Applications
34(4): 2693–2703.
97. Suganthi, S., and Kumar, D. 2010. FMEA without fear AND tear. In Management of
Innovation and Technology (ICMIT)IEEE International Conference 1118–1123.
98. Ming Tan, C. 2003. Customer-focused build-in reliability: A case study. International
Journal of Quality & Reliability Management 20(3): 378–397.
99. Meng Tay, K., and Peng Lim, C. 2006. Fuzzy FMEA with a guided rules reduction
system for prioritization of failures. International Journal of Quality  & Reliability
Management 23(8): 1047–1066.
100. Teng, S. H., and Ho, S. Y. 1996. Failure mode and effects analysis: An integrated
approach for product design and process control. International Journal of Quality &
Reliability Management 13(5): 8–26.
101. Teoh, P. C., and Case, K. 2004. Failure modes and effects analysis through knowledge
modelling. Journal of Materials Processing Technology153: 253–260.
102. Throop, D. R., Malin, J. T., and Fleming, L. D. 2001. Automated incremental design
FMEA. IEEE Aerospace Conference. Proceedings 7: 7–3458.
103. Johnson, T., Azzaro, S., and Cleary, D., 2004. Method, system and computer prod-
uct for integrating case-based reasoning data and failure modes, effects and corrective
action data. US2004103121, filed November 25, 2002, and issued May 27, 2004.
104. Johnson, T. L., Cuddihy, P. E., and Azzaro, S. H. 2004. Method, system and computer
product for performing failure mode and effects analysis throughout the product life
cycle. US2004225475, filed November 25, 2002, and issued November 11, 2004.
105. Chandler, F. T., Valentino, W. D., Philippart, M. F., Relvini, K. M., Bessette, C.  I.
and  Shedd, N. P. 2004. Human factors process failure modes and effects analysis
(hf pfmea) software tool. US2004256718, filed April 15, 2004, and issued December 23,
2004.
106. Liddy, R., Maeroff, B., Craig, D., Brockers, T., Oettershagen, U., and Davis, T.
2005. Method to facilitate failure modes and effects analysis. US2005138477, filed
November 25, 2003, and issued June 23, 2005.
107. Lonh, K. J., Tyler, D. A., Simpson, T. A., and Jones, N. A. 2006. Method for predict-
ing performance of a future product. US2006271346, filed May 31, 2005, and issued
November 30, 2006.
108. Mosleh, A., Wang, C., and Groen, F. J. 2007. System and methods for assessing risk
using hybrid causal logic. US2007011113, filed March  17, 2006, and issued July  11,
2007.
109. Coburn, J. A., and Weddle, G. B. 2009. Facility risk assessment systems and methods.
US20090138306, filed September 25, 2008, and issued May 28, 2009.
110. Singh, S., Holland, S. W. and Bandyopadhyay, P. 2012. Graph matching system for
comparing and merging fault models. US2012151290, filed December  9, 2010,  and
issued June 14, 2012.
111. Harsh, J. K., Walsh, D. E., and Miller, E., M. 2012. Risk reports for product qual-
ity planning and management. US2012254044, filed March  30, 2012,  and issued
October 4, 2012.
112. Abhulimen, K. E. 2012. Design of computer-based risk and safety management sys-
tem of complex production and multifunctional process facilities-application to fpso’s,
US2012317058, filed June 13, 2011, and issued December 13, 2012.
113. Oh, K., P. 2013. Spreadsheet-based templates for supporting the systems engineering
process. US2013013993, filed August 24, 2011, and issued January 10, 2013.
114. Chang, Y. 2014. Product quality improvement feedback method. US20140081442, filed
September 18, 2012, and issued March 20, 2014.
126 Reliability Engineering

115. Barnard, R. F., Dohanich, S. L., and Heinlein, P., D. 1996. System for failure mode and
effects analysis. US5586252, filed May 24, 1994, and issued December 17, 1996.
116. Williams, E., and Rudoff, A. 2006. System and method for performing automated sys-
tem management. US7120559, filed June 29, 2004, and issued October 10, 2006.
117. Williams, E., and Rudoff, A. 2008. System and method for automated problem diagno-
sis. US7379846, filed June 29, 2004, and issued May 27, 2008.
118. Williams, E., and Rudoff A., 2009. System and method for providing a data structure
representative of a fault tree. US7516025, filed June 29, 2004, and issued April 7, 2009.
119. Dreimann, M., Ehlers, P., Goerisch, A., Maeckel, O., Sporer, R., and Sturm, A. 2007.
Method for analyzing risks in a technical project. US8744893, filed April 11, 2006, and
issued November 1, 2007.
120. Vahdani, B., Salimi, M., and Charkhchian, M. 2015. A new FMEA method by inte-
grating fuzzy belief structure and TOPSIS to improve risk evaluation process.
The International Journal of Advanced Manufacturing Technology 77(1–4): 357–368.
121. Wang, C. S., and Chang, T. R. 2010. Systematic strategies in design process for inno-
vative product development. Industrial Engineering and Engineering Management
Proceedings: 898–902.
122. Wang, M. H. 2011. A cost-based FMEA decision tool for product quality design and
management. IEEE Intelligence and Security Informatics Proceedings 297–302.
123. Wirth, R., Berthold, B., Krämer, A., and Peter, G. 1996. Knowledge-based support of
system analysis for the analysis of failure modes and effects. Engineering Applications
of Artificial Intelligence 9(3): 219–229.
124. Selvage, C. 2007. Look-across system. WO2007016360, filed July 28, 2006, and issued
February 28, 2007.
125. Bovey, R. L., and Senalp, E., T. 2010. Assisting with updating a model for diagnosing
failures in a system, WO2010038063, filed September 30, 2009, and issued April 8,
2010.
126. Snooke, N. A. 2010. Assisting failure mode and effects analysis of a system,
WO2010142977, filed June 4, 2010, and issued December 16, 2010.
127. Snooke, N. A. 2012. Automated method for generating symptoms data for diagnostic
systems, WO2012146908, filed April 12, 2012, and issued November 1, 2012.
128. Xiao, N., Huang, H. Z., Li, Y., He, L., and Jin, T. 2011. Multiple failure modes analysis
and weighted risk priority number evaluation in FMEA. Engineering Failure Analysis
18(4): 1162–1170.
129. Yang, C., Letourneau, S., Zaluski, M., and Scarlett, E. 2010. APU FMEA validation
and its application to fault identification. ASME International Design Engineering
Technical Conferences and Computers and Information in Engineering Conference
959–967.
130. Zafiropoulos, E. P., and Dialynas, E. N. 2005. Reliability prediction and failure mode
effects and criticality analysis (FMECA) of electronic devices using fuzzy logic.
International Journal of Quality & Reliability Management 22(2): 183–200.
131. Yang, Z., Bonsall, S., and Wang, J. 2008. Fuzzy rule-based Bayesian reasoning approach
for prioritization of failures in FMEA. IEEE Transactions on Reliability 57(3), 517–528.
132. Zhao, X., and Zhu, Y. 2010. Research of FMEA knowledge sharing method based on
ontology and the application in manufacturing process. Database Technology and
Applications (DBTA), 2nd International Workshop 1–4.
133. Zhou, J., and Stalhaane, T. 2004. Using FMEA for early robustness analysis of Web-
based systems. In Computer Software and Applications Conference Proceedings (2):
28–29.
5 Markov Chains and
Stochastic Petri Nets
for Availability and
Reliability Modeling
Paulo Romero Martins Maciel, Jamilson
Ramalho Dantas, and Rubens de Souza
Matos Júnior

CONTENTS
5.1 Introduction................................................................................................... 127
5.2 A Glance at History....................................................................................... 128
5.3 Background.................................................................................................... 130
5.3.1 Markov Chains.................................................................................. 130
5.3.2 Stochastic Petri Nets.......................................................................... 131
5.4 Availability and Reliability Models for Computer Systems.......................... 133
5.4.1 Common Structures for Computational Systems Modeling.............. 134
5.4.1.1 Cold, Warm, and Hot Standby Redundancy....................... 135
5.4.1.2 Active-Active and k-out-of-n Redundancy Mechanisms.......138
5.4.2 Examples of Models for Computational Systems.............................. 140
5.4.2.1 Markov Chains.................................................................... 140
5.4.2.2 SPN Models........................................................................ 143
5.5 Final Comments............................................................................................. 147
Acknowledgment.................................................................................................... 147
References............................................................................................................... 148

5.1 INTRODUCTION
Due to the ubiquitous provision of services on the internet, dependability has become
an attribute of prime concern in hardware/software development, deployment, and
operation. Providing fault-tolerant services is related inherently to the adoption of
redundancy. Redundancy can be exploited either in time or in space. Replication of
services usually is provided through distributed hosts across the world so that when-
ever the service, the underlying host, or network fails another service is ready to take
over. Dependability of a system can be understood as the ability to deliver a specified
functionality that can be justifiably trusted. Functionality might be a set of roles or

127
128 Reliability Engineering

services (functions) observed by an outside agent (a human being, another system,


etc.) that interacts with system at its interfaces; and the specified functionality of a
system is what the system is intended.
Two fundamental dependability attributes are reliability and availability. The
task of estimating reliability and availability metrics may be undertaken by adopting
combinatorial models such as reliability block diagrams and fault trees. These mod-
els, however, lack the modeling capacity to represent dynamic redundancies. State-
based models such as Markov chains and stochastic Petri nets have higher modeling
power, but the computation cost for performing the evaluation is usually an issue
to be considered. This chapter studies the reliability and availability modeling of a
system through Markov chains and stochastic Petri nets.
This  chapter is divided into four sections. After the introduction follows a
glance on some key authors and papers of area. Section 5.3 brings out background
concepts on Markov chains and Stochastic Petri Nets. Section 5.4 presents some
availability and reliability models for computer systems. Section  5.5 closes the
chapter.

5.2  A GLANCE AT HISTORY


This section provides a summary of early work related to dependability and briefly
describes some seminal efforts as well as the respective relations with current preva-
lent methods. This effort is undoubtedly incomplete; nonetheless, the intent is that it
provides key events, people, and noteworthy research related to what is now called
dependability modeling [28].
Dependability is related to disciplines such as fault tolerance and reliability.
The  concept of dependable computing first appeared in the 1820s when Charles
Babbage carried out the initiative to conceive and build a mechanical calculating
engine to get rid of the risk of human errors [1,2]. In his book, On the Economy of
Machinery and Manufacture, he remarks “The first objective of every person who
attempts to make an article of consumption is, or ought to be, to produce it in perfect
form” [3]. In the nineteenth century, reliability theory advanced from probability and
statistics as a way to support estimating maritime and life insurance rates. In the early
twentieth century, methods had been proposed to estimate survivorship of railroad
equipment [4,5].
The first IEEE (formerly AIEE and IRE) public document to mention reliability
is “Answers to Questions Relative to High Tension Transmission” that archives the
meeting of the Board of Directors of the American Institute of Electrical Engineers
held on September  26, 1902  [6]. In  1905, H. G. Stott and H. R. Stuart discuss
“Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability
of Services” at New York [4] and Pittsburg [5]. In these works, the concept of reli-
ability was chiefly qualitative. In  1907, A. A. Markov began the study of a nota-
ble sort of chance process. In this process, the outcome of a given experiment can
modify the outcome of the next experiment. This  sort of process is now  called a
Markov chain [7]. Markov’s classic textbook, Calculus of Probabilities, was pub-
lished four times in Russian and was translated into German [9]. In 1926, 20 years
after Markov’s initial discoveries, a paper by Russian mathematician S. N. Bernstein
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 129

used the term “Markov chain” [8]. In the 1910s, A. K. Erlang studied telephone traf-
fic planning for reliable service provisioning [10].
The first generation of electronic computers was entirely undependable; thence
many techniques were investigated for improving their reliability. Among such tech-
niques, many researchers investigated design strategies and evaluation methods.
Many methods then were proposed for improving system dependability such as error
control codes, replication of components, comparison monitoring, and diagnostic
routines. The  leading researchers during that period were Shannon  [13], Von
Neumann [14], and Moore [15], who proposed and developed theories for building
reliable systems by using redundant and less reliable components. These theories
were the forerunners of the statistical and probabilistic techniques that form the
groundwork of modern dependability theory [17].
In the 1950s, reliability turns out to be a subject of great interest because of the
cold war efforts, failures of American and Soviet rockets, and failures of the first
commercial jet—the British de Havilland Comet [18,19]. Epstein and Sobel’s 1953
paper on the exponential distribution was a landmark contribution [20]. In 1954, the
first Symposium on Reliability and Quality Control (it is now the IEEE Transactions
on Reliability) was held in the United States, and in 1958 the First All-Union
Conference on Reliability was held in Moscow [7,21]. In 1957, S. J. Einhorn and
F. B. Thiess applied Markov chains for modeling system intermittence [22], and in
1960 P. M. Anselone employed Markov chains for evaluating the availability of radar
systems [23]. In 1961, Birnbaum, Esary, and Saunders published a pioneering paper
introducing coherent structures [24].
The  reliability models might be classified as combinatorial (non-state space
model) and state-space models. Reliability Block Diagrams (RBD) and Fault Trees
(FT) are combinatorial models and the most widely adopted models in ­reliability
evaluation. RBD is probably the oldest combinatorial technique for reliabil-
ity analysis. Fault Tree Analysis (FTA) was initially developed in 1962 at Bell
Laboratories by H. A. Watson to analyze the Minuteman I Intercontinental Ballistic
Missile Launch Control System. Afterward, in 1962, Boeing and AVCO expanded
the use of FTA to the entire Minuteman II [25]. In 1965, W. H. Pierce unified the
Shannon, Von Neumann, and Moore theories of masking and redundancy as the
concept of failure tolerance [26]. In 1967, A. Avizienis combined masking methods
with error detection, fault diagnosis, and recovery into the concept of fault-tolerant
systems [27].
The  formation of the IEEE Computer Society Technical Committee on Fault-
Tolerant Computing (now Dependable Computing and Fault Tolerance TC) in 1970 and
of IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance in 1980
was an essential mean for defining a consistent set of concepts and terminology. In early
1980s, Laprie coined the term dependability for covering concepts such as reliability,
availability, safety, confidentiality, maintainability, security, and integrity [1,29].
In  late 1970s some works were proposed for mapping Petri nets to Markov
chains [30,32,47]. These models have been extensively adopted as high-level Markov
chain automatic generation models and for discrete event simulation. Natkin was the
first to apply what is now generally called stochastic Petri nets (SPNs) to depend-
ability evaluation of systems [33].
130 Reliability Engineering

5.3 BACKGROUND
This section provides a very brief introduction to Continuous Time Markov Chains
(CTMCs) and SPNs, which are the formalism adopted to model availability and reli-
ability in this chapter.

5.3.1  Markov Chains


Markov chains have been applied in many areas of science and engineering. They have
been widely adopted for performance and dependability evaluation in manufacturing,
logistics, communication, computer systems, and so forth [34]. The name Markov chains
came from the Russian mathematician Andrei Andreevich Markov. Markov was born on
June, 14, 1856, in Ryazan, Russia, and died on July 20, 1922, in Saint Petersburg [35].
The References offers many books on Markov chains [36–40]. These books cover
Markov chain theory and applications in different depth and styles.
A  stochastic process is defined as a family of random variables ({Xi(t): t ∈ T})
indexed through some parameter (t). Each random variable (Xi(t)) is defined on some
probability space. The parameter t usually represents time, so Xi(t) denotes the value
assumed by the random variable at time t. T is called the parameter space and is a subset
of R (the set of real numbers).
If T is discrete, that is, T  = {0,1,2,...}, the process is classified as discrete-time
parameter stochastic process. On the other hand, if T is continuous, that is, T = {t:
0 ≤ t < ∞}, the process is a continuous-time parameter stochastic process. In CTMC,
a change of state may occur at any point in time. A CTMC is a continuous time, dis-
crete state-space stochastic process, that is, the state values are discrete, but param-
eter t has a continuous range over [0,∞].
A CTMC can be represented by a state-transition diagram in which the vertices rep-
resent states and the arcs between vertices i and j are labeled with the respective transi-
tion rates, that is, λij, i ≠ j. Consider a chain composed of three states, s0, s1, and s2, and
their transition rates, α, β, γ, and λ. The model transitions from s0 to s1 with rate α; from
state s1, the model transitions to state s0 with rate β, and to state s2 with rate γ. When in
state s2, the model transitions to state s1 with rate λ. The rate matrix, Q is:

 −α α 0 
 
Q= β −(β + γ ) γ 
 0 λ −λ 

For time homogeneous CTMCs:

dΠ (t )
= Π ( t ) ⋅ Q, (5.1)
dt
that has the following solution [12,16]:

 ∞
Qt k 
Π ( t ) = Π ( 0 ) e Qt = Π ( 0 )  I +



k =1
k!
 . (5.2)


Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 131

In many cases, however, the instantaneous behavior, Π(t), of the Markov chain is
more than needed. In many cases, often it is satisfied already when computing the
steady-state probabilities, that is, Π = limt → ∞Π(t). Hence, consider the system of
differential equations  presented in Equation 5.1. If the steady-state distribution
exists, then dΠ(t):

dΠ (t )
= 0
dt
Consequently, for calculating the steady-state probabilities, the only necessity is to
solve the system:

Π ⋅ Q = 0, ∑ ∀i
π i = 1. (5.3)

5.3.2 Stochastic Petri Nets


The  first SPN extensions were proposed independently by Symons, Natkin, and
Molloy  [30,31,32]. After, many other stochastic extensions were introduced, Marsan
et al. extended the basic SPNs by considering stochastic timed transitions and immediate
transitions [41]. This model was named Generalized Stochastic Petri Nets (GSPN) [43].
Later on, Marsan and Chiola proposed an extension that also supported determinis-
tic timed transitions  [42], which was named Deterministic Stochastic Petri Nets
(DSPN)  [46]. Many other extensions followed, among them extended Deterministic
Stochastic Petri Nets (eDSPN) [44,45] and Stochastic Reward Nets (SRN) [48].
The SPN considered here is a very general stochastic extension of Place-Transition
nets. Its modeling capacity is well beyond that presented by Symons, Natkin, and
Molloy. The original SPN considered only exponential distributions. GSPNs adopted,
besides exponential distributions, immediate transitions. These models shared the
memoryless property also presented in untimed Petri nets since reachable marking
is only dependent on the current Petri net marking.
Stochastic Petri Nets—Let SPN = (P, T, I, O, H, M0,Atts) be an SPN, where P, T,
I, O, and M0 are defined as for Place-Transition nets, that is, P is the set of places, T
is the set of transitions, I in input matrix, O is the output matrix, and M0 is the initial
marking. The  set of transition, T, is, however, divided into immediate transitions
(Tim), timed exponentially distributed transitions (Texp), deterministic timed transi-
tions (Tdet), and timed generically distributed transitions (Tg):

T = Tim ∪ Texp ∪ Tdet ∪ Tg .

Immediate transitions are graphically represented by thin black rectangles, timed


exponentially distributed are depicted by white rectangles, deterministic timed tran-
sitions are represented by thick black rectangles, and timed generically distributed
gray rectangles denote transitions. The  matrices I and O represent the input and
output arcs of transitions. These matrices may be marking dependent, that is the arc
weights may be dependent on current marking:

I = (i p,t ) P × T , i p,t : MD × RSSPN → ,


132 Reliability Engineering

and

O = (o p,t ) P × T , o p,t : MD × RSSPN → ,

where MD = {true, false} is a set that specify if the arc between p and t is marking
dependent or not. If the arc is marking dependent, the arc weight is dependent on the
current marking M ∈ RSSPN, RSSPN is the reachability set of the net SPN. Otherwise,
it is constant.

H = ( hp,t ) P × T , hp,t : MD × RSSPN → 

is a matrix of inhibitor arcs. These arcs may also be marking dependent, that is the
arc weight may be dependent on current marking. hp,t: MD  ×  RSSPN  →  ℕ, where
MD = {true, false} is a set that specify if the arc between p and t is marking depen-
dent or not. If the arc is marking dependent, the arc weight is dependent on the cur-
rent marking M ∈ RSSPN. Otherwise, it is constant.

• Atts  =  (Π, Dist, MDF, W, G, Policy, Concurrency) is set of attributes


assigned to transitions, where:
• Π: T  →  N is a function that assigns a firing priority on transitions.
The larger the number the higher is the firing priority. Immediate transi-
tions have higher priorities than timed transitions, and timed determinis-
tic transitions have higher priorities than random timed transitions, that is,
π(ti) > π(tj) > π(tk), ti ∈ Tim, tj ∈ Tdet, and tk ∈ Texp ∪Tg.
• Dist: Texp∪Tg → F is a function that assigns non-negative probability distri-
bution function to random delay transitions. F is the set of functions.
• MDF: T  →  MD is a function that defines if the probability distribution
functions assigned to delays of transitions are marking dependent or not.
MD = {true, false}.
• W: Texp∪Tdet∪Tim → R+ is a function that assigns a non-negative real num-
ber to exponential, deterministic, and immediate transitions. For expo-
nential transitions, these values correspond to the parameter values of the
exponential distributions (rates). In the case of deterministic transitions,
they are the deterministic delays assigned to transitions. Moreover, in
the case of immediate transitions, they denote the weights assigned to
transitions.
• G: T → 7N|P| is a partial operator that assigns to transitions a guard expres-
sion. The  guards are evaluated by GE: (T  →  7N|P|)  →  {true, false} that
results in true or false. The guard expressions are Boolean formulas com-
posed of predicates specified regarding marking of places. A transition may
be enabled only if its guard function is evaluated as true. It is worth noting
that not every transition may be guarded.
• Policy: T  →  {prd, prs}, where prd denotes pre-emptive repeat different
(restart), and prs is pre-emptive resume (continue). The timers of transitions
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 133

with prd are discarded and new values are generated in the new marking.
The timers of transitions with prs hold the present values.
• Concurrency: T − Tim → {sss, iss} is a function that assigns to each timed
transition a timing semantics, where sss denotes single server semantics
and iss is infinite server semantics.

SPNs are usually evaluated through numerical methods. However, if the state space
is too big, infinite or even if non-phase-type distributions should be represented, the
evaluation option may fall into the simulation. With simulation, there are no funda-
mental restrictions on the models that can be evaluated. Nevertheless, the simulation
does have pragmatical constraints, since the amount of computer time and memory
running a simulation can be prohibitively large. Therefore, the general advice is to
pursue an analytical model wherever possible, even if simplifications and or decom-
position is required.
For a detailed introduction to SPNs, refer to [43,45].

5.4 AVAILABILITY AND RELIABILITY MODELS


FOR COMPUTER SYSTEMS
Dependability aspects deserve great attention for assuring of the quality of service
provided by a computer system. Dependability studies look for determining reliabil-
ity, availability, security, and safety metrics for the infrastructure under analysis [50].
RBD [51], FT [53] and Petri nets are, as well as Markov chains, widely used to capture
the system behavior and allow the description and prediction of dependability metrics.
The  most basic dependability aspects of a system are the failure and repair
events, which may bring the system to different configurations and operational
modes. The steady-state availability is a common measure extracted from depend-
ability models. Reliability, downtime, uptime, and mean time to system failure are
other metrics usually obtained as output from a dependability analysis in computer
systems.
The  combined analysis of performance and dependability aspects, so-called
performability analysis, is another frequent necessity when dealing with computer
systems, since many of them may continue working after partial failures. Such grace-
fully degrading systems [54] require specific methods to achieve an accurate evalu-
ation of their metrics. Markov reward models constitute an essential framework for
performability analysis. In this context, the hierarchical modeling approach is also
a useful alternative in which distinct models may be used to represent the depend-
ability relationships of the system in the upper level and performance aspects in the
lower level, or vice versa [49,55,58].
For all kinds of Markov chain or SPN analyses, an important assumption must
be kept in mind: the exponential distribution of transition rates or firing delays,
respectively. The  behavior of events in many computer systems may fit better to
other probability distributions, but in some of these situations, the exponential dis-
tribution is a fair approximation, enabling the use of Markovian models. In cases
when the exponential distribution is not  a reasonable approximation, SPN exten-
sions may be used that enable non-exponential distributions. Such a deviation
134 Reliability Engineering

from Markovian assumptions requires the adoption of simulation for a model solu-
tion [57,59–61]. It is possible also to adapt transitions to represent other distributions
employing phase approximation or moment matching as shown in [36,52]. The use
of such techniques allows the modeling of events described by distributions such as
Weibull, hypoexponential, hyperexponential, and Erlang and Cox [13,16].

5.4.1 Common Structures for Computational Systems Modeling


Consider a single component repairable system. This system may be either opera-
tional or in failure. If the time to failure (TTF) and the time to repair (TTR) are
exponentially distributed with rates λ and µ, respectively, the CTMC shown in
Figure  5.1a is its availability model. The  state U (Up) represents the operational
state, and the state D (Down) denotes the faulty system. If the system is operational,
it may fail. The system failure is represented by the transition from state U to state
D. The faulty system may be restored to its operational state by a repair. The repair is
represented by the transitions from state D to state U. The matrix rate, Q, is presented
in Figure 5.1b.
The instantaneous availability is the instantaneous probability of being in state U
and D is, respectively:

µ λ
A (t ) = πU (t ) = e ( ) (5.4)
− λ +µ t
+
λ+µ λ+µ

and

λ λ
UA ( t ) = π D ( t ) = e ( ) , (5.5)
− λ +µ t

λ+µ λ+µ

such that πU(t) + πD(t) = 1.


If t  →  ∞, then the steady-state availability and unavailability is obtained,
respectively:

µ
A = πU = (5.6)
λ+µ
and
λ
UA = π D = , (5.7)
λ+µ

FIGURE 5.1   Single component system: (a) Availability model and (b) Matrix rate.
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 135

such that πU + πD = 1. The steady-state measures can be obtained also by solving:

Π ⋅ Q = 0, π U + π D = 1,

where Π = (πU,πD). The downtime in a period T is DT = πD × T. For a time period of


1 year (365 days), the number of hours T is 8760 h and 525,600 min. Now assume
a CTMC that represents the system failure. This model has two states, U and D, and
only one transition. This transition represents the system failure; that is, when the
system is operational (U), it may fail, and this event is represented by the transition
from the state U to state D, with failure rate (λ). Solving:

dΠ (t )
= Π ( t ) ⋅ Q,
dt

where Π(t) = (πU(t),πD(t)) and πU(t) + πD(t) = 1, πU(t) = e−λ t and πD(t) = 1−e−λ t are


obtained. The system reliability is:

R(t ) = π U (t ) = e − λt (5.8)

and the unreliability is:

UR(t ) = π D (t ) = 1 − e − λt . (5.9)

It is worth mentioning UR(t) = F(t), where F(t) is cumulative distribution function of the
∞ ∞
time to failure. Consequently, as MTTF = ∫ 0 R(t ) dt , we have: MTTF = ∫0 e − λt dt = λ1 .
The  mean time to failure (MTTF) also can be computed from the rate matrix
Q [56,65].

5.4.1.1  Cold, Warm, and Hot Standby Redundancy


Systems with stringent dependability requirements demand methods for detecting,
correcting, avoiding, and tolerating faults and failures. A failure in a large-scale sys-
tem can mean catastrophic losses. Many techniques have been proposed and adopted
to address dependability issues in computer systems in such a way that failures can
be tolerated and circumvented. Many of those techniques are based on redundancy,
i.e., the replication of components so that they work for a common purpose, ensuring
data security and availability even in the event of some component failure. Three
replication techniques deserve special attention due to its extensive use in clustered
server infrastructures [28]:

• Cold Standby: The backup nodes, or modules, are turned off on standby


and will only be activated if the primary node fails. One positive point
for this technique is that the secondary node has low energy consumption.
While in standby mode, the reliability of the unit is preserved, i.e., it will
not fail or at least its mean time to failure is expected to be much higher
than a fully active component. On the other hand, the secondary node needs
136 Reliability Engineering

significant time to be activated, and clients who were accessing information


on the primary node lose all information with the failure of the primary
node and must redo much of the work when the secondary node activates.
• Hot Standby: This type can be considered the most transparent of the rep-
lication modes. The replicated modules are synchronized with the operat-
ing module; thereby, the active and standby cluster participants are seen by
the end user as a single resource. After a node fails, the secondary node
is activated automatically and the users accessing the primary node will
now access the secondary node without noticing the change of equipment.
• Warm Standby: This technique tries to balance the costs and the recovery
time delay of cold and hot standby techniques. The secondary node is on
standby, but not  completely turned off, so it can be activated faster than
in the cold standby technique, as soon as a monitor detects the failure of
the primary node. The replicated node is synchronized partially with the
operating node, so users who were accessing information on the operating
node may lose some information that was being written close to the moment
when the primary node failed. It is common to assume that in such a state
the standby component has higher reliability than when receiving the work-
load (i.e., properly working).

Figure 5.2 depicts an example SPN for a cold-standby server system, comprising two
servers (S1 and S2). There are two places (S1 -Up and S2 -Down) representing the
operational status of the primary server, indicating when it is working or has failed,
respectively. Three places (S1 Up, S2 Down, and S2 Waiting) represent the opera-
tional status of the spare server, indicating when it is working, failed, or waiting for
activation in case of a primary server failure.
Notice that in the initial state of the cold-standby model, both places S1 -up and
S2 Waiting have one token, denoting the primary server is up, and the spare server
is in standby mode. The  activation of the spare server occurs when the transition
S1 Fail fires, consuming the token from S1 Up. Once the place S1 -Up is empty, the
transition S2 Switch On becomes enabled, due to the inhibitor arc that connects it to
S1 Up. Hence, S2 Switch On fires, removing the token from S2 Waiting, and putting
one token in place S2 Up. This is the representation of the switchover process from
the primary server to the secondary server, which takes an activation time specified
in the S2 Switch On firing delay.
The repair of the primary server is represented by firing the S1 Repair transition.
The places S1 Down and S2 Up become empty, and S1 -Up receives one token again.
As previously mentioned, the time to failure of primary and secondary servers will
be different after the spare server is preserved from the effects of wear and tear when
it is on shut off or in standby mode. The availability can be numerically obtained
from the expression:

A = P ( ( # S1UP = 1) ∨ ( # S 2UP = 1) )

Figure  5.3 depicts an example CTMC for a warm-standby server system, origi-
nally shown in  [49]. This  model has many similarities to the SPN model for the
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 137

S1_Fail

S1_Up S1_Down

S1_Repair

S2_Fail

S2_Up S2_Down S2_Waiting

S2_Repair S2_Switch_On

FIGURE 5.2  SPN for cold standby redundancy.

FIGURE 5.3  CTMC for warm standby redundancy.


138 Reliability Engineering

cold-standby system, despite the distinct semantics and notation. It might be inter-
esting to verify that both approaches can be used interchangeably, mainly when the
state-space size is not a major concern.
The CTMC has five states: UW, UF, FF, FU, and FW, and considers one pri-
mary and one spare server. The first letter in each state indicates the primary server
status, and the second letter indicates the secondary server status. The  letter U
stands for Up and active, F means Failed, and W indicates Waiting condition (i.e.,
the server is up but in standby waiting for activation). The shaded states represent
that the system has failed (i.e., it is not operational anymore). The state UW rep-
resents the primary server (S1) is functional and secondary server (S2) in standby.
When S1 fails, the system goes to state FW, where the secondary server has not yet
detected the S1 failure. FU represents the state where S2 leaves the waiting condi-
tion and assumes the active role, whereas S1 is failed. If S2 fails before taking the
active role, or before the repair of S1, the system goes to the state FF, when both
servers have failed. For this model, we consider a setup where the primary server
repair has priority over the secondary server repair. Therefore, when both serv-
ers have failed (state FF) there is only one possible outgoing transition: from FF
to UF. If S2 fails when S1 is up, the system goes to state UF and returns to state
UW when the S2 repair is accomplished. Otherwise, if S1 also fails, the system
transitions to the state FF. The failure rates of S1 and S2, when they are active, are
denoted by λ1. The rate λ2 denotes the failure rate of the secondary server when it
is inactive. The repair rate assigned to both S1 and S2 is µ. The rate α represents
the switchover rate (i.e., the reciprocal of the mean time to activate the secondary
server after a failure of S1).
The  warm standby system availability is computed from the CTMC model by
summing up the steady-state probabilities for UW, UF, and FU states, which denote
the cases where the system is operational. Therefore, A = πUW + πUF + πFU. System
unavailability might be computed as U = 1 − A, but also as U = πFF + πFW .
A CTMC model for a cold standby system can be created with little adjustments
to the warm standby model, described as follows. The switchover rate (α) must be
modified accordingly to reflect a longer activation time. The transition from UW to
the UF state should be removed if the spare server is not assumed to fail while inac-
tive. If such a failure is possible, the failure rate (λ2) should be adjusted to match the
longer mean time to failure expected for a spare server that is partially or entirely
turned off.
A CTMC model for a hot standby system also can be derived from the warm
standby model by reducing the value of the switchover rate (α) to reflect a smaller
activation time or even removing state FW to allow transition from UW to FU
directly if the switching time from primary to spare server is negligible. In every
case, the failure rate of the spare server (λ2) should be replaced by the same rate
of the primary server since the mean time to failure is expected to be the same for
both components.

5.4.1.2  Active-Active and k-out-of-n Redundancy Mechanisms


Active-active redundancy means that two operational units share the workload, but
workload can be served with acceptable quality by a single unit.
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 139

FIGURE 5.4  CTMC for 3-out-of-5 redundancy.

The concept of active-active redundancy can be generalized by assuming that a


system may depend strictly only on a subset of its components. Consider a system
composed of n identical and independent components that is operational if at least
k out of its n components are working correctly. This sort of redundancy is named
k-out-of-n.
Combinatorial models, such as RBD [62], are widely used for representing k-out-
of-n arrangements, but they also might be modeled and analyzed with CTMC mod-
els with equivalent accuracy and even more flexibility [28,57]. Figure 5.4 depicts an
example of CTMC model for a 3-out-of-5 redundant server system.
In  such a CTMC, the 5U state represents that all five servers are operational.
The failure rate of a single server is denoted by λ, whereas the repair rate is denoted
by µ. The transition from state 5U to state 4U occurs with the rate 5λ, according to
the properties of exponential distribution that is assumed in a Markov chain, con-
sidering that the failure of each unit is statistically independent of one to each other,
which simply means they may fail concurrently. Similarly, the model goes from state
4U to state 3U with a rate of 4λ after there are only four operational servers remain-
ing. If the model is in state 3U, another failure brings it to the Down state, which
represents that the whole system is not operational anymore, and the other servers
are turned off, and hence no other failure can occur. Only the repair of at least one
server can bring the system to an operational state again. This model considers that
only one server can be repaired at a time, which may be the case in many companies
where the maintenance team has a limited number of members or equipment needed
for the repair. For such a reason, the repair occurs with a µ rate for all transitions
outgoing from Down, 3U, and 4U states.
The availability for such a system may be computed as:

A = P {5U } + P {4U } + P {3U } (5.10)

60λ 3
A = 1−
60λ + 20λ 2 µ + 5λµ 2 + µ 3
3

The capacity-oriented availability (COA) allows to estimate how much service the


system can deliver considering failure states [63,64]:

5 × P {5U } + 4 × P {4U } + 3 × P {3U }


COA = (5.11)
5

COA =
(
λ 60λ 2 + 16λµ + 3µ 2 )
60λ + 20λ µ + 5λµ + µ 3
3 2 2
140 Reliability Engineering

The mean time to failure is:

400λ 4 + 275λ 3µ + 107λ 2 µ 2 + 13λµ 3 + µ 4


MTTF =
(
60λ 3 20λ 2 + 5λµ + µ 2 )
5.4.2 Examples of Models for Computational Systems
To demonstrate how to analyze the availability and reliability of computing systems,
an example of architecture that is presented in Figure 5.5 is used. The system is com-
posed of a switch/router and server subsystem. The system fails if the switch/router
fails or if the server subsystem fails. The server subsystem comprises two servers, S1
and S2. S1 is the main server, and S2 is the spare server. They are configured in cold
standby mechanism, that is, S2 starts as soon as S1 fails. The startup time of S2 may
be configured according to the adopted switching mechanism. If the start-up time of
S2 is equal to zero, then it is perfect switching.
For computing the availability and reliability for such a system, a modeling strat-
egy consisting of Markov chains and SPN models is used.

5.4.2.1  Markov Chains


5.4.2.1.1  Availability CTMC Model
The architecture described in Figure 5.5 enables availability analysis through a het-
erogeneous modeling approach. Many formalisms may be used to compute such
metrics. However, the redundancy mechanism used in the systems requires the use
of state-based models, such as Markov chains or SPNs. Therefore, this example
depicts the use of CTMC model to compute availability and reliability measures.
Figure 5.6 represents the CTMC availability model. The CTMC represents the
detailed behavior of the system which employs redundancy, the start-up time of S2
is zero. The CTMC has six states as a tuple: (D, S2,D), (S1,S2,D), (S1,S2,SR), (D,
S2,SR), (D, D, D), and (D, D, SR), and considers one primary and one spare server,
S1 and S2 respectively, and one switch/router.

S1

Switcher/Router
Clients

S2

Servers

FIGURE 5.5  A simple example.


Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 141

(D,S2,D)

S2
µ_

µ_
S1

SR
λ_
S1
µ_
(S1,S2,D)
S2
λ_

λ_
SR
λ_SR

µ_SR
S
µ_

λ_
(D,D,D) S1
(S1,S2,SR)
λ_ µ_S1
SR (D,S2,SR)
S
µ_

S2
λ_

µ_ S2
SR µ_

(D,D,SR)

FIGURE 5.6  CTMC availability model.

Each state name comprises three parts. The first one represents the server one (S1),
the second denotes the server two (S2), and the third letter describes the switch/router
component (SR). The S1 denotes that S1 is running and operational, the S2 represents
the S2 is running and operational, and SR represents the Switch/router is running
and operational. The letter D represents the failure state. The initial state (S1,S2,SR)
represents the primary server (S1) is running and operational, the secondary server
(S2) is the spare server, and the switch/router (SR) is functional. When S1 fails, the
system goes to the state (D,S2,SR), outgoing transition: from (S1,S2,SR) to (D,S2,SR),
when S1 repair, the system returns to the initial state. Once in the state (D,S2,SR), the
system may go to the state (D,S2,D) through the SR failure or, the system may go to
state (D, SR) through the S2 failure. In both cases, the system may return to the previ-
ous state across the SR repair rate or S2 repair rate, respectively. As soon as the state
(D,D,SR) is achieved, the system may go to the state (D,D,D) with the SR failure, or
returns to the initial state (S1,S2,SR), when the repair is accomplished (i.e., the repair
142 Reliability Engineering

of the systems S1 and S2). The failure rates of S1, S2, and SR are denoted by λ_S1,
λ_S2, and λ_SR, respectively, as well as the repair rates for each component µ_S1,
µ_S2, and µ_SR. The µ_S denotes the repair rate when the two servers are in a fail-
ure state.
The CTMC that represents the architecture enables obtaining a closed-form equa-
tion to compute the availability (see Equation 5.12). It is important to stress that the
parameters µ_S1=µ_S2=µ_SR are equal to µ and λ_S1=λ_S2 are equal to λ.

µ ( µ ( µ + µ s ) + λ ( µ + 2µs ) )
A= (5.12)
( λSR + µ ) ( λ 2 + µ ( µ + µs ) + λ ( µ + 2µs ) )
5.4.2.1.2  Reliability CTMC Model
Figure  5.7 depicts the CTMC reliability model for this architecture. The  main
characteristics of the reliability models are the absence of repair, i.e., when the
system goes to the failure state the repair is not considered. This action is neces-
sary to compute with more ease the system mean time to failure, and subsequently
the reliability metric. The reliability model has three states as a tuple: (S1,S2,SR);
(D,S2,SR); and Down state. The initial state (S1,S2,SR) represents all components
running. If S1 fails, the system may go to (D,S2,SR) state, then this event repre-
sents that even with the failure of S1 server, the system may continue the operation
with the secondary server (S2). When S1 is repaired, the system returns to the
initial state. Outgoing transition: from (S1,S2,SR) to Down, when SR fails, repre-
sents the system failure; thus, the system is offline and may not provide the service.
Once in (D,S2,SR) state, the system may go to the Down state with S2 failure rate
or SR failure rate. Once in the Down state, the system goes to the failure condition,
and it is possible to obtain the reliability metric. The up states of the system are
represented by (S1,S2,SR) and (D,S2,SR).

Down
R
λ _S
λ _S
R+λ
_S2

λ _S1

(S1,S2,SR) (D,S2,SR)
µ _S1

FIGURE 5.7  CTMC reliability model.


Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 143

5.4.2.1.3 Results
Table  5.1 presents the values of failure and repair rates, which are the reciprocal
of the MTTF and mean time to repair (MTTR) of each component represented in
Figures 5.6 and 5.7. Those values were estimated and were used to compute the avail-
ability and reliability metrics.
It is important to stress that the µ S represents twice the repair rate of µ S1 con-
sidering just one maintenance team. The availability and reliability measures were
computed herein for the architecture described in Figure 5.5, using the mentioned
input parameters. The results are shown in Table 5.2, including steady-state avail-
ability, number of nines, annual downtime, reliability, and unreliability, considering
4,000 h of activity.
The downtime provides a view of how much time the system is unavailable for its
users for 1 year. The downtime value of 10.514278 h indicates that the system can be
improved; this downtime indicates that the system stands still for 10 hours of total
outage through a year. At 4,000 h of activity, the system has a reliability a little over
80 percent.

5.4.2.2  SPN Models


5.4.2.2.1  Availability SPN Model
An SPN model may be used to represent the same system already analyzed with the
CTMC model discussed in the previous section, and to obtain availability and reli-
ability measures similarly.

TABLE 5.1
Input Parameters
Variable Value (h−1)
λ -SR 1/20,000
λ -S1 = λ S2 1/15,000
µ S1 = µ S2 = µ -SR 1/24
µS 1/48

TABLE 5.2
CTMC Results
Availability 0.9987997
Number of nines 2.9207247
Downtime (h/yr) 10.514278
Reliability (4,000 h) 0.8183847
Unreliability (4000 h) 0.1816152
144 Reliability Engineering

The redundant mechanism is employed to represent switch/router component and


two servers, S1 and S2. The servers are configured in cold Standby; that is, S2 starts
as soon as S1 fails. The start-up time of S2 is denoted by S2 Switching-On transition.
Figure 5.8 shows the SPN model adopted to estimate the availability and downtime
of the servers with cold standby redundancy.
The markings of the places SR OK and S1 OK denote the operational states of
the switch/router and S1 server. The  marking of the S2 -OFF indicates the wait-
ing state before the activation of S2 server. When the place S2 OK is marked, the
server S2 is operational and in use. The places SR F, S1 F, and S2 F indicate the
failure states of these components. When the main module fails (S1), the transition
S2 Switching-On is enabled. Its firing represents the start of the spare in operational
state (S2). This period is the Mean Time to Activate (MTA).
The following statement is adopted for estimating availability and unavailability:

( (
A = P ( # SR _ OK = 1) AND ( ( # S1 _ OK = 1) OR ( # S 2 _ OK = 1) ) ))
( (
UA = 1 − P ( # SR _ OK = 1) AND ( ( # S1 _ OK = 1) OR ( # S 2 _ OK = 1) ) ))
SR_OK
S1_OK

SRR S1R

SRF S1F

SR_F S1_F

S2_OK

S2R

S2F
S2_SwitchingOn S2_OFF

S2_F

FIGURE 5.8  Availability SPN model.


Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 145

5.4.2.2.2  Reliability SPN Model


Figure 5.9 shows the SPN reliability model for architecture presented in Figure 5.5.
The  main difference between models of Figures  5.8 and 5.9 is the repair time
for the entire system, i.e., the system reliability considers the time until the first
failure. The model represents an active/active redundancy, with the failure of S1
and S2 servers the immediate transition is enabled and may be fired, marking the

SR_OK
S1_OK

S1R

SRF S1F

SR_F S1_F
Failure_sys

System_OFF S2_OK

S2R

S2F

S2_F

FIGURE 5.9  SPN Reliability model.


146 Reliability Engineering

place System OFF with a token. The following expressions are adopted for esti-
mating reliability and unreliability, respectively:

R(t ) = 1 − P ((# SR F = 1)V (# SystemOFF = 1))(t )

= ((# SR F 1)V (# SystemOFF = 1))(t )


UR(t ) P=

5.4.2.2.3 Results
Table  5.3 presents the values of mean time to failure (MTTF) and mean time to
repair (MTTR) used for computing availability and reliability metrics for the SPN
models. We computed the availability and reliability measures using the mentioned
input parameters. The results are shown in Table 5.4, including steady-state avail-
ability, number of nines, annual downtime, reliability, and unreliability, considering
4,000 h of activity. The switching time considered is 10 minutes, which are enough
for the system startup and software loading.
This SPN model enables the computation of the reliability function of this sys-
tem over time, which is plotted in Figure  5.10, considering the baseline setup of
parameters shown in Table  5.3, and also a scenario with improved values for the
switch/router MTTF (30,000 h) and both servers MTTR (8 h). It is noticeable that, in
the baseline setup, the system reliability reaches 0.50 at around 15,000 h, and after
60,000 h (about 7 years), the system reliability is almost zero. When the improved
version of the system is considered, the reliability has a smoother decay, reaching
0.50 just around 25,000  h, and approaching zero only near to 100,000  h. For  the
sake of comparison, the reliability at 4,000  h is 0.8840773, wherein the baseline
setup is 0.818385. Such an analysis might be valuable for systems administrators to

TABLE 5.3
Input Parameters for SPN Models
Transition Value (h) Description
SRF 20,000 Switch/Router MTTF
S1F = S2F 15,000 Servers MTTF
SRR = S1R = S2R 24 MTTR
S2 Switching On 0.17 MTA

TABLE 5.4
SPN Results
Availability 0.998799
Number of nines 2.920421
Downtime (h/yr) 10.521636
Reliability (4,000 h) 0.818385
Unreliability (4,000 h) 0.181615
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 147

1 Baseline setup
Improved setup

Reliability 0.8

0.6

0.4

0.2

0
0 20,000 40,000 60,000 80,000 100,000
Time (h)

FIGURE 5.10  Reliability function for the example system.

make decisions regarding system maintenance and replacement of components to


avoid failures that will cause significant damage for revenue, customer satisfaction,
or other corporate goals.

5.5  FINAL COMMENTS


The process of analytical modeling for computational systems must consider a vari-
ety of strategies and characteristics of each available formalisms. The choice of one
type of model may involve accuracy issues, expressiveness power, accessible soft-
ware tools, and the complexity of the target system.
The  concepts and examples presented in the chapter should be viewed as an
introduction and motivation on possible methods to select when studying computing
reliability and availability metrics. The conciseness and power of SPNs especially
can be useful in many cases when complexity grows and many details must be
represented. Nevertheless, CTMCs always will be kept as an option which provides
enough resources for performing many kinds of analyses. Other modeling formal-
isms, such as FTs, RBDs, Reliability Graphs, and stochastic Automata networks, are
also significantly important and enable different views for the same dependability
concepts approached here.
The  world is a place where information systems control almost every aspect
of daily lives. The  knowledge and framework exposed here may be increasingly
required as regulatory agencies and big corporate customers demand the estimation
of boundaries on how dependable their systems are.

ACKNOWLEDGMENT
This work was supported by a grant of contract number W911NF1810413 from the
U.S. Army Research Office (ARO).
148 Reliability Engineering

REFERENCES
1. Laprie, J.C. Dependable Computing and Fault Tolerance: Concepts and terminology.
Proceedings of the 15th IEEE International Symposium on Fault-Tolerant Computing.
1985.
2. Schaffer, S. Babbage’s Intelligence: Calculating Engines and the Factory System.
Critical Inquiry. 1994, Vol. 21, No. 1, 203–227.
3. Blischke, W.R., Murthy, D.P.  [ed.]. Case Studies in Reliability and Maintenance.
Hoboken, NJ: John Wiley & Sons, 2003. p. 661.
4. Stott, H.G. Time-Limit Relays and Duplication of Electrical Apparatus to Secure
Reliability of Services at New York. Transactions of the American Institute of Electrical
Engineers, 1905, Vol. 24, 281–282.
5. Stuart, H.R. Time-Limit Relays and Duplication of Electrical Apparatus to Secure
Reliability of Services at Pittsburg. Transactions of the American Institute of Electrical
Engineers, 1905, vol. XXIV, 281–282.
6. Board of Directors of the American Institute of Electrical Engineers. Answers to
Questions Relative to High Tension Transmission. s.l.: IEEE, 1902.
7. Ushakov, Igor. Is Reliability Theory Still Alive? Reliability: Theory  &Applications.
2007, Vol. 2, No. 1, Mar. 2017, p. 10.
8. Bernstein, S. Sur l’extension du théorème limite du calcul des probabilités aux´ sommes
de quantités dépendantes. Mathematische Annalen. 1927, Vol. 97, 1–59. http://eudml.
org/doc/182666.
9. Basharin, G.P., Langville, A.N., Naumov, V.A. The  Life and Work of A.A. Markov.
Linear Algebra and Its Applications. 2004, Vol. 386, 3–26. doi:10.1016/j.laa.2003.12.041.
10. Principal Works of A. K. Erlang—The  Theory of Probabilities and Telephone
Conversations. First published in Nyt Tidsskrift for Matematik B. 1909, Vol. 20, 33–39.
11. Kotz, S., Nadarajah, S. (2000). Extreme Value Distributions: Theory and Applications.
Imperial College Press. doi:10.1142/9781860944024.
12. Kolmogoroff, A. Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung [in
German]  [Springer-Verlag]. Mathematische Annalen. 1931, Vol.  104, 415–458.
doi:10.1007/BF01457949.
13. Shannon, C.E. A Mathematical Theory of Communication. The Bell System Technical
Journal. 1948, Vol. 27, 379–423, 623–656.
14. Neumann, J.V. Probabilistic Logics and the Synthesis of Reliable Organisms from
Unreliable Components. Automata studies, 1956, Vol. 34, 43–98.
15. Moore, E.F. Gedanken-Experiments on Sequential Machines. The Journal of Symbolic
Logic. 1958, Vol. 23, No. 1, 60.
16. Cox, D. A  Use of Complex Probabilities in the Theory of Stochastic Processes.
Mathematical Proceedings of the Cambridge Philosophical Society. 1955, Vol.  51,
No. 2, 313319. doi:10.1017/S0305004100030231.
17. Avizienis, A. Toward Systematic Design of Fault-Tolerant Systems. IEEE Computer.
1997, Vol. 30, No. 4, 51–58.
18. Barlow, R.E. Mathematical Theory of Reliability. New York: John Wiley & Sons, 1967.
SIAM series in applied mathematics.
19. Barlow, R.E., Mathematical Reliability Theory: From the Beginning to the Present
Time. Proceedings of the Third International Conference on Mathematical Methods
In Reliability, Methodology and Practice. Trondheim, Norway, 2002.
20. Epstein, B., Sobel, M. Life Testing. Journal of the American Statistical Association.
1953, Vol. 48, No. 263, 486–502.
21. Gnedenko, B., Ushakov, I. A., & Ushakov, I. (1995). Probabilistic reliability engineering.
John Wiley & Sons.
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 149

22. Thiess, S. J. Einhorn and F. B. Intermittence as a stochastic process. S. J. Einhorn


and F. B. Thiess, Intermittence as a stNYU-RCA  Working Conference on Theory of
Reliability. New York: Ardsley-on-Hudson, 1957.
23. Anselone, P.M. Persistence of an Effect of a Success in a Bernoulli Sequence. Journal
of the Society for Industrial and Applied Mathematics. 1960, Vol. 8, No. 2, 272–279.
24. Birnbaum, Z.W., Esary, J.D., Saunders, S.C. Multi-component Systems and Structures
and Their Reliability. Technometrics. 1961, Vol. 3, No. 1, 55–77.
25. Ericson, C. Fault Tree Analysis—A  History. Proceedings of the 17th International
Systems Safety Conference. Orlando, FL, 1999.
26. Pierce, W.H. Failure-Tolerant Computer Design. New  York: Academic Press, 1965,
65–69.
27. Avizienis A., Laprie J.C., Randell, B. Fundamental Concepts of Computer System
Dependability. IARP/IEEE-RAS Workshop on Robot Dependability: Technological
Challenge of Dependable Robots in Human Environments—Seoul, Korea, 2001
28. Maciel, P., Trivedi, K., Matias, R., Kim, D. Dependability Modeling. Performance and
Dependability in Service Computing: Concepts, Techniques and Research Directions
ed. Hershey, PA: IGI Global, 2011.
29. Laprie, J.C. Dependability: Basic Concepts and Terminology. s.l. New  York:
SpringerVerlag, 1992.
30. Natkin, S.O. (1980). Les reseaux de Petri stochastiques et leur application a l’evaluation
des systemes informatiques. Conservatoire National des Arts et Metiers. PhD thesis.
CNAM. Paris, France.
31. Molloy, M.K. (1982). On The Integration of Delay and Throughput Measures in
Distributed Processing Models. PhD thesis. UCLA. Los Angeles, CA.
32. Symons, F.J.W. Modelling and Analysis of Communication Protocols using Numerical
Petri Nets. PhD Thesis, University of Essex, also Dept of Elec. Eng. Science
Telecommunications Systems Group Report No. 152, 1978.
33. Chiola, G., Franceschinis, G., Gaeta, R., Ribaudo, M. GreatSPN 1.7: Graphical Editor
and Analyzer for Timed and Stochastic Petri Nets. Performance Evaluation. Vol. 25,
No. 1–2, 47–68, 1995.
34. Haverkort, B.R. Markovian Models for Performance and Dependability Evaluation.
Lectures on Formal Methods and Performance Analysis. Berlin, Germany: Springer,
2001.
35. Seneta, E. Markov and the Creation of the Markov Chains. School of Mathematics and
Statistics, University of Sydney, NSW, Australia, 2006.
36. Trivedi, K.S. Probability and Statistics with Reliability, Queuing, and Computer
Science Applications, 2nd ed. Hoboken, NJ: John Wiley & Sons, 2001.
37. Parzen, E. Stochastic Processes. Dover Publications. San Francisco, CA, 1962.
38. Stewart, W.J. Probability, Markov Chains, Queues and Simulation. Princeton, NJ:
Princeton University Press, 2009.
39. Ash, R.B. Basic Probability Theory. New York: John Wiley & Sons, 1970.
40. Feller, W. An Introduction to Probability Theory and Its Applications. Vols. I, II.
New York: John Wiley & Sons, 1968.
41. Marsan, M.A., Conte, G., Balbo, G. A Class of Generalized Stochastic Petri Nets for the
Performance Evaluation of Multiprocessor Systems. ACM Transactions on Computer
System. 1984, Vol. 2, No. 2, 93–122.
42. Ajmone Marsan, M., Chiola, G. On Petri Nets with deterministic and exponentially
distributed firing times. In G. Rozenberg, editor, Advances in Petri Nets 1987, Lecture
Notes in Computer Science 266, pp. 132–145. Springer-Verlag, 1987.
43. Marsan, M.A., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G. Modelling with
Generalized Stochastic Petri Nets. Wiley, 1995.
150 Reliability Engineering

44. German, R., Lindemann, C. Analysis of Stochastic Petri Nets by the Method of
Supplementary Variables. Performance Evaluation. 1994, Vol. 20, No. 1, 317–335.
45. German, R. Performance Analysis of Communication Systems with NonMarkovian
Stochastic Petri Nets. New York: John Wiley & Sons, 2000.
46. Lindemann, C. (1998). Performance modelling with deterministic and stochastic Petri
nets. ACM sigmetrics performance evaluation review, 26(2), 3.
47. Molloy, M.K. Performance Analysis Using Stochastic Petri Nets. IEEE Transactions
on Computers. 1982, Vol. 9, 913–917.
48. Muppala, J., Ciardo, G., Trivedi, K.S. Stochastic Reward Nets for Reliability Prediction.
Communications in Reliability, Maintainability and Serviceability. 1994, Vol. 1, 9–20.
49. Matos, R., Dantas, J., Araujo, J., Trivedi, K.S., Maciel, P. Redundant Eucalyptus Private
Clouds: Availability Modeling and Sensitivity Analysis. Journal Grid Computing.
2017, Vol. 15, No. 1, 1–23.
50. Malhotra, M., Trivedi, K.S. Power-hierarchy of Dependability-Model Types. IEEE
Transactions on Reliability. 1994, Vol. 43, No. 3, 493–502.
51. Shooman, M.L. The  Equivalence of Reliability Diagrams and Fault-Tree Analysis.
IEEE Transactions on Reliability. 1970, Vol. 19, No. 2, 74–75.
52. Watson, J.R., Desrochers, A.A. Applying Generalized Stochastic Petri Nets to
Manufacturing Systems Containing Nonexponential Transition Functions. IEEE
Transactions on Systems, Man, and Cybernetics. 1991, Vol. 21, No. 5, 1008–1017.
53. O’Connor P, Kleyner A. Practical Reliability Engineering. John Wiley & Sons; 2012
Jan 30.
54. Beaudry, M.D. Performance-Related Reliability Measures for Computing Systems.
IEEE Transactions on Computers. 1978, Vol. 6, 540–547.
55. Dantas, J., Matos, R., Araujo, J., Maciel, P. Eucalyptus-based Private Clouds:
Availability Modeling and Comparison to the Cost of a Public Cloud. Computing. 2015,
Vol. 97, No. 11, 1121–1140.
56. Buzacott, J.A. Markov Approach to Finding Failure Times of Repairable Systems.
IEEE Transactions on Reliability. 1970, Vol. 19, No. 4, 128–134.
57. Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira, D., Fe, I., Maciel, R.,
Dantas, J. Mercury: Performance and Dependability Evaluation of Systems with
Exponential, Expolynomial, and General Distributions. In: The 22nd IEEE Pacific Rim
International Symposium on Dependable Computing (PRDC 2017). January 22–25,
2017. Christchurch, New Zealand.
58. Guedes, E., Endo, P., Maciel, P. An Availability Model for Service Function Chains
with VM Live Migration and Rejuvenation. Journal of Convergence Information
Technology. Volume 14 Issue 2, April, 2019. Pages 42–53.
59. Silva, B., Matos, R., Callou, G., Figueiredo, J., Oliveira, D., Ferreira, J., Dantas, J.,
Junior, A.L., Alves, V., Maciel, P. Mercury: An Integrated Environment for Performance
and Dependability Evaluation of General Systems. Proceedings of Industrial Track at
45th Dependable Systems and Networks Conference (DSN-2015). 2015. Rio de Janeiro,
Brazil.
60. Silva, B., Maciel, P., Tavares, E., Araujo, C., Callou, G., Souza, E., Rosa, N. et  al.
ASTRO: A  Tool for Dependability Evaluation of Data Center Infrastructures. IEEE
International Conference on Systems, Man, and Cybernetics, 2010, Istanbul, Turkey.
IEEE Proceeding of SMC, 2010.
61. Silva, B., Callou, G., Tavares, E., Maciel, P., Figueiredo, J., Sousa, E., Araujo, C.,
Magnani, F., Neves, F. Astro: An Integrated Environment for Dependability and
Sustainability Evaluation. Sustainable Computing: Informatics and Systems. 2013
Mar 1;3(1):1–7.
62. Kuo, W., Zuo, M.J. Optimal Reliability Modeling—Principles and Applications. Wiley,
2003. p. 544.
Markov Chains and Stochastic Petri Nets for Availability and Reliability Modeling 151

63. Heimann, D., Mittal, N., Trivedi, K. Dependability Modeling for Computer systems.
Proceedings Annual Reliability and Maintainability Symposium, 1991. IEEE, Orlando,
FL, pp. 120–128.
64. Matos, R., Maciel, P.R.M., Machida, F., Kim, D.S., Trivedi, K.S. Sensitivity Analysis
of Server Virtualized System Availability. IEEE Transactions on Reliability. 2012,
Vol. 61, No. 4, 994–1006.
65. Reinecke, P., Bodrog, L., Danilkina, A. Phase-Type Distributions. Resilience
Assessment and Evaluation of Computing Systems, Berlin, Germany: Springer, 2012.
6 An Overview of Fault
Tree Analysis and Its
Application in Dual
Purposed Cask Reliability
in an Accident Scenario
Maritza Rodriguez Gual, Rogerio Pimenta
Morão, Luiz Leite da Silva, Edson Ribeiro,
Claudio Cunha Lopes, and Vagner de Oliveira

CONTENTS
6.1 Introduction................................................................................................... 153
6.2 Overview of Fault Tree Analysis................................................................... 154
6.3 Minimal Cut Sets........................................................................................... 156
6.4 Description of Dual Purpose Cask................................................................ 157
6.5 Construct the Fault Tree of Dual Purpose Cask............................................ 159
6.6 Results............................................................................................................ 160
6.7 Conclusion..................................................................................................... 163
References............................................................................................................... 164

6.1 INTRODUCTION
Spent nuclear fuel is generated from the operation of nuclear reactors and must be
safely managed following its removal from reactor cores. The Nuclear Technology
Development Center (Centro de Desenvolvimento da Tecnologia Nuclear–CDTN),
Belo Horizonte, Brazil constructed a dual-purpose metal cask in scale 1:2 for the
transport and dry storage of spent nuclear fuel (SNF) that will be generated by
research reactors, both plate-type material testing reactor (MTR) and TRIGA fuel
rods. The CDTN is connected to the Brazilian National Nuclear Energy Commission
(Comissão Nacional de Energia Nuclear—CNEN).
The  dual purpose cask (DPC) development was supported by International
Atomic Energy Agency (IAEA) Projects RLA4018, RLA4020, and RLA3008.
The  project began in 2001 and finished in 2006. Five Latin American countries
participated—Argentina, Brazil, Chile, Peru, and Mexico. The cask is classified as a
Type B package according to IAEA Regulations for the Safe Transport of Radioactive​

153
154 Reliability Engineering

Materials (IAEA, TS-R-1, 2009). The RLA/4018 cask was designed and constructed
in compliance with IAEA  Transport Regulations. The  IAEA  established the stan-
dards for the packages used in the transport of radioactive materials under both nor-
mal and accident conditions.
The  general safety requirement concerns, among other issues, are package tie-
down, lifting, decontamination, secure and closing devices, and material resistance to
radiation, thermal, and pressure conditions likely to be found during transportation.
The regulations establish requirements that guarantee that fissile material is pack-
aged and shipped in such a manner that they remain subcritical under the conditions
prevailing during routine transport and in accidents.

6.2  OVERVIEW OF FAULT TREE ANALYSIS


Different techniques are applied widely in risk analyses of industrial process and
equipment operating such as Failure Modes and Effects Analysis (FMEA) and its
extension Failure Mode, Effects, and Criticality Analysis (FMECA) (Gual et  al.,
2014; Perdomo and Salomon, 2016), fault tree analysis (FTA) (Vesely, 1981), What-if
(Gual, 2017), Layer of Protection Analysis (LOPA) (Troncoso, 2018a), and Hazards
and Operability Study (HAZOP) (Troncoso, 2018b). Different methods of solving a
fault tree as an advanced combinatorial method (Rivero et al., 2018) also has been
applied. All techniques have advantages and limitations. The selection techniques
chosen will depend on the documentation available and the objectives to be achieved.
The  FTA  is one of the most popular and visual techniques to identify risk for
design operation, reliability, and safety.
The traditional FTA technique was selected to evaluate reliability and risk assess-
ment of a DPC constructed at CDTN for the transport and storage of spent nuclear
fuel that will be generated by research reactors.
FTA techniques were first developed at Bell Telephone Laboratories in the early
1960s. Since this time, FTA techniques have been adopted readily by a wide range
of industries, such as power, rail transport, oils, nuclear, chemistry, and medicine,
as one of the primary methods of performing reliability and safety analysis. FTA is
a top down, deductive failure analysis in which an undesired state of a system is
analyzed using Boolean logic to combine a series of lower-level events. This analy-
sis method is used mainly in the field of safety engineering and reliability engi-
neering to determine the probability of a safety accident or a specific system level
(functional) failure. In 1981, the U.S. Nuclear Regulatory Commission (NRC) issued
the Fault Tree Handbook, NUREG-0492 (Vesely, 1981). In  2002, the National
Aeronautics and Space Administration (NASA) published the Fault Tree Handbook
with Aerospace Applications (Stamatelatos et al., 2002). Today, FTA is used widely
in all major fields of engineering.
FTA defined in NUREG-0492 is “An analytical technique, whereby an undesired
state of the system is specified (usually a state that is critical from a safety stand-
point), and the system is then analyzed in the context of its environment and opera-
tion to find all credible ways in which the undesired event can occur.”
A  Fault Tree always can be translated into entirely equivalent minimal cut
sets (MCS), which can be considered the root causes for these fall fatalities
An Overview of Fault Tree Analysis and Its Application 155

TABLE 6.1
Fundamental Laws of Boolean Algebra
Law AND Form Representation OR Form Representation
Commutative x + y = y + x x·y = y·x
Associative x + (y + z) = (x + y) + z x·(y·z) = (x·y)·z
Distributive x·(y + z) = x·y + x·z x·y + x·z
Idempotent x·x = x x + x = x
Absorption x·(x + y) = x x + x·y = x

(Vesely  et  al., 1981). The  FTA  begins by identifying multiple-cause combina-
tions for each fatality. These multiple-cause combinations can be connected by an
AND-gate (the output occurs only if all inputs occur), indicating that these two or
three events contributed simultaneously to these fatal falls and OR gate (the out-
put occurs if any input occurs). Fundamental laws of Boolean algebra (Whitesitt,
1995) (see Table 6.1) were applied to reduce all possible cause combinations to
the smallest cut set (Vesely et al., 1981) that could cause the top event to occur.
Eventually, all case combinations associated with each basic event can be simpli-
fied and presented in a fault tree diagram.
The fault tree gates are systematically substituted by their entries, applying the
Boolean algebra laws in several stages until the top event Boolean expression con-
tains only basic events. The final form of the Boolean equation is an irreducible logi-
cal union of minimum sets of events necessary and enough to cause of the top event,
denominated MCSs. Then, the original fault tree is mathematically transformed into
an equivalent MCS fault tree. The transformation process also ensures that any sin-
gle event that appears repeatedly in various branches of the fault tree is not counted
twice.
Fault trees graphically represent the interaction of failures and other events within
a system. Basic events at the bottom of the fault tree are linked via logic symbols
(known as gates) to one or more top events. These top events represent identified
hazards or system failure modes.
A  fault tree diagram (FTD) is a logic block diagram that displays the state of
a system (top event) in terms of the states of its components or subsystems (basic
events). The  basic events are determined through a top-down approach. The  rela-
tionship between causes and the top event (failure or fatalities) are represented by
standard logic symbols (AND, OR, etc.).
FTA involves five steps to obtain an understanding of the system:

• Define the top event to study


• Obtain an understanding of the system (with functional diagram, design,
for example)
• Construct the fault tree
• Analyze the fault tree qualitatively or quantitatively
• Evaluate the results and propose recommendations.
156 Reliability Engineering

FTA is a simple, clear, and direct-vision method for effectively analyzing and esti-
mating possible accidents or incidents and causes. FTA  is useful to prioritize the
preventive maintenance of the equipment that is contributing the most to the failure.
Also, it is a quality assurance (QA) tool. The overall success of the FTA depends on
the skill and experience of the analyst.
Qualitative analysis by FTA is an alternative for reliability assessment when his-
torical data of undesirable event (fatalities or failure) are incomplete or unavailable
for probabilistic calculation (quantitative).
FTA can be used for quantitative assessment of reliability if probability values
are available.
For a large or very complex system that includes a large number of equipment and
components, FTA can be time consuming. The complex FTA must be analyzed with
a specialized computer program. However, there are still several practical cases in
which fault trees are convenient as it is for the case study solved here.
This methodology (Vesely et al., 1981) is applicable to all fault trees, regardless of
size of complexity, that satisfy the following conditions:

• All failures are binary in nature (either success or failure; ON/OFF).


The partial failures do not exist.
• Transition between working and failed states occurs instantaneously (no
time a delay is considered).
• All component failures are statistically independent.
• The failure rate of reach equipment item is constant.
• The repair rate for each equipment item is constant.
• After repair, the system will be as good as the old, not as good as new (i.e.,
the repaired component is returned to the same state, with the same failure
characteristics that it would have had if the failure had not occurred; repair
is not considered to be a renewal process).
• The fault tree for system failure is the same as the repair tree (i.e., repair
of the failed component results in the immediate return to their normal
state of all higher intermediate events that failed as a result of the failed
component).

But, the biggest advantage of using FTA  is that it starts from a top event that is
selected by the user for a specific interest, and the tree developed will identify the
root cause.

6.3  MINIMAL CUT SETS


There  are several methods for determining MCS. In  this case study, the classic
Boolean reduction is used as was described previously.
The adjective minimal means that they are all essential. If just one of the single
events is recovered, the system recuperates the success state, and when it fails, it
causes the system failure.
Cut sets with fewer events are generally more likely to fail since only a few events
must fail simultaneously. Therefore, these MCSs have a higher importance.
An Overview of Fault Tree Analysis and Its Application 157

The MCSs can be ordered by number and the order (i.e., cut set size).
A cut set order is the number of elements in cut sets. The first-order MCS can be
directly obtained, and the second-order MCS is obtained by the logical operation
“OR.” When the gate is “AND,” it increases the order of the MCS and when it is
“OR,” the quantity of MCS is increased.
The lower the order, the more critical is the cut, which is only acceptable if this
failure is of very low probability.

6.4  DESCRIPTION OF DUAL PURPOSE CASK


The difference between the conventional transport and storage packages of radioac-
tive material and the DPC is that in addition to the cask body it has primary and
secondary lids, an internal basket, and external shock absorbers (See Figure 6.1).
Figure 6.2 shows the photography of a DPC constructed in CDTN.
The RLA4018 cask consists of a robust cylindrical body provided with an internal
cavity to accommodate a basket holding the spent fuel elements, a double lid (primary
and secondary), and two external impact limiters (top and bottom). The impact limit-
ers are structures made of an external stainless-steel skin and an energy-­absorbing
filling material.
The body and the primary lid are sandwich-like structures with stainless steel in
the outside and lead in the inside for shielding purposes. The primary lid is provided
with a double metallic sealing system. Bolts are used to fix the primary lid to the
cask body.
The main function of the secondary lid is to protect the primary lid against impacts.
The  internal basket is a square array stiffened by spacers and provided with a
bottom plate and feet. It is designed typically to hold 21 MTR fuel elements. Each
MTR fuel type element has 21 plates, with oxide fuel of U3O8 and clad in aluminum.
The fuel elements are transferring into a basket of stainless steel and transported dry.
Fuel elements are stored interim in dry conditions.

FIGURE 6.1  Spent fuel transport and storage RLA4018 design by CDTN.


158 Reliability Engineering

FIGURE 6.2  Photography of spent fuel transport and storage RLA4018 design by CDTN.

The cask is provided with four lifting trunnions; two in the top half and two in its
bottom half so that the cask can be easily rotated. The cask is vertically held down
by four bottom screwed trunnions.
The process of loading spent fuel consists of submerging the transport cask in
the reactor pool while spent fuel is transferred into the basket. The water is drained
and the cask is dried to eliminate residual amounts of water in the cavity to ensure
sub-criticality conditions.
The cask has one draining port for vacuum drying, while its primary lid is pro-
vided with another one for helium gas filling. After the water draining, a cask pri-
mary lid is installed on the cask body. Next, a vacuum drying system is connected to
a cask to remove the moisture from the cask.
The shock absorbers provide protection to the whole cask during the 9 meters
drop test prescribed for this type of package. They  consist of a thin external
stainless-steel shell encasing an energy absorbing material. Different materials
An Overview of Fault Tree Analysis and Its Application 159

have been used by the cask designers for this purpose, the most common being
polyurethane foam, solid wood and wood composites, aluminum honeycomb, and
aluminum foam. The currently selected cushioning material is high density rigid
polyurethane foam.
It is important to note that the accelerometer base is not in the final cask. It is used
only to measure the acceleration range during the impact tests.
Type B packages are designed to withstand very severe accidents in all the modes
of transport without any unacceptable loss of containment or shielding.
The transport regulations and storage safety requirements to consider in the
DPC package design (IAEA, 2014), under routine conditions of transport (RCT),
normal conditions of transport (NCT), and accident conditions of transport
(ACT) are:

• Containment of radioactive materials


• Shielding (control of external radiation levels)
• Prevention of nuclear criticality (a self-sustaining nuclear chain reaction)
• Prevention of damage caused by heat dissipation
• Structural integrity
• Stored spent fuel retrievability
• Aging

Aging effects in DPCs is considering because they are expected to be used for spent
fuel interim storage for up to 20 years.
The objective of the regulations is to protect people and the environment from the
effects of radiation during the transport of radioactive material.
Normal conditions that a spent fuel transport package must be able to resist
include hot and cold environments, changes in pressure, vibration, water spray,
impact, puncture, and compression.
To show that it can resist accident conditions, a package must pass impact, fire,
and water immersion tests.
Reports from the United States (Nuclear Monitor 773, 2013) and the United
Kingdom (Jones and Harvey, 2014) include descriptions of various accidents and
incidents involving the transport of radioactive materials, which occurred until 2014,
but none resulted in a release of radioactive material or a fatality due to radiation
exposure. For this reason, this study is important.

6.5  CONSTRUCT THE FAULT TREE OF DUAL PURPOSE CASK


The fault tree is a directed acyclic graph consisting of two types of nodes: events
(represented as circles) and gates.
An event is an occurrence within the system, typically the failure of a component
or sub-system.
Events can be divided into:

• Basic events (BEs), which occur on their own


• Intermediate events (IEs), which are caused by other events
160 Reliability Engineering

The root, called the top event (TE), is the undesired event of a tree.
Rectangle represents top event and middle events.
Circle represents basic events.
Logic OR gate, which is equivalent to the Boolean symbol +, represents a situ-
ation in which one of the events alone (input gate) is enough to contribute to the
system fault (output event). OR gates increase the number of cut sets, but often lead
to single component sets.
Logic AND gate, which is equivalent to the Boolean symbol, represents a situa-
tion in which all the events shown below the gate (input gate) are required for a sys-
tem fault shown above the gate (output event). AND gates of the fault tree increase
the number of components (order) in a cut set.
The analysis was performed according to the following steps:

• Definition of the system failure event of interest, known as the top event, as
environmental contamination.
• Identification of contributing events (basic or intermediate), which might
directly cause the top event to occur.

6.6 RESULTS
The specific case study analyzed to apply FTA is titled Environmental contamina-
tion (Top Event).
The fault tree was constructing within multidisciplinary teams working together,
such as nuclear engineers, electrical engineers, and mechanical engineers. Working
within multidisciplinary teams makes it possible to analyze the design of weak points.
The fault tree diagram is shown in Figure 6.3.
The  basic events that led to the top event, Environmental contamination, are
shown in Table 6.2 with the symbols given.
Table 6.3 describes the symbols for the Intermediary Events on the FTD.
The Boolean algebra analysis of the fault tree is shown in Table 6.4.
The MCSs are listed in Table 6.5.
Events B1, B2, B3, B4, B5, B7, B8, B9, and B10 are associated with human errors.
Hence, B6 is susceptible to human error.
Boolean algebra laws reduced the amount of cause combinations and the redun-
dancy of basic events.
MCS can be used to understand the structural vulnerability of a system. If the
order of MCS is high, then the system will less vulnerable (or top event in fault trees)
to these events combinations. In addition, if there are numerous MCSs, it indicates
that the system has higher vulnerability. Cut sets can be used to detect single point
failures (one independent element of a system that causes an immediate hazard to
occur and/or causes the whole system to fail).
Two first-order and five second-order MCS were found.
• 1st order: The  occurrence of a BE implies the occurrence of the top or
undesired event.
• 2nd order: The simultaneous occurrences of BEs result in the loss of conti-
nuity of operation of the system.
An Overview of Fault Tree Analysis and Its Application 161

FIGURE 6.3  FTD of RLA4018 DPC before the MCS analysis.

TABLE 6.2
Description of Symbols for the Basic Events on the Fault Tree Diagram
Number Basic Events Symbols
1 Containment failure B1
2 Failure in inspection, control, in testing program B2
3 Vehicle collision B3
4 Fire of oil B4
5 Deficiencies in component B5
6 Operator errors B6
7 Contaminated water in reactor pool B7
8 Improper equipment to closure of screw B8
9 Error in tightening torque calculation B9
10 Material aging B10
162 Reliability Engineering

TABLE 6.3
Description of Symbols for the Intermediary Events on the FTD
Number Intermediary Events Symbols
1 Contamination outside of cask G2
2 Vehicle fire G3
3 Internal lid screws with incorrect torque G4
Collision and other accidents G5
4 Containment failure G6
5 Deficiencies in decontamination equipment and/or contamination detection F1
6 Failure in screw closure F2

TABLE 6.4
Minimal Cut Set Determination Steps
Step Boolean Expression for Top Event (G1) of Figure 6.3
1 G1 = G2 + G6 + G3 + G4 + (G5·G3)
2 G1 = (B7·B6) + (B2·B6) + (B2·B6) + (B2·B6) + (B2·B6) + B8 + B9 + (B1·B2) + (B10·B2) ++
(B3·B4) + (B2·B6) + (B5·B6·B3·B4)
3 G1 = (B7·B6) + (B2·B6) + B8 + B9 + (B1·B2) + (B10·B2) + (B3·B4)

TABLE 6.5
List of Minimal Cut Sets
Number Minimal Cut Sets Cause
1 B8 Improper equipment to closure of screw
2 B9 Error in tightening torque calculation
3 (B6,B7) Operator errors and contaminated water in reactor pool
4 (B3,B4) Vehicle collision and fire of oil
5 (B2,B6) Failure in inspection, control, in testing program and operator errors
6 (B1,B2) Containment failure and failure in inspection, control, in testing program
7 (B10,B2) Failure in inspection, control, in testing program and material aging

Based on this, it is necessary early on to prevent the occurrence of the top event and
take care more quickly with the most critical causes (i.e., those that represent the first
or lowest order MCSs (B8 and B9). It shows the system is relatively safe because the
first order MCSs are few. The system is relative dangerous however.
For this tree, seven root causes were found and, according to the MCSs, two of
these causes are critical; they can happen independent of the others and cause the
top event.
Human error in inspection, control, in testing program, decontamination, in contam-
ination detection, manufacturing, in tightening torque calculation, in use of improper
An Overview of Fault Tree Analysis and Its Application 163

equipment to closure of the screws and preparation of operators for transportation


as well as during loading and unloading of spent nuclear fuel must be considered.
Corrective actions are required to minimize the probability of fault occurrence,
such as:

• Make sure the operator is well trained and qualified


• Create a preventive/predictive maintenance planning and scheduling
• Build a QA program

The diagrams created in the fault tree methods, in general, are more easily under-
stood by non-probabilistic safety analysis (PSA) specialists, and therefore they can
greatly assist in the documentation of the event model (IAEA-TECDOC-1267, 2002).
A PSA fault tree is a powerful tool that can be used to confirm assumptions that
are commonly made in the deterministic calculation about the availability of sys-
tems, for example, to determine the potential for common cause failures or the mini-
mum system requirements, to identify important single failures, and to determine
the adequacy of technical specifications (IAEA-SSG-2, 2002).
The risk assessment has been seriously addressed within the IAEA staff in the
Safety Analysis Report (SAR) and an assessment of PSA (IAEA-GSR-4, 2016) is
included in the SAR.
The risk assessment for spent nuclear fuel transportation and storage are part of
SAR of the CDTN. The constructed DPC is not yet licensed in Brazil. The SAR is
an important document for the entire licensing process.
This study will form part of a future SAR of the CDTN and a safety operation
manual for the DPC because it provides pertinent information.

6.7 CONCLUSION
The FTA of the DPC was established on the basis of the environmental contamina-
tion scenario of the DPC in this chapter.
Some main causes include the use of improper equipment for closure of screws
and errors in calculation of the tightening torque. Appropriate precautions measures
can be taken to decrease the probability of this occurrence.
The  results revealed that a large proportion of undesired events were the result of
human errors. Proposed corrective actions have been implemented to minimize the incident.
This evaluation system predicted the weak points existing in the DPC, as well as
provided theoretical support to avoid the loss of DPC integrity.
Despite all the advantages previously discussed, it is important to note that this
study is an initial work that must continue because other possible undesired events
must be studied.
This is the first work in CDTN about FTA for DPCs that will contribute to many
future studies in this system, and will involve quantitative derivation of probabilities.
This study provides an organized record of basic events that contribute to an envi-
ronmental contamination of a DPC. Also is provided information pertinent to future
SARs of nuclear installations of CDTN (in Portuguese, RASIN) and an operation
manual for DPCs.
164 Reliability Engineering

REFERENCES
Gual MR, Rival RR, Oliveira V, Cunha CL. 2017. Prevention of human factors and Reliability
analysis in operating of sipping device on IPR-R1 TRIGA  reactor, a case study. In.
Human Factors and Reliability Engineering for Safety and Security in Critical
Infrastructures, eds. Felice F. and Petrillo A., pp. 155–170, Springer Series in Reliability
Engineering, Piscataway, NJ.
Gual MR, Perdomo OM, Salomon J, Wellesley J, Lora A. 2014. ASeC software applica-
tion based on FMEA  in a mechanical sample positioning system on a radial chan-
nel for irradiations in a nuclear research reactor with continuous full-power operation.
International Journal of Ecosystems and Ecology Science 4(1):81–88.
International Atomic Energy Agency—IAEA. 2009. Regulations for the Safe Transport of
Radioactive Material. Safety Requirements No. TS-R-1, Vienna, Austria.
IAEA-TECDOC-1267. 2002. Procedures for Conducting Probabilistic Safety Assessment
for Non-reactor Nuclear Facilities. International Atomic Energy Agency Vienna
International Centre, Vienna, Austria.
IAEA  Specific Safety Guide No.  SSG-2. 2009. Deterministic Safety Analysis for Nuclear
Power Plants Safety. International Atomic Energy Agency Vienna International
Centre, Vienna, Austria.
IAEA  General Safety Requirements GSR-4. 2016. Safety Assessment for Facilities and
Activities, IAEA-GSR-4, Vienna, Austria.
Jones AL, Harvey MP. 2014. Radiological Consequences Resulting from Accidents and
Incidents Involving the Transport of Radioactive Materials in the UK: 2012 Review,
PHE-CRCE-014. Education Public Health England, August, HPA-RPD-034.
Nuclear Monitor 773. 2013. Nuclear transport accidents and incidents. https://www.
wiseinternational.org/nuclear-monitor/773/nuclear-transport-accidents-and-­incidents
(accessed May 30, 2018).
Perdomo OM, Salomon LJ. 2016. Expanded failure mode and effects analysis: Advanced
approach for reliability assessments. Revista Cubana de Ingenierıia VII (2):5–14.
Rivero JJ, Salomón LJ, Perdomo OM, Torres VA. 2018. Advanced combinatorial method for
solving complex fault trees. Annals of Nuclear Energy 120:666–681.
Stamatelatos M, Vesely WE, Dugan J, Fragola J. 2002. Fault Tree Handbook with Aerospace
Applications. NASA  Office of Safety and Mission Assurance. NASA  Headquarters.
Washington, DC. August.
Troncoso M. 2018a. Estudio LOPA para la Terminal de Petrolíferos Veracruz (In Spanish)
Internal Task: 9642-18-2017 (408).
Troncoso M. 2018b. Estudio HAZOP para la Terminal de Petrolíferos Puebla. (In Spanish)
Internal Task: 9642-18-2015 (408).
Vesely W, Goldberg F, Roberts N, Haasl D. 1981. Fault tree handbook. Technical Report
NUREG-0492, Office of Nuclear Regulatory Research U.S. Nuclear Regulatory
Commission (NRC). Washington, DC.
Whitesitt J. Eldon. 1995. Boolean Algebra and Its Applications. Courier Corporation, 182 p.,
Dover Publications, Inc., New York.
7 An Overview on
Failure Rates in
Maintenance Policies
Xufeng Zhao and Toshio Nakagawa

CONTENTS
7.1 Introduction................................................................................................... 165
7.2 Inequalities of Failure Rates.......................................................................... 167
7.3 Age and Random Replacement Policies........................................................ 169
7.4 Periodic and Random Replacement Policies................................................. 174
7.5 Periodic Replacement Policies with Failure Numbers.................................. 178
7.6 Conclusions.................................................................................................... 182
Appendices.............................................................................................................. 183
Acknowledgment.................................................................................................... 196
References............................................................................................................... 196

7.1 INTRODUCTION
Aging describes how an operating unit improves or deteriorates with its age and is
usually measured by the term of the failure rate function [1,2]. The failure rate is
the most important quantity in reliability theory and these properties were investi-
gated in [2–4]. For an age replacement model, it has been supposed that an operat-
ing unit is replaced preventively at time T (0 < T ≤ ∞) or correctively at failure time
X ( X > 0), whichever occurs first, in which the random variable X has a general

distribution F (t ) ≡ Pr{ X ≤ t} for t ≥ 0 with finite mean µ ≡ ∫0 F (t )dt . The expected
cost rate for the age replacement policy was given [2,4]:

cT + (cF − cT ) F (T )
C (T ) = T , (7.1)

∫ 0
F (t )dt

where:
cT = preventive replacement cost at time T ,
cF = corrective cost at failure time X ,
cF > cT .

165
166 Reliability Engineering

To obtain an optimum T * to minimize Equation  (7.1), it was assumed F (t ) has a


t
density function f (t ) ≡ dF (t ) / dt , i.e., F (t ) = ∫0 f (u)du, where ϕ = 1 − ϕ (t ) for any
function ϕ (t ). Then, for F (t ) < 1, the failure rate h(t ) is defined as [2]:

f (t )
h(t ) ≡ , (7.2)
F (t )

where h(t )∆t ≡ Pr{t < X ≤ t + ∆t} for small ∆t > 0 represents the probability that an
operating unit with age t will fail during interval (t , t + ∆t ). Therefore, optimum T * to
minimize C (T ) is a solution of:

T
cT

∫ F (t )[h(T ) − h(t )]dt = c
0 F − cT
. (7.3)

It  has been shown  [4] that if h(t ) increases strictly to h(∞) ≡ lim t →∞h(t ) and
h(∞) > cF / [ µ (cF − cT )], then a finite and unique T * exists, and the optimum cost rate
is given by the failure rate h(t ) as:

C (T * ) = (cF − cT )h(T * ). (7.4)

Equations (7.3) and (7.4) indicate the optimum time T * decreases while the expected
cost rate C (T * ) increases with the failure rate h(t ). This result means if we know
more about the properties of the failure rate, we can make better replacement deci-
sions for an operating unit in an economical way.
We recently proposed several new replacement models such as random replace-
ment, replacement first, replacement last, replacement overtime, and replacement
middle [5–8]. These models showed that the extended types of failure rates appeared,
which played important roles in obtaining optimum replacement times in analytical
ways. So it would be of interest to survey the reliability properties of the failure rates
and their further applications for the recent maintenance models.
The  standard failure rate h(t ) has been defined in Equation  (7.2). We will for-
mulate several extended failure rates in inequality forms by integrating h(t ) with
replacement policy at time T. We show the examples of these failure rates appeared
in replacement models. In Section 7.3, when the replacement time T and work num-
ber N become the decision variables, we introduce the failure rates that are found
in age and random replacement models. In Section 7.4, the failure rates in periodic
replacement models with minimal repairs are given and shown in periodic and ran-
dom replacement models. In Section 7.5, the failure rates and their inequalities are
shown for the model where replacement is done at failure number K .
The recent models of replacement first, replacement last, and replacement over-
time are surveyed for these failure rates in the following sections. In addition, we
give an appendix for the proofs of these extended failure rates.
An Overview on Failure Rates in Maintenance Policies 167

7.2  INEQUALITIES OF FAILURE RATES


t
We give the cumulative hazard function H (t ), i.e., H (t ) ≡ ∫0 h(u)du , and obtain:

 t 

 0  ∫
F (t ) = exp  − h(u)du  = e − H ( t ) , i.e., H (t ) = − log F (t ). (7.5)

This equation means the functions of F (t ), h(t ), and H (t ) can determine each other.


Suppose the failure distribution F (t ) has an increasing failure rate (IFR) property,
and its failure rate h(t ) increases with t from h(0) to h(∞) ≡ lim t →∞ h(t ), which might
be infinity. Then we have the following inequalities [2,9]: For 0 < T < ∞:
Inequality I:
T

∫ F (t )dt ≤ F (T ) ≤ H (T ) ≤ h(T )
0
T
 T
 t
T
∫ ∫ F (u)du  dt ∫ F (t )dt
0 0 0

(7.6)

≤ ∞
F (T )

∫ T
.
F (t )dt

  ∞

∫T
F (t )dt
∫ ∫ F (u)du  dt
T

 t

Inequality II:

∞ T


T
F (T )

∫ T
F (t )dt
1
≤ ≤
∫ F (t )dt ≤ F (T ) . (7.7)
0

 ∞
 t
µ T
 ∞

∫0
F (t )dt
∫  ∫ F (u)du  dt
T 0 ∫ ∫ F (u)du  dt ∫ F (t )dt
0 t T

Generally, repeating the procedures of integral calculations in Equation  (7.6), we


obtain:
Inequality III:

T
 tn t1

∫ ∫ ∫
T

0 

0
[
0
f (u)du]dt n−1  dt n
 ≤≤
∫ f (t )dt ≤ h(T )
0
T
T
 tn t1

∫ ∫ ∫
0 

0
[ F (u)du]dt n−1  dt n
0 
∫ F (t )dt
0
(7.8)
 ∞ ∞ ∞

∫ ∫ ∫


≤ T∞
f (t )dt
T

≤≤ ∞ 
tn
[ f (u)du]dt n−1  dt n
t1  ( n = 1,2,).
 ∞ ∞



T
F (t )dt 
T  ∫ ∫ ∫ tn
[ F (u)du]dt n−1  dt n
t1 
168 Reliability Engineering

All of the above functions increase with T and become h(T ) = λ for T ≥ 0 when
F (t ) = 1− e − λt .
We next give some other applications of the failure rates in Equations (7.2), (7.6),
and (7.7) to replacement policies planned at time T when h(t ) increases with t from
h(0) to h(∞).

Example 7.1

[4, p. 8] Suppose the unit only produces profit per unit of time when it is operating
without failure, and it is replaced preventively at time T (0 < T < ∞). Then the aver-
age time for operating profit during [0, T ] is:

l0 (T ) = T × F(T ) + 0 × F(T ) = TF(T ).

Optimum time T0 to maximize l0 (T ) satisfies:

1
h(T ) = . (7.9)
T

When F(t ) = 1− e − λt , optimum T0 = 1/ λ means replacement should be made at the


mean failure time.

Example 7.2 

[4, p. 8] Suppose there is one spare unit available for replacement, the operating
unit is replaced preventively with the spare one at time T (0 < T ≤ ∞), and the spare
unit should operate until failure. When both units have an identical failure distri-
bution F(t ) with mean µ , the mean time to failure of either unit is:

T T
l1(T ) =

0
tdF(t ) + F(T )(T + µ ) =
∫ F(t)dt + µ F(T ).
0

Optimum time T1 to maximize l1(T ) satisfies:

1
h(T ) = . (7.10)
µ

Next, suppose there are unlimited spare units available for replacement and each
unit has an identical failure distribution F(t ). When preventive replacement is
planned at time T (0 < T ≤ ∞), the mean time to failure of any unit is:

T
1 F(T )
l(T )=
∫ tdF(t)+F(T )[T +l(T )], i.e., =
l(T ) T
, (7.11)

0
F(t )dt
0

which increases strictly with T from h(0) to 1/µ .


An Overview on Failure Rates in Maintenance Policies 169

Example 7.3 

[4, p. 8] The failure distribution of an operating unit with age T (0 ≤ T < ∞) is:

F(t + T ) − F(T )
F(t ; T ) ≡ Pr{T < X ≤ T + t | X > T } = , (7.12)
F(T )

which is also called failure rate. The mean time to failure is:

∞ ∞
1 1

F(T ) ∫ [F(T ) − F(t + T )]dt = F(T ) ∫ F(t)dt,
0 T
(7.13)

which decreases with T from µ to 1/ h(∞).

7.3  AGE AND RANDOM REPLACEMENT POLICIES


Suppose the unit operates for jobs with successive working times
Y j ( j = 1,2,) , where random variables Y j are independent and have an identical
distribution G(t ) ≡ Pr{Y j ≤ t} = 1 − e −θ t (0 < 1/ θ < ∞). When the unit is replaced pre-
ventively at time T or at working number N, we give the following inequalities of the
extended failure rates: For 0 < T < ∞ and N = 0,1, 2,:
Inequality IV:

T T

∫ (θ t ) N e −θ t dF (t )
∫ t dF (t ) ≤ h(T )N

0
T ≤ 0
T

∫ (θ t ) e −θ t
∫ t F ( t ) dt
N N
F ( t ) dt
0 0

∞ ∞ (7.14)


∫ T

(θ t ) N e −θ t dF (t )

∫T

t N dF ( t )
.
∫ (θ t ) −θ t
∫t
N N
e F ( t ) dt F (t )dt
T T

Inequality V:

T T

∫ e −θ t dF (t )
∫ t dF (t ) ≤ h(T ) ≤ F (T ) N
F (T )
0
T ≤ ≤ T
0
T ∞

∫e0
−θ t
F (t )dt
∫ F (t )dt ∫ t F (t )dt
0 ∫ F (t )dt 0
N

∞ ∞ (7.15)


∫ T

t N dF (t )

∫ T

e−θ t dF (t )
.
∫t ∫e
N −θ t
F (t )dt F (t )dt
T T
170 Reliability Engineering

Inequality VI:
T t T ∞ t

∫0
T

(θ t ) N [ e −θ u dF (u)]dt
t
0

∫ 0
T
e −θ t dF
F (t )

∫ T
T ∫
(θ t ) N [ e −θ u dF (u)]dt

0

∫ (θ t ) [∫ e −θ u
∫e ∫ (θ t ) [∫ e
N −θ t N −θ u
F (u)du]dt F (t )dt F (u)du]dt
0 0 0 0 t

T ∞ ∞


∫ 0
T
∫ (θ t ) N [
t

e −θ u dF (u)]dt

∫e T

−θ t
dF (t )
(7.16)
∫ (θ t ) [∫ e ∫e
N −θ u −θ t
F (u)du]dt F (t )dt
0 t T

∞ ∞


∫ (θ t ) [∫ e
T

N

t

−θ u
dF (u)]dt
.
∫ (θ t ) [∫ e N −θ u
F (u)du]dt
T t

Note that all these functions increase with T and N .


Furthermore, we obtain:
Inequality VII:

 t −θ u  T

∫ ∫
T T

0 ∫ ≤ T t
0 0
e dF (u)  dt
e −θ t F (t )dt

 ≤
∫e
0
T
−θ t
dF (t )

−θ t    −θ u 
T t

∫ 0
e  F (u)du  dt
 0  ∫  e F (u)du  dt
0  0  ∫ ∫ ∫e
0
−θ t
F ( t ) dt

∫e
−θ t
dF (t )
F (T ) F (T )
≤ T ≤ h(T ) ≤ T
∞ ≤ ∞

∫ F (t )dt
0 ∫e T
−θ t
F (t )dt
∫ T
F (t )dt

∞ ∞ ∞


∫ ∫ T

[
t

e −θ u dF (u)]dt

∫e ∞
T
−θ t


F ( t ) dt
, (7.17)
∫ [∫ e T t
−θ u
F (u)du]dt
∫ e [∫ T
−θ t

t
F (u)du]dt

where all these functions increase with T .

Example 7.4 

[5,  p. 30] Suppose when the random time Y has an exponential distribution
Pr{Y ≤ t } = 1− e −θ t , the unit is replaced preventively at time T (0 < T ≤ ∞) or at time
Y , whichever occurs first. Then the expected cost rate is:

∫e −θ t
cT + (cF − cT ) dF(t )
C (T ) = T
0
, (7.18)
∫ 0
e −θ t F(t )dt
An Overview on Failure Rates in Maintenance Policies 171

where:
cT = replacement cost at time T or at time Y,
cF = replacement cost at failure with cF > cT .

Optimum time T to minimize C (T ) satisfies:

T
cT

∫e
−θ t
F(t )[h(T ) − h(t )]dt = . (7.19)
0 cF − cT

Example 7.5 

[5, p.  44] Suppose the unit is replaced preventively at time T (0 < T ≤ ∞) or at
working number N (N = 1,2,), i.e., at Y1 + Y2 +  + YN, whichever occurs first.
Denoting that G ( j ) (t ) ≡ Pr{Y1 + Y2 +  + Yj ≤ t } ( j = 1,2,) and G (0) (t ) ≡ 1 for t > 0,
the expected cost rate is:

C (T , N ) =
cT + (cF − cT )
T
∫ 0
1− G(N ) (t ) dF(t )
 
. (7.20)
∫0
1− G(N ) (t ) F(t )dt
 


When G(t ) = 1− e −θ t , G (N) (t ) = ∑ j = N[(θ t ) j /j!]e −θ t (N = 0,1,2,). Optimum time T to
minimize C (T , N) satisfies:

N −1
(θ t ) j −θ t
∑∫
T
cT
e F(t )[h(T ) − h(t )]dt = , (7.21)
j =0
0 j! cF − cT

and optimum number N to minimize C (T , N) satisfies:

∫ (θ t) e
N −θ t
dF(t ) N −1 N −1
(θ t ) j −θ t (θ t ) j −θ t
∑∫ ∑∫
T T
0
T
e F(t )dt − e dF(t )
j! j!
∫ (θ t) e N −θ t 0 0
F(t )dt j =0 j =0
0 (7.22)
cT 
≥ .
cF − cT

Example 7.6 

[5, p.  46] Suppose the unit is replaced preventively at time T (0 < T ≤ ∞) or at
working number N (N = 1,2,), whichever occurs last. Then the expected cost
rate is:

∫ [1− G (N )
cT + (cF − cT ){F(T ) + (t )]dF(t )}
C (T , N) = T ∞
T
. (7.23)

∫0
F(t )dt +
∫T
[1− G (N) (t )]F(t )dt
172 Reliability Engineering

When G(t ) = 1− e −θ t , optimum time T to minimize C (T , N) satisfies:

N −1
(θ t ) j −θ t
∑∫
T ∞
cT


0
F(t )[h(T ) − h(t )]dt −
j =0
T j!
e F(t )[h(t ) − h(T )]dt =
cF − cT
, (7.24)

and optimum number N to minimize C (T , N) satisfies:


∫ (θ t) e N −θ t
dF(t )  N −1 
(θ t ) j −θ t
∑∫

 T
T
∞  ∫ F(t )dt +
j!
e F(t )dt 
∫ (θ t) e F(t )dt  
N −θ t 0 T
j =0
T
N −1 (7.25)
(θ t ) j −θ t
∑∫

cT
− F(T ) − e dF(t ) ≥ .
T j! cF − cT
j =0 

Example 7.7 

[5, p. 34] Suppose the unit is replaced preventively at the next working time over time
T, e.g., at time Yj +1 for Yj < T ≤ Yj +1. When G(t ) = 1− e −θ t , the expected cost rate is:

∫ θe −θ ( t −T )
cF − (cF − cT ) F(t )dt
C (T ) = T
T

, (7.26)
∫ F(t)dt + ∫ e
0 T
−θ ( t −T )
F(t )dt


where:
cT = replacement cost over time T,
cT < c F .

Optimum time T to minimize C (T ) satisfies:


∫e −θ t
dF(t ) T
cT
T
∞ ∫ F(t)dt − F(T ) = c F − cT
. (7.27)
∫e
−θ t 0
F(t )dt
T 

Example 7.8 

[5, p. 9] Suppose the unit is replaced preventively at the end of the next working
number over time T or at working number N (N = 1,2,), whichever occurs first.
Then the expected cost rate is:

C (T , N) =
 N −1 N −1 

∑[(θ T )

cF − (cF − cT ) − [(θ T ) j / j!]e −θ T
∫ θ e −θ t F(t )dt − / j!]e −θ T
∫ θe
j −θ t
F(t )dt 
 j =0 
T T
j =0
N −1 .
∑∫
T ∞

j =0
{ [(θ t ) j / j!]e −θ t F(t )dt + [(θ T ) j / j!]e −θ T
0 ∫
T
e −θ t F(t )dt }

(7.28)
An Overview on Failure Rates in Maintenance Policies 173

Optimum time T to minimize C (T , N) satisfies:

∫e −θ t
dF(t ) N −1 N −1
(θ t ) j −θ t (θ t ) j −θ t
∑∫ ∑∫
T T
cT
T

e F(t )dt − e dF(t ) = , (7.29)
j! j! cF − cT
∫e −θ t 0 0
F(t )dt j =0 j =0
T


and optimum number N to minimize C (T , N) satisfies:

T
 ∞ 
∫ 0  t ∫
(θ t )N −1  e −θ udF(u) dt

N −1  
T ∞

∫ 0
−θ u
(θ t )  e F(u)du  dt
 t ∫ 
N −1
(7.30)
 (θ T ) j (θ t ) j −θ t 

∞ T

∫e ∫
−θ t
×  F(t )dt + e F(t )dt 
j =0  j! T 0 j! 
N −1
 (θ T ) j (θ t ) j −θ t 
∑ 
∞ T
cT



j =0
j! ∫ T
e −θ t dF(t ) +

0 j!
e dF(t ) ≥
 cF − cT
.


Example 7.9 

[6, p. 13] Suppose the unit is replaced preventively at the end of the next working
number over time T or at working number N (N = 1,2,), whichever occurs last.
Then the expected cost rate is:

C (T , N) =
 ∞ ∞ 
∑[(θ T )

∫ θ[(θ t )N −1 / (N − 1)!]e −θ t F(t )dt − / j!]e −θ T


∫ θe −θ t
cF − (cF − cT )  j
F(t )dt 
 
T T
j =N
N −1 ∞
.

∫ F(t)dt + ∑∫ [(θ t)

∑[(θ T )
T ∞

∫e
−θ t −θ T −θ t
j
/ j!]e F(t )dt + j
/ j!]e F(t )dt
0 T T
j =0 j =N
 (7.31)

Optimum time T to minimize C (T , N) satisfies:

∫e −θ t
dF(t )  N −1 
(θ t ) j −θ t
∑∫

 T
T
∞  ∫ F(t )dt +
j!
e F(t )dt 
∫e F(t )dt  
−θ t 0 T
j =0
T (7.32)
N −1
(θ t ) −θ t
∑∫
∞ j
cT
− F(T ) − e dF(t ) = ,
T j! cF − cT
j =0
174 Reliability Engineering

and optimum number N to minimize C (T , N) satisfies:


∞ ∞

∫ (θ t) [∫ e N −θ u
dF(u)]dt  N −1
(θ t ) j −θ t
∑∫
T ∞
T

t


F(u)du]dt 
∫ F(t )dt +
j!
e F(t )dt
∫ (θ t) [∫ e N −θ u 0 T
j =0
T t

(θ T )
∞ 
+∑
j ∞
(7.33)
e −θ T
∫e F(t )dt 
−θ t
j! T 
j =N 

 (θ T ) j −θ T (θ t ) j −θ t 
∑ 
∞ ∞
cT



j =N
j!
e
∫ T
e −θ t dF(t ) −
∫ T j!
e dF(t ) ≥
 cF − cT
.


7.4  PERIODIC AND RANDOM REPLACEMENT POLICIES


Suppose the unit operates for jobs with working times Y j defined in Section 7.3 and
undergoes minimal repairs at failures. When the unit is replaced at time T or at
working number N , we give the following inequalities of the extended failure rates:
For 0 < T < ∞ and N = 0,1,2,:
Inequality VIII:
T T T

∫ e −θ t h(t )dt
∫ (θ t ) N e −θ t h(t )dt
∫ t h(t )dt ≤ h(T )
N

0
T ≤ 0
T ≤ 0
T

∫e ∫ (θ t ) ∫ t dt
−θ t
dt N
e−θ t dt N

0 0 0

(7.34)

∫ (θ t ) N e −θ t h(t )dt

∫ θe −θ t
h(t + T )dt ≤ T
∞ .
∫ (θ t ) N e −θ t dt
0

T
Inequality IX:
T
 t  ∞
 t

∫ ∫ ∫ ∫ e h(u)du  dt
T
(θ t ) N  e −θ u h(u)du  dt

−θ u
e −θ t h(t )dt (θ t ) N 
0  0  ≤ 0 ≤
T  0
T
T
 t
 ∞
 t


0  0 ∫
(θ t ) N  e −θ u du  dt
 0

e dtθ t

T
(θ t ) N ∫ ∫  ∫ e du  dt
 0
−θ u


T
 ∞ 
∫ ∫
(θ t ) N  e −θ u h(u)du  dt ∞
 t  ≤ e −θ t h(t + T )dt

0
≤ (7.35)
T
 ∞

∫ ∫
0
(θ t ) N  e−θ u du  dt
0  t 

 ∞



T t ∫
(θ t ) N  e −θ u h(u)du  dt
  .
N  
∞ ∞

∫ ∫
−θ u
(θ t )  e du  dt
T  t 
Note that all these functions increase with T and N .
An Overview on Failure Rates in Maintenance Policies 175

Example 7.10 

[5, p. 65] Suppose the unit is replaced at time T (0 < T ≤ ∞) or at time Y , whichever
occurs first. Then the expected cost rate is:

∫e −θ t
cT + cM h(t )dt
C (T ) = T
0
, (7.36)

∫ 0
e −θ t dt


where:
cT = replacement cost at time T or at time Y,
cM = cost of minimal repair at each failure.

Optimum time T to minimize C (T ) satisfies:

T
cT
∫e
−θ t
[h(T ) − h(t )]dt = . (7.37)
0 cM

Example 7.11 

[5, p. 77] Suppose the unit is replaced at time T (0 < T ≤ ∞) or at working number
N (N = 1,2,), whichever occurs first. Then, the expected cost rate is:

C (T , N) =
cT + cM
T
∫ [1− G
0
(N )
(t )]h(t )dt
. (7.38)
∫ 0
[1− G (N) (t )]dt


When G(t ) = 1− e −θ t , optimum time T to minimize C (T , N) satisfies:

N −1
(θ t ) j −θ t
∑∫
T
c
e [h(T ) − h(t )]dt = T , (7.39)
j =0
0 j! cM

and optimum number N to minimize C (T , N) satisfies:

∫ (θ t) e h(t)dt ∑
N −θ t
N −1 N −1
(θ t ) j −θ t (θ t ) j −θ t
∑∫
T T
c
0
T ∫ j!
e dt −
j!
e h(t )dt ≥ T . (7.40)
∫ (θ t) e dt N −θ t
j =0
0
j =0
0 cM
0
176 Reliability Engineering

Example 7.12 

[5, p. 79] Suppose the unit is replaced at time T (0 ≤ T ≤ ∞) or at working number


N (N = 0,1,2,), whichever occurs last. Then, the expected cost rate is:

∫ [1− G (N )
cT + cM {H(T ) + (t )]h(t )dt }
C (T , N) = ∞
T
. (7.41)


T+
∫ T
[1− G (N) (t )]dt


When G(t ) = 1− e −θ t , optimum time T to minimize C (T , N) satisfies:

N −1
(θ t ) j −θ t
∑∫
T ∞
c

∫0
[h(T ) − h(t )]dt −
j =0
T j!
e [h(t ) − h(T )]dt = T ,
cM
(7.42)

and optimum number N to minimize C (T , N) satisfies:

∫ (θ t) e h(t)dt T + ∑
N −θ t
(θ t ) j −θ t 
N −1 ∞
T

 ∫ j!
e dt  − H(T )

∫ (θ t) e dt N −θ t T
j =0 
T
(7.43)
N −1
(θ t ) j −θ t
∑∫

c
− e h(t )dt ≥ T . 
T j! cM
j =0

Example 7.13 

[6, p. 39] Suppose the unit is replaced at the end of the next working number over
time T. When G(t ) = 1− e −θ t , the expected cost rate is:

∫e −θ t
cT + cM[H(T ) + h(t + T )dt ]
C (T ) = 0
. (7.44)
T + 1/ θ

Optimum time T to minimize C (T ) satisfies:

T
cT
∫ θe
−θ t
T [h(t + T ) − h(t )]dt = . (7.45)
0 cM
An Overview on Failure Rates in Maintenance Policies 177

Example 7.14 

[6, p.  41] Suppose the unit is replaced at the next working number over time
T (0 < T ≤ ∞) or at working number N (N = 1,2,), whichever occurs first. Then,
the expected cost rate is:

 N −1 
∑∫ ∫
T T ∞
cT + cM 

∫ 0
[1− G (N) (t )]h(t )dt +
j =0
0 T
u]dG ( j ) (t )
G(u − t )h(u)du
[

C (T , N) = T
. (7.46)

∫ 0
[1− G (N) (t )]dt + (1/ θ )[1− G (N) (T )]

When G(t ) = 1− e −θ t , optimum time T to minimize C (T , N) satisfies:

N −1
(θ t ) j −θ t  
∑∫
T ∞
c
∫ θe −θ u
e  [h(u + T ) − h(t )]du  dt = T , (7.47)
j =0
0 j!  0  cM

and optimum number N to minimize C (T , N) satisfies:

T
 ∞

∫ (θ t) ∫ e h(u)du dt 
N −1 −θ u
N −1
θ (θ t ) j −θ t (θ T ) j −θ T 
∑ ∫
T
0 t
T
e dt + e 
j! j!
∫ (θ t) e dt 
N −1 −θ t 0
j =0
0
 (7.48)
N −1
 (θ t ) j −θ t (θ T ) j  c
∑∫
T ∞


j =0 

0 j!
e h(t )dt +
j! ∫ T
e −θ t h(t )dt  ≥ T .
 cM

Example 7.15 

[6, p.  44] Suppose the unit is replaced at the next working number over time
T (0 ≤ T ≤ ∞) or at working number N (N = 0,1,2,), whichever occurs last. Then,
the expected cost rate is:

 ∞ ∞
  
∑∫ ∫ G(u − t)h(u)du dG
T ∞


( j)
cT + cM H(T ) + [1− G (N) (t )]h(t )dt + (t )
 T
j =N
0 T 
C (T , N) = ∞
. (7.49)
T+
∫ [1− G
T
(N )
(t )]dt + (1/ θ )G (N) (T )

When G(t ) = 1− e −θ t , optimum time T to minimize C (T , N) satisfies:


178 Reliability Engineering

T
 ∞

∫ ∫ θ e
−θ u
[h(u + T ) − h(t )]du  dt
0 0 
N −1
(7.50)
(θ t ) j −θ t  
∑∫
∞ ∞
c
+
j =0
T j!
e 
 ∫ 0
θ e −θ u[h(u + T ) − h(t )]du  dt = T ,
 cM

and optimum number N to minimize C (T , N) satisfies:


∞ ∞

∫ ∫e −θ u
(θ t )N −1[ h(u)du]dt 
(θ T ) j −θ T 
N −1
T

t 1+ θ T +
 ∑ j!
e  − H(T )

∫ T
(θ t )N −1e −θ t dt  j =0 
(7.51)
N −1
∞  (θ t ) −θ t
j
(θ T )  c
∑∫
∞ j ∞

∫e ∫
−θ t
− h(t + T )dt −  e h(t )dt − e −θ t h(t + T )dt  ≥ T .
0
j =0  T j! j! 0  cM

7.5  PERIODIC REPLACEMENT POLICIES WITH FAILURE NUMBERS


It is assumed that failures occur at a non-homogeneous Poisson process with mean
t
value function H (t ) ≡ ∫0 h(u)du , then the probability that k failures occur in [0, t ]
is [9, p. 27]:

H (t ) k − H ( t )
pk (t ) ≡ e ( k = 0,1,2),
k!

and the probability that more than k failures occur in [0, t ] is Pk (t ) = ∑ ∞j = k p j (t ). Note
that Pk (t ) ≡ 1 − Pk (t ) = ∑ kj −=01 p j (t ) . Suppose the unit undergoes minimal repair at
­failures and is replaced at time T or at failure number K , we give the following
inequalities of the extended failure rates: For 0 < T < ∞ and K = 0,1,2,:
Inequality X:
T ∞

T
F (T )

∫ 0
pK (t )h(t )dt
T ≤ h(T ) ≤ ∞
F (T )

∫ T
pK (t )h(t )dt
∞ . (7.52)
∫ 0
F (t )dt
∫ 0
pK (t )dt
∫ T
F (t )dt
∫ T
pK (t )dt

Inequality XI:
T T

∫ 0
pK (t )h(t )dt
T
≤ T
∫ p (t)h(t)dt 0

K
F (T )
∞ ∞

∫ 0
pK (t )dt
∫ p (t)[∫ F(u)du / F(t)]dt ∫ F(t)dt
0
K
t T

∞ ∞


∫ T
pK (t )h(t )dt

≤ ∞
∫ T
pK (t )h(t )dt

. (7.53)
∫ T
pK (t )dt
∫ T
pK (t )[

t
F(u)du / F(t )]dt
An Overview on Failure Rates in Maintenance Policies 179

Inequality XII:
T ∞


∫0
pK (t )h(t )dt
T ≤ ∞
1

∫T
pK (t )h(t )dt
∞ , (7.54)
∫ 0
pK (t )dt
∫0
pK (t )dt
∫ T
pK (t )dt

1

∫ p (t )h(t )dt0
K
≤ ∞
1
µ T
  ∞

∫ p (t )  ∫ F (u)du / F (t ) dt ∫
0
K
t 0
pK +1(t )dt


(7.55)


∫ T
pK (t )h(t )dt
.

 ∞

∫T
pK ( t ) 
 ∫ t
F (u)du / F (t )  dt

Example 7.16 

[2, p. 104] Suppose the unit is replaced at failure number K (K = 1, 2,). Then, the
expected cost rate is:

cK + cMK
C (K ) = ∞
, (7.56)
∫0
P K (t )dt

where cK = replacement cost at failure number K . Optimum K to minimize C (K )


satisfies:

1 cK
∞ ∫ P (t)dt − K ≥ c K . (7.57)

0 M
p (t )dtK
0

Example 7.17 

[10] Suppose the unit is replaced at time T (0 < T ≤ ∞) or at failure number K


(K = 1, 2,), whichever occurs first. Then, the expected cost rate is:

C (T , K ) =
cT + cM
∫ P (t)(t)h(t)dt ,
T
0
K
(7.58)
∫ P (t)dt
0
K

where cT = replacement cost at time T and at failure number K. Optimum T to


minimize C (T , K ) satisfies:

T T
cT
h(T )
∫ P (t)dt − ∫ P (t)h(t)dt = c
0
K
0
K
M
, (7.59)
180 Reliability Engineering

and optimum K to minimize C (T , K ) satisfies:

∫ p (t)h(t)dt P (t)dt − P (t)h(t)dt ≥ c


K T T

∫ ∫
T
0
T K K . (7.60)
c
∫ p (t)dt
0 0 M
K
0

Example 7.18 

[10] Suppose the unit is replaced at time T (0 ≤ T ≤ ∞) or at failure number K


(K = 0,1, 2,), whichever occurs last. Then, the expected cost rate is:

C (T , K ) =
cT + cM[H(T ) +

∫ P (t)h(t)dt] .
T
K
(7.61)
T+
∫T
P K (t )dt

Optimum T to minimize C (T , K ) satisfies:

 ∞
 ∞
cT
h(T ) T +
 ∫
T
P K (t )dt  − H(T ) −
 ∫ P (t)h(t)dt = c
T
K
M
, (7.62)

and optimum K to minimize C (T , K ) satisfies:

∫ p (t)h(t)dt T + P (t)dt  − H(T ) − P (t)h(t)dt ≥ c


K ∞ ∞

∫ ∫
T
T
∞ K K . (7.63)
c
∫ p (t)dt   T T M
K
T

Example 7.19 

[6, p. 47] Suppose the unit is replaced at the first failure over time T (0 ≤ T < ∞).
Then, the expected cost rate is:

cT + cM[H(T ) + 1]
C (T ) = ∞
, (7.64)
T+
∫T
e −[H(t ) −H(T )]dt

where cT = replacement cost over time T. Optimum T to minimize C (T ) satisfies:

cT
TQ(T ) − H(T ) = , (7.65)
cM

where:

F(T )
Q(T ) ≡ ∞
.

∫ T
F(t )dt

An Overview on Failure Rates in Maintenance Policies 181

Example 7.20 

[6, p. 47] Suppose the unit is replaced at failure number K (K = 1, 2,) or at the
first failure over time T (0 ≤ T < ∞), whichever occurs first. Then, the expected
cost rate is:

C (T , K ) =
∫ P (t)h(t)dt + P (T )] .
cT + cM[
T
0
K


K
(7.66)
∫ P (t)dt + P (T )∫ e
−[H (t ) − H (T )]
K dt K
0 T

Optimum T to minimize C (T , K ) satisfies:


T T
cT
Q(T )
∫0
P K (t )dt −
∫ P (t)h(t)dt = c
0
K
M
, (7.67)

and optimum K to minimize C (T , K ) satisfies:

∫p K −1(t )h(t )dt


 T ∞

T
0
 ∫ P (t)dt + P (T )∫ e
K K
−[H (t ) − H (T )]
dt 
∫ pK −1(t )[h(t ) / Q(t )]dt  
0 T

0
T
(7.68)
cT

∫0
P K (t )h(t )dt − P K (T ) ≥
cM
.

Example 7.21 

[6, p. 50] Suppose the unit is replaced at failure number K (K = 0,1, 2,) or at the
first failure over time T (0 ≤ T < ∞), whichever occurs last. Then, the expected cost
rate is:

C (T , K ) =
∫ P (t)h(t)dt + P (T )] .
cT + cM[H(T ) +

T
K


K
(7.69)
∫ ∫
−[H (t ) − H (T )]dt
T + P (t )dt + P (T ) e K K
T T

Optimum T to minimize C (T , K ) satisfies:

 ∞
  ∞
 cT
Q(T ) T +
 ∫
T
P K (t )dt  − H(T ) +
  ∫ P (t)h(t)dt  = c
T
K
M
, (7.70)

and optimum K to minimize C (T , K ) satisfies:

∫p K −1 (t )h(t )dt
 ∞ ∞

∫ ∫e
−[H (t ) − H (T )]
T + P K (t )dt + PK (T )
T

dt 
∫ pK −1(t )[h(t ) / Q(t )]dt  
T T

T
(7.71)
 ∞
 c
− H(T ) +
 ∫
T
P K (t )h(t )dt + PK (T ) ≥ T .
 cM
182 Reliability Engineering

7.6 CONCLUSIONS
We surveyed several extended failure rates that appeared in the recent age, random,
and periodic replacement models. The reliability properties of these extended failure
rates would be helpful in obtaining optimum maintenance times for complex sys-
tems. We also gave the inequalities of the failure rates, which would help greatly to
compare their optimum replacement policies.
There are some examples for which we cannot give inequalities. For example:
T

1. T
F (T )
and
∫ (θ t ) e
0
T
N −θ t
dF ( t )
.

F (t )dt
0 ∫ (θ t ) e
0
N −θ t
F ( t ) dt

∫ (θ t ) e N −θ t
dF ( t )
F (T ) T
2. ∞ and ∞ .
∫ ∫ (θ t ) e −θ t
N
F (t )dt F ( t ) dt
T T

1
h(T ) and
3. ∞ .
∫0
pK (t )dt

However, it can be shown that:


T T


1. (θt ) N e − θt dF (t ) /
∫ (θt ) e − θt F (t )dt increases with N from:
N

0 0

∫e 0
T
−θ t
dF (t )

F (T )
T to h(T ) ≥
F (T )
T .
∫e
0
−θ t
F (t )dt
∫0
F (t )dt

0
F (t )dt

∞ ∞


2. (θ t ) N e −θ t dF (t ) /
∫ (θ t ) e −θ t F (t )dt increases with N from:
N

T T

∫e −θ t
dF (t )
F (T ) F (T )
T
∞ ≤ ∞ to h(T ) ≥ ∞ .
∫e
T
−θ t
F (t )dt

T
F (t )dt

T
F (t )dt


3.

1/ pK (t )dt increases with K from 1/µ ≥ h(0) to h(∞).
0
An Overview on Failure Rates in Maintenance Policies 183

APPENDICES
Assuming that the failure rate h(t ) increases with t from h(0) to h(∞), we complete
the following proofs.

APPENDIX 7.1 
Prove that for 0 < T < ∞:
T
H (T ) 1
T
=
T ∫ h(t )dt
0

increases with T from h(0) to h(∞) and:

F (T ) H (T )
T ≤ ≤ h(T ). (A7.1)
T
∫ F (t )dt
0

Proof. Note that:


T
H (T ) 1
lim
T →0 T
= lim
T →0 T ∫ h(t )dt = h(0),
0

T
H (T ) 1
lim
T →∞ T
= lim
T →∞ T ∫ h(t )dt = h(∞),
0

and
T
d[ H (T ) / T ] 1 1

dT
= 2 [Th(T ) − H (T )] = 2
T T ∫ [h(T ) − h(t )]dt ≥ 0.
0

This follows that H (T ) / T increases with T from h(0) to h(∞).


From Equation (A7.1), letting:

T
L(T ) ≡ H (T )
∫ F (t )dt − TF (T ),
0

we have L(0) = 0 and:

T
dL(T )
dT
= h(T )
∫ F (t )dt + H (T )F (T ) − F (T ) − Tf (T )
0


=
∫ [h(T ) − h(t )][F (T ) − F (t )]dt ≥ 0,
0

which completes the proof of Equation (A7.1).


184 Reliability Engineering

APPENDIX 7.2
Prove that for 0 < T < ∞:

T∫ F (t )dt (A7.2)

 t

∫ ∫ F (u)du  dt
T 0

increases with T to 1/µ:


T

∫ F (t )dt (A7.3)
0
T
 ∞

∫  ∫ F (u)du  dt
0 t

increases with T from 1/µ , and:


∞ T

∫ T

∫ F (t )dt . (A7.4)
F (t )dt
0
 ∞ t
   T ∞

∫ ∫ F (u)du  dt ∫  ∫ F (u)du  dtt


T 0 0 t

Proof. Note that from Equation (A7.1):


lim ∞
∫ F (t )dt = lim F (T ) = 1 ,
T
t T
µ
∫ [∫ F (u)du]dt ∫ F (t )dt
T →∞ T →∞

T 0 0

lim T
∫ F (t )dt = lim F (T ) = 1 .
0
∞ ∞
µ
∫ [∫ F (u)du]dt ∫ F (t )dt
T →0 T →0

0 t T

T
Furthermore, because F (T ) /
∫ F (t )dt increases with T:
0


∫ F (t )dt ≥ F (T ) .

T
t T

∫ [∫ F (u)du]dt ∫ F (t )dt
T 0 0


Similarly, because F (T ) /
∫ T
F (t )dt increases with T:

∫ F (t )dt ≤ F (T ) .
T
0
∞ ∞

∫ [∫ F (u)du]dt ∫ F (t )dt
0 t T
An Overview on Failure Rates in Maintenance Policies 185

∞ ∞ t
Differentiating
∫ F (t )dt / ∫ [∫ F (u)du]dt with respect to T:
T T 0


 t  ∞ T
− F (T )
∫ ∫ T
 0 F ( u ) du  dt +
  ∫ T
F ( t ) dt
∫ F ( t ) dt
0

 ∞

T ∞
 t
  ∫ F (t )dt − F (T )  ≥ 0,
=
∫ F (t )dt ∫ ∫  0 F (u)du  dt  ∞
T
t T
  
∫ [∫ F (u)du]dt ∫ F (t )dt 
0 T

 T 0 0

which proves that Equation (A7.2) increases with T to 1/ µ .


Similarly, we can prove Equation (A7.3) increases with T from 1/ µ and then com-
plete the proof of Equation (A7.4).

APPENDIX 7.3 
For 0 < T < ∞ and N = 0,1,2,:
T

∫ t dF (t ) (A7.5) N

0
T

∫ t F (t )dt 0
N


increases with T from h(0) and increases with N from F (T ) / F (t )dt to h(T ):
0

∫ (θt ) e N − θt
dF (t )

0
T
(A7.6)
∫ (θt ) e
0
N − θt
F (t )dt

T T

∫ e −θ t dF (t ) /
∫e −θ t
increases with T from h(0) and increases with N from F (t )dt to
0 0
h(T ):
T T

∫ ∫ (θt ) e − θt
t N dF ( t ) N
dF (t )
0
T ≥ 0
T . (A7.7)
∫t0
N
F (t )dt
∫ (θt ) e 0
N − θt
F (t )dt

Proof. Note that:


T T

∫ ∫ (θt ) e
− θt
t N dF ( t ) N
dF (t )
lim 0
T = lim 0
T = h(0),
∫t ∫ (θt ) e
T →0 N T →0 N −θt
F (t )dt F (t )dt
0 0
186 Reliability Engineering

T T

∫ ∫ (θt ) e
− θt
t N dF ( t ) N
dF (t )
lim 0
T = lim 0
T = h(T ).
∫t ∫ (θt ) e
N →∞ N N →∞ N −θt
F (t )dt F (t )dt
0 0

T T

∫ t N dF ( t ) /
∫t N
Differentiating F (t )dt with respect to T :
0 0

T T
T N f (T )
∫ t N F (t )dt − T N F (T )
∫t N
dF ( t )
0 0

T
= T N F (T )
∫ 0
t N F (t )[h(T ) − h(t )]dt ≥ 0,

which follows that Equation (A7.5) increases with T from h(0), forming:

T T

∫ t N +1dF (t )
∫ t dF ( t ) , N


0
T − 0
T

∫t ∫ t F ( t ) dt
N +1 N
F ( t ) dt
0 0
and denoting:
T T T T

∫ t N +1dF (t )
∫ ∫ ∫t N +1
L(T ) ≡ t N F (t )dt − t N dF (t ) F (t )dt ,
0 0 0 0

we have L(0) = 0 and


T T
dL(T )
= T N +1 f (T )
∫ ∫t N +1
t N F (t )dt + T N F (T ) dF ( t )
dT 0 0

T T
− T N f (T )
∫t N +1
F (t )dt − T N +1 F (T )
∫t N
dF ( t )
0 0


= T N F (T )
∫t 0
N
F (t )(T − t )[h(T ) − h(t )]dt ≥ 0,

which follows that Equation (A7.5) increases with N to h(T ).


Similarly, we can prove Equation (A7.6) increases with T from h(0) and increases
with N to h(T ).
From Equation (A7.7), letting:
T T

∫ t N dF ( t )
∫ (θt ) e − θt F (t )dt
N
L(T ) ≡
0 0

T T



∫ 0
t N F (t )dt
∫ (θt )
0
N
e − θt dF (t ),

An Overview on Failure Rates in Maintenance Policies 187

we have L(0) = 0 and


T T
dL(T )
= T N f (T )
∫ (θ t ) N e −θ t F (t )dt + (θ T ) N e −θ T F (T )
∫t
N
dF (t )
dT 0 0

T T
− T N F (T )
∫ (θ t ) N e −θ t dF (t ) − (θ T ) N e −θ T f (T )
∫t N
F (t )dt
0 0

T
= T N F (T )
∫ (θ t ) F (t )(e −θ t − e −θ T )[h(T ) − h(t )]dt ≥ 0,
N

0
which completes the proof of Equation (A7.7).

APPENDIX 7.4 
For 0 < T < ∞ and N = 0,1,2,:

∫ t dF ( t )
T

N

∫ t F (t )dt
N


increases with T to h(∞) and increases with N from F (T ) /
∫T
F (t )dt to h(∞):

∫ (θt ) e N − θt
dF (t )
T

∫ (θt ) e − θt
N
F (t )dt
T

∞ ∞
increases with T to h(∞) and increases with N from
h(∞), and:
∫e
T
− θt
dF ( t ) /
∫e
T
− θt
F (t )dt to

∞ ∞

∫ ∫ (θt ) e − θt
t N dF (t ) N
dF (t )
T
∞ ≥ T
∞ .
∫t ∫ (θt ) e − θt
N N
F (t )dt F (t )dt
T T

Proof. Appendix 7.4 can be proved by using the similar discussions of Appendix 7.3.

APPENDIX 7.5 
Prove that:
T ∞

∫ ∫e − θu
(θt ) N [ dF (u)]dt
0 t
T ∞ (A7.8)
∫ (θt ) [∫ e
0
N

t
− θu
F (u)du]dt
188 Reliability Engineering

∞ ∞

∫ e − θt dF (t ) /
∫e − θt
increases with T from F (t )dt and increases with N to:
∞ ∞ 0 0

∫e ∫e
− θt − θt
dF ( t ) / F (t )dt ,
T T

∞ t


∫ T

∫ (θt ) N [ e − θu dF (u)]dt
t
0
(A7.9)
∫ (θt ) [∫ e − θu
N
F (u)du]dt
T 0

∞ ∞
e − θt dF (t ) /
∫ ∫e − θt
increases with T to F (t )dt and increases with N to
∞ ∞

∫ ∫
− θt − θt 0 0
e dF (t ) / e F (t )dt , and:
0 0

T ∞ ∞ t


∫ 0
T
∫ (θt ) N [
t

e − θu dF (u)]dt

∫ T

∫(θt ) N [ e − θu dF (u)]dt
t
0
. (A7.10)
∫ (θt ) [∫ e N − θu
∫ (θt ) [∫ e − θu
N
F (u)du]dt F (u)du]dt
0 t T 0

Proof. Note that:

T ∞ ∞

∫ ∫ e − θu dF (u)]dt
∫e − θt
(θt ) N [ dF ( t )
0 t 0
lim T ∞ = ∞ ,
∫ (θt ) [∫ e ∫e
T →0 N − θu − θt
F (u)du]dt F (t )dt
0 t 0

T ∞ ∞

lim
∫ 0
T
∫ (θt ) N [
t

e − θu dF (u)]dt
=
∫e T

− θt
dF ( t )
,
∫ (θt ) [∫ e ∫e
N →∞ N − θu − θt
F (u)du]dt F (t )dt
0 t T

and:


 t  ∞
 t 

lim ∞
T ∫ ∫
(θt ) N  e − θu dF (u)  dt
 0  = lim
∫ T  0 ∫
(θt ) N  e − θu dF (u)  dt

 t
 ∞
 t

∫ ∫ ∫ ∫
T →∞ − θu N →∞
N
(θt )  e F (u)du  dt (θt ) N  e − θu F (u)du  dt
T  0  T  0 

∫e
− θt
dF ( t )
0
= ∞ .
∫e 0
− θt
F (t )dt
An Overview on Failure Rates in Maintenance Policies 189

∞ ∞

∫ e − θt dF (t ) /
∫e − θt
Furthermore, because F (t )dt increases with T :
T T

T ∞ ∞

∫ (θt ) [∫ e ∫e
N − θu − θt
dF (u)]dt F (t )
dF

0
T
t
∞ ≤ T
∞ .
∫ (θt ) [∫ e − θu
∫e
N − θt
F (u)du]dt F (t )dt
0 t T

T ∞ T ∞

∫ ∫ e − θu dF (u)]dt /
∫ ∫e − θu
Differentiating (θt ) N [ (θt ) N [ F (u)du]dt with respect to T:
0 t 0 t

∞ T
 ∞

∫ e − θt dF (t )
∫ ∫e − θu
(θT ) N (θt ) N  F (u)du  dt
T 0  t 
∞ T
 ∞

− (θT ) N
∫ T
e − θt F (t )dt
∫ 0
(θt ) N 
 ∫e t
− θu
dF (u)  dt

∞ T
 ∞

= (θT ) N
∫ T
e − θt F (t )dt
∫0
(θt ) N 
 ∫e t
− θu
F ( u ) du  dt

 ∞ T ∞

∫ ∫ ∫e
− θu
 e − θt dF (t ) (θt ) N [ dF (u)]dt 

× T
∞ − 0
T
t
∞  ≥ 0,

 ∫e T
− θt
F (t )dt
∫ (θt ) [∫ e 0
N

t
− θu
F (u)du]dt 


∞ ∞

∫ e − θt dF (t ) /
∫e − θt
which follows that Equation (A7.8) increases with T from F (t )dt.
0 0
Forming:

T ∞ T ∞

∫ ∫ (θt ) N +1[ e − θu dF (u)]dt


∫ ∫e − θu
(θt ) N [ dF (u)]dt

0
T
t
∞ − 0
T
t
∞ ,
∫ (θt ) [∫ e
0
N +1

t
− θu
F (u)du]dt
∫ (θt ) [∫ e 0
N

t
− θu
F (u)du]dt

and denoting:

T
 ∞
 T
 ∞

∫ (θt ) N +1 
∫ e − θu dF (u)  dt
∫ ∫e − θu
L(T ) ≡ (θt ) N  F (u)du  dt
0  t  0  t 
T
 ∞
 T
 ∞

∫ ∫ e − θu dF (u)  dt
∫ (θt ) N +1 
∫e
− θu
− (θt ) N  F (u)du  dt ,
0  t  0  t 
190 Reliability Engineering

we have L(0) = 0 and:

dL(T ) ∞ T
 ∞

= (θT ) N +1
∫ e − θt dF (t )
∫ ∫e − θu
(θt ) N  F (u)du  dt
dT T 0  t 
∞ T
 ∞

+ (θT ) N
∫T
e − θt F (t )dt
∫ 0
(θt ) N +1 
 ∫e t
− θu
dF (u)  dt

∞ T
 ∞

− (θT ) N
∫ T
e− θt dF (t )
∫ 0
(θt ) N +1 
 ∫e t
− θu
F (u)du  dt

∞ T
 ∞

− (θT ) N +1
∫ T
e − θt F (t )dt
∫ 0
(θt ) N 
 ∫e t
− θu
dF (u)  dt

∞ T
  ∞

=
∫T
e − θt F (t )dt
∫ 0 
N N
(θT ) (θt ) (θT − θt ) 
 ∫et
− θu
F (u)du 

 ∞ ∞

∫ ∫e
− θu
 e − θu dF (u) dF ( u )  

×  T
∞ − t
∞   dt ≥ 0,

 ∫e
T
− θu
F (u)duu
∫e t
− θu
F ( u ) du  
 

∞ ∞

∫ e − θt dF (t ) /
∫e − θt
which follows that Equation (A7.8) increases with N to F (t )dt .
T ∞ T ∞

∫ ∫e
− θt − θt
Similarly, we can prove Equation (A7.9) increases with T to e dF ( t ) /
∞ ∞ 0 0

∫e ∫e
− θt − θt
F (t )dt and increases with N to dF ( t ) / F (t )dt and complete the proof of
0 0
Equation (A7.10).

APPENDIX 7.6 
Prove that:
T t


∫ 0
T
∫ (θt ) N [ e − θu dF (u)]dt
t
0
(A7.11)
∫ (θt ) [∫ e
0
N

0
− θu
F (u)du]dt

T T

∫e ∫e
− θt − θt
increases with T from h(0) and increases with N to dF ( t ) / F (t )dt , and:
0 0

∞ ∞

∫ ∫e − θu
(θt ) N [ dF (u)]dt
T t
∞ ∞ (A7.12)
∫ (θt ) [∫ e − θu
N
F (u)du]dt
T t

increases with T to h(∞) and increases with N to h(∞).


An Overview on Failure Rates in Maintenance Policies 191

Proof. Note that:


T
 t 
∫ ∫
T ∞
(θt ) N  e − θu dF (u)  dt
∫e ∫e
− θt − θt
dF (t ) dF (t )
0  0  ≤ 0
≤ T
T ∞
N  
T t

∫ ∫ ∫e ∫e
− θu − θt − θt
(θt )  e F (u)du  dt F (t )dt F (t )dt
0  0  0 T


 ∞ 

≤ ∞
T ∫ t ∫
(θtt ) N  e − θu dF (u)  dt
  .
N  

∫ ∫
− θu
(θt )  e F (u)du  dt
T  t 

Using the similar discussions of Appendices 7.5 and 7.6 can be proved.
Similarly, we can prove that the failure rates in VIII and IX increase with T and
N and obtain the inequalities for 0 < T < ∞ and N = 0,1,2,.

APPENDIX 7.7 
Prove that for 0 < T < ∞ and K = 0,1,2,:
T


∫ p (t )h(t )dt (A7.13)
0
T
K

∫ p ( t ) dt
0
K

∞ T
increases with T from h(0) to1/
to h(T ), and:
∫ 0
pK (t )dt and increases with K from F (T ) /
∫ F (t )dt
0

T
F (T )

∫ p (t )h(t )dt ≤ h(T ) ≤ F (T ) , (A7.14)
0
T
K

∫0
F (t )dt
∫ p (t )dt 0
K ∫ F (t )dt T


∫ p (t )h(t )dt ≤
0
K
T ∞
1
. (A7.15)
∫ p (t )dt ∫
0
K
0
pK (t )dt

Proof. Note that:


T T

lim

0
pK (t )h(t )dt
T = h(0), lim
∫ p (t )h(t )dt =
0
T
K

1
,
∫ p ( t ) dt ∫ p (t )dt ∫
T →0 T →∞
K K
pK (t )dt
0 0 0
192 Reliability Engineering

T T

lim
∫ 0
pK (t )h(t )dt
T = T
F (T )
, lim
∫ p (t )h(t )dt = h(T ).
0
T
K

∫ p ( t ) dt ∫ F (t )dt ∫ p ( t ) dt
K →0 K →∞
K K
0 0 0

T T
Differentiating
∫ 0
pK (t )h(t )dt /
∫ p ( t ) dt
0
K with respect to T :

T T
pK (T )h(T )
∫ 0
pK (t )dt − pK (T )
∫ p (t )h(t )dt
0
K

T
= pK (T )
∫ p (t )[h(T ) − h(t )]dt ≥ 0,
0
K

which follows that Equation (7A.13) increases with T .


T T
Making the difference between ∫0 pK (t )h(t )dt / ∫0 pK (t )dt for K , and letting:

T T T T
L(T ) ≡
∫ 0
pK +1 (t )h(t )dt
∫ 0
pK (t )dt −
∫ 0
pK (t )h(t )dt
∫p
0
K +1 (t )dt ,

we have L(0) = 0 and:

T
L′(T ) = pK +1 (T )
∫ p (t )[h(T ) − h(t )]dt
0
K

T
− pK (T )
∫p 0
K +1
(t )[h(T ) − h(t )]dt

H (T ) K − H (T ) T


=
( K + 1)!
e
∫ p (t )[h(T ) − h(t )][H (T ) − H (t )]dt ≥ 0,
0
K

which follows that Equation (A7.13) increases with K .


From these results, we can obtain Equations (A7.14) and (A7.15).

APPENDIX 7.8 
Prove that for 0 ≤ T < ∞ and K = 0,1,2,:


∫ T
pK (t )h(t )dt
∞ (A7.16)
∫ T
pK (t )dt

∞ ∞
increases with T from1/
to h(∞), and:

0
pK (t )dt to h(∞) and increases with K from F (T ) /

T
F (t )dt
An Overview on Failure Rates in Maintenance Policies 193

∞ ∞


F (T )

∫ T
pK (t )h(t )dt
∞ , ∞
1

∫ T
pK (t )h(t )dt
∞ . (A7.17)
∫T
F (t )dt
∫ T
pK (t )dt

0
pK (t )dt
∫ T
pK (t )dt

Proof. Note that:


∞ ∞

lim
∫ T
pK (t )h(t )dt
∞ = ∞
1
, lim
∫ T
pK (t )h(t )dt
∞ = h(∞),
∫ ∫ ∫
T →0 T →∞
pK (t )dt pK (t )dt pK (t )dt
T 0 T

∞ ∞

lim
∫ T
pK (t )h(t )dt
∞ = ∞
F (T )
, lim
∫ T
pK (t )h(t )dt
∞ = h(∞).
∫ ∫ ∫
T →0 K →∞
pK (t )dt F (t )dt pK (t )dt
T T T

By similar methods used in Appendix 7.7, we can easily prove Appendix 7.8.

APPENDIX 7.9 
Prove that for 0 < T < ∞ and K = 0,1,2,:
T

T
∫ p (t )h(t )dt
0
K
(A7.18)
∫ 0
pK (t )[h(t ) / Q(t )]dt

increases with T from 1/µ to 1/ ∫ ∞ pK +1(t )dt and increases with K from:
0
T ∞ ∞
F (T ) / ∫0 h(t )[ ∫t F (u)du]dt to F (T ) / ∫T F (t )dt , and:


1
≤ T
∫ p (t )h(t )dt
0
K
≤ ∞
1
, (A7.19)
µ
∫ 0
pK (t )[h(t ) / Q(t )]dt
∫ 0
pK +1(t )dt

T
∫ p (t )h(t )dt
0
K
≤ ∞
F (T )
, (A7.20)

0
pK (t )[h(t ) / Q(t )]dt
∫ T
F (t )dt

Proof. Note that:


T

lim T
∫ p (t )h(t )dt
0
K
= lim ∞
F (T )
=
1
,
µ
∫ p (t )[h(t ) / Q(t )]dt ∫
T →0 T →0
K
F (t )dt
0 T
194 Reliability Engineering

lim T
∫ p (t )h(t )dt
0
K
= ∞
1
= ∞
1
,
∫ ∫ ∫
T →∞
pK (t )[h(t ) / Q(t )]dt pK (t )[h(t ) / Q(t )]dt pK +1(t )dt
0 0 0

lim T
∫ p (t )h(t )dt
0
K
= T
F (T )
∞ ,
∫ p (t )[h(t ) / Q(t )]dt ∫ h(t )[∫
K →0
K
F (u)du]dt
0 0 t

lim T
∫ p (t )h(t )dt
0
K
=
F (T )
∞ .
∫ ∫
K →∞
pK (t )[h(t ) / Q(t )]dt F (t )dt
0 T

T T
Differentiating
∫0
pK (t )h(t )dt /
∫ p (t )[h(t ) / Q(t )]dt
0
K
with respect to T :

T T
h(t ) h(T )


pK (T )h(T )
∫ 0
pK (t )
Q (t )
dt − pK (T )
Q(T ) ∫ p (t )h(t )dt
0
K

T
h(T ) h(t )
= pK (T )
Q(T ) ∫ p (t ) Q(t ) [Q(T ) − Q(t )]dt ≥ 0,
0
K

which follows that Equation (A7.18) increases with T.


T T
Making difference between
∫ 0
pK (t )h(t )dt /
∫ p (t )[h(t ) / Q(t )]dt for K, and letting:
0
K

T T
h(t )


L(T ) ≡
∫ 0
pK +1 (t )h(t )dt
∫ p (t ) Q(t ) dt
0
K

T T
h(t )

∫ p (t )h(t )dt ∫
0
K
0
pK +1 (t )
Q (t )
dt ,

we have L(0) = 0, and:

H (T ) K − H (T ) T
 1 1 
L′(T ) =
( K + 1)!
e h(T )
∫ p (T )  Q(t ) − Q(T )  [H (T ) − H (t )]dt ≥ 0,
0
K

which follows that Equation (A7.18) increases with K .


From these results, we can obtain Equations (A7.19) and (A7.20).
An Overview on Failure Rates in Maintenance Policies 195

APPENDIX 7.10 
Prove that for 0 ≤ T < ∞ and K = 0,1,2,:



T
pK (t )h(t )dt
(A7.21)
∫T
pK (t )[h(t ) / Q(t )]dt


increases with T from 1/ ∫0 pK +1(t )dt to h(∞) and increases with K from:
∞ ∞
F (T ) / ∫T [ ∫t F (u)du]dt to h(∞), and:

∞ ∞


∫T
pK (t )h(t )dt
∞ ≤ ∞
∫ T
pK (t )h(t )dt
∞ ,
∫ T
pK (t )dt
∫ T
pK (t )h(t )[
∫ t
F (u)du / F (t )]dt


1
≤ ∞
∫T
pK (t )h(t )dt
∞ .
∫ 0
pK +1 (t )dt
∫ T
pK (t )h(t )[
∫t
F (u)du / F (t )]dt

Proof. Note that:


lim ∞
∫ T
pK (t )h(t )dt
= ∞
1
,
∫ ∫
T →0
pK (t )[h(t ) / Q(t )]dt pK +1 (t )dt
T 0


lim ∞
∫ T
pK (t )h(t )dt
= h(∞),


T →∞
pK (t )[h(t ) / Q(t )]dt
T

lim ∞
∫ T
pK (t )h(t )dt
= ∞
F (T )
∞ ,
∫ ∫ ∫
K →0
pK (t )[h(t ) / Q(t )]dt h(t )[ F (u)du]dt
T T t

lim ∞
∫ T
pK (t )h(t )dt
= limQ(T ) = h(∞).

K →∞ T →∞
pK (t )[h(t ) / Q(t )]dt
T

Using h(t ) / Q(t ) ≤ 1 and similar methods in Appendix 7.9, we can easily prove
Appendix 7.10.
196 Reliability Engineering

ACKNOWLEDGMENT
This  work is supported by National Natural Science Foundation of China
(NO. 71801126), Natural Science Foundation of Jiangsu Province (NO. BK20180412),
Aeronautical Science Foundation of China (NO. 2018ZG52080), and Fundamental
Research Funds for the Central Universities (NO. NR2018003).

REFERENCES
1. Lai, C.D., Xie, M. Concepts and applications of stochastic aging in reliability.
In Pham, H. (ed.), Handbook of Reliability Engineering. London, UK: Springer, 2003:
pp. 165–180.
2. Barlow, R.E., Proschan, F. Mathematical Theory of Reliability. New  York: John
Wiley & Sons, 1965.
3. Finkelstein, M. Failure Rate Modeling for Reliability and Risk. London, UK: Springer,
2008.
4. Nakagawa, T. Maintenance Theory of Reliability. London, UK: Springer, 2005.
5. Nakagawa, T. Random Maintenance Policies. London, UK: Springer, 2014.
6. Nakagawa, T., Zhao, X. Maintenance Overtime Policies in Reliability Theory. Cham,
Switzerland: Springer, 2015.
7. Zhao, X., Al-Khalifa, K.N., Hamouda, A.M.S., Nakagawa, T. What is middle mainte-
nance policy? Quality and Reliability Engineering International, 2016, 32, 2403–2414.
8. Zhao, X., Al-Khalifa, K.N., Hamouda, A.M.S., Nakagawa, T. Age replacement models:
A summary with new perspectives and methods. Reliability Engineering and System
Safety, 2017, 161, 95–105.
9. Nakagawa, T. Stochastic Processes with Applications to Reliability Theory. London,
UK: Springer, 2011.
10. Zhao, X., Qian, C., Nakagawa, T. Comparisons of maintenance policies with peri-
odic times and repair numbers. Reliability Engineering and System Safety, 2017, 168,
161–170.
8 Accelerated Life Tests
with Competing
Failure Modes
An Overview
Kanchan Jain and Preeti Wanti Srivastava

CONTENTS
8.1 Introduction................................................................................................... 198
8.1.1 Accelerated Life Test Models............................................................ 198
8.1.2 An Accelerated Life Test Procedure.................................................. 199
8.1.3 Competing Failure Modes................................................................. 199
8.2 Accelerated Life Tests with Independent Causes of Failures........................200
8.2.1 Constant-Stress Accelerated Life Test with Independent Causes
of Failures.......................................................................................... 201
8.2.1.1 Model Illustration...............................................................202
8.2.2 Step-Stress Accelerated Life Test with Independent Causes of
Failures..............................................................................................204
8.2.2.1 Model Illustration............................................................... 205
8.2.3 Modified Ramp-Stress Accelerated Life Test with Independent
Causes of Failures..............................................................................208
8.2.3.1 Model Illustration............................................................... 210
8.3 Accelerated Life Tests with Dependent Causes of Failures.......................... 213
8.3.1 Copulas and Their Properties............................................................ 213
8.3.2 Constant-Stress Accelerated Life Test Based on Copulas................. 214
8.3.2.1 Model Illustration............................................................... 215
8.3.3 Constant-Stress Partially Accelerated Life Test Based on Copulas.....216
8.3.3.1 Model Illustration............................................................... 217
8.3.4 Step-Stress Accelerated Life Test Based on Copulas........................ 219
8.4 Bayesian Approach to Accelerated Life Test with Competing Failure
Mode.............................................................................................................. 219
8.5 Conclusion..................................................................................................... 219
References............................................................................................................... 219

197
198 Reliability Engineering

8.1 INTRODUCTION
A longer time period is necessary to test systems or components with a long expected
lifetime under normal operating conditions and many units are required which is
very costly and impractical. In  such situations, accelerated life test (ALT) meth-
ods are used that lead to failure/degradation of systems or components in shorter
time periods. Hence, failure data can be obtained during a reasonable period without
changing failure mechanisms.
ALTs were introduced by Chernoff (1962) and Bessler et al. (1962). They are used
during Design and Development, Design Verification, and Process Validation stages
of a product life cycle. Designing of optimal test plans is a critical step for assur-
ing that ALTs help in prediction of the product reliability accurately, quickly, and
economically.
In ALT, systems or components are:

• Subjected to more severe conditions than those experienced at normal


conditions (accelerated stress). Stress factors can be temperature, voltage,
mechanical load, thermal cycling, humidity, and vibration.
• Put in operation more vigorously at normal operating conditions (acceler-
ated failure). Products such as home appliances and vehicle tires are put to
accelerated failure.

For  accurate prediction of the reliability, the types of stresses to which systems/­
components are subjected and the failure mechanisms must be understood.
Different Types of Stress are:

1. Constant
2. Step
3. Ramp-step
4. Triangular-cyclic
5. Ramp-soak-cyclic
6. Sinusoidal-cyclic

8.1.1 Accelerated Life Test Models


In engineering applications, several ALT models have been proposed and used suc-
cessfully. Accelerated Failure Time (AFT) models are the most widely used ALT
models.

• Partially accelerated life test model: Degroot and Goel (1979) introduced
Partially Accelerated Life Test (PALT) models wherein the items are run at
normal as well as accelerated conditions.
A PALT model consists of a life distribution and an acceleration ­factor for
extrapolating accelerated data results to normal operating condition when
the life-stress relationship cannot be specified. The acceleration factor—the
Accelerated Life Tests with Competing Failure Modes 199

ratio of a reliability measure—for example, mean life, at use condition to


that at accelerated condition provides a quantitative ­estimate of the relation-
ship between the accelerated condition and the fiean condition.
• Fully accelerated life test model: Introduced by Bhattacharya and Soejoeti
(1989), a fully ALT consists of testing the items at accelerated condition
only. A fully ALT model consists of:

1. A life distribution that represents the scatter in product life


2. Relationship between life and stress

Some of such stress-life relationships used in the literature (Nelson 1990; Yang 2007;
Elsayed 2012; Srivastava 2017) are:

• Life-Temperature models described by Arrhenius and Eyring relationships


• Life-Voltage model described by Inverse Power relationships
• Life-Usage Rate relationship
• Temperature-Humidity model
• Temperature-Nonthermal model

8.1.2 An Accelerated Life Test Procedure


An ALT is undertaken in the design and development phase as well as in the verifica-
tion and validation phases of a product life cycle.
Nelson (1990) gave a comprehensive presentation of statistical models and meth-
ods for accelerated tests.

8.1.3 Competing Failure Modes


Many products have more than one cause of failure referred to as a failure mode or
failure mechanism. Examples include:

• The Turn, Phase, or Ground insulation failing in motors


• A ball or the race failing in ball bearing assemblies
• A semiconductor device that fails at a junction or a lead
• A cylindrical fatigue specimen failing in the cylindrical portion, in the fillet
(or radius), or in the grip
• Solar lighting device with capacitor and controller failure as two modes of
failure

In these examples, the assessment of each risk factor in the presence of other risk
factors is necessary and gives rise to competing risks analysis. For such an analysis,
each complete observation must be composed of the failure time and the correspond-
ing cause of failure. The  causes of failure can be independent or dependent upon
each other.
The procedure underlying an ALT is shown in the following flowchart (Figure 8.1).
200 Reliability Engineering

FIGURE 8.1  Flow chart.

8.2 ACCELERATED LIFE TESTS WITH INDEPENDENT


CAUSES OF FAILURES
Let n identical units be put to test and suppose that a unit fails due to one of the r (>2)
fatal risk factors. Let T j be the life time of the unit due to jth risk factor with cumu-
lative distribution function (CDF), G j (t ), and probability density function (PDF),
g j (t ). The overall failure time of a test unit is T = min {T1, T2, …, Tr} with CDF:

r
F (t ) = 1 − ∏ (1 − G (t ) ), (8.1)
i=1
j
Accelerated Life Tests with Competing Failure Modes 201

and PDF:

r r
f (t ) = ∑
j =1
h j (t ) ∏
j =1
(1 − G (t ) ), (8.2)
j

where h j (t ), the hazard rate corresponding to the jth risk factor, is defined as:

g j (t )
h j (t ) = . (8.3)
(1− G j (t ))

Let C be the indicator variable for the cause of failure, then the joint distribution of
(T, C) is given by:

fT ,C (t , j ) = g j (t ). (8.4)

f T, C (t, j) is used in the formulation of the likelihood function, which is used to


estimate model parameters and obtain optimal plans using the Fisher Information
Matrix. Fisher information measures the amount of information that an observable
random variable X carries about an unknown parameter θ upon which the likelihood
function depends.

8.2.1 Constant-Stress Accelerated Life Test with Independent
Causes of Failures
In a constant-stress ALT (CSALT) set-up, sub-groups of test specimens are allocated
to different test chambers and, in each test chamber, the test units are subjected to
different but fixed stress levels. The  experiment is terminated according to a pre-
specified censoring scheme. Each unit is tested under the same temperature for a
fixed duration of time. For example, 10 units are tested for 100 hours at 310 K, 10
different units are tested for 100 hours at 320 K, and another 10 different units are
tested for 100 hours at 330 K.
Figure 8.2 exhibits the constant-stress patterns.

FIGURE 8.2  Constant-stress loading.


202 Reliability Engineering

McCool (1978) presented a technique for finding interval estimates for Weibull
parameters of a primary failure mode when there is a secondary failure mode with
the same (but unknown) Weibull shape parameter. Moeschberger and David (1971)
and David and Moeschberger (1978) gave an expression for the likelihood of com-
peting risk data under censoring and fixed experimental conditions. Large sample
properties of maximum likelihood estimators (MLEs) were discussed for Weibull
and log-normal distributions. Herman and Patell (1971) discussed the MLEs under
competing have causes of failure.
Klein and Basu (1981, 1982a) analyzed ALT for more than one failure mode.
For  independent competing failure modes for each stress level, the authors found
MLEs with life times as exponential or Weibull, with common or different shape
parameters under Type-I, Type-II, or progressively censored data. Using a general
stress function, Klein and Basu (1982b) obtained estimates of model parameters
under various censoring schemes. A dependent competing risk model was proposed
by considering a bivariate Weibull distribution as the joint distribution of two com-
peting risks.
Nelson (1990) and Craiu and Lee (2005) analyzed ALTs under competing causes
of failure for semiconductor devices, ball bearing assemblies, and insulation sys-
tems. Kim and Bai (2002) analyzed ALT data with two competing risks taking a
mixture of two Weibull distributions and location parameters as linear functions of
stress.
Pascual (2007) considered the problem of planning ALT when the respective
times to failure of competing risks are independently distributed as Weibull with a
commonly known shape parameter.
Shi et al. (2013) proposed a CSALT with competing risks for failure from expo-
nential distribution under progressive Type-II hybrid censoring. They obtained the
MLE and Bayes estimators of the parameter and proved their equivalence under
certain circumstances. A  Monte Carlo simulation demonstrated the accuracy and
effectiveness of the estimations.
Yu et al. (2014) proposed an accelerated testing plan with high and low tempera-
tures as multiple failure modes for a complicated device. They gave the reliability
function of the product and established the efficiency of the plan through a numerical
example.
Wu and Huang (2017) considered planning of two or more level CSALTs with
competing risk data from Type-II progressive censoring assuming exponential
distribution.

8.2.1.1  Model Illustration


In this section, CSALT with competing failure modes proposed by Wu and Huang
(2017) has been described briefly for illustration purpose.
Consider a CSALT with L levels of stress and let yl be the lth stress level, l = 1,
2, …, L. Each unit is run at a constant-stress and may fail due to J failure modes.
Assume that at yl, the latent failure times Xi1l, Xi2l, …, XiJl are independent and expo-
nentially distributed with hazard rate λ jl ( > 0), i = 1, 2, …, n, l = 1, 2, …, L, and j = 1,
2, …, J. The failure time of the ith test unit is:
Accelerated Life Tests with Competing Failure Modes 203

X il = mim{ X i1l , X i 2l , …, X iJl }.


It is assumed that at the lth stress level, the mean life time of a test unit is a log-linear
function of standardized stress:

 1 
log   = β 0j + β1 j sl , (8.5)
 λ jl 

where:
−∞ < β 0 j < ∞,
β1 j < 0 are unknown design parameters.

The standardized stress, sl, is:

yl − yD
sl = , 0 ≤ sl ≤ 1, l = 1, 2, …, L,
yL − yD

y1 < y2 < …, < yL are L ordered stress levels and yD is the stress at normal operat-
ing condition. The log linear function is a common choice of life-stress relationship
because it includes the power law and the Arrhenius law as special cases.
The  failure density and failure distribution of the ith unit under jth risk are,
respectively:

f ( xil , l ) = λ jl e − λ+ l xil, xil > 0 (8.6)

J
λ jl
F ( xil , l ) =
λ+ l
( )
1 − e − λ+ l xil , xil > 0, λ + l = ∑ λ . (8.7)
j =1
jl

The failure distribution at failure time xil is:

( )
F ( xil ) = 1 − e − λ+ l xil , xil > 0. (8.8)

The authors have used progressive Type-II censoring scheme. Under this scheme,
nl units are tested at stress level sl with ΣlL= 1nl = n . For each stress level l, ml failures
are observed. The data are collected as follows:
When the first failure time, X(1)l, and its cause of failure, δ 1l , are observed, r1l of
the surviving units are selected randomly and removed. When the second failure
time, X(2)l, and its cause of failure, δ 2l, are observed, r2l of the surviving units are
selected randomly and removed. For simplicity, Xil is used instead of X(i)l. Type-II
progressive censored data with competing risks at stress level sl are:

( X1l , δ1l , r1l ) , ( X2l , δ 2l , r2l ) , ..., ( Xm l , δ m l , rm l )


l l l
204 Reliability Engineering

X 1l < X 2l < ... < X ml l are the ml observed life times,


δ11 , δ 21 , ..., δ ml 1 are the observable causes of failures,
r1l , r2l ,..., rml l are the number of censored units.

The likelihood function under Type-II progressive censoring scheme is:

L ml  J  − λ x r
L= ∏∏ ∏
l =1 i =1

 j = 1
λ Ijlijl e + l { il il

( + 1)}
, (8.9)

where I ijl = 1 if δ il = j and zero otherwise.

J
 1 
log 
λ
 = β 0j + β1 j sl , λ jl = e
 jl 
− β0 j − β1j sl
and λ+ l = ∑e j =1
− β0 j − β1j sl
.

Using the likelihood function, the authors have used D-optimality, variance optimal-
ity, and A-optimality criteria to obtain the optimal stress level as well as the optimal
sample allocation at each stress level. They used the real data set from Nelson (1990)
on times to failure of the Class-H insulation system in motors to explain the pro-
posed method. The design temperature is 180°C. The insulation systems are tested
at high temperatures of 190°C, 220°C, 240°C, and 260°C. Turn, Phase, and Ground
are three causes of failure.

8.2.2 Step-Stress Accelerated Life Test with Independent


Causes of Failures
Step-stress loading requires one test chamber. The stress on a specimen is increased
step-by-step wherein at each step it is subject to constant stress for a specified period.
The experiment is terminated according to a pre-specified censoring scheme.
Figure 8.3 exhibits the step-stress loading scheme.
To model data from step-stress test, the life distribution under step-stressing
must be related to the distribution under a constant stress. Such a model, known as
Cumulative Exposure Model (CEM), was put forward by Nelson (1980).
The  CEM assumes that the remaining life of a unit depends only on the cur-
rent cumulative fraction failed and current stress irrespective of accumulation of the
fraction. At the current stress, survivors fail according to the CDF for that stress but
starting at the previously accumulated fraction failed. Under CEM, the step-stress
life distribution is

G1( w ), τ 0 ≤ w < τ 1

G( w ) = Gi ( w − τ i −1 + si −1 ), τ i −1 ≤ w < τ i , i = 1, 2, ..., k − 1 (8.10)

G k ( w − τ k −1 + s k −1 ), τ i −1 ≤ w < τ i ,

s0 = τ 0 = 0, si (i > 0) is the solution of Gi + 1( si ) = Gi (τ i − τ i − 1 ), i = 1, 2, ..., k − 1.


Accelerated Life Tests with Competing Failure Modes 205

Stress

Time

FIGURE 8.3  Step-stress loading.

Khamis and Higgens (1998) formulated the Weibull CEM, which is based on the
time transformation of exponential CEM. Bai and Chun (1991) studied optimum
simple step-stress accelerated life tests (SSALTs) with competing causes of failure
when the distributions of each failure cause were independent and exponential.
Balakrishnan and Han (2008) and Han and Balakrishnan (2010) considered an
exponential SSALT with competing risks using Type-I and Type-II censored data
respectively. Donghoon and Balakrishnan (2010) studied inferential problem for
exponential distribution under time constraint. Using time-censored data, Liu and
Qiu (2011) devised a multiple-step SSALT with independent competing risks.
Srivastava et al. (2014) considered simple SSALT under Type-I censoring using
the Khamis-Higgins model (an alternative to the Weibull CEM) with competing
causes of failure. The Khamis-Higgins model is based on time transformation of the
exponential model. The life distribution of each failure cause, which is independent
of other, is assumed to be Weibull with the log of characteristic life as a linear func-
tion of the stress level.
Haghighi (2014) studied a step-stress test under competing risks and degradation
measurements and estimated the reliability function.

8.2.2.1  Model Illustration


In this section, the design of SSALT plan is explained with competing failure modes
using the methodology adopted by Srivastava et al. (2014).
Consider an SSALT with two causes of failure. There  are two independent
potential failure times for n test specimens corresponding to two causes of failure.
The  ­failure time of a unit is the lowest of its potential failure times. Two stress
­levels, x1 and x2 (x1 < x2), are used and x0 is the stress level under normal operating
­condition. For any level of stress, i, the life time under each failure cause j, follows
a Weibull distribution with shape parameter δ (known) and scale parameter θij, i,
j = 1, 2. Hence:

 −w δ 
G j ( w ) = 1 − exp   , 0 ≤ w < ∞. (8.11)
 θij 
206 Reliability Engineering

The  characteristic life, which is the 63.2th percentile of the distribution, of two
potential failure times are log-linear functions of stress and:

1
log θijδ = α j + β j xi ; i = 0, 1, 2; j = 1, 2. (8.12)

α j , β j (< 0) are unknown parameters depending on the nature of1 the product and
the test method and δ is known. It can be shown as follows that θijδ is the character-
istic life of expression 8.11.

1/δ ≈ ξ , i , j = 1, 2. (8.13)
G j (ξ p ) = p ⇒ ξ P = (−θij log (1 − p))1/δ, ⇒ θij 0.632

For each failure cause, Weibull CEM is assumed. Failure times and failure causes of
test units are observed jointly and continuously.
From the CEM and Weibull distributed life assumptions, the CDF of failure
cause, j = 1, 2, under a simple time step-stress test is the Khamis-Higgins model
given by:

G j ( w ) = G j ( w; θ 1 j , θ 2 j )

  −w δ 
 1 − exp   if 0 < w < τ (8.14)
  θ 1 j 
=
  τ δ (w δ − τ δ ) 
 1 − exp − −  if τ ≤ w < ∞
  θ 1 j θ2 j 

Since only the smaller of W1 and W2 is observed, let the overall failure time of a test
unit be

W = min {W1, W2}.

The CDF and PDF of W are

F ( w ) = F ( w;θ )

= 1 − (1 − G1( w ))(1 − G2 ( w ))



{
 1 1  δ
1 − exp −  θ11 + θ12  w
=
} if 0 < w < τ , (8.15)

 {
 θ11 θ12 
 
 θ 21 θ 22  }
1 − exp −  1 + 1 τ δ −  1 + 1  ( w δ − τ δ ) if τ ≤ w <∞
 
Accelerated Life Tests with Competing Failure Modes 207

f ( w ) = f ( w;θ )

 δ −1  1 1    1 1  δ
δw  +
θ11 θ12  exp −  + w 
θ11 θ12 
if 0 < w < τ ,
   
   1 1  δ 
= −  + τ (8.16)
  θ11 θ12 
 δ w δ −1 1 1  
  +  exp   if τ ≤ w <∞
 θ 21 θ 22  
 1 1  δ δ 
−  + (w − τ )
   θ 21 θ 22 

respectively, where θ = (θ1,θ 2 ) with θi = (θi1,θi 2 ) for i  =  1, 2. Furthermore, let j


denote the indicator for the cause of failure. For j, j′ = 1, 2 and j′ ≠ j, the joint PDF
of (W, C) is given by:

f w, c ( w , j ) = g j ( w )(1 − G j′ ( w ))

 δ w δ −1   1 1  δ
 exp  −  + w  if 0 < w < τ ,
 θ1 j   θ11 θ12 
   1 
 1  δ
=  − +  τ  (8.17)
 δ w exp   11
δ −1 θ θ12 
 θ2 j   if τ ≤ w < ∞.
   1 1  δ δ 
  −  θ + θ  (w − τ ) 
   21 22 

The  relative risk imposed on a test unit before τ and due to failure cause j is
denoted by

θ −11j
π 1 j = Pr[C = j | 0 < W < τ ] = , j = 1, 2. (8.18)
θ + θ12−1
−1
11

Similarly, the relative risk after τ due to the cause j is denoted by

θ −21j
π 2 j = Pr[C = j | W ≥ τ ] = , j = 1, 2. (8.19)
θ 21−1 + θ 22−1

These equations are simply the proportion of failure rates in the given time frame.
It follows from Equations 8.11 through 8.13 that W and C are independent given the
time frame in which a failure has occurred. For j = 1,2, let

n1j = number of units failing before τ due to failure cause j,


n2j = number of units failing after τ due to failure cause j.
208 Reliability Engineering

Under the assumption of the CEM, the likelihood function of θ based on the Type-I
censored sample is:

n11
 δ w δ −1 −wδ /θ  n12  δ w δ −1 −wδ /θ 
L(θ ) = ∏  i e i 1•  ∏  i e i 1• 
i =1  θ 11  i =1  θ 12 

 δ −1 − τ δ − wi −τ  n  δ −1 − τ δ − wiδ −τ δ
δ δ

n21
δ w  22  δ wi e θ 1•
× ∏  i e θ 1• θ 2• θ 2•  (8.20)
i =1  θ 21 ∏i =1  θ 22 
   

  nc (T δ −τ δ ) ncτ δ  
×  − θ 2• + θ 1•



e  ,

where
1 1 1
= + ,
θ 1• θ 11 θ 12

1 1 1
= + ,
θ 2 • θ 21 θ 22

n1• = n11 + n12,

n2 • = n21 + n22,

n = n1• + n2 • + nc,

n is fixed and known.

The authors estimated model parameters and obtained optimum plan for the time-
censored SSALTs which minimizes the sum over all causes of failure of asymptotic
variances of the MLEs of the log characteristics life at design stress. The inferential
procedures involving design parameters also were studied.

8.2.3 Modified Ramp-Stress Accelerated Life Test


with Independent Causes of Failures

Modified ramp-stress loading proposed by Srivastava and Gupta (2015) requires


one test chamber. The stress is increased at low constant stress rate starting from
the normal operating stress level, s0 for example to the stress level, s1, for example.
Thereafter, it is increased at the higher constant stress rate until the termination of
the experiment.
Modified ramp-stress ALT is designed using a generalized formulation of the
CEM wherein stress (t ) = k1t , where k1 is the rate of increase of stress.
Figure 8.4 shows this stress pattern.
Accelerated Life Tests with Competing Failure Modes 209

FIGURE 8.4  Modified ramp-stress loading.

Consider the life distribution F0 ( t ;  , x ) that depends on constant stress,  (volt-


age, for example) and other variables through the scale parameter α ( , x ) that is a
function of  , x, and coefficients:

 t 
F0 ( t ;  , x ) = G  (8.21)
 α (  , x ) 
 

where the scale parameter is set equal to unity in the assumed CDF, G(⋅).

⇒ F0 (t I ) = G(ε ), (8.22)

where
t I = ∆1 + ∆ 2 + ... + ∆ i + ... + ∆ I (8.23)
is the time after I steps in step-stress testing with step i at stress level i for a time:
∆ i = ti − ti − 1, (8.24)
with t0 = 0, and

∆1 ∆2 ∆i ∆I
ε = + + ... + + ... + (8.25)
α ( 1 , x ) α ( 2 , x ) α ( i , x ) α ( I , x )

as the cumulative exposure for the failure mode.


When α ( , x ) = α ( (t ), x ), that is, the stress, , is a function of time then, using
Equation 8.25, the corresponding cumulative exposure function is given by

 ∆1 ∆2 ∆i ∆I 
ε ( t ) = lim  + + ... + + ... + .
∆ i → 0  α ( 1, x ) α ( 2, x ) α ( i, x ) α (  I, x ) 

(8.26)
t
dt
=
∫ α ( ( t ), x )
0
210 Reliability Engineering

⇒ F0 ( t ;  ( t ) , x ) = G ( ε ( t ) ) (8.27)

is the generalized formulation of CEM.

8.2.3.1  Model Illustration


Srivastava and Gupta (2017) explored formulation of the optimum time-censored
ALT model under modified ramp-stress loading when different failure causes
have independent exponential life distributions. Their procedure is explained as
follows.
Suppose each unit fails by one of the two fatal risk factors and the time to failure
by each competing risk has an independent exponential life distribution obeying the
linear CEM. ε (t ) at time t for 0 < t ≤ τ 1 and τ 1 < t ≤ η has been obtained under stress
level s as:

t t t
1 1 1
ε (t ) =
∫ θ ( s( y ))
dy =
∫ γ0 j  s0 
γ1 j dy =
∫ γ0 j  s0 
γ1 j dy,
0 0 e  
0 e  
 s( y )  s
 0 + y β 1  (8.28)
γ 1+γ1 j
e −γ 0 j s0−γ1 j ((s0 + β 1t )1+ 1 j − (s0 ) )
= = W1 j (t ), 0 < t ≤ τ 1
β 1(1 + γ 1 j )

and

t t
1 1
ε (t ) =
∫ θ (s( y))
dy = ε (τ 1 ) +
∫ γ0 j  s0 
γ1 j
dy
0 τ1 e  
s
 1 + β 2 ( y − τ )
1 

γ γ
e −γ 0 j s0−γ1 j ((s1 + β 2 (t − τ 1 ))1+ 1 j − s11+ 1 j )
= ε (τ 1 ) + (8.29)
β 2 (1 + γ 1 j )

= ε (τ 1 ) + W2 j (t ),τ 1 < t ≤ η .

Then the CDF of failure cause j (j = 1,2) under modified ramp-stress is:

G j (t ) = J (ε (t )), (8.30)

where:
J (⋅) is the exponential CDF with mean θ set equal to one and
ε (t ) is the cumulative exposure (damage) function.
Accelerated Life Tests with Competing Failure Modes 211

Hence, the life distribution under modified ramp-stress loading corresponding to


failure cause j (j = 1, 2) is:

G j (t ) ≡ G j (t ; γ 0 j , γ 1 j )

 1 − exp {−W1 j (t )} , if 0 < t < τ 1 (8.31)


=
1 − exp {−W1 j (τ 1 ) − W2 j (t )} , if τ 1 ≤ t < ∞

and the failure density is:

g j (t ) ≡ g j (t ; γ 0 j , γ 1 j )

 exp {−W1 j (t )} W1′ j (t), if 0 < t < τ1 (8.32)


=
 exp {−W1 j (τ1) − W2 j (t )} W2 j (t),

if τ1 ≤ t < ∞

Let T = min {T1, T2} denote the overall failure time of a test unit, then its CDF and
PDF, respectively, are

F (t ) ≡ F (t ; γ 0 j , γ 1 j )

= 1 − (1 − G1(t ))(1 − G2 (t ))
(8.33)
 1− exp{−W11(t ) −W12 (t ) } , if 0 < t < τ1
=
 1− exp{−W11(τ1) −W21(t ) −W12 (τ1) −W22 (t ) } , if τ1 ≤ t < ∞

f (t ) ≡ f (t; γ 0 j , γ 1j )

 exp {−W11 (t ) − W12 (t )} (W11 (t ) + W12 (t )) ,


′ ′
if 0 < t < τ 1
=
 exp {−W11 (τ 1 ) − W21 (t ) − W12 (τ 1 ) − W22 (t )} (W21 (t ) + W22 (t )),
′ ′
if τ 1 ≤ t < ∞
(8.34)

Furthermore, let the indicator for the cause of failure be denoted by j. Then, under
assumptions, for j, j′ = 1, 2 and j′ ≠ j, the joint PDF of (T, C) is given by:

fT ,C (t , j ) = g j (t )(1 − G j′ (t ))

 exp{−W1 j (t )} exp{−W1 j′ (t )} (W1′j (t )), if 0 < t < τ1


=
 exp { 1 j 1 2 j } { 1 j′ 1 2 j′ } 2 j
−W (τ ) − W ( t ) exp −W (τ ) − W ( t ) (W ′ ( t ) ), if τ1 ≤ t < ∞
(8.35)
212 Reliability Engineering

The relative risk imposed on a test unit before τ1 due to failure cause j for j = 1,2 is
denoted by:

τ1

∫ exp{−W 1j (t )} exp {−W1 j ′ (t )} ( W1′j (t )) dt


π1 j = Pr [C = j | 0 < T < τ1] = 0
. (8.36)
1 − exp {−W11(τ 1 ) − W12 (τ 1 )}

Similarly, the relative risk after τ1 due to the cause j for j = 1,2 is denoted by:

π 2 j = Pr[C = j | T ≥ τ 1 ]


∫ exp{−W 1j (τ 1 ) − W2 j (t ) } exp {−W1 j ′ (τ 1 ) − W2 j ′ (t ) } ( W2′ j (t )) dt (8.37)
= τ1
exp {−W11(τ 1 ) − W12 (τ 1 )}

Define for j = 1, 2,

n1j is the number of units that fail before τ1 due to the failure cause j,
n2j is the number of units that fail after τ1 due to the failure cause j.

Using Equation 8.36, the likelihood function is:

n1 j n2 j
2 2 nc
L(γ 0 j , γ 1 j ) = ∏
j =1
∏ f (ti , ci ) ∏
j =1
∏ f (ti , ci ) (1 − F (η ))
i =1 i =1
n11

= ∏[ exp{−W (t )}exp{−W
i =1
11 i 12 (ti )} ( W11′ (ti )) ]

n12

∏[ exp{−W
i =1
12 (ti )} exp {−W11(ti )} ( W12′ (ti )) ] (8.38)

n21

∏ exp{−W (τ ) −W
i =1
11 1 21 (ti ) }exp { −W12 (τ 1 ) −W22 (ti ) } (W21′ (ti ) ) 

n22

∏[exp{−W
i =1
12 (τ 1 ) − W22 (ti ) } exp {−W11(τ 1 ) − W21 (ti ) } (W22′ (ti ) ) ]

e c { 11 1 21
− n W (τ ) + W ( η) + W12 (τ1 ) + W22 (η )}
.

n = n11 + n12 + n21 + n22 + nc, n is fixed and known.


Accelerated Life Tests with Competing Failure Modes 213

The  model parameters have been estimated and the optimal plan reveals rele-
vant experimental variables, namely, stress rate and stress rate change point(s) using
D-optimality criterion, which consists in finding out the optimal stress rate and the
optimal stress rate change point by maximizing the logarithm of the determinant
of the Fisher information matrix to the base 10. This criterion is motivated by the
fact that the volume of the joint confidence region of model parameters is inversely
proportional to the square root of the determinant of the Fisher information matrix.
The method developed has been explained using a numerical example. The results
of sensitivity analysis show that the plan is robust to small deviations from the true
values of baseline parameters.
Srivastava and Gupta (2018) also formulated the triangular cyclic-stress ALT
plan with independent competing failure modes.

8.3 ACCELERATED LIFE TESTS WITH DEPENDENT


CAUSES OF FAILURES
The competing failure modes usually are dependent. The literature about depen-
dent competing failure modes is rare in engineering but is available in biostatis-
tics and econometrics. Models with copulas have become increasingly popular for
modeling multivariate survival data. Carriere (1994) and Escarela and Carriere
(2003) modeled dependence between two failure times by a two-dimensional
copula. Carriere (1994) used a bivariate Gaussian copula to model the effect of
complete elimination of one of two competing causes of death on human mor-
tality. In Escarela and Carriere (2003), the bivariate Frank copula was fitted to
a prostate cancer data set. Ancha and Yincai (2012) have introduced copula in
reliability and analyzed ALT data with dependent multiple failure modes. Bunea
and Mazucchi (2014) have applied the copula-based ALT competing risk model
to Nelson’s motorettes data.

8.3.1 Copulas and Their Properties


Copulas help to model dependence between two failure modes. The  dependence
structure relates the known marginal distributions of failure modes to their bivari-
ate distribution (Nelsen 2006). The kind of dependence structure depends upon the
choice of an appropriate copula.
A probabilistic way to define the copula is provided by Sklar (1959).
Let X, Y be random variables with continuous distributions F1, F2 and survival
functions S1 = F1 and S2 = F2, respectively. The joint distribution and survival func-
tions are H(x, y) and S(x, y), respectively. There exists a unique 2-dimensional copula
C such that for all x in R2:

H (x1,x2 ) = C(F1 (x1 ) , F2 (x2 ))

and conversely, if C is a two-dimensional copula and F1, F2 are distribution func-


tions, then H is a two-dimensional distribution function with marginals F1, F2.
214 Reliability Engineering

Survival Copula

S(x, y) = P  X > x,Y > y  = 1 − F1 (x ) − F2 ( y) + H (x,y)

= F1 (x ) + F2 ( y) − 1 + C( F1 (x ), F2 ( y))
(8.39)
   
= F1 (x ) + F2 ( y) − 1 + C(1 − F1 (x ),1 − F2 ( y))
  
= C(F1 (x ), F2 (y))

with C(u,v) = u + v − 1 + C(1 − u,1 − v).


Sklar’s Theorem leads to the following relationships:

P [ X ≤ x,Y > y ] = F1(x ) − C ( F1(x ), F2 ( y )),


1.
P [ X > y,Y ≤ y ] = F2 (x ) − C ( F1(x ), F2 ( y )),
2.
C ( F1(x ), F2 ( y ))
P [ X ≤ x|Y ≤ y ] =
3. ,
F2 (y )
F (x ) − C ( F1(x ), F2 ( y ))
P [ X ≤ x|Y > y ] = 1
4. ,
1 − F2 (y )
∂ ( C(u,v) )
P  X ≤ x|Y = y  = C1|2 ( F1 (x ), F2 ( y)) =
5. |u = F1 (x ),v = F2 ( y ) ,
∂v
C( F1 (x ),F2 ( y))
P  X > x|Y > y  =
6. .
F2 (y)
There  are many types of copula functions, such as Gaussian copula, Student’s
t-copula, Frank copula, Clayton copula, and Gumbel copula. Different copulas
produce different dependence structures and the kind of dependence structure comes
from the choice of an appropriate copula.
The Gumbel-Hougaard copula is given by:

1
C (u, v ) = exp[−(( −loge [u])θ + ( −loge [v ])θ ) θ ] (8.40)

where θ ∈ [1, ∞] characterizes the association between the two variables.


The Gumbel-Hougaard copula is one-parametric and symmetrical. The Gumbel-
Hougaard copula belongs to the family of Archimedean copulas, used widely
because they can be constructed easily and many families belong to this class.

8.3.2 Constant-Stress Accelerated Life Test Based on Copulas


Ancha and Yincai (2012) proposed the CSALT using the Gumbel-Hougaard copula
(see Equation  8.34) with exponential marginals. Bai et  al. (2018) considered a
dependent-competing risk model under a constant-stress setting using the Bivariate
Pareto copula function with Lomax marginals and Type-II progressive censoring.
Accelerated Life Tests with Competing Failure Modes 215

8.3.2.1  Model Illustration


The methodology used by Ancha and Yincai (2012) is described as follows.
Consider a k CSALT with two competing failure modes. At  each  stress
level  si, i  =  1,2,…,k, several ni systems are tested until ri of them fail. (ti1,ci1),
(ti 2 , ci 2 ), ,(ti ri , ci ri ) is the failure data, where cil takes any integer in the set {1, 2}.
cil = 1 and cil = 2 indicate the failure caused by failure modes 1 and 2, respectively.
Each failure mode has an exponential life distribution with hazard rate λij ,
i = 1,2,…,k; j = 1,2 under stress si. Thus, Equation (8.39) gives

θ θ (1 θ )
Si ( t ) = C (e − λi1t , e − λi 2t ) = e −( λi1 +λi 2 )
.t

Under stress level si, the stress-life relationship is modeled using the log-linear
equation:

log( µi j ) = α j + β jφ ( si ), (8.41)

where:
µi j = 1 λi j , α j and β j are unknown parameters,
φ ( s) is a given function of stress s.

This is a general formulation which contains the Arrhenius and inverse power law
models as special cases; defined:

1 if cil = j
δ j (cil ) =  (8.42)
0 if cil ≠ j , j = 1, 2
Then the likelihood function due to failure mode 1 under stress level si is

δ1 ( cil )
ri
 P (T1 < T2 ) ∩ ( til ≤ T1 < til + ∆t )  
Li1 = ∏
l =1 
 ∆lim

t →0 ∆t


1 − δ1 ( cil )
 P (T2 < T1 ) ∩ ( til ≤ T2 < til + ∆t )  
×  lim  
∆t →0 ∆t
 

{P [T > t , T2 > tiri ]}
ni − ri
× 1 iri

λ 
θ gi1 1   1  ri  
( ) ( ) ∑
ri  − 1
til + ( n − ri ) tiri  
 
=  i1  λiθ2 − ri λiθ1 + λiθ2 θ  exp−  λiθ1 + λiθ2 θ  
 λi 2   
  l =1  
(8.43)
and that due to failure mode 2 under stress level si is
216 Reliability Engineering

λ 
θ gi2 1   1  ri  
( ) ( ) ∑
ri  − 1
til + ( n − ri ) tiri  
 
Li 2 =  i 2  λiθ1 − ri λiθ1 + λiθ2 θ  exp−  λiθ1 + λiθ2 θ  
  λi1   
  l =1  
(8.44)

where:
gij = ∑ lri= 1δ j (cil )
gi1 = ri − gi 2

Therefore, the log-likelihood function under the stress si is:

log Li = log Li1 + log Li 2

 1 
1

 θ 
(
⇒ log Li = 2 θ gi1 log λi1 + θ ( ri − gi log λi 2 ) +  − 1 log λiθ1 + λiθ2 − λiθ1 + λiθ2 ) ( ) θ TTTi


(8.45)

where TTTi = ∑ lri= 1 til + ( n − ri ) tiri .


Ancha and Yincai (2012)  estimated the model parameters and compared via
simulation the results for the dependent and the independent failure modes. They
applied CSALT to the data set on insulated system of electromotors from Klein and
Basu (1981) using the Gumbel copula with exponential marginals. The original data
consists of three failure modes: turn failure, phase failure, and ground failure. 323 K
and 423 K are the two temperatures used, and the four accelerated temperatures are
453 K, 463 K, 493 K, and 513 K. They obtained the MLEs of mean life times at
normal operating conditions—323 K and 423 K.

8.3.3 Constant-Stress Partially Accelerated Life Test Based on Copulas


Srivastava and Gupta (2019) designed constant-stress PALT (CSPALT) using the
Gumbel-Hougaard copula. The formulation is based on tampered failure rate model
under a constant-stress set-up (Srivastava and Sharma 2014) and Type-I censoring.
The tampered failure rate (TFR) model assumes that changing the acceleration
factor in different test chambers has a multiplicative effect on initial failure rate
function. Thus, for (m + 1) test chambers including the one in which items are tested
under normal operating condition, the TFR model is:

 h ( y ) under used condition


 (8.46)
h ( y) = 
* j



h j ( y ) = A j h j −1 ( y ) = ∏
i =1
Ai h ( y ) at j th stress level , j = 1, 2, …, m

where the acceleration factor Ai (>1), i = 1, 2,…, m is assumed to be a parameter of


the model. This contrasts with a fully ALT model wherein a regression model on the
accelerating variable is specified (see also, Srivastava [2017]).
Accelerated Life Tests with Competing Failure Modes 217

8.3.3.1  Model Illustration


The methodology used is described as follows.
Under the partially accelerated environmental condition using the constant-
stress tampered failure rate model with m = 1 (Srivastava and Sharma 2014) and
the fact that exponential distribution has a constant hazard rate, λ j, the CDF of Tj,
j = 1, 2 is:

 − t h (u ) du
 ∫0
e = 1 − e − λ j t , under normal operating condition
G j (t ) =  t (8.47)
 − ∫Ah (u ) du
e 0 = 1 − e − Aλ j t , under accelerated condition

with pdf g j(t).


The  joint survival probability for the case of two dependent competing failure
modes is

S ( t ) = P [T > t ] = P [ min{T1, T2} > t ] = P [T1 > t , T2 > t ] (8.48)

Under the tampered failure rate model and the Gumbel-Hougaard copula with expo-
nential marginals, S(t) is given as

C(e − λ1t , e − λ2 t ) = e −( λ1θ + λ2θ )(1 θ ) . t , under normal operrating condition



  S (t ) =  (8.49)
C(e − Aλ1t , e − Aλ2 t ) = e − A( λ1θ + λ2θ )(1 θ ) . t , under accelerated condition.

The  probabilities of failure of a unit under different failure modes over different
intervals are required for the formulation of the likelihood function and:

• Probability that a product fails under failure mode j in chamber 1


­(normal operating conditions) is calculated using Equation  (8.49) and
Section 8.3.1(5) as:

P[(Tj < Tk )  (t < Tj ≤ t + ∆t )]


lim
∆t → 0 ∆t

∂ C(u, v)
= g j (t ) dt (8.50)
∂u u = G ( t ), v = G
1 2 (t )

− ( λ1θ + λ2θ )(1 θ ) . t


= (λ1θ + λ2θ )(1 θ ) −1 λ θj e dt, ( j,k) = 1,,2, j ≠ k
218 Reliability Engineering

• Probability that a product fails under failure mode j in chamber 2 (acceler-


ated condition) is:

P[(T1 < T2 )  (t < T1 ≤ t + ∆t )] ∂ C ( u, v )


lim = g1(t ) dt
∆t → 0 ∆t ∂u u =G1 ( t ), v = u =G2 ( t ) (8.51)

− A( λ1θ + λ2θ )(1 θ ) . t


= A(λ1θ + λ2θ )(1 θ ) −1 λ θj e dt , (j ,k ) = 1,2, j ≠ k

As n φ1 and n φ2 test units are allocated to the normal operating condition and accel-
erated condition, respectively, the likelihood function of λ1, λ2 and A with censoring
time η using Equations 8.50 and 8.51 is:

L(λ1, λ2 ,A) = L1L2 , (8.52)


where:

{  θ
} 
θ θ (1 θ ) δ11
θ (1 θ ) −1 θ − ( λ1 + λ2 ) ti
 (λ1 + λ2 ) λ1 e 
 
 δ12 
L =∏ { }
nφ1
θ θ (1 θ )
1  (λ1θ + λ2θ )(1 θ )−1 λ2θ e −( λ1 +λ2 ) ti 
i =1  
 
{ }
1−δ11 −δ12
 e −( λ1θ +λ2θ )(1 θ ) . η 
 

{ 
} 
θ θ (1 θ ) δ 21
θ θ (1 θ ) −1 θ − A( λ1 + λ2 ) . ti
 A(λ1 + λ2 ) λ1 e 
 
 
L =∏ { }
nφ2 δ 22
θ θ (1 θ )
2  A(λ1θ + λ2θ )(1 θ ) −1 λ2θ e − A( λ1 + λ2 ) . ti 
i =1  
 
{ }
1−δ 21 −δ 22
 e − A( λ1θ + λ2θ )(1 θ ) . η 
 

where:

1, cause of failure is j in chamber m (normal operatinng condition )


δm j = 
 0, otherwise.

Define as
Φm as the proportion of units that are allocated in chamber m, m = 1, 2 and Φ1 + Φ2 = 1.

The authors have estimated model parameters and obtained optimal plan that con-
sists in finding the optimal allocation, n1= n Φ1, n the first test chamber in normal
conditions using D-optimality criterion. The method developed has been explained
using numerical example and sensitivity analysis were carried out.
Accelerated Life Tests with Competing Failure Modes 219

8.3.4 Step-Stress Accelerated Life Test Based on Copulas


Zhou et al. (2018) have addressed the statistical analysis of an SSALT in the presence
of dependent competing failure modes. The dependence structure among distributions
of life times is constructed by the copula function with an unknown copula p­ arameter.
Under the CEM for SSALT with two assumed copulas, namely the Gumbel and
Clayton copulas, an expectation maximization (EM) algorithm is developed to obtain
MLEs of model parameters and the missing information principle is used to obtain
their standard errors (SEs). SSALT is applied to the Y11X-1419 type of Aerospace
Electrical Connector composed of contact element, insulator, and mechanical connec-
tion. Three kinds of failure modes—contact failure, insulation failure, and mechani-
cal connection failure have been considered. For assessing the storage reliability of
electrical connectors, the data are collected in an SSALT accelerated by temperature
because it is the most important environmental factor which affects the storage reli-
ability of the electromechanical components. They used the MLE method to estimate
the parameters of the candidate copula functions, Akaike’s information criterion to
select optimal copula functions, and verified strong dependence among failure modes.
The results of the case studies show that the method proposed is valid and effective for
the statistical analysis of SSALT with dependent competing failure modes.

8.4 BAYESIAN APPROACH TO ACCELERATED LIFE


TEST WITH COMPETING FAILURE MODE
Zhang and Mao (1998) and Bunea and Mazzuchi (2005, 2006) considered the analy-
sis of ALT with competing failure modes from a Bayesian viewpoint. Bunea and
Mazzuchi (2005, 2006) considered two Bayesian models: Exponential Gamma
and the other with prior as an ordered Dirichlet distribution. Tan et al. (2009) pro-
posed a Bayesian method for analyzing incomplete data obtained from CSALT when
there are two or more failure modes, or competing failure modes.

8.5 CONCLUSION
This chapter is a brief review on formulation of ALT models with competing failure
modes—independent or dependent. The stress loading factors used in the literature are
constant, step-stress, modified ramp-stress, and triangular cyclic. In case of dependent
failure modes, dependence is described through copulas. In the literature, ALT models
have been designed by various authors using the classical approach or the Bayesian
approach. Various authors carried out data analysis using different censoring schemes
such as time-censoring, failure censoring, progressive censoring, and determined opti-
mal plans. The methods developed also were explained using numerical examples.

REFERENCES
Ancha, X. and Yincai, T. (2012). Statistical analysis of competing failure modes in acceler-
ated life testing based on assumed copulas. Chinese Journal of Applied Probability and
Statistics, 28, 51–62.
Bai, D.S. and Chun, Y.R. (1991). Optimum simple step-stress accelerated life tests with com-
peting causes of failure. IEEE Transactions on Reliability, 40 (5), 622–627.
220 Reliability Engineering

Bai, X., Shi, Y., Liu, Y., and Liu, B. (2018). Statistical analysis of dependent competing risks
model in constant stress accelerated life testing with progressive censoring based on
copula function. Statistical Theory and Related Fields, 2 (1), 48–57.
Balakrishnan, N. and Han, D. (2008). Exact inference for simple step-stress model with com-
peting risks for failure from exponential distribution under Type-II censoring. Journal
of Statistical Planning and Inference, 138, 4172–4186.
Bessler, S., Chernoff, H., and Marshall, A.W. (1962). An optimal sequential accelerated life
test. Technometrics, 4 (3), 367–379.
Bhattacharya, G.K. and Soejoeti, Z. (1989). A  tampered failure rate model for step-stress
accelerated life test. Communications in Statistics—Theory and Methods, 18 (5),
1627–1643.
Bunea, C. and Mazzuchi, T.A. (2005). Bayesian accelerated life testing under competing
failure modes. Proceedings of Annual Reliability and Maintainability Symposium,
Alexandria, VA, 152–157.
Bunea, C. and Mazzuchi, T.A. (2006). Competing failure modes in accelerated life testing.
Journal of Statistical Planning and Inference, 136, 1608–1620.
Bunea, C. and Mazzuchi, T.A. (2014). Accelerated Life Tests: Analysis with Competing
Failure Modes. Wiley Stats, Reference: Statistics Reference Online, pp. 1–12.
Carriere, J. (1994). Dependent decrement theory. Transactions, Society of Actuaries, XLVI, 45–65.
Chernoff, H. (1962). Optimal accelerated life designs for estimation, accelerated life test.
Technometrics, 4 (3), 381–408.
Craiu, R.V. and Lee, T.C.M. (2005). Model selection for the competing-risks model with and
without masking. Technometrics, 47 (4), 457–467.
David, H.A. and Moeschberger, M.L. (1978). The  Theory of Competing Risks. Griffin,
London, UK.
DeGroot, M.H. and Goel, P.K. (1979). Bayesian estimation and optimal designs in partially
accelerated life testing. Naval Research Logistic Quarterly, 26 (20), 223–235.
Donghoon, H. and Balakrishnan, N. (2010). Inference for a simple step-stress model with
competing risks for failure from the exponential distribution under time constraint.
Computational Statistics & Data Analysis, 54 (9), 2066–2081.
Elsayed, A.E. (2012). Reliability Engineering. John Wiley & Sons, Hoboken, NJ.
Escarela, G. and Carriere, J. (2003). Fitting competing risks with an assumed copula.
Statistical Methods in Medical Research, 12 (4), 333–349.
Haghighi, F. (2014). Accelerated test planning with independent competing risks and concave
degradation path. International Journal of Performability Engineering, 10 (1), 15–22.
Han, D. and Balakrishnan, N. (2010). Inference for a simple step-stress model with competing
risks for failure from the exponential distribution under time constraint. Computational
Statistics and Data Analysis, 54, 2066–2081.
Herman, R.J. and Patell Rusi, K.N. (1971). Maximum likelihood estimation for multi-risk
model. Technometrics, 13 (2), 385396. doi:10.1080/00401706.1971.10488792.
Khamis, I.H. and Higgins, J.J. (1998). New model for step-stress testing. IEEE Transactions
on Reliability, 47 (2), 131–134.
Kim, C.M. and Bai, D.S. (2002). Analysis of accelerated life test data under two failure modes.
International Journal of Reliability, Quality and Safety Engineering, 9, 111–125.
Klein, J.P. and Basu, A.P. (1981). Weibull accelerated life tests when there are competing causes
of failure. Communications in Statistics Theory and Methods, 10 (20), 2073–2100.
Klein, J.P. and Basu, A.P. (1982a). Accelerated life testing under competing exponential fail-
ure distributions. IAPQR Transactions, 7 (1), 1–20.
Klein, J.P. and Basu, A.P. (1982b). Accelerated life tests under competing Weibull causes of
failure. Communications in Statistics—Theory and Methods, 11 (20), 2271–2286.
Liu, X. and Qiu, W.S. (2011). Modeling and planning of step-stress accelerated life tests with
independent competing risks. IEEE Transactions on Reliability, 60 (4), 712–720.
Accelerated Life Tests with Competing Failure Modes 221

McCool, J. (1978). Competing risk and multiple comparison analysis for bearing fatigue tests.
Tribology Transactions, 21, 271–284.
Moeschberger, M.L. and David, H.A. (1971). Life tests under competing causes of failure and
the theory of competing risks. Biometrics, 27 (4), 909–933.
Nelsen, R.B. (2006). An Introduction to Copulas, 2nd ed. Springer Science + Business Media,
New York.
Nelson, W.B. (1990). Accelerated Testing: Statistical Models, Test Plans, and Data Analysis.
John Wiley & Sons, Hoboken, NJ.
Pascual, F.G. (2007). Accelerated life test planning with independent Weibull competing
risks with known shape parameter. IEEE Transactions on Reliability, 56 (1), 85–93.
Shi, Y., Jin, L., Wei, C., and Yue, H. (2013). Constant-stress accelerated life test with compet-
ing risks under progressive type-II hybrid censoring. Advanced Materials Research,
712–715, 2080–2083.
Sklar, A. (1959). Fonctions de répartition à n dimensions et leurs marges. Publications de
l'Institut de statistique de l'Université de Paris, 8, 229–231.
Srivastava, P.W. (2017). Optimum Accelerated Life Testing Models with Time-varying
Stresses. World Scientific Publishing Europe, London, UK.
Srivastava, P.W. and Gupta, T. (2015). Optimum time-censored modified ramp-stress ALT
for the Burr Type XII distribution with warranty: A  goal programming approach.
International Journal of Reliability, Quality and Safety Engineering, 22 (3), 23.
Srivastava, P.W. and Gupta, T. (2017). Optimum modified ramp-stress ALT plan with compet-
ing causes of failure. International Journal of Quality and Reliability Management, 34
(5), 733–746.
Srivastava, P.W. and Gupta, T. (2018). Optimum triangular cyclic-stress ALT plan with
independent competing causes of failure. International Journal of Reliability and
Applications, 19 (1), 43–58.
Srivastava, P.W. and Gupta, T. (2019). Copula based constant-stress PALT using tampered
failure rate model with dependent competing risks. International Journal of Quality
and Reliability Management, 36 (4), 510–525.
Srivastava, P.W. and Sharma, D. (2014). Optimum time-censored constant-stress PALTSP for
the Burr Type XII distribution using tampered failure rate model. Journal of Quality
and Reliability Engineering, 2014, 564049, 13. doi:10.1155/2014/564049.
Srivastava, P.W., Shukla, R., and Sen, K. (2014). Optimum simple step-stress test with
competing risks for failure using Khamis-Higgins model under Type-I censoring.
International Journal of Operational Research/Nepal, 3, 75–88.
Tan, Y., Zhang, C., and Cen, X. (2009). Bayesian analysis of incomplete data from
accelerated life testing with competing failure modes. 8th International
­
Conference on Reliability, Maintainability and Safety, pp. 1268–1272. doi:10.1109/
ICRMS.2009.5270049.
Wu, S.-J. and Huang, S.-R. (2017). Planning two or more level constant-stress accelerated life
tests with competing risks. Reliability Engineering and System Safety, 158, 1–8.
Yang, G. (2007). Life Cycle Reliability Engineering. John Wiley & Sons, Hoboken, NJ.
Yu, Z., Ren, Z., Tao, J., and Chen, X. (2014). Accelerated testing with multiple failure modes
under several temperature conditions. Mathematical Problems in Engineering, 839042,
8. doi:10.1155/2014/839042.
Zhang, Z. and Mao, S. (1998). Bayesian estimator for the exponential distribution with the
competing causes of failure under accelerated life test. Chinese Journal of Applied
Probability and Statistics, 14 (1), 91–98.
Zhou, Y., Lu, Z., Shi, Y, and Cheng, K. (2018). The  copula-based method for statistical
analysis of step-stress accelerated life test with dependent competing failure modes.
Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and
Reliability, 1–18. doi:10.1177/1748006X18793251.
9 European Reliability
Standards
Miguel Angel Navas, Carlos Sancho,
and Jose Carpio

CONTENTS
9.1 Introduction................................................................................................... 223
9.2 Classification of the Dependability Standards of the International
Electrotechnical Commission........................................................................224
9.3 Management Procedures............................................................................... 225
9.4 Establishment of Requirements..................................................................... 230
9.5 Test Methods.................................................................................................. 232
9.6 Method Selection........................................................................................... 234
9.7 Reliability Evaluation Methods..................................................................... 237
9.8 Statistical Methods for the Evaluation of Reliability....................................246
9.9 Conclusions.................................................................................................... 253
References............................................................................................................... 254

9.1 INTRODUCTION
The International Electrotechnical Commission (IEC) is a standardization organiza-
tion in the fields of electrical, electronic, and related technologies. It is integrated
by the national standardization bodies of each member country. The IEC includes
85 countries, including those of the European Union, Japan, and the United States,
among others.
The IEC has a Technical Committee, TC56, whose current name is Dependability.
The  purpose of TC 56 is to prepare international standards for reliability (in its
broadest sense), applicable in all technological areas. Reliability can be expressed in
terms of the essential attributes of support such as availability, maintainability, etc.
The standards provide systematic methods and tools for evaluating the reliability and
management of equipment, services, and systems throughout their life cycles. As of
June 2018, TC56 has 57 published standards in this area.
The standards cover generic aspects of administration of the reliability and main-
tenance program, tests and analytical techniques, software and system reliability, life
cycle costs, technical risk analysis, and project risk management. This list includes
standards related to product problems from reliability of components to guidance for
reliability of systems engineering, standards related to process issues from technolog-
ical risk analysis to integrated logistics support, and standards related to management
issues from program management from reliability to administration for obsolescence.

223
224 Reliability Engineering

9.2 CLASSIFICATION OF THE DEPENDABILITY STANDARDS OF


THE INTERNATIONAL ELECTROTECHNICAL COMMISSION
The set of standards issued by the IEC enables handling a large part of the main-
tenance processes, under contrasted methods and metrics, backed by the rigor and
scientific level applied in its preparation and very demanding review processes, prior
to its publication. That is why IEC standards must be one of the essential sources
adopted by maintenance engineers in their academic, scientific, and business
activities.
A  classification of the 57 current standards is presented, grouped according to
their main application field, noting that many of them are complementary and others
are alternatives in their use. So, it is necessary to carry out a complete analysis of
the process for which standards are to be applied to make an appropriate selection
of them.
In Table 9.1, the 57 standards are classified into 6 clusters according to their main
field of application:

• Management procedures: There are 19 standards that cover different processes


for application in the field of maintenance (design, life cycle, maintainability,
logistics, risk, etc.), which include and develop the necessary procedures for
their adoption and implementation on the assets to be maintained.
• Establishment of requirements: The eight standards include procedures for
the specification of the reliability, maintainability, and availability require-
ments that the systems must comply with to be established from the design
phase.
• Test methods: These 11 standards develop the application procedures of dif-
ferent tests for application to the systems to obtain real operating data, and
thus to evaluate practically the behavior of the systems.
• Method selection: There are five standards that assist in establishing mea-
surement metrics and in selecting the most appropriate methods for evalu-
ating the reliability of each system based on quantitative and qualitative
selection criteria.

TABLE 9.1
Classification of the Dependability
Standards Issued by IEC
Cluster Number of Standards
Management procedures 19
Establishment of requirements 8
Test methods 11
Method selection 5
Reliability evaluation methods 9
Statistical methods for reliability 5
European Reliability Standards 225

• Reliability evaluation methods: The nine standards present an alternative


method for evaluating the reliability of a system with a different approach.
It is therefore necessary to properly choose the method to be applied, taking
into account the specific characteristics of each system or equipment, since
each method is more appropriate for a certain type of item.
• Statistical methods for the evaluation of reliability: These five standards
must be applied together, since the selection of the specific statistical
method depends on whether the system is repairable or not. All of them are
strongly linked and must be used in an integrated manner.

9.3  MANAGEMENT PROCEDURES


These 19 standards provide maintenance managers with multiple tools to perform a
comprehensive management of their activities with procedures of proven academic
and business validation.
Table 9.2 presents the classification of the management procedures issued by the
IEC in the field of Dependability:

• Maintenance strategies: These eight standards pose to potential maintenance


engineers basic strategies to adopt in the management of activities and
operational processes.
• Data processing: There are two specific standards apply to the collection,
analysis, and presentation of the operating data of the systems.
• Risk: These three standards are for the implementation of risk management
procedures.
• Logistics: One of the processes that has the most influence on the results of
maintenance management and two standards have been developed for its
treatment.
• Improvement processes: One standard has been developed to improve the
reliability of the systems in operation.
• Life cycle: There are three standards that address the life cycle of systems
and the impact on maintenance costs.

TABLE 9.2
Classification of the Management
Procedures Standards Issued by IEC
Cluster Number of Standards
Maintenance strategies 8
Data processing 2
Risk 3
Logistics 2
Improvement processes 1
Life cycle 3
226 Reliability Engineering

The eight standards of maintenance strategies are:

• IEC 60300-1:2014: Dependability management—Part 1: Guidance for


management and application (Edition 3.0) establishes a framework for
dependability management. It  provides guidance on dependability man-
agement of products, systems, processes, or services involving hardware,
software, and human aspects or any integrated combinations of these
elements. It presents guidance on planning and implementation of depend-
ability activities and technical processes throughout the life cycle taking
into account other requirements such as those relating to safety and the
environment. This  standard gives guidelines for management and their
technical personnel to assist them to optimize dependability.
• IEC 60300-3-10:2001: Dependability management—Part 3-10: Application
guide—Maintainability (Edition 1.0) can be used to implement a maintain-
ability program covering the initiation, development, and in-service phases
of a product, which form part of the tasks in IEC 60300-2. It provides guid-
ance on how the maintenance aspects of the tasks should be considered to
achieve optimum maintainability.
• IEC 60300-3-11:2009: Dependability management—Part 3-11: Application
guide—Reliability centered maintenance (Edition 2.0) provides guidelines
for the development of failure management policies for equipment and struc-
tures using reliability centered maintenance (RCM) analysis techniques.
This part serves as an application guide and is an extension of IEC 60300-
3-10, IEC 60300-3-12 and IEC 60300-3-14. Maintenance activities recom-
mended in all three standards, which relate to preventive maintenance, may
be implemented using this standard.
• IEC 61907:2009: Communication network dependability engineering
(Edition 1.0) gives guidance on dependability engineering of communica-
tion networks. It establishes a generic framework for network dependability
performance, provides a process for network dependability implementation,
and presents criteria and methodology for network technology designs,
performance evaluation, security consideration, and quality of service
measurement to achieve network dependability performance objectives.
This standard is applicable to network equipment developers and suppliers,
network integrators, and providers of network service functions for plan-
ning, evaluation, and implementation of network dependability.
• IEC 62508:2010: Guidance on human aspects of dependability (Edition 1.0),
provides guidance on the human aspects of dependability, and the
human-centered design methods and practices that can be used throughout
the whole system life cycle to improve dependability performance. This stan-
dard describes qualitative approaches.
• IEC 62628:2012: Guidance on software aspects of dependability
(Edition 1.0) addresses the issues concerning software aspects of depend-
ability and gives guidance on achievement of dependability in software
performance influenced by management disciplines, design processes, and
European Reliability Standards 227

application environments. It establishes a generic framework on software


dependability requirements, provides a software dependability process for
system life cycle applications, presents assurance criteria and methodology
for software dependability design and implementation, and provides practi-
cal approaches for performance evaluation and measurement of dependabil-
ity characteristics in software systems.
• IEC 62673:2013: Methodology for communication network dependability
assessment and assurance (Edition 1.0) describes a generic methodology
for dependability assessment and assurance of communication networks
from a network life cycle perspective. It presents the network dependabil-
ity assessment strategies and methodology for analysis of network topol-
ogy, evaluation of dependability of service paths, and optimization of
network configurations to achieve network dependability performance and
dependability of service. It also addresses the network dependability assur-
ance strategies and methodology for application of network health check,
network outage control, and test case management to enhance and sustain
dependability performance in network service operation. This standard is
applicable to network service providers, network designers and developers,
and network maintainers and operators for assurance of network depend-
ability performance and assessment of dependability of service.
• IEC TS 62775:2016: Application guidelines—Technical and financial pro-
cesses for implementing asset management systems (Edition 1.0), which is a
Technical specification, shows how the IEC dependability suite of standards,
systems engineering, and the IFRS and IAS standards can support the
requirements of asset management, as described by the ISO 5500x suite of
standards.

The most relevant aspects of the two standards of data processing are summarized
as follows:

• IEC 60300-3-2:2004: Dependability management—Part 3-2: Application


guide—Collection of dependability data from the field (Edition 2.0)
provides guidelines for the collection of data relating to reliability, main-
tainability, availability, and maintenance support performance of items
operating in the field. It deals in general terms with the practical aspects
of data collection and presentation and briefly explores the related topics of
data analysis and presentation of results. Emphasis is on the need to incor-
porate the return of experience from the field in the dependability process
as a main activity. The typing of the data is done according to the attributes
of Table 9.3.
• IEC 60706-3:2006: Maintainability of equipment—Part 3: Verification
and collection, analysis and presentation of data (Edition 2.0) addresses
the collection, analysis, and presentation of maintainability-related data,
which may be required during, and at the completion of, design and during
item production and operation.
228 Reliability Engineering

TABLE 9.3
Attributes of the Collection of Dependability Data from
the Field
Attribute Values
Respect to time Continuous, discontinuous, etc.
Number of data Complete or limited
Type of population Finite, infinite, or hypothetical
Sample size No sampling, random sampling, or stratified sampling
Type of data Qualitative or quantitative
Data censorship Uncensored, lateral censorship, or censorship by interval
Data validation In origin, by supervisor, etc.
Data screening Without screening or with screening standards

The most important aspects of the three standards dedicated to risk management are
summarized as follows:

• IEC/ISO 31010:2009: Risk management—Risk assessment techniques


(Edition 1.0) is a dual logo IEC/ISO supporting standard for ISO 31000 and
provides guidance on selection and application of systematic techniques for
risk assessment. This standard is not intended for certification, regulatory,
or contractual use.
• IEC 62198:2013: Managing risk in projects—Application guidelines
(Edition 2.0) provides principles and generic guidelines on managing risk
and uncertainty in projects. In particular it describes a systematic approach to
managing risk in projects based on ISO 31000, Risk management—Principles
and guidelines. Guidance is provided on the principles for managing risk in
projects, the framework and organizational requirements for implementing
risk management, and the process for conducting effective risk management.
This standard is not intended for the purpose of certification.
• IEC TR 63039:2016: Probabilistic risk analysis of technological
systems—Estimation of final event rate at a given initial state (Edition 1.0)
provides guidance on probabilistic risk analysis (hereinafter referred to as
risk analysis) for the systems composed of electrotechnical items and is
applicable (but not  limited) to all electrotechnical industries where risk
analyses are performed.

The two standards for logistics management are:

• IEC 60300-3-12:2011: Dependability management—Part 3-12:


Application guide—Integrated logistic support (Edition 2.0) is an appli-
cation guide for establishing an integrated logistic support (ILS) man-
agement system. It  is intended to be used by a wide range of suppliers
European Reliability Standards 229

including large and small companies wishing to offer a competitive and


quality item that is optimized for the purchaser and supplier for the com-
plete life cycle of the item. It also includes common practices and logistic
data analyses that are related to ILS.
• IEC 62550:2017: Spare parts provisioning (Edition 1.0) describes require-
ments for spare parts provisioning as a part of supportability activities
that affect dependability performance so that continuity of operation of
products, equipment, and systems for their intended application can be
sustained. This document is intended for use by a wide range of suppliers,
maintenance support organizations, and users and can be applied to
all items.

The existing standard for improvement processes is:

• IEC 61160:2005: Design review (Edition 2.0). This International Standard


makes recommendations for the implementation of design review as a
means of verifying that the design input requirements have been met and
stimulating the improvement of the product’s design. The intention is for it to
be applied during the design and development phase of a product’s life cycle.

And finally, the three standards developed for the life cycle are:

• IEC 60300-3-3:2017: Dependability management—Part 3-3: Application


guide—Life cycle costing (Edition 3.0) establishes a general introduction to
the concept of life cycle costing and covers all applications. Although costs
incurred over the life cycle consist of many contributing elements, this doc-
ument particularly highlights the costs associated with the dependability of
an item. This standard forms part of an overall dependability management
program as described in IEC 60300-1. Guidance is provided on life cycle
costing for use by managers, engineers, finance staff, and contractors; it is
also intended to assist those who may be required to specify and commis-
sion such activities when undertaken by others.
• IEC 60300-3-15:2009: Dependability management—Part 3-15: Application
guide—Engineering of system dependability (Edition 1.0) provides guidance
for an engineering system’s dependability and describes a process for real-
ization of system dependability through the system life cycle. This standard
is applicable to new system development and for enhancement of existing
systems involving interactions of system functions consisting of hardware,
software, and human elements.
• IEC 62402:2007: Obsolescence management—Application guide (Edition
1.0). This  International Standard gives guidance for establishing a
framework for obsolescence management and for planning a cost-effective
obsolescence management process that is applicable through all phases of
the product life cycle.
230 Reliability Engineering

9.4  ESTABLISHMENT OF REQUIREMENTS


There are eight standards that allow establishing requirements in the design phase
so that the systems and equipment have established a series of indicators and val-
ues such as reliability, availability, and maintainability and checks a posteriori the
degree of compliance with them:

• IEC 60300-3-4:2007: Dependability management—Part 3-4: Application


guide—Guide to the specification of dependability requirements (Edition
2.0) gives guidance on specifying required dependability characteristics in
product and equipment specifications, together with specifications of pro-
cedures and criteria for verification. The guide includes advice on specify-
ing quantitative and qualitative reliability, maintainability, and availability
requirements. The main changes from the previous edition are the concept
of systems has been included and the need to specify the dependability
of the system and not  just the physical equipment has been stressed; the
need for verification and validation of the requirement has been included;
differentiation has been made between requirements that can be measured
and verified and validated, and goals, which cannot; and the content on
availability, maintainability, and maintenance support has been updated
and expanded to similar level of detail to reliability.
• IEC 60300-3-14:2004: Dependability management—Part 3-14:
Application guide—Maintenance and maintenance support (Edition
1.0) describes a framework for maintenance and maintenance support
and the various minimal common practices that should be undertaken.
The  guide outlines in a generic manner the management, processes, and
techniques related to maintenance and maintenance support that are nec-
essary to achieve adequate dependability to meet the operational needs
of the customer. It is applicable to items, which include all types of prod-
ucts, equipment, and systems (hardware and associated software). Most of
these require a certain level of maintenance to ensure that their required
functionality, dependability, capability, economic, safety, and regulatory
requirements are achieved.
• IEC 60300-3-16:2008: Dependability management—Part 3-16: Appli-
cation guide—Guidelines for specification of maintenance support ser-
vices (Edition 1.0) describes a framework for the specification of services
related to the maintenance support of products, systems, and equipment
that are carried out during the operation and maintenance phase. The pur-
pose of this standard is to outline, in a generic manner, the development of
agreements for maintenance support services as well as guidelines for the
management and monitoring of these agreements by the company and the
service provider.
• IEC 60706-2:2006: Maintainability of equipment—Part 2: Maintainability
requirements and studies during the design and development phase
(Edition 2.0). This part of IEC 60706 examines the maintainability require-
ments and related design and use parameter, and discusses some activities
European Reliability Standards 231

necessary to achieve the required maintainability characteristics and their


relationship to planning of maintenance. It describes the general approach
in reaching these objectives and shows how maintainability character-
istics should be specified in a requirements document or contract. It  is
not intended to be a complete guide on how to specify or to contract for
maintainability. Its purpose is to define the range of considerations when
maintainability characteristics are included as requirements for the devel-
opment or the acquisition of an item.
• IEC 61014:2003: Programmes for reliability growth (Edition 2.0) specifies
requirements and gives guidelines for the exposure and removal of weak-
nesses in hardware and software items for the purpose of reliability growth.
It applies when the product specification calls for a reliability growth pro-
gram of equipment (electronic, electromechanical, and mechanical hard-
ware as well as software) or when it is known that the design is unlikely
to meet the requirements without improvement. The  main changes with
respect to the previous edition are: a subclause on planning reliability
growth in the design phase, a subclause on management aspects covering
both reliability growth in design and the test phase, and a clause on reli-
ability growth in the field.
• IEC 62347:2006: Guidance on system dependability specifications
(Edition 1.0). This International Standard gives guidance on the prepara-
tion of system dependability specifications. It provides a process for system
evaluation and presents a procedure for determining system dependabil-
ity requirements. This  International Standard is not  intended for certifi-
cation or to perform conformity assessment for contractual purposes. It is
not  intended to change any rights or obligations provided by applicable
statutory or regulatory requirements.
• IEC 62741:2015: Demonstration of dependability requirements—The
dependability case (Edition 1.0) gives guidance on the content and applica-
tion of a dependability case and establishes general principles for the prepa-
ration of a dependability case. This standard is written in a basic project
context where a customer orders a system that meets dependability require-
ments from a supplier and then manages the system until its retirement.
The  methods provided in this standard may be modified and adapted to
other situations as needed. The  dependability case is normally produced
by the customer and supplier but can also be used and updated by other
organizations.
• IEC 62853:2018: Open systems dependability (Edition 1.0) provides guid-
ance in relation to a set of requirements placed upon system life cycles in
order for an open system to achieve open systems dependability This docu-
ment is applicable to life cycles of products, systems, processes, or services
involving hardware, software, and human aspects or any integrated combi-
nations of these elements. For open systems, security is especially impor-
tant since the systems are particularly exposed to attack. This document can
be used to improve the dependability of open systems and to provide assur-
ance that the process views specific to open systems achieve their expected
232 Reliability Engineering

outcomes. It helps an organization define the activities and tasks that need
to be undertaken to achieve dependability objectives in an open system,
including dependability related communication, dependability assessment,
and evaluation of dependability throughout system life cycles.

9.5  TEST METHODS


There are 11 standards designed to help methodologically in the use of testing and
testing procedures to obtain field data in a controlled manner and to serve as a basis
for estimating the operation indicators that systems and equipment will have:

• IEC 60605-2:1994: Equipment reliability testing—Part 2: Design of test


cycles (Edition 1.0) applies to the design of operating and environmental
test cycles referred to in 8.1 and 8.2 of IEC 605-1.
• IEC 60706-5:2007: Maintainability of equipment—Part 5: Testability
and diagnostic testing (Edition 2.0) provides guidance for the early con-
sideration of testability aspects in design and development, and assists in
determining effective test procedures as an integral part of operation and
maintenance.
• IEC 61070:1991: Compliance test procedures for steady-state availability
(Edition 1.0) specifies techniques for availability performance testing of
frequently maintained items when the availability performance measure
used is either steady-state availability or steady-state unavailability.
• IEC 61123:1991: Reliability testing—Compliance test plans for success
ratio (Edition 1.0) specifies procedures for applying and preparing compli-
ance test plans for success ratio or failure ratio. The procedures are based
on the assumption that each trial is statistically independent.
• IEC 61124:2012: Reliability testing—Compliance tests for constant failure
rate and constant failure intensity (Edition 3.0) gives a number of opti-
mized test plans, the corresponding operating characteristic curves, and
expected test times. In  addition, the algorithms for designing test plans
using a spreadsheet program are given, together with guidance on how to
choose test plans. This  standard specifies procedures to test an observed
value of failure rate, failure intensity, meantime to failure (MTTF), and
mean operating time between failures (MTBF). It  provides an extensive
number of statistical tests.
• IEC 61163-1:2006: Reliability stress screening—Part 1: Repairable
assemblies manufactured in lots (Edition 2.0) describes particular meth-
ods to apply and optimize reliability stress screening processes for lots
of repairable hardware assemblies in cases where the assemblies have an
unacceptably low reliability in the early failure period, and when other
methods such as reliability growth program and quality control techniques
are not applicable.
• IEC 61163-2:1998: Reliability stress screening—Part 2: Electronic compo-
nents (Edition 1.0) provides guidance on reliability stress screening techniques
European Reliability Standards 233

and procedures for electronic components. Is intended for use of (1) com-
ponent manufacturers as a guideline, (2) component users as a guideline to
negotiate with component manufacturers on stress screening requirements or
plan a stress screening process in house due to reliability requirements, and
(3) subcontractors who provide stress screening as a service.
• IEC 61164:2004: Reliability growth—Statistical test and estimation meth-
ods (Edition 2.0) gives models and numerical methods for reliability growth
assessments based on failure data, which were generated in a reliability
improvement program. These procedures deal with growth, estimation,
confidence intervals for product reliability, and goodness-of-fit tests.
In Table 9.4, the types of model developed are classified.
• IEC 62309:2004: Dependability of products containing reused
parts—Requirements for functionality and tests (Edition 1.0) introduces the
concept to check the reliability and functionality of reused parts and their
usage within new products. It also provides information and criteria about
the tests/analysis required for products containing such reused parts, which
are declared “qualified-as-good-as-new” relative to the designed life of the
product. The purpose of this standard is to ensure by tests and analysis that
the reliability and functionality of a new product containing reused parts is
comparable to a product with only new parts.
• IEC 62429:2007: Reliability growth—Stress testing for early failures in
unique complex systems (Edition 1.0). This  International Standard gives
guidance for reliability growth during final testing or acceptance testing of
unique complex systems. It gives guidance on accelerated test conditions
and criteria for stopping these tests.
• IEC 62506:2013: Methods for product accelerated testing (Edition 1.0)
provides guidance on the application of various accelerated test techniques
for measurement or improvement of product reliability. Identification
of potential failure modes that could be experienced in the use of a
product/item and their mitigation is instrumental to ensure dependability
of an item. The object of the methods is to either identify potential design
weakness or provide information on item dependability, or to achieve nec-
essary reliability/availability improvement, all within a compressed or
accelerated period of time. This standard addresses accelerated testing of
non-repairable and repairable systems.

TABLE 9.4
Attributes of the Collection of Dependability
Data from the Field
Type of Model Continuous Time Discrete Time
Classic design Section 6.1 —
Bayesian design Section 6.2 —
Classic tests Section 7.1 Section 7.2
Bayesian tests — —
234 Reliability Engineering

9.6  METHOD SELECTION


There are five key standards, since the selection of methods to implement in a reli-
ability program is a very individualized process, so much so that it is not possible
to make a generic suggestion for the selection of one or more of the specific meth-
ods. The choice of the appropriate method should be made with the joint effort of
experts in reliability and in the field of systems engineering. The selection should be
made at the beginning of the development of the program and its applicability should
be reviewed. These standards help in making the selection of the most appropriate
method for a system or equipment.

• IEC 60300-3-1:2003: Dependability management—Part 3-1: Application


guide—Analysis techniques for dependability—Guide on methodology
(Edition 2.0) gives a general overview of commonly used dependability
analysis techniques. It describes the usual methodologies, their advantages
and disadvantages, data input, and other conditions for using various
techniques. This  standard is an introduction to selected methodologies
and is intended to provide the necessary information for choosing the most
appropriate analysis methods.

The 12 methods included are briefly explained in Annex A of the standard and refer-
ence is made to the IEC standard developed by each method, if any. This standard
includes a guide for the selection of the appropriate analysis method taking into
account the characteristics of the system or equipment:

• Complexity of the system


• Novelty of the system
• Quantitative analysis or qualitative analysis
• Single failure or multiple failures
• Behavior dependent on time or a sequence
• Existence of dependent events
• Analysis below—up or top-down
• Suitable for reliability assignment
• Required domain
• Acceptance and common use
• Need for tool support
• Credibility checks
• Availability of tools
• Normalization, referencing the seven methods with specific IEC standards

• IEC 60300-3-5:2001: Dependability management—Part 3-5: Application


guide—Reliability test conditions and statistical test principles (Edition
1.0) provides guidelines for the planning and performing of reliability
tests and the use of statistical methods to analyze test data. It describes
the tests related to repaired and non-repaired items together with tests for
constant and non-constant failure intensity and constant and non-constant
failure rate.
European Reliability Standards 235

This standard establishes the methods and conditions for reliability tests and prin-
ciples for the performance of statistical tests. It  includes a detailed guide for the
selection of the statistical methods used to analyze the data coming from reliability
tests of repairable or non-repairable elements.
The requirements for a correct specification of the reliability test to be executed
are established so that all the variables that may affect the test are determined and
bounded prior to the application of the statistical test and contrast methods.
The following standard focuses on the analysis of trial data. For the non-repairable
elements, parametric methods adjusted to the exponential distribution are proposed
for failure rate λ(t) constant and adjusted to the Weibull distribution for λ(t) with trend.
The statistical nature of failure modes in repairable elements is described as a sto-
chastic point process (SPP). The failure intensity z(t) refers exclusively to repairable
elements. This means that the failure current of a single repairable element can be
estimated using the successive times between failures. It is estimated by the number
of failures per unit of time or another variable.
In this case, the failures of each element happen sequentially and this is known as
an SPP. It is important to maintain the traceability of the sequence of times between
failures. If the times between failures are distributed exponentially, then the fail-
ure current is constant. Therefore, the time between failures can be modeled by an
exponential distribution. In this case, the number of failures per unit of time can be
modeled by a homogeneous Poisson process (HPP).
In many cases where there is a trend in the failure intensity, the power-law process
(PLP) can be applied. This leads to a model from which the trend can be estimated.
If there is a trend (intensity of increasing or decreasing failure), a non-homogeneous
Poisson process (NHPP) can be applied. See classification in Table 9.5.
Attached is a list of standards for the estimation of reliability in non-repairable
elements according to IEC 60300-3-5:

• Contrasts of the constant failure rate hypothesis: IEC 60605-6


• Point estimation and confidence intervals for the exponential distribution:
IEC 60605-4
• Goodness of fit contrast for the Weibull distribution: IEC 61649
• Point estimation and confidence intervals for the Weibull distribution: IEC 61649
• Point estimation and confidence intervals for the binomial distribution: ISO
11453

TABLE 9.5
Appropriate Models for Data Analysis According to IEC
60300-3-5
Item Trend Appropriate Model
Non-repairable Constant Exponential distribution
Non-repairable Non-constant Weibull distribution
Repairable Constant Homogenous Poisson process (HPP)
Repairable Non-constant Non-homogenous Poisson process (NHPP)
236 Reliability Engineering

Attached is a list of standards for estimating reliability in repairable items:

• Contrasts for constant failure intensity: IEC 60605-6


• Point estimation and confidence intervals for the exponential distribution:
IEC 60605-4
• Estimation of the parameters and statistical contrast of the PLP: IEC
61710

• IEC 60319:1999: Presentation and specification of reliability data for elec-


tronic components (Edition 3.0) describes the information needed for char-
acterizing reliability of a component and also the detailed requirements for
reporting reliability data. It gives guidance to component users as to how
they should specify their reliability requirements to component manufactur-
ers. The data, derived from laboratory tests, should enable circuit and equip-
ment designers to evaluate the reliability of circuits and systems.
• IEC 61703:2016: Mathematical expressions for reliability, availability,
maintainability and maintenance support terms (Edition 2.0), to account
for mathematical constraints, splits the items between the individual items
considered as a whole (e.g., individual components) and the systems made
of several individual items. It provides general considerations for the math-
ematical expressions for systems as well as individual items but the individ-
ual items that are easier to model are analyzed in more detail with regard to
their repair aspects. This standard is mainly applicable to hardware depend-
ability, but many terms and their definitions may be applied to items con-
taining software.

This standard provides the definitions related to reliability as well as the mathemati-


cal expressions that should be used in the calculations of the main variables. In this
standard, the following classes of elements are considered separately:

• Non-repairable items
• Items repairable with time to zero restoration
• Repairable items with time to non-zero restoration

For  non-repairable items, repairable items with time zero restoration, and repair-
able items with time to non-zero restoration develop and formulate the mathematical
expressions:

• Reliability; R(t)
• Instantaneous failure rate; λ(t) (non-repairable items)
• Instantaneous failure intensity; z(t) (repairable items)
• Average failure rate; λ (t1, t2 ) (non-repairable items)
• Average failure intensity; z (t1, t2 ) (repairable items)
• Mean Time To Failure: MTTF (non-repairable items)
• Mean Up Time: MUT (repairable items)
• Mean Time Between Failures: MTBF (repairable items)
European Reliability Standards 237

Likewise, and for the repairable items with time to the non-zero restoration,
mathematical expressions are included for the calculation of availabilities and
instantaneous, average and asymptotic unavailability, and maintainability, average
repair rate, and average repair time.
• IEC 62308:2006: Equipment reliability—Reliability assessment methods
(Edition 1.0). This International Standard describes early reliability assessment
methods for items based on field data and test data for components and mod-
ules. It is applicable to mission, safety and business critical, high integrity,
and complex items. It contains information on why early reliability estimates
are required and how and where the assessment would be used.

9.7  RELIABILITY EVALUATION METHODS


Each of these nine standards develops a specific method for evaluating the reliability
of a system or equipment. The  selection of the most appropriate method and for-
mulation must be made under the criteria specified in the selection standards of the
previous section:
• IEC 60812:2006: Analysis techniques for system reliability: Procedure for
failure mode and effects analysis (FMEA) (Edition 2.0). This International
Standard describes Failure Mode and Effects Analysis (FMEA) and Failure
Mode, Effects, and Criticality Analysis (FMECA) and gives guidance as
to how they may be applied to achieve various objectives by providing the
procedural steps necessary to perform analysis, identify appropriate terms,
define basic principles, and provide examples of the necessary worksheets
or other tabular forms.
The FMEA analysis is a top-down and qualitative reliability analysis method that is
particularly suitable for the study of material, component, and equipment failures and
their effects on the next higher functional level. The iterations of this step (identification
of the single failure modes and the evaluation of their effects on the next higher level)
produce the identification of all the single failure modes of the system.
The  FMEA  lends itself to the analysis of systems of different technologies
(electrical, mechanical, hydraulic, software, etc.) with simple functional structures.
The FMECA analysis extends the FMEA to include the criticality analysis, quan-
tifying the effects of the failures in terms of probability of occurrence and their
severity. The severity of the effects is assigned with respect to a specific scale.
Both FMECA and FMEA are normally carried out when a certain risk is foreseen
in the program corresponding to the start of the development of a process or product.
The factors that can be considered are new technology, new processes, new designs
or changes in the environment, charges, or regulations. These analyses can be per-
formed on components or systems that are part of products, processes, or manufac-
turing equipment. They also can be carried out on software systems.
The FMECA and FMEA methods generally follow the following steps:

• Identification of how the component of a system should work


• Identification of their potential failure modes, causes and effects
238 Reliability Engineering

• Identification of the risk related to failure modes and their effects


• Identification of the recommended actions to eliminate or reduce the risk
• Follow up activities to close the recommended actions

• IEC 61025:2006: Fault tree analysis (FTA) (Edition 2.0) describes


FTA and provides guidance on its application to perform an analysis, iden-
tifies appropriate assumptions, events and failure modes, and provides iden-
tification standards and symbols.

The FTA is a top-down approach to the analysis of the reliability of a product. It seeks


the identification and analysis of the conditions and factors that cause, or contribute
to, the occurrence of an undesired determined event and that may affect the operation,
safety, economy, or other specified characteristics of the product.
The FTA also can be performed to provide a prediction model of the reliability
of a system and allow cost-benefit studies in the design phase of a product. Used as a
tool for the detection and quantitative evaluation of a cause of failure, the FTA repre-
sents an efficient method that identifies and evaluates the modes and causes of failure
of known or suspected effects.
Taking into account the known unfavorable effects and the ability to find the respec-
tive modes and causes of failure, the FTA allows the timely mitigation of the potential
failure modes, allowing the improvement of the reliability of the product in its design
phase. Built to represent hardware and software architecture in addition to analyzing
functionality, the FTA, developed to deal with basic events, becomes a systematic reli-
ability modeling technique that considers the complex interactions between parts of a
system through the modeling of its functional or fault dependencies, of events that trig-
ger failures and of common cause events and allowing the representation of networks.
To estimate the reliability and availability of a system using the FTA technique,
methods such as the Boolean reduction and the analysis of the cutting sets are used.
The basic data that are required are the failure rates of the components, repair rates,
probability of occurrence of failure modes, etc. FTA has a double application, as a
means to identify a cause of a known failure and as a tool for analyzing failure
modes and modeling and predicting reliability. The key elements of a fault tree are
events or gates and cutting sets.
The  gates represent results and the events represent entrances to the gates.
The symbolic representation of some specific gates may vary from one textbook or
analysis software to another; however, the representation of the basic gates is clearly
universal (see Table 9.6).

• IEC 61078:2016: Analysis techniques for dependability—Reliability block


diagram and Boolean methods (Edition 3.0). This International Standard
describes the requirements to apply when RBDs are used in dependability
analysis; the procedures for modeling the dependability of a system with
reliability block diagrams; how to use RBDs for qualitative and quantitative
analysis; the procedures for using the RBD model to calculate availability,
failure frequency, and reliability measures for different types of systems
with constant (or time dependent) probabilities of blocks success/failure,
European Reliability Standards 239

TABLE 9.6
Symbols That Are Used in the Representation of
the FTA Method
FTA Symbols Event or Gate

Higher or intermediate event

Basic event

Undeveloped event

Transfer gate

OR gate

AND gate

and for non-repaired blocks or repaired blocks; and some theoretical


aspects and limitations in performing calculations for availability, failure
frequency, and reliability measures.

The RBD analysis is a method of analyzing a system by graphically representing


a logical structure of a system in terms of subsystems or components. This allows
the success paths of the system to be represented by the way in which the blocks
­(subsystems/components) connect logically (see Figure 9.1).

FIGURE 9.1  Elementary models for RBD analysis.


240 Reliability Engineering

Block diagrams are among the first tasks that are completed during the definition
of the product. They should be built as part of the initial conceptual development.
They should be started as soon as the program definition exists, completed as part of
the requirements analysis, and continuously extended to a greater level of detail, as
the data becomes available to make decisions and perform cost-benefit studies.
To construct an RBD, several techniques of qualitative analysis can be used:

• Establish the definition of the success of the system


• Divide the system into appropriate functional blocks for the purpose of reli-
ability analysis (some blocks can represent substructures of the system that,
in turn, can be represented by other RBDs (system reduction))
• Carry out the qualitative analysis (there are several methods for the quanti-
tative evaluation of an RBD. Depending on the type of structure (reducible
or irreducible), simple Boolean techniques, truth tables, or analysis of cut-
off or path sets can be used for the prediction of reliability and system avail-
ability values from the data of basic component)

You can evaluate more complex models in which the same block appears more than
once in the diagram by using:

• The total probability theorem


• Truth Boolean tables

• IEC 61165:2006: Application of Markov techniques (Edition 2.0).


This International Standard provides guidance on the application of Markov
techniques to model and analyze a system and estimate reliability, avail-
ability, maintainability, and safety measures. This standard is applicable to
all industries where systems, which exhibit state-dependent behavior, have
to be analyzed. The Markov techniques covered by this standard assume
constant time-independent state transition rates. Such techniques often are
called homogeneous Markov techniques.

The  Markov model is a probabilistic method that allows adapting the statistical
dependence of the failure or repairing characteristics of the individual components
to the state of the system. Therefore, the Markov model can consider the effects of
the failures of the order-dependent components and the variable transition rates that
change as a result of efforts or other factors. For this reason, Markov analysis is an
adequate method for evaluating the reliability of functionally complex system struc-
tures and complex repair and maintenance strategies.
The  method is based on the theory of Markov chains. For  reliability applica-
tions, the normal reference model is the homogeneous Markov model over time that
requires transition rates (failure and repair) to be constant. At  the expense of the
increase in the state space, non-exponential transitions can be approximated by a
sequence of exponential transitions. For this model, general and efficient techniques
of numerical methods are available and their only limitation for their application is
the dimension of the state space.
European Reliability Standards 241

The  representation of the behavior of the system by means of a Markov model


requires the determination of all possible states of the system, preferably repre-
sented graphically by means of a state transition diagram. In addition, transition rates
(constants) from one state to another have to be specified (failure or repair rates of a
component, event rates, etc.). The typical result of a Markov model is the probability of
being in a given set of states (normally this probability is the measure of availability).
The appropriate field of application of this technique is when the transition rates
(failure or repair) depend on the state of the system or vary with the load, the level of
effort, the structure of the system (e.g., waiting), the policy of maintenance, or other
factors. In particular, the structure of the system (cold or hot waiting, spare parts)
and the maintenance policy (single or multiple repair equipment) induce dependen-
cies that cannot be considered with other, less computationally intensive techniques.
Typical applications are predictions of reliability/availability. For the application of
this methodology, the following key steps must be taken into account:

• Definition of the space of the states of the system


• Assignment of transition rates between states (independent of time)
• Definition of the exit measures (group of states that lead to a system failure)
• Generation of the mathematical model (matrix of transition rates) and reso-
lution of Markov models by using an appropriate software package
• Analysis of results

In Figure 9.2, the white circles represent operational states, while the gray circles
represent non-operative states. λx are the transition failure rates from one state to
another and μx are the step repair rates from one state to another.

• IEC 61709:2017: Electric components—Reliability—Reference condi-


tions for failure rates and stress models for conversion (Edition 3.0) gives
guidance on the use of failure rate data for reliability prediction of electric

FIGURE 9.2  State transition diagram in Markov analysis.


242 Reliability Engineering

components used in equipment. The  method presented in this document


uses the concept of reference conditions, which are the typical values of
stresses that are observed by components in the majority of applications.
Reference conditions are useful since they provide a known standard basis
from which failure rates can be modified to account for differences in envi-
ronment from the environments taken as reference conditions. Each user
can use the reference conditions defined in this document or use their own.
When failure rates stated at reference conditions are used it allows realistic
reliability predictions to be made in the early design phase.

The stress models described herein are generic and can be used as a basis for conver-
sion of failure rate data given at these reference conditions to actual operating condi-
tions when needed and this simplifies the prediction approach. Conversion of failure
rate data is only possible within the specified functional limits of the components.
This document also gives guidance on how a database of component failure data can
be constructed to provide failure rates that can be used with the included stress models.
Reference conditions for failure rate data are specified so that data from differ-
ent sources can be compared on a uniform basis. If failure rate data are given in
accordance with this document, then additional information on the specified condi-
tions can be dispensed with. This document does not provide base failure rates for
components—rather it provides models that allow failure rates obtained by other
means to be converted from one operating condition to another operating condition.
The prediction methodology described in this document assumes that the parts are
being used within its useful life.
This international standard is intended for the prediction of reliability of compo-
nents used in equipment and focuses on organizations with their own data, describing
how to establish and use such data to make predictions of reliability. The failure rate
of a component under operating conditions is calculated as follows:

λ = λref π U π I π T π Eπ Sπ ES (9.1)

with:
λb is the failure rate in the reference conditions
ΠU is the dependence factor with voltage
ΠI is the dependence factor with current
ΠT is the dependence factor with temperature
ΠE is the environmental application factor
ΠS is the dependence factor with switching frequency
ΠES is the dependence factor with electrical stress

Therefore, the failure rate for sets of components under operating conditions is cal-
culated as aggregation as follows:

n
λEquip = ∑ ( λ ) (9.2)
i =1
i
European Reliability Standards 243

The standard develops specific stress models and values of the π factors applicable to
the different types of components that must be used to convert the reference failure
rates to failure rates in the operating conditions. The π factors are modifiers of the fail-
ure rate associated with a specific condition or effort. They provide a measure of the
modification of the failure rate as a consequence of changes in the effort or condition.

• IEC 61882:2016: Hazard and operability studies (HAZOP studies)—


Application guide (Edition 2.0) provides a guide for HAZOP studies of
systems using guide words. It  gives guidance on application of the tech-
nique and on the HAZOP study procedure, including definition, prepara-
tion, examination sessions and resulting documentation, and follow-up.
Documentation examples illustrating HAZOP studies, as well as a broad set
of examples encompassing various applications, are provided.

A HAZOP study is a detailed process of identification of hazards and operational


problems carried out by a team. A HAZOP deals with the identification of potential
deviations in the design proposal, examination of its possible causes, and evaluation
of its consequences.
The basis of a HAZOP is a guide-words exam that constitutes a deliberate search
for deviations in the design proposal. The proposed design contemplates the behav-
ior of a system, its elements, and characteristics desired, or specified, by the designer.
To facilitate the examination, the system is divided into parts so that the design pro-
posal of each of the parties can be defined properly.
The design proposal of a given part of a system is expressed in terms of elements
that convey the essential benefits of that part and that represent its natural divisions.
The elements can be steps or steps of a procedure, individual signals and elements of
equipment in a control system, equipment or components in a process or electronic
system, and so on.
The identification of the deviations in the design proposal is obtained through a
process of questions using predetermined guide words. The role of the word guide
is to stimulate imaginative thinking, to focus the study, and to provoke ideas and
discussion, thus maximizing the opportunities to achieve a more complete study.
The HAZOP is well suited in the later stages of detailed design to examine opera-
tional capabilities and when changes are made to existing facilities. The best time
to conduct a HAZOP study is just before the design freezes. The HAZOP studies
consist of four basic sequential steps:

• Definition of scope, objectives, responsibilities, and equipment


• Preparation of the study, registration format, and data collection
• Examination dividing the system into parts and identifying problems,
causes, and consequences (identify protection mechanisms and measures)
• Documentation and follow-up with report of initial conclusions, preventive,
and corrective actions taken and final results report

• IEC 62502:2010: Analysis techniques for dependability—Event tree


analysis (ETA) (Edition 1.0) specifies the consolidated basic principles of
244 Reliability Engineering

ETA and provides guidance on modeling the consequences of an initiating


event as well as analyzing these consequences qualitatively and quantita-
tively in the context of dependability and risk-related measures.

The event tree considers a number of possible consequences of an initiating event or


system failure. Thus, the event tree can be combined very efficiently with the fault tree.
The root of an event tree can be seen as the main event of a fault tree. This combination
is sometimes called analysis of causes and consequences. To evaluate the seriousness
of certain consequences that derive from the initiating event, all possible consequences
paths should be identified and investigated, and their probabilities determined.
The  analysis with event tree is used when it is essential to investigate all the
possible paths of consequent events, their sequences, and the consequences or most
probable results of the initiating event. After an initiating event, there are some first
events or subsequent consequences that may follow. The probability associated with
the occurrence of a specific path (sequence of events) represents the product of the
conditioned probabilities of all the events of that path.
The key elements in the application of the event tree are (Figure 9.3):

• The initiator (initiating event)


• Subsequent events
• Consequences of the events

• IEC 62551:2012: Analysis techniques for dependability—Petri net


techniques (Edition 1.0) provides guidance on a Petri-net based methodol-
ogy for dependability purposes. It supports modeling a system, analyzing
the model, and presenting the analysis results. This methodology is oriented
to dependability-related measures with all the related features, such as reli-
ability, availability, production availability, maintainability, and safety.

Petri net is a graphical tool for the representation and analysis of complex logical
interactions between the components or events of a system. The  typical complex
interactions that are naturally included in the language of the Petri net are concur-
rency, conflict, synchronization, mutual exclusion, and resource limitation.

initiating subsequent subsequent consequence of


event event 1 event 1 events
Prob. YES=0,c
State 1. Prob.= 0,a x 0,c

Prob. YES=0,a
Failure

Prob. NOT=0,d
State 2. Prob.= 0,a x 0,d
Failure

Prob. NOT=0,b
State 3. Prob.= 0,b

FIGURE 9.3  General outline of an event tree.


European Reliability Standards 245

place identifier transition identifier arcs symbols

relationship symbols mark symbol

FIGURE 9.4  Petri net symbols not timed.

The static structure of the system modeled is represented by a Petri net graph,


which is composed of three primary elements (see Figure 9.4):

• Nodes or places: usually drawn as circles, that represent the conditions in


which the system can be found
• Transitions: usually drawn as bars, that represent events that can change
one condition into another
• Arcs: drawn as arrows that connect nodes with transitions and transitions
with nodes and that represent the admissible logical connections between
conditions and event

A condition is valid in a given situation if the corresponding node is marked; that is,
it contains at least one “•” mark (drawn as a black dot). The dynamics of the system
is represented by the movement of the marks in the graph. A transition is allowed if
its input nodes contain at least one mark.
A permitted transition can be triggered and that trigger removes a mark from each
entry node and places a mark on each exit node. The distribution of the marks in the
nodes is called marking.
Starting from an initial marking, the application of the activation and firing
rules produces all the possible markings that constitute the attainable set of the
Petri nets. This achievable set provides all the states that the system can reach from
the initial state.
Standard Petri nets do not contemplate the notion of time. However, many exten-
sions have appeared in which temporary aspects overlap the Petri network. If a trig-
ger rate (constant) is assigned to each transition, then the dynamics of the Petri net
can be analyzed by a continuous-time Markov chain whose state space is isomorphic
with the attainable set of the corresponding Petri net.
The Petri net can be used as a high level language to generate Markov models and
some tools used for reliability analysis are based on this methodology. Petri nets also
provide a natural environment for simulation.
The use of Petri nets is recommended when complex logical interactions must be
considered (concurrence, conflict, synchronization, mutual exclusion, and resource
limitation). In addition, Petri net usually is an easier and more natural language to
use in describing a Markov model.
246 Reliability Engineering

The key element of a Petri net analysis is the description of the structure of the
system and its dynamic behavior in terms of primary elements (nodes, transitions,
arcs, and marks) typical of the Petri net language. This step requires the use of ad
hoc software tools.

• IEC 62740:2015: Root cause analysis (RCA) (Edition 1.0) describes the
basic principles of root cause analysis (RCA) and specifies the steps that a
process for RCA should include. This standard identifies a number of attri-
butes for RCA techniques that assist with the selection of an appropriate
technique. It describes each RCA technique and its relative strengths and
weaknesses. RCA is used to analyze the root causes of focus events with
both positive and negative outcomes, but it is most commonly used for the
analysis of failures and incidents.

Causes for such events can be varied in nature, including design processes and
techniques, organizational characteristics, human aspects, and external events.
An RCA can be used for investigating the causes of non-conformances in quality
(and other) management systems as well as for failure analysis (e.g., in maintenance
or equipment testing). An RCA is used to analyze focus events that have occurred;
therefore, this standard only covers a posteriori analyses.
It  is recognized that some of the RCA  techniques with adaptation can be used
proactively in the design and development of items and for causal analysis during
risk assessment; however, this standard focuses on the analysis of events that have
occurred. The intent of this standard is to describe a process for performing RCA and
to explain the techniques for identifying root causes. These techniques are not designed
to assign responsibility or liability, which is outside the scope of this standard.

9.8 STATISTICAL METHODS FOR THE EVALUATION


OF RELIABILITY
These five standards should be used together, since they are all complementary. In a
first phase it is necessary to classify the nature of the system or equipment subjected
to the analysis in relation to its maintainability, since the methods are different if it
is a repairable or non-repairable item. The IEC 60300-3-5: 2001 standard includes
a complete procedure for the adequate selection of the most appropriate statistical
method in each case.

• IEC 60605-6:2007: Equipment reliability testing—Part 6: Tests for the validity


and estimation of the constant failure rate and constant failure intensity
(Edition 3.0) specifies procedures to verify the assumption of a constant failure
rate or constant failure intensity. These procedures are applicable whenever
it is necessary to verify these assumptions. This  may be a requirement or
to assessing the behavior in time of the failure rate or the failure intensity.
The major technical changes with respect to the previous edition concern the
inclusion of corrected formulas for tests previously included in a corrigendum,
and the addition of new methods for the analysis of multiple items.
European Reliability Standards 247

The standard develops the tests to check the hypothesis of constant failure rate λ(t)
for non-repairable elements, and the tests to check the hypothesis of constant failure
intensity z(t) for repairable elements.
In Section 6.2 of the standard the U-test (Laplace test) is developed to analyze
whether the nonrepairable equipment object of the study has a trend in its failure rate.
The standard also includes three graphical methods of trend testing in Sections 6.3
through 6.5 of the standard as support to the researcher to assess whether it can be
assumed that the non-repairable elements under study have a trend or not trend.
In Section 7.2 of the standard, the procedure is developed to check if a repairable
element has a constant failure intensity z(t), based on the calculation of the U-test
(Laplace test).
For testing completed by time:


r T*
Ti − r
U= i =1 2 (9.3)
r
T*
12

For testing completed by failure:


r Tr
Ti − ( r − 1)
U= i =1 2 (9.4)
r −1
Tr
12
with:
r is the total number of failures
T* is the total time of the test completed by time
Tr is the total time of the test completed by failure
Ti is the cumulative time of the test in the ith failure

With the zero growth hypothesis (i.e., the failure times follow a HPP), the U-test is
roughly distributed according to a standardized exponential distribution of mean 0
and deviation 1. The U-test can be used to test whether there is evidence of reliability
growth, positive or negative, independent of the reliability growth model.
A bilateral test for positive or negative growth with significance level α has criti-
cal values u1−α/2 and −u1−α/2, where u1−α/2 is the (1−α/2)100 percent percentile of the
typical normal distribution. If −u1−α/2 < U < u1−α/2, then there is no evidence of posi-
tive or negative growth of the reliability to a significance level α. In this case, the
hypothesis of an exponential distribution of times between successive failures of the
HPP is accepted with significance level α:

−u1−α / 2 < U < u1−α / 2 (9.5)

For the significance levels required in each test, the appropriate critical values of the
percentile table of the normalized typified distribution should be chosen according
to Table 9.7.
248 Reliability Engineering

TABLE 9.7
Critical Values for a Level
of Significance α
α Uα Value
0.025 2.24
0.050 1.96
0.100 1.64

In Section 7.3 of the standard, the procedure is developed to check whether a set


of repairable elements of the same characteristics has a constant failure intensity z(t),
based on the calculation of the U-test:

∑ ∑ ( )
k ri
Tij − 0, 5 rT
1 1 + r2T2 + ... + rk Tk
* * *
i =1 j =1
U= (9.6)
1
12
rT (
1 2 + r2T2 + ... + rk Tk
*2 2
*2 *2
)
with:
ri is the total number of failures to consider from the ith item
Ti* is the total time of the test for the ith item
Tij is the time accumulated at the jth failure of the ith item
k is the total number of items

As in the case of Section 7.2 of the standard, a bilateral test for positive or negative
growth with significance level α has critical values u1−α/2 and −u1−α/2, where u1−α/2 is
the (1−α/2)100 percent percentile of the typical normal distribution
In Section 7.4 of the standard, the graphical procedure M(t) plot is developed to
check whether one or a set of repairable elements of the same characteristics has
constant failure intensity. It is a more qualitative than quantitative test.

• IEC 60605-4:2001: Equipment reliability testing—Part 4: Statistical pro-


cedures for exponential distribution—Point estimates, confidence inter-
vals, prediction intervals and tolerance intervals (Edition 2.0) provides
statistical methods for evaluating point estimates, confidence intervals, pre-
diction intervals, and tolerance intervals for the failure rate of items whose
time to failure follows an exponential distribution.

This standard develops the statistical procedure for the exponential distribution and
allows estimating the value of constant failure rate λ(t) for non-repairable elements
and the constant failure intensity z(t) value for non-repairable elements. It  also
includes the formulation for the calculation of confidence intervals, tolerances, and
so on.
European Reliability Standards 249

This  norm must apply complementary to IEC 60605-6 in such a way that
if the result of the application of U-test accepts the hypothesis of exponential
distribution of the times between successive failures (or a HPP), it is possible
to calculate directly the value of constant failure rate λ(t) or constant failure
intensity z(t).
For testing completed by time and non-repairable items, the point estimate of the
failure rate:
r
λ = * (9.7)
T
For test terminated by failure:

r
λ = * (9.8)
T
with:
r is the total number of failures in test
T* is the total time of the test completed by time or by failure

For  testing completed by time and repairable elements, the point estimate of the
failure intensity:

 = r (9.9)
Z
T*

For test terminated by failure:

 = r (9.10)
Z
T*
with:
r is the total number of failures in test
T* is the total time of the test completed by time or by failure

The standard includes the calculation of bilateral confidence intervals, for example,


for tests completed by time for repairable items:

X α2 / 2 2r
Z L 2 = λL 2 = (9.11)
2T *

X 2 α (2r + 2)
1−
ZU 2 = λU 2 = 2
(9.12)
2T *
with:
Χ2 is the fractile table value of the Χ 2 distribution for the 90 percent confidence
interval.
250 Reliability Engineering

In  addition, the standard allows for prediction intervals for failures for a
future period in Section 9.6 and a procedure for assigning tolerance intervals in
Section 9.7.

• IEC 61649:2008: Weibull analysis (Edition 2.0) provides methods for ana-
lyzing data from a Weibull distribution using continuous parameters such as
time to failure, cycles to failure, mechanical stress, and so on. This standard
is applicable whenever data on strength parameters such as times to fail-
ure, cycles, and stress are available for a random sample of items operating
under test conditions or in-service to estimate measures of reliability per-
formance of the population from which these items were drawn. The main
changes with respect to the previous edition are as follows: the title has been
shortened and simplified to read “Weibull analysis” and provision of meth-
ods for both analytical and graphical solutions has been added.

In non-repairable items, when the failure rate λ(t) does not have a constant behavior
over time, usually the Weibull distribution is tried:

β
f ( t ) = βα (α t ) e ( ) (9.13)
β −1 − αt

β
R ( t ) = e ( ) (9.14)
− αt

λ ( t ) = βα (α t ) (9.15)
β −1

where:
α is the scale parameter
β is the shape parameter
f(t) is the probability density function of the failure
R(t) is the reliability function

The Weibull distribution is used to model data without considering whether the fail-
ure rate is increasing, decreasing, or constant. The Weibull distribution is flexible
and can be adapted to a wide variety of data.
The standard contemplates the Weibull distribution with two and three param-
eters, graphical methods, and goodness of fit. It also includes a section for the inter-
pretation of the resulting probability graph.
It  develops computational methods for the point estimation of parameters by
means of maximum likelihood estimation (MLE), confidence intervals, as well as
the Weibayes approach, and the “sudden death” method.

• IEC 61710:2013: Power law model—Goodness-of-fit tests and estimation meth-


ods (Edition 2.0) specifies procedures to estimate the parameters of the power
law model, to provide confidence intervals for the failure intensity, to provide
prediction intervals for the times to future failures, and to test the goodness-of-fit
European Reliability Standards 251

of the power law model to data from repairable items. It is assumed that the time
to failure data have been collected from an item or some identical items operat-
ing under the same conditions (e.g., environment and load).

This standard develops the statistical procedure for an NHPP by means of PLP and
allows estimating the value of the failure intensity z(t) for tests of one or more repair-
able items in tests terminated by time or by failure. It also allows the estimation of
the z(t) in tests for groups of failures in time intervals.
This standard must be applied in a complementary way to IEC 60605-6 so that
if the result of the application of the U-test is rejected, there is a trend (intensity of
increasing or decreasing failure) and may be applicable PLP:

E  N ( t )  = λt β (9.16)

Then z(t) can be calculated as follows:


d
Z (t ) = E  N ( t )  = λβ t β −1 (9.17)
dT 
with:
E[N(t)] is the expected accumulated number of failures up to time t
λ is the scale parameter
β is the shape parameter

The methods of estimating z(t) differ according to the type of test carried out:

• One or more repairable items observed in the same space time (the statistics
of Section 7.2.1 of the standard are applied)
• Multiple repairable items observed in different time intervals (the statistics
of Section 7.2.2 of the standard are applied)
• Groups of failures in time intervals (the statistics of Section  7.2.3 of the
standard are applied)

For one or multiple repairable items observed in the same period of time, Section 7.2.1
the summation is calculated:

N
T* 
S1 = ∑ ln  t
j =1
 ; for tests completed on time (9.18)
j 

N
 tN 
S2 = ∑ ln  t
j =1
 ; for tests completed to failure (9.19)
j 

with:
T* is the total time of the test completed by time
tN is the total time of the test completed by failure
tj is the cumulative time of test in jth failure
252 Reliability Engineering

The unbiased estimation of the parameter of form β is calculated:

N −1
β = ; for tests completed on time (9.20)
S1

N −2
β = ; for tests completed to failure (9.21)
S2

The unbiased estimation of the scale parameter λ is calculated:

N
λ = ; for tests completed on time (9.22)
k (T * ) β

N
λ = ; for tests completed to failure (9.23)
k (t N ) β

with:
N is the total number of failures accumulated in test
k is the total number of test items

z(t), therefore, according to the PLP:

 ( t ) = λ β
Z t β −1 (9.24)

For multiple repairable items observed in different time intervals (Section 7.2.2), the


parameter of form β is calculated iteratively:


k
N N T jβ lnT j
∑ lnt −
N j =1
+ = 0 (9.25)
β

i k
i =1 T jβ
j =1

And the estimation of the scale parameter λ is calculated:

N
λ = (9.26)

k 
T jβ
J =1

with:
N is the total number of failures accumulated in the test
k is the total number of items
ti is time to the ith failure (i = 1, 2, …, N)
Tj is the total time of observation for item j = 1, 2, …, k
European Reliability Standards 253

The goodness-of-fit test given in IEC 61710 (2013) is the Cramér–von Mises statistic
C2, with M = N and T = T* for testing completed based on time, and M = N − 1 and
T = T N for tests completed to failure:

2
M
 t j  β  2 j − 1  

1
C =
2
+   −    (9.27)
12 M j =1  T   2 M  

A critical value of C20.90(M) is selected, with a level of significance of 10 percent of


the tabulated value. If C2 exceeds the critical value C20.90(M), C2 > C20.90(M), then the
hypothesis that the PLP model fits the test data must be rejected.
In  Annex C of IEC 61710 of 2013 a Bayesian estimate for PLP is included.
The methods reflected in the main body of this standard are based on the classic
approach to make statistical estimates. This means that the parameters of PLP, λ, and
β are assumed to be fixed, but unknown and a classical method such as maximum
likelihood is used to estimate the values of both parameters, using the observed data
of the accumulated times until the failure of a repairable item or items.
An alternative approach is that of the Bayesian estimate. This approach deals with
the parameters of PLP, λ, and β as random variables not observed. This affects the
stages of the estimation process. A Bayesian approach to estimating the PLP can be
summarized in the following steps:

1. Choose a probability distribution that reflects the degree of knowledge of


each of the parameters, λ and β, before collecting any data. This distribu-
tion is called the a priori distribution
2. Collect the observed data of the accumulated failure times for the repair-
able items in question
3. Estimate the parameters of PLP with a posteriori distribution calculated
using the Bayes theorem and reflects what is known about the parameters
after observing the data

• IEC 61650:1997: Reliability data analysis techniques—Procedures for


comparison of two constant failure rates and two constant failure (event)
intensities (Edition 1.0) specifies procedures to compare two observed
failure rates, failure intensities, rates/intensities of relevant events. The pro-
cedures are used to determine whether an apparent difference between
the two sets of observations can be considered statistically significant.
Numerical methods and a graphical procedure are prescribed. Simple prac-
tical examples are provided to illustrate how the procedures can be applied.

9.9 CONCLUSIONS
The IEC standards published in the field of reliability provide maintenance engineers
with tools, procedures, and methods to deal with a large part of the management and
control activities that they have to develop, in a standardized and auditable manner and
that have the support from official, business, and scientific community organizations.
254 Reliability Engineering

However, these standards are not  being used systematically or generalized by


maintenance organizations, nor  are they cross-referenced in indexed scientific
publications.
It is estimated that this lack of knowledge and use may have part of its origin in
the form of organization and structuring of the contents published in the different
standards, which hinder their understanding and practical application.
In this chapter an attempt has been made to classify and present the 57 published
standards grouped according to their main field of application to improve their dis-
semination, understanding, and orientation of use:

• Management procedures: They are standards for the management of dif-


ferent maintenance activities: design, life cycle, maintainability, logistics,
risk, etc.
• Establishment of requirements: These standards develop procedures for
the specification of design requirements for systems and equipment such as
reliability, maintainability, availability, etc.
• Test methods: They are norms that present the procedures for the design
and application of different tests in order to evaluate the behavior of the
systems and equipment in operation.
• Method selection: These standards establish metrics for the selection of the
most appropriate method for evaluating the reliability of a system or equipment.
• Reliability evaluation methods: They are the standards that develop each
one of the methods for the evaluation of the reliability of a system or equip-
ment, with a different estimation approach.
• Statistical methods for the evaluation of reliability: They are the standards
developed by the statistical method for evaluating the reliability of repair-
able and non-repairable systems or equipment (these standards must be
applied jointly and in an integrated manner).

Researchers, organizations, and maintenance professionals are encouraged to use


these standards as procedures or as a reference guide, particularly in the field of reli-
ability assessment because they provide mathematical methods and metrics that have
the consensus and support of international standardization bodies.

REFERENCES
IEC/ISO 31010:2009 Edition 1.0, Risk Management: Risk Assessment Techniques,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60300-1:2014 Edition 3.0, Dependability Management: Part 1: Guidance for
Management and Application, International Electrotechnical Commission (IEC),
Geneva, Switzerland.
IEC 60300-3-1:2003 Edition 2.0, Dependability Management: Part 3-1: Application Guide—
Analysis Techniques for Dependability—Guide on Methodology, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60300-3-2:2004 Edition 2.0, Dependability Management: Part 3-2: Application Guide—
Collection of Dependability Data from the Field, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
European Reliability Standards 255

IEC 60300-3-3:2017 Edition 3.0, Dependability Management: Part 3-3: Application


Guide—Life Cycle Costing, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC 60300-3-4:2007 Edition 2.0, Dependability Management: Part 3-4: Application
Guide—Guide to the Specification of Dependability Requirements, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60300-3-5:2001 Edition 1.0, Dependability Management: Part 3-5: Application
Guide—Reliability Test Conditions and Statistical Test Principles, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60300-3-10:2001 Edition 1.0, Dependability Management: Part 3-10: Application
Guide—Maintainability, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC 60300-3-11:2009 Edition 2.0, Dependability Management: Part 3-11: Application
Guide—Reliability Centred Maintenance, International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 60300-3-12:2011 Edition 2.0, Dependability Management: Part 3-12: Application
Guide—Integrated Logistic Support, International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 60300-3-14:2004 Edition 1.0, Dependability Management: Part 3-14: Application
Guide—Maintenance and Maintenance Support, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 60300-3-15:2009 Edition 1.0, Dependability Management: Part 3-15: Application Guide—
Engineering of System Dependability, International Electrotechnical Commission (IEC),
Geneva, Switzerland.
IEC 60300-3-16:2008 Edition 1.0, Dependability Management: Part 3-16: Application
Guide—Guidelines for Specification of Maintenance Support Services, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60319:1999 Edition 3.0, Presentation and Specification of Reliability Data for Electronic
Components, International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60605-2:1994 Edition 1.0, Equipment Reliability Testing: Part 2: Design of Test Cycles,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60605-4:2001 Edition 2.0, Equipment Reliability Testing: Part 4: Statistical Procedures
for Exponential Distribution—Point Estimates, Confidence Intervals, Prediction
Intervals and Tolerance Intervals, International Electrotechnical Commission (IEC),
Geneva, Switzerland.
IEC 60605-6:2007 Edition 3.0, Equipment Reliability Testing: Part 6: Tests for the Validity
and Estimation of the Constant Failure Rate and Constant Failure Intensity,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60706-2:2006 Edition 2.0, Maintainability of Equipment: Part 2: Maintainability
Requirements and Studies During the Design and Development Phase, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 60706-3:2006 Edition 2.0, Maintainability of Equipment: Part 3: Verification and
Collection, Analysis and Presentation of Data, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 60706-5:2007 Edition 2.0, Maintainability of Equipment: Part 5: Testability and
Diagnostic Testing, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC 60812:2006 Edition 2.0, Analysis Techniques for System Reliability—Procedure for
Failure Mode and Effects Analysis (FMEA), International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 61014:2003 Edition 2.0, Programmes for Reliability Growth, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
256 Reliability Engineering

IEC 61025:2006 Edition 2.0, Fault Tree Analysis (FTA), International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 61070:1991 Edition 1.0, Compliance Test Procedures for Steady-State Availability,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61078:2016 Edition 3.0, Reliability Block Diagrams, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 61123:1991 Edition 1.0, Reliability Testing: Compliance Test Plans for Success Eatio,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61124:2012 Edition 3.0, Reliability Testing: Compliance Tests for Constant Failure Rate
and Constant Failure Intensity, International Electrotechnical Commission (IEC),
Geneva, Switzerland.
IEC 61160:2005 Edition 2.0, Design Review, International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 61163-1:2006 Edition 2.0, Reliability Stress Screening: Part 1: Repairable Assemblies
Manufactured in Lots, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC 61163-2:1998 Edition 1.0, Reliability Stress Screening: Part 2: Electronic Components,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61164:2004 Edition 2.0, Reliability Growth: Statistical Test and Estimation Methods,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61165:2006 Edition 2.0, Application of Markov Techniques, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 61649:2008 Edition 2.0, Weibull Analysis, International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 61650:1997 Edition 1.0, Reliability Data Analysis Techniques: Procedures for
Comparison of Two Constant Failure Rates and Two Constant Failure (event)
Intensities, International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61703:2016 Edition 2.0, Mathematical Expressions for Reliability, Availability,
Maintainability and Maintenance Support Terms, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 61709:2017 Edition 3.0, Electric Components: Reliability: Reference Conditions for
Failure Rates and Stress Models for Conversion, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 61710:2013 Edition 2.0, Power Law Model: Goodness-of-fit Tests and Estimation
Methods, International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61882:2016 Edition 2.0, Hazard and Operability Studies (HAZOP studies): Application
Guide, International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 61907:2009 Edition 1.0, Communication Network Dependability Engineering,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62198:2013 Edition 2.0, Managing Risk in Projects: Application Guidelines, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62308:2006 Edition 1.0, Equipment Reliability: Reliability Assessment Methods,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62309:2004 Edition 1.0, Dependability of Products Containing Reused Parts:
Requirements for Functionality and Tests, International Electrotechnical Commission
(IEC), Geneva, Switzerland.
IEC 62347:2006 Edition 1.0, Guidance on System Dependability Specifications, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62402:2007 Edition 1.0, Obsolescence Management: Application Guide, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62429:2007 Edition 1.0, Reliability Growth: Stress Testing for Early Failures in Unique
Complex Systems, International Electrotechnical Commission (IEC), Geneva, Switzerland.
European Reliability Standards 257

IEC 62502:2010  Edition 1.0, Analysis Techniques for Dependability: Event Tree Analysis
(ETA), International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62506:2013 Edition 1.0, Methods for Product Accelerated Testing, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62508:2010 Edition 1.0, Guidance on Human Aspects of Dependability, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62550:2017 Edition 1.0, Spare Parts Provisioning, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 62551:2012 Edition 1.0, Analysis Techniques for Dependability: Petri Net Techniques,
International Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62628:2012 Edition 1.0, Guidance on Software Aspects of Dependability, International
Electrotechnical Commission (IEC), Geneva, Switzerland.
IEC 62673:2013 Edition 1.0, Methodology for Communication Network Dependability
Assessment and Assurance, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC 62740:2015 Edition 1.0, Root Cause Analysis (RCA), International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 62741:2015 Edition 1.0, Demonstration of Dependability Requirements:
The Dependability Case, International Electrotechnical Commission (IEC), Geneva,
Switzerland.
IEC TS 62775:2016 Edition 1.0, Application Guidelines: Technical and Financial Processes
for Implementing Asset Management Systems, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC 62853:2018 Edition 1.0, Open Systems Dependability, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
IEC TR 63039:2016 Edition 1.0, Probabilistic Risk Analysis of Technological Systems:
Estimation of Final Event Rate at a Given Initial State, International Electrotechnical
Commission (IEC), Geneva, Switzerland.
10 Time-Variant Reliability
Analysis Methods for
Dynamic Structures
Zhonglai Wang and Shui Yu

CONTENTS
10.1 Introduction.................................................................................................260
10.2 Time-Variant Reliability.............................................................................. 261
10.3 The Proposed Three Time-Variant Reliability Analysis Methods.............. 262
10.3.1 Failure Processes Decomposition Method..................................... 262
10.3.1.1 The FMTP Search for the Time-Variant Limit State
Function......................................................................... 262
10.3.1.2 Failure Processes Decomposition Based on
Taylor Expansion...........................................................264
10.3.1.3 Failure Processes Decomposition Based on Case
Classification.................................................................264
10.3.1.4 Kernel Density Estimation Method for the
Decomposed Model......................................................266
10.3.2 The Combination of the Extreme Value Moment and
Improved Maximum Entropy (EVM-IME) Method...................... 267
10.3.2.1 Determination of the Extreme Value Moments by
the Sparse Grid Technique............................................ 268
10.3.2.2 The Improved Maximum Entropy Method Based
on the Raw Moments..................................................... 269
10.3.3 Probability Density Function of the First-Passage Time Point
Method........................................................................................... 272
10.3.3.1 Time-Variant Reliability Model Based on PDF of
F-PTP...................................................................... 273
10.3.3.2 Establish f (τ ) by Using the Maximum Entropy
Method Combined with the Moment Method............... 273
10.4 Examples and Discussions........................................................................... 274
10.4.1 Numerical Example........................................................................ 275
10.4.2 A Corroded Simple Supported Beam Under Random
Loadings................................................................................. 276
10.5 Conclusions.................................................................................................. 278
Appendix: Discretization of Random Processes..................................................... 278
References............................................................................................................... 278

259
260 Reliability Engineering

10.1 INTRODUCTION
Reliability analysis aims to estimate the probability that products perform their
intended performance under the specified working conditions during their lifecycle.
For  highly reliable products, it is difficult to collect enough data to conduct reli-
ability analysis using the statistics-based method. From the aspect of failure mecha-
nism of products, the physics-based method will be a proper choice for reliability
analysis with insufficient data. Traditional physics-based static (time-invariant) reli-
ability analysis methods have been developed extensively such as the First Order
Reliability Method (FORM) [1], the Second Order Reliability Method (SORM) [2],
the moment-based method  [3], and surrogate models  [4], which only consider the
static performance or simplify the dynamic performance to be the static perfor-
mance. For most products, the performance is usually dynamic because of various
time-varying loadings, working conditions, and inherent motion. Time-invariant
reliability analysis methods have shown poor capability in satisfying the reliability
accuracy requirements for time-varying and high nonlinear performance functions
of products [5]. Therefore, such engineering requirements have fostered the develop-
ment of time-variant reliability methods and several time-variant reliability analysis
methods have been developed.
Time-variant reliability analysis aims to estimate the probability that products
successfully complete the intended performance during a given time interval.
There  are typically two categories of time-variant reliability analysis methods:
simulation and analytical. Typical analytical time-variant reliability analy-
sis methods include Gamma process method  [6], extreme value method  [7],
­composite limit state method [8], compound random processes method [9], and
­crossing-rate based methods  [10,11]. The  high model error would be produced
due to the model approximation since system parameters or performance func-
tions are usually assumed to follow a certain distribution in the Gamma ­process,
extreme value, and compound random processes methods. When handling high
nonlinear limit state functions, the computational accuracy of the composite
limit state method may be unsatisfactory. After the crossing-rate method was
first proposed [10,11], many crossing-based methods were developed further:
e.g., differential Gaussian process method  [12], the rectangular wave renewal
process method  [13], Laplace integration method  [14], PHI2 method  [15], and
PHI2+ method [16]. The ­d ifferential Gaussian process method, rectangular wave
renewal process method, and Laplace integration method are suitable mainly for
the crossing-rate calculation for some specific random processes. The developed
PHI2 and PHI2+ methods based on the crossing-rate method use the parallel reli-
ability framework to improve the computational accuracy and further broaden
the application range of the crossing rate methods. However, the PHI2 and PHI2+
methods show lower computational accuracy when dealing with the time-variant
reliability analysis of non-monotonic systems [16].
The  other branch of the time-variant reliability analysis is the simulation
methods. The  typical simulation methods are MCS, importance sampling (IS),
and subset simulation (SS) methods. MCS is a direct and easy-to-use method,
regardless of the dimensions and nonlinearity of limit state functions, but the
Time-Variant Reliability Analysis Methods for Dynamic Structures 261

computational efficiency is usually forbidden for high reliability estimation for


complicated engineering systems with implicit expression of the limit state func-
tion. The combination of MCS and analytical methods could improve the compu-
tational efficiency [17]. IS technique can improve the computational efficiency by
introducing the importance density function, but difficulties exist in acquiring the
prior failure domain information and determining the proper importance sampling
density [18]. SS method is another branch of the simulation method that transforms
a small failure probability into the product of some bigger conditional failure prob-
abilities. Subset simulation with Markov Chain Monte Carlo (SS/MCMC) and sub-
set simulation with splitting (SS/S) are the two typical sampling branches  [19].
However, the nonlinearity of the limit state function affects the computational
accuracy [20].
In  this chapter, the expression of the time-variant reliability is described in
Section 10.2. Developed three time-variant reliability analysis methods are elabo-
rated in Section 10.3. In Section 10.4, two examples are used to illustrate and com-
pare the proposed methods. Conclusions are provided in Section 10.5.

10.2  TIME-VARIANT RELIABILITY


Time-variant reliability is defined as the probability that products complete their
intended performance under the practical working conditions during the given time
interval. The typical expression of the time-variant reliability is provided as:

{( ) }
Rt ( tlb , tub ) = Pr g d, X, Y ( t ) , t > 0, ∀t ∈ tlb , tub  (10.1)

where:
g(•) is the time-variant limit state function for a certain structure
d denotes the vector of deterministic design variables
X defines the vector of random design variables and parameters
Y(t ) expresses the vector of time-variant random design variables and ­parameters,
actually stochastic process
tlb and tub are lower and upper boundaries of the time interval

When Y ( t ) is a stochastic process with the autocorrelation function, the stochastic


process can be decomposed into the general stochastic processes Y ( N,t ) with the
stochastic process discretization method  [21], where N =  N1,..., Nr  is a vector
of independent standard normal random variables. The decomposed process of a
scalar Gaussian process with the mean value m ( t ), standard deviation σ ( t ) , and
exponential autocorrelation coefficient function ρ ( t1, t2 ) is provided in the appen-
dix to this chapter. Therefore, the time-variant reliability can be rewritten as:

Rt ( tlb , tub ) = Pr { g ( d, Z, t ) > 0, ∀t ∈ [tlb , tub ]} (10.2)

where Z = [ X, N ].
262 Reliability Engineering

To simplify the computational process, normalization is conducted for the time


interval [tlb , tub ].

t − tlb
T= (10.3)
tub − tlb

Since t ∈ [tlb , tub ], T belongs to [ 0,1] in Equation  10.3. With the normalization in
Equation 10.3, the expression of the time-variant reliability in Equation (10.2) can
be rewritten as:

{ }
RT ( 0,1) = Pr gT ( d, Ζ, T ) > 0, ∀T ∈ 0,1 (10.4)

10.3 THE PROPOSED THREE TIME-VARIANT RELIABILITY


ANALYSIS METHODS
In this section, three developed time-variant reliability analysis methods will be elabo-
rated. The first method is called the failure processes decomposition (FPD) method,
which possesses the advantage for handling the time-variant reliability problems with
the high order of time parameters [21]. The second method is based on the combination
of the extreme value moment and improved maximum entropy (EVM-IME), which
can effectively deal with the time-variant reliability problems with multiple failure
modes and temporal parameters [7]. The third method is proposed based on the prob-
ability density function (PDF) estimation of the first-passage time point (P-FTP).

10.3.1 Failure Processes Decomposition Method


In this subsection, the procedure of the time-variant reliability analysis based on the
FPD method will be illustrated. For more details, please refer to [21]. First, the time
point where the mean value of the limit state function possesses the minimal value
(FMTP) is searched. With the acquired FMTP, the time-variant limit state function
with high order temporal parameters is then transformed to a quadratic function of
time, which is called by the first-stage failure processes decomposition. Based on
the property of the quadratic function and reliability criterion, the time-variant reli-
ability is transformed to the time-invariant system reliability, which is called by the
second-stage FPD. Finally, the kernel density estimation (KDE) method is imple-
mented to calculate the time-invariant system reliability. For a clear illustration, the
flowchart of the FPD method is provided in Figure 10.1.

10.3.1.1  The FMTP Search for the Time-Variant Limit State Function


The expression of the time-variant reliability based on the extreme value theory can
be given by:

{
RT ( 0,1) = Pr gmin
T
}
( d, Z, T ) > 0, T ∈ 0,1 (10.5)
Time-Variant Reliability Analysis Methods for Dynamic Structures 263

FIGURE 10.1  Flowchart of the FPD method.

For  a trajectory of the stochastic process representing the time-variant uncertain


limit state function, g T ( d , Z , T ) is actually a deterministic function of T . The mini-
T
mal value g min ( d , Z , T ) can be achieved when T = T ∗. The time point T ∗ where the
mean value of the limit state function g T ( d, Z, T ) is minimal can be searched with
the optimization model and T ∗ is the so-called FMTP.

find: T


 minimize: g ( d, Z, T ) (10.6)
T


subject to: T ∈ [ 0,1]

264 Reliability Engineering

10.3.1.2  Failure Processes Decomposition Based on Taylor Expansion


In the first-stage FPD, the second-order Taylor expansion is performed for the time-
variant limit state function at the FMTP:

gT ( d, Z, T )

∂gT 1 ∂ 2 gT
( ) ( ) ( )
2
≈ gT d, Z, T ∗ + T − T∗ + T − T ∗ (10.7)
∂T T = T ∗ 2 ∂T T = T ∗
2

= aT 2 + bT + c

where
1 ∂ 2 gT
a=
2 ∂T 2 T = T ∗

∂gT ∂ 2 gT
b= − T∗
∂T T = T ∗ ∂T 2 T = T ∗

∂gT T ∗2 ∂ 2 g T
( )
c = gT d, Z, T ∗ − T ∗
∂T T = T ∗
+
2 ∂T 2 T = T ∗

When the second derivative of the approximate limit state function to T equals 0,
the approximate limit state function is a monotonic function of T . Therefore,
T
g min ( d, Z, T ) = g T ( d, Z, 0 ) or gmin
T
( d, Z, T ) = g T ( d, Z,1), and the reliability can be
obtained for this case:

{ }
RT ( 0,1) = Pr min  g T ( d, Z, 0 ) , g T ( d, Z,1)  > 0 (10.8)

10.3.1.3  Failure Processes Decomposition Based on Case Classification


Since the trajectory g T ( d , Z , T ) of the stochastic process g T ( d, Z, T ) is a quadratic
function of T, the stochastic process g T ( d, Z, T ) could be considered as a collection
of quadratic functions. According to the property of a quadratic function and safety
criterion, three cases are classified to represent the safety of the structure in the
second-stage FPD, shown in Figure 10.2a–c, respectively.
In  Figure  10.2, the green curve represents the shape of the quadratic function
g T ( d, Z, T ), the dashed red line is the symmetric axis Ts = − 2ba , and the green dots,
respectively, denote T = 0 and T = 1. The  corresponding properties for the three
cases are provided in Table 10.1.
For Case 1, Case 2, and Case 3, three events C1, C2, and C3 are provided:

{ }
C1 = Ts < 0, gT ( d, Z, 0 ) > 0, gT ( d, Z,1) > 0
(10.9)
 b 
= − < 0, c > 0, a + b + c > 0 
 2 a 
Time-Variant Reliability Analysis Methods for Dynamic Structures 265

FIGURE  10.2  The  geometrical relationship between g T ( d, Z, T ) and T. (a) case 1 of the
safety situation, (b) case 2 of the safety situation, (c) case 3 of the safety situation.

TABLE 10.1
Properties for the Three Cases
Cases Position of Ts Location of Minimum Point
Case 1 Ts < 0 T = 0 or 1
Case 2 0 ≤ Ts < 1 T = 0 or Ts or 1
Case 3 Ts ≥ 1 T = 0 or 1
266 Reliability Engineering

 b b2 
C2 =  0 < − < 1, c − > 0, c > 0, a + b + c > 0  (10.10)
 2 a 4 a 

 b 
C3 = − > 1, c > 0, a + b + c > 0  (10.11)
 2 a 

Because the three events C1 ∼ C3 are mutually exclusive, the PDF of the system time-
invariant reliability transformed from the time-variant reliability can be expressed by:

f (ς ) = f1 (ς ) + f 2 (ς ) + f3 (ς ) (10.12)

where f1 (ς ), f 2 (ς ), and f3 (ς ) denote the PDF of Case 1, Case 2, and Case 3 occur-
ring, respectively. Therefore, the time-variant reliability can be calculated by the
numerical integration:

+∞
RT ( 0,1) =

0
 f1 (ς ) + f2 (ς ) + f3 (ς ) dς (10.13)

The KDE method will be employed to calculate the PDF for each case.

10.3.1.4  Kernel Density Estimation Method for the Decomposed Model


According to the analysis in Section  10.3.1.3, there are five failure modes for the
system, which can be summarized as:


 b
 g1 =
 2 a

 g2 = a + b + c

 g3 = c (10.14)

 b2
 g4 = c −
 4a
 b
 g5 = 1 +
 2a

Because of the similar procedure for calculating the system reliability for each case,
Case 1 is taken for an example. The limit state function for the event C1 is:

 b 
GC1 ( Z ) = min  g1, g2 , g3  = min  , a + b + c, c  (10.15)
 2a 
Time-Variant Reliability Analysis Methods for Dynamic Structures 267

M samples are directly drawn from the limit state function GC1 ( Z ) , and the vector
{ }
of samples can be obtained as GC1 = ζ 1C1 , ζ 2C1 , ⋅⋅⋅, ζ MC1 . Then the PDF fGC1 (ζ ) for the
event C1 is:

M
 ζ − ζ iC1 

1
fC1 (ζ ) = K   (10.16)
MhC1 i =1  hC1 

where:
K ( • ) is the kernel function and the Gaussian kernel function in this model is
considered: i.e., K ( u ) = 21π exp − u2
2
( )
h is the bandwidth of the kernel function

The bandwidth of the kernel function is important for the prediction accuracy and
the optimal value of h is:

0.2
 4 
hopt =   σ ( G ) (10.17)
 3M 

PDF f1 (ς ) for Case 1 can be estimated by:

f1 (ς ) = fC1 (ζ )
M
 ζ − ζ iC1  (10.18)

1
= K 
MhC1 i =1  hC1 

With the same procedure, PDFs are estimated for events C2 and C3 based on the KDE
method. Using the estimated PDFs, the time-invariant system reliability is obtained:

+∞
RT ( 0,1) =
∫0
 f1 (ς ) + f2 (ς ) + f3 (ς )  dς
3 M
(10.19)
 ζ Ck 
∑∑
1
= Φ i 
M k =1 i =1  hCk 

10.3.2 The Combination of the Extreme Value Moment and


Improved Maximum Entropy (EVM-IME) Method
In this subsection, the second method called EVM-IME will be elaborated. For more
details, please refer to [7]. The accuracy of the maximum entropy first is improved
by introducing the scaling function. The extreme value moments of the limit state
functions are then estimated using the sparse grid stochastic collocation method.
The PDF and corresponding time-variant reliability are finally estimated. The flow-
chart of the EVM-IME method is provided in Figure 10.3.
268 Reliability Engineering

FIGURE 10.3  Flowchart of the EVM-IME method.

10.3.2.1 Determination of the Extreme Value Moments


by the Sparse Grid Technique
Several numerical approximation methods could be employed for the multi-­
dimensional integration, such as full factorial numerical integration, univariate
dimension reduction method, and sparse grid based stochastic collocation method.
Full factorial numerical integration method can obtain accurate results, but suf-
fers from the curse of the dimensionality, and is suitable mainly for solving the low
dimensional problems. In the univariate dimension reduction method, a larger error
may be produced due to the neglect of the interaction terms in the additive decom-
position process. The sparse grid-based stochastic collocation method can avoid the
curse of the dimensionality of the full factorial numerical integration method and
have higher accuracy than the univariate dimension reduction method. Therefore,
the sparse grid-based stochastic collocation method is used in this section. To sim-
plify the computation process, the limit state function is transformed to be the only
one including mutually independent standard normal random variables based on the
Rosenblatt transformation.
Then the expression of the time-variant reliability is:

{ }
RT ( 0,1) = Pr g N ( d, N, T ) > 0, ∀T ∈ 0,1 (10.20)

where N = [N1, N2 ,..., Nk ] is a vector of mutually independent standard normal ran-


dom variables. The lth raw moments of extreme values in the time-variant reliability
model can be expressed as:
Time-Variant Reliability Analysis Methods for Dynamic Structures 269

+∞ +∞ l

∫ ⋅⋅⋅∫ {g ( d, N, T )} ϕ ( x1 ) ⋅⋅⋅ ϕ ( xk ) dx1 ⋅⋅⋅ dxk (10.21)


N
Ml =
−∞ −∞

where ϕ (⋅) is the PDF of a standard normal random variable.


Based on the Smolyak algorithm, the sparse grid quadrature method is employed
to compute the multivariate integration and the lth raw moments of extreme values
can be expressed by:

 k −1  mi1 mik

∑( ) ( −1) ∑ ∑ g ( d, x ,⋅⋅⋅, x )


q+k − i l
Ml =  × ⋅⋅⋅ N i1 ik
, Topt  pij11 ⋅⋅⋅ pijkk
q + k − i 
j1 jk 
i∈H q ,k  j1 =1 j1 =1
(10.22)

where the abscissas and weights are x ijii = 2ξ jiii and pijii = 1π ζ jiii , and ξ ijii and ζ ijii are
the abscissas and weights in the Gauss-Hermite quadrature formula; ji = 1,, mii ; the
multi-index i = ( i1,, ik ) ∈ N+k ; and the set H ( q, k ) is defined by:

 k

H ( q, k ) = i = ( i1,, ik ) ∈ N+k , i ≥ 1: q + 1 ≤

∑ i ≤ q + k 
r =1
r (10.23)

q will affect the computational accuracy of H ( q, k ) and the selection of q is based on


the nonlinearity of the limit state function. To balance the computational accuracy
and efficiency, 2 ≤ q ≤ 4 and m1 = 1 for i = 1 and mi = 2i −1 + 1 for i > 1 are provided in
engineering applications.
In  Equation  10.22, the input variable nodes x ij11 ,, x ijkk can be generated by the
sparse grid technique. g N ( d, x ij11 ,, x ijkk , Topt ) = min g N ( d, x ij11 , ⋅⋅⋅, x ijkk , T ) is regarded
as an extreme value, and the corresponding optimization model is provided by:

find: T


N i1
( ik
minimize: g d, x j1 , ⋅⋅⋅, x jk , T ) (10.24)

subject to: T ∈ 0,1

10.3.2.2 The Improved Maximum Entropy Method Based


on the Raw Moments
The  maximum entropy method can obtain a relatively accurate result of the PDF
based on the known moments. The  typical formulation for the PDF of the time-
invariant limit state function is defined as:


find: p( x )


(10.25)
 maximize: H = − p( x ) ln p( x )dx

subject to:
 ∫
x i p( x )dx = M i , i = 0,1, ⋅⋅⋅, l
270 Reliability Engineering

where:
p( x ) is the PDF of the time-invariant limit state function g ( X )
H is the entropy of the PDF p( x )
M i is the ith raw moment and M0 = 1
l is the number of the given moment constraints, which is defined to be 4 here

For the optimization problem, a Lagrangian multiplier λi , i = 0,1,, 4 is introduced


into the structure Lagrangian function L and:
4


L = − p( x ) ln p( x )dx − ∑ λ ∫ x p( x)dx − M  (10.26)
i =0
i
i
i

δL
= 0 is satisfied for calculating the optimal solution, and therefore the analytical
δ p( x )
expression of p( x ) can be easily obtained by:

 4 
p ( x ) = exp  −
 i =0

λi x i  .

(10.27)

The objective function to be minimized based on the Kullback–Leibler (K–L) diver-


gence between the true PDF and estimated PDF can be provided by:

4
I ( λ ) = λ0 + ∑ λ M (10.28)
i =1
i i

where λ0 = ln ∫ exp ( −∑i4=1 λi x i ) dx . The  optimization with equality constraints in
Equation 10.25 can be converted into an unconstrained optimization:

find: λ1, λ2 , λ3 , λ4


   4   4 . (10.29)


minimize: I = ln 
 ∫
exp −

 i =1

λi x i  dx  +
 
 
∑ i =1
λi M i

With the obtained raw moments, the PDF p ( x ) of the limit state function g ( X ) can
be acquired from the optimization model in Equation 10.29:

   4
  4 

 ∫
p ( x ) = exp  − ln  exp  −
 

∑ λi x  dx  −
 
 
i
∑ λ x  (10.30)
i
i

 i =1 i =1 
Reliability is then calculated based on the PDF p ( x ) from the maximum entropy
method:
+∞
R=
∫0
p ( x ) dx (10.31)
Time-Variant Reliability Analysis Methods for Dynamic Structures 271

The reliability results from this method may be not accurate due to the trunca-
tion error of the integral. Addressing this issue, the monotonic scaling function
is introduced to improve the computational accuracy of the maximum entropy
method. The truncation error of the numerical integration can be greatly reduced
by changing the definition domain of the PDF from an infinite interval to a lim-
ited interval.
The scaling function is expressed as:

 g (X) 
1 − exp  
 c 
gs ( X ) = − (10.32)
 g (X) 
1 + exp  
 c 

where c is a conversion coefficient and c > 0.


The  scaling function g s ( X ) is a monotonic increasing function for the limit
state function g ( X ), and lim g s (X) = 1, lim g s (X) = −1. Therefore, the defini-
g( X )→∞ g( X )→−∞
tion domain is changed from the infinite interval [−inf, +inf] to the limited inter-
val  [−1,  1]. c is an important coefficient which affects the relationship between
g ( X ) and g s ( X ), shown in Figure 10.4. From Figure 10.4, it is possible to see that
the greater co­efficient c will lead to the gentle curve. In this subsection, c = µ g ( X )
is chosen.

FIGURE 10.4  Conversion relationship between g(x) and g s (x) for different c.


272 Reliability Engineering

With the scaling function g s ( X ), the unconstrained optimization could be


­rewritten as:

find: λ1, λ2 , λ3 , λ4


  1  4   4 (10.33)


minimize: I = ln 

 −1
exp −
 ∑
 i =1
λi x i  dx  +
 
 
∑λ M
i =1
i i
s

and the corresponding reliability is:

1
R=
∫ p ( x ) dx (10.34)
0

where M is is the ith raw moment of g s ( X ), p ( x ) is the PDF obtained from M is.
For the time-variant reliability analysis, the scaling function can be expressed by:

 g T ( d, N, T ) 
1 − exp  
 c  (10.35)
g s ( d, N, T ) = −
 g T ( d, N, T ) 
1 + exp  
 c 

where c is the conversion coefficient.


To obtain the conversion coefficient c, the mean value of the limit state function
g T ( d, N, T ) is minimal and the corresponding optimization model is provided by:

find : c


 minimize: c = µ g ( d, N, T ) (10.36)

subject to: T ∈ [ 0,1]

where µ g ( d, N, T ) is the mean value of g T ( d, N, T ).

10.3.3 Probability Density Function of the First-Passage


Time Point Method
In  this subsection, the third method for the time-variant reliability analysis based
on the PDF of the first-passage time point (F-PTP) is discussed. The  mean value
function of the time-variant limit state function is obtained first using the sparse
grid based stochastic collocation method. The expression of the first-passage time is
then built based on the second-order Taylor expansion. With the combination of the
fourth central moments and the maximum entropy methods, the PDF of the F-PTP
is obtained and the time-variant reliability can be calculated with the integration.
Time-Variant Reliability Analysis Methods for Dynamic Structures 273

10.3.3.1  Time-Variant Reliability Model Based on PDF of F-PTP


According to the reliability criterion, the reliability situation and failure situation can
be described by the relationship between the realization g T ( d, Z , T ) of the stochastic
process representing the time-variant limit state function and the time interval [0,1],
shown in Figure 10.5. For example, there are three intersections t1, t2 , and t3 , between
g T ( d, Z , T ) and horizontal coordinate axes, where t1 is the F-PTP. When t1 ∈ [ 0,1],
the failure occurs, shown in Figure 10.5a. When t1 > 1, the system operates success-
fully, shown in Figure 10.5b.
Actually, the F-PTP, t1, is a function of random input vector Z, denoted as t1 = t ( Z ) .
Therefore, the failure probability during the time interval [0,1] can be expressed as:

{ }
P ( 0,1) = Pr t1 ∈ 0,1 (10.37)

If the PDF f (τ ) of the F-PTP function t1 = t ( Z ) is available, the failure probability


can be calculated by:

P ( 0,1) = Pr {0 < t1 < 1}


1 (10.38)
=

0
f (τ ) dτ

10.3.3.2 Establish f(τ ) by Using the Maximum Entropy Method


Combined with the Moment Method
The mean value function of the time-variant random limit state function g T ( d, Z, T )
can be expressed by:



µ ( T ) = gT ( d, Z, T )p ( Z ) dZ (10.39)

FIGURE 10.5  The geometrical relationship between g T ( d, Z , T ) and T, (a) the safe situation


on the life cycle [0,T], (b) the failure situation on the life cycle [0,T].
274 Reliability Engineering

The mean value function µ ( T ) can be calculated by the sparse grid-based stochas-


tic collocation method, which is elaborated in Section 10.3.2. The F-PTP, T *, of the
mean value function can be acquired via the optimization model. The second-order
Taylor expansion is used to approximate the time-variant limit state function at T *:

∂g T
( )
g T ( d, Z, T ) ≈ g T d, Z, T * +
∂T T = T *
T −T*( )

1 ∂2 gT
( )
2
+ T −T* (10.40)
2 ∂T 2 T = T *

= AT 2 + BT + C

where:
1 ∂ 2 gT 1
A= ,
2 ∂T T = T * 2
2

∂gT ∂ 2 gT
B= − T* ,
∂T T = T * ∂T 2 T = T *

( )
2
∂gT T* ∂ 2 gT
T
(
C = g d, Z, T *
) −T*
∂T T = T *
+
2 ∂T 2 T = T *
.

The  approximate limit state function is a quadratic function of T. Therefore, the


F-PTP function of the limit state function is t1 = t ( Z ) = − B− 2BA2 −4 AC . The fourth raw
moments of t1 can be also computed by the sparse grid-based stochastic collocation
method, denote as M iF , i = 1, 2, 3, 4. The maximum entropy method is used to estab-
lish f (τ ), and the corresponding optimization model is provided:


find: f (τ )





 maximize: H = − f (τ ) ln f (τ )dτ (10.41)


subject to:
 ∫
τ i f (τ )dτ = M iF , i = 0,1, ⋅⋅⋅, l

10.4  EXAMPLES AND DISCUSSIONS


In this section, two examples are employed to demonstrate the computational effi-
ciency and accuracy of the three time-variant reliability analysis methods. For the
accuracy comparison, the results from the MCS method are provided as the bench-
mark. The error on the failure probability is:
Time-Variant Reliability Analysis Methods for Dynamic Structures 275

P − PMCS
Err = × 100% (10.42)
PMCS

where P is the failure probability from the three proposed methods, PMCS is failure
probability from the MCS method.

10.4.1 Numerical Example
The time-variant limit state function is:

( )
g ( d, X, t ) = x12 x2 − 5 x3t + ( x4 + 1) exp x5t 2 − ∆ (10.43)

where ∆ is the threshold, Xi ( i = 1, 2, 3, 4, 5 ) are input random variables to be inde-


pendent normally distributed, and Xi ~ N ( 5, 0.52 ) ( i = 1, 2, 3, 4, 5 ); the time interval
is defined as t ∈ [ 0,1].
Table  10.2 provides the failure probability results from the MCS method, the
FPD method, the EVM-IME method, and the F-PTP method for different threshold.
Furthermore, the relative errors of the FPD method (Err1), the EVM-IME method
(Err2), and the F-PTP method (Err3) are also given in Table 10.2.
From Table 10.2, the error of the proposed methods is very small for different
threshold. Compared with the three proposed method, the order of the computational
accuracy is approximately the EVM-IME method > the FPD method > the F-PTP
method.
For this example, 105 trajectories are generated in the MCS method. The opti-
mization algorithm is used to estimate the minimum value of a trajectory of a limit
state function. For a given trajectory, nearly 20 functions are called for the global
minimum value. Therefore, the total function calls 2 × 106 are used in the MCS
method. Similarly, nearly 20 function calls are used to obtain FMTP, and 10,000
samples are used to build the kernel density function. Therefore, 10,020 function
calls are used in the FPD method. In the EVM-IME method, q = 2 is set and 61
samples are drawn for each limit state function using the sparse grid stochastic
technique. For each limit state function, there are nearly 20 function calls via the
optimization. Therefore, there are 20 + 20 × 61 + 20 = 1, 260 function calls in the

TABLE 10.2
Failure Probability Results for Example 1
∆ MCS FPD EVM-IME F-PTP Err1 (%) Err2 (%) Err3 (%)
70 0.00823 0.00897 0.00854 0.00893 8.99 3.77 8.51
72 0.01149 0.01099 0.01126 0.01117 4.35 2.00 2.79
74 0.01449 0.01506 0.01458 0.01626 3.93 0.6211 12.22
76 0.01798 0.01900 0.01880 0.01936 5.67 4.56 7.68
78 0.02312 0.02453 0.02386 0.02103 6.88 3.20 9.04
276 Reliability Engineering

EVM-IME method. In the F-PTP method, 101 function calls are used. Therefore,
the order of the computational efficiency is the F-PTP method  >  the EVM-IME
method > the FPD method.

10.4.2 A Corroded Simple Supported Beam Under Random Loadings


A  corroded simple supported beam subjected to random loadings is used as an
engineering case to illustrate the three proposed methods. As shown in Figure 10.6,
the parameters for the beam are L = 5 m, b0 = 0.2 m, and h0 = 0.04 m. The uniform
load p and the time-varying random loading F ( t ) are simultaneously applied to the
beam. The uniform load p is:

p = σ b0 h0 ( N/m) (10.44)

where:

σ = 78,500 ( N/m).

The  time-varying random loading F ( t ) is a Gaussian process with the mean


value of 3,500  N, standard deviation 700  N, and autocorrelation function
( )
ρ ( t1, t2 ) = exp − ( 0t.20833 ) . The time-varying random loading F ( t ) can be expressed
−t1 2

by the random process discretization method:


r


Ni T
F ( t ) = 3, 500 + 700 φi ρ ( t ) (10.45)
i =1
τi

where:
τ i , φiT , and ρ ( t ) can be obtained within different time interval according to the
appendix in this chapter
N i are independently standard normal random variables

With the effect of F ( t ) and p, the bending moment at the mid-span section is:

F ( t ) L pL2
M (t ) = + . (10.46)
4 8

FIGURE 10.6  A corroded simple supported beam under random loadings.


Time-Variant Reliability Analysis Methods for Dynamic Structures 277

Corrosion is assumed to happen around the cross-section of the beam isotropically


and the growth is linear with time progression. Then the surplus area of the cross-
section is provided as:

A ( t ) = b ( t ) × h ( t ) (10.47)

where b ( t ) = b0 − 2κ t , h ( t ) = h0 − 2κ t , and κ = 0.00003 ( m/year ). κ is a parameter to


control the corrosion velocity. The ultimate bending moment is:

A(t ) h (t )
Mu ( t ) = fy . (10.48)
4

where f y is the steel yield stress.


The time-variant limit state function could be provided as:

g ( X, Y ( t ) , t ) = M u ( t ) − M ( t ) (10.49)

In this case, the time intervals [0,15] and [0,20] years are considered. The related
information of random parameters is given in Table 10.3.
The  reliability results for the four methods are provided in Table  10.4. From
Table 10.4, it is seen that the order of the accuracy remains the same as that in exam-
ple 1. Since the expression of the limit state function has little impact on the compu-
tational efficiency, the computational efficiency keeps the same order.

TABLE 10.3
Information of Random Parameters in Example 2
Parameter Distribution Type Mean Standard Deviation
f Lognormal 240 24
b0 Lognormal 0.2 0.01
h0 Lognormal 0.04 0.004
F(t) Gaussian process 3,500 700

TABLE 10.4
Time-Variant Reliability Results in Example 2
Time
Interval MCS FPD EVM-IME F-PTP Err 1 Err 2 Err 3
[0,20] 0.00178 0.00182 0.00175 0.00174 2.28 1.69 2.25
[0,15] 0.00121 0.00119 0.00122 0.00116 1.65 0.826 4.13
278 Reliability Engineering

10.5 CONCLUSIONS
In  this chapter, three time-variant reliability analysis methods including the FPD
method, the EVM-IME method, and the F-PTP method are discussed. From the
procedure and examples, the following conclusions can be reached: (1) the three
time-variant reliability analysis methods have the high computational accuracy,
which satisfy the engineering requirements; (2) the three time-variant reliability
analysis methods have the high computational efficiency, which provide the feasi-
bility for solving complex engineering problems; (3) the order of the computational
accuracy is approximately the EVM-IME method > the FPD method > the F-PTP
method; and (4) the order of the computational efficiency is the F-PTP method > the
­EVM-IME method > the FPD method.
In the further research, the intelligent technique will be used for time-variant reli-
ability analysis to further improve the computational efficiency under the satisfac-
tion of high computational accuracy.

APPENDIX: DISCRETIZATION OF RANDOM PROCESSES


Consider a Gaussian process Y(t) with mean value m(t), standard deviation σ(t), and
autocorrelation coefficient function ρ(t1, t2). In the time interval [0, T], r time points
ti, i = 1, …, r are selected to decompose the process and t1 = 0, tr = T. The Gaussian
process Y(t) is decomposed into:

r
ξi T
Y ( t ) = m ( t ) + σ ( t ) ∑i =1
τi
φi ρ ( t ) (A10.1)

where ξi ~ N(0, 1), i = 1, …, r are independent standard normal random variables,


and (τi,  ϕi) are the eigenvalues and eigenvectors of the correlation matrix C, and
Cij = ρ(ti, tj), i, j = 1, …, r, ρ(t) = [ρ(t, t1), …, ρ(t, tr)]T is a time-variant vector. The cor-
responding decomposition error is given by:

(φ ρ( t))
2
r T


i
e( t ) = 1 − (A10.2)
i =1
τi

REFERENCES
1. Du X. Unified uncertainty analysis by the first order reliability method. Journal of
Mechanical Design, 2008, 130(9): 091401.
2. Wang Z, Huang HZ, Liu Y. A  unified framework for integrated optimization under
uncertainty. Journal of Mechanical Design, 2010, 132(5): 051008.
3. Zhao YG, Ono T. Moment methods for structural reliability. Structural Safety, 2001,
23(1): 47–75.
4. Xiao NC, Zuo MJ, Zhou C. A new adaptive sequential sampling method to construct
surrogate models for efficient reliability analysis. Reliability Engineering  & System
Safety, 2018, 169: 330–338.
Time-Variant Reliability Analysis Methods for Dynamic Structures 279

5. Yu S, Wang Z, Zhang K. Sequential time-dependent reliability analysis for the lower


extremity exoskeleton under uncertainty. Reliability Engineering  & System Safety,
2018, 170: 45–52.
6. Van Noortwijk JM, van Der Weide JAM, Kallen MJ, Pandey MD. Gamma processes
and peaks-over-threshold distributions for time-dependent. Reliability Engineering &
System Safety, 2007, 92(12): 1651–1658.
7. Yu S, Wang Z, Meng D. Time-variant reliability assessment for multiple failure modes
and temporal parameters. Structural and Multidisciplinary Optimization, 2018, 58(4):
1705–1717.
8. Majcher M, Mourelatos Z, Tsianika V. Time-dependent reliability analysis using a
modified composite limit state approach. SAE International Journal of Commercial
Vehicles, 2017, 10(2017-01-0206): 66–72.
9. Gnedenko BV, Belyayev YK, Solovyev AD. Mathematical Methods of Reliability
Theory. Burlington, NY: Academic Press, 2014.
10. Jiang C, Wei X, Huang Z, Liu J. An outcrossing rate model and its efficient calculation
for time-dependent system reliability analysis. Journal of Mechanical Design, 2017,
139(4): 041402.
11. Yan M, Sun B, Liao B et al. FORM and out-crossing combined time-variant reliability
analysis method for ship structures. IEEE Access, 2018, 6: 9723–9732.
12. Sundar VS, Manohar CS. Time variant reliability model updating in instrumented
dynamical systems based on Girsanov’s transformation. International Journal of Non-
linear Mechanics, 2013, 52: 32–40.
13. Breitung K, Rackwitz R. Nonlinear combination of load processes. Journal of
Structural Mechanics, 1982, 10(2): 145–166.
14. Zayed A, Garbatov Y, Soares CG. Time variant reliability assessment of ship struc-
tures with fast integration techniques. Probabilistic Engineering Mechanics, 2013, 32:
93–102.
15. Andrieu-Renaud C, Sudret B, Lemaire M. The PHI2 method: A way to compute time-
variant reliability. Reliability Engineering & System Safety, 2004, 84(1): 75–86.
16. Singh A, Mourelatos Z P. On the time-dependent reliability of non-monotonic, non-
repairable systems. SAE International Journal of Materials and Manufacturing, 2010,
3(1): 425–444.
17. Wang Z, Zhang X, Huang HZ et al. A simulation method to estimate two types of time-
varying failure rate of dynamic systems. Journal of Mechanical Design, 2016, 138(12):
121404.
18. Singh A, Mourelatos Z, Nikolaidis E. Time-dependent reliability of random dynamic
systems using time-series modeling and importance sampling. SAE International
Journal of Materials and Manufacturing, 2011, 4(1): 929–946.
19. Wang Z, Mourelatos ZP, Li J, Baseski I, Singh A. Time-dependent reliability of
dynamic systems using subset simulation with splitting over a series of correlated time
intervals. Journal of Mechanical Design, 2014, 136(6): 061008.
20. Ching J, Au SK, Beck JL. Reliability estimation for dynamical systems subject to sto-
chastic excitation using subset simulation with splitting. Computer Methods in Applied
Mechanics and Engineering, 2005, 194(12–16): 1557–1579.
21. Yu S, Wang Z. A novel time-variant reliability analysis method based on failure pro-
cesses decomposition for dynamic uncertain structures. Journal of Mechanical Design,
2018, 140(5): 051401.
11 Latent Variable
Models in Reliability
Laurent Bordes

CONTENTS
11.1 Introduction.................................................................................................. 281
11.2 Latent Variable Model for Handling Incomplete Data................................ 282
11.2.1 Right Censoring.............................................................................. 282
11.2.2 Partly Observed Current Status Data............................................. 283
11.2.3 Competing Risks Models............................................................... 285
11.3 Latent Variable Model for Handling Heterogeneity.................................... 288
11.3.1 Frailty Models................................................................................ 288
11.3.2 Finite Mixture Models.................................................................... 289
11.3.3 Cure Models................................................................................... 292
11.3.4 Excess Hazard Rate Models........................................................... 294
11.4 Latent Variable or Process Models for Handling Specific Phenomena....... 296
11.4.1 Gamma Degradation Model with Random Initial Time................ 297
11.4.2 Gamma Degradation Model with Frailty Scale Parameter............ 298
11.4.3 Bivariate Gamma Degradation Models.......................................... 298
11.5 Concluding Remarks....................................................................................300
References............................................................................................................... 301

11.1 INTRODUCTION
A  latent variable is a variable that is not  directly observable and is assumed to
affect the response variables. There are many statistical models that involve latent
variables. Such models are called latent variable models. Surprisingly, there are few
monographs specifically dedicated to latent variable models (see, e.g., [1–4]). Latent
variables typically are encountered in econometric, reliability, and survival statistical
model with different aims. A latent variable may represent the effect of unobservable
covariates or factors and then it allows accounting for the unobserved heterogene-
ity between subjects, it may also account for measurement errors assuming that the
latent variables represent the “true” outcomes and the manifest variables represent
their “disturbed” versions, it may also summarize different measurements of the
same (directly) unobservable characteristics (e.g., quality of life), so that sample units
may be easily ordered or classified based on these traits (represented by the latent
variables). Hence, latent variable models now  have a wide range of applications,
especially in the presence of repeated observations, longitudinal/panel data, and
multilevel data.

281
282 Reliability Engineering

In this chapter, we propose to select a few latent variable models that have proved
to be useful in the domain of reliability. We do not pretend to have an exhaustive
view of such models but we try to show that these models lead to various estimation
methodologies that require various mathematical tools if we want to derive large
sample properties. Basic mathematical tools are based on empirical processes theory
(see [5–7]), or martingale methods for counting processes theory (see [8]), or again
Expectation-Maximization (EM) algorithms for parametric models (see [9]).
This chapter is organized into three parts. The first part is Section 11.2, which
is devoted to incomplete data including right censored data, partly right and left
censored data, and competing risk data. Then in the second part, Section  11.3,
we consider models that allow consideration of heterogeneity in data, including
frailty models, finite mixture models, cure models as well as excess risk models.
The last part in Section 11.4 deals with models for time-dependent phenomena.
Indeed, we consider degradation processes for which the latent variable is either
a random duration; this is the case for the Gamma degradation processes with
random initial time or a frailty scale parameter. We also consider bivariate degra-
dation processes obtained from trivariate construction that requires a third latent
Gamma process.

11.2 LATENT VARIABLE MODEL FOR HANDLING


INCOMPLETE DATA
11.2.1 Right Censoring
Let T be a duration of interest with probability density function (PDF) fT , survival
function ST , hazard rate function λT = fT /ST , and cumulative hazard rate function
ΛT . Let C be a censoring time with PDF fC , survival function SC , hazard rate func-
tion λ C = fC /SC, and cumulative hazard rate function ΛC . One of the basic latent
variable model in reliability (or survival analysis) assumes that the duration of
interest T is right censored if instead of observing T we observe the couple ( X , ∆)
where X = T ∧ C and ∆ = 1(T ≤ C) . Here we use the notations T ∧ C = min(T , C)
and 1( A) denotes the set indicator function equal to 1 if A is true and 0 otherwise.
Assuming that the random variables T and C are independent and defining for x ≥ 0
and δ ∈{0,1}:

H δ ( x ) = Pr( X ≤ x; ∆ = δ ),

it is straightforward to check that:

dH1 ( x ) = fT ( x )SC ( x )dx

and:

H ( x ) ≡ H0 ( x ) + H1 ( x ) = Pr( X ≤ x ) = 1 − ST ( x )SC ( x ).
Latent Variable Models in Reliability 283

Thus, we observe that:


x
dH1 ( y)
ΛT ( x) =
∫0 H ( y)
. (11.1)

where H ( x ) = 1 − H ( x ). Then, given n independent identically distributed (i.i.d.)


copies {( X i , ∆ i )}1≤i ≤n of ( X , ∆), we naturally derive the well-known Nelson–Aalen
[10,11] estimator based on the following empirical processes:

∑1( X ≤ x; ∆ = 1)
1
H1n ( x ) ≡ i i
n i =1

and:
n

∑1( X ≥ x).
1
H n ( x) ≡ i
n i =1

Indeed, replacing H1 and H by their empirical counterpart just defined we obtain:

∑∆
x
dH1n ( y) 1( Xi ≤ x )
 T ( x) =
Λ

0 Hn ( y )
=
i =1
i
Hn ( X i )
.

The most important thing allowed by this latent variable representation is that we


have the possibility to represent Λ  T as a functional of the two–dimensional empirical
process x  ( H1n ( x ), H n ( x )). Then the powerful empirical processes tools (continu-
ous mapping theorem, functional delta–method, etc.) allow the transfer to asymp-
totic properties of ( H1n , H n ) to Λ  T . This  example is handled in several textbooks
such as van der Vaart and Wellner, van der Vaart, and Korosok (see [5–7]). Since the
Kaplan–Meier [12] estimator is linked to the Nelson–Aalen estimator by the product-
limit operator, it is also a functional of the empirical process x  ( H1n ( x ), H n ( x ))
and again its asymptotic properties can be derived by the tools mentioned previously.
Another point is that the latent variable representation of ( X , ∆) is also useful for
simulating data and trying alternative estimation methods.

11.2.2 Partly Observed Current Status Data


Let T denote the random lifetime of interest. We consider the case where instead of
T we observe independent copies of a finite nonnegative duration X and of a discrete
variable A∈{0,1,2}, such that:

X = T if A=0

X < T if A =1
X > T if A = 2

284 Reliability Engineering

The aim is to estimate the distribution of T based on n i.i.d. copies {( X i , Ai )}1≤i ≤n


of ( X , A). Let us point out that the limit case where the event A = 2 (resp. A = 1)
has zero probability corresponds to the usual random right-censoring (resp. left-
censoring) setup we discussed in the previous section. If A = 0 has zero prob-
ability, we obtain current status data which means that each observation is either
a right censoring time or a left censoring time. This  observation is a special
case of Turnbull [13] grouped, censored, and truncated data. Note that the Non-
Parametric Maximum Likelihood Estimator (NPMLE) for the distribution of
duration data partly interval censored has been studied by Huang [14]. There are
many papers focusing on the derivation of the NPMLE in  situations where the
observation scheme does not allow us to obtain an explicit version of this quantity
at the contrary to the right censoring case of the previous section.
To derive an explicit version of the distribution of interest, the authors in [15] pro-
pose the following latent variable model for ( X , A). Let us introduce a non-negative
variable C and a Bernoulli variable ∆ such that:

 X = T and A = 0 if T ≤ C and ∆ = 1

 X = C and A = 1 if C<T

 X = C and A = 2 if T ≤ C and ∆ = 0

and for purposes of identification it is assumed that the random variables T , C, and
∆ are independent. As in the previous section, distributions functionals of T and C
are indexed by T and C, respectively, while Pr(∆ = 1) = p ∈ [0,1] . Note that p = 1
corresponds to the right censoring case and that p = 0 corresponds to current status
data. However, for identification it is assumed that p ∈(0,1], which guaranties that
a proportion of durations of interest will be observed. For the sake of simplicity we
assume that T and C admit PDF functions. Thus, defining:

Ha ( x ) = Pr( X ≤ x; A = a),

for a = 0,1,2 it is easy to check that:

dH0 ( x ) = pSC ( x ) fT ( x )dx



dH1 ( x ) = fC ( x )ST ( x )dx
dH ( x ) = (1 − p) fC ( x ) FT ( x )dx
 2

and using the notation H a ( x ) =


∫[ x ,+∞ )
dH a ( y ) simple calculations lead to:

H0 ( x ) + pH1 ( x ) = pST ( x )SC ( x )


Latent Variable Models in Reliability 285

from which we obtain the following representation for the hazard rate function λT :

dH0 ( x )
λT ( x )dx = .
H0 ( x ) + pH1 ( x )

In addition we have:

H0 (0) Pr( A = 0)
p= = Pr(∆ = 1 | T ≤ C) ≡ .
H0 (0) + H2 (0) Pr( A = 0) + Pr( A = 2)

Then, given n i.i.d. copies {( X i , Ai )}1≤i ≤n of ( X , A), [15] derives a Nelson–Aalen type


estimator of ΛT defined by:

dH0 n ( x )
 T ( x) =
Λ
∫  1n ( x )
[0, x ] H0 n ( x ) + pH

where:

n n

∑ ∑1( X ≥ x; A = a)
1 1
Han ( x ) = 1( Xi ≤ x; Ai = a) and Han ( x ) = i i
n i =1
n i =1

and:


n
1( Ai = 0)
p = i =1
.

n
1( Ai ≠ 1)
i =1
 T ( x ) can be written:
Alternatively the estimator Λ


 T (x) = 1( Ai = 0)1( Xi ≤ x )
Λ .

n
i =1 {1( X j ≥ Xi ; Aj = 0) p + 1( X j ≥ Xi ; Aj = 1)}
j =1

In  addition to the fact that this estimator is explicit, it is easily seen that it
can be written  as functional of the three-dimensional empirical  ­ process
x  ( H 0 n ( x ), H1n ( x ), H 2 n ( x )), which allows us to derive its asymptotic behavior by
standard empirical processes tools.

11.2.3 Competing Risks Models


Latent variable models are useful for modeling missing data phenomena. Let us
consider a very simple example of latent variable model useful for handling the
fact that sometimes the cause of failure is unknown. In reliability, competing risk
models correspond to series component systems that fail whenever one of the com-
ponents is down. Thus, if the system is made of p components and if X i denotes
286 Reliability Engineering

the lifetime of the ith component, the lifetime of the whole system is nothing but
X = min1≤ j ≤ p X j = X 1 ∧  ∧ X p and we note S X its reliability function. Let us con-
sider several model assumptions (A1, A2, and so on):

A1. X 1 … X p are i.i.d. and write S the common reliability function of these
random variables. Because the reliability function S X of X verifies:

p
SX ( x ) = Pr( X ≥ x ) =  S( x )  ,

we have S ( x ) = [ S X ( x ) ] . Thus, based on n i.i.d. copies ( X i )1≤ i ≤ n and


1/ p

defining the observable empirical process:

∑1( X
1
Hn ( x ) = j ≥ x )
n j =1

it is straightforward to estimate S by:

S ( x ) =  Hn ( x )  ,
1/ p

and the asymptotic properties of S are inherited from those of H n.


A2. X 1,…, X p are i.n.i.d. (independent but non  identically distributed) and Si
denotes the reliability function of X i . Because SX ( x ) = ∏ pj =1S j ( x ), the reli-
ability functions S1,, S p cannot be recovered from S X . This  result is
an identifiability issue that disappears if in addition to X we observe the
cause of failure; that is, ∆ = ∑ kp=1k1( X = X k ), which is well defined if
Pr(∃k ≠ k′ : X k = X k′ ) = 0 , which holds when the distribution of X 1,…, X p are
absolutely continuous with respect to the Lebesgue measure. Based on n i.i.d.
copies {( X i , ∆ i )}1≤i ≤n of ( X , ∆) and defining for δ ∈{1,, p} the observable
empirical processes:

n n

∑ ∑1( X ≥ x; ∆ = δ )
1 1
Hδ ,n = 1( Xi ≤ x; ∆ i = δ ) and Hδ ,n = i i
n n
i =1 i =1

the j-th cumulative hazard rate function can be consistently estimated by:

x
dH j ,n ( y)
 j ( x) =
Λ
∫ 0 Hn ( y )
,

where Hn = ∑ kp=1Hk ,n and the asymptotic properties of ( Λ 1,…, Λ


 p ) are inher-
ited from those of ( H1,n ,, H p,n ).
A3. We consider now the case where the cause of failure ∆ may be missing com-
pletely at random in the previous i.n.i.d. setup. Considering that X 1,…, X p
Latent Variable Models in Reliability 287

are absolutely continuous with respect to the Lebesgue measure, we note


λ j the failure rate of X j for 1 ≤ j ≤ p. We introduce a binary random vari-
able  A, independent of ( X , ∆) and we observe ( X , D ) ≡ ( X , A∆) instead of
( X , ∆) such that the cause of failure is known only when A = 1, otherwise
it is unknown. We write α ∈(0,1] the probability of the event {A = 1}. Note
that if α = 0, the failure causes are never observed and the model param-
eters are no longer identifiable. Let us define the sub-reliability functions
Hd ( x ) = Pr( X ≥ x; D = d ) for 0 ≤ j ≤ p. It is straightforward to obtain:

−dH j ( x ) = αλ j ( x )SX ( x )dx for 1 ≤ j ≤ p, (11.2)


p

−dH 0 ( x ) = −(1 − α ) ∑λ ( x)S


j =1
j X ( x )dx. (11.3)

Again, based on n i.i.d. copies {( X i , Di )}1≤i ≤n of ( X , D ) , we can define for


d = 0,, p the observable empirical processes:
n

∑1( X ≤ x; D = d ),
1
H d ,n ( x ) = i i
n
i =1

∑1( X ≥ x).
1
Hn ( x ) = i
n
i =1

and a natural estimator for α is α = n −1 ∑ in=11( Di > 0).


Using Equation 11.2, we see that for 1 ≤ j ≤ p the cumulative hazard rate
function λ j are consistently estimated by:
n


 j ( x) = 1 dH j ,n ( x ) 1 1( Xi ≤ x; Di = j )
Λ
α ∫
[0, x ] Hn ( x )
=
α i =1

n
1( Xk ≥ Xi )
.
k =1

Neglecting the information coming from Equation  11.3 may be harmful.


In [16], the authors propose the following strategy. Let us define Λ 0 = ∑ pj =1Λ j .
Using Equation 11.3, Λ 0 can be estimated by:


1 dH0,n ( x ) 1 1( Xi ≤ x; Di = 0)
 0 ( x) =
Λ
1 − α ∫
[0, x ] Hn ( x )
=
1 − α i =1

n
1( Xk ≥ Xi )
.
k =1

Let us* define Λ* = (Λ 0 ,, Λ p ) and Λ  * = (Λ


 0 ,…, Λ
 p ). First they show that
 *
)
n ( Λ − Λ converges weakly to a centered Gaussian process G in ([0,τ ]) p+1
where [0,τ ] is the study interval and G has a covariance function that satisfies
[G( x )G( y)] = Σ( x, y) for ( x, y ) ∈ [0,τ ]2. Then the authors look for a linear
transformation of Λ , which will give an optimal estimator of Λ = (Λ1,, Λ p ).
*
288 Reliability Engineering

To this end, they define H as the set of p × ( p + 1) real valued matrices such that
Ha = a* for all a* = (a1,, a p )T ∈  p and a = (a*T , ∑ pj =1a j )T ∈  p+1.
Then, for a consistent estimator Σ  ( x ) of Σ( x ) ≡ Σ( x, x ), the authors define:

(
H ( x ) = arg min trace H Σ ( x ) H T
H ∈H
)
where a close form expression is available for H ( x ) and where H ( x ) has to
be calculated at points Xi ∈[0,τ ] such that Di > 0. Then Λ ( x ) = H ( x )Λ * ( x )
is a new estimator of Λ ( x ) asymptotically T -optimal in the sense that
among all the estimators obtained by linear transformation of Λ , this one
has the smallest asymptotic variance trace.

11.3 LATENT VARIABLE MODEL FOR HANDLING


HETEROGENEITY
11.3.1 Frailty Models
The  frailty models have been introduced in the biostatistical literature in  [17] to
account for missing covariates or heterogeneity and have been extensively discussed
in  [18–20]. These models can be viewed as an extension of the Cox proportional
hazard model in the sense that it is generally assumed that the hazard rate function
of the duration of interest depends upon an unobservable random quantity that acts
multiplicatively on it. These unobservable random quantities, varying from an indi-
vidual to another, are called frailties.
In [21] frailty models are described as random effects models for time variables,
where the random effect (the frailty) has a multiplicative effect on the hazard rate
function.
In  the univariate case we consider now, if T is the duration of interest with
hazard rate function λT (cumulative hazard function ΛT and survival function
ST ), then we assume that we observe X having a random hazard rate function
λ X ( x ) ≡ λ X |Z ( x ) = Z λT ( x ) where Z is an unobserved positive random variable. Z
is considered as a random mixture variable, varying across the population. It also
means that two individuals have the same hazard rate function up to an unknown
factor as in the famous proportional hazard model. Nearly all arguments in favor
of assuming a Gamma distribution for Z are based on mathematical and com-
putational aspects. However, for identification reasons the condition ( Z ) = 1 is
required, then Z ∼ Γ(α ,α ) for α > 0 where the PDF f of the Gamma distribution
Γ(α , β ) is defined by:

β α zα −1 − β z
f ( z ) ≡ f Γ (α , β ) ( z ) = e 1( z > 0).
Γ(α )

Using the Bayes inversion formula it is easy to show that conditionally on X = x , the
frailty Z is distributed according Γ(α ,α + Λ T ( x )) . We also derive the unconditional
PDF f X of X since:
Latent Variable Models in Reliability 289

+∞
fX ( x ) =

0
λ X |Z ( x | z)SX |Z ( x | z) fZ ( z )dz

α α +1λ T ( x )
= .
(α + Λ T ( x ))α +1

Thus, if the model is fully parametric (i.e, if λT belongs to a parametric family of


hazard functions H = {λ (⋅ | θ );θ ∈ Θ ∈  d}), then (α ,θ ) are estimated by:

α α +1λ ( X i | θ )
(α ,θ ) = arg max (α ,θ )∈(0,+∞ ) × ΘΠ in=1 .
(α + Λ( X i | θ ))α +1

The asymptotic properties of the estimators of α and θ are studied in [8] using the
theory of martingales for counting processes in the right censoring setup. The semi-
parametric joint estimation of (α , Λ T ) has been studied in [22,23]. Frailty models are
interesting ways to consider population heterogeneity. By introducing a known cor-
relation structure between the frailty random variable, it is possible to construct some
homogeneity test based on an approximation of the score function (see, e.g., [24]).
In  the case where Z is a positive discrete random variable belonging to
{z1,, zd } ∈ (0, +∞) d for some 2 ≤ d ∈  and Pr( Z = zi ) = pi ∈ (0,1), then the reliabil-
ity function S X of X is defined by:

∑p S ( x)
zi
SX ( x ) = i T .
i =1

It means that the PDF f X is a convex linear combination of d PDF that are noth-
ing but the conditional PDF of X given Z = zi . This model is a special case of finite
mixture models that we discuss in the next section.

11.3.2 Finite Mixture Models


Finite mixture models have been discussed widely in the literature and for an over-
view on theory and applications of modeling via finite mixture distributions, we refer
to [9]. Basically a duration of interest T has a finite mixture distribution if its PDF
can be written:
d
fT ( x ) = ∑p f ( x )
j =1
j j

where the pi s are non-negative and sum to one and the f js are PDF. A latent variable
representation of T is possible in the sense that if T1,…, Td and Z are p +1 random
variables such that Tj has PDF f j for 1 ≤ j ≤ d , Z ∈{1,, d} with Pr(= Z z= j) p j for
1 ≤ j ≤ d , then if Z and (T1,…, Td ) are independent T and TZ have the same PDF and
thus the same distribution. T can be seen as the lifetime of an individual chosen at
290 Reliability Engineering

random within d populations where the proportion of individuals coming from the
ith population is pi and the lifetimes coming from the ith population are homoge-
neous with PDF fi . Sometimes we are interested in estimating the distributions of
the d sub-populations, that is, the distributions of the Tis.
Of course, if the latent variable Z is observed and if S j denotes the reliability
function of Tj for some 1 ≤ j ≤ d then based on n i.i.d. copies {(Ti , Zi )}1≤i ≤n of (T , Z ):

∑ 1(T ≥ x; Z = j) ∑
n n
i i 1( Zi = j )
S ( x ) =
j
i =1
and p j = i =1
,
∑ 1(Z = j)
n
n
i
i =1

are obviously consistent nonparametric estimators of both S j and p j for 1 ≤ j ≤ d .


The problem becomes more intricate when the component information Z is no lon-
ger available. The first problem we have to face is an identifiability issue: can we
recover p = ( p1,, pd ) and f = ( f1,, fd ) from the knowledge of fT ? Without addi-
tional constraints on the fi values the answer is generally negative; indeed, for d = 2
suppose that there exist two PDF f1 and f 2 such that:

fT = pf1 + (1 − p) f2

with p ∈(0,1), f1 = α g1 + (1 − α )g2 and f2 = β g1 + (1 − β )g2 where gi ≠ f j for


1 ≤ i , j ≤ 2. Then the PDF admits another representation:

fT = ( pα + (1 − p)β )g1 + ( p(1 − α ) + (1 − p)(1 − β ))g2

which shows that the semi-parametric identifiability fails. It is not possible to obtain
identifiability in the semi-parametric setup without additional constraints on the sub-
distribution functions fi . See [25] for the discussion of this problem in the setup of
right-censored data. In the case of mixture of parametric lifetime distributions, that
is when:

∀j ∈ {1,, d}, f j ∈ F = { f (⋅ | θ );θ ∈ Θ ⊂  k},

we have fT ( x ) ≡ fT ( x; p,θ ) with θ = (θ1,,θ d ) and for all x ∈ 

d
fT ( x; p,θ ) = ∑ p f ( x | θ )
j =1
j j

Hence, the identifiability condition becomes: fT ( x; p,θ ) = fT ( x; p′,θ ′) for all x ∈ 


implies ( p,θ ) = ( p′,θ ′). Classical identifiability conditions may be found in [26,27].
Additional identifiability conditions related to mixture of classical parametric life-
time distributions are given in [28].
Latent Variable Models in Reliability 291

In the case of right censoring and left truncation, that is when instead of observ-
ing T we observe ( L, X , ∆) where X = TZ ∧ C = T ∧ C ≥ L and ∆ = 1(T ≤ C) with C a
right censoring time and L a left truncation time both independent of the label ran-
dom variable Z and the lifetime T . The authors in [29] have shown that it is possible
to use the EM–algorithm to estimate the unknown model parameters based on n
i.i.d. realizations of ( L, X , ∆) . However, in the discussion of the this paper, [30] men-
tioned that the EM–algorithm may be trapped by a local maximum and as proposed
in  [31], as an alternative estimation method, to use the stochastic EM–­algorithm.
Here we recall the stochastic EM–algorithm principle and we show that it can be
easily extended to the case of parametric mixtures when data are right censored
and left truncated. Let us write l = (l1,, ln ) , x = ( x1,, xn ) and δ = (δ1,, δ n )
where ( l , x, δ ) = ((l1, x1, δ1 ),,(ln , xn , δ n )) are n i.i.d. realizations of ( L, X , ∆) and for
1 ≤ i ≤ n we have xi = ti ∧ ci . Let us write t = (t1,, tn ) . For the sake of simplicity we
note for 1 ≤ k ≤ d , S(⋅ | θ k ) the reliability function of Tk and λ (⋅ | θ k ) its hazard rate
function, then it is not difficult to check that for 1 ≤ k ≤ d we have:

h( k , l , x, δ ; p,θ ) = Pr( Z = k | ( L, X , ∆) = (l , x, δ ))

pk ( λ( x | θ k ) ) S ( x | θ k ) / S (l | θ k )
δ

= .

p
p j ( λ( x | θ j ) ) S ( x | θ j ) / S ( l | θ j )
δ
j =1

It is important to note that the above probability does not depend on the distribution
of L and C, thus it is possible to estimate both p and θ following the method of [25].
As the EM–algorithm, the stochastic–EM algorithm is an iterative algorithm
which requires an initial value for the unknown parameter θ , for example, θ 0,
and which allows us to obtain iterates ( p s ,θ s ) s≥1 . Indeed let ( p s ,θ s ) be the current
value of the unknown parameters, the next value ( p s+1,θ s+1 ) is derived in the fol-
lowing way:

Expectation step: For i = 1,, n and j = 1,, d calculate:


1.

pijs = h( j, li , xi , δ i ; ps ,θ s ).

Stochastic step: For i = 1,, n simulate a realization zis of a random variable


2.
taking the value j ∈{1,, d} with probability pijs , and define for j = 1,, d :

X sj = {i ∈ {1,, n}; zis = j}.

Maximization step: For  j = 1,, d we set:


3.

Card ( X sj )
p sj +1 = ,
n

and for j = 1,, d we have:


292 Reliability Engineering

θ js +1 = arg max j (θ | ( l , x, δ )),


θ ∈Θ

where:

 
∑ δ log λ( x | θ ) − ∫
xi
 j (θ | ( l , x, δ )) = i i λ( x | θ )dx  .
i∈X sj
li 

In  the special case of mixture of exponential distributions, that is when


F = {x  f ( x | θ ) = θ exp(−θ x )1( x > 0);θ ∈ (0, +∞)}, it is straightforward to see that
for j = 1,, d we have:

θ js +1 =
∑ δ . i∈X sj
i

∑ (x − l )
i∈X sj
i i

Obtaining an initial guess θ 0 may be a tricky problem, see [25] for discussion and
comments about initialization of the stochastic EM–algorithm. There  are several
ways to construct the final estimate based on K iterations of the algorithm. The most
classical one, because the sequence ( p s ,θ s ) s≥1 is a Markov chain, consists in taking
the ergodic mean of iterates, that is:
K K

∑ ∑θ .
1 1
p = ps and θ = s
K s =1
K s =1

The asymptotic properties of these estimators have been studied in [32]. It may be


more stable to replace the current value of the parameters obtained at step s of the
preceding stochastic EM–algorithm by the average of the estimates obtained along
the s −1 first iterations. However, this method is at the cost of losing the Markov’s
property of ( p s ,θ s ) s≥1 .

11.3.3 Cure Models
Cure models are special cases of duration models; Boag, [33] was among the first
to consider a population of patients containing a cured fraction. He used a mixture
model to fit a data set of follow-up study of breast cancer patients and estimated the
cured fraction by maximum likelihood method. As previously stated, the specificity
of cure models comes from the fact that a fraction of subjects in the population will
never experience the event of interest. This outcome is the reason why most of cure
models are special cases of mixture models where the time of interest T has the fol-
lowing distribution:

T ∼ (1 − p) P0 + pδ ∞ ,
Latent Variable Models in Reliability 293

where p ∈[0,1], P0 is the probability measure of a non-negative random variable


and δ ∞ is the Dirac measure at {+∞}. It means that if Y is a Bernoulli latent ran-
dom variable with probability of success p, T0 a non-negative random variable,
independent of Y , with cumulative distribution function F0 (t ) = P0 ([0, t ]) , then
T   has the same distribution as (1 − Y ) × T0 + Y × {+∞}. Hence, the distribution of
T is degenerated in the sense that its cumulative distribution function FT ­verifies:
FT (t ) = (1 − p) P0 ([0, t ]) + pδ +∞ ([0, t ]) = (1 − p) F0 (t ) → 1 − p as t tends to +∞. The
parameter p corresponds to the fraction of cured patients.
Now considering right censoring means that we observe ( X , ∆) = (T ∧ C,1(T ≤ C))
instead of T where C is a random or deterministic right censoring time. In addition
to ( X , ∆) an  p -valued covariate vector Z = ( Z1,, Z p ) maybe observed, and if we
consider that T0 and C are independent conditionally on Z but that conditionally on
Z = z we have Y ∼ B( p( z)) then the conditional cumulative distribution function of T
given Z is defined by:

FT|Z (t | z) = (1 − p( z)) P0 ([0, t ] | z) + p( z)δ +∞ ([0, t ]) = (1 − p( z)) F0 (t | z).

Because Pr(C < +∞) = 1, the event {T = +∞} will never be observed since X ≤ C
with probability one. Concerning the probability of being cured a logistic regression
model is generally assumed (see [34]):

exp(γ 0 + γ T z)
p( z | γ 0 , γ ) = .
1 + exp(γ 0 + γ T z)

Concerning the distribution of T0 conditionally on Z = z parametric and semi-


parametric approaches are available. A review of most standard models and soft-
wares is available in [35]. Let us look at the general principle of implementation
of a stochastic EM–algorithm for a cure model. First let us write θ the model
parameter that may include functional parameters and we note ( xi , δ i , zi )1≤i ≤n the
observed data:

Step 1: Find an initial guess θ (0) for θ


Step 2: Update the current value θ ( k ) to θ ( k +1) 
a. For i ∈{1,, n} simulate a realization yi( k ) using the law of Y con-
ditionally on ( X , ∆, Z) = ( xi , δ i , zi ) and θ = θ ( k ) ;
b. Based on the augmented data ( xi , δ i , yi( k ) , zi )1≤i ≤n calculate θ ( k +1) by
an appropriate method;
Step 3: Based on iterates θ (1) ,θ (2) ,,θ ( K ) obtained by using repeatedly the
above step 2 derive a estimate θ of θ .
Writing S0 ( x | z) the survival function of T0 conditionally on Z = z it is
easy to check that:
294 Reliability Engineering

qθ ( x, δ , z) ≡ Pθ ( Y = 1 | ( X , ∆, Z) = ( x, δ , z) )

p( z | γ 0 , γ )(1 − δ )
= .
p( z | γ 0 , γ ) + (1 − p( z | γ 0 , γ ) ) S0 ( x | z)

It is important to note that this conditional probability does not depend on the distri-
bution of the censoring variable. This fact is essential because it allows considering
the distribution of C as a nuisance parameter in the model.

Example 11.1

Let us assume that the covariate Z is real-valued and that conditionally on Z = z


the random time T0 follows an exponential proportional hazards model with con-
ditional hazard rate functions defined by λ0 ( z| β ) = exp( β 0 + β1 z). Then setting
γ =(γ 0 ,γ 1), β =(β0 ,β1), and θ = (γ 0 ,γ 1,β0 ,β1)∈  4, and

(1−δ )exp(γ 0 + γ 1 z)
qθ ( x,δ ,z) = .
(
exp(γ 0 + γ 1 z) + exp − xe β0 + β1z )
Thus, given θ ( k ) = (γ 0(k ) ,γ 1(k ) ,β0(k ) ,β1(k ) ) , the kth iterate of θ , for the simulation Step 2a
we have for 1≤ i ≤ n:

( )
y (ik ) ∼ B qθ ( k ) ( x i ,δ i ,z i ) ,

while the updating Step 2b is as follows:

n
γ ( k +1) = argmax
2
γ ∈
∑(y
i =1
(k)
i ) ( )
(1− δ i ) log ( p( zi | γ ) ) + 1− yi( k ) log (1− p( zi | γ ) ) ,

And:

n
β ( k +1) = argmax
β ∈2
∑ ((1− y )δ ) ( β
i =1
(k)
i i 0 ( )
+ β1zi ) − 1− yi( k ) xi e β0 +β1zi .

Assuming that K iterates have been obtained, final estimate of Step 3 may be
obtained by averaging the iterates, that is θ = K −1∑ k =1θ ( k ) .
K

11.3.4 Excess Hazard Rate Models


Excess hazard rate models are used in cancer epidemiology studies to evaluate the
excess of risk due to the disease. Generally, considering that an individual is diag-
nosed at age a > 0 its risk or hazard function λobs (t ) at time t ≥ 0 is λ pop ( a + t ) + λexc (t )
where λ pop is the known population risk given by life tables while λexc is an unknown
additional risk due to the disease. In addition, the probability p that the individual
Latent Variable Models in Reliability 295

does not die from the disease is generally not null resulting in an improper excess
risk function λexc connected to p through the relationship:

 +∞

p = exp  −
 ∫0
λexc (s)ds  .

Of course, in such a model the population risk and the excess risk may depend on
covariates and data are generally incomplete including, for instance, right cen-
soring. For  example, a proportional hazards model on the excess risk function
allows us to include covariates effects (see  [36] for an efficient semi-parametric
estimator).
Let us see that it is possible to obtain a latent variable representation for a time to
event T the hazard rate function of which is λobs . Indeed, let us introduce the random
variable A corresponding to the age at which the individual is diagnosed. Then let
Z be a Bernoulli random variable with probability of success p ∈[0,1], T∞ = +∞, T p
a positive random variable with hazard rate function λ pop, and T0 a positive random
variable with hazard rate function λ0. Assume, in addition, that conditionally on A
the random variables Z , T p, and T are independent, then conditionally on A = a , the
hazard rate of the random variable:

{Z × T∞ + (1 − Z ) × T0 } ∧ {Tp − A}
is λobs whenever we have for all t ≥ 0:

t  e − Λexc ( t ) − p 
Λ 0 (t ) =
∫0
λ 0( s)ds = − log 
 1− p
,

t
where Λ exc (t ) = ∫0λexc ( s) ds. It is interesting to note that the excess hazard rate model is
close to the competing risk model. Indeed, if T1 = Z × T∞ + (1 − Z ) × T0 and T2 = T p − A,
we observe the smallest lifetime T = T1 ∧ T2 and the lack of information about the
component failure (here 1(T1 ≤ T2 ) is not observed) is compensated by the assump-
tion that conditionally on A, the distribution of T2 is known.
There  is a large amount of literature about parametric, semi-parametric, and
non-parametric estimation of these models. In addition, a major difficulty comes
from the heterogeneity of the observed T p which generally depend on covariates
that include the age at diagnostic. See, for example [37] for recent discussion about
this issue.
Here, for simplicity, we consider that λ pop is homogeneous, more precisely it
means that it does not depend on the age at diagnosis. Let Sobs (resp. Sexc and S pop) be
the survival function associated to the hazard rate function λobs (resp. λexc and λ pop).
It is straightforward to check that if A = a :

Sobs (t ) = Sexc (t ) × S pop (t + a) for all t ≥ 0.


296 Reliability Engineering

Hence, based on n i.i.d. copies (T (i ) )i =1,,n of T and assuming that all the individuals
are diagnosed at the same age a, the empirical estimator of Sobs is defined by:

1 n
S obs (t ) = ∑ Yi (t ),
n i =1

where Yi (t ) = 1(T (i ) ≥ t ) and thus Sexc is naturally estimated by:

Sobs (t )
Sexc (t ) = .
S pop (a + t )

In this very simple case the asymptotic properties of S exc are easy to obtain. Suppose
now that the age at diagnosis varies from one individual to another, and let us write ai
the age at diagnosis of the ith individual. It is well known (see [8]) that the intensity
process of the counting process N (t ) = ∑ in=11(T (i ) ≤ t ) is:

n
∑Y (t) ( λ
i =1
i exc (t ) + λ pop (ai + t ) ) .

By a method-of-moment approach we derive an estimator Λ exc defined by:

n
dNi (s) − Yi (s)λ pop (ai + s)ds
∑∫
t
 exc (t ) =
Λ ,

n
i =1
0
Yi (s)
i =1

This estimator is known as the Ederrer II estimator of the cumulative excess risk


function (see  [38] for more general non-parametric estimators of the cumulative
excess risk function). Note that in the case of right censoring, that is, if instead of
observing (T (i ) , ai )i =1,,n , we observe ( X (i ) , ∆ i , ai )i =1,,n where ∆ i = 1 if X ( i ) = T ( i ) ,
and ∆ i = 0 if X ( i ) < T ( i ), then this estimator is still valid simply replacing N i (t ) by
N i (t ) = 1(T (i ) ≤ t; ∆ i = 1) for 1 ≤ i ≤ n .

11.4 LATENT VARIABLE OR PROCESS MODELS FOR


HANDLING SPECIFIC PHENOMENA
In  this section, we give a few examples of time-dependent models that describe
the degradation of a system and where the latent variable may depend on the time.
There are a large number of stochastic degradation models, here we focus on Gamma
processes. To be more specific X = ( X t )t ≥0 is a Gamma process with scale param-
eter b > 0 and continuous and non-decreasing shape function a :  + →  + with
a(0) = 0 if X is a random process with independent Gamma distributed incre-
ments with common scale parameter b > 0 such that X0 = 0 almost surely and
Xt − Xs ∼ Γ(a(t ) − a(s), b) for every 0 ≤ s < t where for α > 0 and β > 0 we note
Γ(α , β ) the Gamma distribution with PDF:
Latent Variable Models in Reliability 297

β α x α −1 exp(− β x )
fΓ (α ,β ) ( x ) = 1( x ≥ 0).
Γ(α )

Note that if the shape function satisfies a(t ) = at , then the Gamma process X is
homogeneous since for s ≥ 0 and t ≥ 0, the distribution of Xt + s − Xt is nothing but the
Γ( as, b) distribution which hence does not depend on t.

11.4.1 Gamma Degradation Model with Random Initial Time


In [39] a stochastic model is introduced for a component that deteriorates over time.
The deterioration is due to defects which appear one by one and next independently
propagate over time for passive components within electric power plants, where
(measurable) flaw indications first initiate (one at a time) and next grow over time.
The available data come from inspections at discrete times, where only the largest
flaw indication is measured together with the total number of indications on each
component. As a consequence the model of [39] can be seen as a competing degrada-
tion model. Let us describe here a simpler model with a single degradation trajectory
that initiates at the random time T , which is a latent variable. For example, we may
consider that Yt is the length of a crack at time t that appears at time T ≥ 0, and we
consider that at time t + T the length of the crack is X t where X = ( Xt )t ≥0 is a Gamma
process with scale parameter b > 0 and continuous and non-decreasing shape func-
tion a(⋅;θ1 ) :  + →  + where a(0;θ1 ) = 0 with θ1 ∈ Θ1 ⊂  p . Then Yt = X( t −T )+ where
x + = max(0, x ). Assuming that T has a PDF fT (⋅;θ 2 ) with θ 2 ∈ Θ2 ⊂  q the random
variable Yt has the PDF:

t
b a( t − s;θ1 ) y a( t − s;θ1 )−1 exp( −by )
fYt ( y;θ ) = (1 − FT (t ;θ 2 ) ) δ 0 ( y ) +

0 Γ( a(t − s;θ1 ))
fT ( s;θ 2 )dsdy

with respect to the sum of the Dirac measure δ 0 at 0 and the Lebesgue measure
dy on  where θ = (θ1,θ 2 , b). When N i.i.d. copies (Y ( k ) )k =1,, N of the delayed
degradation process Y = (Yt )t ≥0 are observed at times 0 = t00 < t k1 <  < t knk for
k = 1,, N , it is possible to derive the joint distribution of (Ytk( k1) ,,Ytkn
(k )
) to apply
k
a maximum likelihood principle. However, due to numerical instabilities the max-
imization of the associated log-likelihood function is a tricky problem. An alterna-
tive estimation method based on the pseudo-likelihood (or composite likelihood)
can be (see, e.g., [40]) an alternative method. It simply consists in maximizing:

N nk

(θ )  ∑∑ log ( f
k =1 i =1
Yt
ki )
( yki ;θ ) ,

where for 1 ≤ i ≤ n and 1 ≤ j ≤ N , yki is the observation of Ytki( k ) . In other words, the
pseudo-likelihood method consists in doing as if the random variables Ytki( k ) were
independent, this simplifies the calculation of the log-likelihood at the price of a loss
of efficiency. See [39] for an application to competing degradation processes.
298 Reliability Engineering

11.4.2 Gamma Degradation Model with Frailty Scale Parameter


In  [41] a fatigue crack propagation is considered where the crack growth is
described by a non-homogeneous Gamma process where the scale parameter is a
frailty variable. Let us consider a degradation process X = ( Xt )t ≥0 observed at times
0 = t0 < t1 <  < tn . Let a :  + →  + be a continuous and non-decreasing shape func-
tion with a(0) = 0 and B a non-negative random variable with f B as PDF. Here we
assume that conditionally on B = b > 0 the random process X is a Gamma process
with shape function a and scale parameter b. It means that conditionally on B = b > 0
the increments ∆Xi = Xti − Xti −1 for i = 1,, n are independent with PDF:

n
(δ xi )∆ai −1 b ∆ai exp(−bδ xi )
f∆X1,,∆Xn |B (δ x1,, δ xn | b) = ∏
i =1
Γ(∆ai )
,

where ∆ai = a(ti ) − a(ti −1 ) . As a consequence the unconditional distribution of


∆Xi = Xti − Xti −1 for i = 1,, n when B has a PDF f B is given by:

n +∞
(δ xi )∆ai −1 b ∆ai exp(−bδ xi )
f∆X1,,∆Xn (δ x1,, δ xn ) = ∏∫
i =1
0 Γ(∆ai )
fB (b)db.

For the special case of Gamma frailties, that is when B ∼ Γ(α , β ) we obtain:

n ∆ai −1 α
 δ xi   β  Γ(∆ai + α )
f∆X1,,∆Xn (δ x1,, δ xn ) = ∏i =1
 
 δ xi + β 
 
 δ xi + β  Γ(∆ai )Γ(α )
.

Now suppose that the shape function a depends on an Euclidean parameter


θ ∈ Θ ⊂  p , then based on N i.i.d. copies ( X ( k ) )k =1,, N of X and setting δ xij the
observation of ∆Xi( j ) = Xt(i j ) − Xt(i −j )1 for 1 ≤ j ≤ N and 1 ≤ i ≤ n , the likelihood function
is therefore defined by:

N n ∆ai (θ )−1 α
 δ xij   β  Γ((∆ai (θ ) + α )
(θ ,α , β ) = ∏∏
j =1 i =1
 
 δ xij + β 
 
 δ xij + β  Γ(∆ai (θ ))Γ(α )
,

where ∆ai (θ ) = a(ti ;θ ) − a(ti −1;θ ).

11.4.3 Bivariate Gamma Degradation Models


In  [42] the intervention scheduling of a railway track is discussed based on the
observation of two dependent randomly increasing deterioration indicators modeled
( )
through a bivariate Gamma process Y = Yt(1) , Yt(2) t ≥0 constructed by trivariate reduc-
tion (see [43]). As we will see next this construction is based on the properties that
the sum of two independent Gamma processes with common scale is still a Gamma
Latent Variable Models in Reliability 299

process from one hand, and, on the other hand, that the components of the bivariate
process share a common Gamma latent process which allows obtaining correlation
between the two marginal processes.
Now let us consider three independent Gamma processes X ( i ) for 0 ≤ i ≤ 2 with
scale parameter one and shape functions α i :  + →  + . The bivariate Gamma pro-
cess Y is defined by:

Yt(1)
 = (X (0)
t )
+ Xt(1) / b1
 (2)
Yt = (X (0)
t + Xt(2) )/b 2

where b1 and b2 are two positive scale parameters. As a consequence Y has indepen-
dent increments and for i = 1,2 the marginal process (Yt ( i ) )t ≥0 is a Gamma process
with scale parameter bi and shape function α 0 + α i . In addition it is straightforward
to check that we have for i = 1,2:

α 0 (t ) + α i (t ) α 0 (t ) + α i (t )
( )
 Yt(i ) =
bi
and ( )
var Yt(i ) =
bi2
.

and:
α 0 (t )
(
cov Yt(1), Yt(2) = ) b1b2
.

Now let us consider that the process Y is homogeneous (i.e., α i (t ) = α i t for


0 ≤ i ≤ 2) and observed at times 0 = t0 < t1 < t2 <  < t n and let us define the incre-
(
ments ∆t j = t j − t j −1 and ∆Yj = ∆Yj(1) , ∆Yj(2) = Yt(1) j
− Yt(1)) (
, Y (2) − Yt(2)
j −1 t j
It is easy to check that the bivariate Gamma process Y can be parametrized equiva-
j −1
for 1 ≤ j ≤ n. )
lently by (α 0 ,α 2 ,α 2 , b1, b2 ) or ( µ1, µ2 ,σ 12 ,σ 22 , ρ ) where for 1 ≤ j ≤ n:

 ∆Y (i )  α + α i
µi =   j  = 0 for i = 1,2,
 ∆t j  bi

 ∆Y (i )  α + α
σ i2 = var  j  = 0 2 i for i = 1,2,
 ∆t j  bi
 

 Yt(1) − Yt(1) Yt(2) − Yt(2)  α


ρ = cov j j −1
, j j −1
 = 0 .
 ∆t j ∆t j  b1b2
 
2 2
Then by the moment method ( µ1, µ2 ,σ 12 ,σ 22 , ρ ) is estimated by ( µ 1, µ 2 ,σ 1 ,σ 2 , ρ ) where:

∑ (Y )
n
(i )
tj − Yt(ji−)1
j =1
µ = for i = 1,2,

i n
∆t j
j =1
300 Reliability Engineering

∑ (Y )
n 2

2
(i )
tj − Yt(ji−)1 − µ i ∆t j
σ i = j =1
for i = 1,2,

n 2
(∆t j )

n
j =1
∆t j −

j =1 n
∆t j
j =1

∑ ( ∆Y )( ) ,
n
j
(1)
− µ1∆t j ∆Yj(2) − µ 2 ∆t j
j =1
ρ =
∑ (∆t )
n 2


n j
j =1
∆t −
∑ ∆t
j n
j =1
j
j =1

are unbiased estimators of ( µ1, µ2 ,σ 12 ,σ 22 , ρ ). Now, since we have:

 σ 12σ 22 ρ
α 0 =
 µ1µ2
 µi
bi = for i = 1,2
 σ i2
 µi2 σ 12σ 22 ρ
α i = − for i = 1, 2
 σ i2 µ1µ2

we obtain:
 2 2
α 0 σ 1 σ 2 ρ
=
 µ 1 µ 2

 µ i
 bi = 2
for i = 1, 2
 σ i
 2 2
 µ i2 σ 1 σ 2 ρ
α i = 2

µ1 µ 2
for i = 1,2
 σ i

In [42] an alternative estimation method is proposed based on the maximum likeli-


hood principle. Indeed, based on marginal observations ( ∆Y j( i ) )1≤ j ≤n parameters bi
are estimated using the maximum likelihood principle. Then, considering incre-
ments ∆X (0)
j = X t j − X t j −1 for 1 ≤ j ≤ n as hidden data, the authors develop an EM
(0) (0)

algorithm to estimate (α 0 ,α1,α 2 ) .

11.5  CONCLUDING REMARKS


In this chapter we presented a panel of latent variable models that are useful in reli-
ability and survival analysis studies. We showed that a large variety of parametric,
semi-parametric, or nonparametric estimation methods can be used to estimate the
Latent Variable Models in Reliability 301

models parameters. In addition, when direct calculation of the likelihood function


is mathematically too complicated, or numerically unachievable, estimation meth-
ods based on EM or stochastic EM algorithms, or estimation methods based on the
pseudo-likelihood principle, may be interesting alternatives to classical estimation
methods.

REFERENCES
1. B. Everett. An Introduction to Latent Variable Models. Springer Monographs on
Statistics and Applied Probability, Chapman & Hall, London, UK, 2011.
2. D. Bartholomew, M. Knott, and I. Moustaki. Latent Variable Models and Factor
Analysis: A Unified Approach. Wiley Series in Probability and Statistics, 3rd ed. John
Wiley & Sons, Chichester, UK, 2011.
3. A.A. Beaujean. Latent Variable Modeling Using R: A Step-by-Step Guide. Taylor &
Francis Group, New York, 2014.
4. J.C. Loehlin and A.A. Beaujean. Latent Variable Models: An Introduction to Factor,
Path, and Structural Equation Analysis, 5th ed. Taylor & Francis Group, New York,
2016.
5. A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes.
Springer Series in Statistics, New York, 1996.
6. A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, New  York,
1998.
7. M. Kosorok. Introduction to Empirical Processes and Semiparametric Inference.
Springer Series in Statistics, New York, 2006.
8. P.K. Andersen, O. Borgan, R.D. Gill, and N. Keiding. Statistical Models Based on
Counting Processes. Springer Series in Statistics, New York, 1993.
9. G. McLachlan and D. Peel. Finite Mixture Models. John Wiley  & Sons, New  York,
2000.
10. W. Nelson. Theory and applications of hazard plotting for censored failure data.
Technometrics, 14(4):945–966, 1972.
11. O. Aalen. Nonparametric inference for a family of counting processes. The Annals of
Statistics, 6(4):701–726, 1978.
12. E.L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association, 282(53):457–481, 1958.
13. B.W. Turnbull. The  empirical distribution function with arbitrary grouped, censored
and truncated data. Journal of the Royal Statistical Society, Series B, 38:290–295,
1976.
14. J. Huang. Asymptotic properties of nonparametric estimation based on partly interval–
censored data. Statistica Sinica, 9:501–519, 1999.
15. V. Patilea and J.M. Rolin. Product-limit estimators of the survival function for two
modified forms of current-status data. Bernoulli, 12:801–819, 2006.
16. L. Bordes, J.Y. Dauxois, and P. Joly. Semiparametric inference of competing risks data
with additive hazards and missing cause of failure under mcar or mar assumptions.
Electronic Journal of Statistics, 8:41–95, 2014.
17. J.W. Vaupel, K.G. Manton, and E. Stallard. The impact of heterogeneity in individual
frailty on the dynamics of mortality. Demography, 16(3):439–454, 1979.
18. P. Hougaard. Analysis of Multivariate Survival Data. Springer, New York, 2000.
19. L. Duchateau and P. Janssen. The  Frailty Model. Statistics for Biology and Health,
Springer, New York, 2008.
20. A. Wienke. Frailty Models in Survival Analysis. CRC Press, Boca Raton, FL, 2010.
21. P. Hougaard. Frailty models for survival data. Lifetime Data Analysis, 1:255–273, 1995.
302 Reliability Engineering

22. S.A. Murphy. Consistency in a proportional hazard model incorporating a random


effect. Annals of Statistics, 22:712–731, 1994.
23. S.A. Murphy. Asymptotic theory for the frailty model. Annals of Statistics, 23:182–198,
1994.
24. D. Commenges and H. Jacmin-Gadda. Generalized score test of homogeneity based
on correlated random effects models. Journal of the Royal Statistical Society B,
59:157–171, 1997.
25. L. Bordes and D. Chauveau. Stochastic em algorithms for parametric and semipa-
rametric mixture models for right-censored lifetime data. Computational Statistics,
31(4):1513–1538, 2016.
26. H. Teicher. Identifiability of mixtures of product measures. Annals of Mathematical
Statistics, 38:1300–1302, 1967.
27. S.J. Yakowitz. On the identifiability of finite mixtures. Annals Mathematical Statistics,
39:209–214, 1968.
28. N. Atienza, J. Garcia-Heras, and J.M. Munoz Pichardo. A new condition for identifi-
ability of finite mixture distributions. Metrika, 63(2):215–221, 2006.
29. N. Balakrishnan and D. Mitra. EM–based likelihood inference for some lifetime distri-
butions based on left truncated and right censored data and associated model discrimi-
nation. South African Statistical Journal, 48:125–171, 2014.
30. T.H.K. Ng and Z. Ye. Comment: EM–based likelihood inference for some lifetime
distributions based on left truncated and right censored data and associated model dis-
crimination. South African Statistical Journal, 48:177–180, 2014.
31. L. Bordes and D. Chauveau. Comment: EM–based likelihood inference for some life-
time distributions based on left truncated and right censored data and associated model
discrimination. South African Statistical Journal, 48:197–200, 2014.
32. S.F. Nielsen. The  stochastic EM algorithm: Estimation and asymptotic results.
Bernoulli, 6(3):457–489, 2000.
33. J.W. Boag. Maximum likelihood estimates of the proportion of patients cured by cancer
therapy. Journal of the Royal Statistical Society Series B, 11:15–45, 1949.
34. V. Farewell. A model for binary variable with time-censored observations. Biometrika,
64:43–46, 1977.
35. Y. Peng and J.M.G. Taylor. Cure models. In J. Klein, H. van Houwelingen, J.G. Ibrahim,
and T.H. Scheike, editors, Handbook of Survival Analysis, Chapter  6, pp.  113–134.
Handbooks of Modern Statistical Methods Series, Chapman & Hall, Boca Raton, FL,
2014.
36. P. Sasieni. Proportional excess hazards. Biometrika, 83(1):127–141, 1996.
37. P. Sasieni and A.R. Brentnall. On standardized relative survival. Biometrics, 73(2):473–
482, 2016.
38. M.P. Perme, J. Stare, and J. Estève. On estimation in relative survival. Biometrics,
68(1):113–120, 2012.
39. L. Bordes, S. Mercier, E. Remy, and E. Dautrême. Partially observed competing degra-
dation processes: Modeling and inference. Applied Stochastic Models in Business and
Industry, 32(5):677–696, 2016.
40. D. Cox and N. Reid. A note on pseudolikelihood constructed from marginal densities.
Biometrika, 91:729–737, 2004.
41. M. Guida and F. Penta. A gamma process model for the analysis of fatigue crack growth
data. Engineering Fracture Mechanics, 142:21–49, 2015.
42. S. Mercier, C. Meier-Hirmer, and M. Roussignol. Bivariate gamma wear processes for
track geometry modelling, with application to intervention scheduling. Structure and
Infrastructure Engineering, 8(4):357–366, 2012.
Latent Variable Models in Reliability 303

43. F.A. Buijs, J.W. Hall, J.M. van Noortwijk, and P.B. Sayers. Time-dependent reliability
analysis of flood defences using gamma processes. In G. Augusti, G.I. Schueller, and
M. Ciampoli, editors, Safety and Reliability of Engineering Systems and Structures,
pp. 2209–2216; Proceedings of the Ninth International Conference on Structural Safety
and Reliability (ICOSSAR), Rome, Italy, June 19–23, 2005, Mill-Press, Rotterdam, the
Netherlands.
12 Expanded Failure Modes
and Effects Analysis
A Different Approach for
System Reliability Assessment
Perdomo Ojeda Manuel, Rivero Oliva Jesús,
and Salomón Llanes Jesús

CONTENTS
12.1 Background for Developing the Expanded Failure Modes and
Effects Analysis...........................................................................................306
12.2 Some Distinctive Features of the FMEAe Methodology............................308
12.3 Criticality Analysis of Failure Modes by Applying the Component
Reliability Model Approach........................................................................ 310
12.3.1 Indexes Used in the Criticality Analysis of
Components-Failure Modes���������������������������������������������������������� 313
12.3.1.1 Component Risk Index.................................................. 313
12.3.1.2 System Risk Index......................................................... 314
12.3.1.3 Index of Relative Importance of the
Component-Failure Mode i������������������������������������������ 315
12.3.2 Treatment of Redundant Components............................................ 316
12.4 Procedure for Treating the Common Cause Failures in FMEAe................ 319
12.4.1 List of Components with Potential to Generate Common
Cause Failures................................................................................ 321
12.4.2 Classification of Common Cause Failures into Groups by
Their Degree of Dependency......................................................... 322
12.4.3 Assignment of Postulated Generic β Factors................................. 323
12.4.4 Correction of the β Factor of the Common Cause Failure
Events, According to the Degree of Redundancy.......................... 324
12.5 Analysis of Importance by Component Type.............................................. 324
12.6 Reliability Assessment of a Generic Fire Quenching System
Applying FMEAe........................................................................................ 325
12.6.1 General Assumptions and Other Considerations for the Analysis...... 327
12.6.2 Preparing the Worksheet for the Analysis and Reliability
Assessment in FMEAe������������������������������������������������������������������ 328

305
306 Reliability Engineering

12.6.3 Criticality Analysis of System of Figure 12.2 Applying the


FMEAe Approach������������������������������������������������������������������������� 331
12.7 Final Remarks.............................................................................................. 334
References............................................................................................................... 334

12.1 BACKGROUND FOR DEVELOPING THE EXPANDED


FAILURE MODES AND EFFECTS ANALYSIS
There is significant experience in the field of systems reliability analysis. Here the Fault
Tree Analysis (FTA) technique has played an important role due to its power for empha-
sizing some aspects that exert an enormous influence on the reliability of redundant sys-
tems, specifically those designed to operate with high availability and safety requirements.
These analyses constitute a key support tool in decision-making process, which, in
turn, are a crucial aspect when it concerns activities or processes in industrial facilities
or services with significant hazards associated to the processes with which they deal [1].
To determine the dominant contributors of the risk or the reliability of a system,
detailed information needs to be adequately processed so that the proposed objective
can be accomplished.
However, the following difficulties are frequently present:

• Not all the necessary information for a correct decision is always available
• An important part of the information may be available, but not organized
and processed in an appropriate manner

The solution of these key problems can be achieved by:

• Gathering the raw data of the facility, processing them adequately, and pre-
paring a database oriented to reliability and safety, so that a specialized
computer tool of reliability and risk analysis can use it in a proper manner
• Training and qualifying specialists and managers in the use of these data-
bases and specialized computer programs so that the data can be used cor-
rectly and decisions can produce the expected results

Training and qualifying the staff of an industry in the use of specialized programs in
this field is not a major problem, or at least its solution can be ready in the short term,
because there is currently a significant amount of experience in that field.
Nevertheless, the collection of data, its handling, and developing of computerized
databases, ready to be used in risk and reliability studies, are time-consuming tasks.
On the other hand, the sample of available data should be sufficiently representative
of the processes that are going to be modeled (e.g., failure rates of components-
failure modes and average repair times).
Moreover, inaccuracies in the definition of the component boundaries and in the
way the raw data are described, among other aspects, bring with them uncertain-
ties in the data to be processed. The  uncertainties degrees could be so high that,
for example, the generic databases available for use in the Probabilistic Safety
Assessment (PSA) indicate differences of up to 2 orders of magnitude in the values
of the failure rates of the same failure mode and type of equipment [2,3].
Expanded Failure Modes and Effects Analysis 307

In addition to this, it happens that current technological advances have in some


fields such a dynamic that, when it begins to collect the data and organize them to be
used, there may already be new designs somewhat different from those from which
data has been collected.
Regarding these problems, there is a need to look for ways to reach useful
results, even in the case of partial or almost total lack of data. Hence, qualitative
analysis tools, such as Failure Modes and Effects Analysis (FMEA) and Hazard and
Operability (HAZOP), need to become competitive with the powerful quantitative
tools, such as FTA, whose results depend to a large extent on statistical data.
But, the system “analysis tool-analyst team” must be able to fulfill the objective of a
detailed reliability or safety study with the lowest possible cost, the shortest execution
time, and the least associated uncertainties. This result can be achieved by providing the
analysis team with a tool that covers an exhaustive spectrum of safety-reliability aspects
to be evaluated, the methods for their evaluation, and some useful analysis options.
The matrix shown in Table 12.1 presents a comparison among a set of reliability
and risk analysis techniques widely used in industrial applications [4–9]. The com-
pared attributes are based on important characteristics that an effective analysis tool
should meet to support the decisions regarding safety and availability of facilities
and services with potential risks associated with their operation.
It  also includes, as a comparative pattern, the most frequently used techniques
in the risk studies: the FTA and the Event Tree Analysis (ETA) [10–22], given their
benefits and strengths in this subject.
In  Table  12.1, the symbols used as qualificator of the technique characteristics
mean:

TABLE 12.1
Comparative Matrix of Reliability and Risk Analysis Techniques
Techniques ► HAZOP FMEA Checklist What If? SR PreHA ETA FTA
Items to Compare ▼
Completeness ++ ++ − − − − + ++
Structured approach ++ ++ − − − − ++ ++
Flexibility of application − ++ ++ ++ ++ ++ + +
Objectivity + + − − − − + +
Independence on quantitative + + ++ ++ ++ ++ ++ −
dataa
Capability of modeling − − − − − − ++ ++
dependences
Independence on the analysis − + ++ + + − −− −−
team expertiseb
Quickness in obtaining results − − + + + + + −

a In achieving quantitative results.


b Refers to the skill in using the technique.
308 Reliability Engineering

++ High
+ Moderate
− Low
−− Very low or none

The acronyms used previously mean:

HAZOP: HAZard and Operability Analysis


FMEA: Failure Mode and Effect Analysis
SR: Safety Review
PreHA: Preliminary Hazard Analysis
ETA: Event Tree Analysis
FTA: Fault Tree Analysis

The characteristics to compare the different techniques in the matrix of Table 12.1


have been defined in a positive sense considering the benefits of the technique to
achieve an exhaustive reliability or risk analysis of a system. This requirement means
that the techniques with the greatest number of “++” and “+” results will be the best
candidates to use in that kind of analysis.
Analyzing the previous matrix, the FMEA  technique resulted one of the best
candidates to improve for powering its characteristics to include some important
analytical advantages of the FTA  as, for example, the functional dependence and
common cause failure (CCF) analyses.
On the other hand, reviewing some recently works published about the FMEA meth-
odology [23–26], not one was found dedicated to treat the subject of the dependency
analysis within FMEA/Failure Mode, Effects, and Criticality Analysis (FMECA).
Thus, an analytical tool has been developed that keeps the best characteristics of
the qualitative techniques as FMEA does and that adds some of the greatest poten-
tials of the quantitative ones. These strengths along with some other important fea-
tures, which have been included in the expanded FMEA (FMEAe) methodology, are
described in the next sections.

12.2 SOME DISTINCTIVE FEATURES OF THE FMEAe


METHODOLOGY
The FMEA technique is recognized as a powerful analysis tool because it combines
the structuring and the completeness of the method with descriptive capacities that
improve its integrity, giving the analyst the flexibility to describe in a more com-
plete way all the characteristics of the system from either the design or the opera-
tion standpoints. Thus, FMEA analyzes how these characteristics may influence the
system reliability, or the risk they induce, and gives an order of the importance of
system’s postulated failure modes, allowing optimizing the corrective measures to
reduce risk or increase reliability [4–10].
However, also recognized as an important limitation of the method is its inability
for dependence analysis, an aspect well modeled by the FTA technique. Precisely
Expanded Failure Modes and Effects Analysis 309

one of the most frequent causes of accidents in complex industrial facilities has been
the common cause failures and human errors, hence the importance of being able to
treat them adequately in these studies.
An important part of the insufficiencies or limitations found in the qualitative
techniques presented in Table  12.1 have been resolved in FMEAe. The  most sig-
nificant improvements in FMEAe methodology, in comparison with the traditional
FMEA1 were introduced through several procedures for:

• Determination of common cause basic events, estimation of their probabili-


ties, and inclusion of them in the list of failure modes for the criticality analysis
• Analyzing the joint importance of components of the same type to determine
those types of components with largest contribution to risk or unavailability
and backing up, in that way, the standardization of the corrective measures
• Estimating the risk or reliability of the system under analysis by means of a
global parameter called System Risk Index (IRS).

In FMEAe, these analyses are carried out through algorithms of identification and
comparison of strings. These strings include the information in the fields of the tra-
ditional FMEA worksheet, together with some other that have been added to enlarge
and complete the information about the design and functioning of the components
involved in the analysis [27,28].
Figure  12.1 presents the work sheet of FMEAe in ASeC computer code, showing
the CCF modes included at the final part of the list (those whose code begins with CM),

FIGURE 12.1  FMEAe worksheet in ASeC computer code.

1 Not included in the computer codes considered in the state of the art of this methodology.
310 Reliability Engineering

which form part of the criticality analysis. These events are generated automatically by
algorithms that handle the information the analyst provides for the worksheet using the
first two tables: Datos Técnicos (Engineering- related Data) and Datos de Fallos y Efectos
(Failures and Effects Data). The third table of the worksheet, Observaciones (Remarks),
serves to complete ideas or descriptions about the failure mode, mode of operation, mode
of control of the components, or any consideration made for the analysis.
In the lower left corner of the worksheet appears a panel showing the quantity
and code of the component-failure modes, which are the precursors of the CCF, clas-
sified by their degree of dependence (G1 to G3, in decreasing order of size). In this
example, ten precursors having a G3  dependence degree were determined, where
G3 represents the highest degree of dependence, G2 an intermediate degree, and G1
the lowest degree. As can be observed, for this example, there were no precursors of
the G1 degree because all the five pairs of precursors share the attributes of the G3
failure. Later, in Section 12.4, the methodology developed for the automated deter-
mination of CCFs in FMEAe will be described in more detail.
After the worksheet fields had been filled, if there are data available it is conve-
nient to start the criticality analysis by determining the CCFs so that their influence
in the results is not missed. This latter issue is especially important in the case of
redundant systems, which can be verified later (Section 12.6) through the example of
application of the FMEAe methodology to a fire cooling system.
After determining the CCF, the criticality analysis is carried out by one of two
approaches: the Component Reliability Model or the Risk Matrix. The first approach
uses models similar to those included in the FTA technique to estimate the probability
of basic events and it is discussed in this chapter. Here, the calculated reliability param-
eter (probability of loss of functional capacity) is one of the factors used to determine
the criticality of the failure modes, together with the degree of severity of their effects.
There are three types of effects, each of them requiring of a separate analysis: the
environmental effects (EA), the effects on the safety or health (ES), and the effects
on the system availability (ED). The example shown in Figure 12.1 presents a case in
which the failure modes affected only the system availability.
Another distinctive feature of the FMEAe worksheet can be observed from
Figure 12.1, and it refers to the way the information is presented to the analyst. As
can be seen, the worksheet contains three tables that present all the relevant informa-
tion for the analysis in a unique screen page so that the user can access all the aspects
at once without the need to scroll to another page.

12.3 CRITICALITY ANALYSIS OF FAILURE MODES BY APPLYING


THE COMPONENT RELIABILITY MODEL APPROACH
The Component Reliability Model Approach is the preferred method to use when
the failure rates are known or can be estimated, as well as the average repair times,
and other parameters of the components-failure modes, which depend on statistical
data. As in the analyses carried out by the FTA technique, here it is assumed that the
failure rates and average repair times of the components-failure modes have a behav-
ior described by the exponential distribution, in which they remain approximately
constant in the time considering that the component is in its useful life.
Expanded Failure Modes and Effects Analysis 311

To model the behavior of components reliability through the loss of their func-
tional capacity, two parameters are considered. One of them is the probability of
failure (p), which characterizes the reliability of the components that must operate
during a given mission time; the other is the unavailability (q), for those in standby
that must change their position or state at the time of the demand.
Together with the reliability parameter, the effects caused by the failure modes
are considered to form a matrix (probability of occurrence vs. severity of the effects
of each failure mode). This matrix is affected, in turn, by a weighting (quality) factor
that considers the way the equipment is commanded, that is, auto-actuated or manual
mode from either a remote panel or locally (at field).
The rest of the characteristics that influence the functional capacity of the compo-
nent, such as degree of redundancy and the control mode (periodic testing, continu-
ously monitoring, or non-controlled component), are already included implicitly in
the reliability model of each component-failure mode and in the severity of the effect
considering the information filled in the worksheet by the analyst.
There  are five degrees of severity for the three kind of effects considered in
FMEAe, which are described in Tables 12.2 through 12.4.
This approach assumes that once the failure mode has occurred, the effect will
take place. In  the case that more than one effect of the same kind (environment-
related, safety/health-related, or facility’s availability-related) can occur, the one
with the highest severity is chosen.

TABLE 12.2
Severity of the Environmental Effect of a Failure Mode (EA)
Qualitative Classification Associated
of the Effect Value Meaning
Low 1 There are no impacts on facility’s site. It considers only
internal minor effects. Corrective measures are
not required.
Serious 2 There are minor impacts out of the facility boundaries, which
demand some cleaning procedures, considering a recovery
time of 1 week or less. There is presence of smoke, noise,
and bad smells. The local traffic is affected by the
evacuation.
Severe 3 There are minor impacts outside the facility boundaries,
which require some cleaning processes with a recovery time
of at least 1 month. Possible wounded or injured people.
Very severe 4 There are serious impacts outside the facility boundaries.
Reversible damages are considered with a recovery time
of up to 6 months. Moderated impacts on animal and
vegetal life. Temporary disabilities of people.
Catastrophic 5 There are significant impacts outside the facility boundaries,
with a recovery time of more than 6 months. Irreversible
damages on animal and vegetal life are considered. Possible
deaths or permanent disabilities of people are considered.
312 Reliability Engineering

TABLE 12.3
Severity of the Effect of a Failure Mode on the Safety or Health (ES)
Qualitative Classification Associated
of the Effect Value Meaning
Low 1 Local minor effects (including first aid procedures).
There are no disabling damages.
Serious 2 Appreciable internal effects. Temporary injuries and
disabilities.
Severe 3 Important internal effects. Some permanently injured
and disabled people. The occurrence of up to 1 death is
considered possible.
Very severe 4 Very important internal effects. Several permanently
injured and disabled people. Up to 4 deaths are possible.
Catastrophic 5 Catastrophic internal effects. Multiple permanent
affectations. Numerous deaths (5 cases or more).

TABLE 12.4
Severity of the Effect of a Failure Mode on the Facility’s Availability (ED)
Qualitative Classification Associated
of the Effect Value Meaning
Low 1 There is no effect on production/functioning. Additional
maintenance tasks during shutdown could be required.
Serious 2 Loss of important redundancy/reserve. An unplanned
shutdown within 72 hours could be required. Recovery
time of up to 1 month is considered.
Severe 3 Immediate shutdown is required. Recovery time of
1–3 months is considered.
Very severe 4 Immediate shutdown is required. Recovery time of
3–6 months is considered.
Catastrophic 5 Immediate shutdown is required. Recovery time of more
than 6 months.

Any necessary clarification in support of the analysis as, for example, some
analysis hypothesis, basis of causes and effects, or assignment of certain param-
eters, whose certainty is not  proven, is made in the Remarks table of the work-
sheet, for each failure mode analyzed. Finally, the corrective measures derived from
all the information collected and the criticality analysis are incorporated in the
Recommendations field of the Results sheet.
Expanded Failure Modes and Effects Analysis 313

12.3.1 Indexes Used in the Criticality Analysis


of Components-Failure Modes

The criticality analysis in FMEAe uses some factors related to the following subjects:

• Probability of occurrence of the failure mode


• Severity of the induced effects
• Mode of control (periodic testing, continuously monitoring, etc.)
• Mode of command (auto-activated or manually-activated)
• Degree of redundancy
• Mechanisms of common cause failures

Next, a set of semi-quantitative indices is defined for the criticality analysis of the
component-failure modes.

12.3.1.1  Component Risk Index


The  Component Risk Index (IRCi) gives a measure of the importance of each
component-failure mode within the system function according to the three kind of
effects (ED, ES, and EA) so that there can be three kind of risks related to the sys-
tem function due to the occurrence of a failure mode i: IRCdi, IRCsi, and IRCai.
Following the Component Reliability Model Approach, the expressions of these risk
indexes are:

IRCd i = ( qi )( EDi )( FPm i ) (12.1)

IRCsi = ( qi )( ESi )( FPm i ) (12.2)

IRCa i = ( qi )( EA i )( FPm i ) (12.3)

where:
qi is the probability of failure or the unavailability of the component-failure mode i
EDi, ESi, and EAi, are the severity degrees of the three kinds of effects
(availability-related, safety-related, and environmental, respectively)
induced by the component-failure mode i
FPmi is the weighting (quality) factor that considers the way the respective equip-
ment that experiences a failure mode i is commanded when it is demanded
for operation, that is, auto-actuated or in manual mode, from either a remote
panel or locally at field

It  takes the following values: 1 for components auto-activated; 3 for components
commanded in remote manual mode (from a control room), and 5 for components
commanded manually at field (by hand switch located near the equipment).
314 Reliability Engineering

A qualitative scale to classify the criticality of each failure mode is established


starting from a limiting quantitative goal for the IRC value defined by the following
criteria:

1. A value of q = 1.0 E-3, which corresponds to systems with high require-


ments of safety and availability, as for the industries of good practices, such
as, for example, the nuclear power plants. It means that the system may be
unavailable 8 hours per year, considering that its availability is assessed for
a typical year of operation, that is, 12 months (between planned shutdowns
for maintenance).
2. Neither environmental effects nor effects on the health of people are pres-
ent, and there is only a light effect on the system availability (EA  =  0;
ES = 0; ED = 1).
3. A weighting factor FPm = 1 (neutral) is considered, corresponding to the
best engineering practice in which all the components are activated on an
automatic signal.

In this way, the criticality scale starts with the target minimal value of IRC = 1.0E-03,
and increases periodically by a factor of 5 until reaching the postulated upper limit
of IRC  =  1.2E-01, above which it considers that the criticality of the component-
failure mode is extreme (extremely critical component-failure mode). The scale is
as follows:

• IRC  ≤  1.0E-3: The  risk index of the component-failure mode tends to


excellence
• 1.0E-3  <  IRC  ≤  5.0E-3: The  risk index of the component-failure mode
moves away from the target in a tolerable range
• 5.0E-3 < IRC ≤ 2.5E-2: The risk index of the component-failure mode has
been degraded
• 2.5E-2 < IRC ≤ 1.2E-1: The risk index of the component-failure mode is
critical
• 1.2E-1 < IRC: The risk index of the component-failure mode is extremely
critical

12.3.1.2  System Risk Index


System Risk Index (IRS) is a measure of the average behavior of the functional
capability of the system through the IRC index. It is directly related to the system’s
reliability. Then it considers the contribution of all its component-failure modes,
through their IRC index, and it is estimated by expression (12.4):


∑ IRC (12.4)
i

IRS = i =1
n
Expanded Failure Modes and Effects Analysis 315

where:
IRCi is the risk index of the component-failure mode i
n is the total of component-failure modes of the analyzed system

Following similar criteria to those defined for the IRC, it is postulated that the IRS
target value is 1.0E-3. From this goal, a scale like that proposed for the IRC is estab-
lished, but with more intermediate ranges for a finer classification of the system reli-
ability. The scale is as follows:

• IRS ≤ 1.0E-3: The risk index of the system tends to excellence


• 1.0E-3 < IRS ≤ 2.25E-3: The risk index of the system moves away from
excellence in a tolerable range
• 2.25E-3 < IRS ≤ 5.0E-3: The risk index of the system presents an incipient
degradation
• 5.0E-3 < IRS ≤ 1.14E-2: The risk index of the system is degraded
• 1.14E-2 < IRS ≤ 2.56E-2: The  risk index of the system approaches the
critical zone
• 2.56E-2 < IRS ≤ 5.76E-2: The risk index of the system is in the critical zone
• 5.76E-2 < IRS ≤ 1.29E-1: The risk index of the system is very critical
• 1.29E-1 < IRS: The risk index of the system is extremely critical

12.3.1.3  Index of Relative Importance of the Component-Failure Mode i


The Index of Relative Importance of the Component-failure Mode i (IIRi) gives the
relative contribution or weight of the loss of the functional capacity of the com-
ponent due to the failure mode i (IRCi) to the IRS. It  allows knowing how much
the IRC value of the corresponding component-failure mode deviates from the IRS
value either in excess or in defect. The greatest benefit of criticality analysis can be
obtained when the results of both indices, (IRCi and IIRi), are combined to make
decisions. The IIRi is calculated according to the expression 12.5:

IRCi
IIR i = (12.5)
IRS
where:
IRCi is the risk index of the component-failure mode i
IRS is the risk index of the system

To classify the relative importance of the component-failure modes, the values of


IIRi are ranked according to the following scale. It is recommended to use the fol-
lowing values for decision making together with the IRC values of the respective
component-failure mode.

• IIRi > 10: Too important deviation in excess


• 5 < IIRi ≤ 10: Important deviation in excess
• 2.5 < IIRi ≤ 5: Appreciable deviation in excess
• 1 < IIRi ≤ 2.5: Light deviation in excess
• IIRi ≤ 1: No deviation or deviation in defect
316 Reliability Engineering

In this way, those component-failure modes, very critical or critical (according with
their IRCi values) with too important or important deviations in excess, will receive
the highest priority for proposing corrective measures to diminish their criticality.

12.3.2 Treatment of Redundant Components


The traditional FMEA technique does not make a global assessment of the system as
such, but it is restricted to the individual analysis of each of its components-failure
modes through their criticality (FMECA).
However, as it was previously mentioned, FMEAe defines an overall reliability/risk
index at IRS that permits assessment of the system as a whole. However, for a system
with redundancies, this way of assessing the reliability or risk will distort the value of IRS
and the contribution of each redundant component to it. Thus, to avoid excessively con-
servative results of the IRC values of failure modes derived from redundant components
to IRS, a procedure has been developed that considers the contribution of redundant com-
ponents to the reliability or risk of the system as a function of its degree of redundancy.
To perform the weighting process within this procedure, first, the redundant com-
ponents must be identified. To achieving this identification, two additional fields are
added to the Datos Técnicos table in the FMEAe worksheet (see Figure 12.1):

• Degree of redundancy (Reserv. field)


• Redundancy coupling train (C field)

Thus, each group of redundant components is identified with a unique integer value
in the “C” cell of the respective component-failure mode, and the following attri-
butes must coincide for that group, which represent table fields in the worksheet of
the FMEAe (see Figure 12.1):

• Component function (Function field).


• The mode of operation (Estado field).
• The failure mode (Modo de Fallo field).
• The mode of control (Control field).

After the group of redundant components has been identified, the unavailability of
their values or probabilities of failure (represented by qi) are weighted (penalized),
as follows:

1. Case of groups with n identical redundant elements (lack of diversity). If


there is lack of equipment diversity in a group of n redundant components,
then all its components have the same generic code and the same model
(Cod. G. and Model fields in the worksheet tables in Figure 12.1, respec-
tively). Under this condition, the original qi value is raised to a power equal
to the degree of redundancy n and the result is divided by the latter. Finally,
the result is assigned to the new unavailability value qpi (weighted unavail-
ability) of each redundant component of the group—this being the new
value replaced by qi in expressions 12.1, 12.2, or 12.3.
Expanded Failure Modes and Effects Analysis 317

The general expression for this case is:

( q) n
qp( n) =
n (12.6)
where:
qp(n) is the weighted unavailability/probability of failure mode of redun-
dant components with redundancy degree n
n is the degree of redundancy of the redundant component group
q is the original unavailability/probability of failure mode of the redundant
component group degree of redundancy n

2. Case of groups with n redundant diverse components. This refers to those


groups of redundant components, which have the same attributes of non-
identical redundant elements, that is, they differ in the data of the Cod. G.
and Modelo fields in the FMEAe worksheet tables (see Figure 12.1).
The general expression for this case is:

∏q i

qp ( n ) = i =1
(12.7)
n

The following example shows the usefulness of this weighting procedure.


Consider a system consisting of three identical components A, B and
C, where A  and B are arranged in parallel, and C is arranged in series
with them. Each of them has the same value of unavailability, q  = 1E-3.
Assuming independence between components, the system unavailability
(Qs) can be estimated as follows:

Qs = q( A × B) + q(C)
= [ q( A ) × q( B) ] + [ q(C) ]
=1E-6 +1E-3
=1.001E-3

If the reliability analysis of that system were performed applying


FMEAe, without considering the weighting procedure and applying
the expression 12.4, the IRS index would be (assuming no effects for
simplicity):

IRS = ( IRC [ A ] + IRC [ B] + IRC [ C]) /N


= ( 3E-3) / 3
= 1.0E-3
318 Reliability Engineering

where:
IRC[A], IRC[B], and IRC[C] are the component risk indexes of components
A, B, and C, respectively, assuming no effects
N is the total number of component-failure modes analyzed (assumed three
for this example)

As previously established, the IIR index (relative importance of failure modes in


FMEAe) for each component is calculated applying the expression 12.5:

IIR [ A ] = IRC [ A ] /IRS


= 1.0E-3 / 1.0E-3
=1
IIR [ B] = IRC [ B] /IRS
= 1.0E-3 / 1.0E-3
=1
IIR [ C] = IRC [ A ] /IRS
= 1.0E-3 / 1.0E-3
=1

From these results, the inconsistency of redundant components A and B can be con-
cluded having the same contribution to the IRS of the component C. Different from
C, the occurrence of failure mode A, or failure mode B, is not a sufficient condition
for the system to fail. Then, this result means an excessively conservative contribu-
tion of the components A  and B was caused by the previous procedure (without
weighting of q values of redundant components).
The previous problem is solved by applying the weighting procedure in estimat-
ing the q values of the failure modes, in the case of redundant components, as estab-
lished by expression 12.7. Thus, the contribution of the risk index of each of the
failure modes of the previous example to the IRS, is estimated as follows:

qp (A,B) = (1E-3)2 / 2
= 5E-7

where qp (A,B) is the weighted unavailability of each component A and B, estimated


by expression 12.7.
According to the procedure, the qp substitutes the original q value of the redun-
dant components involved, in this case, A  and B, so that the new values of q are
q(A) = 5E-7, q(B) = 5E-7, while q(C) keeps its value: q(C) = 1E-3, because the latter
is not a redundant component.
Expanded Failure Modes and Effects Analysis 319

Now the IRS can be estimated again, but using the new values of q for A and B,
by expression 12.4:

IRS = ( IRC [ A ] + IRC [ B] + IRC [ C]) / N


= ( 5E-7 + 5E-7 + 1E-3) / 3
= 1.001E-3 / 3
= 3.34E-4

As can be observed, the new IRS value (3.34E-4) is smaller than the formerly esti-
mated value without weighting the q values of the redundant components (1.0E-3),
which is considered more realistic because it considers the expected effect of the
redundancy on the system reliability.
The new IIR indexes for the component-failure modes A, B, and C are now
estimated again, but substituting the modified values of q and IRS, applying
­
­expression 12.5:

IIR [ A ] = IRC [ A ] / IRS


= 5E-7 / 3.34E-4
= 0.001497
IIR [ B] = IRC [ B] /IRS
= 5E-7 / 3.34E-4
= 0.001497
IIR [ C] = IRC [ A ] /IRS
= 1E-3 / 3.34E-4
= 2.99

Finally, it should be noted that the new values obtained for IRS and IIR[A], IIR[B],
and IIR[C], are more representative of the system reliability, with a less value of IRS
and more realistic relative importance or contribution to the IRS of the three compo-
nents-failure modes (IRR[A] = IRR[B] << IIR[C]). This result proves the usefulness
of the weighting procedure for treating redundant components included in FMEAe.

12.4 PROCEDURE FOR TREATING THE COMMON


CAUSE FAILURES IN FMEAe
It is a fact that multiple failures due to common causes are less probable to occur than
single failures due to independent causes, but when they occur, they tear down the
design efforts to achieve high levels of reliability by means of redundancy. Because
of that, is important for system designers and operators to be alert about those issues,
even under lack of data for specific estimation of their failure rates or probabilities
of occurrence.
320 Reliability Engineering

In this way, the objective of treating the CCFs within a structured method like
FMEAe is to avoid letting their effects on results go unnoticed in cases where their
occurrence is possible despite the availability of quantitative specific data that reflect
their occurrence. Thus, FMEAe employs generic data, as a first approximation,
although later, the analysts can update them if the experience warrants it or depend-
ing on the existence of local defense measures.
The  set of procedures developed to include the CCFs in an automated way
within the worksheet of FMEAe summarizes criteria and steps of other pro-
cedures collected in the literature specialized in the subject  [29–34], and the
efforts to update and improve the methodology and general approaches of con-
cern [34]. To achieving this, the typical structure of the traditional FMEA work-
sheet had to be modified and some new fields were included in the tables, as
shown in Figure 12.1, to facilitate the analysis and comparison of the pertinent
information.
The algorithms of these procedures include the following general tasks/steps:

• It starts with a list of generic components, based on engineering judgment,


the operational experience, and on previous published studies [35–38] for
these components for which knowledge justifies this type of analysis, given
their functional characteristics and their failure mechanisms or effects.
However, other components could be included in this list, if the specific
operational data or the specific experience justifies it.
• From the list, some candidates are selected to form part of the CCFs analy-
sis. Those components that match the condition of being active redundant
components are, in principle, potential candidates. The FMEAe approach
defines up to three degrees of dependency (depth of dependence) as a func-
tion of the quantity of coincident attributes among all the possible ways
to share. In  the group G1 are those components that only share internal
attributes such as the mode of operation or command, the failure mode,
the mode of control, etc. In the group G2 are comprised the components
that share the same attributes of G1 plus one of the two external attributes
(e.g., sharing the same room or the same working conditions). Finally, the
G3 group includes the components that share all attributes, either internal
or external. Then, depending on the degree of dependency, it is assigned
a value for the generic β parameter, which is directly proportional to such
dependency degree.
• The  assignment of generic β parameters modifies the failure rate or the
probability of the component-failure modes involved, which are candidates
to CCF, whose value is adjusted as a function of the redundancy degree
of the common cause events group, applying the next expression taken
from [39]:
k
 2i −2 −1.0 + β 2 
βk = ∏
i =2

 2 i −2
 (12.8)

Expanded Failure Modes and Effects Analysis 321

where:
βk is the beta factor to characterize the failure of k components from the
same generic group of size m
β2 is the generic beta factor to characterize the failure of two components
from the same group (assumed here as β2 = 0.1, which is the average
of the values of β factors estimated for the component-failure modes
involved in previous CCF studies [30,31])
• Finally, the resulting CCFs are added in the worksheet, so that they are
part of the criticality analysis together with the rest of the single failure
modes.
The next Sections 12.4.1 through 12.4.4 offer some details of these tasks.

12.4.1 List of Components with Potential to Generate


Common Cause Failures
According to the operational experience in the industry—with the highest safety
and availability requirements—and the recommendations resulting from impor-
tant works within the field of dependent failures analysis for nuclear power facili-
ties [12,29–38], a list of components with credible potential to generate CCFs has
been established for using in FMEAe.
Most represent active-type components or, in some cases, passive ones with active
failure modes (e.g., check valves). Other components, such as batteries, have been
included based on the recommendations of the operational experience of the preced-
ing references; although they do not have a macroscopic movement.
This  preliminary list has a practical basis considering the generalized absence
of data that allows for determining a complete CCF study. Hence, it does not mean
that the occurrence of CCFs in other types of components is excluded; however, its
inclusion would considerably complicate the models and the expected results would
introduce high levels of unwarranted uncertainties.
A two-digit identifier accompanying each component of the following list is used
by FMEAe to make the CCF analysis an easier task. That code is an excerpt of the
3-digit code used in references [2,3].

• BT: Battery
• DG: Emergency diesel generator
• KA: Circuit breaker general
• KB: Circuit breaker bus bar
• KG: Circuit breaker generator
• MA: Motor electrical
• PD: Diesel driven pump
• PM: Motor driven pump
• PT: Turbine driven pump
• QB: Blower fan
• QC: Compressor
• RT: Relay time delay
322 Reliability Engineering

• RC: Relay control


• VA: Air operated valve
• VC: Check valve
• VD: Solenoid operated valve
• VM: Motor operated valve

12.4.2 Classification of Common Cause Failures into Groups


by Their Degree of Dependency

The dependency degrees are used in FMEAe to qualify the depth with which the
dependency mechanisms defined in [30,31,34] could act. Therefore, the ratio of fail-
ures due to common causes among the totality of causes is a value directly propor-
tional to that dependency degree. Following is a general description of the procedure
for classifying CCFs by their degree of dependency:

• Degree of dependency G1: The conditions that should be met by the com-


ponents included in the previous list to be considered as precursors of
potential CCF events of degree 1 (G1) is that they are identical redundant
components. That is, they need to have an exact coincidence of their inter-
nal attributes. This analysis is carried out in FMEAe through an algorithm
of identification and comparison of strings, which include the following
fields of the worksheet.
In the Datos Técnicos (Engineering-related data) table (see Figure 12.1),
it should coincide the following fields for all the components involved:
• Redund: It  indicates the redundancy degree; and those component-
failure modes with values equal to or greater than 200% are of interest
(double degree of redundancy or higher)
• C: It is the redundancy coupling train (those component-failure modes
with the same value of C, means that they belong to the same redundant
group, whose degree of redundancy can be checked at Redund.)
• Modelo: Here appears the identification code of the component manu-
facturer’s model
• Función: It indicates the component function within the system
• Estado: It  indicates the component state and the mode in which the
component is commanded (e.g., normally open, normally closed, auto-
activated, or manually activated)
In the Datos de Fallas y Efectos table (data related to failure modes, causes,
and effects; see Figure 12.1), it should coincide with the strings filled in the
following fields for all the components involved:
• Modo de Fallo: It indicates the failure mode of the component (e.g., fail
to open or fail to close)
• Cod. G: Is the two-digits generic code which identifies the compo-
nent type (e.g., manual operated valve, air operated valve, or motor
driven pump)
Expanded Failure Modes and Effects Analysis 323

• Controles: It  indicates the mode of control of the component-failure


modes (e.g., non-controlled, periodically tested, or continuously
monitored)
• Degree of dependency G2: The conditions that should be met by the com-
ponents included in the list of Section 12.3.1 to be considered as precursors
of potential CCF events of degree 2 (G2) are the following:
• Meet the same conditions of the CCFs of degree G1
• Meet one of the following two external conditions (fields of the Datos
Técnicos table; see Figure 12.1):
– The components must operate under the same working conditions;
that is, the strings in the field Condiciones de trabajo must coincide
exactly
– The  components must be located inside the same room or very
close to each other; that is, the information in the field Local must
coincide exactly
• Degree of dependency G3: The conditions that should be met by the com-
ponents included in the list of Section 12.3.1 to be considered as precursors
of potential CCF events of degree 3 (G3) are the following:
• Meet the same conditions of the CCFs of degree G1
• Meet the following two external conditions (fields of the Datos Técnicos
table; see Figure 12.1):
– The components must operate under the same working conditions;
that is, the strings in the field Condiciones de trabajo must coincide
exactly
– The  components must be located inside the same room or very
close to each other; that is, the information in the field Local must
coincide exactly

12.4.3 Assignment of Postulated Generic 𝜷 Factors


After the CCFs have been determined, the common-cause failure rates for the
respective components are calculated, and then they are assigned to the correspond-
ing cell of the field Tasa F. (failure rate) in the Datos de Fallas y Efectos table of the
worksheet (see Figure 12.1).
The common-cause failure rate for each component involved is determined from
the value of the failure rate of the precursor single component-failure mode and the
generic β factor assigned. The latter refers to the β2 parameter of the expression 12.9
and is chosen from the following list, depending of the corresponding degree of
dependency (G1, G2, or G3):

• G1: β2 = 0.1
• G2: β2 = 0.15
• G3: b2 = 0.2
324 Reliability Engineering

The values of β2 listed are based on the following criteria:

• The basic degree of dependency depth corresponds to the sharing of inter-


nal conditions (G1)
• The basic redundancy degree is the double redundancy (2 × 100% of the
component nominal functional capacity) from which the β factors were esti-
mated in specialized studies as in [30,31] whose average value β = 0.1 is
indicated in Table 1.1 of these references.
• Starting from this value of β2 = 0.1, it is assumed that the addition of any
other external attribute to a CCF of degree G1 produces a linear increase
of its β factor in 0.05. This increase is postulated according to the range of
values of β factors appearing in Table 1.1 of [30,31], so that the maximum
postulated value does not exceed the maximum value of such a range.
• In this way, the CCF of components with double redundancy of degree G2
will have a β2 = 0.15 and for those of G3 a β2 = 0.2.

12.4.4 Correction of the 𝜷 Factor of the Common Cause Failure


Events, According to the Degree of Redundancy
• The values of the β parameter listed in Section 12.4.3 are representative of
CCFs generated by components with double redundancy.
• Since operational experience indicates that as redundancy level increases,
so does the probability of survival to the CCF of the components involved,
it is assumed that using these values for CCFs generated by higher redun-
dancy components is a very conservative approach. Then, to fix that prob-
lem it is used the expression 12.8 in Section 12.4.
• Finally, these fixed values of CCF rates are added to the end of the list of
tables in the FMEAe worksheet for revision and completion (e.g., to write
the pertinent clarifications in the Remarks table or to modify some val-
ues according to pertinent engineering judgment or expert criteria). After
the CCFs have been added to the FMEAe, the criticality analysis can be
performed so that it may include the influence of this kind of events, as
FTA normally does in a system reliability analysis.

12.5  ANALYSIS OF IMPORTANCE BY COMPONENT TYPE


This analysis depends on the results of the criticality analysis treated previously in
Section 12.3 and it reveals the types of component or groups of components (e.g.,
motor driven pumps, diesel driven pumps, motor operated valves, check valves, etc.)
with the highest criticality that support the decision making process in improving
system reliability. The steps of the procedure are:

• After the IRCi values have been calculated, they are grouped by component
types, according to their generic code (field Cod G. in the Datos de fallas y
effectos table of the FMEAe worksheet).
Expanded Failure Modes and Effects Analysis 325

• Within each group k, the average values of IRC[k]i are calculated.


• Finally, the averaged IRC[k]i values are sorted in descending order and the
IIC[k] indexes are computed as follows:
Nk

∑ IRC k
i

IIC[k ] = i =1
(12.9)
Nk
where:
IIC[k] is the importance index of component type k (average IRC value
within a given group k of generic components)
Nk is the total number of component-failure modes belonging to the group k
IRCik is the risk index of the component-failure mode i belonging to group k

• Thus, the most important component types which engender the most crit-
ical failure modes are determined; that is, the types of components that
most contribute to the risk can be known and, therefore, unique correc-
tive actions for similar components can be typified or, otherwise, important
design changes can be proposed.

12.6 RELIABILITY ASSESSMENT OF A GENERIC FIRE


QUENCHING SYSTEM APPLYING FMEAe
Figure 12.2 shows the simplified drawing of a hypothetical fire quenching system
whose reliability is assessed applying the FMEAe approach.
The  design and operational information on which the reliability assessment is
performed is summarized as follows:

1. The system stays in standby state and it is activated automatically through


a signal of fire event from the instrumentation and control (I&C) circuits.
2. The standby positions/states of each component are indicated in Table 12.5.
3. The odd I&C circuit generates a signal to activate motor driven pump PM1
which is set in automatic position by its hand switch (HS), and this same
signal closes motor operated valve MV3 and opens MV1.

FIGURE 12.2  Simplified drawing of a fire quenching system.


326 Reliability Engineering

TABLE 12.5
State/Position of Components System of Figure 12.3
Component Component Standby Demand
No. ID Description State/Position State/Position Control
1. TK Water storage tank Full level Empty after Continuously
mission fulfilled monitored
2. V1 Manual operated valve Normally open Normally open Periodically
for isolating the odd tested
train
3. V2 Manual operated valve Normally open Normally open Periodically
for isolating the even tested
train
4. PM1 Motor driven pump. Automatic Running for Periodically
Odd train 4 hours tested
5. PM2 Motor driven pump. Standby Running for Periodically
Even train 4 hours if PM1 tested
fails
6. VC1 Check valve. Odd train Normally closed Open while PM1 Periodically
is running tested
7. VC2 Check valve. Even Normally closed Open while PM2 Periodically
train is running tested
8. MV1 Motor operated valve Normally closed Full open Periodically
at discharge of the tested
odd train
9. MV2 Motor operated valve Normally closed Full open if train Periodically
at discharge of the odd is failed tested
even train
10. MV3 Motor operated valve Normally open Full closed Periodically
for testing the odd tested
train
11. MV4 Motor operated valve Normally open Full closed is Periodically
for testing the even PM2 is running tested
train
12. SP Sprinkler Empty Cooling water Non-controlled
flowing
13. Power6KV Support system for Energized Energized Continuously
power supply of both monitored
PM1 and PM2
14. Power380V Support system for Energized Energized Continuously
power supply of all monitored
MOVs
15. Odd-IC I&C circuit for Energized Energized Continuously
auto-activation of the monitored
odd train
16. Even-IC I&C circuit for Energized Energized Continuously
auto-activation of the monitored
even train
Expanded Failure Modes and Effects Analysis 327

4. Under signal of fire event, if the flow is not established, a signal for activa-
tion of the even train is produced, which starts PM2 (set in standby position
of its HS), closes MV4, and opens MV2.
5. The odd train is tested every 720 hours through an MV3 valve flowing the
cooling water in recirculation mode through MV3 to the water storage tank
TK, and 15 days later, the even train is tested by starting PM2 and recircu-
lating cooling water through VM4 to the water storage tank TK.
6. The motor operated valves (MOVs) MV3 and MV4 are tested monthly by clos-
ing them, following the same procedure used for testing each train. When the full
closed position is verified, the valves are opened again and stay in that position.
7. In a way like the former case, the MOVs MV1 and MV2 are tested monthly
by opening them. When the full open position is verified, the valves are
closed again and stay in that position.
8. The motor driven pumps, PM1 and PM2, are powered from the same 6000
volts alternating current (6 kV AC) bus bar.
9. All MOVs, MV1, MV2, MV3 and MV4 are powered from the same 380 V
AC bus bar.

12.6.1 General Assumptions and Other Considerations for the Analysis


Following is a set of general assumptions made for the analysis concerning data and
modeling to gain simplicity for achieving the analysis purposes.

1. Only two types of support systems were considered: power supply and I&C
circuits for auto-activating the fire quenching system.
2. The position of the HS of the active components in the case of the odd train
is set to automatic. This means that under a real demand condition, the gen-
erated signal will act on the components of the odd train.
3. The  position of HS for the active components of the even train is set in
standby, which means that they will act only on the condition of coinci-
dence of PM1 failed and fire alarm signal present.
4. The only human errors considered for this example refers to “V1 in wrong
position on demand” (it fails to remain open on demand) due to human error
after maintenance of the odd train; and “V2 in wrong position on demand”
(it fails to remain open on demand) due to human error after maintenance
of the even train. Both human actions are considered as independent events.
5. The pumps and valves are in the same room.
6. The boundary of pumps and MOVs includes the respective circuit breakers
so that the interface for power supply is considered the BB-6KV bus bar for
PM1 and PM2; and BB-380V bus bar for all MOVs from MV1 to MV4.
7. The interfaces for I&C circuit are assumed to be RC-101 for odd-IC circuit
and RC-102 for even-IC circuit.
8. All quantitative data for component-failure modes were taken from generic
databases starting from [2,3,39].
9. For simplicity, only one failure mode for each component was considered,
except for the motor driven pumps. In  the case of the manual operated
328 Reliability Engineering

valves, V1 and V2, the hardware cause for “fail to remain in position” fail-
ure mode was neglected and only the human error was considered instead.
10. The  sprinklers also were excluded from the assessment because they are
passive components with very low failure rates.

12.6.2 Preparing the Worksheet for the Analysis


and Reliability Assessment in FMEAe

Figure 12.3 presents the FMEAe worksheet with all the essential information filled
in the respective cells, before the CCF events are determined.
It can be observed from Figure 12.3 that 17 component-failure modes are included
in the analysis, according to the system drawing of Figure 12.2, the assumptions, and
other information of interest.
After the fields of the tables are filled, the next step is to proceed with the critical-
ity analysis. To prove how the risk profile of the system is modified due to the inclu-
sion of CCF events in the reliability assessment by means of FMEAe, the results of
both cases are compared, which are presented in Figures 12.5 and 12.6. Figure 12.4
shows the modified FMEAe worksheet after the CCFs were determined.
The worksheet in Figure 12.4 shows an increase in the number of component-failure
modes with respect to Figure  12.3. Thus, after determining CCF events the list of
component-failure modes that participate in the criticality analysis encloses 22 ele-
ments, because five CCFs were added (those whose code begins with CM-). Since the
degree of redundancy is two, five CCFs of double-failure were added as indicated at

FIGURE 12.3  FMEAe worksheet showing the last five component-failure modes of the list
(before determining the CCF events).
Expanded Failure Modes and Effects Analysis 329

FIGURE 12.4  FMEAe worksheet modified after determining CCF events showing the last
5 out of 22 component-failure modes.

FIGURE 12.5  System reliability profile estimated by FMEAe without CCF contributions.


330 Reliability Engineering

FIGURE 12.6  List of component importance ranked by the F-V importance measure esti-
mated by the ARCON code without CCF contributions.

the panel located in the left-low corner of the worksheet. They were classified as degree
G3 because they all share the complete set of attributes to be considered (internal and
external attributes).
Then, the analysts must verify all the information concerned in the worksheet
before the criticality analysis is made to avoid inconsistence of results. The data
accompanying the single failure modes generating the CCFs are transferred to
the latter. Some of them, like failure rates, need to be recalculated, which is done
automatically by the FMEAe algorithms. However, the analysts still need to enter
some data in the worksheet, as is the expected effects of each CCF-related failure
modes; in this case, it refers to effects related to system availability (ED), whose
degrees of severity are indicated in the Table 12.3. For this example, the effects
of each of the five CCF-related failure modes were assigned to a severe degree of
severity (3) which means in FMEAe approach: Immediate shutdown is required.
Recovery time of 1–3 months is considered. After doing this, the criticality analy-
sis can be performed.
Expanded Failure Modes and Effects Analysis 331

12.6.3 Criticality Analysis of System of Figure 12.2 Applying


the FMEAe Approach

According to the data introduced by analysts, several indexes are estimated by algo-
rithms of criticality analysis (through the corresponding expressions of Sections 12.3
and 12.4). Figure 12.5 shows the FMEAe Result sheet in which it can be seen the
value of IRS and the ranking of criticality of the component-failure modes without
CCF contribution.
Despite the low value of IRS (9.39E-5), which means that according to the data
used the system presents high degrees of reliability/low degrees of risk, Figure 12.5
shows the component-failure modes which dominate the system reliability by means
of the ranking made by the IIRi values. These include, in decreasing order of impor-
tance, the single failure of both pumps to start under demand (PM1.S and PM2.S),
the failure of both support systems of power supply, and the human errors on the
valves V1 and V2 (V1.M and V2.M).
Figure  12.6 represents the results of the FTA  for the system using the same
set of data and assumptions made for FMEAe analysis by means of the ARCON
code [40,41], which is used here as a way of comparison between the FTA and
FMEAe approaches. The  system failure probability estimated is Ps  =  1.61E-3,
which can be considered within the reliability target that should be established
for safety systems at industrial facilities with high requirements for safety and
availability. The  reliability profile is shown in Figure  12.6 ranked by Fussell-
Vesely (F-V) importance measure that represents the relative contribution of each
component-failure mode to the system’s probability of not  fulfilling its safety
function.
From Figures 12.5 and 12.6, it can be seen that the same group of component-
failure modes dominate the reliability profile. The  distinctive feature between
both approaches lies in the fact that FMEAe adds the failure effects, which, in
turn, considers the redundancy. Therefore, the single events Power380V and
Power6KV are placed in a higher level of the ranking made by FMEAe. The same
is made by F-V in the FTA performed with ARCON. In that sense, the approach
of FMEAe can be used to follow the regulatory issues closer than that of FTA.
Nevertheless, the list of the 12 more important failure modes coincides in both
approaches.
When the CCFs are included in the analysis, whatever the method used, the reli-
ability profiles change dramatically, as a function of the system’s redundancy degree.
In the case of the example used herein, the global values were not affected because
of the relatively low values of unavailability used for the system’s components, and
the system redundancy itself, but importance profile of the component-failure modes
has slight changes as shown in Figures 12.7 and 12.8.
The approach of minimal cut sets (MCSs) is responsible for the major difference
between the results obtained by FMAEe and ARCON code and, once again, the
inclusion of the effects in the former strengthen that difference even more.
332 Reliability Engineering

FIGURE 12.7  Reliability profile determined by FMEAe with CCF contributions.

The value of IRS increases five times when the CCFs are included in the analy-
sis (see Figures  12.5 and 12.7), while the system failure probability estimated by
ARCON after the inclusion of CCFs is P = 4.55E-3; that is, it increases 2.8 times.
The reliability profile coincide in both cases with slight differences that are based on
the same criteria explained previously. In this case, the most dominant failure modes
in both approaches were the CCFs of motor-operated pumps to start, followed in the
case of FMEAe by another two CCF events involving the failure to open on demand
of both MV1 and MV2, and the failure to close on demand of MV3 and MV4 as
shown in Figure 12.7.
On the other hand, the ranking of values estimated by FTA gives more priority
to the single failures of the pumps PM1 and PM2 to start on demand over the CCFs
of the MOVs to open and to close on demand, as shown in Figure 12.8, and as was
stated before. This  is due to the MCS approach, with respect to the failure mode
and effect approach. Nevertheless, both approaches coincide in estimating the most
important contributors to the system reliability, and therefore, they can be equally
useful for decision making, despite their known major differences.
Finally, to complete the results from the FMEAe approach in evaluating the sys-
tem reliability, an analysis of importance by component type can be done as was
indicated in Section 12.5. The results of that kind of analysis for this example system
is presented in Figure 12.9.
Figure 12.9 shows that the component type of greater importance was the PM type
(motor driven pumps), which resulted in a medium value of importance according
to the FMEAe postulated scale. This classification is quite logical because this type
of component was the one with the highest values of criticality among all the system
Expanded Failure Modes and Effects Analysis 333

FIGURE  12.8  List of component importance ranked by the F-V importance measure
­estimated by the ARCON code with CCF contributions.

FIGURE 12.9  Results of the analysis of importance by component type using FMEAe.


334 Reliability Engineering

components either due to their single or their common cause failures. This  result
supports the measures to be taken to improve the system reliability profile, even
though the system reliability can be considered acceptable.

12.7  FINAL REMARKS


The new approach developed in FMEAe shows some important issues to consider for
system reliability assessments. Following these are summarized:

• FMEAe do not substitute the reliability assessments made by other power-


ful quantitative techniques, such as FTA, but it can be used as an alternative
method when the effects of failures need to be considered.
• FMEAe can be considered as an advanced variant of FMEA/FMECA,
which solves some major recognized disadvantages of the latter regarding
the dependency analysis.
• Unlike the traditional method of FMEA/FMECA, FMEAe can provide a
global assessment of the system reliability by means of the new index IRS,
which uses a postulated scale based on good practices criteria.
• Through the new index of relative importance of component-failure modes
(IIR), the analysts can support their decisions on corrective measures to be
proposed based on the analysis results.
• FMEAe keeps the strength of the qualitative methods regarding the descrip-
tive potentialities and all the useful information that can be manage in the
same analysis environment.
• FMEAe has been applied to reliability assessments of other systems such as
the fuel supply system of the ATR-42 aircraft and other designs of cooling
systems as presented in [28], given reliability profiles that are highly com-
parable with the same profiles estimated by FTA analysis.
• It  was demonstrated by all the studies performed so far that when CCF
events are included in the analysis both approaches—FTA and FMEAe—
give results that also are considered highly comparable.

REFERENCES
1. IAEA. INSAG-12. Basic Safety Principles for Nuclear Power Plants. IAEA  Safety
Series No. 75-INSAG-3, Rev. 1. IAEA, Vienna, Austria, 1999.
2. IAEA. IAEA-TECDOC-478. Component Reliability Data for Use in Probabilistic
Safety Assessment. IAEA, Vienna, Austria, 1988.
3. IAEA. IAEA-TECDOC-508. Survey of Ranges of Component Reliability Data for Use
in Probabilistic Safety Assessment. IAEA, Vienna, Austria, 1989.
4. PHA5-Pro Software. Trial Version. DYADEM International LTD, USA, 1994–2000.
5. Hazard Review Leader. Trial Version 4.0.106. ABS Consulting, 2000–2003.
6. Dinámica Heurística, S.A. de C.V. Software SCRI-HAZOP. SCRI-FMEA, SCRI-What/
If. NL, México, 2004.
7. FMEA Pro 6. FMEA-PRO 6 World’s Most Powerful FMEA Tool Risknowlogy Risk,
Safety & Reliability. Dyadem International LTD, 2003.
8. Relex FMECA. Relex Software Corporation, 2003.
Expanded Failure Modes and Effects Analysis 335

9. MIL-STD-1629A. Military standard. Procedures for Performing a Failure Mode,


Effects and Criticality Analysis. DOE, Washington, DC, 1980.
10. US NRC. NUREG/CR-2300. Probabilistic Risk Analysis Procedures Guide. USNRC,
Washington, DC, 1983.
11. IAEA. IAEA  Safety Series No.  50-P-4. Procedures for Conducting PSA of Nuclear
Power Plants. IAEA, Vienna, Austria, 1992.
12. US NRC. NUREG/CR-4550. Vol.  1, Rev. 1. Analysis of Core Damage Frequency:
Internal Events Methodology. US NRC, Washington, DC, 1991.
13. US NRC. NUREG/CR-2815. Probabilistic Safety Analysis Procedures Guide. US
NRC, Rev. 1, Washington, DC, 1985.
14. US NRC. NUREG/CR-2728. Interim Reliability Evaluation Programme (IREP)
Procedures Guide. US NRC, Washington, DC, 1983.
15. CNE.APS.IF.103. Análisis Probabilístico de Seguridad. Informe Final de la Fase I.
Volúmenes I y II. Rev. 1, Central Nuclear Embalse, Córdoba, Argentina, 2003.
16. Análisis Probabilista de Seguridad de la Central Electro Nuclear de Juraguá. Reporte
Preliminar del estudio de 15 sucesos iniciadores, GDA/APS, ISCTN-CNSN, Habana,
Cuba, 1995.
17. US NRC. NUREG/CR-2787. Interim Reliability Evaluation Programme (IREP).
Analysis of the Arkansas Nuclear One -Unit 1- Nuclear Power Plant (NPP). US NRC,
Washington, DC, 1982.
18. US NRC. NUREG/CR-1659. Vol. 1. Reactor Safety Study Methodology Applications
Program. Study of Sequoyah PWR Unit 1. US NRC, Washington, DC, 1981.
19. US NRC. NUREG/CR-1659. Vol. 3. Reactor Safety Study Methodology Applications
Program. Study of Calvert Cliffs PWR Unit 2. US NRC, Washington, DC, 1982.
20. Análisis Probabilístico de Seguridad de la Central Nucleoeléctrica de Laguna Verde.
Unidad 1. Informe Ejecutivo. México, 5 de Diciembre de 1989.
21. EPRI NP-1804-SR. The German Risk Study for Nuclear Power Plants. Ministry for
Research & Technology, TUV Rheinland, Koeln, Germany, 1979.
22. WASH-1400-MR (NUREG-75/014). Reactor Safety Study: An Assessment of Accidents
Risks in US Commercial Nuclear Power Plants. United States Nuclear Regulatory
Commission, Washington, DC, 1975.
23. Estorilo, C., and Posso, R. 2010. The reduction of irregularities in the use of process
FMEA. International Journal of Quality and Reliability Management Vol. 27, No. 6,
pp. 721–733.
24. Arabian Hoseynabadi, H., Oraee, H., and Tavner, P. J. 2010. Failure mode and effects
analysis (FMEA) for wind turbines. Electrical Power and Energy Systems Vol.  32,
pp. 817–824.
25. Sawant, A., Dietrich, S., Svatos, M., and Keal, P. 2010. Failure mode and effect anal-
ysis-based quality assurance for dynamic MLC tracking systems. Medical Physics
Vol. 37, No. 12, p. 6466.
26. Yang, F., Cao, N., Young, L. et  al. 2015. Validating FMEA  output against incident
learning data: A study in stereotactic body radiation therapy. Medical Physics Vol. 42,
p. 2777.
27. Perdomo, M., Salomon, J., Rivero, J. et al. 2010. ASeC, An advanced system for opera-
tional safety and risk assessment of industrial facilities with high reliability require-
ments. IBP3090_10. Rio Oil  & Gas 2010 Expo and Conference, September 13–16,
2010.
28. Perdomo, M., and Salomón, J. 2016. Análisis de modos y efectos de falla expandido:
Enfoque avanzado de evaluación de fiabilidad. Revista Cubana de Ingeniería Vol. VII,
No. 2, pp. 45–54.
29. US NRC. NUREG/CR-4780. Procedures for Treating CCF in Safety and Reliability
Studies. US NRC, Washington, DC, Vol. 1, 1988, Vol. 2, 1989.
336 Reliability Engineering

30. EPRI NP-3967. Classification and Analysis of Reactor Operating Experience Involving
Dependent Events. EPRI, Palo Alto, CA, 1985.
31. EPRI TR-100382. A  Data Base of Common-Cause Events for Risk and Reliability
Applications. EPRI, Palo Alto, CA, 1992.
32. US NRC. NUREG/CR-5460. A  Cause-Defense Approach to the Understanding and
Analysis of Common Cause Failures. SNLs, JBF Associates, NUS Corporation,
Washington, DC, 1990.
33. US NRC. NUREG/CR-5801. Procedure for Analysis of Common-Cause Failures in
Probabilistic Safety Analysis. USNRC, Washington, DC, 1993.
34. US NRC. NUREG/CR-6268, Rev. 1. Common-Cause Failure Database and Analysis
System: Event Data Collection, Classification, and Coding. INL, Washington, DC,
2007.
35. US NRC. NUREG/CR-6819, Vol.  1. Common-Cause Failure Event Insights: Diesel
Generators. INEEL, Washington, DC, 2003.
36. US NRC. NUREG/CR-6819, Vol.  2. Common-Cause Failure Event Insights: Motor
Operated Valves. INEEL, Washington, DC, 2003.
37. US NRC. NUREG/CR-6819, Vol. 3. Common-Cause Failure Event Insights: Pumps.
INEEL, Washington, DC, 2003.
38. US NRC. NUREG/CR-6819, Vol. 4. Common-Cause Failure Event Insights: Circuit
Breakers. INEEL, Washington, DC, 2003.
39. OREDA. Offshore Reliability Data, 4th ed. Det Norske Veritas, Høvik, Norway, 2002.
40. Mosquera, G., Rivero, J., Salomón, J. et  al. 1995. Disponibilidad y Confiabilidad
de Sistemas Industriales. El sistema ARCON. Anexo B, pp.  137–140. Ediciones
Universitarias UGMA, Barcelona, Venezuela, Mayo.
41. Salomón, J. Manual de Usuario Práctico del Código ARCONWIN Ver 7.2. Registro de
autor, CENDA, La Habana, Cuba, 2015.
13 Reliability Assessment
and Probabilistic
Data Analysis of
Vehicle Components
and Systems
Zhigang Wei

CONTENTS
13.1 Introduction................................................................................................. 337
13.2 Reliability of Vehicle Components and Systems.........................................340
13.3 Fatigue S-N Curve Transformation Technique............................................ 341
13.4 Representation of Reliability Testing Methods in the Damage-Cycle
Diagram....................................................................................................... 347
13.5 Probabilistic Data Analysis......................................................................... 350
13.5.1 Binomial Reliability Demonstration.............................................. 351
13.5.2 Life Testing..................................................................................... 352
13.5.3 Bayesian Statistics for Sample Size Reduction.............................. 355
13.6 System Reliability........................................................................................ 357
13.6.1 Series System Model...................................................................... 357
13.6.2 Parallel System Model.................................................................... 357
13.6.3 Mixtures Model.............................................................................. 358
13.7 Conclusions.................................................................................................. 358
References............................................................................................................... 359

13.1 INTRODUCTION
Fatigue-related durability and reliability performance is a major concern for the
design of vehicle components and systems  [1]. Durability describes the ability of
a product to sustain required performance over time or cycles without undesirable
failure. Reliability is defined as the ability of a system or component to perform its
required functions under stated conditions for a specified period. Both load/stress,
as experienced by a vehicle component or a system, and the strength of the compo-
nents or systems being studied are random variables and normally follow stochas-
tic and probabilistic processes. Eventually, probability distribution functions can

337
338 Reliability Engineering

FIGURE  13.1  Reliability assessment based on (a) stress-strength interference model and
(b)  demand-capability in terms of fatigue cycle. (Adapted from Wei, Z. et al., Reliability
analysis based on stress-strength interface model, Wiley Encyclopedia of Electrical and
Electronics Engineering, Chichester, UK, Wiley, 2018.)

be used to characterize the load/stress and strength of the vehicle components and
systems for a given cycle. The  stress–strength interference model is a fundamen-
tal probability-based method for reliability analysis [2] (Figure 13.1a) and it can be
applied to fatigue-related reliability assessment if both stress/load distribution and
fatigue strength distribution at a given common cycle are known. Another approach,
which is like the stress-strength interference model but is more commonly used in prac-
tical reliability assessment, is the life-based demand-capability model (Figure 13.1b).
In contrast to the stress-strength interference model, the life distributions of demand
and capability at a given certain stress or load level must be known in advance.
To make the stress-strength interference model applicable, the stress
­distribution—probability density function (PDF) f P ( P ) and the strength distribu-
tion f S ( S )—must be available. Similarly, to make the life-based demand-capability
model applicable, the demand distribution f N ( D ) and the capability distribution
f N (C ) must be provided in advance. How to obtain a representative stress distri-
bution and a life demand distribution is a challenging topic. A simplified method
often is used in practice. For example, instead of using the whole set of life demand
information, a single life demand point is set as a target, which represents XXth
(e.g., 95%) percentile usage. Corresponding to the life demand point, a single capa-
bility point, which represents a certain reliability and confidence (RC) levels, for
example, R90C90 (90% in reliability and 90% in confidence), as obtained from the
life capability is identified to compare it with the demand point [1]. A safety factor
then can be defined as the ratio of life capability over the life demand. The stress-
strength interference model can be simplified in a similar way. How to obtain
a fatigue life distribution and a fatigue strength distribution from a given set of
stress-cycle (S-N) fatigue data is one of the main focuses in this chapter. The rela-
tionship between these two distributions for a given set of fatigue data is a key to
accomplishing reliability assessments; however, the relationship between them is
often unclear. To reveal the relationship, a new fatigue S-N curve transformation
technique, which is based on the fundamental statistics definition and some reason-
able assumptions, is specifically introduced in this chapter.
Reliability Assessment and Probabilistic Data Analysis 339

Numerous testing methods are available for product durability validation and reli-
ability demonstration, and such methods include life testing (test-to-failure), binomial
testing (pass or fail), and degradation testing [1,3]. The test-to-failure method tests
a component to the occurrence of failure under a specified loading. The binomial
(Bogey) testing method is used often in reliability demonstration in which the cus-
tomers’ specifications must be met for acceptance into service. The  degradation
testing is used to test a product to a certain damage level, which is often at a level
far below complete failure. Additionally, the associated accelerated testing meth-
ods [3,4] (i.e., accelerated life testing, accelerated binomial testing, and accelerated
degradation testing) are used often to shorten the development time and reduce the
associated cost while not  significantly sacrificing the accuracy of the assessment.
All these methods are treated separately, and their relationships are not clear, which
impedes the wide and proper applications of these methods and their combinations.
In  this chapter, a unified framework of the reliability assessment method is pre-
sented in a damage-cycle (D-N) diagram [5], which consists of the following major
constituents: (1) test data, either test-to-failure, binomial, degradation, or combined,
for estimating the continuous probabilistic distribution function, (2) damage accu-
mulation rules, such as the linear or nonlinear damage accumulation rules, for data
interpolation and extrapolation, and (3) a variable transformation technique, which
converts a probabilistic distribution of a variable into a probabilistic distribution of
another variable.
In addition to these two transformation techniques, the probabilistic analysis on
data with large sample size with two- and three-parameter Weibull distribution func-
tions, the uncertainty for data with small sample size, and the sample size reduction
approaches based on the Bayesian statistic also are investigated. Furthermore, the
basic assumptions and theories in assessing the reliability of systems are provided to
complement these two basic transformation techniques. It should be noted that soft-
ware reliability of the modern vehicle components and systems is very important [3]
and it is especially true when vehicle-to-vehicle (V2V), vehicle-to-infrastructure
(V2I), and autonomous vehicle are the mainstream topics in the automotive indus-
try. However, only fatigue-related reliability is considered in this chapter because of
space limitations.
This chapter is organized as follows:
Section 13.2 provides a brief and general background about the reliability assess-
ments of vehicle components and systems with an emphasis on vehicle exhaust
components and systems. Section 13.3 presents a fatigue S-N curve transformation
technique in which distributions of load/stress and life can be properly selected
based on data pattern and converted to each other when necessary. Section  13.4
introduces a variable transformation technique in a damage-cycle (D-N) diagram,
which is a tool that can effectively interpret the commonly used fatigue-testing
methods and seamlessly reveal the interrelationship among these testing meth-
ods. Section 13.5 provides the basic concepts on reliability assessment of systems.
Section  13.6 provides some basic methods for processing data with probabilistic
distributions with a special attention to the differences between the two-parameter
and three-parameter Weibull distribution functions in terms of predictability and
applicability. Uncertainty analysis on data with small sample size and the potential
340 Reliability Engineering

capability of the Bayesian statistic in sample size reduction also are discussed in
Section 13.6. Pertinent examples are provided in each section to demonstrate the
concepts and techniques developed. Finally, Section 13.7 summarizes this chapter
with several key observations.

13.2  RELIABILITY OF VEHICLE COMPONENTS AND SYSTEMS


A  vehicle usually consists of several systems, such as powertrain, chassis,
body, electrical, and exhaust systems. Each system can be further divided
into subsystems and their constituent components. During vehicle operation,
vehicle components and systems are subjected almost invariably to road load
and engine vibration. With the increased mileage demand of the vehicle life
(e.g.,  10  years/150,000  miles), the durability and reliability performance of
the  vehicles is an important factor in vehicle design and development. Some
vehicle systems may be subject to various other operating environments and con-
ditions. For example, vehicle engine and systems are constantly exposed to high
temperature and corrosive environments [6].
Based on the temperature level, the associated failure mechanisms, and related
analysis approaches, the failure type can be categorized into three groups: (1)
isothermal fatigue, (2) anisothermal fatigue, and (3) high-temperature thermal-
mechanical fatigue (TMF) [7]. Temperature remains relatively low and constant
in isothermal fatigue. Temperature varies and does not  have a single fixed tem-
perature in anisothermal fatigue. The  applied temperature in isothermal fatigue
and anisothermal fatigue should be low enough to avoid triggering other failure
mechanisms such as creep and oxidation, which are time-dependent failure mech-
anisms. Corrosion in vehicle exhaust systems is usually caused by salt, condensate,
urea, and other corrosive agents. Creep begins at a temperature of approximately
half the absolute temperature (degrees Kelvin or Rankine) of the metal melting
point  [6]. By contrast, fatigue is essentially a cycle-dependent failure mecha-
nism. The temperature in high-temperature TMF is high enough to trigger creep
and oxidation.
Product durability and reliability validation testing and associated life assess-
ment are becoming routine processes for the development of exhaust components and
systems. In product validation, how to handle the temperature effects is still a contro-
versial issue. Generally, there are two approaches: (1) cold-testing and (2) hot-testing.
To reduce cost and shorten product development cycle, the hot gas in a vehicle often
is bypassed during the road load data acquisition (RLDA) process in cold-testing;
hence, room or near-room temperature information is collected during RLDA.
For consistent performance evaluation, subsequent calibration testing and component
bench testing also are conducted in the same cold conditions [8]. With the cold-testing
information, the performance of the component or system at high operating tempera-
tures can be estimated by introducing a temperature factor, which is used to correct
and compensate the temperature effects. With the introduced temperature factor, the
product designed in the cold-testing condition could be reliable if the dominating fac-
tors are properly considered in the temperature factor. After these factors are identi-
fied, quantified, and applied to the RLDA load data, the rainflow cycle counting can
Reliability Assessment and Probabilistic Data Analysis 341

be performed with the help of the linear Miner’s damage rule. Miner’s rule predicts
that failure occurs when damage is greater than or equal to 1 [8].
As the name implies, in hot-testing all parts of the RLDA, calibration, and
bench testing are conducted in service or equivalent high temperature conditions.
The  fatigue life can be assessed in service condition and no temperature correc-
tion factor is required in the fatigue life assessment. The hot-testing method is still
evolving [8] and, without losing generality, only the cold-testing related topics will
be addressed in this chapter.

13.3  FATIGUE S-N CURVE TRANSFORMATION TECHNIQUE


The fatigue S-N data in a (2D) fatigue S-N plot characterize the capability of the
material in fatigue failure resistance. The higher the location of the data in the plot,
the higher the resistance of the material to fatigue failure. Fatigue data often show
large scatters in life as well as in load/stress due to a wide variety of intrinsic uncer-
tainties, such as material, loading, and manufacturing uncertainties. Figure 13.2 sche-
matically shows the major characteristics of a fatigue S-N mean curve, its lower and
upper bounds, and the life distributions around the mean curve. A fatigue S-N curve
can be roughly divided into three regimes: Regime-I, Regime-II, and Regime-III,
which represent low-cycle fatigue, medium-cycle fatigue, and high-cycle fatigue,
respectively. In  many engineering applications, the Regime-II for medium-cycle
fatigue is of significant interest, and the mean curve in Regime-II often can be treated
using a linear approximation in an appropriate plot, such as log-log plot. Mean curve
is used often to characterize the general trend of a material in fatigue failure resis-
tance. However, in many applications, such as product validation, quality control,
and life management, the scatter of the fatigue life around the mean is also of signifi-
cant importance.
There are two basic ways to describe the statistical variability of the fatigue S-N
data in the linear Regime-II:

1. Life distribution as a function of load or stress, f[Nf(S)] (Figure 13.3)


2. Load/stress (strength) distribution as a function of fatigue cycle, f[S(Nf )]
(Figure 13.3)

FIGURE 13.2  A schematic of a general fatigue S-N curve.


342 Reliability Engineering

FIGURE 13.3  Cycle and load based probabilistic distributions for the same set of fatigue
S-N data.

The life distribution is used much more commonly than the strength distribution
in fatigue data analysis. However, the strength distribution has many unique charac-
teristics and important applications, such as:

1. Relatively invariant to the levels of load/stress for some engineering


materials [9]
2. The load/stress-based safety factor is more reasonable to assess the margin
of safety of a product as a unifier across the whole range of the fatigue
regimes
3. The probabilistic distributions of stress/load (strength) at a given cycle to
failure is an essential part of the stress-strength interference model based
reliability analysis
4. The strength distribution makes the stress-strength interference model pos-
sible at the system level, Although the life distribution and strength dis-
tribution can be obtained directly from, respectively, the horizontal offset
method and the vertical offset method, the recommended standard fatigue
life data analysis is the horizontal offset method [10]. How to transform the
life distribution to load/stress distribution is an open challenge. In addition,
the vertical offset method is not always feasible, but the load/stress distribu-
tion is often desirable. For example:
5. The raw fatigue S-N data, such as the data plotted in literature and reports,
are not always available. However, the values of the fit parameters based on
the horizontal offset approach are often provided.
6. The patterns of some fatigue S-N data lead to inaccurate fitting results if the
vertical offset approach is used, whereas the data patterns match the hori-
zontal offset method well [10]. This situation often is the case for two-stress
level fatigue data.
7. Fatigue S-N data are available only at one stress level while the slope of the
fatigue S-N curve is known already based on the historical data.

In all these cases, a new technique is required to transform the distribution of life
to the distribution of strength or vice versa. The  following is such a technique to
accomplish this goal.
Reliability Assessment and Probabilistic Data Analysis 343

FIGURE 13.4  Schematic of probabilistic distributions of x and y .

The  only assumption of this new technique is mathematically expressed in


Equation 13.1, which indicates that the amplitude of the PDF of y at a given x level
is proportional to the amplitude of the PDF, f [ x( y ) ], of x at that point (Figure 13.4).

f [ y( x ) ] = Kf [ x( y ) ] (13.1)

where K is a constant, which is determined by satisfying the basic probability law


(Equation 13.2):
+∞

∫−∞
f ( y ) dy = 1 (13.2)

Equation 13.1 indicates implicitly that the peak of the PDF of the strength distribu-
tion corresponds to that of the PDF of life distribution, and the valley of the PDF
of the strength distribution corresponds to that of the PDF of life distribution for
single-mode probabilistic distributions (Figure 13.4). This assumption makes sense
intuitively based on the observations of a wide variety of fatigue data. The follow-
ing lognormal (normal) distributions is provided to demonstrate the transformation
technique.
The selection of probabilistic distribution functions is a critical issue in reliability
assessment. The  real distribution of a fatigue life given stress level is essentially
unknown. However, the two-parameter Weibull and log-normal distribution functions
are commonly used in probabilistic fatigue life assessments [11]. In the automotive
industry, the two-parameter Weibull often is preferred in fatigue life assessments
because of its simplicity and seemly meaningful interpretation of the shape parameter.
Years of experience and data collection show that both functions empirically fit the
fatigue data equally well as far as the mean behavior is concerned [11]. The pairs of
the two fit parameters for the two distribution functions are, respectively, µ (mean)/σ
(standard deviation) and η (scale)/β (shape). The bell-shaped normal PDF and the cor-
responding cumulative distribution function (CDF) are expressed in Equations 13.3a
and 13.3b, respectively:

1  1  x − µ 2 
f ( x) = exp  −    (13.3a)
σ 2π  2  σ  
344 Reliability Engineering

1  x − µ 
F ( x) = 1 + erf    (13.3b)
2  σ 2 

where erf () is the error function.


The Weibull PDF and the corresponding CDF are listed in Equations 13.4a and
13.4b, respectively:

βx
β −1
  x β 
f ( x) =   exp  −    (13.4a)
η η    η  

  x β 
F ( x) = −
1 exp  −    (13.4b)
  η  

With an added threshold parameter, a, Equation 13.4 can be generalized to the three-


parameter Weibull function in Equation 13.5:

 β  x − a 
β −1
  x − a β 
f ( x) =     exp  −    (13.5a)
 η  η    η  

  ( x − a)  β 
F ( x ) = 1 − exp −    ; 0 < a ≤ x < ∞,η , β > 0 (13.5b)
  η  

The normal distribution is used in this section to show the fatigue S-N curve trans-
formation technique. Based on Equations 13.1 through 13.3a, the PDF of the normal
distribution function f ( y ) as a function of y can be written as Equation 13.6:

+∞ +∞  1  x − µ ( y ) 2 
1  
∫ K f  x ( y )  dy = K

x
exp −    dy = 1 (13.6)
−∞ −∞ σ x ( y ) 2π  2  σ x ( y )  

where both mean µ x and standard deviation σ x in Equation 13.6 are assumed to be


functions of y. When σ x is assumed to be a constant and the linear relationship is
held for the mean curve in Equation 13.7:

µ x = a + by (13.7)

Equation 13.6 can be much simplified and after rearrangement it can be transformed


to Equation 13.8:

+∞
1  1  y − ( a − x ) ( −b)  2 
K

−∞ σx 2π
exp − 
 2  ( −σ x / b)
  dy = 1 (13.8)
 
Reliability Assessment and Probabilistic Data Analysis 345

Based on the linear assumption in Equation 13.7, the term ( a − x ) ( −b) in Equation 13.8


is actually the mean of y and can be expressed as µ y = ( a − x ) ( −b), which is essen-
tially Equation 13.7 but in a different format. The unknown K can be solved from
Equation 13.8 and the result is expressed in Equation 13.9:

K = −b (13.9)

For  a linear fatigue S-N curve in a log-log plot with x = log( N ) and y = log( S ),
assume that the distribution of the cycles to failure at a stress level follows a normal
distribution in a log-log plot, then Equation 13.8 simply becomes Equation 13.10:

   
2

 log( S ) −  a − log ( N )   

   −b  
1    
f  S ( N )  = exp −   (13.10)
2π ( − σ b ) 2 ( −σ b )
2
 
 
 

Clearly, the probabilistic distribution as a function of strength is still a normal


distribution with a new mean (see Equation 13.11a), which is essentially Equation 13.7,
and a new standard deviation in Equation 13.11b:

µ y =  a − log ( N )  ( −b) (13.11a)


 
σ y = −σ x b (13.11b)

Example 13.1  Two-stress level fatigue data

Table 13.1 lists a set of fatigue S-N data of welded exhaust components made of a
steel. Tests are conducted by controlling the applied load and only two load levels
are tested with six data points at each load level. The fatigue data show a wide
scatter band because many factors, such as material inhomogeneity and welding
quality, are involved in the failure of the exhaust components. Since the data in
Figure 13.5 belong to the “standard horizontal pattern” [12], the horizontal offsets
method, which is the ASTM standard recommended method [10], should provide
a reasonable fit curve. The fit curves with the horizontal offset method as well as
the vertical offset method are plotted in Figure 13.5, and the fit parameters are
listed in Table 13.2. The results of the horizontal offset methods are very different
from those of the vertical offset methods, which provide a poor fit to the set of
data. Based on the linear assumption in the log-log plot and the estimated mean
curve from the horizontal offset method, the standard deviation of the strength
can be calculated, and the results are listed in Table 13.2. This example belongs
to type (f) listed in Section 13.3. To accurately assess the reliability of a system,
the reliability of each constituent component must be accurately assessed as well.
However, the reliability assessment of each component often is conducted with
limited sample size and under certain testing conditions because of budget and
346 Reliability Engineering

testing constraints, which brings significant uncertainty in test results and their
interpretations. Example 13.1 indicates that the obtained results (the mean and the
standard deviation) could be inaccurate and even misleading if the load/stress dis-
tribution is obtained directly from fitting the data with the vertical offsets method.
By contrast, the load/stress distribution as obtained by transforming the life distri-
bution, which is obtained by fitting the data with the horizontal offsets method,
is logically sound and meaningful; therefore, surely it will lead to a more accurate
system reliability assessment.

TABLE 13.1
Fatigue Cycles to Failure at the Two-Stress Levels
Load, lbs No. 1 No. 2 No. 3
520 86188 130708 153282
620 45823 55775 73715
Load, lbs No. 4 No. 5 No. 6
520 168718 177465 304998
620 89524 108583 135140

FIGURE  13.5  Vertical and horizontal offsets methods for fatigue data of an automotive
exhaust component.

TABLE 13.2
Calculated Fit Parameters with a = log(C ) and b = 1/ h for the
Power Law S = CN h
C (a ) h(b) STD_N STD_S (Equation 13.11b)
Vert. 2213.1 (28.6) −0.117 (−8.547) 0.262 —
Hori. 10889.3 (15,9) −0.254 (−3.937) 0.178 0.045
Reliability Assessment and Probabilistic Data Analysis 347

13.4 REPRESENTATION OF RELIABILITY TESTING


METHODS IN THE DAMAGE-CYCLE DIAGRAM
The reliability testing methods (life testing, Bogey testing, and degradation testing)
are treated as three different methods in practice. To better understand and fully use
these testing methods, a general framework in which the three testing methods can
be evaluated in a consistent way is required. The damage-cycle (D-N) diagram [5]
is a tool to bring all these three reliability testing methods together in the same
framework. Figure 13.6 schematically shows the three reliability testing methods
in the (D-N) diagram. In  Figure  13.6, the horizontal axis represents the applied
fatigue cycle, N, while the vertical axis represents damage, D. The intersection of
the two axes, where the applied cycle N = 0 and D = 0, represents the beginning of
the testing process. The D is always an increasing function of N because a dam-
age process is often assumed to be an irreversible process. When the applied cycle
N = N f , D = 1 indicating a complete failure of the product. The dashed lines shown
in Figure 13.6 represent the evolution trajectory of the damage bounds (lower and
upper), which can be linear or nonlinear depending on the assumption of the dam-
age process.
The PDF and CDF as obtained from fitting the test-to-failure data can be described
as f  N f ( D = 1)  and F  N f ( D = 1) . In Figure 13.6 the probabilistic distributions
are assumed to be representative of the population so that the uncertainty caused by
the sample size can be ignored. The obtained probabilistic distribution of failure can
be compared against the established reliability criterion to assess the reliability per-
formance of the product. The most appropriate reliability criterion is the reliability
function R = 1 − F. For example, a product specification R99 states that 99% of the
product is expected to pass a specified target. The uncertainty of the fatigue behav-
ior of a population also can be described by f [ D( N ) ] and F  D( N )  which are the
PDF and CDF of the damage at a specific applied cycle, respectively, and also shown
in Figure  13.6. As compared to the life data from the life testing, obtaining the
damage distribution below D < 1 is more difficult. Eventually in practice, instead of
the detailed continuous distribution, a discrete assessment (i.e., pass or fail), which
is exact the measure used in the binomial testing, often is used. Mathematically,
all products with D < 1 are characterized as “pass” and all others are characterized
as “fail.”

FIGURE 13.6  The representation of population as revealed in life testing, binomial testing,


and degradation testing in the (D-N) diagram.
348 Reliability Engineering

The distribution of the cycles at any given damage level below D = 1 (i.e., f  N ( Dd ) 


shown in Figure 13.6 and the damage distribution at any applied cycle below N F , i.e.,
f  D( N d )  also shown in Figure  13.6 can be represented in the (D-N) diagram.
The lower case d stands for degradation.
Intuitively, close interrelationships among the distributions as shown in Figure 13.6
should exist. The key to reveal the interrelationships among the distributions is to find
the relationship between two different variables for a given damage evolution equation.
Mathematically, the problem is equivalent to seeking a target distribution function,
FY ( y ), for a given initial distribution function, FX ( x ), and transformation functions,
y = ϕ ( x ) and x = ψ ( y ). In fact, closed-form solutions can be obtained by the following
procedure [13] that is well developed and described briefly herein.
The target distribution function, FY ( y ), can be expressed as:

FY ( y ) = P (Y ≤ y ) = P ϕ ( X ) ≤ y  (13.12)

First, consider the case where y = ϕ ( x ) is a strictly monotone increasing function.


x = ψ ( y ) is then a unique inverse function and:

FY ( y ) = ∫ −∞( ) f X ( x ) dx = FX ψ ( y )  (13.13)


ψ y

The target PDF fY ( y ) of Y is obtained as:

dFY ( y ) dψ ( y )
fY ( y ) = = f X ψ ( y )  (13.14)
dy dy

For cases where ϕ ( x ) is a strictly monotone decreasing function:

FY ( y ) = ∫ψ∞( y ) f X ( x ) dx = 1 − FX ψ ( y )  (13.15)

dFY ( y ) dψ ( y )
fY ( y ) = = − f X ψ ( y )  (13.16)
dy dy

The two cases in Equations 13.14 and 13.17 can be combined as:

dFY ( y ) dψ ( y )
fY ( y ) = = f X ψ ( y )  (13.17)
dy dy
The variable transformation technique shown in Equations 13.14 through 13.17 is the
essential part of the unified framework for representing these three reliability testing
methods. With this technique, the distribution of cycles to failure can be calculated
easily from the damage distribution at a given cycle or the cycle distribution at a
given damage with the help of a damage evolution equation, which can be either
Reliability Assessment and Probabilistic Data Analysis 349

linear or nonlinear. In reverse, if the final life distribution is known, then the distri-
bution of damage at any given cycle and the cycle distribution at any given damage
level can be calculated in the same manner. In practice, the PDF, f  N f ( D = 1)  and
CDF F  N f ( D = 1)  , can be estimated by fitting the life testing data. It is noted that
Equation 13.17 is obtained by assuming that the transformation functions are either
monotone increasing or decreasing, which is the case for the fatigue-based reliabil-
ity analysis. For complex cases where the assumptions of monotone increasing or
decreasing are not valid, a more general theoretical framework as provided in [13]
can be followed.
Corresponding to the three commonly used testing methods, there are three cor-
responding accelerated testing methods: accelerated life testing, accelerated bino-
mial testing, and accelerated degradation testing methods. For example, accelerated
fatigue life testing can be achieved through increasing stress/load levels. At  least
two higher stress levels (lower and upper levels) often are introduced to conduct
accelerated fatigue life testing. Then the design parameters at service stress level
are estimated from the accelerated testing through extrapolation. With probabilistic
distribution functions (i.e., f  N F ( SU )  and f  N F ( S L ) ) at the two higher stress
levels, SU and S L , the probabilistic distribution f  N F ( Ss )  of the life at the service
stress level, SS , can be obtained appropriately by extrapolating data obtained from
the higher stress levels. It should be noted that in accelerated testing data analysis,
the farther the accelerated stress level is from the normal stress level, the larger the
uncertainty in the extrapolation [4]. All these testing methods can be interpreted in
a single D-N diagram for one-stress level testing (Figure 13.6) and the S-N curve for
multiple-stress level testing.

Example 13.2 The Damage Distribution at a Specific Given


Applied Cycle N Obtained from the Weibull Life
Distribution with the Linear Damage Rule, D = N/Nf

For  the linear damage rule, Equation  13.17 results in D = ϕ ( N f ) = N/N f ,


N f =ψ ( D ) =N/ D, and dψ ( D ) / dD = −N / D2. The  damage distribution at a given
applied  cycle, N, can be obtained from the variable transformation tech-
nique. For  example, the two-parameter Weibull distribution, f N f ( D = 1)  and
F N f ( D = 1) , as shown Equation 13.4 can be expressed in Equation 13.18:

( β −1)   N β 
 β  N   N   
f (D ) =    2    exp −    , D ≥ 0 (13.18a)
 η   D   Dη    Dη  

  N β 
F ( D ) = exp  −    , D ≥ 0 (13.18b)
  Dη  

Clearly, when D = 1, the complementary part (i.e., 1−F[D(1)]) of Equation (13.18) is


exactly the same as Equation 13.4, F N f .
350 Reliability Engineering

Example 13.3 The Cycle Distribution at Damage D  Obtained


from the Weibull Life Distribution with
the Linear Damage Rule, D = N Nf

For the same two-parameter Weibull distribution of cycles to failure (Equation 13.4),


the transformed distribution of cycle at a given damage D is Equation 13.19:

1  β  N 
β −1
  N  β

f (N ) =   exp  −    , N ≥ 0 (13.19a)
D  n   Dη    Dη  

  N β 
F ( N ) = 1− exp  −    , D ≥ 0 (13.19b)
  Dη  

The distribution function is still a two-parameter Weibull function with a shape


parameter β and a scale parameter of Dη , which is a proportional to the scale
parameter η with a factor D.

13.5  PROBABILISTIC DATA ANALYSIS


In practice it is difficult and often impossible to get the distribution of the popula-
tion because of limited sample size. A  schematic of reliability testing methods
with limited sample size based on the corresponding population distributions
(Figure 13.6) is illustrated in Figure 13.7. The dashed distribution lines indicate
the true population distributions are essentially unknown. The circles on the hori-
zontal line, D = 1, represent the cycles to failure of the samples. The solid circles
stand for the samples below a specified cycle N , and the hollow circles stand for
the samples above the cycle N. The diamonds on the vertical line at the specific
cycle N represent the pass/fail status of the samples. The solid diamonds stand for
pass (i.e., D < 1) and the hollow diamonds stand for fail (i.e., D > 1). Following the
binomial testing and life testing based on the damage-cycle diagram in Figure 13.7
are addressed.

FIGURE 13.7  The representation of unknown population and samples in the (D-N) diagram.


Reliability Assessment and Probabilistic Data Analysis 351

13.5.1  Binomial Reliability Demonstration


Bernoulli (or binomial) trials can be used to describe an independent random event
that has only two possible outcomes: success or failure. The  discrete probability
distribution generated from the Bernoulli trial is binomial distribution. Suppose an
experiment is repeated n times, where p is the probability of success (reliability
R = p), the probability of a product to survive (based on the binomial CDF) can be
presented in the form of [3]:
r

∑ i !( n − i )! R
n!
(1 − R ) (13.20)
n −i i
C = 1−
i =0

where:
R is reliability
C is confidence level
r is the number of failed items

When r = 0 (no failure), Equation  13.20 is a simple equation  for a successful run
testing (Equation 13.21):

C = 1 − R n (13.21)

The  binomial test methods have been used widely in the automotive industry.
However, the sample size required for achieving high confidence and reliability are
significant. Based on the assumption that the probabilistic distribution follows the
two-parameter Weibull distribution (Equation  13.4), a general accelerated testing
procedure (Equation 13.22) can be developed by following the Lipson equality [14]:

β
C = 1 − R ( 1 2 ) (13.22a)
n LL

β
1/  n( L1L2 ) 
R = (1 − C )   (13.22b)

ln (1 − C )
n= (13.22c)
( L1L 2 )
β
ln R

where:
L1 = t2 / t1 is life test ratio
L2 = η1/η2 is load test ratio indicating that the change in the characteristic life is
caused by the change in load

In Equation 13.22, the shape parameter β is assumed to be a constant for simplicity


even though a formula similar to Equation 13.22 can be derived when β1 ≠ β 2. Based
352 Reliability Engineering

on Equation 13.22c, ( L1L2 ) β times fewer test units are needed than would be required
by using the conventional successful run approach. In addition, the larger the value of
the shape parameter β , the greater the ratio effects on sample size reduction. The ratio
η1/η2 can be estimated from historical data or expert opinions. Equation 13.22 can be
reduced to the extended test method when the effect of L2 is ignored (i.e., η1 = η2 [3]).
From Equation 13.22, with the same confidence and reliability, the sample size
reduction can be achieved in three ways:

Way-1: extend or increase the test time at the same stress/load level [3]
Way-2: increase the load/stress level and eventually reduce the characteristic
life η
Way-3: combine Way-1 and Way-2

Figure 13.8a and b illustrate Way-1 and Way-2, respectively.

13.5.2 Life Testing
The two-parameter Weibull distribution function often is used in the life testing and
almost exclusively used in the extended time testing, which can be considered as
an accelerated testing method by appropriately extending the testing time but with
significantly reduced testing samples as shown in Equation 13.22c in Section 13.5.1.
However, the fatigue data from a wide variety of sources indicate that the three-
parameter Weibull distribution function with a threshold parameter at the left tail
is more appropriate for fatigue life data with large sample sizes [14]. The uncertain-
ties introduced from the assumptions about the underlying probabilistic distribution
would significantly affect the interpretation of the test data and the assessment of the
performance of the accelerated binomial testing methods; therefore, the selection
of a probabilistic model is critically important. Product validation and reliability
demonstration, designs targeting the low percentiles of the fatigue life at the left tail,
are required [11]. Therefore, the characteristics of the left tail of a selected model
needs to be thoroughly examined test data with a large sample size against the physi-
cal mechanisms when the left tail of a distribution is a concern. For test data with

FIGURE  13.8  Schematic of accelerated binomial (Bogey) testing procedure through (a)
extended testing time and (b) increased load/stress level as represented in the (S-N) diagram
with D = 1.
Reliability Assessment and Probabilistic Data Analysis 353

a small sample size, the benefit of using the three-parameter Weibull distribution is
not clear because the third fit parameter (threshold) brings significant uncertainty in
data analysis and often results in abnormal values of the fit parameters. However,
meaningful results can be obtained for data with even very small sample sizes if
Bayesian statistics are used and the historical data are available. Three examples
following demonstrate these three respective aspects.

Example 13.4  Fatigue Data of 2024-T4 with Relative Large Sample Size

A set of high-cycle fatigue data at room temperature [9] with sample size of 30


for 2024-T4, which is a commonly used aluminum alloy, is selected for fitting the
Weibull distribution functions. The probability plots estimated using Minitab for the
two- and three-parameter Weibull functions are shown in Figure  13.9a and b.
The values of fit parameters for the set of test data also are listed in Figure 13.9.
The three-parameter Weibull distribution has a much better fit in terms of visual
examination and the Anderson-Darling (AD) statistic value. The  AD values for
the two- and the three-parameter Weibull distribution functions are, respectively,
1.246 and 0.526. A lower value of the AD statistic indicates a better data fit.
An important observation from the data shown in Figure 13.9 is that the values
of the shape parameters for the two- and three-parameter Weibull functions are,
respectively, 1.74758 and 0.908975, a change from β >1 to β <1. The values of the
scale parameters are, respectively, 2092213 and 1510218 cycles. It  is noted that
one of the advantages of the Weibull function over other distribution functions is
supposed to be its capability to distinguish among several possible failure mecha-
nisms [3], β < 1 for infantile or early-life failure, β = 1 for constant failure rate, and
β >1wear-out failure. Clearly, the characterization of data based on β can be signif-
icantly compromised by the fact that two- and three-parameter Weibull functions
can lead to very different conclusions for the same set of data. All the fitting param-
eters are listed in Table 13.3. The results indicate that the three-parameter Weibull
distribution function with a threshold parameter at the left tail is more appropriate
for fatigue life data with large sample sizes. By contrast, the two-parameter Weibull
with a long left tail (zero at the left end) does not reflect the intrinsic incubation
time caused by the fatigue crack initiation and propagation.

Probability Plot for 2024-T4


Probability Plot for 2024-T4
3-Parameter Weibull
Weibull
Complete Data - LSXY Estimates
Complete Data - LSXY Estimates
0.99
0.99 Table of Statistics
Table of Statistics

0.9 0.9 Shape 0.908975


Shape 1.74758 0.8
0.8 0.7 Scale 1510218
0.7 Scale 2092213
0.6 0.6 Thres 452578
0.5 Mean 1863512 0.5
0.4
Probability

0.4 Mean 2033108


Probability

StDev 1100375 0.3


0.3
Median 1696378 0.2 StDev 1741252
0.2
IQR 1496580 Median 1461658
0.1 0.1 IQR 1779718
Failure 30
0.05 Censor 0 0.05 Failure 30
0.03 AD* 1.246 0.03 Censor 0
0.02 0.02 AD* 0.526
Correlation 0.961
0.01 0.01 Correlation 0.997
100000 1000000 10000000 1000000 10000000
Cycles to failure, N Cycles to failure, N

(a) (b)

FIGURE  13.9  Probability plots of (a) two-parameter Weibull distribution and (b) three-
parameter Weibull distribution for a set of 2024-T4.
354 Reliability Engineering

TABLE 13.3
The Values of the Fit Parameters of
Two-Parameter (2P) and Three-Parameter
(3P) Weibull Distribution Functions for
the High-Cycle Fatigue Data of 2024-t4
Distribution Functions Parameters
2P-Weibull Shape 1.74758
Scale 2092213
AD statistic 1.246
3P-Weibull β 0.908975
η 1510218
δ 452578
AD statistic 0.526

Example 13.5  Fatigue Data with Small Sample Size

The  probabilistic plot and the corresponding estimated parameters obtained


using Minitab from six fatigue failure data is shown in Figure 13.10. The obtained
shape parameter β = 2.37219 and a scale parameter of η = 750355 can be cal-
culated directly from the probability plot. In Figure 13.10, the 90% (confidence)

Probability Plot
Weibull - 90% LB
Complete Data - LSXY Estimates
99
Table of Statistics
90 Shape 2.37219
80
70 Scale 75035.5
60
50 Mean 66504.0
40 StDev 29826.6
Percent

30
20
Median 64293.5
IQR 41734.1
10
Failure 6
5 Censor 0
3 AD* 2.133
2 Correlation 0.972
1
10000 100000
Cycles to failure

FIGURE 13.10  Probability plots obtained from cycles to failure.


Reliability Assessment and Probabilistic Data Analysis 355

lower bound is shown with a large scatter band indicating the uncertainty nature
of the calculated values of the fit parameters caused by the small sample size.
The value of the test data at any given reliability and confidence levels (RxxCyy)
can be obtained readily. The smaller the sample size, the wider the scatter band
and the larger the uncertainty. Clearly, how to obtain accurate estimated param-
eters from small sample size is a big challenge.
It  should be noted that even though the suitability of the three-parameter
Weibull distribution in fatigue life testing and associated product validation is
obvious from Figure  13.9, for the data with a relatively large sample size, its
application to data with small sample size is not recommended because of the
high possibility of unstable solutions with the introduced third fit parameter in the
three-parameter Weibull distribution. Instead, a two-parameter Weibull distribu-
tion is recommended because, although it cannot provide accurate information
about the tails of the distribution, it does provide reliable information of the mean,
which is often useful. To obtain accurate parameter estimation of three- or mul-
tiple Weibull distributions with limited sample sizes, the Bayesian statistics, which
uses historical data, can be considered.

13.5.3 Bayesian Statistics for Sample Size Reduction


When sample size is extremely small, the disadvantage of the traditional
Frequentist (based on the current test data only) method is obvious: (1) signif-
icant loss in certainty and confidence and (2) high sensitivity to the specific
pattern of the test data. However, reliability assessment based on extremely
small sample size, such as 3, 2, and 1, often is desired. To overcome these draw-
backs of the Frequentist method, a Bayesian statistics-based approach has been
developed [11].
The Bayes’s rule in the modern version can be expressed as:

l (θ ; x ) p (θ )
p (θ | x ) = 1
(13.23)

0
l (θ ; x ) p (θ ) d (θ )

where p(θ | x ) is posterior PDF, for the parameter θ given the data x. p(θ ) is prior
PDF for nthe parameter θ . l (θ ; x ) is the likelihood function, which is defined as
l (θ ; x ) = ∏ f (θ ; xk ) , where xk is kth experimental observation and f (θ ; xk ) is the
k −1
PDF of cycles to failure. The denominator in Equation 13.27 is simply a normalizing
factor which ensures that the posterior PDF integrates to one. The Bayesian process
is schematically shown in Figure  13.11. The  posterior distribution usually is nar-
rower than the prior distribution, and results with improved confidence and accuracy
can be obtained by analyzing the posterior data.
Two key steps to realize the Bayesian statistics in constructing a reliability
RxxCyy are (1) posterior distributions from the historical data and (2) efficient
numerical algorithms to implement Equation 13.23.
356 Reliability Engineering

FIGURE 13.11  Schematic of the basic concept of Bayesian statistics procedure.

Example 13.6  Bayesian Statistic for Design Curve Construction

A  large amount of reliable historical fatigue test data for welded structures has
been systematically collected and analyzed, and the associated probabilistic
distributions of the mean and standard deviation of the failure cycles have been
successfully obtained  [15]. An advanced acceptance-rejection resampling algo-
rithm and a Monte Carlo simulation procedure have been implemented.
Figure 13.12a shows a mean life-standard deviation (Mean-STD) plot (in log-log)
of 110 sets of fatigue failure data of a type of welded exhaust component. Based
on the data pattern shown in Figure 13.12a, a probabilistic distribution and the val-
ues of the corresponding fit parameters have been obtained. With the probabilistic

FIGURE  13.12  (a) The STD-Mean plot based on historical weld fatigue data and (b) the
R90C90 design curve constructed with only one data point at each of the two stress levels by
using a Bayesian statistics procedure.
Reliability Assessment and Probabilistic Data Analysis 357

distribution of the historical data, the Bayes’s rule (Equation 13.23) and an advanced
numerical algorithm, a design curve obtained with only one data point at each of
the two stress levels, can be constructed and is shown Figure 13.12b. It should be
noted that a design curve cannot be constructed with the traditional probability
plot even with two data points at each stress level. The advantage of the Bayesian
statistic is clearly demonstrated from this example.

13.6  SYSTEM RELIABILITY


The reliability of a system can be cascaded into the reliability of its components.
A system is often complex and the reliability of a system often can be idealized with
the following simple models and their combinations [3].

13.6.1 Series System Model


A  system is called series system if its life is the smallest of all those potential
times (or cycles) to failure. Such a system fails when the first failure mode occurs.
Mathematically:

F ( x) = 1− ∏ 1 − F ( x ) (13.24a)


i =1
i

or, equivalently:

R( x) = ∏ R ( x ) (13.24b)
i =1
i

Equation 13.24a is referred to as the product rule of reliability since it establishes


that the reliability of a series system is the product of the individual component
reliabilities. A series system model also is called a competing risks model if bi-modal
or multiple failure mechanisms are involved.

13.6.2 Parallel System Model


A parallel system is the system that fails when all components fail. Mathematically:

F ( x) = ∏ F ( x ) (13.25a)
i =1
i

or, equivalently:

R( x) = 1− ∏ 1 − R ( x ) (13.25b)


i =1
i
358 Reliability Engineering

The parallel system model represents statistically the polar opposite from the series
system model but with F ( x ) and R( x ) interchanged. Like the product rule of reli-
ability, Equation 13.25a can be referred to as the product rule of unreliability since it
establishes that the unreliability of a parallel system is the product of the individual
component unreliabilities. A parallel system model also is called a dominant modal
model [4] if bi-modal or multiple failure mechanisms are involved.

13.6.3  Mixtures Model


n n
F ( x) = ∑
i =1
pi Fi ( x ) ; ∑ p = 1;
i =1
i 0 ≤ pi ≤ 1 (13.26a)

or, equivalently:
n n
R( x) = ∑
i =1
pi Ri ( x ) ; ∑ p = 1;
i =1
i 0 ≤ pi ≤ 1 (13.26b)

where F ( x ) is CDF and, again, R ( x ) is reliability or survival function. pi is propor-


tion or probability of occurrence for failure mechanism i. F ( x ) and R ( x ) have the
same mathematical structure.
With these system models, the reliability analysis can be conducted like that used
for component analysis. An example of using Equation  13.26 to calculate system
reliability from the known component reliability is provided as follows. Suppose a
series system consisting of five identical components is subjected to constant ampli-
tude vibrational loading. What would be the system reliability if the reliability of
each of the five identical components is 0.98 as calculated from the stress-strength
interference model? The  system reliability as calculated from Equation  13.26b is
=
simply R 0= .985 0.90.

13.7 CONCLUSIONS
This chapter introduces several recently developed new methodologies for fatigue
associated reliability assessment of vehicle components and systems. The  most
important two of these methodologies are the fatigue S-N curve transformation
technique and a variable transformation technique. In principle, these methodologies
can be applied to the reliability assessment of other similar engineering components
and systems. With these new methodologies, the current S-N data analysis and reli-
ability testing methods can be interpreted in a new unified probabilistic framework.
The importance of selecting two-parameter and three-parameter Weibull distribu-
tions in a probabilistic analysis of data with large sample size has been illustrated
with examples. The uncertainty introduced in test data with small sample size and
the benefits of using Bayesian statistics approach in cost reduction also has been
demonstrated with examples.
Reliability Assessment and Probabilistic Data Analysis 359

REFERENCES
1. Lee YL, Pan J, Hathaway R, Barkey M. Fatigue Testing and Analysis: Theory and
Practice. Boston, MA: Elsevier Butterworth-Heinemann; 2005.
2. Wei Z, Hamilton J, Ling J, Pan J. Reliability analysis based on stress-strength inter-
face model, Wiley Encyclopedia of Electrical and Electronics Engineering, Chichester,
UK: Wiley; 2018.
3. O’Connor PDT, Kleyner A. Practical Reliability Engineering, 5th ed. Chichester, UK:
Wiley; 2012.
4. Nelson WB. Accelerated Testing: Statistical Models, Test Plans, and Data analysis,
Hoboken, NJ: John Wiley & Sons; 2004.
5. Wei Z, Start M, Hamilton J, Luo L. A  unified framework for representing product
validation testing methods and conducting reliability analysis. SAE Technical Paper
2016-01-0269.
6. Wei Z, Kotrba A, Goehring T, Mioduszewski M, Luo L, Rybarz M, Ellinghaus K,
Pieszkalla M. Chapter  18: Failure mechanisms and modes analysis of automo-
tive exhaust components and systems, pp.  392–432, Handbook of Materials Failure
Analysis, Abdel Salam Hamdy Makhlouf (Ed.). Amsterdam, the Netherland: Elsevier;
2015.
7. Wei Z, Luo L, Voltenburg R, Seitz M, Hamilton J, Rebandt R. Consideration of tem-
perature effects in thermal-fatigue performance assessment of components with stress
raisers, SAE Technical Paper 2017-01-0352.
8. Seitz M, Hamilton J, Voltenburg R, Luo L, Wei Z, Rebandt R. Practical and techni-
cal challenges of the exhaust system fatigue life assessment process at elevated tem-
perature, ASTM Selected Technical Papers (STP) 1598, Zhigang Wei, Kamran Nikbin,
D. Gary Harlow, Peter C. McKeighan (Eds.). ASTM International; 2016.
9. Shen CL. The  statistical analysis of fatigue data. PhD dissertations, Tucson, AZ:
The University of Arizona; 1994.
10. Standard practice for statistical analysis of linear or linearized stress-life (S-N) and
strain-life ( ε − N ) fatigue data, ASTM Designation: E739-10.
11. Wei Z, Luo L, Yang F, Lin B, Konson D. Product durability/reliability design and vali-
dation based on test data analysis, pp. 379–413, Quality and Reliability Management
and Its Applications, Hoang Pham (Ed.). Springer; 2016.
12. Wei Z, Yang F, Cheng H, Maleki S, Nikbin K. Engineering failure data analysis:
Revisiting the standard linear approach. Engineering Failure Analysis, 2013; 30:
27–42.
13. Elishakoff I. Probabilistic Theory of Structures, 2nd ed. Mineola, NY: Dover
Publications; 1999.
14. Wei Z, Mandapati R, Nayaki R, Hamilton J. Accelerated reliability demonstra-
tion methods based on three-parameter Weibull distribution. SAE Technical Paper
2017-01-0202.
15. Wei Z, Zhu G, Gao L, Luo L. Failure modes effect and fatigue data analysis of
welded components and its applications in product validation. SAE Technical Paper,
2016-01-0374.
14 Maintenance Policy
Analysis of a Marine
Power Generating
Multi-state System
Thomas Markopoulos and Agapios N. Platis

CONTENTS
14.1 Introduction................................................................................................. 361
14.2 Reliability Assessment and Multi-state Systems......................................... 365
14.3 Description of Ship’s Electric Power Generation System............................ 366
14.4 Semi-Markov Model Development.............................................................. 371
14.5 Multi-state System Analysis........................................................................ 374
14.6 Maintenance Policy and Implications......................................................... 380
14.7 Conclusions.................................................................................................. 385
Acknowledgment.................................................................................................... 385
Appendix................................................................................................................. 386
References............................................................................................................... 397

14.1 INTRODUCTION
This study is an attempt to analyze the reliability performance of a marine power
generation system with the auxiliary systems attached and to develop an alternative
for maintenance policy. The main scope of this study is to analyze the methodology
and to conduct reliability analysis of the marine electric power system, focusing
rather on the mathematical modeling than on the field of the research on pure electric
and mechanical systems and their technical details. This aspect leads to generic infer-
ences that are applicable in most systems providing the big picture of the problem
and its solution. Nevertheless, authors use references to certain technical issues to
help the reader to understand the basic principles of a marine electrical power gener-
ating system with the attached auxiliary systems.
This chapter is organized as follows. In this section, there is a short description and
general information concerning the marine power generating system as a part of the ship
and some related references. Section 14.2 is a presentation of the reliability assessment
and multi-state systems in brief. In Section 14.3, there is a description of a typical electric
power generation system and reliability characteristics. Section 14.4 presents the devel-
opment of the semi Markov model. Section 14.5 is a description of the auxiliary diesel

361
362 Reliability Engineering

engines system driving the electric power generators and a reliability analysis of the
multi-state system including the probabilities related with its operation. In Section 14.6
the basic outlines on maintenance policy and maintenance implications and ideas on how
stochastic analysis and its inferences could contribute in real world management issues
are presented. In addition, there are empirical results concerning the availability of the
power generating system under different system configurations. Finally, in Section 14.7,
the conclusion sums up maintenance policy and suggests some ideas for further research.
The design of a vessel follows certain basic principles given as guidelines by organiza-
tions such as International Maritime Organization (IMO) and Marine Technology Society
(MTS), covering all possible sectors of a ship building project and all systems of the ves-
sel. Consequently, such guidelines (MTS DP Technical Committee; MSC/Circular 1994)
as a design philosophy and for all essential calculations (IMO MEPC 1-CIRC 866 2014)
exist for the electrical power generating system as a part of the whole vessel. Currently,
and due to issues related to environment and modern economics, major challenges arise
concerning the ship’s technology (MUNIN D6.7 2014). There is an increasing pressure for
more efficiency in energy, environmental effect, and safety. IMO has developed certain
regulations (IMO 2016) concerning a ship’s efficiency quantification providing guidelines
for all essential calculations (MEPC 61/inf.18 2010; MEPC.1-Circ.681–2 2009; MEPC.1-
Circ.684 2009). One major problem designers have in ship technology and design is
systems efficiency. Especially, the ship’s energy is a sector where a lot of challenges
arise continuously. Climate change and the problem of the greenhouse gas emissions
lead research to more efficient energy systems on ships and intensifying the demand
for improved safety levels and environmental protection to be competitive. The quan-
titative analysis of this effort could be summarized using certain indices such as the
Energy Efficiency Design Index (EEDI) and the Energy Efficiency Operational Indicator
(EEOI). Presumably, diesel engine driven electric power generating systems depend on
these regulations. Previous research (Prousalidis et al. 2011) has shown that the evolution
of a ship’s technology leads to new trends. Concerning the energy efficiency, use of vessel
and energy management means research on optimization of routes and vessel’s speed,
which implies optimization of power systems and management and finally presenting
advantages through an extensive electrification of ship systems. All those challenges and
trends could lead to increased complexity of the systems and requirements concerning
the technical background and skills of crewmembers. Unfortunately, all improvements
mentioned do not assure full ship safety and there is always the probability that unpre-
dictable incidents will happen (Mindykowski and Tarasiuk 2015). Since electric power
is a basic and essential factor of the normal operation of a ship, the electric power system
of a ship is dedicated to meet its electric load requirements according to the type of
mission during the different phases of its operation, such as overseas voyage, charging
and discharging, berthing, etc. According to international regulations in the case of an
electrical system failure, the usual and anticipated consequence is a blackout (Brocken
2016), which leads to a deadship condition initiating event. The meaning of the term
“deadship” is a condition under which the main propulsion plant, boilers, and auxiliaries
are not in operation and in restoring the propulsion, no stored energy for starting the pro-
pulsion plant, the main source of electrical power, and other essential auxiliaries should
be assumed available. It is assumed that the means are available to start the emergency
generator at all times (IMO 2005).
Maintenance Policy Analysis of a Marine Power Generating MSS 363

The research about blackout incidents shows that there are many different factors
causing a blackout in a ship, such as human error, control equipment failure, auto-
mation failure, electrical failure, lack of fuel, mechanical failure, and other causes
(Miller 2012) leading to certain questions such as:

• Do the available electric power generators meet the ship’s power requirements?
• What is the probability of a total system failure?
• What would be the financial cost of the system failure?

All these questions are closely related with the issue of electric power system reli-
ability, which in the case of a vessel is manageable by following strategies on its
architecture such as the use of multiple power sources, sectioning of the distribution
grid (Stevens et al. 2015), and use of auxiliary safety subsystems, such as earthing
and protection systems (Maes 2013). More specifically, the primary and standby
generators are driven by diesel engines with different technical characteristics and
attributes related to the requirements and the mission of the vessel such as:

• Load acceptance and rejection


• Starting time
• Load up time and emergency loading ramp
• Time on hot standby
• Minimum load and part load ratings
• Black start requirements

The subsystems of generators are:

• Excitation system
• Lubrication system
• Cooling system
• Facilities for alarms, monitoring, and protection
• Neutral earthing

The importance of the marine electric power system and its components could be
understood easily if electric failures are considered that led to marine accidents, such
as that of RMS Queen Mary 2 (MAIB 2011), which is obvious since its main tasks
could be summarized as follows (Patel 2012):

• The optimal system configuration


• Load analysis and selection of the necessary equipment (e.g., generators
and electric motors)
• The power distribution system
• Optimization of the routing cables
• Fault current analysis and the necessary safety devices
• Optimization of the power monitoring system

Since a ship operates in an autonomous mode at sea and usually when moored, the
design of the power system faces major challenges to meet the established stan-
dards and other requirements. The ship designers must consider the electrical power
364 Reliability Engineering

requirements during each phase of the ship operation. A major concept affecting the
use of electric energy on a ship is the quality of power. According to the established
standards, by the term “quality of power,” we mean “the term of power quality refer-
ring to a wide variety of electromagnetic phenomena that characterize the voltage and
current at a given time and at a given location on the power system” (IEEE 1159–1995).
There  are several direct and indirect consequences of a poor electric power supply
quality on a ship, which leads to several problems and distortions that could take place
resulting in systems failures and a reduced level of reliability.
These problems could be summarized as follows (Prousalidis et al. 2008):

• Harmonics
• Short duration voltage events
• Voltage unbalance

According to other research, the operation of the electric power generating system
could be summarized by two major groups of parameters:

• Parameters of voltage and currents in all the points of the analyzed system
• Parameters describing a risk of loss of power supply continuity

Attempting to evaluate the levels of quality and to deal with these problems,
researchers have developed certain quality indices concerning voltage and frequency
deviations (Prousalidis et al. 2011). The importance of those indices is obvious if
their limit values and the standards established (Table 14.1) concerning the issues

TABLE 14.1
Standards Concerning Power Quality of a Ship
# Standard Range
1 IEEE Std. 45:2002 IEEE Recommended Practice for Electrical
Installations on Shipboard
2 IEC 60092-101:2002 Electrical installations in ships. Definitions and
general requirements
3 STANAG 1008:2004 Characteristics of Shipboard
Electrical Power Systems in Warships of the
North Atlantic Treaty Navies, NATO, Edition 9,
2004
4 American Bureau of Shipping, ABS, 2008 Rules of building and classing, steel vessels
5 Rules of international ship classification Technical Requirements for Shipboard Power
societies, e.g., PRS/25/P/2006 Electronic Systems

Source: Mindykowski, J., Power quality on ships: Today and tomorrow’s challenges, International
Conference and Exposition on Electrical and Power Engineering (EPE 2014), Iasi,
Romania, 2014.
Maintenance Policy Analysis of a Marine Power Generating MSS 365

of electric power quality assessment in ship networks are considered. The  usual
causes of the power quality problems on ships are human factors, the assigned loads,
overloading, and technical failures (Mindykowski 2014). It should be noted that the
quality of electric power passes through two stages: assessment and improvement.
The improvement stage is possible through the technical solutions and the invest-
ment in the staff and human capital (Mindykowski 2016). Technical solutions refer
to new distribution systems such as Zonal Electrical Distribution System (ZEDS)
or hybrid technology solutions (Shagar et al. 2017). The needs for electrical power
differ from phase to phase of operation depending on the devices and systems that
are necessary for the normal operation of the ship. According to expert opinions,
the phase of charging and discharging are the most demanding and stressful for
the electrical power generating system of a ship. Thus, a reliability modeling and
analysis of the system related to these phases provides valuable inferences about the
safety of a ship.

14.2  RELIABILITY ASSESSMENT AND MULTI-STATE SYSTEMS


The term “reliability” refers generally to the capability of any system or element
to perform its assigned task. The analysis of reliability of a marine electric power
system starts from the elements of the system, continues to the subsystems, and
finally examines the whole system (Wu et al. 2013). Multi-state systems (MSS)
theory covers a wide range of applications in reliability analysis with significant
theoretical advances as well (Lisnianski et al. 2010). An MSS can operate pass-
ing through a finite number of states that are called state spaces (Lisnianski et al.
2010), describing different states (Eryilmaz 2015), and consequently working
in different rates of output. This finite number of states indicates the difference
between the MSS and the binary systems that operate in two states only (on-off)
(Levitin and Xing 2018). The complexity of MSS depends on the number the sub-
systems, whereas its availability depends on the availability of these subsystems
(Markopoulos and Platis 2018). Based on the requirements set, the structure of a
MSS provides flexibility to the research of reliability to manage both theoretical
problems and applications. It is well known that reliability is the capability of an
element or a system to operate normally without failures or interruptions. In the
case of an MSS, reliability could be the system capability to operate among spe-
cific states related to acceptable limits of operation according to the requirements
established. The general mathematical form of an MSS operating among several
states depicts the set of them such as:

S j = {S j1, S j 2 ,..., S ji ,..., S jk } (14.1)

where:
s ji is the state representing a specific level of performance of the subsystem j
i ∈ {1, 2,..., k } is the set of the states of each subsystem
366 Reliability Engineering

Introducing the factor of time in the model, the state of the MSS over time is a ran-
dom variable representing a stochastic process (Lisnianski et al. 2010) with its major
parameters such as mean and variance. The function describing the reliability of the
MSS can be defined as:

R(t , w ) = P {S (t ) ≥ w} (14.2)

Based on literature findings, one of the major research fields of MSS is reliabil-
ity assessment and more specifically the electric—electronic systems such as
power generation and communication systems (Lisnianski et  al. 2012). To assess
the expected performance of a complex or composite system, it is necessary to
determine the states of the system and the sojourn time of each state (Barbu and
Karagrigoriou 2018). This aspect implies the use of the semi-Markov methodology
to take advantage of the flexibility it provides compared with the ordinary two-state
Markov binary systems (operation or failure). The trade-off of the flexibility is the
complexity of the system and the implied difficulties for understanding and perfor-
mance evaluation (Yingkui and Jing 2012). There are more advantages concerning
the flexibility of MSS. Since the focus is on the acceptable and non-acceptable sides,
the analysis is closer to real world problems (Liu and Kapur 2006) than the ordinary
simple systems that focus on “time to failure.” This advantage leads to better accu-
racy assessments (Lisnianski et al. 2012) and improving the time needed to analyze
the model (Billinton and Li 2007).

14.3 DESCRIPTION OF SHIP’S ELECTRIC POWER


GENERATION SYSTEM
All these characteristics mentioned that are related to flexibility imply the capa-
bility of MSS to describe a lot of systems either technical or not. It is known that
major research fields for MSS analysis are the electrical power generation and dis-
tribution systems (Markopoulos and Platis 2018) and telecommunications as well.
In  this current analysis, the attempt is to expand the reliability analysis using the
theoretical tool of MSS in a marine electrical power generating system, considering
its specific particularities against the terrestrial ordinary power generating systems.
Depending on the phase of the operation, they should meet all minimum energy and
power requirements without remaining in “out of order” status, even if some of their
elements fail during the repair process. Thus, this chapter considers failure of the
system as all those levels of output that do not meet minimum requirements for the
normal operation of each operational phase of the ship.
According to the existing standards (DNV 2011), three general assumptions on
the ship’s systems structure are necessary to meet the established requirements. First,
an electric power generation station should be arranged (DNV 2011, B301). The next
requirement is that depending on the ship and operational phase, there are a mini-
mum required number of independent electric power sources capable to meet the
load requirements for normal operation of the ship without use of emergency power
generators (DNV 2011, B302). The third main requirement is that the electric power
generation system should be able to be restored within 45 seconds (DNV 2011, B303)
Maintenance Policy Analysis of a Marine Power Generating MSS 367

using the existing automatic control switching (ACS) system. A typical example of an
electric power system in a ship consists of operating components for power genera-
tion, energy transmission, and energy distribution for all energy consuming devices.
Usually, there are ships with a configuration of three main generators and one emer-
gency unit (Wärtsilä 2014) where the main system consists of two primary and one
secondary and the switchboard (Mindykowski 2016), or a set of four, consisting of
two main generators (primary) and two standby ones (Mennis and Platis 2013).
Considering the standards of IEEE (IEEE 45-2002) as shown in Figure 14.1, a
typical example of the electrical power generation system of a large cargo ship con-
sists of four generator units dedicated to serve the ordinary loads during different
phases of the ship’s use. An emergency generator unit exists in case of a total failure
(blackout) of all four main generators. In this case, the capacity of the emergency
generator is lower than that of the main ones, since it serves only the basic loads
such as emergency lighting and basic instruments and devices of the ship such as
internal communication and basic electronic systems (Patel 2012). In addition, many

FIGURE 14.1  Large cargo ship power system with emergency generator and battery backup
based on Standard IEEE 45-2002. (Based on Patel, M.R., Shipboard Electrical Power
Systems, CRC Press, Boca Raton, FL, 2012.)
368 Reliability Engineering

batteries exist to serve the ship in case of a total blackout. The case of four generators
is the generic one covering more complex systems.
Starting the description of the system, we examine the case of the four-generator
system assuming that it consists of two primary generators and two standby genera-
tors as shown in Figure 14.1 and in Table 14.2 (Patel 2012). When generators are in
automatic startup, they need specific time to acquire their operational parameters
such as the voltage and the frequency of their output current. All generators are
controlled by an automatic control system which activates the standby generator or
generators when necessary.
According to the same standard (IEEE 45-2002), we assume that when a genera-
tor startup failure takes place there are two ways of activation: automatic switching
by the automatic control system and manual switching by the crew. The automatic
switching time to activate the standby generators is 45 seconds.
The switching time for the manual activation depends on the current position of
the crew members in the ship and for the current analysis we assume it is 5 minutes
as the time to proceed to the machinery room from anywhere in the ship. Concerning
the nominal power of the generators (e.g., Wärtsilä 2014), we assume the output of
main and standby generators is 875 KW and the output of the emergency generator
is 200 KW (Table 14.3). According to the ordinary use of the marine power gen-
erating system, the standby generators remain in cold mode to operate in case of a
primary generator failure. In fact, the standby generators are not in running mode
and only some of their essential subsystems are running to respond whenever it is
necessary. All these generators are driven by auxiliary diesel engines that also are
subject to failures, repairs, and maintenance. In this case, there are certain failure
modes (shown in Table 14.4) for each subsystem describing the type of occurrence
and its effect to the normal operation of the whole system. Due to the standby status

TABLE 14.2
Basic Parts of Ship’s Electric Power
Generating System (Four Generators)
Number of main generators 2
Number of stand by generators 2
Number of emergency generator 1
Automatic control system (ACS) 1

TABLE 14.3
Output Power of the Generators
Main generator #1 875 KW
Main generator #2 875 KW
Standby generator #1 875 KW
Standby generator #2 875 KW
Emergency generator 200 KW
Maintenance Policy Analysis of a Marine Power Generating MSS 369

TABLE 14.4
Failure Modes of Standby Generator
Effect
Does Not Prevent
Occurrence Type Prevents the Operation the Operation
Monitored Monitored Critical Monitored
Non-critical
Latent Latent Critical Latent Non-critical

Source: Alzbutas, R., Energetika, 4, 27–33, 2003.

of the secondary systems, their failures are probable to remain latent and they would
be realized during a simultaneous failure of a main generator, whereas the time of
this failure combined with the status of the whole system would be critical, espe-
cially when the specific generator is the last one available, since all main and standby
ones have failed.
We should notice when the ship is in anchorage without additional electric sys-
tems in operation, one generator meets all load requirements. During additional
operations such as cargo charging and discharging, one more generator is considered
necessary (Mennis and Platis 2013). A general block diagram of the whole system
is shown in Figure 14.2, where in case of a primary generator failure the automatic
control system will switch normally to one of the two standby generators or the
emergency one in case of failure of all main generators. We assume that according
to the switching sequence, the ACS activates the first available secondary generator
anticipating a failure with the probability (γ).
Considering the block diagram of Figure 14.2, the next step is to construct the
Markov model diagrams for each phase of the vessel’s operation. Since it is an ordi-
nary electric power structure, we can use the same guidelines from previous research
(Mennis and Platis 2013) adapting to the requirements of the current analysis.

Primary #1

Primary #2

Automatic
Standby #1 Output
Control

Standby #2

Emergency

FIGURE 14.2  Block diagram of the power generating system. Use of primary #2 generator
depends on the phase of the operation (e.g., it is necessary only during port phase).
370 Reliability Engineering

Although the operational combinations of generators remain the same, we add


some additional assumptions concerning the repair of generators and the auto-
matic control system, which is supposed to present a certain level of availability.
Specifically, when all primary generators fail, the priority is to switch to a secondary
one first if it is available and after that to repair one or to complete the maintenance
that is in progress. Considering other research on electric power systems (Wu et al.
2013), the reliability of ACS exceeds 5,500 hours. Since it is a pure electronic sys-
tem, we suppose that only replacement of modules or rearrangements of cables are
possible or necessary on board. The engine crew assigned to operate and maintain
all the mechanical and electric power systems assure that the systems will be acti-
vated even manually; thus, we could consider the probability to fail a start-up of a
generator as close to zero. The time of manual switching is 5 minutes during the
manual activation of a standby generator. Major maintenance works on generators
take place when the ship is in shipyard. In this current analysis, we should notice that
there are differences in the requirements for electrical power between the phases of
the operation. A model describing the states of the electric power generating system
concerning the operational sequence is shown in Figure 14.3.
Because of the complexity of the system and for better understanding, a short
description of the model is necessary. The meaning of the term “phases” is closely
related to the minimum available power to meet the operational requirements.
There are three phases. The first phase is the journey phase that lasts on average
7 days when the vessel departs a port following a route to another one. During this
phase, one primary generator provides the necessary level of power covering the
needs of all operational systems. The port phase follows and lasts 3 days and takes
place when the ship is in the port. The necessary machinery operates to charge or
discharge cargo and to complete loading and unloading cargo processes according
to the type of the ship and its cargo. Therefore, this phase is the most demanding
one for power requirements and at least two primary generators are necessary for
cargo handling. Finally, the maintenance phase follows the port phase and lasts
two days. During this phase the engine crew conducts all necessary maintenance
works, including both corrective (repairing generators that fail) and preventive

Port
3 days

Journey
7 days Idle
Maintenance
2 days

FIGURE 14.3  Phases of a ship’s operation.


Maintenance Policy Analysis of a Marine Power Generating MSS 371

maintenance (conducting inspections or overhaul maintenance). The maintenance


refers to all four (or three) primary generators covering routine inspections with
ad hoc repairs and overhaul maintenance according to the manufacturer’s mainte-
nance plan.

14.4  SEMI-MARKOV MODEL DEVELOPMENT


The development of a semi-Markov model aims to develop a probabilistic study of
the system and to assess the probabilities of the system to run in a specific output
mode (state) and the time to remain in this state as well. To develop the semi-
Markov model, it is necessary to determine and solve the system of steady-state
equations. We will assume that the sojourn time in each state depends on the rate
that will lead the system to another state through either failure or repair, and, in
addition, each rate is distributed randomly following the exponential distribution.
The process of transition from one state to another one consists of two components.
The  first one is the probability of transition of the system between two different
states and the other component is the time spent in each specific state. The param-
eter that rules the jump from one state to another is the failure or repair rate for the
affected generators or subsystems. The  transition matrix of the probabilities will
have the following form:

 p1,1 p1,2 ... p1,13 


p p2,2 ... p2,13 
P=
2,1
(14.3)
 ... ... ... ... 
 
 p13,1 p13,2 .... p13,13 

To calculate the steady-state probabilities, it is necessary to solve the following equa-


tion system (Trivedi 2002):

v = vP (14.4)

where P is the matrix of the steady-state probabilities of Equation 14.3 and v is the


vector of the discrete time Markov chain:

v = [ v1, v2 ,..., v13 ] (14.5)

The solution of Equation 14.4 is feasible under the restriction (Trivedi 2002):

13
∑ v = 1 (14.6)
i =1
i

The general formula of mean sojourn time is (Trivedi 2002):



hi =

0
[1 − H i (t )]dt (14.7)
372 Reliability Engineering

which after the integration of exponential distributions is:

1
hi = (14.8)
∑ ∑
i
λi +
j
µj

where λ and μ are the failure and repair rates, respectively. Since the manual time
and automatic repair time are considered constant, the mean sojourn time is:

= =
hi t man and hi t aut (14.9)

The expression of formula (14.8) is a general one that implies that the transition of
the system from one state to another depends on the combination of all probable fail-
ures and repairs between the two states. The state probabilities of the semi-Markov
model are given by the following formula:

vi hi
πi = (14.10)
∑ vh

j j
j

The matrix equation is:

−1
V ⋅ Psemi = U ⇔ V = U ⋅ Psemi (14.11)

where U is the vector:

U = 1(1) , 0( 2) ,..., 0(13)  (14.12)

and V is the matrix that will be combined with the set of mean sojourn times to
calculate the final steady-state probabilities. Considering the general model of
Figure 14.3, the one-step transition probability matrix is given by Table A14.4 of the
Appendix to this chapter. A typical scenario of the operation cycle of a ship as previ-
ously mentioned consists of three phases: the system runs for 7 days in the journey
phase, for 3 days in the port phase, and 2 days in the maintenance phase for a total
of 12 days and a total of approximately 30 cycles on an annual basis. Proceeding to
further analysis, the failures on the operating components of the system are events
that take place in a random order; thus, they could be assumed to follow the Poisson
distribution with a mean rate of failure (λ), whereas the mean time to repair 1 µ fol-
lows the exponential distribution and, consequently, the rate of repair is (μ). A series
of state diagrams could describe the system. The number of possible states of the
system depends on its complexity. The model of the main electric power generation
system as mentioned previously consists of two primary generators and two second-
ary (or standby) ones. Their output is identical, providing 875 KW. There is also a
fifth generator (emergency generator) for providing a lower power level at 200 KW
and its mission is to provide power for auxiliary loads (Wärtsilä 2014), thus provid-
ing a certain level of reliability in the case of a total blackout of all main generators.
Maintenance Policy Analysis of a Marine Power Generating MSS 373

TABLE 14.5
Failure and Repair Rates
Failure Rate MTTR Failure Rate Repair Rate
System MTTF (Hrs) (per 106 Hrs) (man-hours) (per hour) (per hour)
Prim. Gen. 1 2,208.04 452.89 58.00 0.000453 0.017241
Prim. Gen. 2 2,208.04 452.89 58.00 0.000453 0.017241
Standby Gen. 1 2,208.04 452.89 58.00 0.000453 0.017241
Standby Gen. 2 2,208.04 452.89 58.00 0.000453 0.017241
Autom. Control 2,828.97 353.49 0.0833 0.000353 N/A

According to the available data (OREDA 2002), the failure rates and the repair
times are given in three forms: min, mean, and max. Examining the worst case sce-
nario, we assume the highest rate of failure and the longest repair time expressed in
man-hours for each case. The failure rates and time to repair for all five generators
are shown in Table 14.5. The failure rates are expressed in failures per 106 hours and
the repair rates in hours considering a basic crew of six in the engine room.
The automatic control system is responsible for the activation process of a standby
generator when a primary generator fails. In this study, we assume an automatic sys-
tem (Wu et al. 2013) that is connected to the marine generators provides reliability
parameters and characteristics as shown in Table A14.1 of the Appendix. The systems
that are used in our study consist of three serial subsystems (Figure 14.4): Sys1—the
main switch with a failure rate λSYS1 = 59.9998 × 10 −6 / hr, Sys2—the excitation system
with a failure rate fSYS2 = 18.7 × 10 −6 /hr, and Sys3—the main switching system with
a failure rate fSYS3 = 361.4859 × 10 −6 / hr. The system is serial; thus, its failure rate is
the sum of its components failure rates and totally fSYSTEM = 432.1857 × 10 −6 /hr.
Since it is an electronic system, in the case of a failure its repair includes replace-
ment of a module or rearrangement of the cables and contacts start-up a secondary
generator when a primary one fails. The time the crew needs to repair the system
manually is considered mean time to repair (MTTR) = 5 minutes or 0.0833 hours.
Considering the structure of the whole power system, the automatic control system is
vital for its normal operation. Consequently, the calculation of the probability (even
if it is close to zero) to switch from a failed generator to a standby one is necessary.
This  probability is identical with the availability (A) of the control system and is
expressed by the following formula:

MTBF-MTTR
A= ⇔ A = 1 − γ = 0.999964 (14.13)
MTBF

SYS1 SYS2 SYS3

FIGURE 14.4  Layout of ACS blocks.


374 Reliability Engineering

where 1− γ is the probability to work normally when a primary generator fails.


The probabilities of ACS normal operation impose certain difficulties in the devel-
opment of the models and to the calculations process. To deal with this issue, the
calculations are based on the expected time (tswitch) of the system response and the
rate of response is:

t switch = γ ⋅ t man + (1 − γ ) ⋅ t aut (14.14)

Equation (14.14) represents the weighted average of the switching time either manu-
ally or automatically.

14.5  MULTI-STATE SYSTEM ANALYSIS


The  electric power generating machine as an MSS can be assumed in operating
mode (acceptable level of operation) when it operates in specific states which meet
the operational requirements of each phase with the assumption that the power out-
put for all other states is lower than the required one. In this case all latter states are
considered failure states (Levitin and Lisnianski 1999). As mentioned in previous
sections, the electric power generating system is driven by auxiliary diesel engines.
Based on the available information1 about the maintenance schedule and manu-
facturer’s instructions for the auxiliary systems, there are certain restrictions and/
or limitations concerning the operation and handling of the whole system. Thus,
the typical crew in an engine room consists of three engineers, two assistants, and
one electric expert. These persons are responsible for the normal operation of all
electromechanical systems during the ship’s operation. In addition, they are respon-
sible for the minor routine inspections and maintenance and the major ones such
as overhauls whenever necessary. According to the available information, some
typical rates of failure and repair times for auxiliary diesel engines are provided
in Table A14.2 of the Appendix and a typical maintenance routine is summarized
in Table A14.3 of the Appendix as well. During overhaul maintenance, when the
parts are new, they are checked for good condition. Most parts are replaced with
new ones at 8,000, 16,000, 24,000, and 32,000 hours, or confirmed whether they
are in good condition. Exceptions to this rule are the fuel system check (fuel injec-
tion pump) in 2,000 hours, the lubricating system and cooling water system (ther-
mostatic valves) in 12,000 hours, and the supercharging system (clean charge air
cooler) in 4,000 hours. Obviously, the overhaul maintenance is scheduled according
to the manufacturer’s instructions but there is always the probability of discrepan-
cies due to the quality of fuel and lubricating oil. The planned maintenance assures
that no major failures will take place during the time between overhaul mainte-
nances. Except the maintenance previously described, there are additional minor
maintenance steps in shorter intervals such as a daily pressure check of the air
­filters and supercharging the compressor, the weekly check for the functionality of
the control system and the compressed air system. Furthermore, the monthly check

1 Information given by the marine engineer expert based on major engine manufacturer’s data.
Maintenance Policy Analysis of a Marine Power Generating MSS 375

is adopted for other elements such as centrifugal oil filter and the compressed air
system. There is a major factor affecting the maintenance of certain subsystems in
the auxiliary engines. Due to crew and other restrictions, overhaul maintenance
and all minor inspections (daily, weekly, and monthly) take place during the main-
tenance phase. All inspections or maintenance cover the respective ones of lower
levels, for example, when the monthly inspection takes place then the respective
weekly or daily inspections are omitted and engine crew members repair failures of
auxiliary engines when they appear. Concerning the detailed analysis of the Markov
model for each operational phase, each state presents the specific conditions of the
­system’s operation. The code of label in each state describes the operational state
(1st character), the number of active primary generators (2nd character), the num-
ber of the active secondary generators (3rd character), and whether one generator
primary or secondary is in the maintenance process (4th character). Transitions and
their rates for all states of the model are provided in Table A14.4 of the Appendix to
this chapter. Starting with the maintenance phase as shown in Figure 14.5, the sys-
tem enters the maintenance phase and leaves the port phase. The possible states are
all those with one primary generator active (M,1,3,0 – M,1,2,0 – M,1,1,0, – M,1,0,0).
These states represent the preparation of the maintenance process. The  rate of
maintenance is four generators per 48 hours (2 days of maintenance). During this
phase, if a primary generator fails, then a secondary generator is activated either
automatically (by ACS) or manually by the crew. This situation refers to the states
M,0,3,0 – M,0,2,0 – M,0,1,0 – M,0,2,1, and M,0,1,1.

FIGURE 14.5  Markov model of the electric power generating system (4-Gen)—maintenance


phase.
376 Reliability Engineering

If the failure happens while scheduled maintenance is in progress, then the crew
continues to complete the maintenance because this time is shorter than that of a
repair. If a secondary generator and a primary one operate normally, then the crew
starts the process to repair it. Concerning the failure of a secondary generator while
maintenance is in progress, the crew follows the same steps as in primary’s failure.
When all generators fail, the crew repairs one to recover normal power for main-
tenance. In this phase, the system is considered in normal operation when at least
one primary generator is in normal operation, including states (M,1,3,0 – M,1,2,0 –
M,1,10 – M,1,0,0 – M,1,2,1 – M,1,1,1 – M,1,0,1) and fails when it falls in any of the
other states.
Next is the journey phase shown in Figure 14.6 when the system enters the jour-
ney phase leaving the maintenance phase. The  strategy of the crew to repair or
maintain the generators is the same as that of the maintenance phase with the dif-
ference that there is no generator under maintenance process. The  possible states
are all those with one primary generator active (J,1,3,0 – J,1,2,0 – J,1,1,0, – J,1,0,0).
The  activation of a secondary generator after a primary one’s failure follows the
same steps through the ACS and the normal operation includes the states J,1,3,0 –
J,1,2,0 – J,1,1,0 – J,1,0,0. In this phase, there is an additional characteristic. Whereas
the journey phase requires at least one primary generator, the transition to the next
phase, the port phase, requires at least two primary generators. Thus, there are three

FIGURE  14.6  Markov model of the electric power generating system (4-Gen)—journey
phase.
Maintenance Policy Analysis of a Marine Power Generating MSS 377

FIGURE 14.7  Markov model of the electric power generating system (4-Gen)—port phase.

additional states in the journey phase (J,2,2,0 – J,2,1,0 – J,2,0,0) aiming to assure the
activation of the second primary generator to prepare the system for the next phase
requiring increased power.
The next and last phase, the port phase, shown in Figure 14.7 is when the system
enters this phase after the journey phase. The possible states are P,2,2,0 – P,2,1,0 –
P,2,0,0. The repair strategy for activation of secondary generators using ACS is the
same with that of the journey phase.
Following the semi-Markov methodology as described in formulas (14.3) through
(14.13), we can construct the transition matrix easily using Table  A14.4 of the
Appendix to this chapter followed by the one step probability matrix and the matrix
of mean sojourn times. The V vector after calculations and the mean sojourn times
are shown in Table A14.6 of the Appendix to this chapter and the final matrix of
the steady-state probabilities in Table A14.8. As previously mentioned, systems with
four generators are usual in large vessels. The analysis of the model with four gen-
erators shows that the level of availability of the system is high and there is no seri-
ous variation when the number of crew changes proving that investment in backup
systems can reduce the need for crewmembers.
At  this point it would worthwhile to investigate the sensitivity of the system’s
structure concerning the crewmembers and the backup systems. One test is to assume
fewer generators for the system (e.g., three generators) as shown in Figures  14.8
through 14.10.
378 Reliability Engineering

FIGURE  14.8  Markov model of the electric power generating system (3-Gen)—­
maintenance phase.

FIGURE 14.9  Markov model of the electric power generating system (3-Gen)—journey phase.
Maintenance Policy Analysis of a Marine Power Generating MSS 379

FIGURE 14.10  Markov model of the electric power generating system (3-Gen)—port phase.

Given that the needs for power are the same for each phase in both configura-
tions (three and four generators), the systems differ only in the number of secondary
generators. Following the same methodology of semi-Markov modeling as in four
generator configurations, we can see simpler diagrams. The Markov models of the
phases for system with three generators are shown in Figures 14.7 through 14.9 and
all transitions are shown in Table A14.5 of the Appendix to this chapter. Concerning
the maintenance phase (Figure 14.8), there are five out of ten states in normal opera-
tion (M,1,2,0 – M,1,1,0 – M,1,0,0 – M,1,1,1 – M,1,0,1), while all others are considered
failure. The  next phase, the journey phase, as shown in Figure  14.9, is where the
system enters the phase and leaves the maintenance phase. The strategy of crew to
repair or maintain the generators is the same as that of the maintenance phase with
the difference that there is no generator under a maintenance process. The possible
states are all those with one primary generator active (J,1,2,0 – J, 1,1,0, – J,1,0,0).
The activation of a secondary generator after a primary one’s failure follows the
same steps through the ACS and the normal operation includes the states J,1,2,0 –
J,1,1,0  – J,1,0,0. In  this phase, there is an additional characteristic. Following the
same transition states of preparation (states J,2,1,0 and J,2,0,0), the system passes
to the port phase (Figure 14.10). Implementing the semi-Markov methodology, we
can construct the transition matrix using transitions of Table A14.5 in the Appendix
to this chapter followed by the one step probability matrix and the matrix of mean
sojourn times.
380 Reliability Engineering

TABLE 14.6
Steady-State Probabilities of the Semi-Markov Model
Configuration
State of 4 Gen 3 Gen
Unavailability 2.893424E-06 1.024447E-05

The  V vector after calculations and the mean sojourn times are shown in
Table A14.6 of the Appendix to this chapter and the matrix of the steady-state proba-
bilities in Table A14.9. A summary of probabilities concerning states of normal oper-
ation and states of unavailability (as they shown in Table A14.14 of the Appendix to
this chapter) for system configuration (three and four generators) and crew of six are
shown in Table 14.6.
The  differences between probabilities in state of normal operation and failure
are in line with the general reliability theory concerning the use of backup systems.

14.6  MAINTENANCE POLICY AND IMPLICATIONS


The maintenance policy of each operation system is a wide concept. The form and
characteristics of the maintenance policy depend on the nature of the system, its
complexity, and the requirements it is called to meet. It  is also well known that
maintenance aims to keep the system at a sufficient level of availability through the
management of its parts and subsystems, according to the established requirements
along with the reduction or minimization of the required time and cost. Concerning
the general maintenance methodologies, they are classified in three basic groups
which are (Chowdhury 1988):

1. Replacement and/or repair on failure


2. Planned maintenance (repair or replacement)
3. Condition-based maintenance

Considering the first group, this method is applied once a failure takes place.
Following this strategy, it can be handled as a stochastic renewal process and the
implied cost can be expressed by the following formula:

C RR = C R + C D (14.15)

where:
C RR is the total cost of maintenance
C R is the repair cost
C D is the indirect cost while the system is not operative

Compared with the alternative of planned maintenance, the repair on failure policy
is preferred when:

C RR ≤ C PM (14.16)
Maintenance Policy Analysis of a Marine Power Generating MSS 381

where C PM is the total cost of planned maintenance in predetermined intervals,


which refers to the second group. The replacement as a concept is included in the
wider concept of maintenance. The  adequate actions could refer to age, periodic,
or block replacement (Nakagawa 2006). All above parameters of maintenance
summarize the preventive maintenance which in the case of a ship and due to its
particularities, plays a crucial role. According to planned maintenance policy and
depending on the conditions, the replacement takes place in blocks, when the parts
have reached their age or operational limit, using the optimal number of failures or
using cycle time. Implementing the block replacement/repair policy, each part or
subsystem is replaced at times kT with k = 1, 2 … or at failure, whichever comes first
(Barlow and Proschan 1996). In this case, the interval that minimizes the total cost
is (Chowdhury 1988):

C2
T0 m(T0 ) − M (T0 ) = (14.17)
C1

where:
T0 is the optimum time interval
M (T0 ) is the renewal function
m(T0 ) is the renewal density
C1 is the expected cost of failure
C2 is the expected cost for exchanging non-failed item

The  study of the maintenance problem also is related to the MSS methodology.
In general, two major categories of maintenance are followed (Liu and Huang 2010).
The corrective one is conducted when a system failure takes place and the preven-
tive one is conducted when the user’s intention is to keep its performance within
the desired limits during specific periods of operation. Concerning the ships and
shipping industry, the application of the corrective maintenance refers to onboard
repairing activities whereas the preventive one refers to repairs in shipyards during
major overhauls. Due to the existing restrictions to repair failed systems on board,
the corrective maintenance presents inherent difficulties. One major challenge is
the optimization of maintenance policy through a combination of maintenance
policy to achieve the ship’s unobstructed operation. Depending on the management
policy of the ship’s owner, the maintenance policy (consisting of corrective and
preventive maintenance) of the ship’s subsystems should consider the expected time
between failures organize the transportation assignments of each ship aiming to
minimize the cost. Thus, minor failures subjected to repair by the crew members
would not  affect the ship’s transportation capability. It  is obvious that repair of
major failures should be scheduled during the overhaul inspections and repair in
the shipyard.
One important parameter that affects the development of the maintenance pol-
icy is the time horizon. This  horizon determines the strategy of the maintenance
policy management. In the case of a long or medium term, the maintenance policy
could focus more on planning and spare parts’ inventory management, while the
short-term horizon focuses more on monitoring and control (Ben-Daya et al. 2000).
382 Reliability Engineering

If the complexity of the system is high, the replacement policy should focus rather
on block replacement than on other policies such as age replacement, proving that
the first one is preferred to the second one (Barlow and Proschan 1996). The basic
analysis of the maintenance policy refers to the failures and their distributions of the
parts, subsystems, and systems. Nevertheless, it does not provide universal answers
concerning maintenance, because further questions could arise, such as what the
optimal maintenance policy is, considering the specific conditions of the system.
The term “optimal” refers not only to a cost minimization, but also to maximization
of the availability (Barlow and Proschan 1996). These two terms follow the same
principles concerning the optimization of problems with more than two decision
variables under optimization.
Considering the findings from previous sections, we could propose certain basic
principles concerning a ship’s operation. One example is the maintenance of the
electrical power system and the results of this study as a way of thinking could be
expanded to other subsystems and finally to the whole ship. The electric system
of a ship as a complex system consists of many different parts that are subject to
deterioration and a possible gradual degradation. Since it is a complicated system,
it could be an MSS operating in different output levels. As described in previous
sections, a marine electric power generating system consists of many different
main and emergency generators. The  system follows a typical configuration of
four main generators and one emergency, whereas the set of the four main consists
of two primary and two standby generators. Considering the generating system as
an MSS, in general, whenever a system’s performance falls under the threshold of
acceptance it is assumed failed, thus maintenance actions should take place (Liu
and Huang 2010). Since there is always the probability of transition of the MSS
from one state to another, the restoration of the system is subject to a factor of ran-
domness, because one subsystem could fail during the restoration of a previously
failed one.
As maintenance policy depends on the strategic goals of the decision makers, it is
closely related to the minimization of the maintenance cost. Although cost is a single
concept, there are different aspects to describe it in the same result. Concerning a
ship, the sufficient level of maintenance implies direct and indirect cost savings.
Direct cost savings refers to reduced needs for repairs, reduced man-hours, and the
losses due to not using the equipment. The indirect cost savings refer to meeting the
requirements of contracts and penalties due to delays. Another valid assessment of
this cost is the reliability associated cost (RAC), which is expressed by (Lisnianski
et al. 2010) by the formula:

RAC = OC + RC + PC (14.18)

where:
OC is the operational cost and the fuel cost, when it comes to power systems
driven by auxiliary diesel engines
RC the repair cost including the repair and maintenance cost in man-hours and
spare parts
PC the penalty cost when the system’s failure leads to delays of the operation
Maintenance Policy Analysis of a Marine Power Generating MSS 383

TABLE 14.7
Probabilities of Operation at Acceptable Level
(min power requirements)
4 Generators 3 Generators
Crew
Members Unavailability Unavailability
1 1.400095E-05 2.204020E-04
2 4.306027E-06 5.963794E-05
3 3.288975E-06 2.927190E-05
4 3.031658E-06 1.834814E-05
5 2.936747E-06 1.314772E-05
6 2.893424E-06 1.024447E-05
7 2.870731E-06 8.447592E-06
8 2.857641E-06 7.252148E-06
9 2.849532E-06 6.413197E-06
10 2.844226E-06 5.799750E-06
11 2.840598E-06 5.336362E-06
12 2.838030E-06 4.976965E-06

According to our findings, the probability of the system to reach a non-acceptable level
of operation depending on the crew varies from 1.400095E-05 to 2.838030E-06 in a
four generator configuration and from 2.204020E-04 to 4.976965E-06 in a three gener-
ators configuration (Table 14.7). Attempting to understand the sensitivity of availability
subjected to changes of crew, this chapter developed all models of previous sections for
different crews, from 1 up to 12 members. The final probabilities for each phase of the
auxiliary engines’ operation and for different crews are shown in Table 14.7.
Alternatively, the availabilities of both configurations are shown in Figure 14.11.
There is an obvious difference between the availability of the two systems showing
the possible interaction between systems and manpower.

1.0001E+00

1.0000E+00

9.9995E-01
3 GEN
9.9990E-01 4 GEN
9.9985E-01

9.9980E-01

9.9975E-01

9.9970E-01

9.9965E-01
1 2 3 4 5 6 7 8 9 10 11 12

FIGURE 14.11  System availability according to the number of crew members.


384 Reliability Engineering

Concerning all models previously presented, certain implications referring to the


maintenance policy and its cost minimization arise. More specifically, there are dif-
ferent ways to improve the availability of the auxiliary engines and the reduction of
their reliability cost. Compared to any proposed solutions, presumably the lower the
cost and the simpler the action, the more preferable the solution is.
According to formula (14.18), the structure of reliability cost for the marine gen-
erator system is shown in Figure 14.12. Considering this cost and its minimization,
there are two possible options for action concerning the ship. Either, the ship is in a
planning and construction phase or it is already in operation. If it is in planning or
construction phase, adjustment to the equipment such as changes of the number or
the power level of generators and the attached auxiliary engines and systems are pos-
sible. On the other hand, if it is in phase of operation, the management team should
search for alternative options to reduce/optimize maintenance cost. This study aim-
ing to search for alternative maintenance policy options focuses on the potential
adjustment of parameters based on Figure 14.12.
Supplies, transportation, and fuel cost are related to geographical and financial
factors which are subject to changes beyond planning and construction. Other costs
(e.g., spare parts cost) are related with planning and construction decisions. For a
ready-made ship (in which case this study refers), only any cost related to its use is
adjustable, such as penalty cost and man-hours cost that could be adjusted to deal
with the problem of the maintenance cost optimization. The  penalty cost due to
delays is related to the reliability of the ship and its capability to operate on time and
to increase its utility and efficiency. The labor cost is related to the number of the
crew members in the engine room. Obviously, the more crew members, the higher
maintainability and the less time to repair subjected to the restrictions between the
cost and benefits of each personnel change.

Reliabilit y Cost

Cost of Fuel Repair Cost Penalty Cost

Man-hours Cost

Spare Parts Cost

Supplies and

Transportation

FIGURE 14.12  Structure of dependability cost.


Maintenance Policy Analysis of a Marine Power Generating MSS 385

14.7 CONCLUSIONS
This  study evaluated the reliability performance using MSS theory of a marine
electric power generation unit that consists of four generators (one or two primary
and two or three secondary) driven by auxiliary diesel engines. An additional
alternative analysis of a system including three generators (one or two primary and
one or two secondary) was conducted to investigate the differences between two
system configurations and to identify possible alternatives concerning the decision
making.
One main characteristic of the system is that it uses a single type of generators.
This strategy implies a managerial aspect that is the simplification of the system’s
management concerning, in general, the schedule of supplies and the maintenance.
The analysis of the generators as a MSS shows that the probability of operation in
non-acceptable level of output is low implying high availability, which is along with
the operational requirements of a ship. This fact is possible through the increased
number of generators (primary, secondary, and emergency) and the expected rates
of failure of the subsystems (generators) that tend to reach low levels following the
technological progress. Due to technological limitations, the continuous lowering of
failure rates could be considered difficult. To achieve the goal of continuous improve-
ment of the systems there are two options. The  first one is the additional backup
systems (configuration with four generators) and the second one is to increase repair
rates using the parameter of maintainability (configuration with three generators)
using additional highly qualified personnel in the engine room.
The analysis of the system shows that except the ordinary aspect of maintenance
focusing on materials, another aspect focusing on manpower and human capital
also exists. Considering the reliability parameters of the systems and the modeling
process, the empirical results show that the efficiency of the systems is very high.
Comparison of both system configurations shows that increasing the engine crew,
the probability of normal operation tends to be almost the same for both options.
This fact implies the existence of an interaction between system and crew. The find-
ings that the level of reliability increases along with the increase of crew members are
valid within limits that depend on the cost of penalty to cost of wage ratio. Obviously,
these limits differ depending on the specifications and operation requirements of
each system. The analysis conducted in this chapter covers a theoretical approach of
the maintenance problem optimization in a specific sector of the ship. This model
could focus to specific devices and subsystems or could be expanded in an appropri-
ate way to include more systems or integrated blocks. All the above findings would
be a starting point for further research, combining additional mathematical methods
such as Monte Carlo Simulation, leading to uncertainty reduction of the models.

ACKNOWLEDGMENT
We would like to thank Dr. J. Dagkinis, whose expert knowledge on marine engi-
neering issues was most helpful.
386 Reliability Engineering

APPENDIX

TABLE A14.1
Failure Rates for Automatic Control System EEA-22
Type of Failure MTBF (Days) Failure Rate (per 10-6 Hrs)
5C147 1,818.38 22.9142
Input 5C15 8,998.89 4.6302
Power 5C3 301.02 138.4184
Start 5C21 2,324.97 17.9214
Stop 5C27 3,929.78 10.6028
General control 6C109 2,270.22 18.3536
Stop cascade 6C103 10,659.16 3.9090
Voltage monitor 5C153 671.80 62.0226
V/f monitor 6C39 4,030.83 10.3370
Additional reference value 6C67 5,374.96 7.7520
Closing pulse EVG23 8,726.73 4.7746
Power SNT23 19,007.65 2.1921
Relay output RAG23 2,102.36 19.8190
Frequency presetting FAG23 9,780.68 4.2601
Active power measuring WMG23 3,499.05 11.9080
Frequency controller FRG23 9,063.09 4.5974
Load distribution LAG23 4,592.13 9.0735
Total Rate 117.87 353.4859

TABLE A14.2
Failure and Repair Rates for Engine Drivers of Power Generators
Failure Rate Time to Repair Rate
Type of Failure MTBF (Days) (per 10–6 Hrs) Repair (Hrs) (per 10–6 Hrs)
Fuel oil filters 20 2,083.33 1.5 27,777.78
Oil filters 20 2,083.33 1.5 27,777.78
Air filters 20 2,083.33 1.5 27,777.78
Water filters 30 1,388.89 1.5 27,777.78
Fuel injector 60 694.44 1.5 27,777.78
Leaking of gasket 90 462.96 1.5 27,777.78
Piping system 60 694.44 1.5 27,777.78
Water pump 60 694.44 3.0 13,888.89
Fuel pump 30 1,388.89 3.0 13,888.89
Fuel injector 180 231.48 3.0 13,888.89
Dirty water cooler 60 694.44 3.0 13,888.89
Dirty oil cooler 60 694.44 3.0 13,888.89
Cover gasket damage 365 114.16 3.0 13,888.89
Exhaust inlet valve 365 114.16 7.0 5,952.38
Cracking of cyl. heads 240 173.61 7.0 5,952.38
Piston ring damage 365 114.16 7.0 5,952.38
Turbo charger 365 114.16 7.0 5,952.38
TABLE A14.3
General Plan of Maintenance
Overhaul Maintenance Interval (Hours of Operation)
Action—Description Daily Weekly Monthly 1,000 2,000 4,000 8,000 12,000 16,000 20,000 24,000 28,000 30,000
Major fasteners—retightening X X X X
Major bearing—inspection X X X
Resilient mounts—inspect-retighten X
Cylinder and rod—inspection X X
Crankshaft—gears—inspection X X X
Valve mechanism X
Control system X X X
Fuel system X X
Lubricating oil system X X X
Cooling water system X X X
Compressed air system X X X
Maintenance Policy Analysis of a Marine Power Generating MSS

Supercharging system X X X
387
388 Reliability Engineering

TABLE A14.4
States of the Marine Power Generating System (4 Generators)a
State FR TO Rate State FR TO Rate State FR TO Rate

M,1,3,0 S1 S2 λs J,1,3,0 S15 S16 λs P,2,2,0 S26 S1 λp→m


S5 m S15 S18 λ S26 S27 λs
S8 λ S15 S23 1/tswitch S26 S29 λ
S15 λm→j J,1,2,0 S16 S15 μ P,2,1,0 S27 S2 λp→m
M,1,2,0 S2 S1 μ S16 S17 λs S27 S26 μ
S3 λs S16 S19 λ S27 S28 λs
S5 m S16 S24 1/tswitch S27 S30 λ
S9 λ J,1,1,0 S17 S16 μ P,2,0,0 S28 S3 λp→m
S16 λm→j S17 S20 λ S28 S27 μ
M,1,1,0 S3 S2 μ S17 S21 λs S28 S31 λ
S6 m S17 S25 1/tswitch P,1,2,0 S29 S2 λp→m
S10 λ J,0,3,0 S18 S16 1/tswitch S29 S27 1/tswitch
S17 λm→j S18 S19 λs S29 S30 λs
M,1,0,0 S4 S3 μ J,0,2,0 S19 S17 1/tswitch S29 S32 λ
S11 λ S19 S20 λs P,1,1,0 S30 S3 λp→m
S21 λm→j J,0,1,0 S20 S21 1/tswitch S30 S31 λs
M,1,2,1 S5 S1 1/M S20 S22 λs S30 S28 1/tswitch
S6 λs J,1,0,0 S21 S17 μ S30 S33 λ
S12 λ S21 S22 λ P,1,0,0 S31 S4 λp→m
M,1,1,1 S6 S2 1/M J,0,0,0 S22 S21 μ S31 S28 μ
S6 S7 λs J,2,2,0 S23 S26 λj→p S31 S34 λ
S6 S13 λ J,2,1,0 S24 S27 λj→p P,0,2,0 S32 S30 1/tswitch
M,1,0,1 S7 S3 1/M J,2,0,0 S25 S28 λj→p S32 S33 λs
S14 λ P,0,1,0 S33 S30 1/tswitch
M,0,3,0 S8 S2 1/tswitch S33 S34 λs
S8 S9 λs P,0,0,0 S34 S31 μ
M,0,2,0 S9 S3 1/tswitch
S10 λs
M,0,1,0 S10 S4 1/tswitch
S11 λs
M,0,0,0 S11 S4 μ
M,0,2,1 S12 S6 1/tswitch
S12 S13 λs
M,0,1,1 S13 S7 1/tswitch
S13 S14 λs
M,0,0,1 S14 S4 1/M

a The transition to another state is either through failure or repair of a generator or automatic control
system.
Maintenance Policy Analysis of a Marine Power Generating MSS 389

TABLE A14.5
States of the Marine Power Generating System (3 Generators)a
State FR TO Rate State FR TO Rate State FR TO Rate
M,1,2,0 S1 S2 λs J,1,2,0 S11 S12 λs P,2,1,0 S19 S1 λp→m
S4 m S13 λ S20 λs
S6 λ S17 1/tswitch S21 λ
S11 λm→j J,1,1,0 S12 S11 μ P,2,0,0 S20 S2 λp→m
M,1,1,0 S2 S1 μ S14 λ S19 μ
S3 λs S15 λs S22 λ
S4 m S18 1/tswitch P,1,1,0 S21 S2 λp→m
S7 λ J,0,2,0 S13 S12 1/tswitch S20 1/tswitch
S12 λm→j S14 λs S22 λs
M,1,0,0 S3 S2 μ J,0,1,0 S14 S15 1/tswitch S23 λ
S8 λ S16 λs P,1,0,0 S22 S3 λp→m
S15 λm→j J,1,0,0 S15 S12 μ S20 μ
M,1,1,1 S4 S1 1/M S16 λ S24 λ
S5 λs J,0,0,0 S16 S15 μ P,0,1,0 S33 S22 1/tswitch
S9 λ J,2,1,0 S17 S19 λj→p S24 λs
M,1,0,1 S5 S2 1/M J,2,0,0 S18 S20 λj→p P,0,0,0 S34 S22 μ
S10 λ
M,0,2,0 S6 S2 1/tswitch
S7 λs
M,0,1,0 S7 S3 1/tswitch
S3 μ
M,0,0,0 S8 S5 1/tswitch
S10 λs
M,0,1,1 S9 S3 1/M
S2 λs
M,0,0,1 S10 S4 m

a The transition to another state is either through failure or repair of a generator or automatic control
system.
390 Reliability Engineering

TABLE A14.6
V-Vector and Mean Sojourn Times (4 Generators)
State vi hi State vi hi
1 2.9110E-01 1.4128E+00 18 1.1644E-06 1.2502E-02
2 1.6476E-03 1.0933E+00 19 5.1062E-09 1.2502E-02
3 7.2131E-06 1.0938E+00 20 2.2407E-11 1.2502E-02
4 1.0464E-08 1.4137E+00 21 7.4578E-09 4.8228E+00
5 8.5464E-02 4.8123E+00 22 1.6289E-11 4.8333E+00
6 3.7416E-04 4.8123E+00 23 2.0564E-01 7.0000E+00
7 1.6319E-06 4.8228E+00 24 9.0063E-04 7.0000E+00
8 1.8626E-04 1.2502E-02 25 3.9522E-06 7.0000E+00
9 8.1683E-07 1.2502E-02 26 2.0620E-01 2.9919E+00
10 3.5778E-09 1.2502E-02 27 1.4607E-03 1.8480E+00
11 6.7201E-12 4.8333E+00 28 6.3973E-06 1.8495E+00
12 1.8626E-04 1.2502E-02 29 2.7939E-04 1.2451E-02
13 8.1650E-07 1.2502E-02 30 1.2257E-06 1.2451E-02
14 3.5691E-09 4.8333E+00 31 5.3700E-09 1.8495E+00
15 2.0564E-01 1.2502E-02 32 1.5754E-09 1.2502E-02
16 9.0297E-04 1.2470E-02 33 6.9200E-12 1.2502E-02
17 3.9625E-06 1.2470E-02 34 4.4978E-12 4.8333E+00

TABLE A14.7
V-Vector and Mean Sojourn Times (3 Generators)
State vi hi State vi hi
1 2.9111E-01 1.4128E+00 13 1.1644E-06 1.2502E-02
2 1.6468E-03 1.0933E+00 14 5.1164E-09 1.2502E-02
3 3.2053E-06 1.4137E+00 15 2.2809E-06 4.8228E+00
4 8.5465E-02 4.8123E+00 16 4.9820E-09 4.8333E+00
5 3.7253E-04 4.8228E+00 17 2.0564E-01 7.0000E+00
6 1.8626E-04 1.2502E-02 18 9.0243E-04 7.0000E+00
7 8.1641E-07 1.2503E-02 19 2.0620E-01 2.9919E+00
8 2.0523E-09 4.8333E+00 20 1.4605E-03 1.8495E+00
9 1.8626E-04 1.2502E-02 21 2.7940E-04 1.2451E-02
10 8.1473E-07 4.8333E+00 22 1.2276E-06 1.8495E+00
11 2.0564E-01 1.2502E-02 23 1.5754E-09 1.2502E-02
12 9.0477E-04 1.2470E-02 24 1.0282E-09 1.0000E+00
Maintenance Policy Analysis of a Marine Power Generating MSS 391

TABLE A14.8
Steady-State Probabilities of Electric Power Generating System
(4 Generators)
Probability
Phase State Crew of 6
Maintenance S1 M,1,3,0 1.4147949E-01
S2 M,1,2,0 1.2390664E-03
S3 M,1,1,0 1.0848216E-05
S4 M,1,0,0 2.5590879E-08
S5 M,1,2,1 1.4147978E-01
S6 M,1,1,1 1.2387772E-03
S7 M,1,0,1 1.0802759E-05
S8 M,0,3,0 8.0109202E-07
S9 M,0,2,0 7.0204378E-09
S10 M,0,1,0 6.1465043E-11
S11 M,0,0,0 1.1230434E-10
S12 M,0,2,1 8.0109365E-07
S13 M,0,1,1 7.0188000E-09
S14 M,0,0,1 4.7324521E-08
Journey S15 J,1,3,0 8.8442730E-04
S16 J,1,2,0 7.7457435E-06
S17 J,1,1,0 6.7974990E-08
S18 J,0,3,0 5.0078470E-09
S19 J,0,2,0 4.3886679E-11
S20 J,0,1,0 3.8513977E-13
S21 J,1,0,0 1.2428462E-07
S22 J,0,0,0 5.4411190E-10
S23 J,2,2,0 4.9517823E-01
S24 J,2,1,0 4.3367313E-03
S25 J,2,0,0 3.8058227E-05
Port S26 P,2,2,0 2.1221906E-01
S27 P,2,1,0 1.8575841E-03
S28 P,2,0,0 1.6281384E-05
S29 P,1,2,0 1.1966432E-06
S30 P,1,1,0 1.0487945E-08
S31 P,1,0,0 1.6892712E-08
S32 P,0,2,0 6.7756912E-12
S33 P,0,1,0 5.9423717E-14
S34 P,0,0,0 7.3956051E-11
392 Reliability Engineering

TABLE A14.9
Steady-State probabilities of Electric Power Generating System (3 Generators)
Probability
Phase State Crew of 6
Maintenance S1 M,1,2,0 1.4148529E-01
S2 M,1,1,0 1.2381783E-03
S3 M,1,0,0 3.8518151E-06
S4 M,1,1,1 1.4148464E-01
S5 M,1,0,1 1.2334183E-03
S6 M,0,2,0 8.0112483E-07
S7 M,0,1,0 7.0154487E-09
S8 M,0,0,0 1.6863003E-08
S9 M,0,1,1 8.0112119E-07
S10 M,0,0,1 5.4033344E-06
Journey S11 J,1,2,0 8.8446355E-04
S12 J,1,1,0 7.7642464E-06
S13 J,0,2,0 5.0080522E-09
S14 J,0,1,0 4.3991448E-11
S15 J,1,0,0 1.8685111E-05
S16 J,0,0,0 8.1802426E-08
S17 J,2,1,0 4.9519852E-01
S18 J,2,0,0 4.3470908E-03
Port S19 P,2,1,0 2.1222833E-01
S20 P,2,0,0 1.8595168E-03
S21 P,1,1,0 1.1966955E-06
S22 P,1,0,0 1.9305769E-06
S23 P,0,1,0 6.7759872E-12
S24 P,0,0,0 8.7434201E-10

TABLE A14.10
Steady-State Probabilities for Different Crews (4 Generators)
Probabilities (1 of 2)
State Crew 1 2 3 4 5 6
M,1,3,0 1 1.35245E-01 1.38982E-01 1.40230E-01 1.40855E-01 1.41230E-01 1.41479E-01
M,1,2,0 2 7.10546E-03 3.65104E-03 2.45599E-03 1.85026E-03 1.48420E-03 1.23907E-03
M,1,1,0 3 3.73045E-04 9.58536E-05 4.29911E-05 2.42938E-05 1.55917E-05 1.08482E-05
M,1,0,0 4 1.11179E-06 2.72044E-07 1.16200E-07 6.26382E-08 3.84186E-08 2.55909E-08
M,1,2,1 5 1.35246E-01 1.38982E-01 1.40231E-01 1.40855E-01 1.41230E-01 1.41480E-01
M,1,1,1 6 7.10492E-03 3.65068E-03 2.45566E-03 1.84996E-03 1.48391E-03 1.23878E-03
M,1,0,1 7 3.63724E-04 9.46614E-05 4.26364E-05 2.41436E-05 1.55141E-05 1.08028E-05
(Continued)
Maintenance Policy Analysis of a Marine Power Generating MSS 393

TABLE A14.10 (Continued)


Steady-State Probabilities for Different Crews (4 Generators)
Probabilities (1 of 2)
State Crew 1 2 3 4 5 6
M,0,3,0 8 7.65792E-07 7.86949E-07 7.94018E-07 7.97555E-07 7.99677E-07 8.01092E-07
M,0,2,0 9 4.02372E-08 2.06775E-08 1.39109E-08 1.04812E-08 8.40845E-09 7.02044E-09
M,0,1,0 10 2.11250E-09 5.42864E-10 2.43505E-10 1.37617E-10 8.83315E-11 6.14650E-11
M,0,0,0 11 2.92597E-08 3.58011E-09 1.01957E-09 4.12243E-10 2.02297E-10 1.12304E-10
M,0,2,1 12 7.65794E-07 7.86951E-07 7.94020E-07 7.97556E-07 7.99679E-07 8.01094E-07
M,0,1,1 13 4.02342E-08 2.06755E-08 1.39091E-08 1.04794E-08 8.40677E-09 7.01880E-09
M,0,0,1 14 9.55521E-06 1.24354E-06 3.73441E-07 1.58617E-07 8.15479E-08 4.73245E-08
J,1,3,0 15 8.45455E-04 8.68813E-04 8.76618E-04 8.80522E-04 8.82865E-04 8.84427E-04
J,1,2,0 16 4.44182E-05 2.28236E-05 1.53530E-05 1.15665E-05 9.27816E-06 7.74574E-06
J,1,1,0 17 2.33895E-06 6.00907E-07 2.69475E-07 1.52259E-07 9.77078E-08 6.79750E-08
J,0,3,0 18 4.78718E-09 4.91944E-09 4.96363E-09 4.98573E-09 4.99900E-09 5.00785E-09
J,0,2,0 19 2.51534E-10 1.29261E-10 8.69609E-11 6.55207E-11 5.25635E-11 4.38867E-11
J,0,1,0 20 1.32452E-11 3.40322E-12 1.52633E-12 8.62498E-13 5.53544E-13 3.85140E-13
J,1,0,0 21 3.23649E-05 3.96043E-06 1.12799E-06 4.56127E-07 2.23855E-07 1.24285E-07
J,0,0,0 22 8.50149E-07 5.20156E-08 9.87656E-09 2.99535E-09 1.17603E-09 5.44112E-10
J,2,2,0 23 4.73358E-01 4.86436E-01 4.90806E-01 4.92992E-01 4.94304E-01 4.95178E-01
J,2,1,0 24 2.48691E-02 1.27786E-02 8.59595E-03 6.47592E-03 5.19471E-03 4.33673E-03
J,2,0,0 25 1.30955E-03 3.36440E-04 1.50875E-04 8.52475E-05 5.47052E-05 3.80582E-05
P,2,2,0 26 2.02868E-01 2.08473E-01 2.10345E-01 2.11282E-01 2.11844E-01 2.12219E-01
P,2,1,0 27 1.06571E-02 5.47545E-03 3.68290E-03 2.77433E-03 2.22527E-03 1.85758E-03
P,2,0,0 28 5.60425E-04 1.43969E-04 6.45574E-05 3.64737E-05 2.34044E-05 1.62814E-05
P,1,2,0 29 1.14391E-06 1.17552E-06 1.18608E-06 1.19136E-06 1.19453E-06 1.19664E-06
P,1,1,0 30 6.01056E-08 3.08880E-08 2.07803E-08 1.56572E-08 1.25612E-08 1.04879E-08
P,1,0,0 31 7.24062E-07 1.77306E-07 7.59545E-08 4.10781E-08 2.52784E-08 1.68927E-08
P,0,2,0 32 6.47713E-12 6.65607E-12 6.71586E-12 6.74577E-12 6.76372E-12 6.77569E-12
P,0,1,0 33 3.40369E-13 1.74933E-13 1.17701E-13 8.86932E-14 7.11628E-14 5.94237E-14
P,0,0,0 34 1.90194E-08 2.32870E-09 6.65049E-10 2.69756E-10 1.32802E-10 7.39561E-11

TABLE A14.11
Steady-State Probabilities for Different Crews (4 Generators)
Probabilities (2 of 2)
State Crew 7 8 9 10 11 12
M,1,3,0 1 1.41658E-01 1.41792E-01 1.41896E-01 1.41979E-01 1.42048E-01 1.42104E-01
M,1,2,0 2 1.06343E-03 9.31411E-04 8.28553E-04 7.46157E-04 6.78668E-04 6.22377E-04
M,1,1,0 3 7.98124E-06 6.11720E-06 4.83750E-06 3.92115E-06 3.24256E-06 2.72606E-06
M,1,0,0 4 1.80554E-08 1.32918E-08 1.01108E-08 7.89429E-09 6.29624E-09 5.11149E-09
M,1,2,1 5 1.41658E-01 1.41792E-01 1.41896E-01 1.41980E-01 1.42048E-01 1.42105E-01
(Continued)
394 Reliability Engineering

TABLE A14.11 (Continued)


Steady-State Probabilities for Different Crews (4 Generators)
Probabilities (2 of 2)
State Crew 7 8 9 10 11 12
M,1,1,1 6 1.06315E-03 9.31136E-04 8.28284E-04 7.45894E-04 6.78411E-04 6.22126E-04
M,1,0,1 7 7.95212E-06 6.09728E-06 4.82316E-06 3.91040E-06 3.23423E-06 2.71943E-06
M,0,3,0 8 8.02103E-07 8.02861E-07 8.03451E-07 8.03922E-07 8.04308E-07 8.04630E-07
M,0,2,0 9 6.02595E-09 5.27842E-09 4.69602E-09 4.22948E-09 3.84734E-09 3.52861E-09
M,0,1,0 10 4.52259E-11 3.46670E-11 2.74177E-11 2.22265E-11 1.83820E-11 1.54556E-11
M,0,0,0 11 6.79228E-11 4.37567E-11 2.95897E-11 2.07948E-11 1.50791E-11 1.12227E-11
M,0,2,1 12 8.02104E-07 8.02863E-07 8.03452E-07 8.03924E-07 8.04310E-07 8.04631E-07
M,0,1,1 13 6.02436E-09 5.27687E-09 4.69450E-09 4.22799E-09 3.84589E-09 3.52719E-09
M,0,0,1 14 2.98631E-08 2.00374E-08 1.40907E-08 1.02828E-08 7.73242E-09 5.96046E-09
J,1,3,0 15 8.85543E-04 8.86380E-04 8.87031E-04 8.87552E-04 8.87978E-04 8.88333E-04
J,1,2,0 16 6.64780E-06 5.82250E-06 5.17951E-06 4.66443E-06 4.24254E-06 3.89065E-06
J,1,1,0 17 5.00057E-08 3.83233E-08 3.03037E-08 2.45615E-08 2.03094E-08 1.70732E-08
J,0,3,0 18 5.01417E-09 5.01891E-09 5.02259E-09 5.02554E-09 5.02795E-09 5.02996E-09
J,0,2,0 19 3.76699E-11 3.29969E-11 2.93561E-11 2.64396E-11 2.40508E-11 2.20583E-11
J,0,1,0 20 2.83358E-13 2.17183E-13 1.71753E-13 1.39223E-13 1.15133E-13 9.67977E-14
J,1,0,0 21 7.51762E-08 4.84345E-08 3.27563E-08 2.30225E-08 1.66962E-08 1.24275E-08
J,0,0,0 22 2.82101E-10 1.59033E-10 9.56038E-11 6.04751E-11 3.98703E-11 2.72037E-11
J,2,2,0 23 4.95803E-01 4.96272E-01 4.96636E-01 4.96928E-01 4.97166E-01 4.97365E-01
J,2,1,0 24 3.72201E-03 3.25994E-03 2.89993E-03 2.61155E-03 2.37534E-03 2.17832E-03
J,2,0,0 25 2.79975E-05 2.14567E-05 1.69666E-05 1.37516E-05 1.13710E-05 9.55905E-06
P,2,2,0 26 2.12487E-01 2.12688E-01 2.12844E-01 2.12969E-01 2.13071E-01 2.13156E-01
P,2,1,0 27 1.59415E-03 1.39615E-03 1.24188E-03 1.11831E-03 1.01709E-03 9.32677E-04
P,2,0,0 28 1.19767E-05 9.17819E-06 7.25718E-06 5.88176E-06 4.86331E-06 4.08820E-06
P,1,2,0 29 1.19815E-06 1.19929E-06 1.20017E-06 1.20087E-06 1.20145E-06 1.20193E-06
P,1,1,0 30 9.00255E-09 7.88604E-09 7.01619E-09 6.31940E-09 5.74869E-09 5.27269E-09
P,1,0,0 31 1.19558E-08 8.82792E-09 6.73458E-09 5.27269E-09 4.21643E-09 3.43167E-09
P,0,2,0 32 6.78424E-12 6.79065E-12 6.79564E-12 6.79963E-12 6.80289E-12 6.80561E-12
P,0,1,0 33 5.10131E-14 4.46912E-14 3.97659E-14 3.58205E-14 3.25890E-14 2.98938E-14
P,0,0,0 34 4.48649E-11 2.89865E-11 1.96568E-11 1.38509E-11 1.00687E-11 7.51153E-12
Maintenance Policy Analysis of a Marine Power Generating MSS 395

TABLE A14.12
Steady-State Probabilities for Different Crews (3 Generators)
Probabilities (1 of 2)
State Crew 1 2 3 4 5 6
M,1,2,0 1 1.35479E-01 1.39041E-01 1.40256E-01 1.40869E-01 1.41238E-01 1.41485E-01
M,1,1,0 2 7.11002E-03 3.64893E-03 2.45415E-03 1.84884E-03 1.48309E-03 1.23818E-03
M,1,0,0 3 2.74263E-05 1.34534E-05 8.65137E-06 6.24126E-06 4.80228E-06 3.85182E-06
M,1,1,1 4 1.35472E-01 1.39037E-01 1.40254E-01 1.40867E-01 1.41237E-01 1.41485E-01
M,1,0,1 5 6.93487E-03 3.60483E-03 2.43476E-03 1.83805E-03 1.47623E-03 1.23342E-03
M,0,2,0 6 7.67116E-07 7.87283E-07 7.94163E-07 7.97634E-07 7.99726E-07 8.01125E-07
M,0,1,0 7 4.02633E-08 2.06657E-08 1.39006E-08 1.04732E-08 8.40221E-09 7.01545E-09
M,0,0,0 8 7.20424E-07 1.76694E-07 7.57503E-08 4.09858E-08 2.52289E-08 1.68630E-08
M,0,1,1 9 7.67076E-07 7.87264E-07 7.94152E-07 7.97626E-07 7.99721E-07 8.01121E-07
M,0,0,1 10 1.82183E-04 4.73554E-05 2.13254E-05 1.20755E-05 7.75960E-06 5.40333E-06
J,1,2,0 11 8.46916E-04 8.69182E-04 8.76778E-04 8.80609E-04 8.82919E-04 8.84464E-04
J,1,1,0 12 4.46181E-05 2.28945E-05 1.53956E-05 1.15966E-05 9.30121E-06 7.76425E-06
J,0,2,0 13 4.79545E-09 4.92152E-09 4.96453E-09 4.98623E-09 4.99931E-09 5.00805E-09
J,0,1,0 14 2.52666E-10 1.29662E-10 8.72020E-11 6.56912E-11 5.26941E-11 4.39914E-11
J,1,0,0 15 7.97707E-04 1.95675E-04 8.38996E-05 4.54015E-05 2.79510E-05 1.86851E-05
J,0,0,0 16 2.09539E-05 2.56996E-06 7.34615E-07 2.98148E-07 1.46841E-07 8.18024E-08
J,2,1,0 17 4.74176E-01 4.86642E-01 4.90895E-01 4.93041E-01 4.94334E-01 4.95199E-01
J,2,0,0 18 2.49811E-02 1.28183E-02 8.61979E-03 6.49278E-03 5.20762E-03 4.34709E-03
P,2,1,0 19 2.03219E-01 2.08562E-01 2.10384E-01 2.11304E-01 2.11858E-01 2.12228E-01
P,2,0,0 20 1.06905E-02 5.48495E-03 3.68807E-03 2.77777E-03 2.22777E-03 1.85952E-03
P,1,1,0 21 1.14590E-06 1.17602E-06 1.18630E-06 1.19148E-06 1.19461E-06 1.19670E-06
P,1,0,0 22 1.38134E-05 6.75649E-06 4.34056E-06 3.12977E-06 2.40744E-06 1.93058E-06
P,0,1,0 23 6.48835E-12 6.65892E-12 6.71711E-12 6.74646E-12 6.76416E-12 6.77599E-12
P,0,0,0 24 6.25597E-09 3.05995E-09 1.96580E-09 1.41745E-09 1.09031E-09 8.74342E-10

TABLE A14.13
Steady-State Probabilities for Different Crews (3 Generators)
Probabilities (2 of 2)
State Crew 7 8 9 10 11 12
M,1,2,0 1 1.41662E-01 1.41795E-01 1.41898E-01 1.41981E-01 1.42049E-01 1.42106E-01
M,1,1,0 2 1.06271E-03 9.30808E-04 8.28044E-04 7.45722E-04 6.78293E-04 6.22051E-04
M,1,0,0 3 3.18093E-06 2.68457E-06 2.30418E-06 2.00458E-06 1.76339E-06 1.56571E-06
M,1,1,1 4 1.41662E-01 1.41795E-01 1.41898E-01 1.41981E-01 1.42049E-01 1.42105E-01
M,1,0,1 5 1.05920E-03 9.28102E-04 8.25882E-04 7.43944E-04 6.76797E-04 6.20768E-04
M,0,2,0 6 8.02126E-07 8.02878E-07 8.03464E-07 8.03933E-07 8.04317E-07 8.04637E-07
M,0,1,0 7 6.02188E-09 5.27504E-09 4.69317E-09 4.22704E-09 3.84524E-09 3.52678E-09
(Continued)
396 Reliability Engineering

TABLE A14.13 (Continued)


Steady-State Probabilities for Different Crews (3 Generators)
Probabilities (2 of 2)
State Crew 7 8 9 10 11 12
M,0,0,0 8 1.19365E-08 8.81467E-09 6.72503E-09 5.26555E-09 4.21092E-09 3.42729E-09
M,0,1,1 9 8.02123E-07 8.02876E-07 8.03462E-07 8.03932E-07 8.04316E-07 8.04636E-07
M,0,0,1 10 3.97767E-06 3.05001E-06 2.41278E-06 1.95628E-06 1.61809E-06 1.36060E-06
J,1,2,0 11 8.85569E-04 8.86399E-04 8.87046E-04 8.87563E-04 8.87987E-04 8.88341E-04
J,1,1,0 12 6.66313E-06 5.83549E-06 5.19071E-06 4.67422E-06 4.25120E-06 3.89837E-06
J,0,2,0 13 5.01431E-09 5.01901E-09 5.02267E-09 5.02560E-09 5.02800E-09 5.03001E-09
J,0,1,0 14 3.77567E-11 3.30704E-11 2.94195E-11 2.64951E-11 2.40998E-11 2.21020E-11
J,1,0,0 15 1.32282E-05 9.76991E-06 7.45489E-06 5.83785E-06 4.66926E-06 3.80088E-06
J,0,0,0 16 4.96391E-08 3.20791E-08 2.17581E-08 1.53347E-08 1.11501E-08 8.32006E-09
J,2,1,0 17 4.95817E-01 4.96282E-01 4.96644E-01 4.96934E-01 4.97171E-01 4.97369E-01
J,2,0,0 18 3.73059E-03 3.26721E-03 2.90621E-03 2.61703E-03 2.38018E-03 2.18264E-03
P,2,1,0 19 2.12494E-01 2.12693E-01 2.12848E-01 2.12972E-01 2.13074E-01 2.13158E-01
P,2,0,0 20 1.59570E-03 1.39741E-03 1.24293E-03 1.11920E-03 1.01786E-03 9.33341E-04
P,1,1,0 21 1.19819E-06 1.19931E-06 1.20019E-06 1.20089E-06 1.20146E-06 1.20194E-06
P,1,0,0 22 1.59411E-06 1.34523E-06 1.15454E-06 1.00438E-06 8.83513E-07 7.84461E-07
P,0,1,0 23 6.78445E-12 6.79081E-12 6.79576E-12 6.79973E-12 6.80297E-12 6.80568E-12
P,0,0,0 24 7.21957E-10 6.09244E-10 5.22883E-10 4.54877E-10 4.00137E-10 3.55278E-10

TABLE A14.14
States of Operation and Unavailability for 4 and 3 Generators
State 4 Generators State 3 Generators
M,1,3,0 1 Operation M,1,2,0 1 Operation
M,1,2,0 2 Operation M,1,1,0 2 Operation
M,1,1,0 3 Operation M,1,0,0 3 Operation
M,1,0,0 4 Operation M,1,1,1 4 Operation
M,1,2,1 5 Operation M,1,0,1 5 Operation
M,1,1,1 6 Operation M,0,2,0 6 Unavailability
M,1,0,1 7 Operation M,0,1,0 7 Unavailability
M,0,3,0 8 Unavailability M,0,0,0 8 Unavailability
M,0,2,0 9 Unavailability M,0,1,1 9 Unavailability
M,0,1,0 10 Unavailability M,0,0,1 10 Unavailability
M,0,0,0 11 Unavailability J,1,2,0 11 Operation
M,0,2,1 12 Unavailability J,1,1,0 12 Operation
M,0,1,1 13 Unavailability J,0,2,0 13 Unavailability
M,0,0,1 14 Unavailability J,0,1,0 14 Unavailability
J,1,3,0 15 Operation J,1,0,0 15 Operation
J,1,2,0 16 Operation J,0,0,0 16 Unavailability
(Continued)
Maintenance Policy Analysis of a Marine Power Generating MSS 397

TABLE A14.14 (Continued)


States of Operation and Unavailability for 4 and 3 Generators
State 4 Generators State 3 Generators
J,1,1,0 17 Operation J,2,1,0 17 Operation
J,0,3,0 18 Unavailability J,2,0,0 18 Operation
J,0,2,0 19 Unavailability P,2,1,0 19 Operation
J,0,1,0 20 Unavailability P,2,0,0 20 Operation
J,1,0,0 21 Operation P,1,1,0 21 Unavailability
J,0,0,0 22 Unavailability P,1,0,0 22 Unavailability
J,2,2,0 23 Operation P,0,1,0 23 Unavailability
J,2,1,0 24 Operation P,0,0,0 24 Unavailability
J,2,0,0 25 Operation
P,2,2,0 26 Operation
P,2,1,0 27 Operation
P,2,0,0 28 Operation
P,1,2,0 29 Unavailability
P,1,1,0 30 Unavailability
P,1,0,0 31 Unavailability
P,0,2,0 32 Unavailability
P,0,1,0 33 Unavailability
P,0,0,0 34 Unavailability

REFERENCES
Alzbutas, R. (2003). Diesel generators reliability data analysis and testing interval optimiza-
tion. Energetika 4:27–33.
Barbu, V.S., Karagrigoriou, A. (2018). Modeling and inference for multi-state systems,
In: Lisnianski A., Frenkel I., Karagrigoriou A. (eds) Recent Advances in Multi-state
Systems Reliability: Springer Series in Reliability Engineering. Springer, Cham,
Switzerland. doi:10.1007/978–3-319–63423-4_16.
Barlow, R., Proschan, F. (1996). Mathematical Theory of Reliability. John Wiley  & Sons,
New York.
Ben-Daya, M., Duffuaa, S., Raouf, A. (2000). Maintenance Modelling and Optimization.
Springer Science and Media, New York.
Billinton, R., Li, Y. (2007). Incorporating multi state unit models in composite system ade-
quacy assessment. European Transactions on Electrical Power 17:375–386.
Brocken, E.M. (2016). Improving the Reliability of Ship Machinery: A  Step Towards
Unmanned Shipping. Delft University of Technology, Delft, the Netherlands.
Chowdhury, C. (1988). A  systematic survey of the maintenance models. Periodica
Polytechnica Engineering Mechanical Engineering 32(3–4):253–274.
Det Norske Veritas DNV (2011). Machinery Systems General, in Rules for Classification of
Ships, Høvik, Norway.
Eryilmaz, S. (2015). Assessment of a multi-state system under a shock model. Applied
Mathematics and Computation 269:1–8.
IEEE 1159–1995: IEEE Recommended Practice for Monitoring Electric Power Quality, 1995.
IEEE 45–2002: IEEE Recommended Practice for Electrical Installations on Shipboard, 2002.
398 Reliability Engineering

IMO (2005). Unified Interpretations to SOLAS Chapters II-1 and XII and to the Technical
Provisions for Means of Access for Inspections, London, UK. http://imo.udhb.gov.tr/
dosyam/EKLER/SOLAS__BOLUM_II_1_EK(21).pdf.
IMO Study on the optimization of energy consumption as part of implementation of a Ship
Energy Efficiency Management Plan (SEEMP) 2016.
IMO MEPC 1-CIRC 866 (E). (2014). Guidelines on the Method of Calculation of the Attained
Energy Efficiency Design Index (EEDI) For  New Ships, As Amended (Resolution
Mepc.245(66), As Amended By Resolutions Mepc.263(68) And Mepc.281(70), January
2017.
Levitin, G., Lisnianski, A. (1999). Joint redundancy and maintenance optimization for multi-
state series-parallel systems. Reliability Engineering & System Safety 64(1):33–42.
Levitin, G., Xing, L. (2018). Dynamic performance of series parallel multi-state sys-
tems with standby subsystems or repairable binary elements, In: Lisnianski  A.,
Frenkel I., Karagrigoriou A. (eds) Recent Advances in Multi-state Systems
Reliability: Springer Series in Reliability Engineering. Springer, Cham, Switzerland.
doi:10.1007/978–3-319–63423-4_16.
Lisnianski, A., Frenkel, I., Ding, Y. (2010). Multi State Systems Reliability and Optimization
for Engineers and Industrial Managers. Springer, London, UK.
Lisnianski, A., Elmakias, D., Laredo, D., Haim, H.B. (2012). A multi-state Markov model
for a short-term reliability analysis of a power generating unit. Reliability Engineering
Systems Safety 98:1–6.
Liu, Y., Huang, H.Z. (2010). Optimal replacement policy for multi-state system under imper-
fect maintenance. IEEE Transactions on Reliability 59(3):483–495.
Liu, Y.W., Kapur, K.C. (2006). Reliability measures for dynamic multi state non repairable
systems and their applications to system performance evaluation. IIE Transaction
38(6):511–520.
Maes, W. (2013). Marine Electrical Knowledge. Antwerp Maritime Academy, Antwerp,
Belgium.
MAIB (2011). Report on the investigation of the catastrophic failure of a capacitor in the aft
harmonic filter room on board RMS Queen Mary 2 while approaching Barcelona on
23 September 2010. Marine Accident Investigation Branch. http://www.maib.gov.uk/
publications/investigation_reports/2011/qm2.cfm (last accessed May 7, 2018).
Markopoulos, T., Platis, A. (2018). Reliability analysis of a modified IEEE 6 BUS RBTS,
In: Lisnianski A., Frenkel I., Karagrigoriou A. (eds) Recent Advances in Multi-state
Systems Reliability. Springer Series in Reliability Engineering. Springer, Cham,
Switzerland. doi:10.1007/978–3-319–63423-4_16.
Mennis, E., Platis, A. (2013). Availability assessment of diesel generator system of a ship:
A case study. International Journal of Performability Engineering 9(5):561–567.
MEPC 61/inf.18: Reduction of GHG Emissions from Ships—Marginal abatement costs and
cost-effectiveness of energy-efficiency measures, October 2010.
MEPC.1-Circ.681–2: Interim Guidelines on the Method of Calculation of the Energy
Efficiency Design Index for New Ships, August 2009.
MEPC.1-Circ.684: Guidelines for Voluntary Use of the Ship Energy Efficiency Operational
Indicator (EEOI), August 2009.
Miller, T. (2012). Risk focus: Loss of power. http://www.ukpandi.com/fileadmin/uploads/
uk-pi/Documents/Brochures/Risk%20Focus%20-%20Loss%20of% 20 Power.pdf.
Mindykowski, J. (2014). Power quality on ships: Today and tomorrow’s challenges.
International Conference and Exposition on Electrical and Power Engineering (EPE
2014), Iasi, Romania.
Mindykowski, J. (2016). Case study—Based overview of some contemporary challenges to
power quality in ship systems. Inventions 1(2):12.
Maintenance Policy Analysis of a Marine Power Generating MSS 399

Mindykowski, J., Tarasiuk, T. (2015). Problems of power quality in the wake of ship
technology development. Ocean Engineering 107:108–117.
MSC/Circular.645-Guidelines for Vessels with Dynamic Positioning Systems-(adopted on 6
June 1994).
MTS DP Technical Committee. DP Vessel Design Philosophy Guidelines Part II.
MUNIN. D6.7: Maintenance indicators and maintenance management principles for autono-
mous engine room, 2014.
Nakagawa, T. (2006). Maintenance Theory of Reliability. Springer Science  & Business
Media, London, UK.
OREDA (2002). Offshore Reliability Data Handbook, 4th ed. OREDA, Trondheim, Norway.
Patel, M.R. (2012). Shipboard Electrical Power Systems. CRC Press, Boca Raton, FL.
Prousalidis, J., Styvaktakis, E., Hatzilau, I.K., Kanellos, F., Perros, S., Sofras, E. (2008).
Electric power supply quality in ship systems: An overview. International Journal of
Ocean Systems Management 1(1):68–83.
Prousalidis, J.M., Tsekouras, J.G., Kanellos, F. (2011). New challenges emerged from the
development of more efficient electric energy generation units. From Electric Ship
Technologies Symposium (ESTS), IEEE. doi:10.1109/ESTS.2011.5770901.
Shagar, V., Jayasinghe, S.G., Enshaei, H. (2017). Effect of load changes on hybrid shipboard
power systems and energy storage as a potential solution: A review. Inventions 2:21.
Stevens, B., Dubey, A., Santoso, S. (2015). On improving reliability of shipboard power sys-
tem. IEEE Transactions on Power Systems 30(4):1905–1906.
Trivedi, K.S. (2002). Probability and Statistics with Reliability, Queuing and Computer
Science Applications. Wiley, New York.
Wärtsilä (2014). WSD 42111K, Aframax Tanker for Oil and Products. Wärtsilä Corporation,
Helsinki, Finland.
Wu, Z., Yao, Y., Wang, D. (2013). The reliability modeling of marine power station. Applied
Mechanics and Materials 427–429:404–407.
Yingkui, G., Jing, L. (2012). Multi state system reliability: A  new and systematic review.
Procedia Engineering 29:531–536.
15 Vulnerability Discovery
and Patch Modeling
State of the Art
Avinash K. Shrivastava, P. K. Kapur,
and Misbah Anjum

CONTENTS
15.1 Introduction................................................................................................. 401
15.1.1 Vulnerability.................................................................................402
15.2 Literature Review........................................................................................403
15.2.1 Anderson Thermodynamic Model...............................................404
15.2.2 Alhazmi Malaiya Logistic Model................................................405
15.2.3 Rescorla Quadratic and Rescorla Exponential Models................405
15.2.3.1 Rescorla Quadratic Model..........................................406
15.2.3.2 Rescorla Exponential Model.......................................406
15.2.4 Vulnerability Discovery Model Using Stochastic Differential
Equation........................................................................................406
15.2.5 Effort-Based Vulnerability Discovery Model..............................407
15.2.6 User-Dependent Vulnerability Discovery Model.........................408
15.2.7 Vulnerability Discovery Model for Open and Closed Source.....409
15.2.8 Coverage Based Vulnerability Discovery Modeling.................... 410
15.2.9 Vulnerability Patching Model...................................................... 411
15.2.9.1 One-Dimension Vulnerability Patching Model.......... 412
15.2.9.2 Two-Dimensional Vulnerability Patch Modeling....... 413
15.2.10 Vulnerability Discovery and Patching Model.............................. 413
15.3 Vulnerability Discovery in Multi-version Software Systems...................... 415
15.3.1 User Dependent Multi-version Vulnerability
Discovery Modeling..................................................................... 416
15.4 Conclusion and Future Directions............................................................... 417
References............................................................................................................... 417

15.1 INTRODUCTION
With the continual evolution of information technology (IT) infrastructures, the
related vulnerabilities and exploitations are increasing because of the security issues
raised during the operational phase. Today, there is no software system that is free
from weaknesses or vulnerabilities, whether it is a system for personal use or for a
large-scale organization. According to National Vulnerability Database (NVD), a total

401
402 Reliability Engineering

of 16,555 security vulnerabilities were reported in 2018 (the highest figures thus far).
This statistic indicates that vulnerability assessment is the most ignored security tech-
nology today. Thus, there is a need to quantify the discovered software vulnerabilities
with some mathematical models with an improvement in security without increasing
penetration costs. However, some considerable work has been done on modeling the
vulnerabilities with respect to time (Alhazmi & Malaiya 2005a, 2005b; Kimura 2006;
Kim et  al. 2007; Okamura et  al. 2013; Joh and Malaiya 2014; Kapur et  al. 2015;
Sharma et al. 2016; Kansal et al. 2017a, 2017b; Movahedi et al. 2018). In the next
section, we will discuss briefly the vulnerability life cycle followed by a literature
review of vulnerability discovery models (VDMs) in Section 15.2. In Section 15.3, we
provide a description of the modeling frameworks of VDMs based on a different set
of assumptions followed by vulnerability patching models (VPM). Then modeling of
the multi-version vulnerability discovery will be discussed in Section 15.4 followed
by the conclusion and future research directions in Section 15.5.

15.1.1 Vulnerability
One of the best definitions of software vulnerability is given by Schultz et al. (1990)
who defined it as follows: “A vulnerability is defined as a defect which enables an
attacker to bypass security measures.” To assess the value of vulnerability finding,
we must examine the events surrounding discovery and disclosure. Schneier (2000)
described the lifecycle of a vulnerability in six phases: Introduction, Discovery,
Private Exploitation, Disclosure, Public Exploitation, and Fix Release. These events
do not necessarily occur strictly in this order. Disclosure and Fix Release often occur
together, especially when a manufacturer discovers a vulnerability and releases the
announcement along with a patch (Figure 15.1).
Expectation of a more secured software system requires longer testing that results
in high cost and delay in release with increased selling price. However, due to strong
market competition, the release time cannot be delayed or the price of the soft-
ware cannot be increased. Therefore, a trade-off between testing and launch time
is required. In the existing literature, many quantitative models have been proposed
by several authors. These quantitative models can help the developers in allocating
the resources for security testing, scheduling, and development of security patches
(Alhazmi & Malaiya 2005a, 2005b; Kimura 2006; Kim et al. 2007; Okamura et al.

FIGURE 15.1  Lifecycle of a vulnerability.


Vulnerability Discovery and Patch Modeling 403

2013; Joh et  al., 2014; Kapur et  al. 2015; Sharma et  al. 2016; Younis et  al., 2016;
Kansal et al. 2017a, 2017b). In addition, developers can use VDMs to assess risk and
estimate the redundancy needed in resources and procedures to deal with potential
breaches. These measures help to determine the resources needed to test a specific
part of software. The prime objective of this study is to understand the mathematical
models pertaining to vulnerability discovery and patching phenomenon.

15.2  LITERATURE REVIEW


In past few decades, various researchers considered software security to be analogous
with software reliability and developed the vulnerability discovery models on similar
lines (Alhazmi & Malaiya 2005a, 2005b; Woo et al. 2006; Younis et al. 2011; Narang
et al. (2017)). Anderson et al. (2002) examined and measured the security in open and
closed systems by proposing a thermo-dynamic VDM. They modeled the discovery
rate based on the mean time between failure (MTBF) and defined the model analogous
to the software reliability growth model (SRGM). However, they concluded that the
there is no difference between open and closed system because both are similar in the
long run. Rescorla (2003, 2005) determined that the vulnerability finding is a better
approach if it is followed by the white hat users. He evaluated the economic effective-
ness of finding and fixing rediscovered vulnerabilities on the developing organiza-
tions especially when they are identified by black hat users. The author has fitted the
non-homogeneous Poisson process (NHPP) reliability growth model to the observed
vulnerability data to evaluate the vulnerability discovery rate over time. However, he
proposed two statistical models—Rescorla exponential (RE) and Rescorla linear (RL)
or Rescorla quadratic (RQ) model—that are later proved to be insignificant as they are
not able to predict the behavior of all empirical data sets.
Ozment and Schechter (2006) stated that vulnerability modeling is analogous to
SRGM with an aim to increase the reliability of the system regardless of their oper-
ating environments. Ozment (2007) identified the OpenBSD operating system data
set and stated that some vulnerability within the data set is dependent. However, he
does not  apply any VDM on the data set and considered the engineering tools to
measure the software security. Alhazmi (2007) attempted to develop a logistic VDM
(known as the Alhazmi Malaiya Logistic (AML) model) that quantitatively evaluates
the vulnerabilities trend over time. He also proposed a new metric, the vulnerability
density, which is analogous to defect density. If the software has a high vulnerabil-
ity density, then it is at major risk. The  author divided the discovery process into
three phases: linear, learning, and saturation. Later, he proposed an effort based on
VDM (known as the Alhazmi Malaiya Effort [AME] based model) that exhibits the
environment changes with respect to the effort instead. However, these models are
solely dependent on discovery time that seems inappropriate since there are various
operational factors that may influence the vulnerability discovery process. Kim et al.
(2007) extended the work done by Alhazmi and others for single version by develop-
ing a new VDM for multiple versions (known as the multi-version vulnerability dis-
covery model [MVDM]). He proved that the behavior of the vulnerability discovery
rate for multiple versions is different from single-version modeling for open-source
and commercial software systems. Joh et  al. (2008) developed  a  VDM  that follows
404 Reliability Engineering

the Weibull  distribution and is known as the Joh-Weibull (JW) model. The  model
represents the asymmetric nature of the vulnerability discovery rate because of the
skewness present in probabilistic density functions. Although this model is also
exclusively dependent on discovery time, Bass et al. (1969) scrutinized the factors
that motivate the vulnerability discoverers to spend the effort in findings. As per
the study, the discoverers are more attracted toward bug bounty programs that have
become the main reason for their encouragement. However, they have not ­modeled
the ­vulnerability discovery process. Massacci and Nguyen (2014) proposed a meth-
odology to validate the performance of empirical VDMs. The methodology focuses
on two quantitative metrics: quality and prediction capability. The quality is mea-
sured on the basis of good fit and inconclusive fit while the predictive accuracy is
measured on current and future horizon. However, he does not propose any math-
ematical model. Joh et  al. (2014) found the relationship between performance of
S‐shaped vulnerability discovery models and the skewness in some vulnerability
data sets and applied Weibull, Beta, Gamma, and Normal distributions. Anand et al.
(2017) proposed an approach to quantify the discovered vulnerabilities using vari-
ous software versions. The  authors examined their approach using Windows and
Windows Server Operating Systems. Zhu et  al. (2017) proposed a mathematical
model that predicts the software vulnerabilities and used the estimated parameters to
develop a new risk model. The authors also determined the severity of vulnerability
using logistic function and binomial distribution, respectively. Although this model
also is dependent exclusively on discovery time, Wai et al. (2018) proposed two new
algorithms—mean fit and trend fit—to predict the vulnerability discovery rate using
past vulnerability data. Recently, Movahedi et al. (2018) used a clustering approach
to group vulnerabilities into different clusters and then used NHPP-based software
reliability models to predict the number of vulnerabilities in each cluster and then
combined them together to find the total number of vulnerabilities in the system.
In the next section, we will briefly discuss the VDMs proposed in the literature so far.

15.2.1 Anderson Thermodynamic Model


The  pioneering work in developing a VDM was carried out by Ross Anderson
(2002) resulting in a model known as the Anderson Thermodynamic (AT) model.
The assumptions taken to develop this model are that (1) as soon as a vulnerability
is encountered it is removed with certainty and (2) no extra vulnerabilities are added
while fixing the existing vulnerability. Let Ω(t ) be the remaining number of vulner-
abilities left after t tests and p(t ) be the probability that a test fails, then according to
the AT model p(t ) is given by:

k
p(t ) = (15.1)
γt

where:
k is a constant
γ is value that takes care of lower failure rate during beta testing by the users in
comparison with alpha testing
Vulnerability Discovery and Patch Modeling 405

On solving Equation 15.1, we get the cumulative number of vulnerabilities as follows:

k
Ω(t ) = ln(Ct ) (15.2)
γt

where C is the constant of integration. This model is applicable only when t ≥ 1.

15.2.2 Alhazmi Malaiya Logistic Model


Alhazmi et  al. (2005) proposed another s-shaped VDM and called it the AML
model. Their model is based on the following assumption that the rate of change
of the cumulative number of vulnerabilities Ω is dependent on the number of exist-
ing and leftover undetected vulnerabilities. According to them, vulnerability follows
three phases: learning, linear, and saturation (shown in Figure 15.2).
Following the assumptions of AML model, we get the following differential
equation:

dΩ
= AΩ( B − Ω) (15.3)
dt

where Ω is the cumulative number of vulnerabilities, t is the calendar time, A and


B are the empirical constants to be determined from recorded data. After solving
Equation 15.3 we get:

B
Ω(t ) = − ABt
(15.4)
BCe +1

where C is the constant of integration and B is total number of vulnerabilities in the


system.

15.2.3 Rescorla Quadratic and Rescorla Exponential Models


Rescorla (2005) proposed two models: Rescorla Quadratic and Rescorla Exponential.
These models are described in the following sections.

FIGURE 15.2  The basic 3-phase s-shaped model proposed by Alhazmi and Malaiya.


406 Reliability Engineering

15.2.3.1  Rescorla Quadratic Model


According to this model the failure rate ω (t ) takes the linear form that is given by:

ω (t ) = Bt + K (15.5)

where B and K are constants. On integrating Equation 15.5, we get the cumulative


number of vulnerabilities given by:

Bt 2
Ω(t ) = + kt (15.6)
2

At t = 0, Ω(t ) = 0 so, the constant of integration comes out to be zero.

15.2.3.2  Rescorla Exponential Model


Rescorla used exponential distribution to fit the vulnerability data which is given as:

ω (t ) = Bλ e − λt (15.7)

where B represents the total number of vulnerabilities in the system and λ is the rate
constant. On integrating Equation  15.7, we get the cumulative number of vulner-
abilities as:

Ω(t ) = B(1 − e − λt ) (15.8)

Kapur et al. (2015) applied two of the SRGMs (i.e., the Kapur & Garg (1992) Model
and the Two Stage Erlang Logistic Model) on vulnerability data sets and compared
their results with the AML model. They claimed that the results are equivalent to
those obtained from AML model. Shrivastava et al. (2015) applied stochastic dif-
ferential equation to develop a stochastic VDM using the AML model and found that
results of their model are better than the AML model. The formulation of the model
follows.

15.2.4 Vulnerability Discovery Model Using Stochastic


Differential Equation
Shrivastava et al. (2015) extended the AML model using stochastic differential equa-
tion  (SDE). Let b(t ) be the vulnerability removal rate per remaining vulnerabili-
ties in the software, σ denotes the constant magnitude of irregular fluctuation and
γ (t ) the Standard Gaussian White Noise. Then keeping all the assumption of AML
model valid along with an extra assumption that the vulnerability discovery process
follows a stochastic process with a continuous state space we get the following dif-
ferential equation:

dN ( t )
= b ( t )  B − N ( t )  (15.9)
dt
Vulnerability Discovery and Patch Modeling 407

Now assuming irregular variations in b(t) Equation 15.9 can be extended as the fol-
lowing SDE:

dN ( t )
= {b ( t ) + σγ ( t )}{ B − N ( t )} (15.10)
dt

We extend the previous equation to the following SDE of an It O type:

 1 
dN ( t ) = b ( t ) − σ 2 { B − N ( t )} dt + σ  B − N ( t )  dW ( t ) (15.11)
 2 

where

W(t) is called a Brownian or Wiener process. After solving Equation 15.11 using
It O formula, we get:

 − t b(t )dt −σW (t ) 


 ∫ 
N ( t ) = B − ( B − k ) e 0  (15.12)
 

Therefore, the mean number of vulnerabilities will be:

  B − k  −( Bbt − 1 σ 2t ) 
   .e
2

  k   (15.13)
Ω(t ) = E  N ( t )  = B 1 −
   B − k  − Bbt  
 1 +  e  
   k   

Using the previous equation, we can predict the number of vulnerabilities in the
software.

15.2.5 Effort-Based Vulnerability Discovery Model


Alhazmi and Malaiya (2008) proposed an effort-based VDM where they measure
the effort E as follows:


n
E= (U i − Pi ) (15.14)
i =0

Here U i denotes the number of users working on all systems at the time period i and
Pi is the percentage of the users using the system. Assuming that vulnerability detec-
tion rate is proportional to the effort and the remaining number of vulnerabilities, the
effort based VDM is given as follows:

Ω(t ) = B(1 − e − λvu E ) (15.15)

where λvu denotes the failure intensity.


408 Reliability Engineering

15.2.6 User-Dependent Vulnerability Discovery Model


Kansal et al. (2017a) developed a VDM considering the number of users where they
assumed that vulnerability discovery is dependent on the reporting done by the
users who buy the software. The notations used in this section apart from previously
described are as follows:

Notations Description
S Actual number of software buyers
S (t ) Cumulative number of potential software users at time t

The vulnerability intensity defined by Kansal et al. (2007a) is given as:

d Ω  d Ω   dI   dS 
= ⋅ ⋅  (15.16)
dt  dI   dS   dt 

The three components on the right-hand side of Equation 15.16 are described using


the following assumptions:

1. The vulnerability discovery rate is dependent on the number of instructions


executed which is represented mathematically as:

dΩ  Ω
=  x + y ⋅  ⋅ ( B − Ω ) (15.17)
dI  B

where:
( B − Ω) are the remaining vulnerabilities residing in the software
‘x’ is the rate with which unique vulnerabilities are detected
‘y’ is the rate with which the dependent vulnerabilities are detected through the
support rate of ΩB .

2. The  number of instructions executed by every user is constant which is


given as:

dI
= k (15.18)
dS

3. The rate at which the number of people buys the software is given by:

dS  S 
= α + β ⋅  ⋅ ( S − S ) (15.19)
dt  S 

where:
( S − S ) are the remaining number of users who have yet to buy the software
α and β are the rate with which innovators and imitators are buying the software
Vulnerability Discovery and Patch Modeling 409

After solving Equation 15.19 with initial conditions, S (t ) = 0, we get:

1 − exp ( − (α + β ) ⋅ t )
S (t ) = S ⋅ (15.20)
β 
1 +   ⋅ exp ( − (α + β ) ⋅ t )
α 

Now from Equations 15.17 through 15.20, the vulnerability discovery rate becomes:

dΩ  Ω dS
=  x + y ⋅  ⋅ ( B − Ω ) ⋅ k ⋅ (15.21)
dt  B dt

On solving Equation 15.21 with initial conditions Ω ( S ) = 0 , S = 0, we get:

k
Ω (t ) = B ⋅
(1 + h ⋅ exp ( − ( x + y ) ⋅ S (t ))) − ((1 + h) ⋅ exp ( − ( x + y ) ⋅ S (t ) ⋅ k )) (15.22)
k
(1 + h ⋅ exp ( − ( x + y ) ⋅ S (t )))
where h = xy , A = x + y

15.2.7 Vulnerability Discovery Model for Open and Closed Source


Sharma et  al. (2016) proposed a VDM using a Gamma distribution function and
claimed that their model has better prediction capabilities for open and closed source
software. The failure density function for gamma distribution is given as:

α −1 t 
1 t  − 
β 
f (t ) =   e ; t ≥ 0, α , β > 0 (15.23)
Γ (α ) β  β 

where α , β denote the shape and scale parameters, respectively. α controls the shape
of distribution. The cumulative distribution function for Gamma to perform vulner-
ability prediction is given by:

t
γ (α , β t )
cdf (Gamma) = F ( t ;α , β ) =
∫ f ( u;α , β ) du =
0
Γ (α )
(15.24)

So,

Ω(t ) = B * F (t ,α , β ) (15.25)
410 Reliability Engineering

15.2.8 Coverage Based Vulnerability Discovery Modeling


Kansal et al. (2018) proposed a coverage based VDM in which they assumed that the
vulnerability discovery rate is defined by:

d Ω  d Ω   dC   dI   dX 
= ⋅ ⋅ ⋅  (15.26)
dt  dC   dI   dX   dt 

where C, I, and X are explicitly the operational coverage, executed instructions, and
operational effort. The four components in the right-hand side are defined as:

1.
Component 1: Here it was assumed that the vulnerability discovery rate is
directly proportional to the operational coverage rate of the remaining vul-
nerabilities and inversely proportional to uncovered proportion of software
and given by:

dΩ  c′ 
= A1 ⋅   ⋅ ( B − Ω ) (15.27)
dC  p−c 

where c is the coverage rate.


2.
Component 2: The  coverage rate with respect to number of instructions
executed is considered as constant and given by:

dC
= φ 1 (15.28)
dI

3.
Component 3: The rate at which instructions are executed per operational
effort is assumed to be constant and given by:

dI
= φ 2 (15.29)
dX

4.
Component 4: Rate of operational effort is directly proportional to remain-
ing resources where vulnerability discoverers and time are the resources
that are considered as operational effort spent on vulnerability discovery
and it is given by:

dX
= β ( t ) ⋅ (α − X ( t ) ) (15.30)
dt

where β (t ) is the time dependent rate at which operational resources are con-
sumed and α is the total amount of effort required for vulnerability discovery.
Vulnerability Discovery and Patch Modeling 411

Using Equations 15.27 through 15.30, from Equation 15.26 we have:

d Ω   c′  
=  A1⋅
dt   p − c 
( )  dX
⋅( B − Ω )  ⋅ φ 1 ⋅ (φ 2 ) ⋅ 
 dt

 (15.31)

On solving Equation 15.31 under Ω (0) = c(0) = 0 , we get:

 A1⋅φ1⋅φ 2 
  c ( X (t ))  
Ω ( X ( t ) ) = B ⋅ 1 − 1 −   (15.32)
 p 
   
 

In the previously described model, the authors took various effort functions X(t), that
is, to find the final model for vulnerability prediction. They used the Weibull and the
Logistic effort functions in their model. They further took various operational cover-
age functions in their model. For example, if operational effort is assumed to follow
Weibull distribution, then Mean Value Function (MVF) or VDM becomes:

 A1⋅φ1⋅φ 2 
   
h
 
− β ⋅t k 
  − A2⋅ α ⋅1−e 
   
 
  
Ω ( t ) = B ⋅ 1 −  e   

 (15.33)
 
   
  
 
 

15.2.9 Vulnerability Patching Model


VDM predicts the cumulative number of vulnerabilities against calendar time, soft-
ware buyers, operational coverage, and operational effort. In contrast, vulnerability
patch modeling (VPM) observes the cumulative number of patches with calendar time
only. This  research attempts to predict the vulnerability discovery rate, number of
vulnerabilities discovered, vulnerability patching rate, and number of vulnerabilities
patched. It provides an idea for quantifying security risks in terms of vulnerabilities.
These quantitative models may help in understanding the behavior of software vulner-
abilities under different assumptions. One of the major assumptions is that the failures
are caused randomly. The tools of vulnerability modeling known as VDMs and VPMs
may help consumers to estimate and predict the system risk, lower the patch develop-
ment cost and time, increase productivity, and reduce exploitability. In this section, we
will discuss the VPM (Kansal et al. [2016a, 2016b]) that determines the intensity with
which discovered vulnerabilities are fixed or patched. It is assumed that the developed
vulnerability model follows the NHPP properties to fulfill one of the considerations
that the vulnerabilities are successfully removed when patches are applied.
412 Reliability Engineering

15.2.9.1  One-Dimension Vulnerability Patching Model


Notation used in this section apart from those previously described are as follow:

Notation Description
ρ ( r ) Expected number of patches released with respect to patching resources r
A Vulnerability patching rate
r Patching resources
t Patching time or patch release time
v Vulnerabilities reported/discovered
d Vulnerabilities disclosed
B Actual potential number of patches released
C Integration constant
∆, δ Intermediate variables

This  model focuses on determining the successfully released/installed patches


with time. The model is comprised of three components: directly patched vulnerabil-
ities, indirectly patched vulnerabilities, and unsuccessfully patched vulnerabilities.
The first component addresses the patches that are released by the vendors without
customer interference (vulnerabilities are discovered directly by the developing team
and no beta customers are involved). In this case, vendors/developers are free to use
the maximum resources because of no external pressure on managers that makes the
probability of success of these patches as 1. The second component addresses the
patches which are developed and released corresponding to vulnerability reports.
These patches are developed under pressure and resource constraints; thus, there
is a possibility that these patches may fail. Thus, the probability of success of these
patches is denoted as (1− σ ) where σ represents the unsuccessful patching rate that
is considered being third component. The last component addresses the patches that
are unavailable, especially in case of zero-day vulnerabilities whose highest prob-
ability of getting fail is denoted as σ .
Mathematically, the model can be presented as:


d ρ
dt
( ) ρ
( )
= A ⋅ B − ρ + C ⋅ (1 − σ ) ⋅ ⋅ B − ρ − σ ⋅ ρ (15.34)
B

where A represents the proportion of patches that are released or installed suc-
cessfully without disruption. While C represents the proportion of patches that are
released because of the reports submitted to vendors about vulnerabilities.
Under the initial condition ρ (t = 0) = 0 and solving the above equation we get:

 − A + C ⋅t 
B ⋅ 1 − e   
 
ρ =   (15.35)
C  
− A + C ⋅t 

1+ ⋅ e   
A  
 
Vulnerability Discovery and Patch Modeling 413

where  B = B ⋅( ∆ +δ ) 2⋅C ⋅(1−σ ) ,  A = ∆ −δ 2 ,   C = ∆ + δ 2 , δ = C ⋅ (1 − σ ) − A − σ and δ =


δ 2 + 4 ⋅ C ⋅ (1 − σ ) ⋅ A where ∆, δ A, B and C are the notations used for intermediate
variables.

15.2.9.2  Two-Dimensional Vulnerability Patch Modeling


Kansal and Kapur (2019) proposed a two-dimensional VPM considering the relation
between number of vulnerabilities discovered and the number of patches released.
For this they used the Cobb-Douglas production function to show the relationship
between the dependent (output) and independent (input) variables and defined it as:

r ≅ vα ⋅ t1−α 0 ≤ α ≤ 1 (15.36)

where “r” refers to the patching resources, v refers to the quantifiable vulnerabilities,
“t” refers to the patching time and α as the degree of impact to the vulnerability
patching process. The model development is similar to what we have already defined
in Section 15.3.1 where the only change is to replace “t” with “r” to obtain the final
equation as:

 − A+C ⋅ vα ⋅ t1 − α  


   
B ⋅ 1 − e
 
ρ ( r ) = ρ ( v, t ) =
    (15.37)
 − A+C ⋅ vα ⋅ t1 − α  
 
C  

 
1+ ⋅ e
A 
 
 

15.2.10 Vulnerability Discovery and Patching Model


In the operational phase, two processes, vulnerability discovery and patching, occurs
simultaneously. The vulnerability discovery process is done by software users while
the patching is done by the software developers. Here, the VDP process is denoted
by V (t ) distribution and the vulnerability removal process (i.e., patching process) by
P (t ). The intensity with which the vulnerabilities are discovered is calculated as:

dΩ (t ) v (t )
=  B − Ω ( t )  (15.38)
dt 1 − V (t ) 

where Ω (t ) is the number of vulnerabilities expected to be discovered until time t,


v(t ) dV ( t )
1−V ( t ) is the vulnerability discovery rate, dt = v (t ) , and B is the potential number
of discovered vulnerabilities. Solving Equation  15.38 under the initial conditions
Ω (t = 0) = 0, we get:

Ω ( t ) = B ⋅V ( t ) (15.39)
414 Reliability Engineering

After accounting for the number of vulnerabilities discovered, the next step taken by
developers is to develop patches. Hence, we have considered the vulnerability patch-
ing time in our model under the vulnerability discovery process. The intensity with
which discovered vulnerabilities are patched can be calculated as:

d ρ ( t )
=
[ v ∗ p]( t )  B − ρ ( t ) (15.40)
dt 1 − [V ⊗ P ] ( t )  

 v ∗ p  ( t ) (15.41)
1 − V ⊗ P  ( t )

d [V ⊗ P ]( t )
is the vulnerability patching rate wherein dt = [v ∗ p](t ).
The symbol [v ∗ p](t ) denotes convolution of v and p. Another definition of con-
volution function that is a stieltjes convolution is represented as [V ⊗ P ](t ) . Solving
Equation 15.41 under the initial conditions ρ (t = 0) = 0, we get:

ρ ( t ) = B ⋅ (V ⊗ P )( t ) (15.42)

where B is the potential number of patched vulnerabilities, ρ (t ) is the number of


vulnerabilities expected to be patched at time t. Equations 15.40 and 15.42 denote the
generalized modeling approach in which the first step is vulnerability discovery and
the other step is the vulnerability patching process. In this research paper, we have
considered that V (t ) follows the exponential distribution which is represented as:

V ( t ) = (1 − exp ( − A ⋅ t ) ) (15.43)

where A represents the vulnerability discovery rate. The  cumulative vulnerabil-


ity discovery model as proposed by Rescorla et  al. (2002) can be derived from
Equations 15.42 and 15.43. The mean value function becomes:

Ω ( t ) = B ⋅ (1 − exp ( − A ⋅ t ) )

Subsequently, the VPM is formulated where P (t ) is represented by the logistic distri-


bution function since patching is a more complex process than discovery. The mean
value function follows:

 1 − exp ( − A ⋅ t ) 
P (t ) = 
 1 + C ⋅ exp ( − A ⋅ t ) 
(15.44)
 

where A represents the vulnerability patching rate with learning and C represents the
shape parameter.
To obtain the simple mathematical form for the proposed model, we have assumed
that the discovery rate A as in Equation 15.43 is same as the patching rate with learn-
ing as in Equation 15.44. In other words, we have considered that the discovery rate
and patching rate are the same.
Vulnerability Discovery and Patch Modeling 415

The stieltjes convolution as shown in Equation 15.42 is calculated as:

V (t ) ⊗ P (t ) =
∫ P (t − x ) dV ( x ) (15.45)
Equation  5.45 shows the time delay between the vulnerability discovery and the
patching process wherein the vulnerability discovery time is denoted as x and the
vulnerability patching time is denoted as t − x. Here, the model also manifests that it
is not necessary that the number of vulnerabilities discovered and patched are always
same. However, at time infinity the numbers may become similar.
Thus, from Equations 15.43 and 15.44, Equation 15.45 can be re-written as:

 1 − exp ( − A ⋅ ( t − x ) ) 
t

V (t ) ⊗ P (t ) =

0
  ⋅ ( A ⋅ exp ( − A ⋅ x ) ) ⋅ dx (15.46)
 1 + C ⋅ exp ( − A ⋅ ( t − x ) ) 
 
On solving Equation 15.46, we get:

 (1 + C ) ⋅ exp ( − A ⋅ ( t ) ) 
ρ ( t ) = B ⋅  1 − exp ( − A ⋅ t ) + (1 + C ) ⋅ exp ( − A ⋅ t ) ln (15.47)

 (
1 + C ⋅ exxp ( − A ⋅ ( t ) ) )
Equation 15.47 is used further for predicting the number of vulnerabilities discov-
ered and patched.

15.3 VULNERABILITY DISCOVERY IN MULTI-VERSION


SOFTWARE SYSTEMS
Generally, we have several versions of software in which we keep trying to upgrade
the previous versions by adding new functionalities. This up-gradation adds advanced
features in the software and provides better user experiences. As none of the software
developed is free of bugs, the trend of discovering vulnerabilities also continues in each
version of the software. This phenomenon of discovering vulnerabilities in multiple
versions is developed by Kim et al. (2007) by using the AML model in which they
have considered that the new version is developed by keeping the previous function-
alities and adding new features on the base code. Even if the base code is reduced in
the new version, the vulnerability found in the common code will be counted in the
older version only while predicting the number of vulnerabilities of each version (see
Figure 15.3).
The  cumulative number of vulnerabilities Ω(t ) in each version of software is
given by:
B B′
Ω(t ) = − ABt
+α − A′B′( t −ε )
(15.48)
BCe +1 B′C ′e +1
where the parameter α indicates shared components such as shared code and shared
functionality, and ε denotes the time lag between the release dates of the two ver-
sions. Equation  15.48 is referred to as the multi-version vulnerability discovery
416 Reliability Engineering

FIGURE 15.3  Multi-version software vulnerability discovery model.

model (MVDM). Equation 15.48 can be generalized to write the mathematical form


of multi-version software modeling as:

n
Bi′

Ω(t ) = ∑α B ′C ′e
i =1
i
i i
− Ai′ Bi′ ( t −ε i )
+ 1 (15.49)

Following the assumptions of Kim et  al. (2007), Anand et  al. (2017) developed a
framework for predicting the number of vulnerabilities in multi-versions of software
and proposed a similar model and showed that the results are equivalent to those
obtained from the model proposed by Kim et al. (2007).

15.3.1 User Dependent Multi-version Vulnerability Discovery Modeling


Narang et  al. (2017) developed a user dependent model to predict the number of
vulnerabilities in multi-versions of vulnerabilities. They developed the multi-version
model, which is like the effort-based multi-version framework developed in software
reliability literature for predicting the number of faults in multi-releases of software
(Kapur et al. 2011). They assumed t1 and t2 as the time frame of the new version of
software vulnerabilities, where t1 and t2 are the release times of the first and second
versions, respectively. The cumulative number of vulnerabilities detected in the first
version of software is given by:

Ω1( S1(t )) = B1.F1( S1(t )) (15.50)

where F1( S1(t )) represents the user dependent vulnerability discovery function.
For predicting the number of vulnerabilities in the next version, Narang et al. (2018)
considered the vulnerabilities of previous version which were removed in the current
version should be counted in the newer version. The mathematical form for the next
version of vulnerabilities is given by:

Ω 2 ( S2 (t )) = B2.F2 ( S2 (t − t1 )) + B1(1 − F1( S1(t1 )).F2 ( S2 (t − t1 )) (15.51)


Vulnerability Discovery and Patch Modeling 417

where B1(1 − F1( S1(t1 )) are some left over vulnerabilities of the first version, and
F1( S1(t )) and F2 ( S 2 (t )) are the vulnerability discovery rates of older and newer
versions.

15.4  CONCLUSION AND FUTURE DIRECTIONS


In this paper, we have discussed various vulnerability discovery models proposed
in the literature based on different sets of assumptions. We have tried to cover all
the major contributions in quantitative assessments of software security. This field
of research is becoming the topic of interest to the various researchers working
in the field of software engineering due to the sensitivity and significance to real-
life activities that are based on the smooth functioning of software systems. In the
vulnerability data set irrespective of software versions, it is quite possible that the
newer versions of software may have some vulnerabilities that are in common with
previous version. Keeping this in mind, few researchers applied the VDMs to detect
the number of vulnerabilities in multi-versions of software. Also, the literature on
software patching models is presented in this chapter. Researchers may classify the
vulnerabilities with different versions to check and compare the improvements in
vulnerability discovery rates in the future. Further, the impact of incentives or the
bug bounties can be analyzed in the future for vulnerability discovery processes.
Due to lack of information related to patching, we were not able to explicitly cal-
culate the total number of patched vulnerabilities w.r.t. effort. Thus, it proposes
an important research question in future research studies. In this chapter, we have
not covered the cost models proposed in the literature related to determination of
optimal vulnerability discovery and patch release times. This research is another
area in software security that is still in its initial phase. Research on vulnerability
prioritization is another direction for researchers. Few attempts have been made
in this direction but research in this direction is also in its early stages. Research
on prioritizing vulnerability with respect to categorization could be another very
interesting area.

REFERENCES
Alhazmi, O. (2007). Assessing vulnerabilities in software systems: A quantitative approach.
Thesis, Colorado State University.
Alhazmi, O.H.,  & Malaiya, Y.K. (2005a). Modeling the vulnerability discovery pro-
cess. In  16th IEEE International Symposium on Software Reliability Engineering
(ISSRE’05) (pp. 10–pp). IEEE.
Alhazmi, O.H., & Malaiya, Y.K. (2005b). Quantitative vulnerability assessment of systems
software. IEEE, pp. 615–620.
Alhazmi, O.H.,  & Malaiya, Y.K. (2008). Application of vulnerability discovery models to
major operating systems. IEEE Transactions on Reliability, 57(1), 14–22.
Anand, A., Das, S., Aggrawal, D., & Klochkov, Y. (2017). Vulnerability discovery modelling
for software with multi-versions. In Advances in Reliability and System Engineering
(pp. 255–265). Cham, Switzerland: Springer International Publishing.
Anderson, R. (2002). Security in open versus closed systems: The dance of Boltzmann, Coase
and Moore. Technical report, Cambridge University.
418 Reliability Engineering

Bass, F.M. (1969), A new-product growth model for consumer durables. Management Science,
15, 215–227.
Joh, H., & Malaiya, Y.K. (2014). Modeling skewness in vulnerability discovery: Modeling
skewness in vulnerability discovery. Quality and Reliability Engineering International,
30(8), 1445–1459.
Kansal, Y., & Kapur P.K. (2019). Two-dimensional vulnerability patching model. In: Kapur,
P., Klochkov, Y., Verma, A., Singh, G. (Eds.), System Performance and Management
Analytics: Asset Analytics (Performance and Safety Management) (pp. 321–331).
Singapore: Springer.
Kansal, Y., Kapur, P.K., & Kumar, U. (2018). Coverage based vulnerability discovery model-
ing to optimize disclosure time using multi-attribute approach. Quality and Reliability
Engineering International, 35(1), 62–73. doi:10.1002/qre.2380.
Kansal, Y., Kapur, P.K., Kumar, U.,  & Kumar, D. (2017a). User-dependent vulnerability
discovery model and its interdisciplinary nature. International Journal of Life Cycle
Reliability and Safety Engineering, 6(1), 23–29.
Kansal, Y., Kapur, P.K., Kumar, U.,  & Kumar, D. (2017b). Effort and coverage dependent
vulnerability discovery modeling In: IEEE Xplore, International Conference on
Telecommunication and Networking (TELNET), Noida.
Kansal, Y., Kumar, D., & Kapur, P.K. (2016a). Assessing optimal patch release time for vul-
nerable software systems. In  IEEE Xplore, International Conference on Innovation
and Challenges in Cyber Security (ICICCS-INBUSH), Noida, pp. 308–314.
Kansal, Y., Kumar, D., & Kapur, P.K. (2016b). Vulnerability patch modeling. International
Journal of Reliability, Quality and Safety Engineering, 23(6), 1640013.
Kapur, P.K., & Garg, R.B. (1992). A software reliability growth model for an error-removal
phenomenon. Software Engineering Journal, 7(4), 291–294.
Kapur, P.K., Pham, H., Gupta, A., & Jha, P.C. (2011). Software Reliability Assessment with
OR Applications. London, UK: Springer.
Kapur, P.K., Yadavalli, V.S.S., & Shrivastava, A.K. (2015). A comparative study of vulnerabil-
ity discovery modeling and software reliability growth modeling. In The IEEE Xplore
Proceedings of International Conference on Futuristic Trends in Computational
Analysis and Knowledge Management, Amity University, Greater Noida, February
25–27, pp. 246–251.
Kim, J., Malaiya, Y.K., & Ray, I. (2007). Vulnerability discovery in multi-version software
systems. In  10th IEEE High Assurance Systems Engineering Symposium. HASE’07,
pp. 141–148.
Kimura, M. (2006). Software vulnerability: Definition, modelling, and practical evaluation
for e-mail transfer software. International Journal of Pressure Vessels and Piping,
83(4), 256–261.
Massacci, F., & Nguyen, V.H. (2014). An empirical methodology to evaluate vulnerability
discovery models. IEEE Transactions on Software Engineering, 40(12), 1147–1162.
Movahedi, Y., Cukier, M., Andongabo, A.,  & Gashi, I. (2018). Cluster-based vulnerability
assessment of operating systems and web browsers. Computing, 1–22. doi:10.1007/
s00607-018-0663-0.
Narang, S., Kapur, P.K., Damodaran, D., & Shrivastava, A.K. (2017). User-based multi-upgra-
dation vulnerability discovery model. In 6th International Conference on Reliability,
Infocom Technologies and Optimization (Icrito 2017) (Trends and Future directions) to
be held during September 20–22, 2017, Amity University Uttar Pradesh.
Narang, S., Kapur, P.K., Damodaran, D., & Shrivastava, A.K. (2018). Bi-criterion problem to
determine optimal vulnerability discovery and patching time. International Journal of
Quality Reliability and Safety Engineering, 25(1), 1850002.
Vulnerability Discovery and Patch Modeling 419

Okamura, H., Tokuzane, M., & Dohi, T. (2013). Quantitative security evaluation for software
system from vulnerability database. International Journal of Software Engineering &
Applications, 6(3), 15.
Ozment, A., & Schechter, S.E. (2006). Milk or wine: Does software security improve with
age? Proceedings of the 15th Conference on Usenix Security Symposium, Berkeley,
CA.
Ozment, J.A. (2007). Vulnerability discovery & software security. PhD thesis, University of
Cambridge.
Rescorla, E. (2003). Security holes. Who cares? In Proceedings of the 12th Conference on
USENIX Security Symposium, pp. 75–90.
Rescorla, E. (2005). Is finding security holes a good idea? IEEE Security & Privacy, 3(1),
14–19.
Schneier, B. (2000). Full disclosure and the window of vulnerability, Crypto-Gram
(September 15, 2000). www.counterpane.com/cryptogram-0009.html#1.
Schultz, E.E., Brown, D.S.,  & Longstaff, T.A. (1990). Responding to Computer Security
Incidents, Lawrence Livermore National Laboratory, 165. http://ftp.cert.dfn.de/pub/
docs/csir/ ihg.ps.gz, July 23.
Sharma, R., Sibbal, R., & Shrivastava, A.K. (2016). Vulnerability discovery modeling for open
and closed source software. International Journal of Secure Software Engineering,
7(4), 19–38.
Shrivastava, A.K., Sharma, R.,  & Kapur, P.K. (2015). Vulnerability discovery model
for a software system using stochastic differential equation. In  The  IEEE Xplore
Proceedings of International Conference on Futuristic Trends in Computational
Analysis and Knowledge Management, Amity University, Greater Noida, February
25–27, pp. 199–205.
Wai, F.K., Yong, L.W., Divakaran, D.M. & Thing, V.L.L. (2018). Predicting vulnerability dis-
covery rate using past versions of a software. In 2018 IEEE International Conference
on Service Operations and Logistics, and Informatics (SOLI), Singapore, pp. 220–225.
Woo, S., Alhazmi, O.,  & Malaiya, Y. (2006). Assessing vulnerabilities in apache and IIS
HTTP servers. In 2006 2nd IEEE International Symposium on Dependable, Autonomic
and Secure Computing IEEE, pp. 103–110.
Younis, A., Joh, H.,  & Malaiya, Y. (2011). Modeling learningless vulnerability discovery
using a folded distribution. Proceedings of SAM, 11, 617–623.
Younis, A., Malaiya, Y.K., & Ray, I. (2016). Assessing vulnerability exploitability risk using
software properties. Software Quality Journal, 24, 159–202.
16 Signature Reliability
Evaluations
An Overview of
Different Systems
Akshay Kumar, Mangey Ram, and S. B. Singh

CONTENTS
16.1 Introduction................................................................................................. 421
16.2 Algorithms Used in Signature Reliability................................................... 427
16.2.1 Algorithm for Computing the Signature Using
Reliability Function...................................................................... 427
16.2.2 The Algorithm to Assess the Expected Lifetime of the System
by Using Minimum Signature........................................................ 428
16.2.3 Algorithm for Obtaining the Barlow-Proschan Index for the
System............................................................................................ 429
16.2.4 Algorithm to Determine the Expected Value of the System.......... 429
16.2.5 Algorithm for Obtaining the Reliability of the Sliding
Window System.............................................................................. 429
16.3 Illustrations.................................................................................................. 429
16.4 Conclusion................................................................................................... 436
References............................................................................................................... 436

16.1 INTRODUCTION
In recent years, substantial efforts are being made in the development of reliability
theory including signature and fuzzy reliability theories and their applications to
various areas of real-life problems. Barlow and Proschan (1975) discussed an impor-
tant measure of the elements in a coherent system and expressed its fundamental
characteristics in the fault tree. The given new important measure is a useful tool for
evaluating the minimum cut sets, system reliability, and minimum cost of the fault
tree system using the Monte Carlo method and life distribution. They discussed a
method for computing the importance in hazard rate corresponding to series-parallel
and complex systems. Owen (1975) discussed multi-linear extensions of the
­composite value of compounds game theory and evaluated the Banzahat value by
differentiating the extension value of the game unit cube. The presidential election
game and Electoral College can be computed from the proposed algorithm.

421
422 Reliability Engineering

Samaniego  (1985) presented the failure rate of an erratic coherent system with a
lifetime element having independent identically distributed (i.i.d.) elements using the
common continuous distribution F. Various examples were quoted for a coherent
system including the closure theorem for k-out-of-n:F system having i.i.d. elements
and obtained various characteristic of the s-coherent system. Owen (1988) defined
the theory of multi-linear extensions of games and discussed its various properties in
real-life situations based on the Shapley game theory. This study showed that game
theory is a very useful tool for solving many real-life problems. Shapley introduced
game theory in 1953, by which players could compute their utility scales and then
play could be improved. Boland et al. (1990) considered a consecutive k-out-of-n:F
system that consisted of n ordered elements of a coherent system and the system fails
if at least k consecutive elements fail. They presented several examples for consecu-
tive k-out-of-n:F systems applied in oil pipelines, telecommunications, and circuitry
system. Also, they computed the reliability of consecutive k-out-of-n:F systems that
had elements independent from each other. They developed a system having positive
dependence between adjoining elements and showed the reliability of the system was
less for k ≥ (n + l)/2. Yu et al. (1994) investigated the multi-state coherent systems
(MSCS) assumed that the states of the system and its elements are totally ordered set.
They discussed a new MSCS: generalized multi-state coherent system. They ana-
lyzed some properties of the MSCS generalized model and defined a new approach
for computing signature of MSCS. They  analyzed some properties of the MSCS
generalized model and defined a new approach for evaluating the signature of MSCS.
Ushakov (1986, 1994) discussed reliability engineering that plays a key role in real
life. He reviewed and discussed the system reliability and applied it to engineering
systems. He  introduced some basic techniques applicable in cutting-edge results,
probabilistic reliability ,and statistical reliability, etc. He  presented various tech-
niques and applications of reliability theory in real-life systems. Kochar et al. (1999)
discussed the different techniques and properties for discussing coherent systems
having i.i.d. lifetime elements. They assumed that all comparisons rely on the presen-
tation of a system’s lifetime element as a function of the system’s signature. Signature
of the coherent system was based on the probability of that system and failed with the
ith failure element. They introduced a method for evaluating the system signature
from the stochastic method, hazard rate ordering, and likelihood ordering ratio
method and presented an approach to the coherent system. Levitin (2001) considered
a redundancy optimization system for a multi-state system that has a fixed amount of
resources for its work performance and resource generator from the subsystem.
The suggested algorithm evaluated the optimal system structure and system avail-
ability. The system productivity, availability, and cost were evaluated from perfor-
mance based on each element. The main goal of the study was to minimize the cost
investment, total demand, and to present the demand curve based on system proba-
bility. A genetic algorithm was used for solving universal generating function (UGF)
based problems, to compute the system availability, optimal structure function while
the working element of the subsystem had a maximum performance rate under given
demand distribution. Boland (2001) studied the characteristic of signatures having an
i.i.d. lifetime element based on a coherent system. He concluded that a signature is a
widely useful technique for comparing different systems properties and discussed
Signature Reliability Evaluations 423

simple and indirect majority system characteristics. Based on signature and system
lifetime, the ith order statistic described the probabilities of system element and its
computation for the path set and ordered cut set of the system lifetime element.
Levitin (2002) proposed a new system linear multi-state sliding window system that
generalized the multi-state consecutive k-out-of-r-from-n:F system. The considered
system consisted of n linearly ordered multi-state elements. Each element could have
two states: total failure or completely working. If the performance sum of the r con-
secutive element is lower than the total allocated weight, then the system called fails.
The author evaluated various characteristics of the linear multi-state sliding window
system with the suggested algorithm to find the order of elements and maximum
system reliability. A  genetic algorithm is used as the optimal solution based on a
UGF technique for reliability computation. Levitin (2003a) introduced a two-state
linear multi-state sliding window system which consisted of n linearly ordered multi-
state elements. The  system performance rate was based on a given performance
weight. The author presented an approach for calculating the reliability of the sliding
window system (SWS) to the common supply failures (CSFs) and common supply
groups (CSGs). He also described a method for comparing optimal element distribu-
tions of the CSG system reliability. The proposed study computed the optimization
result with the help of the UGF technique and the genetic algorithm. Levitin (2003b)
proposed multi-state a system that generalized the consecutive k-out-of-r-from-n:F
system. The  considered linear multi-state SWS consisted of n ordered multi-state
element and every element could have two states. In this study, he evaluated the sys-
tem reliability, mean time to failure (MTTF) and cost of the considered system using
the extended universal moment generating function. Boland and Samaniego (2004)
described the various characteristic of a system called its “signature.” They defined a
concept between a system’s signature and other well-known system reliabilities and
found that the signature was useful for comparing different systems. They provided
different stochastic comparisons between systems and signature-based comparisons
of a coherent system. They investigated the signature of different systems having an
i.i.d. lifetime element and evaluated expected lifetime and expected cost using the
system reliability function and order statistical methods. Belzunce and Shaked (2004)
reviewed and studied the properties of the failure profile in the coherent system.
In this study, the authors presented system reliability based on the methods of path
set and cut set and discussed the relationship between elements and properties of
failure profiles. They derived an expression for the independent element and density
function of the lifetime distribution of a coherent system. Also, they presented the
likelihood ratio of lifetimes of two systems using failure profiles and obtained bounds
of failure profiles in the likelihood ratio on the lifetimes of coherent systems with
independent and without identical lifetimes. Navarro and Rychlik (2007) studied the
structure functions and the MTTF rate of coherent systems depending on exchange-
able elements having a lifetime distribution function depending on the signature.
They discussed exchangeable elements with absolutely continuous joint distribution
order statistics with the weights identical to the signature based on any coherent sys-
tem. They assessed expectation bounds for exchangeable exponential elements and
expressed the parent marginal reliability function from reliability bounds for all the
coherent with three and four exchangeable elements with exponential lifetime
424 Reliability Engineering

distribution. Navarro et al. (2007a) introduced the various properties of a coherent


system with a dependent element based on the signature. They  presented hyper-­
minimal and hyper-maximal distributions. The authors evaluated distributions, bound
of series, parallel, and k-out-of-n systems. They studied the application of the coher-
ent system in multi-variate lifetime exponential distributions. Navarro et al. (2007b)
provided the survey of the ordered statistical coherent system with exchangeable
lifetimes element and concept of signatures having i.i.d. elements. They discussed
lifetime coherent system representation of the generalized mixture distribution for
series, parallel, and k-out-of-n system. Researchers also defined the nature of the
hazard rate on the basis of the series system and ordered statistical concepts.
Samaniego (2007) discussed properties of series, parallel, and k-out-of-n system
based on the signatures with i.i.d. lifetime elements. In this study, he evaluated sys-
tem signatures, characteristic theorems, and preservation with the help of structure
function and ordering statistical methods and showed the application of system sig-
nature in network reliability and reliability economics. Navarro et  al. (2008) dis-
cussed the application and extension of the coherent system in various engineering
problems. In this study, the authors defined and reviewed the signature-based descrip-
tion and conservation theorems for systems whose elements have i.i.d. lifetime based
on structural reliability. They showed that the distribution of the element system’s
lifetime could be defined as a mixture of the distributions of k-out-of-n systems.
Finally, they evaluated signatures, the expected lifetime of the binary and MSS with
the help of reliability functions and the order statistic method. Bhattacharya and
Samaniego (2008) reviewed and studied the optimal allocation of i.i.d. elements with
reliabilities to specific locations within a given coherent system. They gave the same
sufficient condition on the system structure for which the highest possible system
reliability is achieved. They evaluated the optimal allocation element in series, paral-
lel and series-parallel systems within the independent element and its reliabilities.
Also, they examined long-standing interest problems in reliability theory within a
coherent system having defined relevant and monotone elements and obtained many
solutions of a coherent system based on signatures using order statistics and gave a
sufficient condition for the optimal solution. Li and Zhang (2008) investigated the
coherent systems having i.i.d. elements along with the system properties based on
stochastic methods for comparing the system lifetime distribution and computed the
signature of a coherent system with the help of order statistical methods. They dis-
cussed the characteristic of a coherent system for signature evaluating using reliabil-
ity functions and order statistical methods consisting of i.i.d elements. Navarro and
Rubio (2009) studied the signature-based coherent system and its characteristics in
various engineering fields using stochastic orderings. They discussed signatures with
2, 3, and 4 system elements using the minimum path set and the order statistical
approach. An algorithm was suggested for calculating signature, system moment,
system reliability function, and the expected value of elements with n elements and
they also compared the system in i.i.d. case. The given algorithm was based on the
minimum path set. Eryilmaz et al. (2009) discussed the consecutive k-within-m-out-
of-n:F system with exchangeable elements having the reliability properties based on
survival function. They  obtained the system-bound reliability using Monte Carlo
estimator simulation and moving order statistics. The  system signature was also
Signature Reliability Evaluations 425

discussed with the help of Samaniego signature simulation and defined system char-
acteristics based on a coherent system. The multivariate Pareto distribution was used
to evaluate the results of the system with exchangeable elements. Eryilmaz (2010)
examined the reliability functions of the consecutive systems as a mixture of the reli-
ability of order statistics which consisted of exchangeable lifetime elements. He also
revealed that the reliability and stochastic ordering results for consecutive k-system
can be computed from mixture representations. The consecutive k-systems can be
applied in an oil pipeline, a system in accelerators, vacuums, telecom networks, and
spacecraft relay stations. Navarro and Rychlik (2010) discussed the expected lifetime
of system reliability and compared their bounds and calculated expected lifetimes of
the coherent system and mixed systems based on elements with independently dis-
tributed lifetimes. They  obtained better inequalities dependent on a concentration
measure connected to the Gini dispersion index in case of i.i.d. The expected life-
times of series systems of compact sizes could be derived from bounds and expected
a lifetime of one unit in the case of i.i.d. lifetime distribution. Da Costa Bueno (2011)
determined the importance measure of a coherent system in the presentation of its
signature and described the properties of the dynamic system signature, Barlow-
Proschan importance, and element importance under compensator transforms in case
of deterministic compensators having i.i.d elements using lifetime distribution.
Eryilmaz et  al. (2011) discussed the m-consecutive-k-out-of-n:F systems with
exchangeable elements based on reliability properties and evaluated the recurrence
relations for the signature of the system by exact methods. They introduced order
statistics and the lifetime distribution for describing system reliability metrics.
They also computed the system minimum and maximum signature having i.i.d. ele-
ments and MTTF from stochastic ordering methods for the m-consecutive-k-out-of-
n:F system. Lisnianski and Frenkel (2011) studied the MSS reliability evaluation on
the basis of signature, optimization, and statistical inference. They  discussed the
advanced role of a signature in dynamic reliability and non-parametric inference for
lifetime distribution. The  authors defined the role of a coherent system in various
engineering problems and dynamic reliability based on the signature. They also pre-
sented various methods for signature-based representation of a coherent system
using order statistical, Markov process, and multiple-valued logic methods and com-
puted MSS reliability, expected lifetime, and cost. Mahmoudi and Asadi (2011) eval-
uated the properties of dynamic signature for a coherent system. They reviewed and
studied the concept of signature for the stochastic and advance advantage of coherent
systems. They considered a coherent system and described its various characteristics
and measures in real-life situations and evaluated engineering reliability based on
partial information and obtained the lifetime failure probability of the coherent sys-
tem. Triantafyllou and Koutras (2011) proposed a 2-within consecutive k-out-of-n:F
system that consisted of exchangeable elements. The system was based on the signa-
ture and they gave some stochastic comparisons between the reliability function and
the lifetime element. Researchers presented many stochastic orderings in the 2-within
consecutive k-out-of-n:F system with signature. In  this study, they discussed the
preservation of intrinsic failure rate (IFR) property with the help of the proposed
system. A 2-within consecutive k-out-of-n:F system is used in telecommunication,
oil pipeline, and vacuum systems in accelerators. Balakrishnan et  al. (2012)
426 Reliability Engineering

presented an observation of the present theories relating to the signatures and their
applicable use in the study of dynamic reliability, systems with i.i.d. elements and
non-parametric inference for an element lifetime distribution. They introduced the
various properties of the signature based on a coherent system. The authors discussed
various methods, algorithms for obtaining system reliability, expected lifetime,
Barlow-Proschan index, and expected cost rate using order statistics and reliability
functions of a coherent systems. Eryilmaz (2012) investigated the number of ele-
ments that fail at the time of system failure. The author discussed the coherent sys-
tems such as linear consecutive k-within-m-out-of-n:F and m-consecutive-k-out-of-n:F
and obtained expected lifetime, expected x value, and system reliability of consid-
ered linear consecutive k-within-m-out-of-n:F and m-consecutive-k-out-of-n:F sys-
tems using lifetime distributions and ordering statistics. Da Costa Bueno (2013)
introduced the multi-state monotone system using decomposition methods and eval-
uated the signature of a coherent system in the classical case through exchangeability
properties. The  system reliability function was obtained with monotone i.i.d. ele-
ments and the Samaniego signature. The work also included the study of the signa-
ture of the binary and MSS with the help of the proposed theorem. Marichal and
Mathonet (2013) evaluated that the Samaniego signature of a coherent system has
i.i.d. lifetime elements using Boland’s formula, which had structure function.
They  measured the signature of the coherent system: derivative, Barlow-Proschan
index, and tail signature with lifetime distribution. For computing the signature of the
coherent system with structure function, they used Owen’s method. In real-life situ-
ations, various engineering problems were discussed and provided various methods
and algorithms for determining system signature. Da et al. (2014) studied and dis-
cussed the signature of a k-out-of-n coherent system consisting of n elements.
They computed the minimal signature and the signature of the binary coherent sys-
tem and their combination of elements were derived. The  authors gave several
numerical examples for defining the characteristic of a coherent system with i.i.d.
elements based on the minimum path set along with application in engineering fields.
Also, they obtained the signature from order statistics and suggested algorithms.
Eryilmaz (2014) discussed the signature of a system that is an effective tool not only
for investigation of the binary coherent systems but also for application in network
systems. For  evaluating the system signature of series and parallel systems, he
derived a simple method based on the signature and minimum signature of modules
with the help of system structure functions. A simple statistical approach was given
for comparing the system signature, which was dependent on a coherent system and
computation of series and parallel system modules. Eryilmaz (2015) defined the rep-
resentation for a mixture of the 3-state system with three state elements and reliabil-
ity modeling of 3-state systems consisting of 3-state s-independent elements.
The systems and its element could have three states: perfect functioning, partial per-
formance, and complete failure. The presented study showed that survival functions
of the systems were of different state subsets. Markov process was used for analy-
zing the signature of the 3-state consecutive-k-out-of-n:G systems consisting of
s-independent elements. Lindqvist and Samaniego (2015) introduced that the signa-
ture reliability of a coherent system is a very useful tool in the study with i.i.d. life-
time elements. The signature of a coherent system in n element was a vector whose
Signature Reliability Evaluations 427

kth failure element caused a system failure. They evaluated the dynamic signature of
binary and complex systems with minimum repair called system conditional dynamic
signature with the help of suggested stochastic and minimal path sets. Eryilmaz and
Tuncel (2015) studied a k-out-of-n system that consisted of n linearly ordered ele-
ment (linear and circular). They discussed signature with the help of simulation and
that the system could have various numbers of the element. After obtaining the sig-
nature based on the expression for the structure function, MTTF, mean number of the
failed element, they provided various applications in the engineering fields. Franko
and Tutuncu (2016) computed the reliability of the weighted k-out-of-n:G system
based on the signature with repairable i.i.d. lifetime elements. They studied the reli-
ability and some reliability indices with the repairable weighted k-out-of-n:G system
and found several uncertainties via signature. The proposed system is widely used in
the engineering field such as solar field, military system, etc. They  computed the
system signature of the considered system depending on the weights of the element
using the stochastic method and path set and calculated the Birnbaum and Barlow-
Proschan element importance measures through the suggested algorithms. Chahkandi
et  al. (2016) discussed a repairable coherent system to examine signature and
Samaniego’s notation for i.i.d. lifetime elements. The Poisson process was used to
calculate the failure element that has the same intensity function. They  presented
Samaniego for i.i.d. random variable, whereas the Poisson process could have an
identical intensity function. The authors supposed that the reliability function of a
coherent system depends on the mixture of the probabilities and number of repair-
able elements. They determined the reliability function of the series system using a
stochastic order statistic algorithm. Samaniego and Navarro (2016) studied the
coherent system and its properties for comparing heterogeneous elements. They used
various methods for comparing coherent systems having both independent and
dependent elements. In the independent case, for computing the signature in survival
function, Coolen and Coolen-Maturi methods were used. Kumar and Singh (2017a,
2017b, 2017c) evaluated the signature, expected cost, MTTF, and Barlow-Proschan
index of various engineering systems with the help of reliability functions and using
UGF techniques. Bisht and Singh (2019) discussed the signature of complex bridge
networks with binary state nodes using UGF techniques. They computed the signa-
ture of each node in series, parallel, and complex forms of the network system.

16.2  ALGORITHMS USED IN SIGNATURE RELIABILITY


16.2.1 Algorithm for Computing the Signature
Using Reliability Function
Step 1: Determine the signature of the system using reliability functions (see
Boland, 2001).

∑ ∑
1 1
Aa = φ(H) − φ ( H ) (16.1)
s  H ⊆s   s  H ⊆ s 
     
 s − a + 1  H =s −a+1  s − 1  H =s −1
   
428 Reliability Engineering

Calculate the reliability polynomial of SWS


s s 

H (P) = ∑
j =1
C j  P j q n− j .
 j
 
s
where Ci = ∑ V , j = 1, 2,...s .
i − s − j +1
i

Step 2: Evaluate the tail signature of the system, i.e., ( s +1)-tuple V = (V0 ,...,Vs )
with
s

∑V =  s ∑ φ ( H ) (16.2)
1
Va = i
i = a +1
 H =s−a
 
 s − a
 

Step 3: Calculate the reliability function from a polynomial form with the help
of Taylor evolution at v = 1 by:

1
P ( v ) = v s H   (16.3)
v

Step 4: Compute the tail signature of the system with the help of the reliability
function using Equation 16.2 by (see Marichal and Mathonet, 2013).

Va =
( s − 1)! Da P(1), a = 0,1,..., s (16.4)
s!

Step 5: Obtain the signature from tail signature:

V = Va−1 − Va , a = 1, 2,..., s (16.5)

16.2.2 The Algorithm to Assess the Expected Lifetime of the System


by Using Minimum Signature

Step 1: Determine the MTTF of the i.i.d. of the element of the system that have
exponentially distributed elements with the mean ( µ = 1) .
Step 2: Assessment E(T ) of the system, which has i.i.d. elements (see Navarro,
2009):
n

∑ i (16.6)
Ci
E (T ) = µ
i =1

where C = (C1, C2 ,..., Cn ) is a vector coefficient we obtain with the help of the
minimal signature.
Signature Reliability Evaluations 429

16.2.3 Algorithm for Obtaining the Barlow-Proschan Index


for the System

Compute the Barlow-Proschan index of the i.i.d. elements with the help of the reli-
ability function as (see Shapley, 1953; Owen, 1975, 1988).
1

∫ ( ∂ H ) (v)dv, a = 1, 2,..., n (16.7)


( a)
I BP = a
0

where H are reliability functions of the system.

16.2.4 Algorithm to Determine the Expected Value


of the System (Eryilmaz, 2012)

Step 1: Calculate the expected value of the system elements using the signature:
n

E( X ) = ∑ iV , i = 1,2,…, n.
i =1
i

Step 2: Evaluate E ( X ) and E ( X ) / E (T ) of the system.

16.2.5 Algorithm for Obtaining the Reliability of the Sliding


Window System (Levitin, 2005)
Step 1: Estimate the UGF of the individual element, given F = 0, U1−r ( z ) = z 0,b0.
Step 2: Change the value of i = 1, 2,..., K .
Step 3: Compute Ui −r +1 ( z ) = Ui −r ( z ) φ Ui ( z ).
Step 4: Find all the expressions that satisfied the condition i ≥ r and added the
terms α f (U i −r +1( z )) to F.
Step 5: Find the reliability of SWS as R = 1 − F .

16.3 ILLUSTRATIONS
Case 1: Find a series system that has five elements in a series manner and reli-
ability of the proposed system can be computed as shown in Figure  16.1
such as:
Structure function of the series system from Figure 16.1 as:
n


R( P ) = ∏R j =1
j

R( P ) = R1R2 R3 R4 R5 (16.8)
430 Reliability Engineering

FIGURE 16.1  Series system.

In this case when elements are identically distributed ( R j = R), the reliability func-
tion R( P ) of the series system which has i.i.d. in the element can be revealed as:

R( P ) = P 5 .

Signature of a series system


1.
Using Owen’s method for the system, express the reliability function in
the terms of v as:

H ( v ) = v 5 . (16.9)

With the help of Equations 16.3 and 16.9, the reliability function can be
written as:

1
P ( v ) = v 5 H   = 1.
v

Now, obtain the tail signature V of the series system by using


Equation 16.4 as

=
V0 1=
, V1 0 =
, V2 0=
, V3 0= , V5 0.
, V4 0=

V = (1, 0, 0, 0, 0, 0).

Calculate the signature V of the series system from Equation 16.5:

V = (1, 0, 0, 0, 0, 0 ) .

Barlow-Proschan index of the series system


2.
From Equations 16.8 and 16.7, we obtain the Barlow-Proschan index of
the series system:

(1) 1 1
I BP = ∫ (d1H )dH = ∫ v 4dv = 1 .
5
0 0
(K )
Similarly, we obtain all elements Barlow-Proschan index I BP for
K = (1, 2,..., 5 ) given as:

1 1 1 1 1
I BP =  , , , ,  .
5 5 5 5 5
Signature Reliability Evaluations 431

The expected lifetime of the series system


3.
We have evaluated the minimal signature M of the series system from
Equation 16.9:
Minimal signature (1, 0, 0, 0, 0 ).
Using step 3 Algorithms 16.2.2 determine the expected lifetime of the
series system as:
E (t ) = 1 (16.10)

Expected cost rate


4.
We have evaluated the expected value of the series system with step 1
of Algorithm 16.2.4:

E ( X ) = 1. (16.11)

Using Equations 16.10 and 16.11, the expected cost rate is defined as:

= E ( X ) / E ( t )

= 1.

Case 2: We find a parallel system that has five elements in a parallel man-
ner and the reliability function of the proposed system can be evaluated as
shown in Figure 16.2 such as:
Reliability function of the parallel system from Figure 16.2 defined as:
n

R( P ) = 1 − ∏ (1 − R )
j =1
j

R( P ) = 1 − [(1 − R1 )(1 − R2 )(1 − R3 )(1 − R4 )(1 − R5 )]. (16.12)

Now, ( Ri = R) because elements are identically distributed, the reliability func-


tion R( P ) of the parallel system from i.i.d.in the element can be written as:

R( P ) = 5R − 10 R2 + 10 R3 − 5R 4 + R5 .

FIGURE 16.2  Parallel system.


432 Reliability Engineering

The reliability function can be expressed in the form of P as:

H ( P ) = 5P − 10 P 2 + 10 P 3 − 5P 4 + P 5 . (16.13)

Signature of a parallel system


1.
With the help of Owen’s method for the system, we have computed the
signature of the parallel system from Equation 16.13 as:

H ( P ) = 5P − 10 P 2 + 10 P 3 − 5P 4 + P 5 . (16.14)

Using Equations 16.3 and 16.14, the reliability function can be expressed


in term of v as:

1
P ( v ) = v 5 H   = 1 − 5v + 10v 2 − 10v 3 + 5v 4 .
v

Now, obtaining the tail signature Vi of the proposed system by using


Equation 16.4 as:

=
V0 1=
, V1 1 ,=
V2 1,=
V3 1 = , V5 0.
, V4 1=

V = (1, 1, 1, 1, 1, 0).

Therefore, we have evaluated the signature V of the parallel system


from Equation 16.5:

V = ( 0, 0, 0, 0, 0, 1) .

Barlow-Proschan index of the parallel system


2.
From Equation  16.12 and Algorithm 16.2.3, we have calculated the
Barlow-Proschan index of the parallel system by:
1 1
(1) 1

∫ ∫
I BP = (d1H ) dH = (1 − 4v + 6v − 4v + v ) dv = 5 .
2 3 4

0 0

Similarly, we compute all rest of the elements in the Barlow-Proschan


index I BP) for K = (1, 2,..., 5 ) such as:
(K

1 1 1 1 1
I BP =  , , , ,  .
5 5 5 5 5

The expected lifetime of the parallel system


3.
From the reliability function, we have determined the minimal signa-
ture M of the system from Equation 16.13:

Minimal signature (1, 0, 0, 0, 0 ) .


Signature Reliability Evaluations 433

The expected lifetime of a parallel system assessed from Equation 16.6


defined as:

E (t ) = 2.28 (16.15)

Expected cost rate


4.
Assessment the expected value of the parallel system from using step 1
of Algorithm 16.2.4:

E ( X ) = 5 (16.16)

Therefore, the expected cost rate is:

= E ( X ) / E ( t )

= 2.19298.

Case 3: Consider an SWS that has four window elements with n = 4, r = 3, and
W = 4 as shown in Figure 16.3. Each window having two states: complete
successor and complete failure. Suppose the performance rates of the win-
dow from 1 to 4 are 1,2,3,4, respectively.
Now from UGF of the proposed system from Figure 16.3 given as:

U j ( z ) = Pj z j + (1 − Pj ) z 0

where j = 1,2,3,4, and Pj is given the probability function and z j, z 0 is the per-
formance and non-performance rate.
Therefore, the UGF U j ( z ) ( j = 1, 2, 3, 4 ) of the system is given by:

U1 ( z ) = P1z1 + (1 − P1 ) z 0

U 2 ( z ) = P2 z 2 + (1 − P2 ) z 0

U 3 ( z ) = P3 z 3 + (1 − P3 ) z 0

U 4 ( z ) = P4 z 4 + (1 − P4 ) z 0 .

FIGURE 16.3  Sliding window system.


434 Reliability Engineering

From the Algorithm 16.2.5 of SWS, we obtain the beginning element of the
SWS as:
For i = 1
U 0 ( z ) = φ (U −1( z ),U1( z ))

U 0 ( z ) = φ ( z 0,( 0,0,0 ) , P1z1 + (1 − P1 ) z 0 )

= P1z ( ) + (1 − P1 ) z ( )
0 , 0 ,0 ,1 0 , 0 ,0 ,0

For i = 2

U1( z ) = φ (U 0 ( z ),U 2 ( z ))

= φ ( P1z ( ) + (1 − P1 ) z ( ) , P2 z 2 + (1 − P2 ) z 0 )
0 , 0 ,0 ,1 0 , 0 ,0 ,0

= P1P2 z ( ) + P1(1 − P2 ) z ( ) + P2 (1 − P1 ) z ( ) + (1 − P1 )(1 − P2 ) z ( )


0 , 0 ,1,2 0 , 0 ,1,0 0 , 0 ,0 ,2 0 , 0 ,0 ,0

For i = 3

U 2 ( z ) = φ (U1( z ),U 3 ( z ))

= P1P2 P3 z ( ) + P1 (1 − P2 ) P3 z ( ) + P2 (1 − P1 ) P3 z ( )
0 , 1,2,3 0 , 1,0 ,3 0 , 0 ,2,3

+ (1 − P1 ) (1 − P2 ) P3 z ( ) + P1P2 (1 − P3 )zz ( )
0 , 0 ,0 ,3 0 , 1,2,0

+ P1 (1 − P2 ) (1 − P3 ) z ( ) + P2 (1 − P1 ) (1 − P3 ) z ( )
0 , 1,0 ,0 0 , 0 ,2 ,0

+ (1 − P1 ) (1 − P2 ) (1 − P3 ) z ( )
0 , 0 ,0 ,0

From the condition i ≥ w, obtained unreliability F and U 2 ( z ) are given as:

F = (1 − P1 ) (1 − P2 ) P3 + P1P2 (1 − P3 ) + P1 (1 − P2 ) (1 − P3 )
(16.17)
+ P2 (1 − P1 ) (1 − P3 ) + (1 − P1 ) (1 − P2 ) (1 − P3 )

For i = 4

U 3 ( z ) = φ (U 2 ( z ) , U 4 ( z ) )
= φ ( P1P2 P3 z ( ) + P1(1 − P2 ) P3 z ( ) + (1 − P1 ) P2 P3 z ( ) , P4 z 4 + (1 − P4 ) z 0 )
0 , 1, 2,3 0 , 1, 0 ,3 0 , 0 , 2,3

= P1P2 P3 P4 z ( ) + P1(1 − P2 ) P3 P4 z ( ) + (1 − P1 ) P2 P3 P4 z ( )
1, 2,3, 4 1, 0 ,3, 4 0 , 2,3, 4

+ P1P2 P3 (1 − P4 ) z ( ) + P1(1 − P2 ) P3 (1 − P4 ) z ( ) + (1 − P1 ) P2 P3 (1 − P4 ) z ( )
1, 2,3, 0 1, 0 ,3, 0 0 , 2,3, 0

F = P1(1 − P2 ) P3 (1 − P4 )
(16.18)
Signature Reliability Evaluations 435

Now, adding Equations 16.17 and 16.18, we obtain reliability R of SWS as:

R = P2 P3 + P1P3 P4 − P1P2 P3 P4 (16.19)

The reliability function R of the SWS is defined as:

R( P ) = P 2 + P 3 − P 4 .

Signature of the sliding window system


1.
We obtain the reliability function in the form of v from Owen’s method
for the system as:

H ( v ) = v 2 + v 3 − v 4 (16.20)

Now using Equations 16.3 and 16.20, the reliability function is:

1
P ( v ) = v 4 H   = −1 + v + v 2 .
v

Now calculate the tail signature V of the SWS from using step 4 of
Algorithm 16.2.1 as:

3 1
=
V0 1=
, V1 = , V2 = , V3 0, V4 = 0.
4 2

The tail signature V of the SWS is:

 3 1 
V =  1, , , 0, 0  .
 4 2 
Now, find the signature of the SWS from step 5 Algorithm 16.2.1 is:

1 1 1 
V =  , , , 0 .
4 4 2 
Barlow-Proschan index of the sliding window system
2.
From Equation  16.20 and Algorithm 16.2.3, we obtain the Barlow-
Proschan index of the SWS by:
1
(1) 1
I BP = ∫ (v
2
− v 3 ) dv = .
12
0

Similarly, we compute all the rest of the elements of the Barlow-


Proschan index I (BP
K)
for K = (1, 2,..., 4 ) such as:

 1 1 7 1 
I BP =  , , ,  .
 12 14 12 12 
436 Reliability Engineering

The expected lifetime of the parallel system


3.
From using the reliability function, we have determined the minimal
signature M of the system from Equation 16.20 is:

Minimal signature ( 0, 1, 1, − 1)

Then, the expected lifetime of SWS can be determined by using


­Equation 16.6 given as:

E (t ) = 0.58. (16.21)

Expected cost rate


4.
From Algorithm 16.2.4 with step 1, we have calculated the expected
value of the SWS is:

E(X)= 2 (16.22)

The expected cost rate by using Equations 16.21 and 16.22 is:

= E (X ) /E ( t )

= 3.4483.

16.4 CONCLUSION
In this chapter, we discussed the properties of signature and its factor like a tail sig-
nature, expected cost rate, mean time to failure, and Barlow-Proschan index with the
help of the reliability function and Owen’s method. Also, we evaluated the reliability
function by using UGF. Further, different systems such as series, parallel, and SWS
and computed signature with the help of given algorithms were discussed.

REFERENCES
Balakrishnan, N., Navarro, J., & Samaniego, F. J. (2012). Signature representation and preser-
vation results for engineered systems and applications to statistical inference. In Recent
Advances in System Reliability, Springer, London, UK, pp. 1–22.
Barlow, R. E., & Proschan, F. (1975). Importance of system elements and fault tree events.
Stochastic Processes and Their Applications, 3(2), 153–173.
Belzunce, F., & Shaked, M. (2004). Failure profiles of coherent systems. Naval Research
Logistics (NRL), 51(4), 477–490.
Bhattacharya, D., & Samaniego, F. J. (2008). On the optimal allocation of elements within
coherent systems. Statistics & Probability Letters, 78(7), 938–943.
Bisht, S., & Singh S. B. (2019). Signature reliability of binary state node in complex bridge
network using universal generating function. International Journal of Quality  &
Reliability Management, 36(2), 186–201.
Boland, P. J. (2001). Signatures of indirect majority systems. Journal of Applied Probability,
38(2), 597–603.
Signature Reliability Evaluations 437

Boland, P. J., Proschan, F., & Tong, Y. L. (1990). Linear dependence in consecutive k-out-
of-n: F systems. Probability in the Engineering and Informational Sciences, 4(3),
391–397.
Boland, P. J., & Samaniego, F. J. (2004). The signature of a coherent system and its applications in
reliability. In Mathematical Reliability: An Expository Perspective, Springer US, pp. 3–30.
Chahkandi, M., Ruggeri, F., & Suárez-Llorens, A. (2016). A generalized signature of repair-
able coherent systems. IEEE Transactions on Reliability, 65(1), 434–445.
Da Costa Bueno, V. (2011). A coherent system element importance under its signatures repre-
sentation. American Journal of Operations Research, 1(3), 172.
Da Costa Bueno, V. (2013). A multistate monotone system signature. Statistics & Probability
Letters, 83(11), 2583–2591.
Da, G., Xia, L., & Hu, T. (2014). On computing signatures of k-out-of-n systems consisting of
modules. Methodology and Computing in Applied Probability, 16(1), 223–233.
Eryılmaz, S. (2010). Mixture representations for the reliability of consecutive-k systems.
Mathematical and Computer Modelling, 51(5), 405–412.
Eryilmaz, S. (2012). The number of failed elements in a coherent system with exchangeable
elements. IEEE Transactions on Reliability, 61(1), 203–207.
Eryilmaz, S. (2014). On signatures of series and parallel systems consisting of modules with
arbitrary structures. Communications in Statistics-Simulation and Computation,
43(5), 1202–1211.
Eryilmaz, S. (2015). Mixture representations for three-state systems with three-state ele-
ments. IEEE Transactions on Reliability, 64(2), 829–834.
Eryilmaz, S., Kan, C., & Akici, F. (2009). Consecutive k-within-m‐out‐of‐n: F system with
exchangeable elements. Naval Research Logistics (NRL), 56(6), 503–510.
Eryilmaz, S., Koutras, M. V., & Triantafyllou, I. S. (2011). Signature based analysis of m‐con-
secutive k‐out‐of‐n: F systems with exchangeable elements. Naval Research Logistics
(NRL), 58(4), 344–354.
Eryilmaz, S.,  & Tuncel, A. (2015). Computing the signature of a generalized k -out-of- n
system. IEEE Transactions on Reliability, 64(2), 766–771.
Franko, C.,  & Tütüncü, G. Y. (2016). Signature based reliability analysis of repairable
weighted k-out-of-n: G systems. IEEE Transactions on Reliability, 65(2), 843–850.
Kochar, S., Mukerjee, H., & Samaniego, F. J. (1999). The signature of a coherent system and
its application to comparisons among systems. Naval Research Logistics (NRL), 46(5),
507–523.
Kumar, A., & Singh, S. B. (2017a). Signature reliability of linear multi-state sliding window
system. International Journal of Quality & Reliability Management, 35(10), 2403–2413.
Kumar, A., & Singh, S. B. (2017b). Computations of signature reliability of coherent system.
International Journal of Quality & Reliability Management, 34(6), 785–797.
Kumar, A., & Singh, S. B. (2017c). Signature reliability of sliding window coherent system.
In Mathematics Applied to Engineering, Elsevier International Publisher, London, UK,
pp. 83–95.
Levitin, G. (2001). Redundancy optimization for multi-state system with fixed resource-
requirements and unreliable sources. IEEE Transactions on Reliability, 50(1), 52–59.
Levitin, G. (2002). Optimal allocation of elements in a linear multi-state sliding window
system. Reliability Engineering & System Safety, 76(3), 245–254.
Levitin, G. (2003a). Common supply failures in linear multi-state sliding window systems.
Reliability Engineering & System Safety, 82(1), 55–62.
Levitin, G. (2003b). Linear multi-state sliding-window systems. IEEE Transactions on
Reliability, 52(2), 263–269.
Levitin, G. (2005). The  Universal Generating Function in Reliability Analysis and
Optimization, Springer, London, UK, p. 442. doi:10.1007/1-84628-245-4.
438 Reliability Engineering

Li, X., & Zhang, Z. (2008). Some stochastic comparisons of conditional coherent systems.
Applied Stochastic Models in Business and Industry, 24(6), 541–549.
Lindqvist, B. H.,  & Samaniego, F. J. (2015). On the signature of a system under minimal
repair. Applied Stochastic Models in Business and Industry, 31(3), 297–306.
Lisnianski, A., & Frenkel, I. (Eds.). (2011). Recent Advances in System Reliability: Signatures,
Multi-state Systems and Statistical Inference. Springer Science  & Business Media,
London, UK.
Mahmoudi, M.,  & Asadi, M. (2011). The  dynamic signature of coherent systems. IEEE
Transactions on Reliability, 60(4), 817–822.
Marichal, J. L., & Mathonet, P. (2013). Computing system signatures through reliability func-
tions. Statistics & Probability Letters, 83(3), 710–717.
Navarro, J., & Rubio, R. (2009). Computations of signatures of coherent systems with five
elements. Communications in Statistics-Simulation and Computation, 39(1), 68–84.
Navarro, J., Ruiz, J. M., & Sandoval, C. J. (2007a). Properties of coherent systems with depen-
dent elements. Communications in Statistics: Theory and Methods, 36(1), 175–191.
Navarro, J., & Rychlik, T. (2007). Reliability and expectation bounds for coherent systems
with exchangeable elements. Journal of Multivariate Analysis, 98(1), 102–113.
Navarro, J., & Rychlik, T. (2010). Comparisons and bounds for expected lifetimes of reliabil-
ity systems. European Journal of Operational Research, 207(1), 309–317.
Navarro, J., Rychlik, T., & Shaked, M. (2007b). Are the order statistics ordered? A survey of
recent results. Communications in Statistics: Theory and Methods, 36(7), 1273–1290.
Navarro, J., Samaniego, F. J., Balakrishnan, N.,  & Bhattacharya, D. (2008). On the appli-
cation and extension of system signatures in engineering reliability. Naval Research
Logistics (NRL), 55(4), 313–327.
Owen, G. (1975). Multilinear extensions and the Banzhaf value. Naval Research Logistics
Quarterly, 22(4), 741–750.
Owen, G. (1988). Multilinear extensions of games. The Shapley Value Essays in Honor of
Lloyd S Shapley, Cambridge University Press, New York, pp. 139–151.
Samaniego, F. J. (1985). On closure of the IFR class under formation of coherent systems.
IEEE Transactions on Reliability, 34(1), 69–72.
Samaniego, F. J. (2007). System Signatures and Their Applications in Engineering Reliability.
Springer Science & Business Media, London, UK, p. 110.
Samaniego, F. J., & Navarro, J. (2016). On comparing coherent systems with heterogeneous
elements. Advances in Applied Probability, 48(1), 88–111.
Shapley, L.S. (1953). A value for n-person games. In: Contributions to the Theory of Games,
Vol.  2. In: Annals of Mathematics Studies, vol.  28. Princeton University Press,
Princeton, NJ, pp. 307–317.
Triantafyllou, I. S.,  & Koutras, M. V. (2011). Signature and IFR preservation of 2-within-
consecutive k-out-of- n- F: Systems. IEEE Transactions on Reliability, 60(1), 315–322.
Ushakov, I. (1986) Universal generating function. Journal of Computer Science and Systems
Biology, 24, 118–129.
Ushakov, I. A. (Ed.). (1994). Handbook of Reliability Engineering. John Wiley  & Sons,
New York.
Yu, K., Koren, I., & Guo, Y. (1994). Generalized multistate monotone coherent systems. IEEE
Transactions on Reliability, 43(2), 242–250.
Index
A copulas, 213, 216, 219
cost reduction, 358
accelerated, 86, 197–201, 204, 208, 213–214, coverage based vulnerability discovery
216, 219 modeling, 410
age replacement policies, 17 criticality analysis, 310, 313, 331
Alhazmi Malaiya Logistic (AML) model, 405 CTMC model, 138–140, 142–143
Anderson Thermodynamic (AT) model, 404 Cumulative Exposure Model (CEM), 204
A-optimality, 204 cure models, 292
approximation, 84, 133–134, 260, 268, 289, current, 283
320, 341
Arrhenius law, 203
automatic analysis, 111 D
automatic control system, 386 damage-cycle diagram, 347
auxiliary engines, 375, 383–384 damage distribution, 349
availability, 127, 133 data, 233, 235, 282–283, 345, 353–354
database, 25, 65, 68, 111–112, 117, 119, 242, 306
B degradation, 79, 87, 297–298
analysis, 86
Barlow-Proschan index, 429–430, 432, 435 delay-time modeling, 68
Bayes, 355–356 dependability, 224, 228, 233
Bayesian approach, 219 design curve construction, 356
Bayesian statistics, 356 diesel engine, 362, 368, 374, 382, 385
binomial reliability demonstration, 351 discrete state, 90
bivariate Gamma degradation models, 298 D-optimality, 204, 213, 218
block replacement policy, 17
E
C
effort-based vulnerability discovery model, 407
cannibalization maintenance policy, 24 electric load, 362
case studies, 59, 67 evaluation, 237, 246, 421
censoring, 202, 204–205, 214, 218, 282 excess hazard rate models, 294
schemes, 201–202, 204 expected cost rate, 431, 433, 436
cold standby redundancy, 135 expected lifetime, 428, 431–432, 436
common cause failure analysis, 319, 321–322, 324 expected value, 429
competing, 197, 199, 219 exponential model, 405–406
complexity, 69, 104, 119, 147, 156, 234, 262, extreme value moment, 267–268
365–366, 370, 372, 380, 382
component diversity, 316 F
component-failure modes, 315, 321
component redundancy, 316 failure limit policy, 9, 12–13
Component Risk Index (IRC), 313 failure numbers, 178
composite system, 366 failure processes decomposition (FPD), 262, 264
computational systems modeling, 134 failure rate, 165, 167, 386
computer systems, 133 failures analysis, 321
constant, 201, 214, 216 fatigue data, 345, 353–354
stress, 214, 216 first order reliability method, 260
continuous state, 86, 88 first-passage point, 272
continuous time Markov chain, 53, 55, 59, 92–93, FMEA, 110, 308, 319, 325, 328, 331
104, 130, 245 FMECA, 154, 237, 308, 316, 334
conversion coefficient, 271–272 frailty, 288, 298

439
440 Index

FTA, 239 maintenance, 41, 44, 47, 56, 58, 67, 165, 361, 370,
functional dependence analysis, 308 380, 387
function analysis, 113 policy, 7–10, 361, 380
fuzzy logic, 115 strategy, 12, 23, 45
management, 225
G marine electric power, 361, 363, 365, 382, 385
marine power generator, 361, 368, 389
Gamma, 83, 297–298 Markov, 91, 99, 127, 130, 140, 371
process, 83 Markov chains, 127, 130, 140
generator configuration, 379, 383 Markovian structure, 91
group maintenance policy, 2, 14 mathematical model, 1, 6, 12, 23, 45
maximum entropy, 267, 269, 273
H methods, 232, 237, 246, 259, 347
minimum signature, 428
hazard rate models, 294 mixtures model, 358
hot standby redundancy, 135 model, 21–22, 65, 67, 133, 140, 204–205, 210,
human approach, 114 215, 217, 285, 288–289, 292, 296–298,
hybrid inspection models, 65 357–358, 404–411, 413
classification, 1
I modeling, 1, 3, 14, 41, 44, 56, 79, 127, 134, 401,
413, 416
improve design, 110, 112, 117, 119 methods, 8, 10, 17, 21–22
improvement, 12–13, 108, 229, 231, 233, 238, modes, 197, 199, 219, 305, 310, 313, 369
309, 362, 365, 385, 402, 417 modified ramp-stress, 208
incomplete data, 282 MTTF, 59–60, 63, 135, 140, 143, 146, 232, 236,
independent causes of failure, 200–201, 204, 208 373, 423, 425, 427–428
independent identically distributed (i.i.d.) MTTR, 373
elements, 283, 422 multi-dimensional integration, 268
inequalities of failure rates, 167 multiple failure modes, 202, 213, 262
inspection maintenance, 41, 44, 47, 56, 58, 67 multiple power sources, 363
inspection modeling, 42, 46, 68 multi-state system, 47, 361, 365, 374
inspection policy, 3, 41–47, 53, 57–59, 65–68 multi-unit systems, 14, 17, 19, 21–22, 24, 56,
inverse Gaussian process, 84 60–64

K N
kernel density estimation, 266 non-repairable, 18, 233, 235–236, 246–249, 373
k-out-of-n redundancy, 138
Kullback–Leibler divergence, 270 O

L one dimension VPM, 412


operating conditions, 79, 82, 87, 198, 216–217,
latent variable, 281–282, 288, 296 242–243
life cycle, 79, 198–199, 223–227, 229, 231–232, opportunity-based maintenance policy, 21–22
254, 273, 402
life testing, 352 P
limit state function, 262
linear damage rule, 349–350 parallel system model, 357
logistics, 130, 223–225, 228, 254 parser, 113
Partially Accelerated Life Test (PALT), 216
M patents, 107–109, 115, 118–119
periodic maintenance, 12, 25
main generator, 367–369, 372, 382 periodic replacement policies, 174, 178
maintainability, 223–224, 226–227, 230–231, Petri, 127, 131
236–237, 240, 244, 246, 254, 385 phases of ship operation, 364
Index 441

physics-based method, 260 single-unit systems, 3, 7–10, 44, 48–52, 55


Poisson process, 11, 14, 16, 46, 51, 55, 66, 178, sliding window system, 433
235 S-N curve, 341
power distribution, 363 sparse grid technique, 268
power generating system, 388–389, 391–392 standards, 223–225, 364
power law, 82, 85, 203, 215, 235, 250–251, 346 standby generator, 369
preventive maintenance, 1, 3, 14 standby systems, 56
primary generator, 368–377, 379 statistical, 246
probabilistic data analysis, 337, 350 status, 283
procedures, 48–55, 58, 60–64 step stress, 204, 219
process, 80–81, 83–84, 88, 90, 99, 264, 278, 296 stochastic differential equation, 406
stochastic model, 297
R stochastic Petri nets, 127, 131, 143, 146
stochastic processes, 79–80, 88–90
ramp-step stress, 208 strategies, 110
random replacement policies, 169, 174 stress, 201, 204, 208, 214, 216, 219
raw moments, 269 levels, 86, 346
redundancy, 138, 324 subset simulation, 260–261
relative importance index, 315 survey, 107
reliability, 107, 127, 133, 223, 246, 261–262, 273, survival, 7, 22, 24, 213–214, 217, 281–282, 288,
277, 310, 340, 351, 357, 421, 427 293, 295, 300, 324, 358, 424, 426–427
assessment, 305, 328, 337, 365 system analysis, 374
constrains, 3, 25, 44, 47 system reliability, 305, 357
function, 427 System Risk Index, 314
performance, 250, 337, 340, 347, 361, 385
testing, 347 T
repairable, 4, 11, 14, 17–18, 44, 56, 134, 225,
232–233, 235–237, 246–251, 254, 427 tail signature, 426, 428, 430, 432, 435–436
repair-cost limit policy, 3, 10, 14 technical systems, 41
repair-time limit policy, 3, 10, 14 tests, 197, 200, 213
requirements, 230, 383 time interval, 277
Rescorla exponential model, 405–406 time reduction, 79, 107, 119, 380
Rescorla quadratic model, 406 time-variant reliability, 259, 261–262, 273, 277
review, 1–2, 5, 8, 15, 20, 23–25, 42–43, 47, 67, trend, 13, 26, 107, 235, 247, 251, 341, 362,
108, 115, 219, 224, 229, 234, 293, 308, 403–404, 415
402–403, 422–425 TRIZ, 11, 112–114, 204, 299
risk, 116 2024-T4, 353
analysis, 116 two-dimensional VPM, 413
assessment, 65, 154, 163, 228, 246 two-state systems, 44
risk-based inspection, 44, 65 two-stress level fatigue, 345
type II progressive censoring, 202, 404, 414
S
U
safety, 312
criterion, 264 universal generating function, 337–338, 422
scaling function, 267, 271–272 user dependent multi-version vulnerability
secondary generator, 369, 373, 375–377, 379 discovery modeling, 416
second order reliability method, 260 user dependent vulnerability discovery model, 408
semantic, 107 user interface, 110
semi-Markov, 99, 371, 380
series system model, 357 V
ship technology, 362
signature of parallel system, 432 variable, 281–282, 288, 296
signature of series system, 430 variance optimality, 204
signature reliability, 421, 427 vehicle components, 337, 340
442 Index

vehicle system, 337 vulnerability discovery rate, 403–404, 408–411,


vulnerability, 402 413–414, 417
vulnerability discovery and patching model, vulnerability patching model (VPM), 411–413
401, 413
vulnerability discovery in multi-version software W
systems, 415
vulnerability discovery model, 406–408, warm standby redundancy, 137
410, 416 Weibull, 349–350, 354
for open and close source, 409 Wiener process, 81

You might also like