Loss Data Analytics Aug 2020
Loss Data Analytics Aug 2020
Loss Data Analytics Aug 2020
Preface 13
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Other Collaborators . . . . . . . . . . . . . . . . . . . . . . . . . 19
Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
For our Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Frequency Modeling 41
2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 How Frequency Augments Severity Information . . . . . . 42
2.2 Basic Frequency Distributions . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.2 Moment and Probability Generating Functions . . . . . . 46
2.2.3 Important Frequency Distributions . . . . . . . . . . . . . 47
2.3 The (a, b, 0) Class . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Estimating Frequency Distributions . . . . . . . . . . . . . . . . . 54
3
4 CONTENTS
20 Glossary 551
12 CONTENTS
Preface
Book Description
The online text will be freely available to a worldwide audience. The online ver-
sion will contain many interactive objects (quizzes, computer demonstrations,
interactive graphs, video, and the like) to promote deeper learning. Moreover, a
subset of the book will be available in pdf format for low-cost printing. The on-
line text will be available in multiple languages to promote access to a worldwide
audience.
This book will be useful in actuarial curricula worldwide. It will cover the loss
data learning objectives of the major actuarial organizations. Thus, it will be
suitable for classroom use at universities as well as for use by independent learn-
ers seeking to pass professional actuarial examinations. Moreover, the text will
also be useful for the continuing professional development of actuaries and other
professionals in insurance and related financial risk management industries.
13
14 CONTENTS
Project Goal
The project goal is to have the actuarial community author our textbooks in
a collaborative fashion. To get involved, please visit our Open Actuarial Text-
books Project Site.
Acknowledgements
Edward Frees acknowledges the John and Anne Oros Distinguished Chair for
Inspired Learning in Business which provided seed money to support the project.
Frees and his Wisconsin colleagues also acknowledge a Society of Actuaries Cen-
ter of Excellence Grant that provided funding to support work in dependence
modeling and health initiatives. Wisconsin also provided an education innova-
tion grant that provided partial support for the many students who have worked
on this project.
We acknowledge the Society of Actuaries for permission to use problems from
their examinations.
We thank Rob Hyndman, Monash University, for allowing us to use his excellent
style files to produce the online version of the book.
We thank Yihui Xie and his colleagues at Rstudio for the R bookdown package
that allows us to produce this book.
CONTENTS 15
Contributors
The project goal is to have the actuarial community author our textbooks in a
collaborative fashion. The following contributors have taken a leadership role
in developing Loss Data Analytics.
ment, and pricing. During the PhD candidature, Jianxi also worked as
a research associate at the Model Validation and ORSA Implementation
team of Sun Life Financial (Toronto office).
• Tim Verdonck is associate professor at the University of Antwerp. He
has a degree in Mathematics and a PhD in Science: Mathematics, ob-
tained at the University of Antwerp. During his PhD he successfully took
the Master in Insurance and the Master in Financial and Actuarial Engi-
neering, both at KU Leuven. His research focuses on the adaptation and
application of robust statistical methods for insurance and finance data.
• Krupa Viswanathan is an Associate Professor in the Risk, Insurance
and Healthcare Management Department in the Fox School of Business,
Temple University. She is an Associate of the Society of Actuaries. She
teaches courses in Actuarial Science and Risk Management at the under-
graduate and graduate levels. Her research interests include corporate
governance of insurance companies, capital management, and sentiment
analysis. She received her Ph.D. from The Wharton School of the Univer-
sity of Pennsylvania.
Reviewers
Our goal is to have the actuarial community author our textbooks in a collabora-
tive fashion. Part of the writing process involves many reviewers who generously
donated their time to help make this book better. They are:
• Yair Babab
• Chunsheng Ban, Ohio State University
• Vytaras Brazauskas, University of Wisconsin - Milwaukee
• Yvonne Chueh, Central Washington University
• Chun Yong Chew, Universiti Tunku Abdul Rahman (UTAR)
• Eren Dodd, University of Southampton
• Gordon Enderle, University of Wisconsin - Madison
• Rob Erhardt, Wake Forest University
• Runhun Feng, University of Illinois
• Brian Hartman, Brigham Young University
• Liang (Jason) Hong, University of Texas at Dallas
• Fei Huang, Australian National University
• Hirokazu (Iwahiro) Iwasawa
• Himchan Jeong, University of Connecticut
• Min Ji, Towson University
• Paul Herbert Johnson, University of Wisconsin - Madison
• Dalia Khalil, Cairo University
• Samuel Kolins, Lebonan Valley College
• Andrew Kwon-Nakamura, Zurich North America
• Ambrose Lo, University of Iowa
CONTENTS 19
Other Collaborators
• Alyaa Nuval Binti Othman, Aisha Nuval Binti Othman, and Khairina
(Rina) Binti Ibraham were three of many students at the Univeristy of
Wiscinson-Madison that helped with the text over the years.
• Maggie Lee, Macquarie University, and Anh Vu (then at University of
New South Wales) contributed the end of the section quizzes.
• Jeffrey Zheng, Temple University, Lu Yang (University of Amsterdam),
and Paul Johnson, University of Wisconsin-Madison, led the work on the
glossary.
Version
• This is Version 1.1, August 2020. Edited by Edward (Jed) Frees and
Paul Johnson.
• Version 1.0, January 2020, was edited by Edward (Jed) Frees.
You can also access pdf and epub (current and older) versions of the text in our
Offline versions of the text.
including a pdf version (for Adobe Acrobat) and an EPUB version suitable for
mobile devices. Data for running our examples are available at the same site.
In developing this book, we are emphasizing the online version that has lots of
great features such as a glossary, code and solutions to examples that you can
be revealed interactively. For example, you will find that the statistical code is
hidden and can only be seen by clicking on terms such as
We hide the code because we don’t want to insist that you use the R statistical
software (although we like it). Still, we encourage you to try some statistical
code as you read the book – we have opted to make it easy to learn R as you
go. We have set up a separate R Code for Loss Data Analytics site to explain
more of the details of the code.
Like any book, we have a set of notations and conventions. It will probably save
you time if you regularly visit our Appendix Chapter 19 to get used to ours.
Freely available, interactive textbooks represent a new venture in actuarial ed-
ucation and we need your input. Although a lot of effort has gone into the
development, we expect hiccoughs. Please let your instructor know about oppor-
tunities for improvement, write us through our project site, or contact chapter
contributors directly with suggested improvements.
Chapter 1
21
22 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
probably easiest to think about an insurance policy that covers the contents of
an apartment or house that you are renting (known as renters insurance) or the
contents and property of a building that is owned by you or a friend (known as
homeowners insurance). Another common example is automobile insurance. In
the event of an accident, this policy may cover damage to your vehicle, damage
to other vehicles in the accident, as well as medical expenses of those injured in
the accident.
One way to think about the nature of insurance is who buys it. Renters, home-
owners, and auto insurance are examples of personal insurance in that these
are policies issued to people. Businesses also buy insurance, such as coverage
on their properties, and this is known as commercial insurance. The seller, an
insurance company, is also known as an insurer. Even insurance companies need
insurance; this is known as reinsurance.
Another way to think about the nature of insurance is the type of risk being
covered. In the U.S., policies such as renters and homeowners are known as
property insurance whereas a policy such as auto that covers medical damages
to people is known as casualty insurance. In the rest of the world, these are both
known as non-life or general insurance, to distinguish them from life insurance.
Both life and non-life insurances are important components of the world econ-
omy. The Insurance Information Institute (2016) estimates that direct insurance
premiums in the world for 2014 was 2,654,549 for life and 2,123,699 for non-life;
these figures are in millions of U.S. dollars. The total represents 6.2% of the
world gross domestic product (GDP). Put another way, life accounts for 55.5%
of insurance premiums and 3.4% of world GDP whereas non-life accounts for
44.5% of insurance premiums and 2.8% of world GDP. Both life and non-life
represent important economic activities.
Insurance may not be as entertaining as the sports industry (another industry
that depends heavily on data) but it does affect the financial livelihoods of
many. By almost any measure, insurance is a major economic activity. As
noted earlier, on a global level, insurance premiums comprised about 6.2% of
the world GDP in 2014, (Insurance Information Institute, 2016). As examples,
premiums accounted for 18.9% of GDP in Taiwan (the highest in the study)
and represented 7.3% of GDP in the United States. On a personal level, almost
everyone owning a home has insurance to protect themselves in the event of a
fire, hailstorm, or some other calamitous event. Almost every country requires
insurance for those driving a car. In sum, although not particularly entertaining,
insurance plays an important role in the economies of nations and the lives of
individuals.
forecast financial trends, and so on. These represent general areas of activities
that are not specific to the insurance industry. Although each industry has its
own data nuances and needs, the collection, analysis and use of data is an ac-
tivity shared by all, from the internet giants to a small business, by public and
governmental organizations, and is not specific to the insurance industry. You
will find that the data collection and analysis methods and tools introduced in
this text are relevant for all.
In any data-driven industry, analytics is a key to deriving and extracting in-
formation from data. But what is analytics? Making data-driven business
decisions has been described as business analytics, business intelligence, and
data science. These terms, among others, are sometimes used interchangeably
and sometimes refer to distinct applications. Business intelligence may focus
on processes of collecting data, often through databases and data warehouses,
whereas business analytics utilizes tools and methods for statistical analyses of
data. In contrast to these two terms that emphasize business applications, the
term data science can encompass broader data related applications in many sci-
entific domains. For our purposes, we use the term analytics to refer to the
process of using data to make decisions. This process involves gathering data,
understanding concepts and models of uncertainty, making general inferences,
and communicating results.
When introducing data methods in this text, we focus on losses that arise from,
or related to, obligations in insurance contracts. This could be the amount of
damage to one’s apartment under a renter’s insurance agreement, the amount
needed to compensate someone that you hurt in a driving accident, and the
like. We call this type of obligation an insurance claim. With this focus, we
are able to introduce and directly use generally applicable statistical tools and
techniques.
X X X X X X X
t1 t2 t3 t4 t5 t6
Armed with insurance data, the end goal is to use data to make decisions. We
will learn more about methods of analyzing and extrapolating data in future
chapters. To begin, let us think about why we want to do the analysis. We
take the insurance company’s viewpoint (not the insured person) and introduce
ways of bringing money in, paying it out, managing costs, and making sure
that we have enough money to meet obligations. The emphasis is on insurance-
specific operations rather than on general business activities such as advertising,
marketing, and human resources management.
Specifically, in many insurance companies, it is customary to aggregate detailed
insurance processes into larger operational units; many companies use these
functional areas to segregate employee activities and areas of responsibilities.
Actuaries, other financial analysts, and insurance regulators work within these
units and use data for the following activities:
1. Initiating Insurance. At this stage, the company makes a decision as
to whether or not to take on a risk (the underwriting stage) and assign
an appropriate premium (or rate). Insurance analytics has its actuarial
roots in ratemaking, where analysts seek to determine the right price for
the right risk.
2. Renewing Insurance. Many contracts, particularly in general insurance,
have relatively short durations such as 6 months or a year. Although
there is an implicit expectation that such contracts will be renewed, the
insurer has the opportunity to decline coverage and to adjust the premium.
Analytics is also used at this policy renewal stage where the goal is to retain
profitable customers.
3. Claims Management. Analytics has long been used in (1) detecting
and preventing claims fraud, (2) managing claim costs, including identi-
fying the appropriate support for claims handling expenses, as well as (3)
understanding excess layers for reinsurance and retention.
4. Loss Reserving. Analytic tools are used to provide management with an
appropriate estimate of future obligations and to quantify the uncertainty
of those estimates.
5. Solvency and Capital Allocation. Deciding on the requisite amount
of capital and on ways of allocating capital among alternative investments
are also important analytics activities. Companies must understand how
much capital is needed so that they have sufficient flow of cash available
to meet their obligations at the times they are expected to materialize
(solvency). This is an important question that concerns not only company
managers but also customers, company shareholders, regulatory authori-
ties, as well as the public at large. Related to issues of how much capital
is the question of how to allocate capital to differing financial projects,
1.2. INSURANCE COMPANY OPERATIONS 27
Claims history can provide information about a policyholder’s risk appetite. For
example, in personal lines it is common to use a variable to indicate whether
or not a claim has occurred in the last three years. As another example, in
a commercial line such as worker’s compensation, one may look to a policy-
holder’s average claim frequency or severity over the last three years. Claims
history can reveal information that is otherwise hidden (to the insurer) about
the policyholder.
Insurance managers sometimes use the phrase claims leakage to mean dollars
lost through claims management inefficiencies. There are many ways in which
analytics can help manage the claims process, c.f., Gorman and Swenson (2013).
Historically, the most important has been fraud detection. The claim adjusting
process involves reducing information asymmetry (the claimant knows what
happened; the company knows some of what happened). Mitigating fraud is an
important part of the claims management process.
Fraud detection is only one aspect of managing claims. More broadly, one can
think about claims management as consisting of the following components:
claim, the medical treatment, and the overall costs with an earlier return-
to-work.
• Claims processing. The goal is to use analytics to identify routine situa-
tions that are anticipated to have small payouts. More complex situations
may require more experienced adjusters and legal assistance to appropri-
ately handle claims with high potential payouts.
• Adjustment decisions. Once a complex claim has been identified and
assigned to an adjuster, analytic driven routines can be established to aid
subsequent decision-making processes. Such processes can also be helpful
for adjusters in developing case reserves, an estimate of the insurer’s future
liability. This is an important input to the insurer’s loss reserves, described
in Section 1.2.4.
In addition to the insured’s reimbursement for losses, the insurer also needs to be
concerned with another source of revenue outflow, expenses. Loss adjustment
expenses are part of an insurer’s cost of managing claims. Analytics can be
used to reduce expenses directly related to claims handling (allocated) as well
as general staff time for overseeing the claims processes (unallocated). The
insurance industry has high operating costs relative to other portions of the
financial services sectors.
In addition to claims payments, there are many other ways in which insurers
use data to manage their products. We have already discussed the need for
analytics in underwriting, that is, risk classification at the initial acquisition
and renewal stages. Insurers are also interested in which policyholders elect to
renew their contracts and, as with other products, monitor customer loyalty.
Analytics can also be used to manage the portfolio, or collection, of risks that an
insurer has acquired. As described in Chapter 10, after the contract has been
agreed upon with an insured, the insurer may still modify its net obligation
by entering into a reinsurance agreement. This type of agreement is with a
reinsurer, an insurer of an insurer. It is common for insurance companies to
purchase insurance on its portfolio of risks to gain protection from unusual
events, just as people and other companies do.
Setting aside money for unpaid claims is known as loss reserving; in some juris-
dictions, reserves are also known as technical provisions. We saw in Figure 1.1
several times at which a company summarizes its financial position; these times
are known as valuation dates. Claims that arise prior to valuation dates have
either been paid, are in the process of being paid, or are about to be paid; claims
in the future of these valuation dates are unknown. A company must estimate
these outstanding liabilities when determining its financial strength. Accurately
determining loss reserves is important to insurers for many reasons.
1. Loss reserves represent an anticipated claim that the insurer owes its cus-
tomers. Under-reserving may result in a failure to meet claim liabilities.
Conversely, an insurer with excessive reserves may present a conservative
estimate of surplus and thus portray a weaker financial position than it
truly has.
2. Reserves provide an estimate for the unpaid cost of insurance that can be
used for pricing contracts.
3. Loss reserving is required by laws and regulations. The public has a strong
interest in the financial strength and solvency of insurers.
4. In addition to regulators, other stakeholders such as insurance company
management, investors, and customers make decisions that depend on
company loss reserves. Whereas regulators and customers appreciate con-
servative estimates of unpaid claims, managers and investors seek more
unbiased estimates to represent the true financial health of the company.
Loss reserving is a topic where there are substantive differences between life
and general (also known as property and casualty, or non-life) insurance. In life
insurance, the severity (amount of loss) is often not a source of uncertainty as
payouts are specified in the contract. The frequency, driven by mortality of the
insured, is a concern. However, because of the lengthy time for settlement of
life insurance contracts, the time value of money uncertainty as measured from
issue to date of payment can dominate frequency concerns. For example, for
an insured who purchases a life contract at age 20, it would not be unusual for
the contract to still be open in 60 years time, when the insured celebrates his
or her 80th birthday. See, for example, Bowers et al. (1986) or Dickson et al.
(2013) for introductions to reserving for life insurance. In contrast, for most
lines of non-life business, severity is a major source of uncertainty and contract
durations tend to be shorter.
In this section, we use the Wisconsin Property Fund as a case study. You learn
how to:
• Describe how data generating events can produce data of interest to in-
surance analysts.
32 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
Let us illustrate the kind of data under consideration and the goals that we
wish to achieve by examining the Local Government Property Insurance Fund
(LGPIF), an insurance pool administered by the Wisconsin Office of the Insur-
ance Commissioner. The LGPIF was established to provide property insurance
for local government entities that include counties, cities, towns, villages, school
districts, and library boards. The fund insures local government property such
as government buildings, schools, libraries, and motor vehicles. It covers all
property losses except those resulting from flood, earthquake, wear and tear,
extremes in temperature, mold, war, nuclear reactions, and embezzlement or
theft by an employee.
The fund covers over a thousand local government entities who pay approxi-
mately 25 million dollars in premiums each year and receive insurance coverage
of about 75 billion. State government buildings are not covered; the LGPIF is
for local government entities that have separate budgetary responsibilities and
who need insurance to moderate the budget effects of uncertain insurable events.
Coverage for local government property has been made available by the State
of Wisconsin since 1911, thus providing a wealth of historical data.
In this illustration, we restrict consideration to claims from coverage of building
and contents; we do not consider claims from motor vehicles and specialized
equipment owned by local entities (such as snow plowing machines). We also
consider only claims that are closed, with obligations fully met.
To illustrate, in 2010 there were 1,110 policyholders in the property fund who
experienced a total of 1,377 claims. Table 1.1 shows the distribution. Almost
two-thirds (0.637) of the policyholders did not have any claims and an additional
18.8% had only one claim. The remaining 17.5% (=1 - 0.637 - 0.188) had more
than one claim; the policyholder with the highest number recorded 239 claims.
The average number of claims for this sample was 1.24 (=1377/1110).
Table 1.1. 2010 Claims Frequency Distribution
Type
Number 0 1 2 3 4 5 6 7 8 9 or more Sum
Policies 707 209 86 40 18 12 9 4 6 19 1,110
Claims 0 209 172 120 72 60 54 28 48 617 1,377
Proportion 0.637 0.188 0.077 0.036 0.016 0.011 0.008 0.004 0.005 0.017 1.000
First Third
Minimum Quartile Median Mean Quartile Maximum
167 2,226 4,951 56,330 11,900 12,920,000
34 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
400
100 120
300
Frequency
Frequency
80
200
60
40
100
20
0
0
0.0e+00 6.0e+06 1.2e+07 6 8 10 14
Table 1.3 shows that the average claim varies over time, especially with the high
2010 value (that we saw was due to a single large claim)1 . The total number
of policyholders is steadily declining and, conversely, the coverage is steadily
increasing. The coverage variable is the amount of coverage of the property
and contents. Roughly, you can think of it as the maximum possible payout
of the insurer. For our immediate purposes, the coverage is our first rating
variable. Other things being equal, we would expect that policyholders with
larger coverage have larger claims. We will make this vague idea much more
precise as we proceed, and also justify this expectation with data.
For a different look at the 2006-2010 data, Table 1.4 summarizes the distribution
of our two outcomes, frequency and claims amount. In each case, the average
exceeds the median, suggesting that the two distributions are right-skewed. In
addition, the table summarizes our continuous rating variables, coverage and
deductible amount. The table also suggests that these variables also have right-
skewed distributions.
Table 1.4. Summary of Claim Frequency and Severity, Deductibles,
and Coverages
Table 1.5 describes the rating variables considered in this chapter. Hopefully,
these are variables that you think might naturally be related to claims outcomes.
You can learn more about them in Frees et al. (2016). To handle the skewness,
we henceforth focus on logarithmic transformations of coverage and deductibles.
1 Note that the average severity in Table 1.3 differs from that reported in Table 1.2. This is
because the former includes policyholders with zero claims where as the latter does not. This
is an important distinction that we will address in later portions of the text.
36 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
EntityType Categorical variable that is one of six types: (Village, City,
County, Misc, School, or Town)
LnCoverage Total building and content coverage, in logarithmic millions of dollars
LnDeduct Deductible, in logarithmic dollars
AlarmCredit Categorical variable that is one of four types: (0, 5, 10, or 15)
for automatic smoke alarms in main rooms
NoClaimCredit Binary variable to indicate no claims in the past two years
Fire5 Binary variable to indicate the fire class is below 5
(The range of fire class is 0 to 10)
Table 1.7 shows the claims experience by alarm credit. It underscores the dif-
ficulty of examining variables individually. For example, when looking at the
experience for all entities, we see that policyholders with no alarm credit have
on average lower frequency and severity than policyholders with the highest
(15%, with 24/7 monitoring by a fire station or security company) alarm credit.
In particular, when we look at the entity type School, the frequency is 0.422
and the severity 25,523 for no alarm credit, whereas for the highest alarm level
it is 2.008 and 85,140, respectively. This may simply imply that entities with
more claims are the ones that are likely to have an alarm system. Summary
tables do not examine multivariate effects; for example, Table 1.6 ignores the
effect of size (as we measure through coverage amounts) that affect claims.
Table 1.7. Claims Summary by Entity Type and Alarm Credit (AC)
Category
(no claims credit and fire class); and two categorical rating variables (entity type
and alarm credit). Subsequent chapters will explain how to analyze and model
the distribution of these variables and their relationships. Before getting into
these technical details, let us first think about where we want to go. General in-
surance company functional areas are described in Section 1.2; we now consider
how these areas might apply in the context of the property fund.
Initiating Insurance
Because this is a government sponsored fund, we do not have to worry about
selecting good or avoiding poor risks; the fund is not allowed to deny a cover-
age application from a qualified local government entity. If we do not have to
underwrite, what about how much to charge?
We might look at the most recent experience in 2010, where the total fund claims
were approximately 28.16 million USD (= 1377 claims×20452 average severity).
Dividing that among 1,110 policyholders, that suggests a rate of 24,370 ( ≈
28,160,000/1110). However, 2010 was a bad year; using the same method, our
premium would be much lower based on 2009 data. This swing in premiums
would defeat the primary purpose of the fund, to allow for a steady charge that
local property managers could utilize in their budgets.
Having a single price for all policyholders is nice but hardly seems fair. For
example, Table 1.6 suggests that schools have higher aggregate claims than
other entities and so should pay more. However, simply doing the calculation
on an entity by entity basis is not right either. For example, we saw in Table
1.7 that had we used this strategy, entities with a 15% alarm credit (for good
behavior, having top alarm systems) would actually wind up paying more.
So, we have the data for thinking about the appropriate rates to charge but need
to dig deeper into the analysis. We will explore this topic further in Chapter 7
on premium calculation fundamentals. Selecting appropriate risks is introduced
in Chapter 8 on risk classification.
Renewing Insurance
Although property insurance is typically a one-year contract, Table 1.3 suggests
that policyholders tend to renew; this is typical of general insurance. For re-
newing policyholders, in addition to their rating variables we have their claims
history and this claims history can be a good predictor of future claims. For
example, Table 1.6 shows that policyholders without a claim in the last two
years had much lower claim frequencies than those with at least one accident
(0.310 compared to 1.501); a lower predicted frequency typically results in a
lower premium. This is why it is common for insurers to use variables such as
NoClaimCredit in their rating. We will explore this topic further in Chapter 9
on experience rating.
1.4. FURTHER RESOURCES AND CONTRIBUTORS 39
Claims Management
Of course, the main story line of the 2010 experience was the large claim of over
12 million USD, nearly half the amount of claims for that year. Are there ways
that this could have been prevented or mitigated? Are their ways for the fund to
purchase protection against such large unusual events? Another unusual feature
of the 2010 experience noted earlier was the very large frequency of claims (239)
for one policyholder. Given that there were only 1,377 claims that year, this
means that a single policyholder had 17.4 % of the claims. These extreme
features of the data suggests opportunities for managing claims, the subject of
Chapter 10.
Loss Reserving
In our case study, we look only at the one year outcomes of closed claims (the op-
posite of open). However, like many lines of insurance, obligations from insured
events to buildings such as fire, hail, and the like, are not known immediately
and may develop over time. Other lines of business, including those where there
are injuries to people, take much longer to develop. Chapter 11 introduces this
concern and loss reserving, the discipline of determining how much the insurance
company should retain to meet its obligations.
Frequency Modeling
41
42 CHAPTER 2. FREQUENCY MODELING
concludes the chapter with R Code for plots depicted in Section 2.4.
In this section, you learn how to summarize the importance of frequency mod-
eling in terms of
• contractual,
• behavioral,
• database, and
• regulatory/administrative motivations.
The expected cost for insurance can be determined as the expected number of
claims times the amount per claim, that is, expected value of frequency times
severity. The focus on claim count allows the insurer to consider those factors
which directly affect the occurrence of a loss, thereby potentially generating a
claim.
2.1. FREQUENCY DISTRIBUTIONS 43
as separate processes.
In this section, we introduce the distributions that are commonly used in actu-
arial practice to model count data. The claim count random variable is denoted
by 𝑁 ; by its very nature it assumes only non-negative integer values. Hence
the distributions below are all discrete distributions supported on the set of
non-negative integers {0, 1, …}.
2.2.1 Foundations
Since 𝑁 is a discrete random variable taking values in {0, 1, …}, the most natural
full description of its distribution is through the specification of the probabilities
with which it assumes each of the non-negative integer values. This leads us to
the concept of the probability mass function (pmf) of 𝑁 , denoted as 𝑝𝑁 (⋅) and
defined as follows:
⎧ ⌊𝑥⌋
{ ∑ Pr(𝑁 = 𝑘), 𝑥 ≥ 0;
𝐹𝑁 (𝑥) = 𝑘=0
⎨
{0, otherwise.
⎩
In the above, ⌊⋅⌋ denotes the floor function; ⌊𝑥⌋ denotes the greatest integer
less than or equal to 𝑥. This expression also suggests the descriptor cumula-
tive distribution function, a commonly used alternative way of expressing the
distribution function. We also note that the survival function of 𝑁 , denoted
by 𝑆𝑁 (⋅), is defined as the ones’-complement of 𝐹𝑁 (⋅), i.e. 𝑆𝑁 (⋅) = 1 − 𝐹𝑁 (⋅).
Clearly, the latter is another characterization of the distribution of 𝑁 .
Often one is interested in quantifying a certain aspect of the distribution and
not in its complete description. This is particularly useful when comparing
distributions. A center of location of the distribution is one such aspect, and
there are many different measures that are commonly used to quantify it. Of
these, the mean is the most popular; the mean of 𝑁 , denoted by 𝜇𝑁 ,1 is defined
as
∞
𝜇𝑁 = ∑ 𝑘 𝑝𝑁 (𝑘).
𝑘=0
We note that 𝜇𝑁 is the expected value of the random variable 𝑁 , i.e. 𝜇𝑁 = E[𝑁 ].
This leads to a general class of measures, the moments of the distribution; the
𝑟-th raw moment of 𝑁 , for 𝑟 > 0, is defined as E[𝑁 𝑟 ] and denoted by 𝜇′𝑁 (𝑟).
We remark that the prime ′ here does not denote differentiation. Rather, it is
commonly used notation to distinguish a raw from a central moment, as will be
introduction in Section 3.1.1. For 𝑟 > 0, we have
∞
𝜇′𝑁 (𝑟) = E[𝑁 𝑟 ] = ∑ 𝑘𝑟 𝑝𝑁 (𝑘).
𝑘=0
Var 𝑁 . Note that the latter is well-defined as Var[𝑁 ], by its definition as the
average squared deviation from the mean, is non-negative; Var[𝑁 ] is denoted
2
by 𝜎𝑁 . Note that these two measures take values in [0, ∞].
Theorem 2.1.
∗
Let 𝑁 be a count random variable such that E [𝑒𝑡 𝑁 ] is finite for some 𝑡∗ > 0.
We have the following:
a. All moments of 𝑁 are finite, i.e.
d𝑚
𝑀 (𝑡)∣ = E[𝑁 𝑚 ], 𝑚 ≥ 1.
d𝑡𝑚 𝑁 𝑡=0
Another reason that the mgf is very useful as a tool is that for two independent
random variables 𝑋 and 𝑌 , with their mgfs existing in a neighborhood of 0,
the mgf of 𝑋 + 𝑌 is the product of their respective mgfs, that is, 𝑀𝑋+𝑌 (𝑡) =
𝑀𝑋 (𝑡)𝑀𝑌 (𝑡), for small 𝑡.
A related generating function to the mgf is the probability generating function
(pgf), and is a useful tool for random variables taking values in the non-negative
integers. For a random variable 𝑁 , by 𝑃𝑁 (⋅) we denote its pgf and we define it
as follows2 :
2 00 =1
2.2. BASIC FREQUENCY DISTRIBUTIONS 47
𝑃𝑁 (𝑠) = E [𝑠𝑁 ], 𝑠 ≥ 0.
Moreover, if the pgf exists on an interval [0, 𝑠∗ ) with 𝑠∗ > 1, then the mgf
𝑀𝑁 (⋅) exists on (−∞, log(𝑠∗ )), and hence uniquely specifies the distribution of
𝑁 by Theorem 2.1. (As a reminder, throughout this text we use log as the
natural logarithm, not the base ten (common) logarithm or other version.) The
following result for pgf is an analog of Theorem 2.1, and in particular justifies
its name.
Theorem 2.2. Let 𝑁 be a count random variable such that E (𝑠∗ )𝑁 is finite
for some 𝑠∗ > 1. We have the following:
a. All moments of 𝑁 are finite, i.e.
E 𝑁 𝑟 < ∞, 𝑟 ≥ 0.
⎧𝑃𝑁 (0), 𝑚 = 0;
{
𝑝𝑁 (𝑚) = ⎨
{( 1 ) d𝑚𝑚 𝑃 (𝑠)∣ , 𝑚 ≥ 1.
⎩ 𝑚! d𝑠 𝑁 𝑠=0
Binomial Distribution
We begin with the binomial distribution which arises from any finite sequence
of identical and independent experiments with binary outcomes. The most
canonical of such experiments is the (biased or unbiased) coin tossing experiment
with the outcome being heads or tails. So if 𝑁 denotes the number of heads
in a sequence of 𝑚 independent coin tossing experiments with an identical coin
which turns heads up with probability 𝑞, then the distribution of 𝑁 is called
the binomial distribution with parameters (𝑚, 𝑞), with 𝑚 a positive integer
and 𝑞 ∈ [0, 1]. Note that when 𝑞 = 0 (resp., 𝑞 = 1) then the distribution is
degenerate with 𝑁 = 0 (resp., 𝑁 = 𝑚) with probability 1. Clearly, its support
when 𝑞 ∈ (0, 1) equals {0, 1, … , 𝑚} with pmf given by 3
𝑚
𝑝𝑘 = ( )𝑞 𝑘 (1 − 𝑞)𝑚−𝑘 , 𝑘 = 0, … , 𝑚.
𝑘
3 In the following we suppress the reference to 𝑁 and denote the pmf by the sequence
where
𝑚 𝑚!
( )=
𝑘 𝑘!(𝑚 − 𝑘)!
The reason for its name is that the pmf takes values among the terms that arise
from the binomial expansion of (𝑞 + (1 − 𝑞))𝑚 . This realization then leads to
the the following expression for the pgf of the binomial distribution:
𝑚
𝑃𝑁 (𝑧) = ∑𝑘=0 𝑧𝑘 (𝑚 𝑘
𝑘 )𝑞 (1 − 𝑞)
𝑚−𝑘
𝑚 𝑚
= ∑𝑘=0 ( 𝑘 )(𝑧𝑞) (1 − 𝑞)𝑚−𝑘
𝑘
Note that the above expression for the pgf confirms the fact that the binomial
distribution is the m-convolution of the Bernoulli distribution, which is the
binomial distribution with 𝑚 = 1 and pgf (1 + 𝑞(𝑧 − 1)). By “m-convolution,”
we mean that we can write 𝑁 as the sum of 𝑁1 , … , 𝑁𝑚 . Here, 𝑁𝑖 are iid
Bernoulli variates. Also, note that the mgf of the binomial distribution is given
by (1 + 𝑞(𝑒𝑡 − 1))𝑚 .
The mean and variance of the binomial distribution can be found in a few
different ways. To emphasize the key property that it is a 𝑚-convolution of
the Bernoulli distribution, we derive below the moments using this property.
We begin by observing that the Bernoulli distribution with parameter 𝑞 assigns
probability of 𝑞 and 1 − 𝑞 to 1 and 0, respectively. So its mean equals 𝑞 (=
0 × (1 − 𝑞) + 1 × 𝑞); note that its raw second moment equals its mean as 𝑁 2 = 𝑁
with probability 1. Using these two facts we see that the variance equals 𝑞(1−𝑞).
Moving on to the binomial distribution with parameters 𝑚 and 𝑞, using the fact
that it is the 𝑚-convolution of the Bernoulli distribution, we write 𝑁 as the
sum of 𝑁1 , … , 𝑁𝑚 , where 𝑁𝑖 are iid Bernoulli variates, as above. Now using
the moments of Bernoulli and linearity of the expectation, we see that
𝑚 𝑚
E[𝑁 ] = E [∑ 𝑁𝑖 ] = ∑ E[𝑁𝑖 ] = 𝑚𝑞.
𝑖=1 𝑖=1
Also, using the fact that the variance of the sum of independent random
variables is the sum of their variances, we see that
𝑚 𝑚
Var[𝑁 ] = Var [∑ 𝑁𝑖 ] = ∑ Var[𝑁𝑖 ] = 𝑚𝑞(1 − 𝑞).
𝑖=1 𝑖=1
Alternate derivations of the above moments are suggested in the exercises. One
important observation, especially from the point of view of applications, is that
the mean is greater than the variance unless 𝑞 = 0.
50 CHAPTER 2. FREQUENCY MODELING
Poisson Distribution
After the binomial distribution, the Poisson distribution (named after the French
polymath Simeon Denis Poisson) is probably the most well known of discrete
distributions. This is partly due to the fact that it arises naturally as the
distribution of the count of the random occurrences of a type of event in a
certain time period, if the rate of occurrences of such events is a constant. It
also arises as the asymptotic limit of the binomial distribution with 𝑚 → ∞
and 𝑚𝑞 → 𝜆.
The Poisson distribution is parametrized by a single parameter usually denoted
by 𝜆 which takes values in (0, ∞). Its pmf is given by
𝑒−𝜆 𝜆𝑘
𝑝𝑘 = , 𝑘 = 0, 1, …
𝑘!
It is easy to check that the above specifies a pmf as the terms are clearly non-
negative, and that they sum to one follows from the infinite Taylor series expan-
sion of 𝑒𝜆 . More generally, we can derive its pgf, 𝑃𝑁 (⋅), as follows:
∞ ∞
𝑒−𝜆 𝜆𝑘 𝑧 𝑘
𝑃𝑁 (𝑧) = ∑ 𝑝𝑘 𝑧𝑘 = ∑ = 𝑒−𝜆 𝑒𝜆𝑧 = 𝑒𝜆(𝑧−1) , ∀𝑧 ∈ ℝ.
𝑘=0 𝑘=0
𝑘!
Towards deriving its mean, we note that for the Poisson distribution
0, 𝑘=0
𝑘𝑝𝑘 = {
𝜆 𝑝𝑘−1 , 𝑘 ≥ 1.
E[𝑁 ] = ∑ 𝑘 𝑝𝑘 = 𝜆 ∑ 𝑝𝑘−1 = 𝜆 ∑ 𝑝𝑗 = 𝜆.
𝑘≥0 𝑘≥1 𝑗≥0
In fact, more generally, using either a generalization of the above or using The-
orem 2.1, we see that
𝑚−1
d𝑚
E ∏ (𝑁 − 𝑖) = 𝑃 (𝑠)∣ = 𝜆𝑚 , 𝑚 ≥ 1.
𝑖=0
d𝑠𝑚 𝑁 𝑠=1
𝑠(𝑠 − 1) 2
(1 + 𝑥)𝑠 = 1 + 𝑠𝑥 + 𝑥 + … ..., 𝑠 ∈ ℝ; |𝑥| < 1.
2!
If we define (𝑘𝑠), the generalized binomial coefficient, by
𝑠 𝑠(𝑠 − 1) ⋯ (𝑠 − 𝑘 + 1)
( )= ,
𝑘 𝑘!
then we have
∞
𝑠
(1 + 𝑥)𝑠 = ∑ ( )𝑥𝑘 , 𝑠 ∈ ℝ; |𝑥| < 1.
𝑘=0
𝑘
If we let 𝑠 = −𝑟, then we see that the above yields
∞
(𝑟 + 1)𝑟 2 𝑟+𝑘−1 𝑘
(1 − 𝑥)−𝑟 = 1 + 𝑟𝑥 + 𝑥 + … ... = ∑ ( )𝑥 , 𝑟 ∈ ℝ; |𝑥| < 1.
2! 𝑘=0
𝑘
for 𝑟 > 0 and 𝛽 ≥ 0, then it defines a valid pmf. Such defined distribution is
called the negative binomial distribution with parameters (𝑟, 𝛽) with 𝑟 > 0 and
𝛽 ≥ 0. Moreover, the binomial series also implies that the pgf of this distribution
is given by
1
𝑃𝑁 (𝑧) = (1 − 𝛽(𝑧 − 1))−𝑟 , |𝑧| < 1 + , 𝛽 ≥ 0.
𝛽
We note that when 𝛽 > 0, we have Var[𝑁 ] > E[𝑁 ]. In other words, this
distribution is overdispersed (relative to the Poisson); similarly, when 𝑞 > 0 the
binomial distribution is said to be underdispersed (relative to the Poisson).
Finally, we observe that the Poisson distribution also emerges as a limit of
negative binomial distributions. Towards establishing this, let 𝛽𝑟 be such that
as 𝑟 approaches infinity 𝑟𝛽𝑟 approaches 𝜆 > 0. Then we see that the mgfs of
negative binomial distributions with parameters (𝑟, 𝛽𝑟 ) satisfies
with the right hand side of the above equation being the mgf of the Poisson
distribution with parameter 𝜆.4
In the previous section we studied three distributions, namely the binomial, the
Poisson and the negative binomial distributions. In the case of the Poisson, to
derive its mean we used the the fact that
𝑘𝑝𝑘 = 𝜆𝑝𝑘−1 , 𝑘 ≥ 1,
𝑝𝑘 𝜆
= , 𝑘 ≥ 1.
𝑝𝑘−1 𝑘
4 For the theoretical basis underlying the above argument, see Billingsley (2008).
2.3. THE (A, B, 0) CLASS 53
𝑝𝑘 𝑏
=𝑎+ , 𝑘 ≥ 1; (2.1)
𝑝𝑘−1 𝑘
this raises the question if there are any other distributions which satisfy this
seemingly general recurrence relation. Note that the ratio on the left, the ratio
of two probabilities, is non-negative.
From the above development we see that not only does the recurrence (2.1) tie
these three distributions together, but also it characterizes them. For this reason
these three distributions are collectively referred to in the actuarial literature
as (a,b,0) class of distributions, with 0 referring to the starting point of the
recurrence. Note that the value of 𝑝0 is implied by (𝑎, 𝑏) since the probabilities
have to sum to one. Of course, (2.1) as a recurrence relation for 𝑝𝑘 makes the
computation of the pmf efficient by removing redundancies. Later, we will see
that it does so even in the case of compound distributions with the frequency
distribution belonging to the (𝑎, 𝑏, 0) class - this fact is the more important
motivating reason to study these three distributions from this viewpoint.
Example 2.3.1. A discrete probability distribution has the following properties
2
𝑝𝑘 = 𝑐 (1 + ) 𝑝𝑘−1 𝑘 = 1, 2, 3, …
𝑘
9
𝑝1 =
256
Determine the expected value of this discrete random variable.
Solution: Since the pmf satisfies the (𝑎, 𝑏, 0) recurrence relation we know that
the underlying distribution is one among the binomial, Poisson, and negative
binomial distributions. Since the ratio of the parameters (i.e. 𝑏/𝑎) equals 2, we
know that it is negative binomial and that 𝑟 = 3. Moreover, since for a negative
binomial 𝑝1 = 𝑟(1 + 𝛽)−(𝑟+1) 𝛽, we have
9 𝛽
=3
256 (1 + 𝛽)4
3 𝛽
⟹ =
(1 + 3)4 (1 + 𝛽)4
⟹ 𝛽 =3.
Finally, since the mean of a negative binomial is 𝑟𝛽 we have the mean of the
given distribution equals 9.
𝜃𝑥
𝑝𝜃 (𝑥) = 𝑒−𝜃 , 𝑥 = 0, 1, … ,
𝑥!
𝑚
𝑝𝜃 (𝑥) = ( )𝑞 𝑥 (1 − 𝑞)𝑚−𝑥 , 𝑥 = 0, 1, … , 𝑚.
𝑥
The maximum likelihood estimator (mle) for 𝜃 is any maximizer of the likeli-
hood; in a sense the mle chooses the set of parameter values that best explains
the observed observations. Appendix Section 15.2.2 reviews the foundations of
maximum likelihood estimation with more mathematical details in Appendix
Chapter 17.
Special Case: Three Bernoulli Outcomes. To illustrate, consider a sample
of size 𝑛 = 3 from a Bernoulli distribution (binomial with 𝑚 = 1) with values
0, 1, 0. The likelihood in this case is easily checked to equal
𝐿(𝑞) = 𝑞(1 − 𝑞)2 ,
and the plot of the likelihood is given in Figure 2.1. As shown in the plot, the
maximum value of the likelihood equals 4/27 and is attained at 𝑞 = 1/3, and
hence the maximum likelihood estimate for 𝑞 is 1/3 for the given sample. In
this case one can resort to algebra to show that
1 2 4 4
𝑞(1 − 𝑞)2 = (𝑞 − ) (𝑞 − ) + ,
3 3 27
and conclude that the maximum equals 4/27, and is attained at 𝑞 = 1/3 (using
the fact that the first term is non-positive in the interval [0, 1]).
But as is apparent, this way of deriving the mle using algebra does not gener-
alize. In general, one resorts to calculus to derive the mle - note that for some
likelihoods one may have to resort to other optimization methods, especially
when the likelihood has many local extrema. It is customary to equivalently
maximize the logarithm of the likelihood5 𝐿(⋅), denoted by 𝑙(⋅), and look at
the set of zeros of its first derivative6 𝑙′ (⋅). In the case of the above likelihood,
𝑙(𝑞) = log(𝑞) + 2 log(1 − 𝑞), and
d 1 2
𝑙′ (𝑞) = 𝑙(𝑞) = − .
d𝑞 𝑞 1−𝑞
The unique zero of 𝑙′ (⋅) equals 1/3, and since 𝑙″ (⋅) is negative, we have 1/3 is the
unique maximizer of the likelihood and hence its maximum likelihood estimate.
5 The set of maximizers of 𝐿(⋅) are the same as the set of maximizers of any strictly increas-
ing function of 𝐿(⋅), and hence the same as those for 𝑙(⋅).
6 A slight benefit of working with 𝑙(⋅) is that constant terms in 𝐿(⋅) do not appear in 𝑙′ (⋅)
̂
where 𝑥1 , … , 𝑥𝑛 are the observed values. The mle of 𝜃, denoted as 𝜃MLE , is a
function which maps the observations to an element of the set of maximizers of
𝐿(⋅), namely
{𝜃|𝐿(𝜃) = max 𝐿(𝜂)}.
𝜂∈Θ
Note the above set is a function of the observations, even though this dependence
is not made explicit. In the case of the three distributions that we study, and
quite generally, the above set is a singleton with probability tending to one (with
increasing sample size). In other words, for many commonly used distributions
and when the sample size is large, the likelihood estimate is uniquely defined
with high probability.
In the following, we assume that we have observed 𝑛 iid random variables
𝑋1 , 𝑋2 , … , 𝑋𝑛 from the distribution under consideration, even though the para-
metric value is unknown. Also, 𝑥1 , 𝑥2 , … , 𝑥𝑛 will denote the observed values.
We note that in the case of count data, and data from discrete distributions in
general, the likelihood can alternately be represented as
𝑚𝑘
𝐿(𝜃) = ∏ (𝑝𝜃 (𝑘)) ,
𝑘≥0
Note that this transformation retains all of the data, compiling it in a stream-
lined manner. For large 𝑛 it leads to compression of the data in the sense of
sufficiency. Below, we present expressions for the mle in terms of {𝑚𝑘 }𝑘≥1 as
well.
Special Case: Poisson Distribution. In this case, as noted above, the
likelihood is given by
−1
𝑛
𝑛
𝐿(𝜆) = (∏ 𝑥𝑖 !) 𝑒−𝑛𝜆 𝜆∑𝑖=1 𝑥𝑖 .
𝑖=1
1 𝑛
𝑥 = 𝜆̂ MLE = ∑ 𝑥𝑖 .
𝑛 𝑖=1
Note that the sample mean can be computed also as
1
𝑥= ∑ 𝑘 ⋅ 𝑚𝑘 .
𝑛 𝑘≥1
It is noteworthy that in the case of the Poisson, the exact distribution of 𝜆̂ MLE is
available in closed form - it is a scaled Poisson - when the underlying distribution
is a Poisson. This is so as the sum of independent Poisson random variables is a
Poisson as well. Of course, for large sample size one can use the ordinary Central
Limit Theorem (CLT) to derive a normal approximation. Note that the latter
approximation holds even if the underlying distribution is any distribution with
a finite second moment.
Special Case: Binomial Distribution. Unlike the case of the Poisson distri-
bution, the parameter space in the case of the binomial is 2-dimensional. Hence
the optimization problem is a bit more challenging. We begin by observing that
the likelihood is given by
𝑛
𝑚 𝑛 𝑛
𝐿(𝑚, 𝑞) = (∏ ( )) 𝑞 ∑𝑖=1 𝑥𝑖 (1 − 𝑞)𝑛𝑚−∑𝑖=1 𝑥𝑖 .
𝑖=1
𝑥𝑖
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 59
𝑛 𝑛
𝑙(𝑚, 𝑞) = ∑𝑖=1 log ((𝑥𝑚 )) + (∑𝑖=1 𝑥𝑖 ) log(𝑞)
𝑛𝑖
+ (𝑛𝑚 − ∑𝑖=1 𝑥𝑖 ) log(1 − 𝑞)
𝑛
= ∑𝑖=1 log ((𝑥𝑚 )) + 𝑛𝑥 log(𝑞) + 𝑛 (𝑚 − 𝑥) log(1 − 𝑞),
𝑖
𝑛
where 𝑥 = 𝑛−1 ∑𝑖=1 𝑥𝑖 . Note that since 𝑚 takes only non-negative integer val-
ues, we cannot use multivariate calculus to find the optimal values. Nevertheless,
we can use single variable calculus to show that
𝑞𝑀𝐿𝐸
̂ × 𝑚̂ 𝑀𝐿𝐸 = 𝑥. (2.2)
𝛿 𝑛𝑥 𝑛 (𝑚 − 𝑥)
𝑙(𝑚, 𝑞) = − ,
𝛿𝑞 𝑞 1−𝑞
and that
𝛿2 𝑛𝑥 𝑛 (𝑚 − 𝑥)
2
𝑙(𝑚, 𝑞) = − 2 + ≤ 0.
𝛿𝑞 𝑞 (1 − 𝑞)2
The above implies that for any fixed value of 𝑚, the maximizing value of 𝑞
satisfies
𝑚𝑞 = 𝑥,
and hence we establish equation (2.2).
With equation (2.2), the above reduces the task to the search for 𝑚̂ MLE , which
is a maximizer of
𝑥
𝐿 (𝑚, ). (2.3)
𝑚
Note the likelihood would be zero for values of 𝑚 smaller than max 𝑥𝑖 , and
1≤𝑖≤𝑛
hence 𝑚̂ MLE ≥ max1≤𝑖≤𝑛 𝑥𝑖 .
Towards specifying an algorithm to compute 𝑚̂ MLE , we first point out that for
some data sets 𝑚̂ MLE could equal ∞, indicating that a Poisson distribution
would render a better fit than any binomial distribution. This is so as the bino-
mial distribution with parameters (𝑚, 𝑥/𝑚) approaches the Poisson distribution
with parameter 𝑥 with 𝑚 approaching infinity. The fact that some data sets
prefer a Poisson distribution should not be surprising since in the above sense
60 CHAPTER 2. FREQUENCY MODELING
the set of Poisson distribution is on the boundary of the set of binomial distri-
butions. Interestingly, in Olkin et al. (1981) they show that if the sample mean
is less than or equal to the sample variance then 𝑚̂ MLE = ∞; otherwise, there
exists a finite 𝑚 that maximizes equation (2.3).
In Figure 2.2 below we display the plot of 𝐿 (𝑚, 𝑥/𝑚) for three different samples
of size 5; they differ only in the value of the sample maximum. The first sample
of (2, 2, 2, 4, 5) has the ratio of sample mean to sample variance greater than
1 (1.875), the second sample of (2, 2, 2, 4, 6) has the ratio equal to 1.25 which
is closer to 1, and the third sample of (2, 2, 2, 4, 7) has the ratio less than 1
(0.885). For these three samples, as shown in Figure 2.2, 𝑚̂ MLE equals 7, 18 and
∞, respectively. Note that the limiting value of 𝐿 (𝑚, 𝑥/𝑚) as 𝑚 approaches
infinity equals
−1
𝑛
𝑛𝑥
(∏ 𝑥𝑖 !) exp (−𝑛𝑥 ) ( 𝑥 ) . (2.4)
𝑖=1
Also, note that Figure 2.2 shows that the mle of 𝑚 is non-robust, i.e. changes
in a small proportion of the data set can cause large changes in the estimator.
• Step 1. If the sample mean is less than or equal to the sample variance,
then set 𝑚̂ 𝑀𝐿𝐸 = ∞. The mle suggested distribution is a Poisson distri-
bution with 𝜆̂ = 𝑥.
• Step 2. If the sample mean is greater than the sample variance, then
compute 𝐿(𝑚, 𝑥/𝑚) for 𝑚 values greater than or equal to the sample
maximum until 𝐿(𝑚, 𝑥/𝑚) is close to the value of the Poisson likelihood
given in (2.4). The value of 𝑚 that corresponds to the maximum value of
𝐿(𝑚, 𝑥/𝑚) among those computed equals 𝑚̂ 𝑀𝐿𝐸 .
form:
𝑛
𝑟 + 𝑥𝑖 − 1
𝐿(𝑟, 𝛽) = (∏ ( )) (1 + 𝛽)−𝑛(𝑟+𝑥) 𝛽 𝑛𝑥 .
𝑖=1
𝑥𝑖
The above implies that log-likelihood is given by
𝑛
𝑟 + 𝑥𝑖 − 1
𝑙(𝑟, 𝛽) = ∑ log ( ) − 𝑛(𝑟 + 𝑥) log(1 + 𝛽) + 𝑛𝑥 log 𝛽,
𝑖=1
𝑥𝑖
and hence
𝛿 𝑛(𝑟 + 𝑥) 𝑛𝑥
𝑙(𝑟, 𝛽) = − + .
𝛿𝛽 1+𝛽 𝛽
Equating the above to zero, we get
𝑟𝑀𝐿𝐸
̂ ̂
× 𝛽𝑀𝐿𝐸 = 𝑥.
with respect to 𝑟, with the maximizing 𝑟 being its mle and 𝛽𝑀𝐿𝐸 ̂ = 𝑥/𝑟𝑀𝐿𝐸
̂ .
In Levin et al. (1977) it is shown that if the sample variance is greater than
the sample mean then there exists a unique 𝑟 > 0 that maximizes 𝑙(𝑟, 𝑥/𝑟) and
hence a unique mle for 𝑟 and 𝛽. Also, they show that if 𝜎̂ 2 ≤ 𝑥, then the
negative binomial likelihood will be dominated by the Poisson likelihood with
𝜆̂ = 𝑥. In other words, a Poisson distribution offers a better fit to the data. The
guarantee in the case of 𝜎̂ 2 > 𝜇̂ permits us to use some algorithm to maximize
𝑙(𝑟, 𝑥/𝑟). Towards an alternate method of computing the likelihood, we note
that
62 CHAPTER 2. FREQUENCY MODELING
𝑛 𝑥 𝑛
𝑙(𝑟, 𝑥/𝑟) = ∑𝑖=1 ∑𝑗=1
𝑖
log(𝑟 − 1 + 𝑗) − ∑𝑖=1 log(𝑥𝑖 !)
−𝑛(𝑟 + 𝑥) log(𝑟 + 𝑥) + 𝑛𝑟 log(𝑟) + 𝑛𝑥 log(𝑥),
which yields
𝑥
1 𝛿 1 𝑛 𝑖 1
( ) 𝑙(𝑟, 𝑥/𝑟) = ∑ ∑ − log(𝑟 + 𝑥) + log(𝑟).
𝑛 𝛿𝑟 𝑛 𝑖=1 𝑗=1 𝑟 − 1 + 𝑗
We note that, in the above expressions for the terms involving a double sum-
mation, the inner sum equals zero if 𝑥𝑖 = 0. The maximum likelihood estimate
for 𝑟 is a root of the last expression and we can use a root finding algorithm to
compute it. Also, we have
𝑥
1 𝛿2 𝑥 1 𝑛 𝑖 1
( ) 2 𝑙(𝑟, 𝑥/𝑟) = − ∑∑ .
𝑛 𝛿𝑟 𝑟(𝑟 + 𝑥) 𝑛 𝑖=1 𝑗=1 (𝑟 − 1 + 𝑗)2
A simple but quickly converging iterative root finding algorithm is the Newton’s
method, which incidentally the Babylonians are believed to have used for com-
puting square roots. Under this method, an initial approximation is selected
for the root and new approximations for the root are successively generated
until convergence. Applying the Newton’s method to our problem results in the
following algorithm:
Step i. Choose an approximate solution, say 𝑟0 . Set 𝑘 to 0.
Step ii. Define 𝑟𝑘+1 as
1 𝑛 𝑥 1
𝑛 ∑𝑖=1 ∑𝑗=1
𝑖
𝑟𝑘 −1+𝑗 − log(𝑟𝑘 + 𝑥) + log(𝑟𝑘 )
𝑟𝑘+1 = 𝑟𝑘 − 𝑥 1 𝑛 𝑥𝑖 1
𝑟𝑘 (𝑟𝑘 +𝑥) − 𝑛 ∑𝑖=1 ∑𝑗=1 (𝑟𝑘 −1+𝑗)2
Step iii. If 𝑟𝑘+1 ∼ 𝑟𝑘 , then report 𝑟𝑘+1 as maximum likelihood estimate; else
increment 𝑘 by 1 and repeat Step ii.
For example, we simulated a 5 observation sample of 41, 49, 40, 27, 23 from the
negative binomial with parameters 𝑟 = 10 and 𝛽 = 5. Choosing the starting
value of 𝑟 such that
𝑟𝛽 = 𝜇̂ and 𝑟𝛽(1 + 𝛽) = 𝜎̂ 2
where 𝜇̂ represents the estimated mean and 𝜎̂ 2 is the estimated variance. This
leads to the starting value for 𝑟 of 23.14286. The iterates of 𝑟 from the Newton’s
method are
21.39627, 21.60287, 21.60647, 21.60647;
the rapid convergence seen above is typical of the Newton’s method. Hence in
this example, 𝑟𝑀𝐿𝐸
̂ ∼ 21.60647 and 𝛽𝑀𝐿𝐸̂ = 1.66616.
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 63
To summarize our discussion of MLE for the (𝑎, 𝑏, 0) class of distributions, in Fig-
ure 2.3 below we plot the maximum value of the Poisson likelihood, 𝐿(𝑚, 𝑥/𝑚)
for the binomial, and 𝐿(𝑟, 𝑥/𝑟) for the negative binomial, for the three sam-
ples of size 5 given in Table 2.1. The data was constructed to cover the three
orderings of the sample mean and variance. As shown in the Figure 2.3, and
supported by theory, if 𝜇̂ < 𝜎̂ 2 then the negative binomial results in a higher
maximum likelihood value; if 𝜇̂ = 𝜎̂ 2 the Poisson has the highest likelihood
value; and finally in the case that 𝜇̂ > 𝜎̂ 2 the binomial gives a better fit than
the others. So before fitting a frequency data with an (𝑎, 𝑏, 0) distribution, it
is best to start with examining the ordering of 𝜇̂ and 𝜎̂ 2 . We again emphasize
that the Poisson is on the boundary of the negative binomial and binomial
distributions. So in the case that 𝜇̂ ≥ 𝜎̂ 2 (𝜇̂ ≤ 𝜎̂ 2 , resp.) the Poisson yields
a better fit than the negative binomial (binomial, resp.), which is indicated by
𝑟 ̂ = ∞ (𝑚̂ = ∞, respectively).
64 CHAPTER 2. FREQUENCY MODELING
• Define the (a,b,1) class of frequency distributions and discuss the impor-
tance of the recursive relationship underpinning this class of distributions
• Interpret zero truncated and modified versions of the binomial, Poisson,
and negative binomial distributions
• Compute probabilities using the recursive relationship
2.5. OTHER FREQUENCY DISTRIBUTIONS 65
There are clearly infinitely many other count distributions, and more impor-
tantly the above distributions by themselves do not cater to all practical needs.
In particular, one feature of some insurance data is that the proportion of zero
counts can be out of place with the proportion of other counts to be explainable
by the above distributions. In the following we modify the above distributions
to allow for arbitrary probability for zero count irrespective of the assignment of
relative probabilities for the other counts. Another feature of a data set which
is naturally comprised of homogeneous subsets is that while the above distribu-
tions may provide good fits to each subset, they may fail to do so to the whole
data set. Later we naturally extend the (𝑎, 𝑏, 0) distributions to be able to cater
to, in particular, such data sets.
𝑝𝑘 𝑏
=𝑎+ , 𝑘 ≥ 2. (2.5)
𝑝𝑘−1 𝑘
Note that since the recursion starts with 𝑝1 , and not 𝑝0 , we refer to this super-
class of (𝑎, 𝑏, 0) distributions by (a,b,1). To understand this class, recall that
each valid pair of values for 𝑎 and 𝑏 of the (𝑎, 𝑏, 0) class corresponds to a unique
vector of probabilities {𝑝𝑘 }𝑘≥0 . If we now look at the probability vector {𝑝𝑘̃ }𝑘≥0
given by
1 − 𝑝0̃
𝑝𝑘̃ = ⋅ 𝑝 , 𝑘 ≥ 1,
1 − 𝑝0 𝑘
where 𝑝0̃ ∈ [0, 1) is arbitrarily chosen, then since the relative probabilities for
positive values according to {𝑝𝑘 }𝑘≥0 and {𝑝𝑘̃ }𝑘≥0 are the same, we have {𝑝𝑘̃ }𝑘≥0
satisfies recurrence (2.5). This, in particular, shows that the class of (𝑎, 𝑏, 1)
distributions is strictly wider than that of (𝑎, 𝑏, 0).
In the above, we started with a pair of values for 𝑎 and 𝑏 that led to a valid
(𝑎, 𝑏, 0) distribution, and then looked at the (𝑎, 𝑏, 1) distributions that corre-
sponded to this (𝑎, 𝑏, 0) distribution. We now argue that the (𝑎, 𝑏, 1) class allows
for a larger set of permissible distributions for 𝑎 and 𝑏 than the (𝑎, 𝑏, 0) class.
Recall from Section 2.3 that in the case of 𝑎 < 0 we did not use the fact that
the recurrence (2.1) started at 𝑘 = 1, and hence the set of pairs (𝑎, 𝑏) with
𝑎 < 0 that are permissible for the (𝑎, 𝑏, 0) class is identical to those that are
permissible for the (𝑎, 𝑏, 1) class. The same conclusion is easily drawn for pairs
with 𝑎 = 0. In the case that 𝑎 > 0, instead of the constraint 𝑎 + 𝑏 > 0 for the
(𝑎, 𝑏, 0) class we now have the weaker constraint of 𝑎 + 𝑏/2 > 0 for the (𝑎, 𝑏, 1)
class. With the parametrization 𝑏 = (𝑟 − 1)𝑎 as used in Section 2.3, instead of
𝑟 > 0 we now have the weaker constraint of 𝑟 > −1. In particular, we see that
while zero modifying a (𝑎, 𝑏, 0) distribution leads to a distribution in the (𝑎, 𝑏, 1)
class, the conclusion does not hold in the other direction.
Zero modification of a count distribution 𝐹 such that it assigns zero probability
to zero count is called a zero truncation of 𝐹 . Hence, the zero truncated version
of probabilities {𝑝𝑘 }𝑘≥0 is given by
0, 𝑘 = 0;
𝑝𝑘̃ = { 𝑝𝑘
1−𝑝0 , 𝑘 ≥ 1.
Solution. For the Poisson distribution as a member of the (𝑎, 𝑏,0) class, we
have 𝑎 = 0 and 𝑏 = 𝜆 = 2. Thus, we may use the recursion 𝑝𝑘 = 𝜆𝑝𝑘−1 /𝑘 =
2𝑝𝑘−1 /𝑘 for each type, after determining starting probabilities. The calculation
of probabilities for 𝑘 ≤ 3 is shown in Table 2.2.
Table 2.2. Calculation of Probabilities for 𝑘 ≤ 3
𝑘 𝑝𝑘 𝑝𝑘𝑇 𝑝𝑘𝑀
−𝜆
0 𝑝0 = 𝑒 = 0.135335 0 0.6
𝑝1 1−𝑝𝑀
1 𝑝1 = 𝑝0 (0 + 𝜆1 ) = 0.27067 1−𝑝0 = 0.313035 1−𝑝0 𝑝1 = 0.125214
0
𝑘
𝐹 (𝑥) = ∑ 𝛼𝑖 ⋅ 𝐹𝑖 (𝑥). (2.6)
𝑖=1
The above expression can be seen as a direct application of the Law of Total
Probability. As an example, consider a population of drivers split broadly into
68 CHAPTER 2. FREQUENCY MODELING
two sub-groups, those with at most five years of driving experience and those
with more than five years experience. Let 𝛼 denote the proportion of drivers
with less than 5 years experience, and 𝐹≤5 and 𝐹>5 denote the distribution of
the count of claims in a year for a driver in each group, respectively. Then the
distribution of claim count of a randomly selected driver is given by
𝑘
Var[𝑁𝐼 ] = E[Var[𝑁𝐼 |𝐼]] + Var[E[𝑁𝐼 |𝐼]] = ∑ 𝛼𝑖 Var[𝑁𝑖 ] + Var[E[𝑁𝐼 |𝐼]]. (2.7)
𝑖=1
of efficient simulation schemes that may exist for the component distributions.
2.6. MIXTURE DISTRIBUTIONS 69
Solution.
1. Using Law of Total Probability, we can write the required probability as
Pr(𝑁𝐼 = 3), with 𝐼 denoting the group of the randomly selected individual
with 1, 2 and 3 signifying the groups Children, Adult Non-Smoker, and
Adult Smoker, respectively. Now by conditioning we get
In the above example, the number of subgroups 𝑘 was equal to three. In general,
𝑘 can be any natural number, but when 𝑘 is large it is parsimonious from a
modeling point of view to take the following infinitely many subgroup approach.
To motivate this approach, let the 𝑖-th subgroup be such that its component
distribution 𝐹𝑖 is given by 𝐺𝜃 ̃ , where 𝐺⋅ is a parametric family of distributions
𝑖
with parameter space Θ ⊆ ℝ𝑑 . With this assumption, the distribution function
𝐹 of a randomly drawn observation from the population is given by
𝑘
𝐹 (𝑥) = ∑ 𝛼𝑖 𝐺𝜃 ̃ (𝑥), ∀𝑥 ∈ ℝ,
𝑖
𝑖=1
𝐹 (𝑥) = E[𝐺𝜗̃(𝑥)], ∀𝑥 ∈ ℝ,
where 𝜗 ̃ takes values 𝜃𝑖̃ with probability 𝛼𝑖 , for 𝑖 = 1, … , 𝑘. The above makes it
clear that when 𝑘 is large, one could model the above by treating 𝜗 ̃ as continuous
random variable.
To illustrate this approach, suppose we have a population of drivers with the
distribution of claims for an individual driver being distributed as a Poisson.
Each person has their own (personal) expected number of claims 𝜆 - smaller
values for good drivers, and larger values for others. There is a distribution of 𝜆
in the population; a popular and convenient choice for modeling this distribution
is a gamma distribution with parameters (𝛼, 𝜃) (the gamma distribution will be
introduced formally in Section 3.2.1). With these specifications it turns out
that the resulting distribution of 𝑁 , the claims of a randomly chosen driver, is a
negative binomial with parameters (𝑟 = 𝛼, 𝛽 = 𝜃). This can be shown in many
ways, but a straightforward argument is as follows:
70 CHAPTER 2. FREQUENCY MODELING
In the above we have discussed three basic frequency distributions, along with
their extensions through zero modification/truncation and by looking at mix-
tures of these distributions. Nevertheless, these classes still remain parametric
and hence by their very nature a small subset of the class of all possible fre-
quency distributions (that is, the set of distributions on non-negative integers).
Hence, even though we have talked about methods for estimating the unknown
parameters, the fitted distribution is not be a good representation of the un-
derlying distribution if the latter is far from the class of distribution used for
modeling. In fact, it can be shown that the maximum likelihood estimator con-
verges to a value such that the corresponding distribution is a Kullback-Leibler
projection of the underlying distribution on the class of distributions used for
modeling. Below we present one testing method - Pearson’s chi-square statistic
- to check for the goodness of fit of the fitted distribution. For more details on
the Pearson’s chi-square test, at an introductory mathematical statistics level,
we refer the reader to Section 9.1 of Hogg et al. (2015).
If we a fit a Poisson distribution, then the mle for 𝜆, the Poisson mean, is the
sample mean which is given by
0 ⋅ 6996 + 1 ⋅ 455 + 2 ⋅ 28 + 3 ⋅ 4 + 4 ⋅ 0
𝑁= = 0.06989.
7483
Now if we use Poisson (𝜆̂ 𝑀𝐿𝐸 ) as the fitted distribution, then a tabular compar-
ison of the fitted counts and observed counts is given by Table 2.5 below, where
𝑝𝑘̂ represents the estimated probabilities under the fitted Poisson distribution.
While the fit seems reasonable, a tabular comparison falls short of a statistical
test of the hypothesis that the underlying distribution is indeed Poisson. The
Pearson’s chi-square statistic is a goodness of fit statistical measure that can
be used for this purpose. To explain this statistic let us suppose that a dataset
of size 𝑛 is grouped into 𝑘 cells with 𝑚𝑘 /𝑛 and 𝑝𝑘̂ , for 𝑘 = 1 … , 𝐾 being the
observed and estimated probabilities of an observation belonging to the 𝑘-th
cell, respectively. The Pearson’s chi-square test statistic is then given by
𝐾 2
(𝑚𝑘 − 𝑛𝑝𝑘̂ )
∑ .
𝑘=1
𝑛𝑝𝑘̂
The motivation for the above statistic derives from the fact that
𝐾 2
(𝑚𝑘 − 𝑛𝑝𝑘 )
∑
𝑘=1
𝑛𝑝𝑘
For the Singaporean auto data the Pearson’s chi-square statistic equals 41.98
using the full data mle for 𝜆. Using the limiting distribution of chi-square with
5 − 1 − 1 = 3 degrees of freedom, we see that the value of 41.98 is way out in
the tail (99-th percentile is below 12). Hence we can conclude that the Poisson
distribution provides an inadequate fit for the data.
In the above, we started with the cells as given in the above tabular summary.
In practice, a relevant question is how to define the cells so that the chi-square
distribution is a good approximation to the finite sample distribution of the
statistic. A rule of thumb is to define the cells in such a way to have at least
80%, if not all, of the cells having expected counts greater than 5. Also, it is
clear that a larger number of cells results in a higher power of the test, and
hence a simple rule of thumb is to maximize the number of cells such that each
cell has at least 5 observations.
2.8 Exercises
Theoretical Exercises
Exercise 2.1. Derive an expression for 𝑝𝑁 (⋅) in terms of 𝐹𝑁 (⋅) and 𝑆𝑁 (⋅).
Exercise 2.2. A measure of center of location must be equi-variant with
respect to shifts, or location transformations. In other words, if 𝑁1 and 𝑁2 are
two random variables such that 𝑁1 +𝑐 has the same distribution as 𝑁2 , for some
constant 𝑐, then the difference between the measures of the center of location
of 𝑁2 and 𝑁1 must equal 𝑐. Show that the mean satisfies this property.
Exercise 2.3. Measures of dispersion should be invariant with respect to shifts
and scale equi-variant. Show that standard deviation satisfies these properties
by doing the following:
• Show that for a random variable 𝑁 , its standard deviation equals that of
𝑁 + 𝑐, for any constant 𝑐.
• Show that for a random variable 𝑁 , its standard deviation equals 1/𝑐
times that of 𝑐𝑁 , for any positive constant 𝑐.
Exercise 2.4. Let 𝑁 be a random variable with probability mass function
given by
( 62 ) ( 𝑘12 ) , 𝑘 ≥ 1;
𝑝𝑁 (𝑘) = { 𝜋
0, otherwise.
Show that the mean of 𝑁 is ∞.
Exercise 2.5. Let 𝑁 be a random variable with a finite second moment. Show
that the function 𝜓(⋅) defined by
𝜓(𝑥) = E(𝑁 − 𝑥)2 . 𝑥∈ℝ
is minimized at 𝜇𝑁 without using calculus. Also, give a proof of this fact using
derivatives. Conclude that the minimum value equals the variance of 𝑁 .
74 CHAPTER 2. FREQUENCY MODELING
Exercise 2.6. Derive the first two central moments of the (𝑎, 𝑏, 0) distributions
using the methods mentioned below:
• For the binomial distribution, derive the moments using only its pmf, then
its mgf, and then its pgf.
• For the Poisson distribution, derive the moments using only its mgf.
• For the negative binomial distribution, derive the moments using only its
pmf, and then its pgf.
Exercise 2.7. Let 𝑁1 and 𝑁2 be two independent Poisson random variables
with means 𝜆1 and 𝜆2 , respectively. Identify the conditional distribution of 𝑁1
given 𝑁1 + 𝑁2 .
Exercise 2.8. (Non-Uniqueness of the MLE) Consider the following para-
metric family of densities indexed by the parameter 𝑝 taking values in [0, 1]:
Using the corresponding zero-modified claim count distribution with 𝑝0𝑀 = 0.1,
calculate 𝑝1𝑀 .
2.9. FURTHER RESOURCES AND CONTRIBUTORS 75
No. of Accidents 0 1 2 3 4 5
No. of Days 209 111 33 7 5 2
You use a chi-square test to measure the fit of a Poisson distribution with mean
0.60. The minimum expected number of observations in any group should be
5. The maximum number of groups should be used. Determine the value of the
chi-square statistic.
Additional Exercises
Here are a set of exercises that guide the viewer through some of the theoretical
foundations of Loss Data Analytics. Each tutorial is based on one or more
questions from the professional actuarial examinations – typically the Society
of Actuaries Exam C/STAM.
Frequency Distribution Guided Tutorials
Contributors
• N.D. Shyamalkumar, The University of Iowa, and Krupa
Viswanathan, Temple University, are the principal authors of the
initial version of this chapter. Email: [email protected] for
chapter comments and suggested improvements.
• Chapter reviewers include: Chunsheng Ban, Paul Johnson, Hirokazu
(Iwahiro) Iwasawa, Dalia Khalil, Tatjana Miljkovic, Rajesh Sahasrabud-
dhe, and Michelle Xia.
In this section, you learn how to define some basic distributional quantities:
77
78 CHAPTER 3. MODELING LOSS SEVERITY
• moments,
• percentiles, and
• generating functions.
3.1.1 Moments
Let 𝑋 be a continuous random variable with probability density function (pdf )
𝑓𝑋 (𝑥) and distribution function 𝐹𝑋 (𝑥). The k-th raw moment of 𝑋, denoted
by 𝜇′𝑘 , is the expected value of the k-th power of 𝑋, provided it exists. The first
raw moment 𝜇′1 is the mean of 𝑋 usually denoted by 𝜇. The formula for 𝜇′𝑘 is
given as
∞
𝜇′𝑘 = E (𝑋 𝑘 ) = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥.
0
𝛼
(𝑥/𝜃) −𝑥/𝜃
𝑓𝑋 (𝑥) = 𝑒
𝑥 Γ (𝛼)
for 𝑥 > 0. For 𝛼 > 0, the k-th raw moment is
∞
1 Γ (𝑘 + 𝛼) 𝑘
𝜇′𝑘 = E (𝑋 𝑘 ) = ∫ 𝑥𝑘+𝛼−1 𝑒−𝑥/𝜃 𝑑𝑥 = 𝜃
0 Γ (𝛼) 𝜃𝛼 Γ (𝛼)
Given Γ (𝑟 + 1) = 𝑟Γ (𝑟) and Γ (1) = 1, then 𝜇′1 = E (𝑋) = 𝛼𝜃, 𝜇′2 = E (𝑋 2 ) =
(𝛼 + 1) 𝛼𝜃2 , 𝜇′3 = E (𝑋 3 ) = (𝛼 + 2) (𝛼 + 1) 𝛼𝜃3 , and Var (𝑋) = (𝛼 + 1)𝛼𝜃2 −
(𝛼𝜃)2 = 𝛼𝜃2 .
3
E[(𝑋−𝜇′1 ) ] 𝜇′3 −3𝜇′2 𝜇′1 +2𝜇′1
3
Skewness = 3/2 = 3/2
(Var𝑋) (Var𝑋)
(𝛼+2)(𝛼+1)𝛼𝜃3 −3(𝛼+1)𝛼2 𝜃3 +2𝛼3 𝜃3
= 3/2
(𝛼𝜃2 )
2
= 𝛼1/2
= 1.
3.1.2 Quantiles
Quantiles can also be used to describe the characteristics of the distribution of
𝑋. When the distribution of 𝑋 is continuous, for a given fraction 0 ≤ 𝑝 ≤ 1 the
corresponding quantile is the solution of the equation
𝐹𝑋 (𝜋𝑝 ) = 𝑝.
For example, the middle point of the distribution, 𝜋0.5 , is the median. A per-
centile is a type of quantile; a 100𝑝 percentile is the number such that 100 × 𝑝
percent of the data is below it.
Example 3.1.1. Actuarial Exam Question. Let 𝑋 be a continuous random
variable with density function 𝑓𝑋 (𝑥) = 𝜃𝑒−𝜃𝑥 , for 𝑥 > 0 and 0 elsewhere. If the
median of this distribution is 31 , find 𝜃.
Solution.
The distribution function is 𝐹𝑋 (𝑥) = 1 − 𝑒−𝜃𝑥 . So, 𝐹𝑋 (𝜋0.5 ) = 1 − 𝑒−𝜃𝜋0.5 = 0.5.
As, 𝜋0.5 = 13 , we have 𝐹𝑋 ( 31 ) = 1 − 𝑒−𝜃/3 = 0.5 and 𝜃 = 3 log 2.
80 CHAPTER 3. MODELING LOSS SEVERITY
Section 4.1.1 will extend the definition of quantiles to include distributions that
are discrete, continuous, or a hybrid combination.
for all 𝑡 for which the expected value exists. The mgf is a real function whose
k-th derivative at zero is equal to the k-th raw moment of 𝑋. In symbols, this
is
𝑑𝑘
𝑀 (𝑡)∣ = E (𝑋 𝑘 ) .
𝑑𝑡𝑘 𝑋 𝑡=0
Then,
𝑏 1
𝑀𝑋 (−𝑏2 ) = 2
= = 0.2.
(𝑏 + 𝑏 ) (1 + 𝑏)
Thus, 𝑏 = 4.
The mgf of 𝑆 is
𝑛
𝑛
𝑀𝑆 (𝑡) = E (𝑒tS ) = E (𝑒𝑡 ∑𝑖=1 𝑋𝑖 ) = E (∏ 𝑒𝑡𝑋𝑖 ) .
𝑖=1
𝑛
This indicates that the distribution of 𝑆 is gamma with parameters ∑𝑖=1 𝛼𝑖 and
𝜃.
This is a demonstration of how we can use the uniqueness property of the mo-
ment generating function to determine the probability distribution of a function
of random variables.
We can find the mean and variance from the properties of the gamma distribu-
tion. Alternatively, by finding the first and second derivatives of 𝑀𝑆 (𝑡) at zero,
𝑛
we can show that E (𝑆) = 𝜕𝑀𝜕𝑡𝑆 (𝑡) ∣ = 𝛼𝜃 where 𝛼 = ∑𝑖=1 𝛼𝑖 , and
𝑡=0
𝜕 2 𝑀𝑆 (𝑡)
E (𝑆 2 ) = ∣ = (𝛼 + 1) 𝛼𝜃2 .
𝜕𝑡2 𝑡=0
One can also use the moment generating function to compute the probability
generating function
In this section, you learn how to define and apply four fundamental severity
distributions:
• gamma,
• Pareto,
• Weibull, and
• generalized beta distribution of the second kind.
𝑥/𝜃
𝑥 1
𝐹𝑋 (𝑥) = Γ (𝛼; ) = ∫ 𝑡𝛼−1 𝑒−𝑡 𝑑𝑡,
𝜃 Γ (𝛼) 0
The 𝑘-th raw moment of the gamma distributed random variable for any positive
𝑘 is given by
Γ (𝛼 + 𝑘)
E (𝑋 𝑘 ) = 𝜃𝑘 .
Γ (𝛼)
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY83
scale=100 shape=2
scale=150 shape=3
scale=200 shape=4
0.003
0.003
scale=250 shape=5
Gamma Density
Gamma Density
0.002
0.002
0.001
0.001
0.000
0.000
x x
Figure 3.1: Gamma Densities. The left-hand panel is with shape=2 and
varying scale. The right-hand panel is with scale=100 and varying shape.
84 CHAPTER 3. MODELING LOSS SEVERITY
The mean and variance are given by E (𝑋) = 𝛼𝜃 and Var (𝑋) = 𝛼𝜃2 , respec-
tively.
Since all moments exist for any positive 𝑘, the gamma distribution is considered
a light tailed distribution, which may not be suitable for modeling risky assets
as it will not provide a realistic assessment of the likelihood of severe losses.
𝛼𝜃𝛼
𝑓𝑋 (𝑥) = 𝛼+1
𝑥 > 0, 𝛼 > 0, 𝜃 > 0. (3.1)
(𝑥 + 𝜃)
The two panels in Figure 3.2 demonstrate the effect of the scale and shape
parameters on the Pareto density function. There are other formulations of
the Pareto distribution including a one parameter version given in Appendix
Section 18.2. Henceforth, when we refer the Pareto distribution, we mean the
version given through the pdf in equation (3.1).
The distribution function of the Pareto distribution is given by
𝛼
𝜃
𝐹𝑋 (𝑥) = 1 − ( ) 𝑥 > 0, 𝛼 > 0, 𝜃 > 0.
𝑥+𝜃
It can be easily seen that the hazard function of the Pareto distribution is a
decreasing function in 𝑥, another indication that the distribution is heavy tailed.
Again using the analogy of the income of a population, when the hazard function
decreases over time the population dies off at a decreasing rate resulting in a
heavier tail for the distribution. The hazard function reveals information about
the tail distribution and is often used to model data distributions in survival
analysis. The hazard function is defined as the instantaneous potential that the
event of interest occurs within a very narrow time frame.
The 𝑘-th raw moment of the Pareto distributed random variable exists, if and
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY85
0.0015
0.0020
α =1 θ =2000
α =2 θ =2500
α =3 θ =3000
α =4 θ =3500
0.0015
0.0010
Pareto density
Pareto density
0.0010
0.0005
0.0005
0.0000
0.0000
x x
Figure 3.2: Pareto Densities. The left-hand panel is with scale=2000 and
varying shape. The right-hand panel is with shape=3 and varying scale.
86 CHAPTER 3. MODELING LOSS SEVERITY
0.012
scale=50 shape=1.5
scale=100 shape=2
0.020
scale=150 shape=2.5
0.010
scale=200 shape=3
0.008
0.015
Weibull density
Weibull density
0.006
0.010
0.004
0.005
0.002
0.000
0.000
z z
Figure 3.3: Weibull Densities. The left-hand panel is with shape=3 and
varying scale. The right-hand panel is with scale=100 and varying shape.
88 CHAPTER 3. MODELING LOSS SEVERITY
It can be easily seen that the shape parameter 𝛼 describes the shape of the
hazard function of the Weibull distribution. The hazard function is a decreasing
function when 𝛼 < 1 (heavy tailed distribution), constant when 𝛼 = 1 and
increasing when 𝛼 > 1 (light tailed distribution). This behavior of the hazard
function makes the Weibull distribution a suitable model for a wide variety of
phenomena such as weather forecasting, electrical and industrial engineering,
insurance modeling, and financial risk analysis.
The 𝑘-th raw moment of the Weibull distributed random variable is given by
𝑘
E (𝑋 𝑘 ) = 𝜃𝑘 Γ (1 + ).
𝛼
12 1.2
Pr (𝑋 ≥ 12) = 𝑆𝑋 (12) = 𝑒−( 33.33 ) = 0.746.
b. Let 𝑌 be the number of patients who die within one year of diagnosis. Then,
𝑌 ∼ 𝐵𝑖𝑛 (10, 0.254) and Pr (𝑌 ≤ 2) = 0.514.
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY89
𝜋0.99 1.2
𝑆𝑋 (𝜋0.99 ) = exp {− ( ) } = 0.01.
33.33
(𝑥/𝜃)𝛼2 /𝜎
𝑓𝑋 (𝑥) = 𝛼1 +𝛼2 for 𝑥 > 0, (3.2)
1/𝜎
𝑥𝜎 B (𝛼1 , 𝛼2 ) [1 + (𝑥/𝜃) ]
1
𝛼2 −1
B (𝛼1 , 𝛼2 ) = ∫ 𝑡𝛼1 −1 (1 − 𝑡) 𝑑𝑡.
0
The GB2 provides a model for heavy as well as light tailed data. It includes the
exponential, gamma, Weibull, Burr, Lomax, F, chi-square, Rayleigh, lognormal
and log-logistic as special or limiting cases. For example, by setting the param-
eters 𝜎 = 𝛼1 = 𝛼2 = 1, the GB2 reduces to the log-logistic distribution. When
𝜎 = 1 and 𝛼2 → ∞, it reduces to the gamma distribution, and when 𝛼 = 1 and
𝛼2 → ∞, it reduces to the Weibull distribution.
A GB2 random variable can be constructed as follows. Suppose that 𝐺1 and
𝐺2 are independent random variables where 𝐺𝑖 has a gamma distribution with
shape parameter 𝛼𝑖 and scale parameter 1. Then, one can show that the random
𝐺1 𝜎
variable 𝑋 = 𝜃 ( 𝐺 ) has a GB2 distribution with pdf summarized in equation
2
(3.2). This theoretical result has several implications. For example, when the
moments exist, one can show that the 𝑘-th raw moment of the GB2 distributed
random variable is given by
Using the chain rule for differentiation, the pdf of interest 𝑓𝑌 (𝑦) can be written
as
1 𝑦
𝑓𝑌 (𝑦) = 𝑓𝑋 ( ) .
𝑐 𝑐
Suppose that 𝑋 belongs to a certain set of parametric distributions and define
a rescaled version 𝑌 = 𝑐𝑋, 𝑐 > 0. If 𝑌 is in the same set of distributions
then the distribution is said to be a scale distribution. When a member of a
scale distribution is multiplied by a constant 𝑐 (𝑐 > 0), the scale parameter for
this scale distribution meets two conditions:
• The parameter is changed by multiplying by 𝑐;
• All other parameters remain unchanged.
Example 3.3.1. Actuarial Exam Question. Losses of Eiffel Auto Insurance
are denoted in Euro currency and follow a lognormal distribution with 𝜇 = 8
and 𝜎 = 2. Given that 1 euro = 1.3 dollars, find the set of lognormal parameters
which describe the distribution of Eiffel’s losses in dollars.
Solution.
Let 𝑋 and 𝑌 denote the aggregate losses of Eiffel Auto Insurance in euro cur-
rency and dollars respectively. As 𝑌 = 1.3𝑋, we have,
𝑦 𝑦
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (1.3𝑋 ≤ 𝑦) = Pr (𝑋 ≤ ) = 𝐹𝑋 ( ).
1.3 1.3
As ∣ 𝑑𝑥
𝑑𝑦 ∣ =
1
1.3 , the pdf of interest 𝑓𝑌 (𝑦) is
1 𝑦
𝑓𝑌 (𝑦) = 1.3 𝑓𝑋 ( 1.3 )
2
= 1 1.3 1
1.3 𝑦𝜎 2𝜋 exp {− 2
√ ( log(𝑦/1.3)−𝜇
𝜎 ) }
2
= 1
√
𝑦𝜎 2𝜋
exp {− 12 ( log 𝑦−(log
𝜎
1.3+𝜇)
) }.
Then 𝑌 follows a lognormal distribution with parameters log 1.3 + 𝜇 = 8.26 and
𝜎 = 2.00. If we let 𝜇 = log(𝑚) then it can be easily seen that 𝑚 = 𝑒𝜇 is the
scale parameter which was multiplied by 1.3 while 𝜎 is the shape parameter that
remained unchanged.
Solution.
Let 𝑋 ∼ 𝐺𝑎(𝛼, 𝜃) and 𝑌 = 𝑐𝑋. As ∣ 𝑑𝑥 1
𝑑𝑦 ∣ = 𝑐 , then
𝛼
1 𝑦 (𝑦) 𝑦
𝑓𝑌 (𝑦) = 𝑓𝑋 ( ) = 𝑐𝜃 exp (− ) .
𝑐 𝑐 𝑦 Γ (𝛼) 𝑐𝜃
We can see that 𝑌 ∼ 𝐺𝑎(𝛼, 𝑐𝜃) indicating that gamma is a scale distribution
and 𝜃 is a scale parameter.
Using the same approach you can demonstrate that other distributions intro-
duced in Section 3.2 are also scale distributions. In actuarial modeling, working
with a scale distribution is very convenient because it allows to incorporate the
effect of inflation and to accommodate changes in the currency unit.
1 (1/𝜏)−1
𝑓𝑌 (𝑦) = 𝑦 𝑓𝑋 (𝑦1/𝜏 ) .
𝜏
On the other hand, if 𝜏 < 0, then the distribution function of 𝑌 is given by
and
1
𝑓𝑌 (𝑦) = ∣ ∣ 𝑦(1/𝜏)−1 𝑓 𝑋 (𝑦1/𝜏 ) .
𝜏
Solution.
As 𝑋 follows the exponential distribution with mean 𝜃, we have
1 −𝑥/𝜃
𝑓𝑋 (𝑥) = 𝑒 𝑥 > 0.
𝜃
𝑑𝑥 1 1
∣ ∣ = 𝑦 𝜏 −1 .
𝑑𝑦 𝜏
Thus,
1 1 −1 1 1 1 −1 − 𝑦 𝜏1 𝛼 𝑦 𝛼−1 −(𝑦/𝛽)𝛼
𝑓𝑌 (𝑦) = 𝑦 𝜏 𝑓 𝑋 (𝑦 𝜏 ) = 𝑦𝜏 𝑒 𝜃 = ( ) 𝑒 .
𝜏 𝜏𝜃 𝛽 𝛽
where 𝛼 = 𝜏1 and 𝛽 = 𝜃𝜏 . Then, 𝑌 follows the Weibull distribution with shape
parameter 𝛼 and scale parameter 𝛽.
To see this relationship, we first note that 12 𝐺1 has a gamma distribution with
shape parameter 𝛼1 and scale parameter 0.5. Readers with some background in
applied statistics may also recognize this to be a chi-square distribution with de-
grees of freedom 2𝛼1 . The ratio of independent chi-squares has an 𝐹 -distribution.
That is
𝐺1 0.5𝐺1
= =𝐹
𝐺2 0.5𝐺2
3.3.4 Exponentiation
The normal distribution is a very popular model for a wide number of applica-
tions and when the sample size is large, it can serve as an approximate distri-
bution for other models. If the random variable 𝑋 has a normal distribution
with mean 𝜇 and variance 𝜎2 , then 𝑌 = 𝑒𝑋 has a lognormal distribution with
parameters 𝜇 and 𝜎2 . The lognormal random variable has a lower bound of
zero, is positively skewed and has a long right tail. A lognormal distribution is
commonly used to describe distributions of financial assets such as stock prices.
It is also used in fitting claim amounts for automobile as well as health in-
surance. This is an example of another type of transformation which involves
exponentiation.
In general, consider the transformation 𝑌 = 𝑒𝑋 . Then, the distribution function
of 𝑌 is given by
1
𝑓𝑌 (𝑦) = 𝑓 (log 𝑦) .
𝑦 𝑋
2
1 1 1 log 𝑦 − 𝜇
𝑓𝑌 (𝑦) = 𝑓𝑋 (log 𝑦) = √ exp {− ( ) }.
𝑦 𝑦𝜎 2𝜋 2 𝜎
1 1
𝑓𝑌 (𝑦) = 𝑓 (log 𝑦) = .
𝑦 𝑋 𝑐𝑦
Since 0 < 𝑥 < 𝑐, then 1 < 𝑦 < 𝑒𝑐 .
Two-point Mixture
If the underlying phenomenon is diverse and can actually be described as two
phenomena representing two subpopulations with different modes, we can con-
struct the two-point mixture random variable 𝑋. Given random variables 𝑋1
and 𝑋2 , with pdf s 𝑓𝑋1 (𝑥) and 𝑓𝑋2 (𝑥) respectively, the pdf of 𝑋 is the weighted
average of the component pdf 𝑓𝑋1 (𝑥) and 𝑓𝑋2 (𝑥). The pdf and distribution
function of 𝑋 are given by
and
𝐹𝑋 (𝑥) = 𝑎𝐹𝑋1 (𝑥) + (1 − 𝑎) 𝐹𝑋2 (𝑥) ,
for 0 < 𝑎 < 1, where the mixing parameters 𝑎 and (1 − 𝑎) represent the propor-
tions of data points that fall under each of the two subpopulations respectively.
This weighted average can be applied to a number of other distribution related
quantities. The k-th raw moment and moment generating function of 𝑋 are
given by E (𝑋 𝑘 ) = 𝑎E (𝑋1𝐾 ) + (1 − 𝑎) E (𝑋2𝑘 ), and
respectively.
Example 3.3.5. Actuarial Exam Question. A collection of insurance poli-
cies consists of two types. 25% of policies are Type 1 and 75% of policies are
Type 2. For a policy of Type 1, the loss amount per year follows an exponential
distribution with mean 200, and for a policy of Type 2, the loss amount per year
follows a Pareto distribution with parameters 𝛼 = 3 and 𝜃 = 200. For a policy
chosen at random from the entire collection of both types of policies, find the
probability that the annual loss will be less than 100, and find the average loss.
Solution.
96 CHAPTER 3. MODELING LOSS SEVERITY
The two types of losses are the random variables 𝑋1 and 𝑋2 . 𝑋1 has an expo-
100
nential distribution with mean 100, so 𝐹𝑋1 (100) = 1 − 𝑒− 200 = 0.393. 𝑋2 has
a Pareto distribution with parameters 𝛼 = 3 and 𝜃 = 200, so 𝐹𝑋1 (100) = 1 −
200 3
( 100+200 ) = 0.704. Hence, 𝐹𝑋 (100) = (0.25 × 0.393) + (0.75 × 0.704) = 0.626.
The average loss is given by
E (𝑋) = 0.25E (𝑋1 ) + 0.75E (𝑋2 ) = (0.25 × 200) + (0.75 × 100) = 125
k-point Mixture
In case of finite mixture distributions, the random variable of interest 𝑋 has
a probability 𝑝𝑖 of being drawn from homogeneous subpopulation 𝑖, where
𝑖 = 1, 2, … , 𝑘 and 𝑘 is the initially specified number of subpopulations in our mix-
ture. The mixing parameter 𝑝𝑖 represents the proportion of observations from
subpopulation 𝑖. Consider the random variable 𝑋 generated from 𝑘 distinct sub-
populations, where subpopulation 𝑖 is modeled by the continuous distribution
𝑓𝑋𝑖 (𝑥). The probability distribution of 𝑋 is given by
𝑘
𝑓𝑋 (𝑥) = ∑ 𝑝𝑖 𝑓𝑋𝑖 (𝑥),
𝑖=1
𝑘
where 0 < 𝑝𝑖 < 1 and ∑𝑖=1 𝑝𝑖 = 1.
This model is often referred to as a finite mixture or a 𝑘-point mixture. The
distribution function, 𝑟-th raw moment and moment generating functions of the
𝑘-th point mixture are given as
𝑘
𝐹𝑋 (𝑥) = ∑ 𝑝𝑖 𝐹𝑋𝑖 (𝑥),
𝑖=1
𝑘
E (𝑋 𝑟 ) = ∑ 𝑝𝑖 E (𝑋𝑖𝑟 ), and
𝑖=1
𝑘
𝑀𝑋 (𝑡) = ∑ 𝑝𝑖 𝑀𝑋𝑖 (𝑡),
𝑖=1
respectively.
Example 3.3.6. Actuarial Exam Question. 𝑌1 is a mixture of 𝑋1 and 𝑋2
with mixing weights 𝑎 and (1 − 𝑎). 𝑌2 is a mixture of 𝑋3 and 𝑋4 with mixing
weights 𝑏 and (1 − 𝑏). 𝑍 is a mixture of 𝑌1 and 𝑌2 with mixing weights 𝑐 and
(1 − 𝑐).
3.3. METHODS OF CREATING NEW DISTRIBUTIONS 97
.
Then, 𝑍 is a mixture of 𝑋1 , 𝑋2 , 𝑋3 and 𝑋4 , with mixing weights ca, 𝑐 (1 − 𝑎),
(1 − 𝑐) 𝑏 and (1 − 𝑐) (1 − 𝑏), respectively. It can be easily seen that the mixing
weights sum to one.
∞
𝐹𝑋 (𝑥) = ∫ 𝐹𝑋 (𝑥 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
∞
E (𝑋 𝑘 ) = ∫ E (𝑋 𝑘 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
∞
𝑀𝑋 (𝑡) = E (𝑒𝑡𝑋 ) = ∫ E (𝑒𝑡𝑥 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
respectively.
The 𝑘-th raw moment of the mixture distribution can be rewritten as
∞
E (𝑋 𝑘 ) = ∫ E (𝑋 𝑘 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃 = E [E (𝑋 𝑘 |Θ )] .
−∞
Using the law of iterated expectations (see Appendix Chapter 16), we can define
the mean and variance of 𝑋 as
E (𝑋) = E [E (𝑋 |Θ )]
and
Var (𝑋) = E [Var (𝑋 |Θ )] + Var [E (𝑋 |Θ )] .
E (𝑋) = E [E (𝑋|Λ)] = E (Λ) = 1 and V (𝑋) = V [E (𝑋|Λ)]+E [V (𝑋|Λ)] = V (Λ)+E (1) = 1+1 = 2.
𝑥 1 −𝜃 1 −𝑥
∫0 50 𝑒
5 𝑑𝜃 =
10 (1 − 𝑒
5) 0 ≤ 𝑥 ≤ 10,
𝑓𝑋 (𝑥) = { 𝑥 1 − 5𝜃 1 − (𝑥−10) 𝑥
∫𝑥−10 50 𝑒 𝑑𝜃 = 10 (𝑒 5 − 𝑒− 5 ) 10 < 𝑥 < ∞.
One can use this to derive the mean and variance of the unconditional distribu-
tion. Alternatively, start with the conditional mean and variance of 𝑋, given
by
𝜃 + 𝜃 + 10
E (𝑋|𝜃) = =𝜃+5
2
and
2
[(𝜃 + 10) − 𝜃] 100
Var (𝑋|𝜃) = = ,
12 12
respectively. With these, the unconditional mean and variance of 𝑋 are given
by
and
100
Var (𝑋) = E [𝑉 (𝑋 |Θ )]+Var [E (𝑋 |Θ )] = E ( )+Var (Θ + 5) = 8.33+Var (Θ) = 33.33.
12
potential moral hazard arising from having insurance. Moral hazard occurs
when the insured takes more risks, increasing the chances of loss due to perils
insured against, knowing that the insurer will incur the cost (e.g. a policyholder
with collision insurance may be encouraged to drive recklessly). The larger the
deductible, the less the insured pays in premiums for an insurance policy.
Let 𝑋 denote the loss incurred to the insured and 𝑌 denote the amount of
paid claim by the insurer. Speaking of the benefit paid to the policyholder, we
differentiate between two variables: The payment per loss and the payment per
payment. The payment per loss variable, denoted by 𝑌 𝐿 or (𝑋 − 𝑑)+ is left
censored because values of 𝑋 that are less than 𝑑 are set equal to zero. This
variable is defined as
0 𝑋 ≤ 𝑑,
𝑌 𝐿 = (𝑋 − 𝑑)+ = { .
𝑋−𝑑 𝑋>𝑑
𝑌 𝐿 is often referred to as left censored and shifted variable because the values
below 𝑑 are not ignored and all losses are shifted by a value 𝑑.
On the other hand, the payment per payment variable, denoted by 𝑌 𝑃 , is defined
only when there is a payment. Specifically, 𝑌 𝑃 equals 𝑋 − 𝑑 on the event
{𝑋 > 𝑑}, denoted as 𝑌 𝑃 = 𝑋 − 𝑑||𝑋 > 𝑑. Another way of expressing this that
is commonly used is
Undefined 𝑋 ≤ 𝑑
𝑌𝑃 = {
𝑋−𝑑 𝑋 > 𝑑.
Here, 𝑌 𝑃 is often referred to as left truncated and shifted variable or excess loss
variable because the claims smaller than 𝑑 are not reported and values above 𝑑
are shifted by 𝑑.
Even when the distribution of 𝑋 is continuous, the distribution of 𝑌 𝐿 is a hybrid
combination of discrete and continuous components. The discrete part of the
distribution is concentrated at 𝑌 = 0 (when 𝑋 ≤ 𝑑) and the continuous part
is spread over the interval 𝑌 > 0 (when 𝑋 > 𝑑). For the discrete part, the
probability that no payment is made is the probability that losses fall below the
deductible; that is,
Pr (𝑌 𝐿 = 0) = Pr (𝑋 ≤ 𝑑) = 𝐹𝑋 (𝑑) .
𝐹𝑋 (𝑑) 𝑦=0
𝑓𝑌 𝐿 (𝑦) = {
𝑓𝑋 (𝑦 + 𝑑) 𝑦 > 0.
We can see that the payment per payment variable is the payment per loss
variable (𝑌 𝑃 = 𝑌 𝐿 ) conditional on the loss exceeding the deductible (𝑋 > 𝑑);
3.4. COVERAGE MODIFICATIONS 101
𝑓𝑋 (𝑦 + 𝑑)
𝑓𝑌 𝑃 (𝑦) = ,
1 − 𝐹𝑋 (𝑑)
𝐹𝑋 (𝑑) 𝑦=0
𝐹𝑌 𝐿 (𝑦) = {
𝐹𝑋 (𝑦 + 𝑑) 𝑦 > 0,
and
𝐹𝑋 (𝑦 + 𝑑) − 𝐹𝑋 (𝑑)
𝐹𝑌 𝑃 (𝑦) = ,
1 − 𝐹𝑋 (𝑑)
This could be easily proved if we start with the initial definition of E(𝑌 𝐿 ) and
use integration by parts.
We have seen that the deductible 𝑑 imposed on an insurance policy is the amount
of loss that has to be paid out of pocket before the insurer makes any payment.
The deductible 𝑑 imposed on an insurance policy reduces the insurer’s payment.
The loss elimination ratio (LER) is the percentage decrease in the expected
payment of the insurer as a result of imposing the deductible. It is defined as
E (𝑋) − E (𝑌 𝐿 )
𝐿𝐸𝑅 = .
E (𝑋)
A little less common type of policy deductible is the franchise deductible. The
franchise deductible will apply to the policy in the same way as ordinary de-
ductible except that when the loss exceeds the deductible 𝑑, the full loss is
102 CHAPTER 3. MODELING LOSS SEVERITY
covered by the insurer. The payment per loss and payment per payment vari-
ables are defined as
0 𝑋 ≤ 𝑑,
𝑌𝐿 = {
𝑋 𝑋 > 𝑑,
and
Undefined 𝑋 ≤ 𝑑,
𝑌𝑃 = {
𝑋 𝑋 > 𝑑,
respectively.
Example 3.4.1. Actuarial Exam Question. A claim severity distribution
is exponential with mean 1000. An insurance company will pay the amount of
each claim in excess of a deductible of 100. Calculate the variance of the amount
paid by the insurance company for one claim, including the possibility that the
amount paid is 0.
Solution.
Let 𝑌 𝐿 denote the amount paid by the insurance company for one claim.
0 𝑋 ≤ 100,
𝑌 𝐿 = (𝑋 − 100)+ = {
𝑋 − 100 𝑋 > 100.
and ∞
2 2 100
𝐸 [(𝑌 𝐿 ) ] = ∫ (𝑥 − 100) 𝑓𝑋 (𝑥) 𝑑𝑥 = 2 × 10002 𝑒− 1000 .
100
So,
100 2
Var (𝑌 𝐿 ) = (2 × 10002 𝑒− 1000 ) − (1000𝑒− 1000 ) = 990, 944.
100
2 2 100
𝐸 [(𝑌 𝐿 ) ] = 𝐸 [(𝑌 𝑃 ) ] 𝑆𝑋 (100) = 2 × 10002 𝑒− 1000 .
3.4. COVERAGE MODIFICATIONS 103
The relationship between 𝑋 and 𝑌 𝑃 can also be used when dealing with the
uniform or the Pareto distributions. You can easily show that if 𝑋 is uniform
over the interval (0, 𝜃) then 𝑌 𝑃 is uniform over the interval (0, 𝜃 − 𝑑) and if 𝑋
is Pareto with parameters 𝛼 and 𝜃 then 𝑌 𝑃 is Pareto with parameters 𝛼 and
𝜃 + 𝑑.
10
∫4 (𝑥−4)0.02𝑥𝑑𝑥 2.88
So, 𝐸 (𝑌 𝑃 ) = 1−𝐹 𝑋 (4) = 0.84 = 3.43.
Note that we divide by 𝑆𝑋 (4) = 1 − 𝐹𝑋 (4), as this is the probability where the
variable 𝑌 𝑃 is defined.
𝐸 (𝑋) − 𝐸 (𝑌 𝐿 ) 𝜃 − 𝜃𝑒−𝑑/𝜃
= = 1 − 𝑒−𝑑/𝜃 = 0.7.
𝐸 (𝑋) 𝜃
4𝑑
′ 𝜃−𝜃 exp(− 3 )
𝜃−𝜃 exp(− 𝑑𝜃 ) 𝜃
𝜃 = 𝜃
4
𝑑 4/3
= 1 − exp (− 3𝜃 ) = 1 − (𝑒−𝑑/𝜃 ) = 1 − 0.34/3 = 0.8.
𝑋 𝑋≤𝑢
𝑌 =𝑋∧𝑢={
𝑢 𝑋 > 𝑢.
It can be seen that the distinction between 𝑌 𝐿 and 𝑌 𝑃 is not needed under
limited policy as the insurer will always make a payment.
Using the definitions of (𝑋 − 𝑢)+ and (𝑋 ∧ 𝑢), it can be easily seen that the
expected payment without any coverage modification, 𝑋, is equal to the sum of
the expected payments with deductible 𝑢 and limit 𝑢. That is, 𝑋 = (𝑋 − 𝑢)+ +
(𝑋 ∧ 𝑢).
When a loss is subject to a deductible 𝑑 and a limit 𝑢, the per-loss variable 𝑌 𝐿
is defined as
⎧ 0 𝑋≤𝑑
{
𝑌 𝐿 = ⎨𝑋 − 𝑑 𝑑 < 𝑋 ≤ 𝑢
{
⎩𝑢 − 𝑑 𝑋 > 𝑢.
Hence, 𝑌 𝐿 can be expressed as 𝑌 𝐿 = (𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑).
Even when the distribution of 𝑋 is continuous, the distribution of 𝑌 is a hybrid
combination of discrete and continuous components. The discrete part of the
distribution is concentrated at 𝑌 = 𝑢 (when 𝑋 > 𝑢), while the continuous part
is spread over the interval 𝑌 < 𝑢 (when 𝑋 ≤ 𝑢). For the discrete part, the
probability that the benefit paid is 𝑢, is the probability that the loss exceeds
the policy limit 𝑢; that is,
Pr (𝑌 = 𝑢) = Pr (𝑋 > 𝑢) = 1 − 𝐹 𝑋 (𝑢) .
3.4. COVERAGE MODIFICATIONS 105
For the continuous part of the distribution 𝑌 = 𝑋, hence the pdf of 𝑌 is given
by
𝑓𝑋 (𝑦) 0<𝑦<𝑢
𝑓𝑌 (𝑦) = {
1 − 𝐹𝑋 (𝑢) 𝑦 = 𝑢.
𝐹𝑋 (𝑥) 0<𝑦<𝑢
𝐹𝑌 (𝑦) = {
1 𝑦 ≥ 𝑢.
The raw moments of 𝑌 can be found directly using the pdf of 𝑋 as follows
𝑢 ∞ 𝑢
𝑘
E (𝑌 𝑘 ) = E [(𝑋 ∧ 𝑢) ] = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥+∫ 𝑢𝑘 𝑓𝑋 (𝑥)𝑑𝑥 = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥+𝑢𝑘 [1 − 𝐹 𝑋 (𝑢)] .
0 𝑢 0
𝑢
𝑘
E [(𝑋 ∧ 𝑢) ] = ∫ 𝑘𝑥𝑘−1 [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥.
0
𝑢
E (𝑌 ) = E (𝑋 ∧ 𝑢) = ∫ [1 − 𝐹𝑋 (𝑥)]𝑑𝑥.
0
This could be easily proved if we start with the initial definition of E (𝑌 ) and
use integration by parts. Alternatively, see the following justification of this
limited expectation result.
𝑘 𝑋∧𝑢
E [(𝑋 ∧ 𝑢) ] = E [∫0 𝑘𝑥𝑘−1 𝑑𝑥]
𝑢
= E [∫0 𝑘𝑥𝑘−1 𝐼(𝑋 > 𝑥)𝑑𝑥]
𝑢
= ∫0 𝑘𝑥𝑘−1 E𝐼(𝑋 > 𝑥)𝑑𝑥
𝑢
= ∫0 𝑘𝑥𝑘−1 [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥.
This approach uses the Fubini-Tonelli theorem to exchange the expectation and
integration. Note that it does not make any continuity assumptions about the
distribution of 𝑋.
𝑥(4−𝑥)
0<𝑥<3
𝑓𝑋 (𝑥) = { 9
0 elsewhere.
𝑋 𝑋≤1
𝑌 =𝑋∧1={
1 𝑋 > 1.
1 𝑥2 (4−𝑥) 3 𝑥(4−𝑥)
So E (𝑌 ) = E (𝑋 ∧ 1) = ∫0 9 𝑑𝑥 + 1 ⋅ ∫1 9 𝑑𝑥 = 0.935.
⎧ 0 𝑋 ≤ 𝑑,
{
𝑌𝐿 = 𝛼 (𝑋 − 𝑑) 𝑑 < 𝑋 ≤ 𝑢,
⎨
{
⎩ 𝛼 (𝑢 − 𝑑) 𝑋 > 𝑢.
The maximum amount paid by the insurer in this case is 𝛼 (𝑢 − 𝑑), while 𝑢 is
the maximum covered loss.
We have seen in Section 3.4.2 that when a loss is subject to both a de-
ductible 𝑑 and a limit 𝑢 the per-loss variable 𝑌 𝐿 can be expressed as
𝑌 𝐿 = (𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑). With coinsurance, this becomes 𝑌 𝐿 can be expressed
as 𝑌 𝐿 = 𝛼 [(𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑)].
The 𝑘-th raw moment of 𝑌 𝐿 is given by
𝑢
𝑘 𝑘 𝑘
E [(𝑌 𝐿 ) ] = ∫ [𝛼 (𝑥 − 𝑑)] 𝑓𝑋 (𝑥) 𝑑𝑥 + [𝛼 (𝑢 − 𝑑)] [1 − 𝐹𝑋 (𝑢)].
𝑑
𝑑
⎧ 0 𝑋 ≤ 1+𝑟
𝐿
{ 𝑑 𝑢
𝑌 = ⎨𝛼 [(1 + 𝑟) 𝑋 − 𝑑] 1+𝑟< 𝑋 ≤ 1+𝑟
{
⎩ 𝛼 (𝑢 − 𝑑) 𝑢
𝑋 > 1+𝑟 .
𝑢 𝑑
E (𝑌 𝐿 ) = 𝛼 (1 + 𝑟) [E (𝑋 ∧ ) − E (𝑋 ∧ )] ,
1+𝑟 1+𝑟
and
2 2
2 2 𝑢 𝑑 𝑑 𝑢
E [(𝑌 𝐿 ) ] = 𝛼2 (1 + 𝑟) {E [(𝑋 ∧ ) ] − E [(𝑋 ∧ ) ] −2 ( ) [E (𝑋 ∧ ) − E (𝑋 ∧
1+𝑟 1+𝑟 1+𝑟 1+𝑟 1
respectively.
The formulas given for the first and second moments of 𝑌 𝐿 are general. Under
full coverage, 𝛼 = 1, 𝑟 = 0, 𝑢 = ∞, 𝑑 = 0 and E (𝑌 𝐿 ) reduces to E (𝑋). If only
an ordinary deductible is imposed, 𝛼 = 1, 𝑟 = 0, 𝑢 = ∞ and E (𝑌 𝐿 ) reduces to
E (𝑋) − E (𝑋 ∧ 𝑑). If only a policy limit is imposed 𝛼 = 1, 𝑟 = 0, 𝑑 = 0 and
E (𝑌 𝐿 ) reduces to E (𝑋 ∧ 𝑢).
Example 3.4.5. Actuarial Exam Question. The ground up loss random
variable for a health insurance policy in 2006 is modeled with 𝑋, a random
variable with an exponential distribution having mean 1000. An insurance policy
pays the loss above an ordinary deductible of 100, with a maximum annual
payment of 500. The ground up loss random variable is expected to be 5%
larger in 2007, but the insurance in 2007 has the same deductible and maximum
payment as in 2006. Find the percentage increase in the expected cost per
payment from 2006 to 2007.
Solution.
We define the amount per loss 𝑌 𝐿 in both years as
⎧ 0 𝑋 ≤ 100,
𝐿 {
𝑌2006 = ⎨𝑋 − 100 100 < 𝑋 ≤ 600,
{
⎩ 500 𝑋 > 600.
⎧ 0 𝑋 ≤ 95.24,
𝐿 {
𝑌2007 = ⎨1.05𝑋 − 100 95.24 < 𝑋 ≤ 571.43,
{
⎩ 500 𝑋 > 571.43.
So,
108 CHAPTER 3. MODELING LOSS SEVERITY
𝐿
𝐸 (𝑌2006 ) = 𝐸 (𝑋 ∧ 600) − 𝐸 (𝑋 ∧ 100)
600 100
= 1000 (1 − 𝑒− 1000 ) − 1000 (1 − 𝑒− 1000 )
= 356.026.
Further,
𝐿
𝐸 (𝑌2007 ) = 1.05 [𝐸 (𝑋 ∧ 571.43) − 𝐸 (𝑋 ∧ 95.24)]
571.43 95.24
= 1.05 [1000 (1 − 𝑒− 1000 ) − 1000 (1 − 𝑒− 1000 )]
= 361.659.
𝑃 356.026
𝐸 (𝑌2006 )= 100 = 393.469.
𝑒− 1000
𝑃 361.659
𝐸 (𝑌2007 ) = 95.24 = 397.797.
𝑒− 1000
𝐸(𝑌 𝑃 )
Because 𝐸(𝑌2007
𝑃 − 1 = 0.011, there is an increase of 1.1% from 2006 to 2007.
2006 )
Due to the policy limit, the cost per payment event grew by only 1.1% between
2006 and 2007 even though the ground up losses increased by 5% between the
two years.
3.4.4 Reinsurance
In Section 3.4.1 we introduced the policy deductible feature of the insurance
contract. In this feature, there is a contractual arrangement under which an
insured transfers part of the risk by securing coverage from an insurer in return
for an insurance premium. Under that policy, the insured must pay all losses
up to the deductible, and the insurer only pays the amount (if any) above the
deductible. We now introduce reinsurance, a mechanism of insurance for in-
surance companies. Reinsurance is a contractual arrangement under which an
insurer transfers part of the underlying insured risk by securing coverage from
another insurer (referred to as a reinsurer) in return for a reinsurance premium.
Although reinsurance involves a relationship between three parties: the original
insured, the insurer (often referred to as cedent or cedent) and the reinsurer,
the parties of the reinsurance agreement are only the primary insurer and the
reinsurer. There is no contractual agreement between the original insured and
the reinsurer. Though many different types of reinsurance contracts exist, a
common form is excess of loss coverage. In such contracts, the primary insurer
must make all required payments to the insured until the primary insurer’s total
payments reach a fixed reinsurance deducible. The reinsurer is then only respon-
sible for paying losses above the reinsurance deductible. The maximum amount
retained by the primary insurer in the reinsurance agreement (the reinsurance
deductible) is called retention.
3.4. COVERAGE MODIFICATIONS 109
𝛼−1 5−1
𝜃 𝜃 3600 3600
E (𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 0.85 E (𝑋∧5000) = 0.85 [1 − ( ) ] = 0.85 [1 − ( ) ] = 741.51
𝛼−1 5000 + 𝜃 5−1 5000 + 3600
𝜃 3600
0.85 E 𝑋 = 0.85 = 0.85 = 765.
𝛼−1 5−1
For the second moment of the unlimited variable, we use the table of distribu-
tions to get
Thus, the variance is 1560600 − 7652 = 975375. Alternatively, you can use the
formula
𝛼𝜃2 5(36002 )
0.852 Var 𝑋 = 0.852 = 0.85 2
= 975375.
(𝛼 − 1)2 (𝛼 − 2) (5 − 1)2 (5 − 2)
√
Taking square roots, the standard deviation is 975375 ≈ 987.6108.
joint pdf may be written as the product of pdfs. Thus, we define the likelihood
to be
𝑛
𝐿(𝜃) = ∏ 𝑓(𝑥𝑖 ). (3.3)
𝑖=1
From the notation, note that we consider this to be a function of the parameters
in 𝜃, with the data {𝑥1 , … , 𝑥𝑛 } held fixed. The maximum likelihood estimator
is that value of the parameters in 𝜃 that maximize 𝐿(𝜃).
From calculus, we know that maximizing a function produces the same results
as maximizing the logarithm of a function (this is because the logarithm is a
monotone function). Because we get the same results, to ease computational
considerations, it is common to consider the logarithmic likelihood, denoted
as
𝑛
𝑙(𝜃) = log 𝐿(𝜃) = ∑ log 𝑓(𝑥𝑖 ). (3.4)
𝑖=1
500 𝛼
𝐹 (𝑥) = 1 − ( ) , 𝑥 > 500.
𝑥
Naturally, there are many problems where it is not practical to use hand calcula-
tions for optimization. Fortunately there are many statistical routines available
such as the R function optim.
112 CHAPTER 3. MODELING LOSS SEVERITY
−34.0
log−like
−34.5
−35.0 1 2 3 4 5
alpha
This code confirms our hand calculation result where the maximum likelihood
estimator is 𝛼𝑀𝐿𝐸 = 2.453125.
4 1
4 −𝜃 ∑𝑖=1
𝜃4 𝑒 𝑥𝑖
𝐿 (𝜃) = ∏ 𝑓𝑋𝑖 (𝑥𝑖 ) = 4
.
𝑖=1 ∏𝑖=1 𝑥2𝑖
The log-likelihood function, log 𝐿 (𝜃), is the sum of the individual logarithms
4 4
1
log 𝐿 (𝜃) = 4 log 𝜃 − 𝜃 ∑ − 2 ∑ log 𝑥𝑖 .
𝑖=1
𝑥𝑖 𝑖=1
3.5. MAXIMUM LIKELIHOOD ESTIMATION 113
4
𝑑 log 𝐿 (𝜃) 4 1
= −∑ .
𝑑𝜃 𝜃 𝑖=1 𝑥𝑖
4
4 1
−∑ = 0.
𝜃̂ 𝑖=1
𝑥𝑖
Thus, 𝜃 ̂ = 4
4
1
= 10, 667.
∑𝑖=1 𝑥𝑖
𝑑2 log 𝐿 (𝜃) −4
= 2.
𝑑𝜃2 𝜃
2
1 1 log 𝑥 − 𝜇
𝑓𝑋 (𝑥) = √ exp (− ( ) ),
𝑥𝜎 2𝜋 2 𝜎
where 𝑥 > 0.
The likelihood function, 𝐿 (𝜇, 𝜎), is the product of the pdf for each data point.
6 2
1 1 6 log 𝑥𝑖 − 𝜇
𝐿 (𝜇, 𝜎) = ∏ 𝑓𝑋𝑖 (𝑥𝑖 ) = 3 6
exp (− ∑( ) ).
𝑖=1 𝜎6 (2𝜋) ∏𝑖=1 𝑥𝑖 2 𝑖=1 𝜎
114 CHAPTER 3. MODELING LOSS SEVERITY
Taking a logarithm yields the loglikelihood function, log 𝐿 (𝜇, 𝜎), which is the
sum of the individual logarithms.
6 2
1 6 log 𝑥𝑖 − 𝜇
log 𝐿 (𝜇, 𝜎) = −6 log 𝜎 − 3 log (2𝜋) − ∑ log 𝑥𝑖 − ∑( ) .
𝑖=1
2 𝑖=1 𝜎
𝜕 log 𝐿(𝜇,𝜎) 1 6
𝜕𝜇 = 𝜎2 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)
𝜕 log 𝐿(𝜇,𝜎) −6 1 6 2
𝜕𝜎 = 𝜎 + 𝜎3 ∑𝑖=1 (log 𝑥𝑖 − 𝜇) .
The maximum likelihood estimators of 𝜇 and 𝜎, denoted by 𝜇̂ and 𝜎,̂ are the
solutions to the equations
1 6
𝜎̂ 2 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)̂ =0
−6 1 6 2
𝜎̂ + 𝜎̂ 3 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)̂ = 0.
6 6 2
∑ log 𝑥𝑖 ∑ (log 𝑥𝑖 − 𝜇)̂
𝜇̂ = 𝑖=1 = 9.38 and 𝜎̂ = 𝑖=1
2
= 5.12.
6 6
To check that these estimates maximize, and do not minimize, the likelihood,
you may also wish to compute the second partial derivatives. These are
and
𝜕 2 log 𝐿 (𝜇, 𝜎) 6 3 6 2
= − ∑ (log 𝑥𝑖 − 𝜇) .
𝜕𝜎2 𝜎2 𝜎4 𝑖=1
Two follow-up questions rely on large sample properties that you may have
seen in an earlier course. Appendix Chapter 17 reviews the definition of the
likelihood function, introduces its properties, reviews the maximum likelihood
estimators, extends their large-sample properties to the case where there are
multiple parameters in the model, and reviews statistical inference based on
maximum likelihood estimators. In the solutions of these examples we derive the
asymptotic variance of maximum-likelihood estimators of the model parameters.
We use the delta method to derive the asymptotic variances of functions of these
parameters.
3.5. MAXIMUM LIKELIHOOD ESTIMATION 115
9,000
𝑔 (𝜃)̂ = 1 − 𝑒− 10,667 = 0.57.
̂
We use the delta method to approximate the variance of 𝑔 (𝜃).
9000
dg (𝜃) 9000 − 𝜃
=− 2 𝑒 .
𝑑𝜃 𝜃
9000 2
𝑉̂ ̂ = (− 9000 𝑒−
𝑎𝑟 [𝑔 (𝜃)] 𝜃̂
) 𝑉 ̂ (𝜃)̂ = 0.0329.
𝜃2̂
a. To derive the covariance matrix of the mle we need to find the expectations
of the second derivatives. Since the random variable 𝑋 is from a lognormal
distribution with parameters 𝜇 and 𝜎, then log 𝑋 is normally distributed with
mean 𝜇 and variance 𝜎2 .
𝜕 2 log L (𝜇, 𝜎) −6 −6
E( ) = E( 2 ) = 2 ,
𝜕𝜇2 𝜎 𝜎
𝜕 2 log L (𝜇, 𝜎) −2 6 −2 6 −2 6
E( ) = 3 ∑ E (log 𝑥𝑖 − 𝜇) = 3 ∑ [E (log 𝑥𝑖 ) − 𝜇] = 3 ∑ (𝜇 − 𝜇) = 0,
𝜕𝜇𝜕𝜎 𝜎 𝑖=1 𝜎 𝑖=1 𝜎 𝑖=1
and
Using the negatives of these expectations we obtain the Fisher information ma-
trix
6
0
[ 𝜎2 12 ] .
0 𝜎2
𝜎2
6 0
Σ=[ 𝜎2 ] .
0 12
0.8533 0
Σ̂ = [ ].
0 0.4267
√
b. The 95% confidence interval for 𝜇 is given by 9.38 ± 1.96 0.8533 =
(7.57, 11.19).
√
The 95% confidence interval for 𝜎2 is given by 5.12±1.96 0.4267 = (3.84, 6.40).
𝜎2
c. The mean of X is exp (𝜇 + 2 ). Then, the maximum likelihood estimate of
𝜎2
𝑔 (𝜇, 𝜎) = exp (𝜇 + )
2
3.5. MAXIMUM LIKELIHOOD ESTIMATION 117
is
𝜎̂ 2
𝑔 (𝜇,̂ 𝜎)̂ = exp (𝜇̂ + ) = 153, 277.
2
We use the delta method to approximate the variance of the mle 𝑔 (𝜇,̂ 𝜎).
̂
𝜕𝑔(𝜇,𝜎) 𝜎2 𝜕𝑔(𝜇,𝜎) 𝜎2
𝜕𝜇 = exp (𝜇 + 2 ) and 𝜕𝜎 = 𝜎 exp (𝜇 + 2 ).
Using the delta method, the approximate variance of 𝑔 (𝜇,̂ 𝜎)̂ is given by
𝜕𝑔(𝜇,𝜎)
0.8533 0 153, 277
𝑉̂ ̂ = [ 𝜕𝑔(𝜇,𝜎)
𝑎𝑟 (𝑔 (𝜇,̂ 𝜎)) 𝜕𝜇
𝜕𝑔(𝜇,𝜎)
𝜕𝜎 ] Σ [ 𝜕𝜇
𝜕𝑔(𝜇,𝜎) ]∣ = [153, 277 346, 826] [ ][ ] = 71, 374
0 0.4267 346, 826
𝜕𝜎 𝜇=𝜇,𝜎=
̂ 𝜎̂
𝜎2
The 95% confidence interval for exp (𝜇 + 2 ) is given by
log(Expend)
Exponential
Gamma
0.3
Pareto
Lognormal
GB2
Density
0.2
0.1
0.0
0 5 10 15
Log Expenditures
𝑘
𝑛𝑗
𝐿 (𝜃) = ∏ [𝐹𝑋 ( 𝑐𝑗 ∣ 𝜃) − 𝐹𝑋 ( 𝑐𝑗−1 ∣ 𝜃)] ,
𝑗=1
where 𝑐0 is the smallest possible observation (often set to zero) and 𝑐𝑘 is the
largest possible observation (often set to infinity).
Example 3.5.5. Actuarial Exam Question. For a group of policies, you are
given that losses follow the distribution function 𝐹𝑋 (𝑥) = 1 − 𝑥𝜃 , for 𝜃 < 𝑥 < ∞.
Further, a sample of 20 losses resulted in the following:
log 𝐿 (𝜃) = 9 log (10 − 𝜃) + 6 log 𝜃 + 5 log 𝜃 − 9 log 10 + 6 log 15 − 6 log 250 − 5 log 25
= 9 log (10 − 𝜃) + 11 log 𝜃 + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡.
𝑑 log 𝐿 (𝜃) −9 11
= + .
𝑑𝜃 (10 − 𝜃) 𝜃
̂
The maximum likelihood estimator, 𝜃, is the solution to the equation
−9 11
+ =0
(10 − 𝜃)̂ 𝜃̂
𝑟
𝑚
𝐿(𝜃) = [∏ 𝑓𝑋 (𝑥𝑖 )] [𝑆𝑋 (𝑢)] ,
𝑖=1
where 𝑟 is the number of known loss amounts below the limit 𝑢 and 𝑚 is the
number of loss amounts larger than the limit 𝑢.
Example 3.5.6. Actuarial Exam Question. The random variable 𝑋 has
survival function:
𝜃4
𝑆𝑋 (𝑥) = 2
.
(𝜃2 + 𝑥2 )
Two values of 𝑋 are observed to be 2 and 4. One other value exceeds 4. Calculate
the maximum likelihood estimate of 𝜃.
Solution.
The contributions of the two observations 2 and 4 are 𝑓𝑋 (2) and 𝑓𝑋 (4) respec-
tively. The contribution of the third observation, which is only known to exceed
4 is 𝑆𝑋 (4). The likelihood function is thus given by
4𝑥𝜃4
𝑓𝑋 (𝑥) = 3
.
(𝜃2 + 𝑥2 )
Thus,
So,
log 𝐿 (𝜃) = log 128 + 12 log 𝜃 − 3 log (𝜃2 + 4) − 5 log (𝜃2 + 16) ,
and
𝑑 log 𝐿(𝜃) 12 6𝜃 10𝜃
𝑑𝜃 = 𝜃 − (𝜃2 +4) − (𝜃2 +16) .
12 6𝜃 ̂ 10𝜃 ̂
− − =0
𝜃̂ (𝜃2̂ + 4) (𝜃2̂ + 16)
or
𝑘
𝑓𝑋 (𝑥𝑖 )
𝐿(𝜃) = ∏ ,
𝑖=1
𝑆𝑋 (𝑑)
Solution.
The contributions of the different observations can be summarized as follows:
• For the exact loss: 𝑓𝑋 (𝑥)
• For censored observations: 𝑆𝑋 (25).
• For truncated observations: 𝑆𝑓𝑋 (𝑥)
(5) .𝑋
Given that ground up losses smaller than 5 are omitted from the data set,
the contribution of all observations should be conditional on exceeding 5. The
likelihood function becomes
8
∏𝑖=1 𝑓𝑋 (𝑥𝑖 ) 𝑆𝑋 (25) 2
𝐿 (𝛼) = 8
[ ] .
[𝑆𝑋 (5)] 𝑆𝑋 (5)
For the single-parameter Pareto the probability density and distribution func-
tions are given by
𝛼
𝛼𝜃𝛼 𝜃
𝑓𝑋 (𝑥) = and 𝐹𝑋 (𝑥) = 1 − ( ) ,
𝑥𝛼+1 𝑥
for 𝑥 > 𝜃, respectively. Then, the likelihood is given by
𝛼8 510𝛼
𝐿 (𝛼) = 8 2𝛼
.
∏𝑖=1 𝑥𝛼+1 25
𝑖
8
log 𝐿 (𝛼) = 8 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 + 10𝛼 log 5 − 2𝛼 log 25.
𝑖=1
8
𝑑 log 𝐿 (𝛼) 8
= − ∑ log 𝑥𝑖 + 10 log 5 − 2 log 25.
𝑑𝜃 𝛼 𝑖=1
With this, the maximum likelihood estimator, 𝛼,̂ is the solution to the equation
8
8
− ∑ log 𝑥𝑖 + 10 log 5 − 2 log 25 = 0,
𝛼̂ 𝑖=1
which yields
8
𝛼̂ = 8
∑𝑖=1 log 𝑥𝑖 −10 log 5+2 log 25
8
= (log 7+log 9+⋯+log 20)−10 log 5+2 log 25 = 0.785.
3.6. FURTHER RESOURCES AND CONTRIBUTORS 123
The mean of the single-parameter Pareto is finite for 𝛼 > 1 (see Appendix
Section 18.2). Since 𝛼̂ = 0.785 < 1. Then, the mean is infinite.
Exercises
Here are a set of exercises that guide the viewer through some of the theoretical
foundations of Loss Data Analytics. Each tutorial is based on one or more
questions from the professional actuarial examinations – typically the Society
of Actuaries Exam C/STAM.
Severity Distribution Guided Tutorials
Chapter Preview. Chapters 2 and 3 have described how to fit parametric mod-
els to frequency and severity data, respectively. This chapter begins with the
selection of models. To compare alternative parametric models, it is helpful to
summarize data without reference to a specific parametric distribution. Section
4.1 describes nonparametric estimation, how we can use it for model compar-
isons and how it can be used to provide starting values for parametric procedures.
The process of model selection is then summarized in Section 4.2. Although our
focus is on data from continuous distributions, the same process can be used for
discrete versions or data that come from a hybrid combination of discrete and
continuous distributions.
Model selection and estimation are fundamental aspects of statistical modeling.
To provide a flavor as to how they can be adapted to alternative sampling
schemes, Section 4.3.1 describes estimation for grouped, censored and truncated
data (following the Section 3.5 introduction). To see how they can be adapted
to alternative models, the chapter closes with Section 4.4 on Bayesian inference,
an alternative procedure where the (typically unknown) parameters are treated
as random variables.
125
126 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Moment Estimators
We learned how to define moments in Section 2.2.2 for frequency and Section
3.1.1 for severity. In particular, the 𝑘-th moment, E [𝑋 𝑘 ] = 𝜇′𝑘 , summarizes
many aspects of the distribution for different choices of 𝑘. Here, 𝜇′𝑘 is sometimes
called the 𝑘th population moment to distinguish it from the 𝑘th sample moment,
1 𝑛
∑ 𝑋𝑘,
𝑛 𝑖=1 𝑖
estimator for 𝜇 is called the sample mean, denoted with a bar on top of the
random variable:
1 𝑛
𝑋= ∑𝑋 .
𝑛 𝑖=1 𝑖
Another type of summary measure of interest is the 𝑘-th central moment, E [(𝑋−
𝜇)𝑘 ] = 𝜇𝑘 . (Sometimes, 𝜇′𝑘 is called the 𝑘-th raw moment to distinguish it from
the central moment 𝜇𝑘 .). A nonparametric, or sample, estimator of 𝜇𝑘 is
1 𝑛 𝑘
∑ (𝑋𝑖 − 𝑋) .
𝑛 𝑖=1
Dividing by 𝑛 − 1 instead of 𝑛 matters little when you have a large sample size
𝑛 as is common in insurance applications. The sample variance estimator 𝑠2 is
unbiased in the sense that E [𝑠2 ] = 𝜎2 , a desirable property particularly when
interpreting results of an analysis.
1 𝑛
𝐹𝑛 (𝑥) = ∑ 𝐼 (𝑋𝑖 ≤ 𝑥)
𝑛 𝑖=1
number of observations less than or equal to 𝑥
= .
𝑛
128 CHAPTER 4. MODEL SELECTION AND ESTIMATION
As 𝐹𝑛 (⋅) is based on only observations and does not assume a parametric fam-
ily for the distribution, it is nonparametric and also known as the empirical
distribution function. It is also known as the empirical cumulative distribution
function and, in R, one can use the ecdf(.) function to compute it.
Example 4.1.1. Toy Data Set. To illustrate, consider a fictitious, or “toy,”
data set of 𝑛 = 10 observations. Determine the empirical distribution function.
𝑖 1 2 3 4 5 6 7 8 9 10
𝑋𝑖 10 15 15 15 20 23 23 23 23 30
You should check that the sample mean is 𝑋 = 19.7 and that the sample variance
is 𝑠2 = 34.45556. The corresponding empirical distribution function is
⎧ 0 for 𝑥 < 10
{ 0.1 for 10 ≤ 𝑥 < 15
{
{ 0.4 for 15 ≤ 𝑥 < 20
𝐹𝑛 (𝑥) = ⎨
0.5 for 20 ≤ 𝑥 < 23
{
{ 0.9 for 23 ≤ 𝑥 < 30
{
⎩ 1 for 𝑥 ≥ 30,
0.4
0.2
0.0
5 10 15 20 25 30 35
the number such that approximately 25% of the data is below it and the third
quartile is the number such that approximately 75% of the data is below it. A
100𝑝 percentile is the number such that 100 × 𝑝 percent of the data is below it.
To generalize this concept, consider a distribution function 𝐹 (⋅), which may or
may not be continuous, and let 𝑞 be a fraction so that 0 < 𝑞 < 1. We want
to define a quantile, say 𝑞𝐹 , to be a number such that 𝐹 (𝑞𝐹 ) ≈ 𝑞. Notice that
when 𝑞 = 0.5, 𝑞𝐹 is the median; when 𝑞 = 0.25, 𝑞𝐹 is the first quartile, and so
on. In the same way, when 𝑞 = 0, 0.01, 0.02, … , 0.99, 1.00, the resulting 𝑞𝐹 is
a percentile. So, a quantile generalizes the concepts of median, quartiles, and
percentiles.
To be precise, for a given 0 < 𝑞 < 1, define the 𝑞th quantile 𝑞𝐹 to be any
number that satisfies
Here, the notation 𝐹 (𝑥−) means to evaluate the function 𝐹 (⋅) as a left-hand
limit.
To get a better understanding of this definition, let us look at a few special
cases. First, consider the case where 𝑋 is a continuous random variable so that
the distribution function 𝐹 (⋅) has no jump points, as illustrated in Figure 4.2.
In this figure, a few fractions, 𝑞1 , 𝑞2 , and 𝑞3 are shown with their corresponding
quantiles 𝑞𝐹 ,1 , 𝑞𝐹 ,2 , and 𝑞𝐹 ,3 . In each case, it can be seen that 𝐹 (𝑞𝐹 −) = 𝐹 (𝑞𝐹 )
so that there is a unique quantile. Because we can find a unique inverse of the
distribution function at any 0 < 𝑞 < 1, we can write 𝑞𝐹 = 𝐹 −1 (𝑞).
q3
F(x)
q2
q1
Figure 4.3 shows three cases for distribution functions. The left panel corre-
sponds to the continuous case just discussed. The middle panel displays a jump
point similar to those we already saw in the empirical distribution function
130 CHAPTER 4. MODEL SELECTION AND ESTIMATION
of Figure 4.1. For the value of 𝑞 shown in this panel, we still have a unique
value of the quantile 𝑞𝐹 . Even though there are many values of 𝑞 such that
𝐹 (𝑞𝐹 −) ≤ 𝑞 ≤ 𝐹 (𝑞𝐹 ), for a particular value of 𝑞, there is only one solution to
equation (4.1). The right panel depicts a situation in which the quantile cannot
be uniquely determined for the 𝑞 shown as there is a range of 𝑞𝐹 ’s satisfying
equation (4.1).
F(x)
F(x)
F(x)
q q q
qF qF qF
x x x
where 𝑗 = ⌊(𝑛 + 1)𝑞⌋, ℎ = (𝑛 + 1)𝑞 − 𝑗, and 𝑋(1) , … , 𝑋(𝑛) are the ordered values
(known as the order statistics) corresponding to 𝑋1 , … , 𝑋𝑛 . (Recall that the
brackets ⌊⋅⌋ are the floor function denoting the greatest integer value.) Note
that 𝜋𝑞̂ is simply a linear interpolation between 𝑋(𝑗) and 𝑋(𝑗+1) .
4.1. NONPARAMETRIC INFERENCE 131
Example 4.1.3. Toy Data Set: Continued. Determine the 50th and 20th
smoothed percentiles.
Solution Take 𝑛 = 10 and 𝑞 = 0.5. Then, 𝑗 = ⌊(11)(0.5)⌋ = ⌊5.5⌋ = 5 and
ℎ = (11)(0.5) − 5 = 0.5. Then the 0.5-th smoothed empirical quantile is
𝜋0.5
̂ = (1 − 0.5)𝑋(5) + (0.5)𝑋(6) = 0.5(20) + (0.5)(23) = 21.5.
𝜋0.2
̂ = (1 − 0.2)𝑋(2) + (0.2)𝑋(3) = 0.8(15) + (0.2)(15) = 15.
Density Estimators
Discrete Variable. When the random variable is discrete, estimating the
probability mass function 𝑓(𝑥) = Pr(𝑋 = 𝑥) is straightforward. We simply use
the sample average, defined to be
1 𝑛
𝑓𝑛 (𝑥) = ∑ 𝐼(𝑋𝑖 = 𝑥),
𝑛 𝑖=1
where 𝑛𝑗 is the number of observations (𝑋𝑖 ) that fall into the interval [𝑐𝑗−1 , 𝑐𝑗 ).
Continuous Variable (not grouped). Extending this notion to instances
where we observe individual data, note that we can always create arbitrary
groupings and use this formula. More formally, let 𝑏 > 0 be a small positive
constant, known as a bandwidth, and define a density estimator to be
1 𝑛
𝑓𝑛 (𝑥) = ∑ 𝐼(𝑥 − 𝑏 < 𝑋𝑖 ≤ 𝑥 + 𝑏) (4.2)
2𝑛𝑏 𝑖=1
132 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Snippet of Theory. The idea is that the estimator 𝑓𝑛 (𝑥) in equation (4.2) is
the average over 𝑛 iid realizations of a random variable with mean
1 1
E [ 𝐼(𝑥 − 𝑏 < 𝑋 ≤ 𝑥 + 𝑏)] = (𝐹 (𝑥 + 𝑏) − 𝐹 (𝑥 − 𝑏))
2𝑏 2𝑏
→ 𝐹 ′ (𝑥) = 𝑓(𝑥),
1 𝑛 𝑥 − 𝑋𝑖
𝑓𝑛 (𝑥) = ∑𝑤( ), (4.3)
𝑛𝑏 𝑖=1 𝑏
Kernel 𝑤(𝑥)
1
Uniform 2
𝐼(−1 < 𝑥 ≤ 1)
Triangle (1 − |𝑥|) × 𝐼(|𝑥| ≤ 1)
3
Epanechnikov 4
(1 − 𝑥2 ) × 𝐼(|𝑥| ≤ 1)
Gaussian 𝜙(𝑥)
Here, 𝜙(⋅) is the standard normal density function. As we will see in the following
example, the choice of bandwidth 𝑏 comes with a bias-variance tradeoff between
matching local distributional features and reducing the volatility.
Example 4.1.4. Property Fund. Figure 4.4 shows a histogram (with shaded
gray rectangles) of logarithmic property claims from 2010. The (blue) thick
curve represents a Gaussian kernel density where the bandwidth was selected
automatically using an ad hoc rule based on the sample size and volatility of
these data. For this dataset, the bandwidth turned out to be 𝑏 = 0.3255. For
comparison, the (red) dashed curve represents the density estimator with a
bandwidth equal to 0.1 and the green smooth curve uses a bandwidth of 1. As
anticipated, the smaller bandwidth (0.1) indicates taking local averages over less
4.1. NONPARAMETRIC INFERENCE 133
data so that we get a better idea of the local average, but at the price of higher
volatility. In contrast, the larger bandwidth (1) smooths out local fluctuations,
yielding a smoother curve that may miss perturbations in the local average. For
actuarial applications, we mainly use the kernel density estimator to get a quick
visual impression of the data. From this perspective, you can simply use the
default ad hoc rule for bandwidth selection, knowing that you have the ability
to change it depending on the situation at hand.
b=0.3255 (default)
0.30
b=0.1
b=1.0
0.20
Density
0.10
0.00
0 5 10 15
Log Expenditures
1 𝑛 𝑥 − 𝑋𝑖
𝐹𝑛̃ (𝑥) = ∑𝑊 ( ).
𝑛 𝑖=1 𝑏
⎧0 𝑦 < −1
{
𝑊 (𝑦) = ⎨ 𝑦+1
2 −1 ≤ 𝑦 < 1 .
{1 𝑦≥1
⎩
You study five lives to estimate the time from the onset of a disease to death.
The times to death are:
2 3 3 3 7
Using a triangular kernel with bandwidth 2, calculate the density function esti-
mate at 2.5.
Solution. For the kernel density estimate, we have
1 𝑛 𝑥 − 𝑋𝑖
𝑓𝑛 (𝑥) = ∑𝑤( ),
𝑛𝑏 𝑖=1 𝑏
𝑥−𝑋𝑖
𝑋𝑖 𝑏 𝑤 ( 𝑥−𝑋
𝑏 )
𝑖
2.5−2
2 2 = 14 1
(1 − 4 )(1) = 3
4
3
2.5−3 −1
3 2 = 4 (1 − ∣ −1
4 ∣) (1) =
3
4
3
2.5−7
7 2 = −2.25 (1 − | − 2.25|)(0) = 0
Plug-in Principle
One way to create a nonparametric estimator of some quantity is to use the
analog or plug-in principle where one replaces the unknown cdf 𝐹 with a known
estimate such as the empirical cdf 𝐹𝑛 . So, if we are trying to estimate E [g(𝑋)] =
E𝐹 [g(𝑋)] for a generic function g, then we define a nonparametric estimator to
𝑛
be E𝐹𝑛 [g(𝑋)] = 𝑛−1 ∑𝑖=1 g(𝑋𝑖 ).
To see how this works, as a special case of g we consider the loss per payment
random variable is 𝑌 = (𝑋 − 𝑑)+ and the loss elimination ratio introduced in
Section 3.4.1. We can express this as
We use a sample of 432 closed auto claims from Boston from Derrig et al. (2001).
Losses are recorded for payments due to bodily injuries in auto accidents. Losses
are not subject to deductibles but are limited by various maximum coverage
amounts that are also available in the data. It turns out that only 17 out of 432
(≈ 4%) were subject to these policy limits and so we ignore these data for this
illustration.
The average loss paid is 6906 in U.S. dollars. Figure 4.5 shows other aspects of
the distribution. Specifically, the left-hand panel shows the empirical distribu-
tion function, the right-hand panel gives a nonparametric density plot.
1.0
Density
ECDF
0.4
0.2
0.0
x x
Figure 4.5: Bodily Injury Claims. The left-hand panel gives the empirical
distribution function. The right-hand panel presents a nonparametric density
plot.
The impact of bodily injury losses can be mitigated by the imposition of limits
or purchasing reinsurance policies (see Section 10.3). To quantify the impact
of these risk mitigation tools, it is common to compute the loss elimination
ratio (LER) as introduced in Section 3.4.1. The distribution function is not
available and so must be estimated in some way. Using the plug-in principle, a
nonparametric estimator can be defined as
𝑛 𝑛
𝑛−1 ∑𝑖=1 min(𝑋𝑖 , 𝑑) ∑𝑖=1 min(𝑋𝑖 , 𝑑)
𝐿𝐸𝑅𝑛 (𝑑) = 𝑛 = 𝑛 .
𝑛−1 ∑𝑖=1 𝑋𝑖 ∑𝑖=1 𝑋𝑖
136 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Figure 4.6 shows the estimator 𝐿𝐸𝑅𝑛 (𝑑) for various choices of 𝑑. For example,
at 𝑑 = 1, 000, we have 𝐿𝐸𝑅𝑛 (1000) ≈ 0.1442. Thus, imposing a limit of 1,000
means that expected retained claims are 14.42 percent lower when compared to
expected claims with a zero deductible.
1.0
0.8
0.6
LER
0.4
0.2
0.0
Figure 4.6: LER for Bodily Injury Claims. The figure presents the loss
elimination ratio (LER) as a function of deductible 𝑑.
mark to assess how well the parametric distribution/model represents the data.
Also, as the sample size increases, the empirical distribution converges almost
surely to the underlying population distribution (by the strong law of large num-
bers). Thus the empirical distribution is a good proxy for the population. The
comparison of parametric to nonparametric estimators may alert the analyst to
deficiencies in the parametric model and sometimes point ways to improving
the parametric specification. Procedures geared towards assessing the validity
of a model are known as model diagnostics.
We have already seen the technique of overlaying graphs for comparison pur-
poses. To reinforce the application of this technique, Figure 4.7 compares the
empirical distribution to two parametric fitted distributions. The left panel
shows the distribution functions of claims distributions. The dots forming an
“S-shaped” curve represent the empirical distribution function at each observa-
tion. The thick blue curve gives corresponding values for the fitted gamma
distribution and the light purple is for the fitted Pareto distribution. Because
the Pareto is much closer to the empirical distribution function than the gamma,
this provides evidence that the Pareto is the better model for this data set. The
right panel gives similar information for the density function and provides a
consistent message. Based (only) on these figures, the Pareto distribution is the
clear choice for the analyst.
0.0 0.2 0.4 0.6 0.8 1.0
log(claims)
Gamma
Distribution Function
0.20
Pareto
Density
0.10
log(claims)
Gamma
0.00
Pareto
0 5 10 15 0 5 10 15
For another way to compare the appropriateness of two fitted models, consider
the probability-probability (pp) plot. A 𝑝𝑝 plot compares cumulative probabili-
ties under two models. For our purposes, these two models are the nonparamet-
ric empirical distribution function and the parametric fitted model. Figure 4.8
shows 𝑝𝑝 plots for the Property Fund data introduced in Section 1.3. The fitted
gamma is on the left and the fitted Pareto is on the right, compared to the same
empirical distribution function of the data. The straight line represents equality
between the two distributions being compared, so points close to the line are
desirable. As seen in earlier demonstrations, the Pareto is much closer to the
empirical distribution than the gamma, providing additional evidence that the
Pareto is the better model.
Pareto DF
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Empirical DF Empirical DF
0, 0.001, 0.002, … , 0.999, 1.000), depending on the application. In Figure 4.9, for
each point on the aforementioned grid, the horizontal axis displays the empir-
ical quantile and the vertical axis displays the corresponding fitted parametric
quantile (gamma for the upper two panels, Pareto for the lower two). Quan-
tiles are plotted on the original scale in the left panels and on the log scale in
the right panels to allow us to see where a fitted distribution is deficient. The
straight line represents equality between the empirical distribution and fitted
distribution. From these plots, we again see that the Pareto is an overall bet-
ter fit than the gamma. Furthermore, the lower-right panel suggests that the
Pareto distribution does a good job with large claims, but provides a poorer fit
for small claims.
10
Gamma Quantile
0
200000
−10
−20
0
0 1000000 3000000 2 4 6 8 10 12 14
2000000
10
5
0
0
0 1000000 3000000 2 4 6 8 10 12 14
Figure 4.9: Quantile-Quantile (𝑞𝑞) Plots. The horizontal axis gives the
empirical quantiles at each observation. The right-hand panels they are graphed
on a logarithmic basis. The vertical axis gives the quantiles from the fitted
distributions; gamma quantiles are in the upper panels, Pareto quantiles are in
the lower panels.
1.0
0.8
0.6
Fitted
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Sample
Solution. The tail of the fitted distribution is too thick on the left, too thin
on the right, and the fitted distribution has less probability around the median
than the sample. To see this, recall that the 𝑝𝑝 plot graphs the cumulative
distribution of two distributions on its axes (empirical on the x-axis and fitted
on the y-axis in this case). For small values of 𝑥, the fitted model assigns greater
probability to being below that value than occurred in the sample (i.e. 𝐹 (𝑥) >
𝐹𝑛 (𝑥)). This indicates that the model has a heavier left tail than the data. For
large values of 𝑥, the model again assigns greater probability to being below that
value and thus less probability to being above that value (i.e. 𝑆(𝑥) < 𝑆𝑛 (𝑥)).
This indicates that the model has a lighter right tail than the data. In addition,
as we go from 0.4 to 0.6 on the horizontal axis (thus looking at the middle 20%
of the data), the 𝑝𝑝 plot increases from about 0.3 to 0.4. This indicates that
the model puts only about 10% of the probability in this range.
29 64 90 135 182
the recursive process. Although many problems are robust to the choice of the
starting values, for some complex situations, it can be important to have a start-
ing value that is close to the (unknown) optimal value. Method of moments and
percentile matching are techniques that can produce desirable estimates without
a serious computational investment and can thus be used as a starting value for
computing maximum likelihood.
Method of Moments
Under the method of moments, we approximate the moments of the parametric
distribution using the empirical (nonparametric) moments described in Section
4.1.1. We can then algebraically solve for the parameter estimates.
Example 4.1.9. Property Fund. For the 2010 property fund, there are
𝑛 = 1, 377 individual claims (in thousands of dollars) with
1 𝑛 1 𝑛
𝑚1 = ∑ 𝑋 = 26.62259 and 𝑚2 = ∑ 𝑋 2 = 136154.6.
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
Fit the parameters of the gamma and Pareto distributions using the method of
moments.
Solution. To fit a gamma distribution, we have 𝜇1 = 𝛼𝜃 and 𝜇′2 = 𝛼(𝛼 + 1)𝜃2 .
Equating the two yields the method of moments estimators, easy algebra shows
that
26.622592
𝛼̂ = = 0.005232809
136154.6 − 26.622592
136154.6 − 26.622592
𝜃̂ = = 5, 087.629.
26.62259
𝜇′2
𝛼=1+ and 𝜃 = (𝛼 − 1)𝜇1 .
𝜇′2 − 𝜇21
136154.6
𝛼̂ = 1 + = 2.005233
136154.6 − 26, 622592
𝜃 ̂ = (2.005233 − 1) ⋅ 26.62259 = 26.7619
As the above example suggests, there is flexibility with the method of moments.
For example, we could have matched the second and third moments instead
of the first and second, yielding different estimators. Furthermore, there is no
guarantee that a solution will exist for each problem. For data that are censored
or truncated, matching moments is possible for a few problems but, in general,
this is a more difficult scenario. Finally, for distributions where the moments do
not exist or are infinite, method of moments is not available. As an alternative,
one can use the percentile matching technique.
Percentile Matching
Under percentile matching, we approximate the quantiles or percentiles of the
parametric distribution using the empirical (nonparametric) quantiles or per-
centiles described in Section 4.1.1.
Example 4.1.10. Property Fund. For the 2010 property fund, we illus-
trate matching on quantiles. In particular, the Pareto distribution is intuitively
pleasing because of the closed-form solution for the quantiles. Recall that the
distribution function for the Pareto distribution is
𝛼
𝜃
𝐹 (𝑥) = 1 − ( ) .
𝑥+𝜃
Easy algebra shows that we can express the quantile as
Determine estimates of the Pareto distribution parameters using the 25th and
95th empirical quantiles.
Solution.
The 25th percentile (the first quartile) turns out to be 0.78853 and the 95th
percentile is 50.98293 (both in thousands of dollars). With two equations
Calculate the estimate of 𝜃 by percentile matching, using the 40th and 80th
empirically smoothed percentile estimates.
Solution. With 11 observations, we have 𝑗 = ⌊(𝑛+1)𝑞⌋ = ⌊12(0.4)⌋ = ⌊4.8⌋ = 4
and ℎ = (𝑛 + 1)𝑞 − 𝑗 = 12(0.4) − 4 = 0.8. By interpolation, the 40th empirically
smoothed percentile estimate is 𝜋0.4
̂ = (1−ℎ)𝑋(𝑗) +ℎ𝑋(𝑗+1) = 0.2(86)+0.8(90) =
89.2.
Similarly, for the 80th empirically smoothed percentile estimate, we have
12(0.8) = 9.6 so the estimate is 𝜋0.8
̂ = 0.4(200) + 0.6(210) = 206.
Using the loglogistic cumulative distribution, we need to solve the following two
equations for parameters 𝜃 ̂ and 𝛾:̂
(89.2/𝜃)̂ 𝛾̂ (206/𝜃)̂ 𝛾̂
0.4 = and 0.8 = .
1 + (89.2/𝜃)̂ 𝛾̂ 1 + (206/𝜃)̂ 𝛾̂
4.2. MODEL SELECTION 145
Like the method of moments, percentile matching is almost too flexible in the
sense that estimators can vary depending on different percentiles chosen. For ex-
ample, one actuary may use estimation on the 25th and 95th percentiles whereas
another uses the 20th and 80th percentiles. In general estimated parameters will
differ and there is no compelling reason to prefer one over the other. Also as
with the method of moments, percentile matching is appealing because it pro-
vides a technique that can be readily applied in selected situations and has an
intuitive basis. Although most actuarial applications use maximum likelihood
estimators, it can be convenient to have alternative approaches such as method
of moments and percentile matching available.
This section underscores the idea that model selection is an iterative process in
which models are cyclically (re)formulated and tested for appropriateness before
using them for inference. After an overview, we describe the model selection
process based on:
• an in-sample or training dataset,
• an out-of-sample or test dataset, and
• a method that combines these approaches known as cross-validation.
(e.g., logarithmic) may present some difficulties. For discrete data, tables are
often preferred. Determine sample moments, such as the mean and variance, as
well as selected quantiles, including the minimum, maximum, and the median.
For discrete data, the mode (or most frequently occurring value) is usually
helpful.
These summaries, as well as your familiarity of industry practice, will suggest
one or more candidate parametric models. Generally, start with the simpler
parametric models (for example, one parameter exponential before a two param-
eter gamma), gradually introducing more complexity into the modeling process.
Critique the candidate parametric model numerically and graphically. For the
graphs, utilize the tools introduced in Section 4.1.2 such as 𝑝𝑝 and 𝑞𝑞 plots. For
the numerical assessments, examine the statistical significance of parameters
and try to eliminate parameters that do not provide additional information.
Likelihood Ratio Tests. For comparing model fits, if one model is a subset
of another, then a likelihood ratio test may be employed; the general approach
to likelihood ratio testing is described in Sections 15.4.3 and 17.3.2.
Goodness of Fit Statistics. Generally, models are not proper subsets of one
another so overall goodness of fit statistics are helpful for comparing models.
Information criteria are one type of goodness of statistic. The most widely used
examples are Akaike’s Information Criterion (AIC) and the (Schwarz) Bayesian
Information Criterion (BIC); they are widely cited because they can be readily
generalized to multivariate settings. Section 15.4.4 provides a summary of these
statistics.
For selecting the appropriate distribution, statistics that compare a parametric
fit to a nonparametric alternative, summarized in Section 4.1.2, are useful for
model comparison. For discrete data, a goodness of fit statistic (as described in
Section 2.7) is generally preferred as it is more intuitive and simpler to explain.
Random Split of the Data. Unfortunately, rarely will two sets of data be
available to the investigator. However, we can implement the validation process
by splitting the data set into training and test subsamples, respectively. Figure
4.11 illustrates this splitting of the data.
TRAINING
SUBSAMPLE SIZEn1
1 2 3 4 5 6 ... n
ORIGINAL
SAMPLE
SIZE n
TEST
SUBSAMPLE
SIZE n2
Figure 4.11: Model Validation. A data set is randomly split into two sub-
samples.
One uses the training sample to develop an estimate of g, say, g,̂ and then
calibrate the distance from the observed outcomes to the predictions using a
criterion of the form
4.2. MODEL SELECTION 149
Here, “d” is some measure of distance and the sum 𝑖 is over the test data. In
many regression applications, it is common to use squared Euclidean distance
of the form d(𝑦𝑖 , g) = (𝑦𝑖 − g)2 . In actuarial applications, Euclidean distance
d(𝑦𝑖 , g) = |𝑦𝑖 − g| is often preferred because of the skewed nature of the data
(large outlying values of 𝑦 can have a large effect on the measure). Chapter
7 describes another measure, the Gini index, that is useful in actuarial appli-
cations particularly when there is a large proportion of zeros in claims data
(corresponding to no claims).
Selecting a Distribution. Still, our focus so far has been to select a distribu-
tion for a data set that can be used for actuarial modeling without additional
inputs 𝑥1 , … , 𝑥𝑘 . Even in this more fundamental problem, the model validation
approach is valuable. If we base all inference on only in-sample data, then there
is a tendency to select more complicated models than needed. For example, we
might select a four parameter GB2, generalized beta of the second kind, distri-
bution when only a two parameter Pareto is needed. Information criteria such
as AIC and BIC included penalties for model complexity and so provide some
protection but using a test sample is the best guarantee to achieve parsimonious
models. From a quote often attributed to Albert Einstein, we want to “use the
simplest model as possible but no simpler.”
employed in the first year but not the second have left sometime during the year.
With an exact departure date (individual data), we could compute the amount
of time that they were with the firm. Without the departure date (grouped
data), we only know that they departed sometime during a year-long interval.
Formalizing this idea, suppose there are 𝑘 groups or intervals delimited by
boundaries 𝑐0 < 𝑐1 < ⋯ < 𝑐𝑘 . For each observation, we only observe the interval
into which it fell (e.g. (𝑐𝑗−1 , 𝑐𝑗 )), not the exact value. Thus, we only know the
number of observations in each interval. The constants {𝑐0 < 𝑐1 < ⋯ < 𝑐𝑘 } form
some partition of the domain of 𝐹 (⋅). Then the probability of an observation
𝑋𝑖 falling in the 𝑗th interval is
Now, define 𝑛𝑗 to be the number of observations that fall in the 𝑗th interval,
(𝑐𝑗−1 , 𝑐𝑗 ]. Thus, the likelihood function (with respect to the parameter(s) 𝜃) is
𝑛 𝑘
𝑛𝑗
𝐿(𝜃) = ∏ 𝑓(𝑥𝑖 ) = ∏ {𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 )}
𝑗=1 𝑗=1
𝑛 𝑘
𝑙(𝜃) = log 𝐿(𝜃) = log ∏ 𝑓(𝑥𝑖 ) = ∑ 𝑛𝑗 log {𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 )}
𝑗=1 𝑗=1
Censored Data
Censoring occurs when we record only a limited value of an observation. The
most common form is right-censoring, in which we record the smaller of the
“true” dependent variable and a censoring value. Using notation, let 𝑋 represent
an outcome of interest, such as the loss due to an insured event or time until an
event. Let 𝐶𝑈 denote the censoring amount. With right-censored observations,
we record 𝑋𝑈∗ = min(𝑋, 𝐶𝑈 ) = 𝑋 ∧𝐶𝑈 . We also record whether or not censoring
has occurred. Let 𝛿𝑈 = 𝐼(𝑋 ≤ 𝐶𝑈 ) be a binary variable that is 0 if censoring
occurs and 1 if it does not, that is, 𝛿𝑈 indicates whether or not 𝑋 is uncensored.
For an example that we saw in Section 3.4.2, 𝐶𝑈 may represent the upper limit
of coverage of an insurance policy (we used 𝑢 for the upper limit in that section).
The loss may exceed the amount 𝐶𝑈 , but the insurer only has 𝐶𝑈 in its records
as the amount paid out and does not have the amount of the actual loss 𝑋 in
its records.
Similarly, with left-censoring, we record the larger of a variable of interest
and a censoring variable. If 𝐶𝐿 is used to represent the censoring amount, we
record 𝑋𝐿∗ = max(𝑋, 𝐶𝐿 ) along with the censoring indicator 𝛿𝐿 = 𝐼(𝑋 > 𝐶𝐿 ).
As an example, you got a brief introduction to reinsurance (insurance for insur-
ers) in Section 3.4.4 and will see more in Chapter 10. Suppose a reinsurer will
cover insurer losses greater than 𝐶𝐿 ; this means that the reinsurer is responsi-
ble for the excess of 𝑋𝐿∗ over 𝐶𝐿 . Using notation, the loss of the reinsurer is
𝑌 = 𝑋𝐿∗ − 𝐶𝐿 . To see this, first consider the case where the policyholder loss
𝑋 < 𝐶𝐿 . Then, the insurer will pay the entire claim and 𝑌 = 𝐶𝐿 − 𝐶𝐿 = 0,
4.3. ESTIMATION USING MODIFIED DATA 153
Truncated Data
Censored observations are recorded for study, although in a limited form. In
contrast, truncated outcomes are a type of missing data. An outcome is poten-
tially truncated when the availability of an observation depends on the outcome.
In insurance, it is common for observations to be left-truncated at 𝐶𝐿 when
the amount is
we do not observe 𝑋 𝑋 ≤ 𝐶𝐿
𝑌 ={ .
𝑋 𝑋 > 𝐶𝐿
𝑋 𝑋 ≤ 𝐶𝑈
𝑌 ={
we do not observe 𝑋 𝑋 > 𝐶𝑈 .
| | |
0 CL CU
X
Calendar Time
1 − 𝐹 (𝐶𝑈 ) if 𝛿 = 0 𝛿 1−𝛿
{ = {𝑓(𝑥)} {1 − 𝐹 (𝐶𝑈 )} .
𝑓(𝑥) if 𝛿 = 1
𝑛
𝛿 1−𝛿𝑖
𝐿(𝜃) = ∏ {𝑓(𝑥𝑖 )} 𝑖 {1 − 𝐹 (𝐶𝑈𝑖 )} = ∏ 𝑓(𝑥𝑖 ) ∏ {1 − 𝐹 (𝐶𝑈𝑖 )},
𝑖=1 𝛿𝑖 =1 𝛿𝑖 =0
with potential censoring times {𝐶𝑈1 , … , 𝐶𝑈𝑛 }. Here, the notation “∏𝛿 ”
𝑖 =1
156 CHAPTER 4. MODEL SELECTION AND ESTIMATION
means to take the product over uncensored observations, and similarly for
“∏𝛿 =0 .”
𝑖
On the other hand, truncated data are handled in likelihood inference via condi-
tional probabilities. Specifically, we adjust the likelihood contribution by divid-
ing by the probability that the variable was observed. To summarize, we have
the following contributions to the likelihood function for six types of outcomes:
where “∏𝐸 ” is the product over observations with Exact values, and similarly
for 𝑅ight-, 𝐿eft- and 𝐼nterval-censoring.
For right-censored and left-truncated data, the likelihood is
𝑓(𝑥𝑖 ) 1 − 𝐹 (𝐶𝑈𝑖 )
𝐿(𝜃) = ∏ ∏ ,
𝐸
1 − 𝐹 (𝐶𝐿𝑖 ) 𝑅 1 − 𝐹 (𝐶𝐿𝑖 )
and similarly for other combinations. To get further insights, consider the fol-
lowing.
𝑙(𝜃) = ∑ {log 𝑓(𝑥𝑖 ) − log(1 − 𝐹 (𝐶𝐿𝑖 ))} + ∑ {log(1 − 𝐹 (𝐶𝑈𝑖 )) − log(1 − F(𝐶𝐿𝑖 ))}
𝐸 𝑅
that the observed variable exceeds the lower truncation limit. With this, the
log-likelihood is
𝑛
𝑥∗∗
𝑖
𝑙(𝜃) = − ∑ ((1 − 𝛿𝑖 ) log 𝜃 + ) (4.5)
𝑖=1
𝜃
Taking derivatives with respect to the parameter 𝜃 and setting it equal to zero
yields the maximum likelihood estimator
1 𝑛 ∗∗
𝜃̂ = ∑𝑥 ,
𝑛𝑢 𝑖=1 𝑖
The log-likelihood is
700
𝐿′ (𝜃) = −3𝜃−1 + 700𝜃−2 = 0 ⇒ 𝜃 ̂ = = 233.33.
3
158 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Example 4.3.3. Actuarial Exam Question. You are given the following
information about a random sample:
(i) The sample size equals five.
(ii) The sample is from a Weibull distribution with 𝜏 = 2.
(iii) Two of the sample observations are known to exceed 50, and the remaining
three observations are 20, 30, and 45.
Calculate the maximum likelihood estimate of 𝜃.
Solution. The likelihood function is
1
−6 16650 ̂ = ( 16650 ) = 52.6783
2
+ = 0 ⇒ 𝜃
𝜃 𝜃3 6
Grouped Data
As we have seen in Section 4.3.1, observations may be grouped (also referred
to as interval censored) in the sense that we only observe them as belonging in
one of 𝑘 intervals of the form (𝑐𝑗−1 , 𝑐𝑗 ], for 𝑗 = 1, … , 𝑘. At the boundaries, the
empirical distribution function is defined in the usual way:
number of observations ≤ 𝑐𝑗
𝐹𝑛 (𝑐𝑗 ) = .
𝑛
Ogive Estimator. For other values of 𝑥 ∈ (𝑐𝑗−1 , 𝑐𝑗 ), we can estimate the dis-
tribution function with the ogive estimator, which linearly interpolates between
𝐹𝑛 (𝑐𝑗−1 ) and 𝐹𝑛 (𝑐𝑗 ), i.e. the values of the boundaries 𝐹𝑛 (𝑐𝑗−1 ) and 𝐹𝑛 (𝑐𝑗 ) are
connected with a straight line. This can formally be expressed as
𝑐𝑗 − 𝑥 𝑥 − 𝑐𝑗−1
𝐹𝑛 (𝑥) = 𝐹𝑛 (𝑐𝑗−1 ) + 𝐹 (𝑐 ) for 𝑐𝑗−1 ≤ 𝑥 < 𝑐𝑗
𝑐𝑗 − 𝑐𝑗−1 𝑐𝑗 − 𝑐𝑗−1 𝑛 𝑗
4.3. ESTIMATION USING MODIFIED DATA 159
𝐹𝑛 (𝑐𝑗 ) − 𝐹𝑛 (𝑐𝑗−1 )
𝑓𝑛 (𝑥) = 𝐹𝑛′ (𝑥) = for 𝑐𝑗−1 < 𝑥 < 𝑐𝑗 .
𝑐𝑗 − 𝑐𝑗−1
Example 4.3.4. Actuarial Exam Question. You are given the following
information regarding claim sizes for 100 claims:
Using the ogive, calculate the estimate of the probability that a randomly chosen
claim is between 2000 and 6000.
Solution. At the boundaries, the empirical distribution function is defined in
the usual way, so we have
𝐹100 (1000) = 0.16, 𝐹100 (3000) = 0.38, 𝐹100 (5000) = 0.63, 𝐹100 (10000) = 0.81
For other claim sizes, the ogive estimator linearly interpolates between these
values:
Thus, the probability that a claim is between 2000 and 6000 is 𝐹100 (6000) −
𝐹100 (2000) = 0.666 − 0.27 = 0.396.
return to the “usual” case without censoring. Here, the empirical distribution
function 𝐹𝑛 (𝑥) is an unbiased estimator of the distribution function 𝐹 (𝑥). This
is because 𝐹𝑛 (𝑥) is the average of indicator variables each of which are unbiased,
that is, E [𝐼(𝑋𝑖 ≤ 𝑥)] = Pr(𝑋𝑖 ≤ 𝑥) = 𝐹 (𝑥).
Now suppose the random outcome is censored on the right by a limiting amount,
say, 𝐶𝑈 , so that we record the smaller of the two, 𝑋 ∗ = min(𝑋, 𝐶𝑈 ). For values
of 𝑥 that are smaller than 𝐶𝑈 , the indicator variable still provides an unbiased
estimator of the distribution function before we reach the censoring limit. That
is, E [𝐼(𝑋 ∗ ≤ 𝑥)] = 𝐹 (𝑥) because 𝐼(𝑋 ∗ ≤ 𝑥) = 𝐼(𝑋 ≤ 𝑥) for 𝑥 < 𝐶𝑈 . In the
same way, E [𝐼(𝑋 ∗ > 𝑥)] = 1 − 𝐹 (𝑥) = 𝑆(𝑥). But, for 𝑥 > 𝐶𝑈 , 𝐼(𝑋 ∗ ≤ 𝑥) is in
general not an unbiased estimator of 𝐹 (𝑥).
As an alternative, consider two random variables that have different censor-
ing limits. For illustration, suppose that we observe 𝑋1∗ = min(𝑋1 , 5) and
𝑋2∗ = min(𝑋2 , 10) where 𝑋1 and 𝑋2 are independent draws from the same dis-
tribution. For 𝑥 ≤ 5, the empirical distribution function 𝐹2 (𝑥) is an unbiased
estimator of 𝐹 (𝑥). However, for 5 < 𝑥 ≤ 10, the first observation cannot be used
for the distribution function because of the censoring limitation. Instead, the
strategy developed by (Kaplan and Meier, 1958) is to use 𝑆2 (5) as an estimator
of 𝑆(5) and then to use the second observation to estimate the survival function
conditional on survival to time 5, Pr(𝑋 > 𝑥|𝑋 > 5) = 𝑆(𝑥) 𝑆(5) . Specifically, for
5 < 𝑥 ≤ 10, the estimator of the survival function is
̂
𝑆(𝑥) = 𝑆2 (5) × 𝐼(𝑋2∗ > 𝑥).
0 𝑥 < 𝑡1
𝐹 ̂ (𝑥) = { 𝑠𝑗 . (4.6)
1 − ∏𝑗∶𝑡 (1 − 𝑅𝑗 ) 𝑥 ≥ 𝑡1
𝑗 ≤𝑥
For example, if 𝑥 is smaller than the smallest uncensored loss, then 𝑥 < 𝑡1 and
𝐹 ̂ (𝑥) = 0. As another example, if 𝑥 falls between then second and third smallest
uncensored losses, then 𝑥 ∈ (𝑡2 , 𝑡3 ] and 𝐹 ̂ (𝑥) = 1 − (1 − 𝑅𝑠1 ) (1 − 𝑅𝑠2 ).
1 2
̂
As usual, the corresponding estimate of the survival function is 𝑆(𝑥) = 1 − 𝐹 ̂ (𝑥).
4.3. ESTIMATION USING MODIFIED DATA 161
4 4 5+ 5+ 5+ 8 10+ 10+ 12 15
𝑗 𝑡 𝑗 𝑠𝑗 𝑅𝑗
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
2
𝑠𝑗 𝑠𝑗
̂
𝑆(11) = ∏ (1 − ) = ∏ (1 − )
𝑗∶𝑡𝑗 ≤11
𝑅𝑗 𝑗=1
𝑅 𝑗
2 1
= (1 − ) (1 − ) = (0.8)(0.8) = 0.64.
10 5
1.0
0.8
Kaplan Meier Survival
0.6
0.4
0.2
0.0
𝑛 𝑛 𝑛
𝑅𝑗 = ∑ 𝐼(𝑥𝑖 ≥ 𝑡𝑗 ) + ∑ 𝐼(𝑢𝑖 ≥ 𝑡𝑗 ) − ∑ 𝐼(𝑑𝑖 ≥ 𝑡𝑗 ).
𝑖=1 𝑖=1 𝑖=1
With this new definition of the risk set, the product-limit estimator of the dis-
tribution function is as in equation (4.6).
Greenwood’s Formula. (Greenwood, 1926) derived the formula for the esti-
mated variance of the product-limit estimator to be
𝑠𝑗
𝑉̂
𝑎𝑟(𝐹 ̂ (𝑥)) = (1 − 𝐹 ̂ (𝑥))2 ∑ .
𝑗∶𝑡𝑗
𝑅 (𝑅𝑗 − 𝑠𝑗 )
≤𝑥 𝑗
0 𝑥 < 𝑡1
̂ (𝑥) = {
𝐹𝑁𝐴 𝑠 .
1− exp (− ∑𝑗∶𝑡 ≤𝑥 𝑅𝑗 ) 𝑥 ≥ 𝑡1
𝑗 𝑗
Note that the above expression is a result of the Nelson-Äalen estimator of the
cumulative hazard function
𝑠𝑗
̂
𝐻(𝑥) = ∑
𝑗∶𝑡𝑗
𝑅
≤𝑥 𝑗
164 CHAPTER 4. MODEL SELECTION AND ESTIMATION
and the relationship between the survival function and cumulative hazard func-
̂
̂ (𝑥) = 𝑒−𝐻(𝑥)
tion, 𝑆𝑁𝐴 .
Observation (𝑖) 1 2 3 4 5 6 7 8 9 10
𝑑𝑖 0 0 0 0 0 0 0 1.3 1.5 1.6
𝑥𝑖 0.9 − 1.5 − − 1.7 − 2.1 2.1 −
𝑢𝑖 − 1.2 − 1.5 1.6 − 1.7 − − 2.3
̂
Calculate the Kaplan-Meier product-limit estimate, 𝑆(1.6)
𝑛
Solution. Recall the risk set 𝑅𝑗 = ∑𝑖=1 {𝐼(𝑥𝑖 ≥ 𝑡𝑗 ) + 𝐼(𝑢𝑖 ≥ 𝑡𝑗 ) − 𝐼(𝑑𝑖 ≥ 𝑡𝑗 )}.
Then
𝑗 𝑡𝑗 𝑠𝑗 𝑅𝑗 ̂ )
𝑆(𝑡 𝑗
1 0.9 1 10 − 3 = 7 1 − 71 = 76
6 1 5
2 1.5 1 8−2=6 7 (1 − 6 ) = 7
5 1 4
3 1.7 1 5−0=5 7 (1 − 5 ) = 7
4 2 4
4 2.1 2 3 7 (1 − 3 ) = 21
̂
The Kaplan-Meier estimate is therefore 𝑆(1.6) = 75 .
a) Using the Nelson-Äalen estimator, calculate the probability that the loss
̂ (11).
on a policy exceeds 11, 𝑆𝑁𝐴
b) Calculate Greenwood’s approximation to the variance of the product-limit
̂
estimate 𝑆(11).
𝑗 𝑡 𝑗 𝑠𝑗 𝑅𝑗
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
̂
̂ (11) = 𝑒−𝐻(11) = 𝑒−0.4 = 0.67, since
The Nelson-Äalen estimate of 𝑆(11) is 𝑆𝑁𝐴
2
𝑠𝑗 𝑠𝑗
̂
𝐻(11) = ∑ =∑
𝑗∶𝑡𝑗
𝑅
≤11 𝑗 𝑗=1
𝑅𝑗
2 1
= + = 0.2 + 0.2 = 0.4.
10 5
̂
From earlier work, the Kaplan-Meier estimate of 𝑆(11) is 𝑆(11) = 0.64. Then
Greenwood’s estimate of the variance of the product-limit estimate of 𝑆(11) is
𝑠𝑗 2 1
𝑉̂ ̂
𝑎𝑟(𝑆(11)) ̂
= (𝑆(11)) 2
∑ = (0.64)2 ( + ) = 0.0307.
𝑗∶𝑡𝑗
𝑅 (𝑅𝑗 − 𝑠𝑗 )
≤11 𝑗
10(8) 5(4)
Pr(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) × Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎) = ,
Pr(𝑑𝑎𝑡𝑎)
where
• Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) is the distribution of the parameters, known as the prior
distribution.
• Pr(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) is the sampling distribution. In a frequentist context,
it is used for making inferences about the parameters and is known as the
likelihood.
• Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎) is the distribution of the parameters having observed
the data, known as the posterior distribution.
• Pr(𝑑𝑎𝑡𝑎) is the marginal distribution of the data. It is generally obtained
by integrating (or summing) the joint distribution of data and parameters
over parameter values.
Why Bayes? There are several advantages of the Bayesian approach. First, we
can describe the entire distribution of parameters conditional on the data. This
allows us, for example, to provide probability statements regarding the likeli-
hood of parameters. Second, the Bayesian approach provides a unified approach
for estimating parameters. Some non-Bayesian methods, such as least squares,
require a separate approach to estimate variance components. In contrast, in
Bayesian methods, all parameters can be treated in a similar fashion. This is
convenient for explaining results to consumers of the data analysis. Third, this
approach allows analysts to blend prior information known from other sources
with the data in a coherent manner. This topic is developed in detail in the cred-
ibility Chapter 9. Fourth, Bayesian analysis is particularly useful for forecasting
future responses.
Gamma - Poisson Special Case. To develop intuition, we consider the
gamma-Poisson case that holds a prominent position in actuarial applications.
The idea is to consider a set of random variables 𝑋1 , … , 𝑋𝑛 where each 𝑋𝑖 could
represent the number of claims for the 𝑖th policyholder. Assume that claims of
all policyholders follow the same Poisson so that 𝑋𝑖 has a Poisson distribution
with parameter 𝜆. This is analogous to the likelihood that we first saw in Chap-
ter 2. In a non-Bayesian (or frequentist) context, the parameter 𝜆 is viewed
as an unknown quantity that is not random (it is said to be “fixed”). In the
Bayesian context, the unknown parameter 𝜆 is viewed as uncertain and is mod-
eled as a random variable. In this special case, we use the gamma distribution
to reflect this uncertainty, the prior distribution.
Think of the following two-stage sampling scheme to motivate our probabilistic
set-up.
1. In the first stage, the parameter 𝜆 is drawn from a gamma distribution.
4.4. BAYESIAN INFERENCE 167
2. In the second stage, for that value of 𝜆, there are 𝑛 draws from the same
(identical) Poisson distribution that are independent, conditional on 𝜆.
𝜆𝑥
Pr(𝑋 = 𝑥|𝜆) = 𝑒−𝜆 ,
Γ(𝑥 + 1)
𝜆𝛼−1
𝑓(𝜆) = exp(−𝜆/𝜃).
𝜃𝛼 Γ(𝛼)
∞ ∞
∫ 𝑓(𝜆) 𝑑𝜆 = 1 ⟹ 𝜃𝛼 Γ(𝛼) = ∫ 𝜆𝛼−1 exp (−𝜆/𝜃) 𝑑𝜆.
0 0
In this section, we use small examples that can be done by hand in order to
focus on the foundations. For practical implementation, analysts rely heavily
on simulation methods using modern computational methods such as Markov
Chain Monte Carlo (MCMC) simulation. We will get an exposure to simulation
techniques in Chapter 6 but more intensive techniques such as MCMC requires
yet more background. See Hartman (2016) for an introduction to computational
Bayesian methods from an actuarial perspective.
However, we may be very uncertain (or have no clue) about the distribution of
𝜃; the Bayesian machinery allows the following situation
∫ 𝜋(𝜃) 𝑑𝜃 = ∞,
𝑓(𝑥, 𝜃) 𝑓(𝑥|𝜃)𝜋(𝜃)
𝜋(𝜃|𝑥) = = .
𝑓(𝑥) 𝑓(𝑥)
The idea is to update your knowledge of the distribution of 𝜃 (𝜋(𝜃)) with the
data 𝑥. Making statements about potential values of parameters is an important
aspect of statistical inference.
Pr(𝑎 ≤ 𝜃 ≤ 𝑏|x) ≥ 1 − 𝛼.
Minimizing expected loss is a rigorous method for providing a single “best guess”
about a likely value of a parameter, comparable to a frequentist estimator of
the unknown (fixed) parameter.
To get the exact posterior density, we integrate the above function over its range
(0.6, 0.8)
0.8 0.8
𝑞5 𝑞6 𝑞4 − 𝑞5
∫ 𝑞 4 − 𝑞 5 𝑑𝑞 = − ∣ = 0.014069 ⇒ 𝜋(𝑞|1, 0) =
0.6 5 6 0.6 0.014069
Then
0.8
𝑞4 − 𝑞5
Pr(0.7 < 𝑞 < 0.8|1, 0) = ∫ 𝑑𝑞 = 0.5572
0.7 0.014069
4.4. BAYESIAN INFERENCE 171
1
𝜋(𝜃) = , 1<𝜃<∞
𝜃2
(ii) Given Θ = 𝜃, claim sizes follow a Pareto distribution with parameters
𝛼 = 2 and 𝜃.
A claim of 3 is observed. Calculate the posterior probability that Θ exceeds 2.
Solution: The posterior density, given an observation of 3 is
2𝜃2 1
𝑓(3|𝜃)𝜋(𝜃) (3+𝜃)3 𝜃2 2(3 + 𝜃)−3 −3
𝜋(𝜃|3) = ∞ = ∞ = ∞ = 32(3+𝜃) , 𝜃 > 1
∫1 𝑓(3|𝜃)𝜋(𝜃)𝑑𝜃 ∫1 2(3 + 𝜃)−3 𝑑𝜃 −(3 + 𝜃)−2 |1
Then
∞
∞ 16
Pr(Θ > 2|3) = ∫ 32(3 + 𝜃)−3 𝑑𝜃 = −16(3 + 𝜃)−2 ∣2 = = 0.64
2 25
𝑓(𝑦|𝑥) = ∫ 𝑓(𝑦|𝜃)𝜋(𝜃|𝑥)𝑑𝜃.
Number of Claims 0 1 2
Probability 2𝜃 𝜃 1 − 3𝜃
𝜃 0.05 0.30
Probability 0.80 0.20
Two claims are observed in Year 1. Calculate the Bayesian prediction of the
number of claims in Year 2.
Solution. Start with the posterior distribution of the parameter
Pr(𝑋|𝜃) Pr(𝜃)
Pr(𝜃|𝑋) =
∑𝜃 Pr(𝑋|𝜃) Pr(𝜃)
so
Pr(𝑋 = 2|𝜃 = 0.05) Pr(𝜃 = 0.05)
Pr(𝜃 = 0.05|𝑋 = 2) =
Pr(𝑋 = 2|𝜃 = 0.05) Pr(𝜃 = 0.05) + Pr(𝑋 = 2|𝜃 = 0.3) Pr(𝜃 = 0.3)
(1 − 3 × 0.05)(0.8) 68
= = .
(1 − 3 × 0.05)(0.8) + (1 − 3 × 0.3)(0.2) 70
2
Thus, Pr(𝜃 = 0.3|𝑋 = 1) = 1 − Pr(𝜃 = 0.05|𝑋 = 1) = 70 .
Thus,
(ii) For half of the company’s policies 𝜃 = 1 , while for the other half 𝜃 = 3.
For a randomly selected policy, losses in Year 1 were 5. Calculate the posterior
probability that losses for this policy in Year 2 will exceed 8.
Solution. We are given the prior distribution of 𝜃 as Pr(𝜃 = 1) = Pr(𝜃 = 3) = 21 ,
the conditional distribution 𝑓(𝑥|𝜃), and the fact that we observed 𝑋1 = 5. The
goal is to find the predictive probability Pr(𝑋2 > 8|𝑋1 = 5).
The posterior probabilities are
𝑓(5|𝜃 = 1) Pr(𝜃 = 1)
Pr(𝜃 = 1|𝑋1 = 5) =
𝑓(5|𝜃 = 1) Pr(𝜃 = 1) + 𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
1 1 1
36 ( 2 ) 72 16
= 1 1 3 1
= 1 3 =
36 ( 2 ) + 64 ( 2 ) 72 + 128
43
𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
Pr(𝜃 = 3|𝑋1 = 5) =
𝑓(5|𝜃 = 1) Pr(𝜃 = 1) + 𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
27
= 1 − Pr(𝜃 = 1|𝑋1 = 5) =
43
Note that the conditional probability that losses exceed 8 is
∞
Pr(𝑋2 > 8|𝜃) = ∫ 𝑓(𝑥|𝜃)𝑑𝑥
8
∞ ∞
𝜃 𝜃 𝜃
=∫ 2
𝑑𝑥 = − ∣ =
8 (𝑥 + 𝜃) 𝑥+𝜃 8 8+𝜃
Pr(𝑋2 > 8|𝑋1 = 5) = Pr(𝑋2 > 8|𝜃 = 1) Pr(𝜃 = 1|𝑋1 = 5) + Pr(𝑋2 > 8|𝜃 = 3) Pr(𝜃 = 3|𝑋1 = 5)
1 16 3 27
= ( )+ ( ) = 0.2126
8 + 1 43 8 + 3 43
(iii) An insured is observed for 8 years and has at least one loss every year.
Calculate the posterior probability that the insured will have at least one loss
during Year 9.
Solution. To ease notation, define x = (1, 1, 1, 1, 1, 1, 1, 1) represent the data
indicating that an insured has at least one loss every year for 8 years. Condi-
tional on knowing 𝑝, this has probability 𝑝8 . With this, the posterior probability
density is proportional to
Thus, the posterior probability that the insured will have at least one loss during
Year 9 is
5
Pr(𝑋9 = 1|x) = ∫ Pr(𝑋9 = 1|𝑝) {𝜋(𝑝|x)} 𝑑𝑝
0
5
= ∫ 𝑝 {(9)(0.5−9 )𝑝8 } 𝑑𝑝
0
= 9(0.5−9 )(0.510 )/10 = 0.45
One randomly chosen risk has three claims during Years 1-6. Calculate the
posterior probability of a claim for this risk in Year 7.
Solution. The probabilities are from a binomial distribution with 6 trials in
which 3 successes were observed.
6
Pr(3|I) = ( )(0.13 )(0.93 ) = 0.01458
3
6
Pr(3|II) = ( )(0.23 )(0.83 ) = 0.08192
3
6
Pr(3|III) = ( )(0.43 )(0.63 ) = 0.27648
3
4.4. BAYESIAN INFERENCE 175
𝑓(𝑥|𝜃)𝜋(𝜃)
𝜋(𝜃|𝑥) = 𝑓(𝑥)
∝ 𝑓(𝑥|𝜃)𝜋(𝜃)
Posterior is proportional to likelihood × prior.
For conjugate distributions, the posterior and the prior belong to the same
family of distributions. The following illustration looks at the gamma-Poisson
special case, the most well-known in actuarial applications.
Special Case – Gamma-Poisson - Continued. Assume a Poisson(𝜆) model
distribution and that 𝜆 follows a gamma(𝛼, 𝜃) prior distribution. Then, the
posterior distribution of 𝜆 given the data follows a gamma distribution with
new parameters 𝛼𝑝𝑜𝑠𝑡 = ∑𝑖 𝑥𝑖 + 𝛼 and 𝜃𝑝𝑜𝑠𝑡 = 1/(𝑛 + 1/𝜃).
176 CHAPTER 4. MODEL SELECTION AND ESTIMATION
because the posterior distribution is gamma with parameters 𝛼𝑛𝑒𝑤 and 𝜃𝑛𝑒𝑤 .
For year 1, we have
1 1
0.15 = (𝑋1 + 𝛼) × = (1 + 𝛼) × ,
𝑛 + 1/𝜃 1 + 1/𝜃
so 0.15(1 + 1/𝜃) = 1 + 𝛼. For year 2, we have
1 1
0.2 = (𝑋1 + 𝑋2 + 𝛼) × = (4 + 𝛼) × ,
𝑛 + 1/𝜃 2 + 1/𝜃
4.5. FURTHER RESOURCES AND CONTRIBUTORS 177
Closed-form expressions mean that results can be readily interpreted and easily
computed; hence, conjugate distributions are useful in actuarial practice. Two
other special cases used extensively are:
• The uncertainty of parameters is summarized using a beta distribution and
the outcomes have a (conditional on the parameter) binomial distribution.
• The uncertainty about the mean of the normal distribution is summarized
using a normal distribution and the outcomes are conditionally normally
distributed.
Additional results on conjugate distributions are summarized in the Appendix
Section 16.3.
Contributors
• Edward W. (Jed) Frees and Lisa Gao, University of Wisconsin-
Madison, are the principal authors of the initial version of this chapter.
Email: jf re e s @ b u s . w i s c . e d u for chapter comments and suggested
improvements.
• Chapter reviewers include: Vytaras Brazauskas, Yvonne Chueh, Eren
Dodd, Hirokazu (Iwahiro) Iwasawa, Joseph Kim, Andrew Kwon-
Nakamura, Jiandong Ren, and Di (Cindy) Xu.
178 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Chapter 5
5.1 Introduction
The objective of this chapter is to build a probability model to describe the
aggregate claims by an insurance system occurring in a fixed time period. The
insurance system could be a single policy, a group insurance contract, a business
line, or an entire book of an insurer’s business. In this chapter, aggregate claims
refer to either the number or the amount of claims from a portfolio of insurance
contracts. However, the modeling framework can be readily applied in the more
general setup.
Consider an insurance portfolio of 𝑛 individual contracts, and let 𝑆 denote the
aggregate losses of the portfolio in a given time period. There are two approaches
to modeling the aggregate losses 𝑆, the individual risk model and the collective
risk model. The individual risk model emphasizes the loss from each individual
contract and represents the aggregate losses as:
𝑆𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ,
179
180 CHAPTER 5. AGGREGATE LOSS MODELS
and thus is a fixed number rather than a random variable. For the individual
risk model, one usually assumes the 𝑋𝑖 ’s are independent. Because of different
contract features such as coverage and exposure, the 𝑋𝑖 ’s are not necessarily
identically distributed. A notable feature of the distribution of each 𝑋𝑖 is the
probability mass at zero corresponding to the event of no claims.
The collective risk model represents the aggregate losses in terms of a frequency
distribution and a severity distribution:
𝑆𝑁 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑁 .
Here, one thinks of a random number of claims 𝑁 that may represent either
the number of losses or the number of payments. In contrast, in the individual
risk model, we use a fixed number of contracts 𝑛. We think of 𝑋1 , 𝑋2 , … , 𝑋𝑁
as representing the amount of each loss. Each loss may or may not correspond
to a unique contract. For instance, there may be multiple claims arising from
a single contract. It is natural to think about 𝑋𝑖 > 0 because if 𝑋𝑖 = 0
then no claim has occurred. Typically we assume that conditional on 𝑁 = 𝑛,
𝑋1 , 𝑋2 , … , 𝑋𝑛 are iid random variables. The distribution of 𝑁 is known as
the frequency distribution, and the common distribution of 𝑋 is known as the
severity distribution. We further assume 𝑁 and 𝑋 are independent. With the
collective risk model, we may decompose the aggregate losses into the frequency
(𝑁 ) process and the severity (𝑋) model. This flexibility allows the analyst to
comment on these two separate components. For example, sales growth due to
lower underwriting standards could lead to higher frequency of losses but might
not affect severity. Similarly, inflation or other economic forces could have an
impact on severity but not on frequency.
𝑆𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ,
𝑛
Var(𝑆𝑛 ) = ∑ Var(𝑋𝑖 )
𝑖=1
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ 𝑃𝑋𝑖 (𝑧)
𝑖=1
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ 𝑀𝑋𝑖 (𝑡),
𝑖=1
where 𝑃𝑆𝑛 (⋅) and 𝑀𝑆𝑛 (⋅) are the probability generating function (pgf ) and the
moment generating function (mgf ) of 𝑆𝑛 , respectively. The distribution of each
𝑋𝑖 contains a probability mass at zero, corresponding to the event of no claims
from the 𝑖th contract. One strategy to incorporate the zero mass in the distri-
bution is to use the two-part framework:
0, if 𝐼𝑖 = 0
𝑋𝑖 = 𝐼𝑖 × 𝐵𝑖 = {
𝐵𝑖 , if 𝐼𝑖 = 1.
Here, 𝐼𝑖 is a Bernoulli variable indicating whether or not a loss occurs for
the 𝑖th contract, and 𝐵𝑖 is a random variable with nonnegative support rep-
resenting the amount of losses of the contract given loss occurrence. Assume
that 𝐼1 , … , 𝐼𝑛 , 𝐵1 , … , 𝐵𝑛 are mutually independent. Denote Pr(𝐼𝑖 = 1) = 𝑞𝑖 ,
𝜇𝑖 = E(𝐵𝑖 ), and 𝜎𝑖2 = Var(𝐵𝑖 ). It can be shown (see Technical Supplement
5.A.1 for details) that
𝑛
E(𝑆𝑛 ) = ∑ 𝑞𝑖 𝜇𝑖
𝑖=1
𝑛
Var(𝑆𝑛 ) = ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 )
𝑖=1
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ (1 − 𝑞𝑖 + 𝑞𝑖 𝑃𝐵𝑖 (𝑧))
𝑖=1
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ (1 − 𝑞𝑖 + 𝑞𝑖 𝑀𝐵𝑖 (𝑡)) .
𝑖=1
300
E (𝑆300 ) = ∑ 𝑞𝑖 𝜇𝑖 = 100 {0.05(200)} + 200 {0.06(150)} = 2, 800
𝑖=1
300
Var (𝑆300 ) = ∑𝑖=1 (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 ) since 𝑋𝑖 ’s are independent
2
= 100 {0.05 ( 400
12 ) + 0.05(1 − 0.05)200 }
2
3002
+200 {0.06 ( 12 ) + 0.06(1 − 0.06)1502 }
= 600, 467.
5.2. INDIVIDUAL RISK MODEL 183
The individual risk model can also be used for claim frequency. If 𝑋𝑖 denotes
the number of claims from the 𝑖th contract, then 𝑆𝑛 is interpreted as the total
number of claims from the portfolio. In this case, the above two-part frame-
work still applies since there is a probability mass at zero for contracts that
do not experience any claims. Assume 𝑋𝑖 belongs to the (𝑎, 𝑏, 0) class with
pmf denoted by 𝑝𝑖𝑘 = Pr(𝑋𝑖 = 𝑘) for 𝑘 = 0, 1, … (see Section 2.3). Let 𝑋𝑖𝑇
denote the associated zero-truncated distribution in the (𝑎, 𝑏, 1) class with pmf
𝑇
𝑝𝑖𝑘 = 𝑝𝑖𝑘 /(1 − 𝑝𝑖0 ) for 𝑘 = 1, 2, … (see Section 2.5.1). Using the relationship
between their probability generating functions (see Technical Supplement 5.A.2
for details):
𝑃𝑋𝑖 (𝑧) = 𝑝𝑖0 + (1 − 𝑝𝑖0 )𝑃𝑋𝑖𝑇 (𝑧),
we can write 𝑋𝑖 = 𝐼𝑖 ×𝐵𝑖 with 𝑞𝑖 = Pr(𝐼𝑖 = 1) = Pr(𝑋𝑖 > 0) = 1−𝑝𝑖0 and 𝐵𝑖 =
𝑋𝑖𝑇 . Notice that in this case, we have a zero-modified distribution since the 𝐼𝑖
variable covers the modified probability mass at zero with 𝑞𝑖 = Pr(𝐼𝑖 = 1), while
the 𝐵𝑖 = 𝑋𝑖𝑇 covers the discrete non-zero frequency portion. See Section 2.5.1
for the relationship between zero-truncated and zero-modified distributions.
Find the expected value and variance of the claim frequency for the entire
portfolio.
Solution. For each policy, we can write the zero-modified Poisson claim fre-
quency 𝑁𝑖 as 𝑁𝑖 = 𝐼𝑖 × 𝐵𝑖 , where
For the low-risk policies, we have 𝑞𝑖 = 0.03 and for the high-risk policies, we
have 𝑞𝑖 = 0.05. Further, 𝐵𝑖 = 𝑁𝑖𝑇 , the zero-truncated version of 𝑁𝑖 . Thus, we
have
𝜆
𝜇𝑖 = E(𝐵𝑖 ) = E(𝑁𝑖𝑇 ) =
1 − 𝑒−𝜆
𝜆[1 − (𝜆 + 1)𝑒−𝜆 ]
𝜎𝑖2 = Var(𝐵𝑖 ) = Var(𝑁𝑖𝑇 ) = .
(1 − 𝑒−𝜆 )2
100
Using 𝑛 = 100, let the portfolio claim frequency be 𝑆100 = ∑𝑖=1 𝑁𝑖 . Using the
formulas above, the expected claim frequency of the portfolio is
100
E (𝑆100 ) = ∑ 𝑞𝑖 𝜇𝑖
𝑖=1
1 2
= 40 [0.03 ( −1
)] + 60 [0.05 ( )]
1−𝑒 1 − 𝑒−2
= 40(0.03)(1.5820) + 60(0.05)(2.3130) = 8.8375.
100
Var (𝑆100 ) = ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 )
𝑖=1
1 − 2𝑒−1
= 40 [0.03 ( ) + 0.03(0.97)(1.58202 )]
(1 − 𝑒−1 )2
2[1 − 3𝑒−2 ]
+ 60 [0.05 ( ) + 0.05(0.95)(2.31302 )]
(1 − 𝑒−2 )2
= 23.7214.
Note that equivalently, we could have calculated the mean and variance of an
individual policy directly using the relationship between the zero-modified and
zero-truncated Poisson distributions (see Section 2.3).
To understand the distribution of the aggregate loss, one could use the central
limit theorem to approximate the distribution of 𝑆𝑛 for large 𝑛. Denote 𝜇𝑆𝑛 =
5.2. INDIVIDUAL RISK MODEL 185
E(𝑆𝑛 ) and 𝜎𝑆2 𝑛 = Var(𝑆𝑛 ) and let 𝑍 ∼ 𝑁 (0, 1), a standard normal random
variable with cdf Φ. Then the cdf of 𝑆𝑛 can be approximated as follows:
𝑆𝑛 − 𝜇𝑆𝑛 𝑠 − 𝜇 𝑆𝑛
𝐹𝑆𝑛 (𝑠) = Pr(𝑆𝑛 ≤ 𝑠) = Pr ( ≤ )
𝜎𝑆𝑛 𝜎𝑆𝑛
𝑠 − 𝜇 𝑆𝑛 𝑠 − 𝜇 𝑆𝑛
≈ Pr (𝑍 ≤ ) = Φ( ).
𝜎𝑆𝑛 𝜎𝑆𝑛
For small 𝑛, the distribution of 𝑆𝑛 is likely skewed, and the normal approxima-
tion would be a poor choice. To examine the aggregate loss distribution, we go
back to first principles. Specifically, the distribution can be derived recursively.
Define 𝑆𝑘 = 𝑋1 + ⋯ + 𝑋𝑘 , 𝑘 = 1, … , 𝑛.
For 𝑘 = 1:
𝐹𝑆1 (𝑠) = Pr(𝑆1 ≤ 𝑠) = Pr(𝑋1 ≤ 𝑠) = 𝐹𝑋1 (𝑠).
For 𝑘 = 2, … , 𝑛:
∗2
𝐹𝑋 (𝑥) = Pr(𝑋1 + 𝑋2 ≤ 𝑥) = E𝑋2 [Pr(𝑋1 ≤ 𝑥 − 𝑋2 |𝑋2 )]
= E𝑋2 [𝐹 (𝑥 − 𝑋2 )]
𝑥
∫0 𝐹 (𝑥 − 𝑦)𝑓(𝑦)𝑑𝑦 for continuous 𝑋𝑖 ’s
= {
∑𝑦≤𝑥 𝐹 (𝑥 − 𝑦)𝑓(𝑦) for discrete 𝑋𝑖 ’s
Recall 𝐹 (0) = 0.
When the 𝑋𝑖 ’s are independent and belong to the same family of distributions,
there are some simple cases where 𝑆𝑛 has a closed form. This makes it easy
to compute Pr(𝑆𝑛 ≤ 𝑥). This property is known as closed under convolution,
meaning the distribution of the sum of independent random variables belongs
to the same family of distributions as that of the component variables, just with
different parameters. Table 5.1 provides a few examples.
𝑛
which is the mgf of a gamma random variable with parameters (∑𝑖=1 𝛼𝑖 , 𝜃).
𝑛
Thus, 𝑆𝑛 ∼ 𝐺𝑎𝑚(∑𝑖=1 𝛼𝑖 , 𝜃).
𝑃𝑆𝑛 (𝑧) = E [𝑧 𝑆𝑛 ]
= E [𝑧 𝑋1 ] ⋯ E [𝑧 𝑋𝑛 ] from the independence of 𝑋𝑖 ’s
𝑛 𝑛 𝑛
−𝑟𝑖 − ∑𝑖=1 𝑟𝑖
= ∏ 𝑃𝑋𝑖 (𝑧) = ∏ [1 − 𝛽(𝑧 − 1)] = [1 − 𝛽(𝑧 − 1)] ,
𝑖=1 𝑖=1
E(𝑆|𝑁 ) = E(𝑋1 + ⋯ + 𝑋𝑁 |𝑁 ) = 𝜇𝑁
Var(𝑆|𝑁 ) = Var(𝑋1 + ⋯ + 𝑋𝑁 |𝑁 ) = 𝜎2 𝑁 .
Using the law of iterated expectations from Appendix Section 16.2, the mean
of the aggregate loss is
Using the law of total variance from Appendix Section 16.2, the variance of the
aggregate loss is
E(𝑁 ) = Var(𝑁 ) = 𝜆
E(𝑆𝑁 ) = 𝜆 E(𝑋)
Var(𝑆𝑁 ) = 𝜆(𝜎2 + 𝜇2 ) = 𝜆 E(𝑋 2 ).
E(𝑁 ) = Var(𝑁 ) = 12
1 1 1 5
𝜇 = E(𝑋) = 1 ( ) + 2 ( ) + 3 ( ) =
2 3 6 3
10 25 5
𝜎2 = E(𝑋 2 ) − [E(𝑋)]2 = − =
3 9 9
2
5 5
⇒ Var(𝑆𝑁 ) = ( ) (12) + ( ) (12) = 40.
9 3
Now, recall that the probability generating function (pgf ) of 𝑁 is 𝑃𝑁 (𝑧) = E(𝑧 𝑁 ).
Denote 𝑀𝑋 (𝑡) = 𝑧. Substituting into the expression for the mgf of 𝑆𝑁 above,
it is shown
𝜕
𝑀𝑆′ 𝑁 (𝑡) = 𝑃 (𝑀 (𝑡)) = 𝑃𝑁′ (𝑀𝑋 (𝑡))𝑀𝑋
′
(𝑡)
𝜕𝑡 𝑁 𝑋
′
and recall 𝑀𝑋 (0) = 1, 𝑀𝑋 (0) = E(𝑋) = 𝜇, 𝑃𝑁′ (1) = E(𝑁 ). So,
2
Similarly, one could use relation E(𝑆𝑁 ) = 𝑀𝑆″𝑁 (0) to get
and
𝑛 Pr(𝑁 = 𝑛) 𝑥 Pr(𝑋 = 𝑥)
1 0.8 0 0.2
2 0.2 100 0.7
1000 0.1
Your budget for prizes equals the expected aggregate cash prizes plus the stan-
dard deviation of aggregate cash prizes. Calculate your budget.
Solution. We need to calculate the mean and standard deviation of the ag-
gregate (sum) of cash prizes. The moments of the frequency distribution 𝑁
are
E(𝑁 ) = 1(0.8) + 2(0.2) = 1.2
E(𝑁 2 ) = 12 (0.8) + 22 (0.2) = 1.6
2
Var(𝑁 ) = E(𝑁 2 ) − [E(𝑁 )] = 0.16.
Thus, the mean and variance of the aggregate cash prize are
𝑁 ∞ 𝑛
𝐹𝑆𝑁 (3) = Pr (∑ 𝑋𝑖 ≤ 3) = ∑ Pr (∑ 𝑋𝑖 ≤ 3|𝑁 = 𝑛) Pr(𝑁 = 𝑛)
𝑖=1 𝑛=0 𝑖=1
3
= ∑ 𝐹 ∗𝑛 (3) 𝑝𝑛 = ∑ 𝐹 ∗𝑛 (3)𝑝𝑛
𝑛 𝑛=0
= 𝑝0 + 𝐹 ∗1 (3) 𝑝1 + 𝐹 ∗2 (3) 𝑝2 + 𝐹 ∗3 (3) 𝑝3 .
𝑛 𝑛
1 𝛽 1 4
𝑝𝑛 = ( ) = ( ) .
1+𝛽 1+𝛽 5 5
192 CHAPTER 5. AGGREGATE LOSS MODELS
𝑦≤3
1 ∗1 1
= [𝐹 (2) + 𝐹 ∗1 (1)] = [Pr(𝑋 ≤ 2) + Pr(𝑋 ≤ 1)]
4 4
1 2 1 3
= ( + )=
4 4 4 16
1 3
𝐹 ∗3 (3) = Pr(𝑋1 + 𝑋2 + 𝑋3 ≤ 3) = Pr(𝑋1 = 𝑋2 = 𝑋3 = 1) = ( ) .
4
Notice that we did not need to recursively calculate 𝐹 ∗3 (3) by recognizing that
each 𝑋 ∈ {1, 2, 3, 4}, so the only way of obtaining 𝑋1 + 𝑋2 + 𝑋3 ≤ 3 is to have
𝑋1 = 𝑋2 = 𝑋3 = 1. Additionally, for 𝑛 ≥ 4, 𝐹 ∗𝑛 (3) = 0 since it is impossible
for the sum of 4 or more 𝑋’s to be less than 3. For 𝑛 = 0, 𝐹 ∗0 (3) = 1 since
the sum of 0 𝑋’s is 0, which is always less than 3. Laying out the probabilities
systematically,
Finally,
When E(𝑁 ) and Var(𝑁 ) are known, one may also use a type of central limit the-
orem to approximate the distribution of 𝑆𝑁 as in the individual risk model. That
is, 𝑆√Var(𝑆
𝑁 −E(𝑆𝑁 )
)
approximately follows the standard normal distribution 𝑁 (0, 1).
𝑁
From this type of central limit theorem, the approximation works well if E[𝑁 ]
is sufficiently large.
E(𝑁 ) = 25 Var(𝑁 ) = 25
2
E(𝑋) = 5+95
2 = 50 = 𝜇 Var(𝑋) = (95−5)
12 = 675 = 𝜎2 .
Then for 𝑆𝑁 ,
E(𝑆𝑁 ) = 𝜇 E(𝑁 ) = 50(25) = 1, 250
Var(𝑆𝑁 ) = 𝜎2 E(𝑁 ) + 𝜇2 Var(𝑁 )
= 675(25) + 502 (25) = 79, 375.
194 CHAPTER 5. AGGREGATE LOSS MODELS
E[(𝑆 − 𝑑)+ ]
∞
∫𝑑 (𝑠 − 𝑑)𝑓𝑆𝑁 (𝑠)𝑑𝑠 for continuous 𝑆𝑁
E(𝑆𝑁 − 𝑑)+ = {
∑𝑠>𝑑 (𝑠 − 𝑑)𝑓𝑆𝑁 (𝑠) for discrete 𝑆𝑁
= E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 𝑑)
𝑥 𝑓(𝑥)
5 0.2
10 0.3
20 0.5
The number of projects and the number of overtime hours are independent. You
will get paid for overtime hours in excess of 15 hours in the week. Calculate the
expected number of overtime hours for which you will get paid in the week.
5.3. COLLECTIVE RISK MODEL 195
Solution. The number of projects in a week requiring overtime work has distri-
bution 𝑁 ∼ 𝐺𝑒𝑜(𝛽 = 2), while the number of overtime hours worked per project
has distribution 𝑋 as described above. The aggregate number of overtime hours
in a week is 𝑆𝑁 and we are therefore looking for
1 1
Pr(𝑆𝑁 = 0) = Pr(𝑁 = 0) = =
1+𝛽 3
2 0.4
Pr(𝑆𝑁 = 5) = Pr(𝑋 = 5, 𝑁 = 1) = 0.2 ( ) =
9 9
Pr(𝑆𝑁 = 10) = Pr(𝑋 = 10, 𝑁 = 1) + Pr(𝑋1 = 𝑋2 = 5, 𝑁 = 2)
2 4
= 0.3 ( ) + (0.2)(0.2) ( ) = 0.0726
9 27
1 0.4
Pr(𝑆𝑁 ≥ 15) = 1 − ( + + 0.0726) = 0.5496
3 9
⇒ E(𝑆𝑁 ∧ 15) = 0 Pr(𝑆𝑁 = 0) + 5 Pr(𝑆𝑁 = 5) + 10 Pr(𝑆𝑁 = 10) + 15 Pr(𝑆𝑁 ≥ 15)
1 0.4
= 0( ) + 5( ) + 10(0.0726) + 15(0.5496) = 9.193.
3 9
Therefore,
E(𝑆𝑁 − 15)+ = E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 15)
= 28 − 9.193 = 18.807.
Thus,
E [(𝑆𝑁 − (𝑗 + 1))+ ] − E [(𝑆𝑁 − 𝑗)+ ] = {E(𝑆𝑁 ) − E(𝑆𝑁 ∧ (𝑗 + 1))} − {E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 𝑗)}
= E (𝑆𝑁 ∧ 𝑗) − E [𝑆 ∧ (𝑗 + 1)]
We can write
𝑗
E [𝑆𝑁 ∧ (𝑗 + 1)] = ∑ 𝑥𝑓𝑆𝑁 (𝑥) + (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)
𝑥=0
𝑗−1
= ∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 = 𝑗) + (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)
𝑥=0
Similarly,
𝑗−1
E(𝑆𝑁 ∧ 𝑗) = ∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 ≥ 𝑗)
𝑥=0
as required.
• Step 2:
Example 5.3.8. One has a closed-form expression for the aggregate loss dis-
tribution by assuming a geometric frequency distribution and an exponential
severity distribution.
Assume that claim count 𝑁 is geometric with mean E(𝑁 ) = 𝛽, and that claim
amount 𝑋 is exponential with E(𝑋) = 𝜃. Recall that the pgf of 𝑁 and the mgf
of 𝑋 are:
1
𝑃𝑁 (𝑧) =
1 − 𝛽(𝑧 − 1)
1
𝑀𝑋 (𝑡) = .
1 − 𝜃𝑡
Thus, the mgf of aggregate loss 𝑆𝑁 can be expressed two ways (for details, see
Technical Supplement 5.A.3)
1
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 1
1 − 𝛽 ( 1−𝜃𝑡 − 1)
𝛽
= 1+ ([1 − 𝜃(1 + 𝛽)𝑡]−1 − 1) (5.1)
1+𝛽
1 𝛽 1
= (1) + ( ). (5.2)
1+𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
where
𝛽
𝑃𝑁 ∗ (𝑧) = 1 + (𝑧 − 1),
1+𝛽
1
𝑀𝑋∗ (𝑡) = .
1 − 𝜃(1 + 𝛽)𝑡
∞ ∗𝑛
𝐹𝑆 (𝑠) = 𝑝0 + ∑𝑛=1 𝑝𝑛 𝐹𝑋 (𝑠)
∞ 𝑛−1 1 𝑠 𝑗 −𝑠/𝜃
= 1 − ∑𝑛=1 𝑝𝑛 ∑𝑗=0 𝑗! ( 𝜃 ) 𝑒
∞ 1 𝑗
= 1 − 𝑒−𝑠/𝜃 ∑𝑗=0 𝑗! ( 𝜃𝑠 ) 𝑃 𝑗
where 𝑃 𝑗 = 𝑝𝑗+1 + 𝑝𝑗+2 + ⋯ = Pr(𝑁 > 𝑗) is the “survival function” of the claims
count distribution.
5.3. COLLECTIVE RISK MODEL 199
Thus, the Tweedie distribution can be thought of a mixture of zero and a pos-
itive valued distribution, which makes it a convenient tool for modeling insur-
ance claims and for calculating pure premiums. The mean and variance of the
Tweedie compound Poisson model are:
𝛼 𝛼(1 + 𝛼)
E(𝑆𝑁 ) = 𝜆 and Var(𝑆𝑁 ) = 𝜆 .
𝛾 𝛾2
1 −𝑠 𝜇2−𝑝
𝑓𝑆𝑁 (𝑠) = exp [ ( 𝑝−1
− ) + 𝐶(𝑠; 𝜙)]
𝜙 (𝑝 − 1)𝜇 2−𝑝
200 CHAPTER 5. AGGREGATE LOSS MODELS
where
⎧ 0 if 𝑠 = 0
{ 𝑛
𝐶(𝑠; 𝜙) = ⎨ (1/𝜙)1/(𝑝−1) 𝑠(2−𝑝)/(𝑝−1) 1
log ∑ { } if 𝑠 > 0.
{ (2 − 𝑝)(𝑝 − 1) (2−𝑝)/(𝑝−1) 𝑛! Γ[𝑛(2 − 𝑝)/(𝑝 − 1)]𝑠
⎩ 𝑛≥1
This allows us to use the Tweedie distribution with generalized linear models to
model claims. It is also worth mentioning the two limiting cases of the Tweedie
model: 𝑝 → 1 results in the Poisson distribution and 𝑝 → 2 results in the gamma
distribution. Thus, the Tweedie model accommodates the situations in between
the gamma and Poisson distributions, which makes intuitive sense as it is the
Poisson sum of gamma random variables.
𝑠∧𝑚
1 𝑏𝑥
𝑓𝑆𝑁 (𝑠) = { ∑ (𝑎 + ) 𝑓𝑋 (𝑥)𝑓𝑆𝑁 (𝑠 − 𝑥)} .
1 − 𝑎𝑓𝑋 (0) 𝑥=1 𝑠
𝜆 𝑠∧𝑚
𝑓𝑆𝑁 (𝑠) = { ∑ 𝑥𝑓 (𝑥)𝑓𝑆𝑁 (𝑠 − 𝑥)} .
𝑠 𝑥=1 𝑋
1
𝑓𝑋 (𝑥) = , 𝑥 = 1, 2, 3, 4.
4
𝑥∧𝑚
𝑓𝑆𝑁 (𝑥) = 1 ∑ (𝑎 + 0)𝑓𝑋 (𝑦)𝑓𝑆𝑁 (𝑥 − 𝑦)
𝑦=1
4 𝑥∧𝑚
= ∑ 𝑓 (𝑦)𝑓𝑆𝑁 (𝑥 − 𝑦).
5 𝑦=1 𝑋
202 CHAPTER 5. AGGREGATE LOSS MODELS
Specifically, we have
1
𝑓𝑆𝑁 (0) = Pr(𝑁 = 0) = 𝑝0 =
5
4 1 4
𝑓𝑆𝑁 (1) = ∑ 𝑓𝑋 (𝑦)𝑓𝑆𝑁 (1 − 𝑦) = 𝑓𝑋 (1)𝑓𝑆𝑁 (0)
5 𝑦=1 5
4 1 1 1
= ( )( ) =
5 4 5 25
4 2 4
𝑓𝑆𝑁 (2) = ∑ 𝑓𝑋 (𝑦)𝑓𝑆𝑁 (2 − 𝑦) = [𝑓𝑋 (1)𝑓𝑆𝑁 (1) + 𝑓𝑋 (2)𝑓𝑆𝑁 (0)]
5 𝑦=1 5
4 1 1 1 4 6 6
= [ ( + )] = ( )=
5 4 25 5 5 100 125
4
𝑓𝑆𝑁 (3) = [𝑓𝑋 (1)𝑓𝑆𝑁 (2) + 𝑓𝑋 (2)𝑓𝑆𝑁 (1) + 𝑓𝑋 (3)𝑓𝑆𝑁 (0)]
5
4 1 1 1 6 1 5 + 25 + 6
= [ ( + + )] = ( ) = 0.0576
5 4 25 5 125 5 125
⇒ 𝐹𝑆𝑁 (3) = 𝑓𝑆𝑁 (0) + 𝑓𝑆𝑁 (1) + 𝑓𝑆𝑁 (2) + 𝑓𝑆𝑁 (3) = 0.3456.
5.4.2 Simulation
The distribution of aggregate loss can be evaluated using Monte Carlo simula-
tion. You can get a broad introduction to simulation procedures in Chapter 6.
For aggregate losses, the idea is that one can calculate the empirical distribution
of 𝑆𝑁 using a random sample. The expected value and variance of the aggregate
loss can also be estimated using the sample mean and sample variance of the
simulated values.
We now summarize simulation procedures for aggregate loss models. Let 𝑚 be
the size of the generated random sample of aggregate losses.
1. Individual Risk Model: 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛
• Let 𝑗 = 1, … , 𝑚 be a counter. Start by setting 𝑗 = 1.
• Generate each individual loss realization 𝑥𝑖𝑗 for 𝑖 = 1, … , 𝑛. For
example, this can be done using the inverse transformation method
(Section 6.2).
• Calculate the aggregate loss 𝑠𝑗 = 𝑥1𝑗 + ⋯ + 𝑥𝑛𝑗 .
• Repeat the above two steps for 𝑗 = 2, … , 𝑚 to obtain a size-𝑚 sample
of 𝑆𝑛 , i.e. {𝑠1 , … , 𝑠𝑚 }.
2. Collective Risk Model: 𝑆𝑁 = 𝑋1 + ⋯ + 𝑋𝑁
• Let 𝑗 = 1, … , 𝑚 be a counter. Start by setting 𝑗 = 1.
• Generate the number of claims 𝑛𝑗 from the frequency distribution 𝑁 .
5.4. COMPUTING THE AGGREGATE CLAIMS DISTRIBUTION 203
1 𝑚
𝐹𝑆̂ (𝑠) = ∑ 𝐼(𝑠𝑖 ≤ 𝑠),
𝑚 𝑖=1
where 𝐼(⋅) is an indicator function. The empirical distribution 𝐹𝑆̂ (𝑠) will con-
verge to 𝐹𝑆 (𝑠) almost surely as the sample size 𝑚 → ∞.
The above procedure assumes that the probability distributions, including the
parameter values, of the frequency and severity distributions are known. In
practice, one would need to first assume these distributions, estimate their pa-
rameters from data, and then assess the quality of model fit using various model
validation tools (see Chapter 4). For instance, the assumptions in the collective
risk model suggest a two-stage estimation where one model is developed for the
number of claims 𝑁 from data on claim counts, and another model is developed
for the severity of claims 𝑋 from data on the amount of claims.
[1] 1248.09
var(S) # Compare to theoretical value of 79,375
[1] 77441.22
mean(S>2000) # Proportion of simulated observations s_j that are > 2000
[1] 0.0062
# Compare to normal approximation method of 0.003884
Using simulation, we estimate the mean and variance of the aggregate claims
to be approximately 1248 and 77441 respectively, compared to the theoretical
values of 1,250 and 79,375. In addition, we estimate the probability that aggre-
gate losses exceed 2000 to be 0.0062, compared to the normal approximation
estimate of 0.003884.
Normal density
0.0012
0.0008
Density
0.0004
0.0000
Aggregate Loss S
The simulated losses are slightly more right-skewed than the normal distribution,
with a longer right tail. This explains why the normal approximation estimate
of Pr(𝑆𝑁 > 2000) is lower than the simulated estimate.
5.5. EFFECTS OF COVERAGE MODIFICATIONS 205
𝑛
𝑃𝑆 (𝑧) = E(𝑧 𝑆 ) = E (𝑧 ∑𝑖=1 𝑋𝑖 )
𝑛
= ∏ E(𝑧 𝑋𝑖 ) = [𝑃𝑋 (𝑧)]𝑛 .
𝑖=1
Special Case: Poisson. If 𝑋𝑖 ∼ 𝑃 𝑜𝑖(𝜆), its pgf is 𝑃𝑋 (𝑧) = 𝑒𝜆(𝑧−1) . Then the
pgf of 𝑆 is
𝑃𝑆 (𝑧) = [𝑒𝜆(𝑧−1) ]𝑛 = 𝑒𝑛𝜆(𝑧−1) .
So 𝑆 ∼ 𝑃 𝑜𝑖(𝑛𝜆). That is, the sum of 𝑛 independent Poisson random variables
each with mean 𝜆 has a Poisson distribution with mean 𝑛𝜆.
So 𝑆 ∼ 𝑁 𝐵(𝛽, 𝑛𝑟).
Example 5.5.1. Assume that the number of claims for each vehicle is Poisson
with mean 𝜆. Given the following data on the observed number of claims for
each household, calculate the MLE of 𝜆.
𝑃𝑆𝑛 (𝑧) = [𝑃𝑋 (𝑧)]𝑛2 = [𝑃𝑋 (𝑧)𝑛1 ]𝑛2 /𝑛1 = 𝑃𝑆𝑛 (𝑧)𝑛2 /𝑛1 .
2 1
1 if 𝑋𝑗 > 𝑑
𝐼𝑗 = { .
0 otherwise
Then we establish
𝑁 𝑃 = 𝐼 1 + 𝐼2 + ⋯ + 𝐼 𝑁 𝐿 ,
5.5. EFFECTS OF COVERAGE MODIFICATIONS 207
that is, the total number of payments is equal to the number of losses above the
deductible level. Given that 𝐼𝑗 ’s are independent Bernoulli random variables
with probability of success 𝑣 = Pr(𝑋 > 𝑑), the sum of a fixed number of such
variables is then a binomial random variable. Thus, conditioning on 𝑁 𝐿 , 𝑁 𝑃
has a binomial distribution, i.e. 𝑁 𝑃 |𝑁 𝐿 ∼ 𝐵𝑖𝑛(𝑁 𝐿 , 𝑣), where 𝑣 = Pr(𝑋 > 𝑑).
This implies that
𝑃 𝑁𝐿
E (𝑧 𝑁 |𝑁 𝐿 ) = [1 + 𝑣(𝑧 − 1)]
So the pgf of 𝑁 𝑃 is
𝑃 𝑃
𝑃𝑁 𝑃 (𝑧) = E𝑁 𝑃 (𝑧 𝑁 ) = E𝑁 𝐿 [E𝑁 𝑃 (𝑧 𝑁 |𝑁 𝐿 )]
𝐿
= E𝑁 𝐿 [(1 + 𝑣(𝑧 − 1))𝑁 ]
= 𝑃𝑁 𝐿 (1 + 𝑣(𝑧 − 1)) .
Thus, we can write the pgf of 𝑁 𝑃 as the pgf of 𝑁 𝐿 , evaluated at a new argument
𝑧 ∗ = 1 + 𝑣(𝑧 − 1). That is, 𝑃𝑁 𝑃 (𝑧) = 𝑃𝑁 𝐿 (𝑧 ∗ ).
Special Cases:
• 𝑁 𝐿 ∼ 𝑃 𝑜𝑖(𝜆). The pgf of 𝑁 𝐿 is 𝑃𝑁 𝐿 = 𝑒𝜆(𝑧−1) . Thus the pgf of 𝑁 𝑃 is
𝑃𝑁 𝑃 (𝑧) = 𝑒𝜆(1+𝑣(𝑧−1)−1)
= 𝑒𝜆𝑣(𝑧−1) ,
So 𝑁 𝑃 ∼ 𝑁 𝐵(𝛽𝑣, 𝑟). This means the number of payments has the same
distribution as the number of losses, but with parameters 𝛽𝑣 and 𝑟.
With this, we can assess the second payment distribution 𝑁2𝑃 (under deductible
𝑑2 = 100) as being Poisson with mean 𝜆2 = 𝜆𝑣2 , where
4
150 3 4
𝑣2 = Pr(𝑋 > 100) = ( ) =( )
100 + 150 5
4 4
6 3
⇒ 𝜆2 = 𝜆𝑣2 = 0.4 ( ) ( ) = 0.1075.
5 5
Example 5.5.3. Follow-Up. Now suppose instead that the loss frequency
is 𝑁 𝐿 ∼ 𝑁 𝐵(𝛽, 𝑟) and for deductible 𝑑1 = 30, the payment frequency 𝑁1𝑃 is
negative binomial with mean 0.4. Find the mean of the payment frequency 𝑁2𝑃
for deductible 𝑑2 = 100.
Solution. Because the loss frequency 𝑁 𝐿 is negative binomial, we can relate the
parameter 𝛽 of the 𝑁 𝐿 distribution and the parameter 𝛽1 of the first payment
distribution 𝑁1𝑃 using 𝛽1 = 𝛽𝑣1 , where
5 4
𝑣1 = Pr(𝑋 > 30) = ( )
6
Thus, the mean of 𝑁1𝑃 and the mean of 𝑁 𝐿 are related via
1 − 𝑝0𝑀
𝑝𝑘𝑀 = 𝑐 𝑝𝑘0 , for 𝑘 = 1, 2, 3, … , with 𝑐 = ,
1 − 𝑝00
5.5. EFFECTS OF COVERAGE MODIFICATIONS 209
where 𝑝𝑘0 is the pmf of the unmodified distribution. In the case that 𝑝0𝑀 = 0,
we call this a zero-truncated distribution, or 𝑍𝑇 . For other arbitrary values
of 𝑝0𝑀 , this is a zero-modified, or 𝑍𝑀 , distribution. The pgf for the modified
distribution is shown as
𝑃 𝑀 (𝑧) = 1 − 𝑐 + 𝑐 𝑃 0 (𝑧),
1 − 𝑝0𝑀 1 − 𝑝0𝑀 −𝑟
𝑃𝑁 𝐿 (𝑧) = 1 − −𝑟
+ −𝑟
[1 − 𝛽 (𝑧 − 1)] .
1 − (1 + 𝛽) 1 − (1 + 𝛽)
1 − 𝑝0𝑀 1 − 𝑝0𝑀 −𝑟
𝑃𝑁 𝑃 (𝑧) = 1 − + [1 − 𝛽𝑣 (𝑧 − 1)] .
1 − (1 + 𝛽)−𝑟 1 − (1 + 𝛽)−𝑟
So the number of payments is also a ZM-negative binomial distribution
with parameters 𝛽𝑣, 𝑟, and 𝑝0𝑀 . Similarly, the probability at zero can be
evaluated using Pr(𝑁 𝑃 = 0) = 𝑃𝑁 𝑃 (0).
3
1
𝑣 = Pr(𝑋 > 30) = ( ) = 0.2441.
1 + (30/50)
𝜆∗ 0.7324
E(𝑁 𝑃 ) = (1 − 𝑝0𝑀 ) −𝜆 ∗ = 0.5 ( )
1−𝑒 1 − 𝑒−0.7324
= 0.7053
∗ 2
𝜆∗ [1 − (𝜆∗ + 1)𝑒−𝜆 ] 𝜆∗
Var(𝑁 𝑃 ) = (1 − 𝑝0𝑀 ) ( ∗ ) + 𝑝 𝑀
0 (1 − 𝑝 𝑀
0 ) ( )
(1 − 𝑒−𝜆 )2 1 − 𝑒−𝜆∗
2
0.7324(1 − 1.7324𝑒−0.7324 ) 0.7324
= 0.5 ( ) + 0.52 ( )
(1 − 𝑒−0.7324 )2 1 − 𝑒−0.7324
= 0.7244.
Recall the notation 𝑁 𝐿 for the number of losses. With ground-up loss amount
𝑋 and policy deductible 𝑑, we use 𝑁 𝑃 for the number of payments (as defined
in the previous section 5.5.2). Also, define the amount of payment on a per-loss
basis as
⎧ 𝑑
{ 0, if 𝑋 <
{ 1+𝑟
{ 𝑑 𝑢
𝑋𝐿 = ⎨ 𝛼[(1 + 𝑟)𝑋 − 𝑑] , if ≤𝑋< ,
{ 1+𝑟 1+𝑟
{ 𝑢
{ 𝛼(𝑢 − 𝑑) , if 𝑋 ≥
⎩ 1+𝑟
5.5. EFFECTS OF COVERAGE MODIFICATIONS 211
⎧ 𝑑
{ undefined , if 𝑋 <
{ 1+𝑟
{ 𝑑 𝑢
𝑋𝑃 = ⎨ 𝛼[(1 + 𝑟)𝑋 − 𝑑] , if ≤𝑋< .
{ 1+𝑟 1+𝑟
{ 𝑢
{ 𝛼(𝑢 − 𝑑) , if 𝑋 ≥ .
⎩ 1+𝑟
In the above, 𝑟, 𝑢, and 𝛼 represent the inflation rate, policy limit, and coinsur-
ance, respectively. Hence, aggregate costs (payment amounts) can be expressed
either on a per loss or per payment basis:
𝑆 = 𝑋1𝐿 + ⋯ + 𝑋𝑁
𝐿
𝐿
= 𝑋1𝑃 + ⋯ + 𝑋𝑁
𝑃
𝑃 .
(Recall that when we introduced the per-loss and per-payment bases in Section
3.4, we used another letter 𝑌 to distinguish losses from insurance payments, or
claims. At this point in our development, we use the letter 𝑋 to reduce notation
complexity.)
The fundamentals regarding collective risk models are ready to apply. For in-
stance, we have:
E(𝑆) = E (𝑁 𝐿 ) E (𝑋 𝐿 ) = E (𝑁 𝑃 ) E (𝑋 𝑃 )
2
Var(𝑆) = E (𝑁 𝐿 ) Var (𝑋 𝐿 ) + [E (𝑋 𝐿 )] Var(𝑁 𝐿 )
2
= E (𝑁 𝑃 ) Var (𝑋 𝑃 ) + [E (𝑋 𝑃 )] Var(𝑁 𝑃 )
𝑀𝑆 (𝑧) = 𝑃𝑁 𝐿 [𝑀𝑋𝐿 (𝑧)] = 𝑃𝑁 𝑃 [𝑀𝑋𝑃 (𝑧)] .
Severity Probability
40 0.25
80 0.25
120 0.25
200 0.25
You expect severity to increase 50% with no change in frequency. You decide
to impose a per claim deductible of 100. Calculate the expected total claim
payment 𝑆 after these changes.
212 CHAPTER 5. AGGREGATE LOSS MODELS
Solution. The cost per loss with a 50% increase in severity and a 100 deductible
per claim is
Example 5.5.6. Follow-Up. What is the variance of the total claim payment,
Var (𝑆)?
Solution. On a per loss basis, we have
2
Var(𝑆) = E(𝑁 ) Var (𝑋 𝐿 ) + [E (𝑋 𝐿 )] Var(𝑁 )
3
E(𝑁 𝑃 ) = E(𝑁 𝐿 ) Pr(1.5𝑋 ≥ 100) = 300 ( ) = 225.
4
E(𝑋 𝐿 ) 75
E(𝑋 𝑃 ) = = = 100.
Pr(1.5𝑋 > 100) (3/4)
Alternative Method: Using the Per Payment Basis. We can also use the per
payment basis to find the expected aggregate amount paid after the modifi-
cations. With the deductible of 100, the probability that a payment occurs
is Pr(𝑋 > 100) = 𝑒−100/200 . For the per payment severity, plugging in the
expression for E(𝑋 𝐿 ) from the original example, we have
Putting this together, we produce the same answer using the per payment basis
as the per loss basis from earlier
Contributors
• Peng Shi and Lisa Gao, University of Wisconsin-Madison, are the princi-
pal authors of the initial version of this chapter. Email: [email protected]
for chapter comments and suggested improvements.
• Chapter reviewers include: Vytaras Brazauskas, Mark Maxwell, Jiadong
Ren, Sherly Paola Alfonso Sanchez, and Di (Cindy) Xu.
𝑛 𝑛 𝑛
E(𝑆𝑛 ) = ∑ E(𝑋𝑖 ) = ∑ E(𝐼𝑖 × 𝐵𝑖 ) = ∑ E(𝐼𝑖 ) E(𝐵𝑖 ) from the independence of 𝐼𝑖 ’s and 𝐵𝑖 ’s
𝑖=1 𝑖=1 𝑖=1
𝑛
= ∑ Pr(𝐼𝑖 = 1) 𝜇𝑖 since the expectation of an indicator variable is the probability it equals 1
𝑖=1
𝑛
= ∑ 𝑞𝑖 𝜇𝑖 .
𝑖=1
For the variance of the aggregate loss under the individual risk model,
𝑛
Var(𝑆𝑛 ) = ∑ Var(𝑋𝑖 ) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛
= ∑ ( E [Var(𝑋𝑖 |𝐼𝑖 )] + Var [E(𝑋𝑖 |𝐼𝑖 )] ) from the conditional variance formulas
𝑖=1
𝑛
= ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 ) 𝜇2𝑖 ) .
𝑖=1
and
Var [E(𝑋𝑖 |𝐼𝑖 )] = 𝑞𝑖 (1 − 𝑞𝑖 ) 𝜇2𝑖 ,
using the Bernoulli variance shortcut since E(𝑋𝑖 |𝐼𝑖 ) = 0 when 𝐼𝑖 = 0 (prob-
ability Pr(𝐼𝑖 = 0) = 1 − 𝑞𝑖 ) and E(𝑋𝑖 |𝐼𝑖 ) = 𝜇𝑖 when 𝐼𝑖 = 1 (probability
Pr(𝐼𝑖 = 1) = 𝑞𝑖 ).
For the probability generating function of the aggregate loss under the individual
risk model,
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ 𝑃𝑋𝑖 (𝑧) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛 𝑛 𝑛
𝑋𝑖 𝐼𝑖 ×𝐵𝑖 𝐼𝑖 ×𝐵𝑖
= ∏ E(𝑧 ) = ∏ E(𝑧 ) = ∏ E [E(𝑧 |𝐼𝑖 )] from the law of iterated expectations
𝑖=1 𝑖=1 𝑖=1
𝑛
𝐼𝑖 ×𝐵𝑖 𝐼𝑖 ×𝐵𝑖
= ∏ [ 𝐸 (𝑧 |𝐼𝑖 = 0) Pr(𝐼𝑖 = 0) + 𝐸 (𝑧 |𝐼𝑖 = 1) Pr(𝐼𝑖 = 1) ]
𝑖=1
𝑛 𝑛
= ∏ [ (1) (1 − 𝑞𝑖 ) + 𝑃𝐵𝑖 (𝑧) 𝑞𝑖 ] = ∏ ( 1 − 𝑞𝑖 + 𝑞𝑖 𝑃𝐵𝑖 (𝑧) )
𝑖=1 𝑖=1
216 CHAPTER 5. AGGREGATE LOSS MODELS
Lastly, for the moment generating function of the aggregate loss under the
individual risk model,
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ 𝑀𝑋𝑖 (𝑡) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛 𝑛
= ∏ E(𝑒𝑡 𝑋𝑖
) = ∏ E (𝑒 𝑡 (𝐼𝑖 ×𝐵𝑖 )
)
𝑖=1 𝑖=1
𝑛
𝑡 (𝐼𝑖 ×𝐵𝑖 )
= ∏ E [E (𝑒 |𝐼𝑖 )] from the law of iterated expectations
𝑖=1
𝑛
𝑡 (𝐼𝑖 ×𝐵𝑖 ) 𝑡 (𝐼𝑖 ×𝐵𝑖 )
= ∏ [ E (𝑒 |𝐼𝑖 = 0) Pr(𝐼𝑖 = 0) + E (𝑒 |𝐼𝑖 = 1) Pr(𝐼𝑖 = 1) ]
𝑖=1
𝑛 𝑛
= ∏ [ (1) (1 − 𝑞𝑖 ) + 𝑀𝐵𝑖 (𝑡) 𝑞𝑖 ] = ∏ ( 1 − 𝑞𝑖 + 𝑞𝑖 𝑀𝐵𝑖 (𝑡) ) .
𝑖=1 𝑖=1
1
𝑃𝑁 (𝑧) =
1 − 𝛽(𝑧 − 1)
1
𝑀𝑋 (𝑡) = .
1 − 𝜃𝑡
5.6. FURTHER RESOURCES AND CONTRIBUTORS 217
1
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 1
1 − 𝛽 ( 1−𝜃𝑡 − 1)
𝜃𝑡
1 𝛽 ( 1−𝜃𝑡 )
= 𝜃𝑡 + 1 − 1 = 1 + 𝜃𝑡
1 − 𝛽 ( 1−𝜃𝑡 ) 1 − 𝛽 ( 1−𝜃𝑡 )
𝛽𝜃𝑡 𝛽𝜃𝑡 1+𝛽
=1+ =1+ ⋅
(1 − 𝜃𝑡) − 𝛽𝜃𝑡 1 − 𝜃𝑡(1 + 𝛽) 1 + 𝛽
𝛽 𝜃(1 + 𝛽)𝑡
=1+ [ ]
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
𝛽 1
=1+ [ − 1] ,
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
which gives the expression (5.1). For the alternate expression of the mgf (5.2),
we continue from where we just left off:
𝛽 𝜃(1 + 𝛽)𝑡
𝑀𝑆𝑁 (𝑡) = 1 + [ ]
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1+𝛽 𝛽 𝜃(1 + 𝛽)𝑡
= + [ ]
1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 𝛽 𝜃(1 + 𝛽)𝑡
= + + [ ]
1 + 𝛽 1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 𝜃(1 + 𝛽)𝑡
= + [1 + ]
1+𝛽 1+𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 1
= + [ ].
1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
218 CHAPTER 5. AGGREGATE LOSS MODELS
Chapter 6
219
220 CHAPTER 6. SIMULATION AND RESAMPLING
step 𝑛 𝐵𝑛 𝑈𝑛
0 𝐵0 =1
5
1 𝐵1 = mod (3 × 1 + 2) = 5 𝑈1 = 15
2
2 𝐵2 = mod (3 × 5 + 2) = 2 𝑈2 = 15
8
3 𝐵3 = mod (3 × 2 + 2) = 8 𝑈3 = 15
11
4 𝐵4 = mod (3 × 8 + 2) = 11 𝑈4 = 15
6.1. SIMULATION FUNDAMENTALS 221
Uniform
0.92424
Three Uniform Random Variates
0.53718
0.46920
𝑋𝑖 = 𝐹 −1 (𝑈𝑖 ) .
Here, recall from Section 4.1.1 that we introduced the inverse of the distribution
function, 𝐹 −1 , and referred to it also as the quantile function. Specifically, it is
defined to be
Recall that inf stands for infimum or the greatest lower bound. It is essentially
the smallest value of x that satisfies the inequality {𝐹 (𝑥) ≥ 𝑦}. The result is
that the sequence {𝑋𝑖 } is approximately iid with distribution function 𝐹 if the
{𝑈𝑖 } are iid with uniform on (0, 1) distribution function.
222 CHAPTER 6. SIMULATION AND RESAMPLING
The inverse transform result is available when the underlying random variable
is continuous, discrete or a hybrid combination of the two. We now present a
series of examples to illustrate its scope of applications.
Example 6.1.3. Generating Exponential Random Numbers. Suppose
that we would like to generate observations from an exponential distribution
with scale parameter 𝜃 so that 𝐹 (𝑥) = 1 − 𝑒−𝑥/𝜃 . To compute the inverse
transform, we can use the following steps:
𝑦 = 𝐹 (𝑥) ⇔ 𝑦 = 1 − 𝑒−𝑥/𝜃
⇔ −𝜃 ln(1 − 𝑦) = 𝑥 = 𝐹 −1 (𝑦).
𝛼
𝜃
𝑦 = 𝐹 (𝑥) ⇔ 1 − 𝑦 = ( )
𝑥+𝜃
−1/𝛼 𝑥+𝜃 𝑥
⇔ (1 − 𝑦) = = +1
𝜃 𝜃
⇔ 𝜃 ((1 − 𝑦)−1/𝛼 − 1) = 𝑥 = 𝐹 −1 (𝑦).
Pr(𝑋 ≤ 𝑥) = Pr(𝐹 −1 (𝑈 ) ≤ 𝑥)
= Pr(𝐹 (𝐹 −1 (𝑈 )) ≤ 𝐹 (𝑥))
= Pr(𝑈 ≤ 𝐹 (𝑥)) = 𝐹 (𝑥)
as required. The key step is that 𝐹 (𝐹 −1 (𝑢)) = 𝑢 for each 𝑢, which is clearly
true when 𝐹 is strictly increasing.
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
A graph of the cumulative distribution function in Figure 6.1 shows that the
quantile function can be written as
0 0 < 𝑦 ≤ 0.85
𝐹 −1 (𝑦) = {
1 0.85 < 𝑦 ≤ 1.0.
0 0 < 𝑈 ≤ 0.85
𝑋={
1 0.85 < 𝑈 ≤ 1.0
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
Using the graph of the distribution function in Figure 6.2, with the inverse
transform we may define
⎧ 1 0 < 𝑈 ≤ 0.1
{ 2 0.1 < 𝑈 ≤ 0.3
{
𝑋=⎨ 3 0.3 < 𝑈 ≤ 0.4
{ 4 0.4 < 𝑈 ≤ 0.8
{
⎩ 5 0.8 < 𝑈 ≤ 1.0.
For general discrete random variables there may not be an ordering of outcomes.
For example, a person could own one of five types of life insurance products and
we might use the following algorithm to generate random outcomes:
6.1. SIMULATION FUNDAMENTALS 225
Both algorithms produce (in the long-run) the same probabilities, e.g.,
Pr(whole life) = 0.1, and so forth. So, neither is incorrect. You should be
aware that there is more than one way to accomplish a goal. Similarly, you
could use an alternative algorithm for ordered outcomes (such as failure times
1, 2, 3, 4, or 5, above).
Example 6.1.7. Generating Random Numbers from a Hybrid Dis-
tribution. Consider a random variable that is 0 with probability 70% and is
exponentially distributed with parameter 𝜃 = 10, 000 with probability 30%. In
an insurance application, this might correspond to a 70% chance of having no
insurance claims and a 30% chance of a claim - if a claim occurs, then it is
exponentially distributed. The distribution function, depicted in Figure 6.3, is
given as
0 𝑥<0
𝐹 (𝑦) = {
1 − 0.3 exp(−𝑥/10000) 𝑥 ≥ 0.
From Figure 6.3, we can see that the inverse transform for generating random
variables with this distribution function is
0 0 < 𝑈 ≤ 0.7
𝑋 = 𝐹 −1 (𝑈 ) = {
−1000 ln( 1−𝑈
0.3 ) 0.7 < 𝑈 < 1.
For discrete and hybrid random variables, the key is to draw a graph of the
distribution function that allows you to visualize potential values of the inverse
function.
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
0 10000 20000 30000 40000
1 𝑅
ℎ𝑅 = ∑ ℎ(𝑋𝑖 )
𝑅 𝑖=1
𝑅
1 2
𝑠2ℎ,𝑅 = ∑ (ℎ(𝑋𝑖 ) − ℎ𝑅 ) .
𝑅 − 1 𝑖=1
√
From the independence, the standard error of the estimate is 𝑠ℎ,𝑅 / 𝑅. This
can be made as small as we like by increasing the number of replications 𝑅.
Example 6.1.8. Portfolio Management. In Section 3.4, we learned how
to calculate the expected value of policies with deductibles. For an example of
something that cannot be done with closed form expressions, we now consider
two risks. This is a variation of a more complex example that will be covered
as Example 10.3.6.
We consider two property risks of a telecommunications firm:
• 𝑋1 - buildings, modeled using a gamma distribution with mean 200 and
scale parameter 100.
6.1. SIMULATION FUNDAMENTALS 227
400
Expected Insurer Claims
300
200
100
Number of Simulations (R )
(Recall that 1.96 is the 97.5th percentile from the standard normal distribu-
tion.) Replacing E [ℎ(𝑋)] and Var [ℎ(𝑋)] with estimates, you continue your
simulation until
.01ℎ𝑅
√ ≥ 1.96
𝑠ℎ,𝑅 / 𝑅
or equivalently
𝑠2ℎ,𝑅
𝑅 ≥ 38, 416 2
. (6.1)
ℎ𝑅
choice of the ℎ(⋅) function and the distribution of 𝑋 can play a role.
Consider the following question : what is Pr[𝑋 > 2] when 𝑋 has a Cauchy
−1
distribution, with density 𝑓(𝑥) = (𝜋(1 + 𝑥2 )) , on the real line? The true
value is
∞
𝑑𝑥
Pr [𝑋 > 2] = ∫ .
2 𝜋(1 + 𝑥2 )
One can use an R numerical integration function (which usually works well on
improper integrals)
which is equal to 0.14758.
Approximation 1. Alternatively, one can use simulation techniques to approx-
imate that quantity. From calculus, you can check that the quantile function
of the Cauchy distribution is 𝐹 −1 (𝑦) = tan (𝜋(𝑦 − 0.5)). Then, with simulated
uniform (0,1) variates, 𝑈1 , … , 𝑈𝑅 , we can construct the estimator
1 𝑅 1 𝑅
𝑝1 = ∑ I(𝐹 −1 (𝑈𝑖 ) > 2) = ∑ I(tan (𝜋(𝑈𝑖 − 0.5)) > 2).
𝑅 𝑖=1 𝑅 𝑖=1
[1] 0.147439
[1] 0.0003545432
With one million simulations, we obtain an estimate of 0.14744 with standard
error 0.355 (divided by 1000). One can prove that the variance of 𝑝1 is of order
0.127/𝑅.
Approximation 2. With other choices of ℎ(⋅) and 𝐹 (⋅) it is possible to reduce
uncertainty even using the same number of simulations 𝑅. To begin, one can use
the symmetry of the Cauchy distribution to write Pr[𝑋 > 2] = 0.5 ⋅ Pr[|𝑋| > 2].
With this, can construct a new estimator,
1 𝑅
𝑝2 = ∑ I(|𝐹 −1 (𝑈𝑖 )| > 2).
2𝑅 𝑖=1
1 1 𝑅 2
𝑝3 = − ∑ ℎ3 (2𝑈𝑖 ), where ℎ3 (𝑥) = .
2 𝑅 𝑖=1 𝜋(1 + 𝑥2 )
1 𝑅 1
𝑝4 = ∑ ℎ (𝑈 /2), where ℎ4 (𝑥) = .
𝑅 𝑖=1 4 𝑖 2𝜋(1 + 𝑥2 )
1.0
0.8
Cumulative Distribution
0.3
0.6
Density
0.2
0.4
0.1
0.2
0.0
0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
x x
data: x
D = 0.097037, p-value = 0.3031
alternative hypothesis: two-sided
However, for many distributions of actuarial interest, pre-built programs are
not available. We can use simulation to test the relevance of the test statis-
tic. Specifically, to compute the 𝑝-value, let us generate thousands of random
samples from a 𝐿𝑁 (1, 0.4) distribution (with the same size), and compute em-
pirically the distribution of the statistic,
ns <- 1e4
d_KS <- rep(NA,ns)
# compute the test statistics for a large (ns) number of simulated samples
for(s in 1:ns) d_KS[s] <- D(rlnorm(n,1,.4),function(x) plnorm(x,1,.4))
mean(d_KS>D(x,function(x) plnorm(x,1,.4)))
[1] 0.2843
15
10
Density
5
0
Test Statistic
Figure 6.6. Here, the statistic exceeded the empirical value (0.09704) in 28.43%
of the scenarios, while the theoretical 𝑝-value is 0.3031. For both the simulation
and the theoretical 𝑝-values, the conclusions are the same; the data do not
provide sufficient evidence to reject the hypothesis of a lognormal distribution.
Although only an approximation, the simulation approach works in a variety
of distributions and test statistics without needing to develop the nuances of
the underpinning theory for each situation. We summarize the procedure for
developing simulated distributions and p-values as follows:
1. Draw a sample of size n, say, 𝑋1 , … , 𝑋𝑛 , from a known distribution func-
tion 𝐹 . Compute a statistic of interest, denoted as 𝜃(𝑋 ̂
1 , … , 𝑋𝑛 ). Call
𝑟̂
this 𝜃 for the rth replication.
2. Repeat this 𝑟 = 1, … , 𝑅 times to get a sample of statistics, 𝜃1̂ , … , 𝜃𝑅
̂ .
1̂ ̂
𝑅
3. From the sample of statistics in Step 2, {𝜃 , … , 𝜃 }, compute a summary
measure of interest, such as a p-value.
The resampling algorithm is the same as introduced in Section 6.1.4 except that
now we use simulated draws from a sample. It is common to use {𝑋1 , … , 𝑋𝑛 } to
denote the original sample and let {𝑋1∗ , … , 𝑋𝑛∗ } denote the simulated draws. We
draw them with replacement so that the simulated draws will be independent
from one another, the same assumption as with the original sample. For each
sample, we also use n simulated draws, the same number as the original sample
size. To distinguish this procedure from the simulation, it is common to use
B (for bootstrap) to be the number of simulated samples. We could also write
(𝑏) (𝑏)
{𝑋1 , … , 𝑋𝑛 }, 𝑏 = 1, … , 𝐵 to clarify this.
There are two basic resampling methods, model-free and model-based, which are,
respectively, as nonparametric and parametric. In the nonparametric approach,
no assumption is made about the distribution of the parent population. The
simulated draws come from the empirical distribution function 𝐹𝑛 (⋅), so each
draw comes from {𝑋1 , … , 𝑋𝑛 } with probability 1/n.
In contrast, for the parametric approach, we assume that we have knowledge
of the distribution family F. The original sample 𝑋1 , … , 𝑋𝑛 is used to estimate
parameters of that family, say, 𝜃.̂ Then, simulated draws are taken from the
̂ Section 6.2.4 discusses this approach in further detail.
𝐹 (𝜃).
Nonparametric Bootstrap
The idea of the nonparametric bootstrap is to use the inverse transform method
on 𝐹𝑛 , the empirical cumulative distribution function, depicted in Figure 6.7.
y=F(x)
0
0 x = F−1(y)
• …
• if 𝑦 ∈ ((𝑛 − 1)/𝑛, 1) (with probability 1/𝑛) we draw the largest value
(max{𝑥𝑖 }).
Using the inverse transform method with 𝐹𝑛 means sampling from {𝑥1 , ⋯ , 𝑥𝑛 },
with probability 1/𝑛. Generating a bootstrap sample of size 𝐵 means sampling
from {𝑥1 , ⋯ , 𝑥𝑛 }, with probability 1/𝑛, with replacement. See the following
illustrative R code.
In this section, we focus on three summary measures, the bias, the standard
deviation, and the mean square error (MSE). Table 6.2 summarizes these three
measures. Here, 𝜃∗̂ is the average of {𝜃∗̂ , … , 𝜃∗̂ }.
1 𝐵
The bootstrap standard deviation gives a measure of precision. For one appli-
cation of standard deviations, we can use the normal approximation to create
a confidence interval. For example, the R function boot.ci produces the nor-
mal confidence intervals at 95%. These are produced by creating an interval
of twice the length of 1.95994 bootstrap standard deviations, centered about
the bias-corrected estimator (1.95994 is the 97.5th quantile of the standard nor-
mal distribution). For example, the lower normal 95% CI at 𝑑 = 14000 is
(0.97678 − 0.00018) − 1.95994 ∗ 0.00701 = 0.96286. We further discuss bootstrap
confidence intervals in the next section.
1 𝐵
𝜃3̂ = ∑ exp(𝑥∗𝑏 ).
𝐵 𝑏=1
To implement this, we have the following code.
Then, you can plot(results) and print(results) to see the following.
Call:
boot(data = sample_x, statistic = function(y, indices) exp(mean(y[indices])),
R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 19.13463 0.2536551 3.909725
238 CHAPTER 6. SIMULATION AND RESAMPLING
Histogram of t
40
35
0.08
30
Density
t*
25
0.04
20
15
0.00
10
10 20 30 40 −3 −1 0 1 2 3
This results in three estimators, the raw estimator 𝜃1̂ = 19.135, the second-order
correction 𝜃2̂ = 18.733, and the bootstrap estimator 𝜃3̂ = 19.388.
How does this work with differing sample sizes? We now suppose that the 𝑥𝑖 ’s
are generated from a lognormal distribution 𝐿𝑁 (0, 1), so that 𝜇 = exp(0+1/2) =
1.648721 and 𝜃 = exp(1.648721) = 5.200326. We use simulation to draw the
sample sizes but then act as if they were a realized set of observations. See the
following illustrative code.
The results of the comparison are summarized in Figure 6.10. This figure shows
that the bootstrap estimator is closer to the true parameter value for almost
all sample sizes. The bias of all three estimators decreases as the sample size
increases.
raw estimator
6.0
5.0
4.5
20 40 60 80 100
(2𝜃 ̂ − 𝑞𝑈 , 2𝜃 ̂ − 𝑞𝐿 ) , (6.2)
where 𝑞𝐿 and 𝑞𝑈 are lower and upper 2.5% quantiles from the bootstrap sample
𝜃1∗̂ , … , 𝜃𝐵
∗̂
.
To see where this comes from, start with the idea that (𝑞𝐿 , 𝑞𝑈 ) provides a
95% interval for 𝜃1∗̂ , … , 𝜃𝐵
∗̂
. So, for a random 𝜃𝑏∗̂ , there is a 95% chance that
𝑞𝐿 ≤ 𝜃𝑏 ≤ 𝑞𝑈 . Reversing the inequalities and adding 𝜃 ̂ to each side gives a 95%
∗̂
interval
𝜃 ̂ − 𝑞𝑈 ≤ 𝜃 ̂ − 𝜃𝑏∗̂ ≤ 𝜃 ̂ − 𝑞𝐿 .
So, (𝜃 ̂ − 𝑞𝑈 , 𝜃 ̂ − 𝑞𝐿 ) is an 95% interval for 𝜃 ̂ − 𝜃𝑏∗̂ . The bootstrap approximation
idea says that this is also a 95% interval for 𝜃 − 𝜃.̂ Adding 𝜃 ̂ to each side gives
the 95% interval in equation (6.2).
Many alternative bootstrap intervals are available. The easiest to explain is the
percentile bootstrap interval which is defined as (𝑞𝐿 , 𝑞𝑈 ). However, this has the
drawback of potentially poor behavior in the tails which can be of concern in
some actuarial problems of interest.
Example 6.2.3. Bodily Injury Claims and Risk Measures. To see how
the bootstrap confidence intervals work, we return to the bodily injury auto
claims considered in Example 6.2.1. Instead of the loss elimination ratio, sup-
pose we wish to estimate the 95th percentile 𝐹 −1 (0.95) and a measure defined
as
𝑇 𝑉 𝑎𝑅0.95 [𝑋] = E[𝑋|𝑋 > 𝐹 −1 (0.95)].
This measure is called the tail value-at-risk; it is the expected value of 𝑋 condi-
tional on 𝑋 exceeding the 95th percentile. Section 10.2 explains how quantiles
and the tail value-at-risk are the two most important examples of so-called risk
measures. For now, we will simply think of these as measures that we wish
to estimate. For the percentile, we use the nonparametric estimator 𝐹𝑛−1 (0.95)
defined in Section 4.1.1. For the tail value-at-risk, we use the plug-in principle
to define the nonparametric estimator
𝑛
∑𝑖=1 𝑋𝑖 𝐼(𝑋𝑖 > 𝐹𝑛−1 (0.95))
𝑇 𝑉 𝑎𝑅𝑛,0.95 [𝑋] = 𝑛 .
∑𝑖=1 𝐼(𝑋𝑖 > 𝐹𝑛−1 (0.95))
In this expression, the denominator counts the number of observations that
exceed the 95th percentile 𝐹𝑛−1 (0.95). The numerator adds up losses for those
observations that exceed 𝐹𝑛−1 (0.95). Table 6.4 summarizes the estimator for
selected fractions.
Table 6.4. Bootstrap Estimates of Quantiles at Selected Fractions
6.2. BOOTSTRAPPING AND RESAMPLING 241
Fraction NP Bootstrap Bootstrap Lower Normal Upper Normal Lower Basic Upper Basic Lower Percentile Upper Percentile
Estimate Bias SD 95% CI 95% CI 95% CI 95% CI 95% CI 95% CI
0.50 6500.00 -128.02 200.36 6235.32 7020.72 6300.00 7000.00 6000.00 6700.00
0.80 9078.40 89.51 200.27 8596.38 9381.41 8533.20 9230.40 8926.40 9623.60
0.90 11454.00 55.95 480.66 10455.96 12340.13 10530.49 12415.00 10493.00 12377.51
0.95 13313.40 13.59 667.74 11991.07 14608.55 11509.70 14321.00 12305.80 15117.10
0.98 16758.72 101.46 1273.45 14161.34 19153.19 14517.44 19326.95 14190.49 19000.00
For example, when the fraction is 0.50, we see that lower and upper 2.5th quan-
tiles of the bootstrap simulations are 𝑞𝐿 = 6000 and 𝑞𝑢 = 6700, respectively.
These form the percentile bootstrap confidence interval. With the nonpara-
metric estimator 6500, these yield the lower and upper bounds of the basic
confidence interval 6300 and 7000, respectively. Table 6.4 also shows bootstrap
estimates of the bias, standard deviation, and a normal confidence interval, con-
cepts introduced in Section 6.2.2.
Table 6.5 shows similar calculations for the tail value-at-risk. In each case,
we see that the bootstrap standard deviation increases as the fraction increases.
This is because there are fewer observations to estimate quantiles as the fraction
increases, leading to greater imprecision. Confidence intervals also become wider.
Interestingly, there does not seem to be the same pattern in the estimates of
the bias.
Table 6.5. Bootstrap Estimates of TVaR at Selected Risk Levels
Fraction NP Bootstrap Bootstrap Lower Normal Upper Normal Lower Basic Upper Basic Lower Percentile Upper Percentile
Estimate Bias SD 95% CI 95% CI 95% CI 95% CI 95% CI 95% CI
0.50 9794.69 -120.82 273.35 9379.74 10451.27 9355.14 10448.87 9140.51 10234.24
0.80 12454.18 30.68 481.88 11479.03 13367.96 11490.62 13378.52 11529.84 13417.74
0.90 14720.05 17.51 718.23 13294.82 16110.25 13255.45 16040.72 13399.38 16184.65
0.95 17072.43 5.99 1103.14 14904.31 19228.56 14924.50 19100.88 15043.97 19220.36
0.98 20140.56 73.43 1587.64 16955.40 23178.85 16942.36 22984.40 17296.71 23338.75
nonparametric
parametric(LN)
6
Density
4
2
0
Coefficient of Variation
Results in Table 6.6 are consistent with the results for the uncensored subsample
in Table 6.4. In Table 6.6, we note the difficulty in estimating quantiles at large
fractions due to the censoring. However, for moderate size fractions (0.50, 0.80,
and 0.90), the Kaplan-Meier nonparametric (KM NP) estimates of the quantile
are consistent with those Table 6.4. The bootstrap standard deviation is smaller
at the 0.50 (corresponding to the median) but larger at the 0.80 and 0.90 levels.
The censored data analysis summarized in Table 6.6 uses more data than the
uncensored subsample analysis in Table 6.4 but also has difficulty extracting
information for large quantiles.
6.3 Cross-Validation
In this section, you learn how to:
• Compare and contrast cross-validation to simulation techniques and boot-
strap methods.
• Use cross-validation techniques for model selection
• Explain the jackknife method as a special case of cross-validation and
calculate jackknife estimates of bias and standard errors
Overlap exists but nonetheless it is helpful to think about the broad goals asso-
ciated with each statistical method.
To discuss cross-validation, let us recall from Section 4.2 some of the key ideas
of model validation. When assessing, or validating, a model, we look to perfor-
mance measured on new data, or at least not those that were used to fit the
model. A classical approach, described in Section 4.2.3, is to split the sample in
two: a subpart (the training dataset) is used to fit the model and the other one
(the testing dataset) is used to validate. However, a limitation of this approach
is that results depend on the split; even though the overall sample is fixed, the
split between training and test subsamples varies randomly. A different train-
ing sample means that model estimated parameters will differ. Different model
parameters and a different test sample means that validation statistics will dif-
fer. Two analysts may use the same data and same models yet reach different
conclusions about the viability of a model (based on different random splits), a
frustrating situation.
0.4
0.3
KS Statistic
Pareto
0.2
Gamma
0.1
0.0
1 2 3 4 5 6 7 8
Fold
̂ = 1 𝑛 ̂
𝜃(⋅) ∑𝜃 .
𝑛 𝑖=1 −𝑖
These values can be used to create estimates of the bias of the statistic 𝜃 ̂
̂ − 𝜃)
𝐵𝑖𝑎𝑠𝑗𝑎𝑐𝑘 = (𝑛 − 1) (𝜃(⋅) ̂ (6.3)
𝑛−1 𝑛 2
̂ − 𝜃̂ ) .
𝑠𝑗𝑎𝑐𝑘 = √ ∑ (𝜃−𝑖 (⋅) (6.4)
𝑛 𝑖=1
Table 6.7 summarizes the results of the jackknife estimation. It shows that
jackknife estimates of the bias and standard deviation of the loss elimination ra-
tio E [min(𝑋, 𝑑)]/E [𝑋] are largely consistent with the bootstrap methodology.
Moreover, one can use the standard deviations to construct normal based con-
fidence intervals, centered around a bias-corrected estimator. For example, at
𝑑 = 14000, we saw in Example 4.1.11 that the nonparametric estimate of LER
is 0.97678. This has an estimated bias of 0.00010, resulting in the (jackknife)
bias-corrected estimator 0.97688. The 95% confidence intervals are produced by
creating an interval of twice the length of 1.96 jackknife standard deviations,
centered about the bias-corrected estimator (1.96 is the approximate 97.5th
quantile of the standard normal distribution).
Table 6.7. Jackknife Estimates of LER at Selected Deductibles
Discussion. One of the many interesting things about the leave-one-out special
case is the ability to replicate estimates exactly. That is, when the size of the
fold is only one, then there is no additional uncertainty induced by the cross-
validation. This means that analysts can exactly replicate work of one another,
an important consideration.
Jackknife statistics were developed to understand precision of estimators, pro-
ducing estimators of bias and standard deviation in equations (6.3) and (6.4).
This crosses into goals that we have associated with bootstrap techniques, not
cross-validation methods. This demonstrates how statistical techniques can be
used to achieve different goals.
Repeat this process many (say 𝐵) times. Take an average over the results and
choose the model based on the average evaluation statistic.
Example 6.3.4. Wisconsin Property Fund. Return to Example 6.3.1 where
we investigate the fit of the gamma and Pareto distributions on the property
fund data. We again compare the predictive performance using the Kolmogorov-
Smirnov (KS) statistic but this time using the bootstrap procedure to split the
data between training and testing samples. The following provides illustrative
code.
We did the sampling using 𝐵 = 100 replications. The average KS statistic
for the Pareto distribution was 0.058 compared to the average for the gamma
distribution, 0.262. This is consistent with earlier results and provides another
piece of evidence that the Pareto is a better model for these data than the
gamma.
𝐹 (𝑥) − 𝐹 (𝑎)
𝐹 ⋆ (𝑥) = Pr(𝑋 ≤ 𝑥|𝑎 < 𝑋 ≤ 𝑏) = , for 𝑎 < 𝑥 ≤ 𝑏.
𝐹 (𝑏) − 𝐹 (𝑎)
Using the inverse transform method in Section 6.1.2, we have that the draw
𝑈̃ = (1 − 𝑈 ) ⋅ 𝐹 (𝑎) + 𝑈 ⋅ 𝐹 (𝑏)
and then use 𝐹 −1 (𝑈̃ ). With this approach, each draw counts.
This can be related to the importance sampling mechanism : we draw more
frequently in regions where we expect to have quantities that have some interest.
This transform can be considered as a “a change of measure.”
In Example 6.4.1., the inverse of the normal distribution is readily available (in
R, the function is qnorm). However, for other applications, this is not the case.
Then, one simply uses numerical methods to determine 𝑋 ⋆ as the solution of
the equation 𝐹 (𝑋 ⋆ ) = 𝑈̃ where 𝑈̃ = (1 − 𝑈 ) ⋅ 𝐹 (𝑎) + 𝑈 ⋅ 𝐹 (𝑏). See the following
illustrative code.
This section is being written and is not yet complete nor edited. It
is here to give you a flavor of what will be in the final version.
The idea of Monte Carlo techniques rely on the law of large numbers (that
insures the convergence of the average towards the integral) and the central
limit theorem (that is used to quantify uncertainty in the computations). Recall
that if (𝑋𝑖 ) is an iid sequence of random variables with distribution 𝐹 , then
𝑛
1 ℒ
√ (∑ ℎ(𝑋𝑖 ) − ∫ ℎ(𝑥)𝑑𝐹 (𝑥)) → 𝒩(0, 𝜎2 ), as 𝑛 → ∞,
𝑛 𝑖=1
for some variance 𝜎2 > 0. But actually, the ergodic theorem can be used to
weaker the previous result, since it is not necessary to have independence of the
variables. More precisely, if (𝑋𝑖 ) is a Markov Process with invariant measure 𝜇,
under some additional technical assumptions, we can obtain that
𝑛
1 ℒ
√ (∑ ℎ(𝑋𝑖 ) − ∫ ℎ(𝑥)𝑑𝜇(𝑥)) → 𝒩(0, 𝜎⋆2 ), as 𝑛 → ∞.
𝑛 𝑖=1
𝑓(𝑥⋆𝑡+1 )
𝑅=
𝑓(𝑥𝑡 )
and
Here 𝑟 is called the acceptance-ratio: we accept the new value with probability
𝑟 (or actually the smallest between 1 and 𝑟 since 𝑟 can exceed 1).
For instance, assume that 𝑓(⋅|𝑥𝑡 ) is uniform on [𝑥𝑡 − 𝜀, 𝑥𝑡 + 𝜀] for some 𝜀 > 0,
and where 𝑓 (our target distribution) is the 𝒩(0, 1). We will never draw from
𝑓, but we will use it to compute our acceptance ratio at each step.
In the code above, vec contains values of 𝑥 = (𝑥1 , 𝑥2 , ⋯), innov is the innova-
tion.
4
3
sim[,2]
2
1
0
2 4 6 8 10
sim[,1]
The construction of the sequence (MCMC algorithms are iterative) can be visu-
alized below
Contributors
• Arthur Charpentier, Université du Quebec á Montreal, and Edward
W. (Jed) Frees, University of Wisconsin-Madison, are the principal au-
thors of the initial version of this chapter. Email: [email protected]
and/or [email protected] for chapter comments and sug-
gested improvements.
• Chapter reviewers include Yvonne Chueh and Brian Hartman. Write Jed
or Arthur to add you name here.
Premium Foundations
This chapter explains how you can think about determining the appropriate
price for an insurance product. As described in Section 1.2, one of the core
actuarial functions is ratemaking, where the analyst seeks to determine the
right price for a risk.
As this is a core function, let us first take a step back to define terms. A price
is a quantity, usually of money, that is exchanged for a good or service. In
insurance, we typically use the word premium for the amount of money charged
for insurance protection against contingent events. The amount of protection
varies by risk being insured. For example, in homeowners insurance the amount
of insurance protection depends on the value of the house. In life insurance, the
amount of protection depends on a policyholder’s financial status (e.g. income
253
254 CHAPTER 7. PREMIUM FOUNDATIONS
and wealth) as well as a perceived need for financial security. So, it is common to
express insurance prices as a unit of the protection being purchased, for example,
a price per thousand dollars of coverage on a home or benefit in the event of
death. These prices/premiums are known as rates because they are expressed
in standardized units,
Because costs are unknown at the time of sale, insurance pricing differs from
common economic approaches. This chapter squarely addresses the uncertain
nature of costs by introducing traditional actuarial approaches that determine
prices as a function of insurance costs. As we will see, this pricing approach is
sufficient for some insurance markets such as personal automobile or homeown-
ers where the insurer has a portfolio of many independent risks. However, there
are other insurance markets where actuarial prices only provide an input to gen-
eral market prices. To reinforce this distinction, actuarial cost-based premiums
are sometimes known as technical prices. From the perspective of economists,
corporate decisions such as pricing are to be evaluated with reference to their
impact on the firm’s market value. This objective is more comprehensive than
the static notion of profit maximization. That is, you can think of the value
of the firm as the capitalized value of all future expected profits. Decisions im-
pacting this value in turn affect all groups having claims on the firm, including
stockholders, bondholders, policyowners (in the case of mutual companies), and
so forth.
The Expense term can be split into those that vary by premium (such as
sales commissions) and those that do not (such as building costs and employee
salaries). The term UW Profit is a residual that stands for underwriting profit.
It may also include include a cost of capital (for example, an annual dividend to
company investors). Because fixed expenses and costs of capital are difficult to
interpret for individual contracts, we think of the equation (7.1) relationship as
holding over the sum of many contracts (a portfolio) and work with it in aggre-
gate. Then, in Section 7.2 we use this approach to help us think about setting
premiums, for example by setting profit objectives. Specifically, Sections 7.2.1
and 7.2.2 introduce two prevailing methods used in practice for determining
premiums, the pure premium and the loss ratio methods.
The Loss in equation (7.1) is random and so, as a baseline, we use the expected
costs to determine rates. There are several ways to motivate this perspective
that we expand upon in Section 7.3. For now, we will suppose that the insurer
enters into many contracts with risks that are similar except, by pure chance,
in some cases the insured event occurs and in others it does not. The insurer
is obligated to pay the total amount of claim payments for all contracts. If
risks are similar, then all policyholders are equally likely to contribute to the
total loss. So, from this perspective, it makes sense to look at the average claim
payment over many insureds. From probability theory, specifically the law of
large numbers, we know that the average of iid risks is close to the expected
amount, so we use the expectation as a baseline pricing principle.
Nonetheless, by using expected losses, we essentially assume that the uncertainty
is non-existent. If the insurer sells enough independent policies, this may be a
reasonable approximation. However, there will be other cases, such as a single
contract issued to a large corporation to insure all of its buildings against fire
damage, where the use of only an expectation for pricing is not sufficient. So,
Section 7.3 also summarizes alternative premium principles that incorporate
uncertainty into our pricing. Note that an emphasis of this text is estimation of
the entire distribution of losses so the analyst is not restricted to working only
with expectations.
The aggregate methods derived from equation (7.1) focus on collections of ho-
mogeneous risks that are similar except for the occurrence of random losses. In
statistical language that we have introduced, this is a discussion about risks
that have identical distributions. Naturally, when examining risks that insur-
ers work with, there are many variations in the risks being insured including
the features of the contracts and the people being insured. Section 7.4 extends
pricing considerations to heterogeneous collections of risks.
Section 7.5 introduces development and trending. When developing rates, we
want to use the most recent loss experience because the goal is to develop rates
that are forward looking. However, at contract initiation, recent loss experience
is often not known; it may be several years until it is fully realized. So, this
section introduces concepts needed for incorporating recent loss experience into
our premium development. Development and trending of experience is related to
256 CHAPTER 7. PREMIUM FOUNDATIONS
but also differs from the idea of experience rating that suggests that experience
reveals hidden information about the insured and so should be incorporated in
our forward thinking viewpoint. Chapter 9 discusses this idea in more detail.
The final section of this chapter introduces methods for selecting a premium.
This is done by comparing a premium rating method to losses from a held-out
portfolio and selecting the method that produces the best match with the held-
out data. For a typical insurance portfolio, most policies produce zero losses,
that is, do not have a claim. Because the distribution of held-out losses is
a combination of (a large number of) zeros and continuous amounts, special
techniques are useful. Section 7.6 introduces concepts of concentration curves
and corresponding Gini statistics to help in this selection.
So, when premiums are determined using the pure premium method, we either
take the average loss (loss cost) or use the frequency-severity approach.
To get a bit closer to applications in practice, we now return to equation (7.1)
that includes expenses. Equation (7.1) also refers to UW Profit for underwriting
profit. When rescaled by premiums, this is known as the profit loading. Because
claims are uncertain, the insurer must hold capital to ensure that all claims are
paid. Holding this extra capital is a cost of doing business, investors in the
company need to be compensated for this, thus the extra loading.
We now decompose Expenses into those that vary by premium, Variable, and
those that do not, Fixed so that Expenses = Variable + Fixed. Thinking of
variable expenses and profit as a fraction of premiums, we define
Variable UW Profit
𝑉 = and 𝑄= .
Premium Premium
Losses + Fixed
Premium = . (7.2)
1−𝑉 −𝑄
In words, this is
Example. CAS Exam 5, 2004, Number 13. Determine the indicated rate
per exposure unit, given the following information:
• Frequency per exposure unit = 0.25
• Severity = $100
• Fixed expense per exposure unit = $10
• Variable expense factor = 20%
• Profit and contingencies factor = 5%
Solution. Under the pure premium method, the indicated rate is
From the example, note that the rates produced by the pure premium method
are commonly known as indicated rates.
From our development, note also that the profit is associated with the under-
writing aspect of the contract and not investments. Premiums are typically
paid at the beginning of a contract and insurers receive investment income from
holding this money. However, due in part to the short-term nature of the con-
tracts, investment income is typically ignored in pricing. This builds a bit of
conservatism into the process that insurers welcome. It is probably most rel-
evant in the very long “tail” lines such as workers’ compensation and medical
malpractice. In these lines, it can sometimes take 20 years or even longer to set-
tle claims. But, these are also the most volatile lines with some claim amounts
being large relative to the rest of the distribution. The mitigating factor is that
these large claim amounts tend to be far in the future and so are less extreme
when viewed in a discounted sense.
Loss
Loss Ratio = .
Premium
This adjustment factor is then applied to current rates to get new indicated
rates.
To see how this works in a simple context, let us return to equation (7.1) but now
ignore expenses to get Premium = Losses + UW Profit. Dividing by premiums
yields
UW Profit Loss
= 1 − 𝐿𝑅 = 1 − .
Premium Premium
Suppose that we have in mind a new “target” profit loading, say 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 . As-
suming that losses, exposure, and other things about the contract stay the same,
then to achieve the new target profit loading we adjust the premium. Use the
ICF for the indicated change factor that is defined through the expression
Loss 𝐿𝑅
𝐼𝐶𝐹 = = .
Premium × (1 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 ) 1 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡
So, for example, if we have a current loss ratio = 85% and a target profit load-
ing 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 0.20, then 𝐼𝐶𝐹 = 0.85/0.80 = 1.0625, meaning that we increase
premiums by 6.25%.
Now let’s see how this works with expenses in equation (7.1). We can use the
same development as in Section 7.2.1 and so start with equation (7.2), solve for
the profit loading to get
Loss+Fixed
𝑄=1− −𝑉.
Premium
We interpret the quantity Fixed /Premium + V as the “operating expense ratio.”
Now, fix the profit percentage Q at a target and adjust premiums through the
“indicated change factor” 𝐼𝐶𝐹
Loss + Fixed
𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 1 − −𝑉.
Premium × 𝐼𝐶𝐹
Solving for 𝐼𝐶𝐹 yields
Loss + Fixed
𝐼𝐶𝐹 = Premium×(1−𝑉 −𝑄𝑡𝑎𝑟𝑔𝑒𝑡 )
Loss Ratio + Fixed Expense Ratio (7.3)
= 1−𝑉 −𝑄𝑡𝑎𝑟𝑔𝑒𝑡
.
This means that overall average rate level should be increased by 10%.
We later provide a comparison of the pure premium and loss ratio methods
in Section 7.5.3. As inputs, that section will require discussions of trended
exposures and on-level premiums defined in Section 7.5.
−𝜂
𝐻𝐸𝑥𝑝 (𝑋) = log (1 − 𝛼𝐸𝑥𝑝 𝜃) .
𝛼𝐸𝑥𝑝
262 CHAPTER 7. PREMIUM FOUNDATIONS
To see the relationship between 𝐻𝐸𝑥𝑝 (𝑋) and 𝐻𝑉 𝑎𝑟 (𝑋), we choose 𝛼𝐸𝑥𝑝 =
2𝛼𝑉 𝑎𝑟 . With an approximation from calculus (log(1−𝑥) = −𝑥−𝑥2 /2−𝑥3 /3−⋯),
we write
Description Definition
Nonnegative loading 𝐻(𝑋) ≥ E[𝑋]
Additivity 𝐻(𝑋1 + 𝑋2 ) = 𝐻(𝑋1 ) + 𝐻(𝑋2 ), for independent 𝑋1 , 𝑋2
Scale invariance 𝐻(𝑐𝑋) = 𝑐𝐻(𝑋), for 𝑐 ≥ 0
Consistency 𝐻(𝑐 + 𝑋) = 𝑐 + 𝐻(𝑋)
No rip-off 𝐻(𝑋) ≤ max{𝑋}
This is simply a subset of the many properties quoted in the actuarial literature.
For example, the review paper of Young (2014) lists 15 properties. See also the
properties described as coherent axioms that we introduce for risk measures in
Section 10.3.
Some of the properties listed in Table 7.2 are mild in the sense that they will
nearly always be satisfied. For example, the no rip-off property indicates that
the premium charge will be smaller than the largest or “maximal” value of
the loss 𝑋 (here, we use the notation max{𝑋} for this maximal value which is
defined as an “essential supremium” in mathematics). Other properties may not
be so mild. For example, for a portfolio of independent risks, the actuary may
want the additivity property to hold. It is easy to see that this property holds for
the expected value, variance, and exponential premium principles but not for the
standard deviation principle. Another example is the consistency property that
does not hold for the expected value principle when the risk loading parameter
𝛼 is positive.
The scale invariance principle is known as homogeneity of degree one in eco-
nomics. For example, it allows us to work in different currencies (e.g., from
dollars to Euros) as well as a host of other applications and will be discussed
further in the following Section 7.4. Although a generally accepted principle,
we note that this principle does not hold for a large value of 𝑋 that may border
on a surplus constraint of an insurer; if an insurer has a large probability of
becoming insolvent, then that insurer may not wish to use linear pricing. It
7.4. HETEROGENEOUS RISKS 263
is easy to check that this principle holds for the expected value and standard
deviation principles, although not for the variance and exponential principles.
As noted in Section 7.1, there are many variations in the risks being insured,
the features of the contracts, and the people being insured. As an example, you
might have a twin brother or sister who works in the same town and earns a
roughly similar amount of money. Still, when it comes to selecting choices in
rental insurance to insure contents of your apartment, you can imagine differ-
ences in the amount of contents to be insured, choices of deductibles for the
amount of risk retained, and perhaps different levels of uncertainty given the
relative safety of your neighborhoods. People and risks that they insure are
different.
When thinking about a collection of different (heterogeneous) risks, one option
is to price all risks the same. This is common in government sponsored programs
for flood or health insurance. However, it is also common to have different prices
where the differences are commensurate with the risk being insured.
Table 7.3 provides a few examples. We remark that this table refers to “earned”
car and house years, concepts that will be explained in Section 7.5.
Table 7.3. Commonly used Exposures in Different Types of Insurance
Table 7.4 provides just a few examples. In many jurisdictions, the personal
insurance market (e.g., auto and homeowners) is very competitive - using 10 or
20 variables for rating purposes is not uncommon.
In this case, the rating factors AOI and Terr produce nine cells. Note that one
might combine the cell “territory one with a low amount of insurance”” with
another cell because there are only 7 policies in that cell. Doing so is perfectly
acceptable - considerations of this sort is one of the main jobs of the analyst. An
outline on selecting variables is in Chapter 8, including Technical Supplement
TS 8.B. Alternatively, you can also think about reinforcing information about
the cell (Terr 1, Low AOI ) by “borrowing” information from neighboring cells
(e.g., other territories with the same AOI, or other amounts of AOI within Terr
1). This is the subject of credibility that is introduced in Chapter 9.
(Loss/Exposure)𝑗
Relativity𝑗 = .
(Loss/Exposure)𝐵𝑎𝑠𝑒
Thus, losses and expenses per unit of exposure are 23.2% higher for risks with
a high amount of insurance compared to those with a medium amount. These
relativities do not control for territory.
The introduction of rating factors allows the analyst to create cells that define
small collections of risks – the goal is to choose the right combination of rating
factors so that all risks within a cell may be treated the same. In statistical
terminology, we want all risks within a cell to have the same distribution (subject
to rescaling by an exposure variable). This is the foundation of insurance pricing.
All risks within a cell have the same price per exposure yet risks from different
cells may have different prices.
Said another way, insurers are allowed to charge different rates for different
risks; discrimination of risks is legal and routinely done. Nonetheless, the basis
of discrimination, the choice of risk factors, is the subject of extensive debate.
The actuarial community, insurance management, regulators, and consumer ad-
vocates are all active participants in this debate. Technical Supplement TS 7.A
describes these issues from a regulatory perspective.
In addition to statistical criteria for assessing the significance of a rating factor,
analysts much pay attention to business concerns of the company (e.g., is it
expensive to implement a rating factor?), social criteria (is a variable under the
control of a policyholder?), legal criteria (are there regulations that prohibit
the use of a rating factor such as gender?), and other societal issues. These
questions are largely beyond the scope of this text. Nonetheless, because they
are so fundamental to pricing of insurance, a brief overview is given in Chapter
8, including Technical Supplement TS 8.B.
tivities; financial reports are commonly created at least annually and oftentimes
quarterly. At any given financial reporting date, information about recent poli-
cies and claims will be ongoing and necessarily incomplete; this section intro-
duces concepts for projecting risk information so that it is useful for ratemak-
ing purposes. Information about the risks, such as exposures, premium, claim
counts, losses, and rating factors, is typically organized into three databases:
• policy database - contains information about the risk being insured, the
policyholder, and the contract provisions
• claims database - contains information about each claim; these are linked
to the policy database.
• payment database - contains information on each claims transaction, typ-
ically payments but may also changes to case reserves. These are linked
to the claims database.
With these detailed databases, it is straightforward (in principle) to sum up
policy level detail to aggregate information needed for financial reports. This
section describes various summary measures commonly used.
X X X
A | |
B | |
C | |
D | |
In-Force
Effective Written Exposure Earned Exposure Unearned Exposure Exposure
𝑃 𝑜𝑙𝑖𝑐𝑦 Date 1/1/2019 1/1/2020 1/1/2019 1/1/2020 1/1/2019 1/1/2020 1/1/2020
A 1 Jan 2019 1.00 0.00 1.00 0.00 0.00 0.00 0.00
B 1 April 2019 1.00 0.00 0.75 0.25 0.25 0.00 1.00
C 1 July 2019 1.00 0.00 0.50 0.50 0.50 0.00 1.00
D 1 Oct 2019 1.00 0.00 0.25 0.75 0.75 0.00 1.00
𝑇 𝑜𝑡𝑎𝑙 4.00 0.00 2.50 1.50 1.50 0.00 3.00
Solution.
Only earned premium differs from written premium and inforce premium and
therefore needs to be computed. Thus, earned premium at Dec 31, 2002, equals
$900 × 10/12 = $750. Answer E.
270 CHAPTER 7. PREMIUM FOUNDATIONS
• Accident date - the date of the occurrence which gave rise to the claim.
This is also known as the date of loss or the occurrence date.
• Report date - the date the insurer receives notice of the claim. Claims
not currently known by the insurer are referred to as unreported claims
or incurred but not reported (IBNR) claims.
Until the claim is settled, the reported claim is considered an open claim. Once
the claim is settled, it is categorized as a closed claim. In some instances, further
activity may occur after the claim is closed, and the claim may be re-opened.
Recall that a claim is the amount paid or payable to claimants under the terms
of insurance policies. Further, we have
• Paid losses are those losses for a particular period that have actually been
paid to claimants.
• Where there is an expectation that payment will be made in the future,
a claim will have an associated case reserve representing the estimated
amount of that payment.
• Reported Losses, also known as case incurred, is Paid Losses + Case Re-
serves
The ultimate loss is the amount of money required to close and settle all claims
for a defined group of policies.
experience losses
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = .
experience period earned exposure × current rate
Here, we think of the experience period earned exposure × current rate as the
experience premium.
Using equation (7.2), we can write a loss ratio as
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒
𝐼𝐶𝐹 = . (7.4)
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡
Comparing equation (7.3) to (7.4), we see that the latter offers more flexibility
to explicitly incorporate trended experience. As the loss ratio method is based
on rate changes, this flexibility is certainly warranted.
Comparison of Methods
Assuming that exposures, premiums, and claims have been trended to be repre-
sentative of a period that rates are being developed for, we are now in a position
to compare the pure premium and loss ratio methods for ratemaking. We start
with the observation that for the same data inputs, these two approaches pro-
duce the same results. That is, they are algebraically equivalent. However, they
rely on different inputs:
Comparing the pure premium and loss ratio methods, we note that:
• The pure premium method requires well-defined, responsive exposures.
• The loss ratio method cannot be used for new business because it produces
indicated rate changes.
7.5. DEVELOPMENT AND TRENDING 273
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡 = 1−𝑉 −𝑄
1+𝐺
= 1−premium related expense factor - profit and contingencies factor
1+ratio of non-premium related expenses to losses
1−0.23−0.05
= 1+0.07
= 0.673.
21000
Here, the ratio of non-premium related expenses to losses is 𝐺 = 300000 = 0.07.
Thus, the (new) indicated rate level change is
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 0.60
𝐼𝐶𝐹 = −1= − 1 = −10.8%.
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡 0.673
(b) Using the pure premium method with equation (7.2),
Losses + Fixed
𝑃 𝑟𝑒𝑚𝑖𝑢𝑚𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = 1−𝑄−𝑉
300000+21000
= 1−0.23−0.05
= 445833.33.
445833.33
Thus, the indicated rate level change is 500000 − 1 = −10.8%.
274 CHAPTER 7. PREMIUM FOUNDATIONS
(c) The loss ratio method is preferable when the exposure unit is not available.
The loss ratio method is preferable when the exposure unit is not reasonably
consistent between risks.
The pure premium method is preferable for a new line of business.
The pure premium method is preferable where on-level premiums are difficult
to calculate.
For a portfolio of insurance contracts, insurers collect premiums and pay out
losses. After making adjustments for expenses and profit considerations, tools
for comparing distributions of premiums and losses can be helpful when selecting
a premium calculation principle.
line of equality; if each policyholder has the same loss, then the loss distribution
would be at this line. The Gini index, twice the area between the Lorenz curve
and the 45 degree line, is 37.6 percent for this data set.
Proportion of Losses
0.8
0.0020
Density
0.4
0.0000
(0.60, 0.30)
0.0
0 400 800 0.0 0.2 0.4 0.6 0.8 1.0
Performance Curve
It is convenient to first sort the set of policies based on premiums (from smallest
to largest) and then compute the premium and loss distributions. The premium
distribution is 𝑛
∑𝑖=1 𝑃𝑖 I(𝑃𝑖 ≤ 𝑠)
̂
𝐹𝑃 (𝑠) = 𝑛 , (7.5)
∑𝑖=1 𝑃𝑖
𝑛
∑𝑖=1 𝑦𝑖 I(𝑃𝑖 ≤ 𝑠)
𝐹𝐿̂ (𝑠) = 𝑛 , (7.6)
∑𝑖=1 𝑦𝑖
where I(⋅) is the indicator function, returning a 1 if the event is true and zero
otherwise. For a given value 𝑠, 𝐹𝑃̂ (𝑠) gives the proportion of premiums less than
or equal to 𝑠, and 𝐹𝐿̂ (𝑠) gives the proportion of losses for those policyholders
with premiums less than or equal to 𝑠. The graph (𝐹𝑃̂ (𝑠), 𝐹𝐿̂ (𝑠)) is known as
a performance curve.
Example – Loss Distribution. Suppose we have 𝑛 = 5 policyholders with
experience as follows. The data have been ordered by premiums.
Variable 𝑖 1 2 3 4 5
Premium 𝑃 (x𝑖 ) 2 4 5 7 16
𝑖
Cumulative Premiums ∑𝑗=1𝑃 (x𝑗 ) 2 6 11 18 34
Loss 𝑦𝑖 2 5 6 6 17
𝑖
Cumulative Loss ∑𝑗=1 𝑦𝑗 2 7 13 19 36
Figure 7.4 compares the Lorenz to the performance curve. The left-hand panel
shows the Lorenz curve. The horizontal axis is the cumulative proportion of
policyholders (0, 0.2, 0.4, 0.6, 0.8, 1.0) and the vertical axis is the cumulative
proportion of losses (0, 2/36, 7/36, 13/36, 19/39, 36/36). For the Lorenz curve,
you first order by the loss size (which turns out to be the same order as premi-
ums for this simple dataset). This figure shows a large separation between the
distributions of losses and policyholders.
The right-hand panel shows the performance curve. Because observations are
sorted by premiums, the first point after the origin (reading from left to right)
is (2/34, 2/36). The second point is (6/34, 7/36), with the pattern continu-
ing. From the figure, we see that there is little separation between losses and
premiums.
The performance curve can be helpful to the analyst who thinks about forming
profitable portfolios for the insurer. For example, suppose that 𝑠 is chosen to
represent the 95th percentile of the premium distribution. Then, the horizontal
axis, 𝐹𝑃̂ (𝑠), represents the fraction of premiums for this portfolio and the vertical
axis, 𝐹𝐿̂ (𝑠), the fraction of losses for this portfolio. When developing premium
principles, analysts wish to avoid unprofitable situations and make profits, or
at least break even.
𝑛 𝑛
The expectation of the denominator in equation (7.6) is ∑𝑖=1 E [𝑦𝑖 ] = ∑𝑖=1 𝜇𝑖 .
Thus, if the premium principle is chosen such that 𝑃𝑖 = 𝜇𝑖 , then we anticipate
7.6. SELECTING A PREMIUM 277
Lorenz Performance
Loss Distn Loss Distn
0.8
0.8
0.4
0.4
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Gini Statistic
The classic Lorenz curve shows the proportion of policyholders on the horizontal
axis and the loss distribution function on the vertical axis. The performance
curve extends the classical Lorenz curve in two ways, (1) through the ordering
of risks and prices by prices and (2) by allowing prices to vary by observation.
We summarize the performance curve in the same way as the classic Lorenz
curve using a Gini statistic, defined as twice the area between the curve and
a 45 degree line. The analyst seeks ordered performance curves that approach
passing through the 45 degree line; these have the least separation between the
loss and premium distributions and therefore small Gini statistics.
Specifically, the Gini statistic can be calculated as follows. Suppose that the
empirical performance curve is given by {(𝑎0 = 0, 𝑏0 = 0), (𝑎1 , 𝑏1 ), … , (𝑎𝑛 =
1, 𝑏𝑛 = 1)} for a sample of 𝑛 observations. Here, we use 𝑎𝑗 = 𝐹𝑃̂ (𝑃𝑗 ) and
𝑏𝑗 = 𝐹𝐿̂ (𝑃𝑗 ). Then, the empirical Gini statistic is
𝑛−1
𝑎𝑗+1 + 𝑎𝑗 𝑏𝑗+1 + 𝑏𝑗
̂ = 2 ∑(𝑎
𝐺𝑖𝑛𝑖 𝑗+1 − 𝑎𝑗 ) { − }
𝑗=0
2 2
𝑛−1
= 1 − ∑(𝑎𝑗+1 − 𝑎𝑗 )(𝑏𝑗+1 + 𝑏𝑗 ). (7.7)
𝑗=0
To understand the formula for the Gini statistic, here is a sketch of a parallelo-
278 CHAPTER 7. PREMIUM FOUNDATIONS
gram connecting points (𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), and a 45 degree line. You can use basic
geometry to check that the area of the figure is 𝐴𝑟𝑒𝑎 = (𝑎2 −𝑎1 ) { 𝑎2 +𝑎2
1
− 𝑏2 +𝑏
2 }.
1
The definition of the Gini statistic in equation (7.7) is simply twice the sum of
the parallelograms. The second equality in equation (7.7) is the result of some
straight-forward algebra.
(a2,a2)
(a1,a1)
Area
45 degree line
(a2,b2)
(a1,b1)
Example – Loss Distribution: Continued. The Gini statistic for the Lorenz
curve (left-hand panel of Figure 7.4) is 34.4 percent. In contrast, the Gini
statistic for performance curve (right-hand panel) is 1.7 percent.
Recall for the gamma distribution that the mean equals the shape times the
scale or, 5 times the scale parameter, for our example. So, you can check that
the maximum likelihood estimates are simply the average experience.
For our base premium, we assume a common distribution among all states. For
these simulated data, the average in-sample loss is 𝑃1 =221.36.
As an alternative, we use averages that are state-specific; these averages form
our premiums 𝑃2 . Because this illustration uses means that vary by states, we
anticipate this alternative rating procedure to be preferred to the community
rating procedure.
Out of sample claims were generated from the same gamma distribution as the
in-sample model, with 100 observations for each state. The following R code
shows how to calculate the performance curves.
For these data, the Gini statistics are 19.6 percent for the flat rate premium
and -0.702 percent for the state-specific alternative. This indicates that the
state-specific alternative procedure is strongly preferred to the base community
rating procedure.
Discussion
In insurance claims modeling, standard out-of-sample validation measures are
not the most informative due to the high proportions of zeros (corresponding
to no claim) and the skewed fat-tailed distribution of the positive values. In
contrast, the Gini statistic works well with many zeros (see the demonstration
in (Frees et al., 2014)).
280 CHAPTER 7. PREMIUM FOUNDATIONS
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
The value of the performance curves and Gini statistics have been recently
advanced in the paper of Denuit et al. (2019). Properties of an extended version,
dealing with relativities for new premiums, were developed by Frees et al. (2011)
and Frees et al. (2014). In these articles you can find formulas for the standard
errors and additional background information.
Market Conduct
To help protect consumers, regulators impose administrative rules on the behav-
ior of market participants. These rules, known as market conduct regulation,
provide systems of regulatory controls that require insurers to demonstrate that
they are providing fair and reliable services, including rating, in accordance with
the statutes and regulations of a jurisdiction.
1. Product regulation serves to protect consumers by ensuring that insurance
policy provisions are reasonable and fair, and do not contain major gaps
in coverage that might be misunderstood by consumers and leave them
unprotected.
2. The insurance product is the insurance contract (policy) and the coverage
it provides. Insurance contracts are regulated for these reasons:
a. Insurance policies are complex legal documents that are often difficult
to interpret and understand.
b. Insurers write insurance policies and sell them to the public on a
“take it or leave it” basis.
Market conduct includes rules for intermediaries such as agents (who sell in-
surance to individuals) and brokers (who sell insurance to businesses). Market
conduct also includes competition policy regulation, designed to ensure an effi-
cient and competitive marketplace that offers low prices to consumers.
Rate Regulation
Rate regulation helps guide the development of premiums and so is the focus
of this chapter. As with other aspects of market conduct regulation, the in-
tent of these regulations is to ensure that insurers not take unfair advantage of
consumers. Rate (and policy form) regulation is common worldwide.
The amount of regulatory scrutiny varies by insurance product. Rate regulation
is uncommon in life insurance. Further, in non-life insurance, most commercial
lines and reinsurance are free from regulation. Rate regulation is common in
282 CHAPTER 7. PREMIUM FOUNDATIONS
Risk Classification
Chapter Preview. This chapter motivates the use of risk classification in in-
surance pricing and introduces readers to Poisson regression as a prominent
example of risk classification. In Section 8.1 we explain why insurers need to
incorporate various risk characteristics, or rating factors, of individual policy-
holders in pricing insurance contracts. In Section 8.2, we introduce Poisson
regression as a pricing tool to achieve such premium differentials. The con-
cept of exposure is also introduced in this section. As most rating factors are
categorical, we show in Section 8.3 how the multiplicative tariff model can be
incorporated into a Poisson regression model in practice, along with numerical
examples for illustration.
8.1 Introduction
285
286 CHAPTER 8. RISK CLASSIFICATION
for a single period, say annually, the gross insurance premium based on the
equivalence principle is stated as
As this dataset contains random counts, we try to fit a Poisson distribution for
each level.
As introduced in Section 2.2.3, the probability mass function of the Poisson with
mean 𝜇 is given by
8.2. POISSON REGRESSION MODEL 289
𝜇𝑦 𝑒−𝜇
Pr(𝑌 = 𝑦) = , 𝑦 = 0, 1, 2, … (8.1)
𝑦!
𝑛1 𝑛2
𝜇̂ = ( ) 𝜇(1)
̂ +( ) 𝜇(2)
̂ = 0.0792, (8.2)
𝑛1 + 𝑛 2 𝑛1 + 𝑛 2
𝜇 = 𝛽0 + 𝛽1 𝑥1 (8.3)
log 𝜇 = 𝛽0 + 𝛽1 𝑥1 , (8.4)
290 CHAPTER 8. RISK CLASSIFICATION
1 if smoker,
𝑥1 = { (8.5)
0 otherwise.
We generally prefer the log linear relation in (8.4) to the linear one in (8.3) to
prevent producing negative 𝜇 values, which can happen when there are many
different risk factors and levels. The setup in (8.4) and (8.5) then results in
different Poisson frequency parameters depending on the level in the risk factor:
log 𝜇 = 𝛽1 𝑥1 + 𝛽2 𝑥2 , (8.7)
1 if non-smoker,
𝑥2 = {
0 otherwise.
The numerical result of (8.6) is the same as (8.8) as all coefficients are given
as numbers in actual estimation, with the former setup more common in most
texts; we also stick to the former.
With this Poisson regression model we can readily understand how the coeffi-
cients 𝛽0 and 𝛽1 are linked to the expected loss frequency in each level. Accord-
ing to (8.6), the Poisson mean of the smokers, 𝜇(1) , is given by
where 𝜇(2) is the Poisson mean for the non-smokers. This relation between the
smokers and non-smokers suggests a useful way to compare the risks embedded
in different levels of a given risk factor. That is, the proportional increase in
the expected loss frequency of the smokers compared to that of the non-smokers
is simply given by a multiplicative factor 𝑒𝛽1 . Put another way, if we set the
expected loss frequency of the non-smokers as the base value, the expected loss
frequency of the smokers is obtained by applying 𝑒𝛽1 to the base value.
Dealing with multi-level case
We can readily extend the two-level case to a multi-level one where 𝑙 different
levels are involved for a single rating factor. For this we generally need 𝑙 − 1
indicator variables to formulate
respectively, we have 𝑘 = (2 − 1) × (3 − 1) × (4 − 1) = 6.
292 CHAPTER 8. RISK CLASSIFICATION
The condition inside the expectation in equation (8.10) indicates that the loss
frequency 𝜇𝑖 is the model expected response to the given set of risk factors or
explanatory variables. In principle the conditional mean E (𝑦𝑖 |x𝑖 ) in (8.10) can
take different forms depending on how we specify the relationship between x
and 𝑦. The standard choice for Poisson regression is to adopt the exponential
function, as we mentioned previously, so that
′
𝜇𝑖 = E (𝑦𝑖 |x𝑖 ) = 𝑒x𝑖 𝛽 , 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛. (8.11)
to reveal the relationship when the right side is set as the linear form, x′𝑖 𝛽.
Again, we see that the mapping works well as both sides of (8.12), log 𝜇𝑖 and
x𝑖 𝛽, can now cover all real values. This is the formulation of Poisson regres-
sion, assuming that all policyholders have the same unit period of exposure.
When the exposures differ among the policyholders, however, as is the case in
most practical cases, we need to revise this formulation by adding an exposure
component as an additional term in (8.12).
is 0.2, it does not mean much without the specification of the exposure such
as, in this case, per month or per year. In fact, all premiums and losses need
the exposure precisely specified and must be quoted accordingly; otherwise all
subsequent statistical analyses and predictions will be distorted.
In the previous section we assumed the same unit of exposure across all policy-
holders, but this is hardly realistic in practice. In health insurance, for example,
two different policyholders with different lengths of insurance coverage (e.g., 3
months and 12 months, respectively) could have recorded the same number of
claim counts. As the expected number of claim counts would be proportional
to the length of coverage, we should not treat these two policyholders’ loss ex-
periences identically in the modeling process. This motivates the need of the
concept of exposure in Poisson regression.
The Poisson distribution in (8.1) is parametrized via its mean. To understand
the exposure, we alternatively parametrize the Poisson pmf in terms of the rate
parameter 𝜆, based on the definition of the Poisson process:
(𝜆𝑡)𝑦 𝑒−𝜆𝑡
Pr(𝑌 = 𝑦) = , 𝑦 = 0, 1, 2, … (8.13)
𝑦!
with E (𝑌 ) = Var (𝑌 ) = 𝜆𝑡. Here 𝜆 is known as the rate or intensity per unit
period of the Poisson process and 𝑡 represents the length of time or exposure, a
known constant value. For given 𝜆 the Poisson distribution (8.13) produces a
larger expected loss count as the exposure 𝑡 gets larger. Clearly, (8.13) reduces
to (8.1) when 𝑡 = 1, which means that the mean and the rate become the same
for an exposure of 1, the case we considered in the previous subsection.
In principle, the exposure does not need to be measured in units of time and
may represent different things depending the problem at hand. For example:
1. In health insurance, the rate may be the occurrence of a specific disease
per 1,000 people and the exposure is the number of people considered in
the unit of 1,000.
2. In auto insurance, the rate may be the number of accidents per year of a
driver and the exposure is the length of the observed period for the driver
in the unit of year.
3. For workers compensation that covers lost wages resulting from an em-
ployee’s work-related injury or illness, the rate may be the probability of
injury in the course of employment per dollar and the exposure is the
payroll amount in dollars.
4. In marketing, the rate may be the number of customers who enter a store
per hour and the exposure is the number of hours observed.
5. In civil engineering, the rate may be the number of major cracks on the
paved road per 10 kms and the exposure is the length of road considered
294 CHAPTER 8. RISK CLASSIFICATION
6. In credit risk modelling, the rate may be the number of default events per
1000 firms and the exposure is the number of firms under consideration in
the unit of 1,000.
Actuaries may be able to use different exposure bases for a given insurable loss.
For example, in auto insurance, both the number of kilometers driven and the
number of months covered by insurance can be used as exposure bases. Here the
former is more accurate and useful in modelling the losses from car accidents, but
more difficult to measure and manage for insurers. Thus, a good exposure base
may not be the theoretically best one due to various practical constraints. As a
rule, an exposure base must be easy to determine, accurately measurable, legally
and socially acceptable, and free from potential manipulation by policyholders.
Incorporating exposure in Poisson regression
As exposures affect the Poisson mean, constructing Poisson regressions requires
us to carefully separate the rate and exposure in the modelling process. Focusing
on the insurance context, let us denote the rate of the loss event of the 𝑖th
policyholder by 𝜆𝑖 , the known exposure (the length of coverage) by 𝑚𝑖 and the
expected loss count under the given exposure by 𝜇𝑖 . Then the Poisson regression
formulation in (8.11) and (8.12) should be revised in light of (8.13) as
′
𝜇𝑖 = E (𝑦𝑖 |x𝑖 ) = 𝑚𝑖 𝜆𝑖 = 𝑚𝑖 𝑒x𝑖 𝛽 , 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛, (8.14)
which gives
Adding log 𝑚𝑖 in (8.15) does not pose a problem in fitting as we can always
specify this as an extra explanatory variable, as it is a known constant, and fix
its coefficient to 1. In the literature the log of exposure, log 𝑚𝑖 , is commonly
called the offset.
8.2.4 Exercises
1. Regarding Table 8.1 answer the following.
(a) Verify the mean values in the table.
(c) Produce the fitted Poisson counts for each smoking status in the
table.
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 295
• Age band of the driver: Young (age < 25), middle (25 ≤ age < 60) and
old age (age ≥ 60). We use index 𝑘 = 1, 2 and 3, respectively, for this
rating factor.
From this classification rule, we may create an organized table or list, such
as the one shown in Table 8.2, collected from all policyholders. Clearly there
are 2 × 3 = 6 different risk classes in total. Each row of the table shows a
combination of different risk characteristics of individual policyholders. Our
goal is to compute six different premiums for each of these combinations. Once
the premium for each row has been determined using the given exposure and
claim counts, the insurer can replace the last two columns in Table 8.2 with
a single column containing the computed premiums. This new table then can
296 CHAPTER 8. RISK CLASSIFICATION
serve as a manual to determine the premium for a new policyholder given rating
factors during the underwriting process. In non-life insurance, a table (or a set of
tables) or list that contains each set of rating factors and the associated premium
is referred to as a tariff. Each unique combination of the rating factors in a tariff
is called a tariff cell; thus, in Table 8.2 the number of tariff cells is six, same as
the number of risk classes.
Let us now look at the loss information in Table 8.2 more closely. The exposure
in each row represents the sum of the length of insurance coverages, or in-force
times, in years, of all the policyholders in that tariff cell. Similarly the claim
counts in each row is the number of claims in each cell. Naturally the exposures
and claim counts vary due to the different number of drivers across the cells, as
well as different in-force time periods among the drivers within each cell.
In light of the Poisson regression framework, we denote the exposure and claim
count of cell (𝑗, 𝑘) as 𝑚𝑗𝑘 and 𝑦𝑗𝑘 , respectively, and define the claim count per
unit exposure as
𝑦𝑗𝑘
𝑧𝑗𝑘 = , 𝑗 = 1, 2; 𝑘 = 1, 2, 3.
𝑚𝑗𝑘
For example, 𝑧12 = 8/208.5 = 0.03837, meaning that a policyholder in tariff cell
(1,2) would have 0.03837 accidents if insured for a full year on average. The set
of 𝑧𝑖𝑗 values then corresponds to the rate parameter in the Poisson distribution
(8.13) as they are the event occurrence rates per unit exposure. That is, we
have 𝑧𝑗𝑘 = 𝜆̂ 𝑗𝑘 where 𝜆𝑗𝑘 is the Poisson rate parameter. Producing 𝑧𝑖𝑗 values
however does not do much beyond comparing the average loss frequencies across
risk classes. To fully exploit the dataset, we will construct a pricing model from
Table 8.2 using Poisson regression, for the remaining part of the chapter.
We comment that actual loss records used by insurers typically include many
more risk factors, in which case the number of cells grows exponentially. The
tariff would then consist of a set of tables, instead of one, separated by some of
the basic rating factors, such as sex or territory.
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 297
Here {𝑓1𝑗 , 𝑗 = 1, 2} are the parameters associated with the two levels in the first
rating factor, car type, and {𝑓2𝑘 , 𝑘 = 1, 2, 3} associated with the three levels
in the age band, the second rating factor. For instance, the Poisson rate for a
mid-aged policyholder with a Type B vehicle is given by 𝜆22 = 𝑓0 × 𝑓12 × 𝑓22 .
The first term 𝑓0 is some base value to be discussed shortly. Thus these six
parameters are understood as numerical representations of the levels within
each rating factor, and are to be estimated from the dataset.
The multiplicative form (8.16) is easy to understand and use, because it clearly
shows how the expected loss count (per unit exposure) changes as each rating
factor varies. For example, if 𝑓11 = 1 and 𝑓12 = 1.2, then the expected loss count
of a policyholder with a vehicle of type B would be 20% larger than type A, when
the other factors are the same. In non-life insurance, the parameters 𝑓1𝑗 and
𝑓2𝑘 are known as relativities as they determine how much expected loss should
change relative to the base value 𝑓0 . The idea of relativity is quite convenient in
practice, as we can decide the premium for a policyholder by simply multiplying
a series of corresponding relativities to the base value.
Dropping an existing rating factor or adding a new one is also transparent with
this multiplicative structure. In addition, the insurer may adjust the overall
premium for all policyholders by controlling the base value 𝑓0 without chang-
ing individual relativities. However, by adopting the multiplicative form, we
implicitly assume that there is no serious interaction among the risk factors.
When the multiplicative form is used we need to address an identification issue.
That is, for any 𝑐 > 0, we can write
𝑓1𝑗
𝜆𝑗𝑘 = 𝑓0 × × 𝑐 𝑓2𝑘 .
𝑐
By comparing with (8.16), we see that the identical rate parameter 𝜆𝑗𝑘 can
be obtained for very different individual relativities. This over-parametrization,
2 Preferring the multiplicative form to others (e.g., additive one) was already hinted in (8.4).
298 CHAPTER 8. RISK CLASSIFICATION
This clearly shows that the Poisson rate parameter 𝜆 varies across different tariff
cells, with the same log linear form used in a Poisson regression framework. In
fact the reader may see that (8.18) is an extended version of the early expression
(8.6) with multiple risk factors and that the log relativities now play the role of
𝛽𝑖 parameters. Therefore all the relativities can be readily estimated via fitting
a Poisson regression with a suitably chosen set of indicator variables.
For the second rating factor, we employ two indicator variables for the age band,
that is,
and
The triple (𝑥1 , 𝑥2 , 𝑥3 ) then can effectively and uniquely determine each risk
class. By observing that the indicator variables associated with Type A and
Age band 1 are omitted, we see that tariff cell (𝑗, 𝑘) = (1, 1) plays the role of
the base cell. We emphasize that our choice of the three indicator variables
above has been carefully made so that it is consistent with the choice of the
base levels in the multiplicative tariff model in the previous subsection (i.e.,
𝑓11 = 1 and 𝑓21 = 1).
With the proposed indicator variables we can rewrite the log rate (8.17) as
which is identical to (8.18) when each triple value is actually applied. For
example, we can verify that the base tariff cell (𝑗, 𝑘) = (1, 1) corresponds to
(𝑥1 , 𝑥2 , 𝑥3 ) = (0, 0, 0), and in turn produces log 𝜆 = log 𝑓0 or 𝜆 = 𝑓0 in (8.19) as
required.
Poisson regression for the tariff model
Under this specification, let us consider 𝑛 policyholders in the portfolio with the
𝑖th policyholder’s risk characteristic given by a vector of explanatory variables
x𝑖 = (1, 𝑥𝑖1 , 𝑥𝑖2 , 𝑥𝑖3 )′ , for 𝑖 = 1, … , 𝑛. We then recognize (8.19) as
with 𝑓11 = 1 and 𝑓21 = 1 from the original construction. For the actual dataset,
𝛽𝑖 , 𝑖 = 0, 1, 2, 3, is replaced with the mle 𝑏𝑖 using the method in the technical
supplement at the end of this chapter (Section 8.A).
from the relation given in (8.20). The R script and the output are as follows.
other types of vehicles are recorded to be aged 21 or less with sex unspecified,
except for one policy, indicating that no driver information has been collected
for non-automobile vehicles. Second, type A vehicles are all classified as private
vehicles and all the other types are not.
When we include these risk factors, we assume all unspecified sex to be male.
As the age information is only applicable to type A vehicles, we set the model
accordingly. That is, we apply the age variable only to vehicles of type A.
Also we used five vehicle age bands, simplifying the original seven bands, by
combining vehicle ages 0,1 and 2; the combined band is marked as level 23 in
the data file. Thus our Poisson model has the following explicit form:
6
log 𝜇𝑖 = x′𝑖 𝛽+ log 𝑚𝑖 = 𝛽0 + 𝛽1 𝐼(𝑆𝑒𝑥𝑖 = 𝑀 ) + ∑ 𝛽𝑡 𝐼(𝑉 𝑎𝑔𝑒𝑖 = 𝑡)
𝑡=2
13
+ ∑ 𝛽𝑡 𝐼(𝑉 𝑡𝑦𝑝𝑒𝑖 = 𝐴) × 𝐼(𝐴𝑔𝑒𝑖 = 𝑡 − 7) + log 𝑚𝑖 .
𝑡=7
The fitting result is given in Table 8.3, for which we have several comments.
• The claim frequency is higher for males by 17.3%, when other rating
factors are held fixed. However, this may have been affected by the fact
that all unspecified sex has been assigned to male.
• Regarding the vehicle age, the claim frequency gradually decreases as the
vehicle age increases, when other rating factors are held fixed. The level
starts from 2 for this variable but, again, the numbering is nominal and
does not affect the numerical result.
relativity (that is, the common premium reduction) applied to all policies with
vehicle type A and the latter is the base value for age band 1. Then the relativity
of age band 2 can be seen as 0.917 = 0.918 × 0.999, where 0.999 is understood as
the relativity for age band 2. The remaining age bands can be treated similarly.
Table 8.3. Singapore Insurance Claims Data
Let us try several examples based on Table 8.3. Suppose a male policyholder
aged 40 who owns a 7-year-old vehicle of type A. The expected claim frequency
for this policyholder is then given by
Note that for this policy the age band variable is not used as the vehicle type
is not A. The R script is given as follows.
Contributor
• Joseph H. T. Kim, Yonsei University, is the principal author of the
initial version of this chapter. Email: [email protected] for chapter
comments and suggested improvements.
• Chapter reviewers include: Chun Yong Chew, Lina Xu, Jeffrey Zheng.
𝑛
log 𝐿(𝛽) = 𝑙(𝛽) = ∑ (−𝜇𝑖 + 𝑦𝑖 log 𝜇𝑖 − log 𝑦𝑖 !)
𝑖=1
𝑛
= ∑ (−𝑚𝑖 exp(x′𝑖 𝛽) + 𝑦𝑖 (log 𝑚𝑖 + x′𝑖 𝛽) − log 𝑦𝑖 !) (8.21)
𝑖=1
𝑛
𝜕
𝑙(𝛽)∣ = ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 = 0. (8.22)
𝜕𝛽 𝑖=1
𝛽=b
𝑛
∑ (𝑦𝑖 − 𝜇𝑖̂ ) x𝑖 = 0.
𝑖=1
Since the solution b satisfies this equation, it follows that the first among the
array of 𝑘 + 1 equations, corresponding to the first constant element of x𝑖 , yields
𝑛
∑ (𝑦𝑖 − 𝜇𝑖̂ ) × 1 = 0,
𝑖=1
𝑛 𝑛
𝑛−1 ∑ 𝑦𝑖 = 𝑦 ̄ = 𝑛−1 ∑ 𝜇𝑖̂ .
𝑖=1 𝑖=1
This is an interesting property saying that the average of the individual losses,
𝑦,̄ is same as the average of the estimated values. That is, the sample mean is
preserved under the fitted Poisson regression model.
Maximum Likelihood Estimation for Grouped Data
Sometimes the data are not available at the individual policy level. For example,
Table 8.2 provides collective loss information for each risk class after grouping
individual policies. When this is the case, 𝑦𝑖 and 𝑚𝑖 , the quantities needed for
the mle calculation in (8.22), are unavailable for each 𝑖. However this does not
pose a problem as long as we have the total loss counts and total exposure for
each risk class.
To elaborate, let us assume that there are 𝐾 different risk classes, and further
that, in the 𝑘th risk class, we have 𝑛𝑘 policies with the total exposure 𝑚(𝑘) and
the average loss count 𝑦(𝑘)̄ , for 𝑘 = 1, … , 𝐾; the total loss count for the 𝑘th
risk class is then 𝑛𝑘 𝑦(𝑘)
̄ . We denote the set of indices of the policies belonging
to the 𝑘th class by 𝐶𝑘 . As all policies in a given risk class share the same risk
characteristics, we may denote x𝑖 = x(𝑘) for all 𝑖 ∈ 𝐶𝑘 . With this notation, we
can rewrite (8.22) as
8.4. FURTHER RESOURCES AND CONTRIBUTORS 305
𝑛 𝐾
∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 = ∑ { ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 }
𝑖=1 𝑘=1 𝑖∈𝐶𝑘
𝐾
= ∑ { ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′(𝑘) b)) x(𝑘) }
𝑘=1 𝑖∈𝐶𝑘
𝐾
= ∑ {( ∑ 𝑦𝑖 − ∑ 𝑚𝑖 exp(x′(𝑘) b))x(𝑘) }
𝑘=1 𝑖∈𝐶𝑘 𝑖∈𝐶𝑘
𝐾
̄ − 𝑚(𝑘) exp(x′(𝑘) b))x(𝑘) = 0.
= ∑ (𝑛𝑘 𝑦(𝑘) (8.23)
𝑘=1
Since 𝑛𝑘 𝑦(𝑘)
̄ in (8.23) represents the total loss count for the 𝑘th risk class and
𝑚(𝑘) is its total exposure, we see that for Poisson regression the mle b is the
same whether if we use the individual data or the grouped data.
Information matrix
Section 17.1 defines information matrices. Taking second derivatives to (8.21)
gives the information matrix of the mle estimators,
𝑛 𝑛
𝜕2 ′ ′ ′
I(𝛽) = −E ( ′ 𝑙(𝛽)) = ∑ 𝑚𝑖 exp(x𝑖 𝛽)x𝑖 x𝑖 = ∑ 𝜇𝑖 x𝑖 x𝑖 . (8.24)
𝜕𝛽𝜕𝛽 𝑖=1 𝑖=1
𝐾 𝐾
I(𝛽) = ∑ { ∑ 𝑚𝑖 exp(x′𝑖 𝛽)x𝑖 x′𝑖 } = ∑ 𝑚(𝑘) exp(x′(𝑘) 𝛽)x(𝑘) x′(𝑘) .
𝑘=1 𝑖∈𝐶𝑘 𝑘=1
Statistical Criteria
From an analyst’s perspective, the discussion starts with the statistical signif-
icance of a rating factor. If the factor is not statistically significant, then the
306 CHAPTER 8. RISK CLASSIFICATION
variable is not even worthy of consideration for inclusion in a rating plan. The
statistical significance is judged not only on an in-sample basis but also on how
well it fares on an out-of-sample basis, as per our discussion in Section 4.2.
It is common in insurance applications to have many rating factors. Handling
multivariate aspects can be difficult with traditional univariate methods. Ana-
lysts employ techniques such as generalized linear models as described in Section
8.3.
Rating factors are introduced to create cells that contain similar risks. A rating
group should be large enough to measure costs with sufficient accuracy. There
is an inherent trade-off between theoretical accuracy and homogeneity.
As an example, most insurers charge the same automobile insurance premiums
for drivers between the ages of 30 and 50, not varying the premium by age.
Presumably costs do not vary much by age, or cost variances are due to other
identifiable factors.
Operational Criteria
From a business perspective, statistical criteria only provide a starting point
for discussions of potential inclusion of rating factors. Inclusion of a rating
factor must also induce economically meaningful results. From an insured’s
perspective, if differentiation by a factor produces little change in a rate then it
is not worth including. From an insurer’s perspective, the inclusion of a factor
should help segment the marketplace in a way that helps attract the business
that they seek. For example, we introduce the Gini index in Section 7.6 as one
metric that insurers use to describe the financial impact of a rating variable.
Rating factors should also be objective, inexpensive to administer, and verifi-
able. For example, automobile insurance underwriters often talk of “maturity”
and “responsibility” as important criteria for youthful drivers. Yet, these are
difficult to define objectively and to apply consistently. As another example, in
automobile it has long been known that amount of miles (or kilometers) driven
is an excellent rating factor. However, insurers have been reluctant to adopt
this factor because it is subject to abuse. Historically, driving mileage has not
been used because of the difficulty in verifying this variable (it is far too easy
to alter the car’s odometer to change reported mileage). Going forward, mod-
ern day drivers and cars are equipped with global positioning devices and other
equipment that allow insurers to use distance driven as a rating factor because
it can be verified.
low risks
• Classification Costs - Money spent by society, insurers, to classify people
appropriately.
Legal Criteria
For example, some states have statutes prohibiting the use of gender in rating
insurance while others permit it as a rating variable. As a result, an insurer
writing in multiple states may include gender as a rating variable in those states
where it is permitted, but not include it in a state that prohibits its use for rating.
If allowed by law, the company may continue to charge the average rate but
utilize the characteristic to identify, attract, and select the lower-risk insureds
that exist in the insured population; this is called skimming the cream. See
Frees and Huang (2020) for a broad discussion of the discrimination in pricing.
Chapter 9
An experience rating plan attempts to capture some of the variation in the risk
of loss among insureds within a rating class by using the insured’s own loss
experience to complement the rate from the classification rating plan. One way
to do this is to use a credibility weight 𝑍 with 0 ≤ 𝑍 ≤ 1 to compute
309
310CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
𝑅̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝑀 ,
For a risk whose loss experience is stable from year to year, 𝑍 might be close to
1. For a risk whose losses vary widely from year to year, 𝑍 may be close to 0.
Credibility theory is also used for computing rates for individual classes within a
classification rating plan. When classification plan rates are being determined,
some or many of the groups may not have sufficient data to produce stable
and reliable rates. The actual loss experience for a group will be assigned a
credibility weight 𝑍 and the complement of credibility 1 − 𝑍 may be given to
the average experience for risks across all classes. Or, if a class rating plan is
being updated, the complement of credibility may be assigned to the current
class rate. Credibility theory can also be applied to the calculation of expected
frequencies and severities.
Computing numeric values for 𝑍 requires analysis and understanding of the
data. What are the variances in the number of losses and sizes of losses for
risks? What is the variance between expected values across risks?
The expected number of claims required for the probability on the left-hand
side of (9.1) to equal 𝑝 is called the full credibility standard.
If the expected number of claims is greater than or equal to the full credibility
standard then full credibility can be assigned to the data so 𝑍 = 1. Usually the
expected value 𝜇𝑁 is not known so full credibility will be assigned to the data
if the actual observed number of claims 𝑛 is greater than or equal to the full
credibility standard. The 𝑘 and 𝑝 values must be selected and the actuary may
rely on experience, judgment, and other factors in making the choices.
Subtracting 𝜇𝑁 from each term in (9.1) and dividing by the standard deviation
𝜎𝑁 of 𝑁 gives
−𝑘𝜇𝑁 𝑁 − 𝜇𝑁 𝑘𝜇𝑁
Pr [ ≤ ≤ ] ≥ 𝑝. (9.2)
𝜎𝑁 𝜎𝑁 𝜎𝑁
𝑁 − 𝜇𝑁
Pr[−𝑦𝑝 ≤ ≤ 𝑦𝑝 ] = Φ(𝑦𝑝 ) − Φ(−𝑦𝑝 ) = 𝑝
𝜎𝑁
where Φ() is the cumulative distribution function of the standard normal. Be-
cause Φ(−𝑦𝑝 ) = 1 − Φ(𝑦𝑝 ), the equality can be rewritten as 2Φ(𝑦𝑝 ) − 1 = 𝑝.
Solving for 𝑦𝑝 gives 𝑦𝑝 = Φ−1 ((𝑝 + 1)/2) where Φ−1 () is the inverse of Φ().
312CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Equation (9.2) will be satisfied if 𝑘𝜇𝑁 /𝜎𝑁 ≥ 𝑦𝑝 assuming the normal approxi-
mation. First we will consider this inequality for the case when 𝑁 has a Poisson
distribution: Pr[𝑁 = 𝑛] = 𝜆𝑛 e−𝜆 /𝑛!. Because 𝜆 = 𝜇𝑁 = 𝜎𝑁 2
for the Poisson,
1/2 1/2
taking square roots yields 𝜇𝑁 = 𝜎𝑁 . So, 𝑘𝜇𝑁 /𝜇𝑁 ≥ 𝑦𝑝 which is equivalent to
𝜇𝑁 ≥ (𝑦𝑝 /𝑘)2 . Let’s define 𝜆𝑘𝑝 to be the value of 𝜇𝑁 for which equality holds.
Then the full credibility standard for the Poisson distribution is
𝑦𝑝 2
𝜆𝑘𝑝 = ( ) with 𝑦𝑝 = Φ−1 ((𝑝 + 1)/2). (9.3)
𝑘
If the expected number of claims 𝜇𝑁 is greater than or equal to 𝜆𝑘𝑝 then equation
(9.1) is assumed to hold and full credibility can be assigned to the data. As
noted previously, because 𝜇𝑁 is usually unknown, full credibility is given if the
observed number of claims 𝑛 satisfies 𝑛 ≥ 𝜆𝑘𝑝 .
Example 9.2.1. The full credibility standard is set so that the observed number
of claims is to be within 5% of the expected value with probability 𝑝 = 0.95.
If the number of claims has a Poisson distribution find the number of claims
needed for full credibility.
Solution. Referring to a standard normal distribution table, 𝑦𝑝 = Φ−1 ((𝑝 +
1)/2) = Φ−1 ((0.95 + 1)/2)=Φ−1 (0.975) = 1.960. Using this value and 𝑘 = .05
then 𝜆𝑘𝑝 = (𝑦𝑝 /𝑘)2 = (1.960/0.05)2 = 1, 536.64. After rounding up the full
credibility standard is 1,537.
If claims are not Poisson distributed then equation (9.2) does not imply (9.3).
Setting the upper bound of (𝑁 −𝜇𝑁 )/𝜎𝑁 in (9.2) equal to 𝑦𝑝 gives 𝑘𝜇𝑁 /𝜎𝑁 = 𝑦𝑝 .
Squaring both sides and moving everything to the right side except for one of
the 𝜇𝑁 ’s gives 𝜇𝑁 = (𝑦𝑝 /𝑘)2 (𝜎𝑁
2
/𝜇𝑁 ). This is the full credibility standard for
frequency and will be denoted by 𝑛𝑓 ,
𝑦𝑝 2 𝜎𝑁2
𝜎2
𝑛𝑓 = ( ) ( ) = 𝜆𝑘𝑝 ( 𝑁 ) . (9.4)
𝑘 𝜇𝑁 𝜇𝑁
This is the same equation as the Poisson full credibility standard except for the
2
(𝜎𝑁 /𝜇𝑁 ) multiplier. When the claims distribution is Poisson this extra term is
one because the variance equals the mean.
Example 9.2.2. The full credibility standard is set so that the total number of
claims is to be within 5% of the observed value with probability 𝑝 = 0.95. The
number of claims has a negative binomial distribution,
𝑟 𝑥
𝑥+𝑟−1 1 𝛽
Pr(𝑁 = 𝑥) = ( )( ) ( ) ,
𝑥 1+𝛽 1+𝛽
9.2. LIMITED FLUCTUATION CREDIBILITY 313
2
We see that the negative binomial distribution with (𝜎𝑁 /𝜇𝑁 ) > 1 requires more
claims for full credibility than a Poisson distribution for the same 𝑘 and 𝑝 values.
2
The next example shows that a binomial distribution which has (𝜎𝑁 /𝜇𝑁 ) < 1
will need fewer claims for full credibility.
Example 9.2.3. The full credibility standard is set so that the total number of
claims is to be within 5% of the observed value with probability 𝑝 = 0.95. The
number of claims has a binomial distribution
𝑚
Pr(𝑁 = 𝑥) = ( )𝑞 𝑥 (1 − 𝑞)𝑚−𝑥 .
𝑥
Rather than using expected number of claims to define the full credibility stan-
dard, the number of exposures can be used for the full credibility standard. An
exposure is a measure of risk. For example, one car insured for a full year would
be one car-year. Two cars each insured for exactly one-half year would also
result in one car-year. Car-years attempt to quantify exposure to loss. Two
car-years would be expected to generate twice as many claims as one car-year if
the vehicles have the same risk of loss. To translate a full credibility standard
denominated in terms of number of claims to a full credibility standard denom-
inated in exposures one needs a reasonable estimate of the expected number of
claims per exposure.
Example 9.2.4. The full credibility standard should be selected so that the ob-
served number of claims will be within 5% of the expected value with probability
𝑝 = 0.95. The number of claims has a Poisson distribution. If one exposure
is expected to have about 0.20 claims per year, find the number of exposures
needed for full credibility.
314CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Solution With 𝑝 = 0.95 and 𝑘 = .05, 𝜆𝑘𝑝 = (𝑦𝑝 /𝑘)2 = (1.960/0.05)2 = 1, 536.64
claims are required for full credibility. The claims frequency rate is 0.20 claims
per exposure. To convert the full credibility standard to a standard denominated
in exposures the calculation is: (1,536.64 claims)/(0.20 claims/exposures) =
7,683.20 exposures. This can be rounded up to 7,684.
𝜇𝑁 𝑁 𝜇
Pr [(1 − 𝑘) ≤ ≤ (1 + 𝑘) 𝑁 ] ≥ 𝑝.
𝑚 𝑚 𝑚
𝑆 = 𝑋 1 + 𝑋2 + ⋯ + 𝑋 𝑁 .
The random variable 𝑁 represents the number of losses and random variables
𝑋1 , 𝑋2 , … , 𝑋𝑁 are the individual loss amounts. In this section it is assumed
that 𝑁 is independent of the loss amounts and that 𝑋1 , 𝑋2 , … , 𝑋𝑁 are iid.
The mean and variance of 𝑆 are
and
where 𝑋 is the amount of a single loss. See the discussion on collective risk
models in Section 5.3 for more discussion of this framework.
9.2. LIMITED FLUCTUATION CREDIBILITY 315
−𝑘𝜇𝑆 𝑘𝜇𝑆
Pr [ ≤ (𝑆 − 𝜇𝑆 )/𝜎𝑆 ≤ ] ≥ 𝑝.
𝜎𝑆 𝜎𝑆
2 2
𝑦𝑝 2 𝜎2 𝜎 𝜎2 𝜎
𝑛𝑆 = ( ) [( 𝑁 ) + ( 𝑋 ) ] = 𝜆𝑘𝑝 [( 𝑁 ) + ( 𝑋 ) ] . (9.5)
𝑘 𝜇𝑁 𝜇𝑋 𝜇𝑁 𝜇𝑋
When the number of claims is Poisson distributed then equation (9.5) can be
2
simplified using (𝜎𝑁 /𝜇𝑁 ) = 1. It follows that
2
[(𝜎𝑁 /𝜇𝑁 ) + (𝜎𝑋 /𝜇𝑋 )2 ] = [1 + (𝜎𝑋 /𝜇𝑋 )2 ] = [(𝜇2𝑋 + 𝜎𝑋
2
)/𝜇2𝑋 ] = E(𝑋 2 )/E(𝑋)2
316CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
𝜇𝑆 𝑆 𝜇
Pr [(1 − 𝑘) ( ) ≤ ( ) ≤ (1 + 𝑘) ( 𝑆 )] ≥ 𝑝.
𝑚 𝑚 𝑚
This means that the full credibility standard 𝑛𝑃 𝑃 for the pure premium is the
same as that for aggregate losses
2 2
𝜎𝑁 𝜎
𝑛𝑃 𝑃 = 𝑛𝑆 = 𝜆𝑘𝑝 [( ) + ( 𝑋) ].
𝜇𝑛 𝜇𝑋
1
𝑋̄ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) .
𝑛
How big does 𝑛 need to be to get a good estimate? Note that 𝑛 is not a random
variable whereas it is in the aggregate loss model.
In Section 9.2.1 the accuracy of an estimator for frequency was defined by re-
quiring that the number of claims lie within a specified interval about the mean
number of claims with a specified probability. For severity this requirement is
where 𝑘 and 𝑝 need to be specified. Following the steps in Section 9.2.1, the
mean claim severity 𝜇𝑋 is subtracted from each term and the standard deviation
of the claim severity estimator 𝜎𝑋̄ is divided into each term yielding
−𝑘 𝜇𝑋 𝑘 𝜇𝑋
Pr [ ≤ (𝑋̄ − 𝜇𝑋 )/𝜎𝑋̄ ≤ ] ≥ 𝑝.
𝜎𝑋̄ 𝜎𝑋̄
𝑦𝑝 2 𝜎𝑋 2 𝜎𝑋
2
𝑛𝑋 =( ) ( ) = 𝜆𝑘𝑝 ( ) . (9.6)
𝑘 𝜇𝑋 𝜇𝑋
Note that the term 𝜎𝑋 /𝜇𝑋 is the coefficient of variation for an individual claim.
Even though 𝜆𝑘𝑝 is the full credibility standard for frequency given a Poisson
distribution, there is no assumption about the distribution for the number of
claims.
Example 9.2.6. Individual loss amounts are independently and identically
distributed with a Type II Pareto distribution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . How
many claims are required for the average severity of observed claims to be within
5% of the expected severity with probability 𝑝 = 0.95?
Solution. The mean of the Pareto is 𝜇𝑋 = 𝜃/(𝛼 − 1) and the variance is
2
𝜎𝑋 = 𝜃2 𝛼/[(𝛼 − 1)2 (𝛼 − 2)] so (𝜎𝑋 /𝜇𝑋 )2 = 𝛼/(𝛼 − 2). From a standard normal
distribution table 𝑦𝑝 = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility standard
is 𝑛𝑋 = (1.96/0.05)2 [𝛼/(𝛼 − 2)] = 1, 536.64𝛼/(𝛼 − 2). Suppose 𝛼 = 3 then
𝑛𝑋 = 4, 609.92 for a full credibility standard of 4,610.
√𝑛/𝑛0 if 𝑛 < 𝑛0
𝑍={
1 if 𝑛 ≥ 𝑛0 ,
where 𝑛0 is the full credibility standard. The quantity 𝑛 is the number of claims
for the data that is used to estimate the expected frequency, severity, or pure
premium.
Example 9.2.7. The number of claims has a Poisson distribution. Individ-
ual loss amounts are independently and identically distributed with a Type II
Pareto distribution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . Assume that 𝛼 = 3. The number
of claims and loss amounts are independent. The full credibility standard is
that the observed pure premium should be within 5% of the expected value
with probability 𝑝 = 0.95. What credibility 𝑍 is assigned to a pure premium
computed from 1,000 claims?
Solution. Because the number of claims is Poisson,
2
E(𝑋 2 ) 𝜎2 𝜎
2
= 𝑁 +( 𝑋) .
[E (𝑋)] 𝜇𝑁 𝜇𝑋
The mean of the Pareto is 𝜇𝑋 = 𝜃/(𝛼 − 1) and the second moment is E(𝑋 2 ) =
2𝜃2 /[(𝛼 − 1)(𝛼 − 2)] so E(𝑋 2 )/[E (𝑋)]2 = 2(𝛼 − 1)/(𝛼 − 2). From a standard
normal distribution table, 𝑦𝑝 = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility
standard is
Limited fluctuation credibility uses the formula 𝑍 = √𝑛/𝑛0 to limit the fluctu-
ation in the credibility-weighted estimate to match the fluctuation allowed for
data with expected claims at the full credibility standard. Variance or standard
deviation is used as the measure of fluctuation. Next we show an example to
explain why the square-root formula is used.
Suppose that average claim severity is being estimated from a sample of size
𝑛 that is less than the full credibility standard 𝑛0 = 𝑛𝑋 . Applying credibility
theory, the estimate 𝜇𝑋̂ would be
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝑀𝑋 ,
𝜇𝑋
could be last year’s estimated average severity adjusted for inflation, the average
severity for a much larger pool of risks, or some other relevant quantity selected
by the actuary. It is assumed that the variance of 𝑀𝑋 is zero or negligible.
With this assumption
𝑛
̂ ) = Var(𝑍 𝑋)̄ = 𝑍 2 Var(𝑋)̄ =
Var(𝜇𝑋 ̄
Var(𝑋).
𝑛0
𝑛 𝑛 Var(𝑋𝑖 ) Var(𝑋𝑖 )
Var(𝜇𝑋
̂ )= Var(𝑋)̄ = = .
𝑛0 𝑛0 𝑛 𝑛0
The last term is exactly the variance of a sample mean 𝑋̄ when the sample size
is equal to the full credibility standard 𝑛0 = 𝑛𝑋 .
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 (9.7)
with
In the prior example the risk parameter 𝜃 is a random variable with an expo-
nential distribution. In the next example there are three types of risks and the
risk parameter has a discrete distribution.
9.3. BÜHLMANN CREDIBILITY 321
If a risk is selected at random from the population, what is the expected aggre-
gate loss in a year?
Solution The expected number of claims for a risk is E(𝑁 |𝜆)=𝜆. The expected
value for a Pareto distributed random variable is E(𝑋|𝜃, 𝛼)=𝜃/(𝛼 − 1). The
expected value of the aggregate loss random variable 𝑆 = 𝑋1 + ⋯ + 𝑋𝑁 for
a risk with parameters 𝜆, 𝛼, and 𝜃 is E(𝑆) = E(𝑁 )E(𝑋) = 𝜆𝜃/(𝛼 − 1). The
expected aggregate loss for a risk of type A is E(𝑆A )=(0.5)(1000)/(2-1)=500.
The expected aggregate loss for a risk selected at random from the population
is E(𝑆) = 0.5[(0.5)(1000)]+0.3[(1.0)(1500)]+0.2[(2.0)(2000)]=1500.
What is the risk parameter for a risk (policyholder) in the prior example? One
could say that the risk parameter has three components (𝜆, 𝜃, 𝛼) with possible
values (0.5,1000,2.0), (1.0,1500,2.0), and (2.0,2000,2.0) depending on the type
of risk.
Note that in both of the examples the risk parameter is a random quantity with
its own probability distribution. We do not know the value of the risk parameter
for a randomly chosen risk.
Although formula (9.7) was introduced using experience rating as an example,
the Bühlmann credibility model has wider application. Suppose that a rating
plan has multiple classes. Credibility formula (9.7) can be used to determine
individual class rates. The overall mean 𝜇 would be the average loss for all
classes combined, 𝑋̄ would be the experience for the individual class, and 𝜇(𝜃)
̂
would be the estimated loss for the class.
𝐸𝑃 𝑉 = E(Var(𝑋𝑗 |𝜃)).
̄
Because Var(𝑋|𝜃) ̄
= Var(𝑋𝑗 |𝜃)/𝑛 it follows that E(Var(𝑋|𝜃)) = 𝐸𝑃 𝑉 /𝑛.
2. How homogeneous is the population of risks whose experience was com-
bined to compute the overall mean 𝜇? If all the risks are similar in loss
potential then more weight (1 − 𝑍) would be given to the overall mean
𝜇 because 𝜇 is the average for a group of similar risks whose means 𝜇(𝜃)
are not far apart. The homogeneity or heterogeneity of the population is
measured by the Variance of the Hypothetical Means with abbreviation
𝑉 𝐻𝑀 :
̄
𝑉 𝐻𝑀 = Var(E(𝑋𝑗 |𝜃)) = Var(E(𝑋|𝜃)).
̄
Note that we used E(𝑋|𝜃) = E(𝑋𝑗 |𝜃) for the second equality.
̄ A larger sample
3. How many observations 𝑛 were used to compute 𝑋?
would infer a larger 𝑍.
Example 9.3.3. The number of claims 𝑁 in a year for a risk in a population
has a Poisson distribution with mean 𝜆 > 0. The risk parameter 𝜆 is uniformly
distributed over the interval (0, 2). Calculate the 𝐸𝑃 𝑉 and 𝑉 𝐻𝑀 for the
population.
Solution. Random variable 𝑁 is Poisson with parameter 𝜆 so Var(𝑁 |𝜆) = 𝜆.
The Expected Value of the Process variance is 𝐸𝑃 𝑉 = E(Var(𝑁 |𝜆)) = E(𝜆) =
2
∫0 𝜆 12 𝑑𝜆 = 1. The Variance of the Hypothetical Means is 𝑉 𝐻𝑀 = Var(E(𝑁 |𝜆))
2
= Var(𝜆) = E(𝜆2 ) − (E(𝜆))2 = ∫0 𝜆2 12 𝑑𝜆 − (1)2 = 13 .
𝑛 𝐸𝑃 𝑉
𝑍= , 𝐾= . (9.8)
𝑛+𝐾 𝑉 𝐻𝑀
If the 𝑉 𝐻𝑀 increases then 𝑍 increases. If the 𝐸𝑃 𝑉 increases then 𝑍 gets
smaller. Unlike limited fluctuation credibility where 𝑍 = 1 when the expected
number of claims is greater than the full credibility standard, 𝑍 can approach
but not equal 1 as the number of observations 𝑛 goes to infinity.
9.3. BÜHLMANN CREDIBILITY 323
𝑉 𝐻𝑀
𝑍= .
𝑉 𝐻𝑀 + (𝐸𝑃 𝑉 /𝑛)
Var(𝑋)̄ = E(Var(𝑋|𝜃))
̄ ̄
+ Var(E(𝑋|𝜃)).
̄
In bullet (1) at the beginning of this section we showed E(Var(𝑋|𝜃)) = 𝐸𝑃 𝑉 /𝑛.
̄
In the bullet (2), Var(E(𝑋|𝜃)) = 𝑉 𝐻𝑀 . Reordering the right hand side gives
Var(𝑋)̄ = 𝑉 𝐻𝑀 + (𝐸𝑃 𝑉 /𝑛). Another way to write the formula for credibility
̄
𝑍 is 𝑍 = Var(E(𝑋|𝜃))/Var( ̄ This implies (1 − 𝑍) = E(Var(𝑋|𝜃))/Var(
𝑋). ̄ ̄
𝑋).
The following long example and solution demonstrate how to compute the
credibility-weighted estimate with frequency and severity data.
Example 9.3.5. For any risk in a population the number of losses 𝑁 in a year
has a Poisson distribution with parameter 𝜆. Individual loss amounts 𝑋 for
a selected risk are independent of 𝑁 and are iid with exponential distribution
𝐹 (𝑥) = 1 − 𝑒−𝑥/𝛽 . There are three types of risks in the population as shown
below. A risk was selected at random from the population and all losses were
recorded over a five-year period. The total amount of losses over the five-year
period was 5,000. Use Bühlmann credibility to estimate the annual expected
aggregate loss for the risk.
• Calculate required values including the Expected Value of the Process Vari-
ance (𝐸𝑃 𝑉 ), Variance of the Hypothetical Means (𝑉 𝐻𝑀 ) and collective
mean 𝜇.
• Recognize situations when the Bühlmann-Straub model is appropriate.
Define 𝑌𝑗𝑘 to be the loss for the 𝑘𝑡ℎ vehicle in the fleet for year 𝑗. Then, the
total losses for the fleet in year 𝑗 are 𝑌𝑗1 + ⋯ + 𝑌𝑗𝑚𝑗 where we are adding up
the losses for each of the 𝑚𝑗 vehicles. In the Bühlmann-Straub model it is
assumed that random variables 𝑌𝑗𝑘 are iid across all vehicles and years for the
policyholder. With this assumption the means E(𝑌𝑗𝑘 |𝜃) = 𝜇(𝜃) and variances
Var(𝑌𝑗𝑘 |𝜃) = 𝜎2 (𝜃) are the same for all vehicles and years. The quantity 𝜇(𝜃)
is the expected loss and 𝜎2 (𝜃) is the variance in the loss for one year for one
vehicle for a policyholder with risk parameter 𝜃.
1 𝑛 𝑛
𝑋̄ = ∑𝑚 𝑋 , 𝑚 = ∑ 𝑚𝑗 .
𝑚 𝑗=1 𝑗 𝑗 𝑗=1
̄
It follows that E(𝑋|𝜃) = 𝜇(𝜃) and Var(𝑋|𝜃)̄ = 𝜎2 (𝜃)/𝑚 where 𝜇(𝜃) and 𝜎2 (𝜃)
are the mean and variance for a single vehicle for one year for the policyholder.
̄
Example 9.4.1. Prove that Var(𝑋|𝜃) = 𝜎2 (𝜃)/𝑚 for a risk with risk parameter
𝜃.
Solution
326CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
̄ 1 𝑛
Var(𝑋|𝜃) = Var ( ∑ 𝑚 𝑋 |𝜃)
𝑚 𝑗=1 𝑗 𝑗
1 𝑛 1 𝑛
= ∑ Var(𝑚 𝑋
𝑗 𝑗 |𝜃) = ∑ 𝑚2 Var(𝑋𝑗 |𝜃)
𝑚2 𝑗=1 𝑚2 𝑗=1 𝑗
1 𝑛 𝜎2 (𝜃) 𝑛
= ∑ 𝑚2𝑗 (𝜎2 (𝜃)/𝑚𝑗 ) = ∑ 𝑚 = 𝜎2 (𝜃)/𝑚.
2
𝑚 𝑗=1 𝑚2 𝑗=1 𝑗
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 (9.9)
with
̄
Var(E(𝑋|𝜃)) ̄
Var(E(𝑋|𝜃))
𝑍= = .
Var(𝑋) ̄ ̄ ̄
E(Var(𝑋|𝜃)) + Var(E(𝑋|𝜃))
The denominator was expanded using the law of total variance. As noted above
̄
E(𝑋|𝜃) ̄
= 𝜇(𝜃) so Var(E(𝑋|𝜃)) = Var(𝜇(𝜃)) = 𝑉 𝐻𝑀 . Because Var(𝑋|𝜃) ̄ =
9.5. BAYESIAN INFERENCE AND BÜHLMANN CREDIBILITY 327
̄
𝜎2 (𝜃)/𝑚 it follows that E(Var(𝑋|𝜃)) = E(𝜎2 (𝜃))/𝑚 = 𝐸𝑃 𝑉 /𝑚. Making these
substitutions and using a little algebra gives
𝑚 𝐸𝑃 𝑉
𝑍= , 𝐾= . (9.10)
𝑚+𝐾 𝑉 𝐻𝑀
• The number of claims in a year for each vehicle in the policyholder’s fleet
is Poisson distributed with the same mean (parameter) 𝜆.
• Parameter 𝜆 is distributed among the policyholders in the population with
pdf 𝑓(𝜆) = 6𝜆(1 − 𝜆) with 0 < 𝜆 < 1.
The policyholder has 18 vehicles in its fleet in year 4. Use Bühlmann-Straub
credibility to estimate the expected number of policyholder claims in year 4.
Solution The expected number of claims for one vehicle for a randomly chosen
1
policyholder is 𝜇 = E(𝜆) = ∫0 𝜆[6𝜆(1 − 𝜆)]𝑑𝜆 = 1/2. The average number of
̄
claims per vehicle for the policyholder is 𝑋=13/36. The expected value of the
process variance for a single vehicle is 𝐸𝑃 𝑉 = E(𝜆) = 1/2. The variance of the
hypothetical means across policyholders is 𝑉 𝐻𝑀 = Var(𝜆) = E(𝜆2 )-(E(𝜆))2 =
1
∫0 𝜆2 [6𝜆(1 − 𝜆)]𝑑𝜆 − (1/2)2 = (3/10) − (1/4) = (6/20) − (5/20) = 1/20. So, 𝐾 =
𝐸𝑃 𝑉 /𝑉 𝐻𝑀 =(1/2)/(1/20)=10. The number of exposures in the experience
period is 𝑚 = 9 + 12 + 15 = 36. The credibility is 𝑍 = 36/(36 + 10) = 18/23.
The credibility-weighted estimate for the number of claims for one vehicle is
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇=(18/23)(13/36)+(5/23)(1/2)=9/23. With 18 vehicles
in the fleet in year 4 the expected number of claims is 18(9/23)=162/23=7.04 .
• Use Bayes Theorem to determine a formula for the expected loss of a risk
given a likelihood and prior distribution.
• Determine the posterior distributions for the gamma-Poisson and beta-
binomial Bayesian models and compute expected values.
• Understand the connection between the Bühlmann and Bayesian estimates
for the gamma-Poisson and beta-binomial models.
Section 4.4 reviews Bayesian inference and it is assumed that the reader is fa-
miliar with that material. The reader is also advised to read the Bühlmann
credibility Section 9.3 in this chapter. This section will compare Bayesian infer-
ence with Bühlmann credibility and show connections between the two models.
A risk with risk parameter 𝜃 has expected loss 𝜇(𝜃) = E(𝑋|𝜃) with random
variable 𝑋 representing pure premium, aggregate loss, number of claims, claim
severity, or some other measure of loss during a period of time. If the risk
has 𝑛 losses 𝑋1 , … , 𝑋𝑛 during n separate periods of time, then these losses are
assumed to be 𝑖𝑖𝑑 for the policyholder and 𝜇(𝜃) = E(𝑋𝑖 |𝜃) for 𝑖 = 1, .., 𝑛.
If the risk had 𝑛 losses 𝑥1 , … , 𝑥𝑛 then E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ) is the conditional ex-
pectation of 𝜇(𝜃). The Bühlmann credibility formula 𝜇(𝜃) ̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 is
̄
a linear function of 𝑋 = (𝑥1 + ⋯ + 𝑥𝑛 )/𝑛 used to estimate E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ).
The expectation E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ) can be calculated from the conditional den-
sity function 𝑓(𝑥|𝜃) and the posterior distribution 𝜋(𝜃|𝑥1 , … , 𝑥𝑛 ):
𝑛
∏𝑗=1 𝑓(𝑥𝑗 |𝜃)
𝜋(𝜃|𝑥1 , … , 𝑥𝑛 ) = 𝜋(𝜃).
𝑓(𝑥1 , … , 𝑥𝑛 )
The conditional density function 𝑓(𝑥|𝜃) and the prior distribution 𝜋(𝜃) must
𝑛
be specified. The numerator ∏𝑗=1 𝑓(𝑥𝑗 |𝜃) on the right-hand side is called the
likelihood. The denominator 𝑓(𝑥1 , … , 𝑥𝑛 ) is the joint density function for 𝑛
losses 𝑥1 , … , 𝑥𝑛 .
for 𝜆 is gamma with 𝜋(𝜆) = 𝛽 𝛼 𝜆𝛼−1 𝑒−𝛽𝜆 /Γ(𝛼). (Note that a rate parameter 𝛽
is being used in the gamma distribution rather than a scale parameter.) The
mean of the gamma is E(𝜆) = 𝛼/𝛽 and the variance is Var(𝜆) = 𝛼/𝛽 2 . In this
section we will assume that 𝜆 is the expected number of claims per year though
we could have chosen another time interval.
If a risk is selected at random from the population then the expected number of
claims in a year is E(𝑁 ) = E(E[𝑁 |𝜆]) = E(𝜆) = 𝛼/𝛽. If we had no observations
for the selected risk then the expected number of claims for the risk is 𝛼/𝛽.
During 𝑛 years the following number of claims by year was observed for the ran-
domly selected risk: 𝑥1 , … , 𝑥𝑛 . From Bayes theorem the posterior distribution
is
𝑛
∏𝑗=1 (𝜆𝑥𝑗 𝑒−𝜆 /𝑥𝑗 !)
𝜋(𝜆|𝑥1 , … , 𝑥𝑛 ) = 𝛽 𝛼 𝜆𝛼−1 𝑒−𝛽𝜆 /Γ(𝛼).
Pr(𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 )
Combining terms that have a 𝜆 and putting all other terms into constant 𝐶
gives
𝑛
𝜋(𝜆|𝑥1 , … , 𝑥𝑛 ) = 𝐶𝜆(𝛼+∑𝑗=1 𝑥𝑗 )−1 𝑒−(𝛽+𝑛)𝜆 .
𝑛
This is a gamma distribution with parameters 𝛼′ = 𝛼 + ∑𝑗=1 𝑥𝑗 and 𝛽 ′ = 𝛽 + 𝑛.
𝛼′ ∞
The constant must be 𝐶 = 𝛽 ′ /Γ(𝛼′ ) so that ∫0 𝜋(𝜆|𝑥1 , … , 𝑥𝑛 )𝑑𝜆 = 1 though
we do not need to know 𝐶. As explained in Chapter 4 the gamma distribution
is a conjugate prior for the Poisson distribution so the posterior distribution is
also gamma. See also Appendix Section 16.3.2.
Because the posterior distribution is gamma the expected number of claims for
the selected risk is
𝑛
𝛼 + ∑𝑗=1 𝑥𝑗 𝛼 + number of claims
E(𝜆|𝑥1 , … , 𝑥𝑛 ) = = .
𝛽+𝑛 𝛽 + number of years
𝑛 𝑛
𝑛 ∑𝑗=1 𝑥𝑗 𝑛 𝛼 𝛼 + ∑𝑗=1 𝑥𝑗
𝜇̂ = + (1 − ) = .
𝑛+𝛽 𝑛 𝑛+𝛽 𝛽 𝛽+𝑛
For the gamma-Poisson model the Bühlmann credibility estimate matches the
Bayesian analysis result.
Γ(𝛼 + 𝛽) 𝛼−1
𝜋(𝑝) = 𝑝 (1 − 𝑝)𝛽−1 , 0 < 𝑝 < 1, 𝛼 > 0, 𝛽 > 0.
Γ(𝛼)Γ(𝛽)
Combining terms that have a 𝑝 and putting everything else into the constant 𝐶
yields
Γ(𝛼 + 𝛽 + 𝑛)
𝐶= .
Γ(𝛼 + 𝑥)Γ(𝛽 + 𝑛 − 𝑥)
The mean for the beta distribution with parameters 𝛼 and 𝛽 is E(𝑝) = 𝛼/(𝛼+𝛽).
Given 𝑥 successes in 𝑛 trials in the beta-binomial model the mean of the posterior
distribution is
𝛼+𝑥
E(𝑝|𝑥) = .
𝛼+𝛽+𝑛
The Bühlmann credibility estimate for E(𝑝|𝑥) is exactly as the same as the
Bayesian estimate as demonstrated in the following example.
Example 9.5.1 The probability that a coin toss will yield heads is 𝑝. The prior
distribution for probability 𝑝 is beta with parameters 𝛼 and 𝛽. On 𝑛 tosses of
the coin there were exactly 𝑥 heads. Use Bühlmann credibility to estimate the
expected value of 𝑝.
Solution Define random variables 𝑌𝑗 such that 𝑌𝑗 = 1 if the 𝑗𝑡ℎ coin toss is
heads and 𝑌𝑗 = 0 if tails for 𝑗 = 1, … , 𝑛. Random variables 𝑌𝑗 are iid conditional
on 𝑝 with Pr[𝑌 = 1|𝑝] = 𝑝 and Pr[𝑌 = 0|𝑝] = 1 − 𝑝 The number of heads in 𝑛
tosses can be represented by the random variable 𝑋 = 𝑌1 + ⋯ + 𝑌𝑛 . We want to
estimate 𝑝 = 𝐸[𝑌𝑗 ] using Bühlmann credibility: 𝑝̂ = 𝑍 𝑌 ̄ + (1 − 𝑍)𝜇. The overall
mean is 𝜇 = E(E(𝑌𝑗 |𝑝)) = E(𝑝) = 𝛼/(𝛼 + 𝛽). The sample mean is 𝑦 ̄ = 𝑥/𝑛. The
credibility is 𝑍 = 𝑛/(𝑛 + 𝐾) and 𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀 . With Var(𝑌𝑗 |𝑝) = 𝑝(1 − 𝑝)
it follows that 𝐸𝑃 𝑉 = E(Var[𝑌𝑗 |𝑝]) = E(𝑝(1 − 𝑝)). Because E(𝑌𝑗 |𝑝) = 𝑝 then
𝑉 𝐻𝑀 = Var((E(𝑌𝑗 |𝑝)) = Var(𝑝). For the beta distribution
𝛼 𝛼(𝛼 + 1) 𝛼𝛽
E(𝑝) = , E(𝑝2 ) = , and Var(𝑝) = 2
.
𝛼+𝛽 (𝛼 + 𝛽)(𝛼 + 𝛽 + 1) (𝛼 + 𝛽) (𝛼 + 𝛽 + 1)
𝑛 𝑥 𝑛 𝛼
𝑝̂ = ( ) + (1 − )
𝑛+𝛼+𝛽 𝑛 𝑛+𝛼+𝛽 𝛼+𝛽
𝛼+𝑥
𝑝̂ =
𝛼+𝛽+𝑛
with the Bayesian estimate. More information about exact credibility can be
found in (Bühlmann and Gisler, 2005), (Klugman et al., 2012), and (Tse, 2009).
The examples in this chapter have provided assumptions for calculating credi-
bility parameters. In actual practice the actuary must use real world data and
judgment to determine credibility parameters.
2
𝑦𝑝 2 𝜎2 𝜎
𝑛𝑆 = ( ) [( 𝑁 ) + ( 𝑋 ) ] ,
𝑘 𝜇𝑁 𝜇𝑋
with 𝑁 representing number of claims and 𝑋 the size of claims. If one assumes
𝜎𝑋 = 0 then the full credibility standard for frequency results. If 𝜎𝑁 = 0 then
the full credibility formula for severity follows. Probability 𝑝 and 𝑘 value are
often selected using judgment and experience.
𝑦𝑝 2 E(𝑋 2 )
𝑛𝑆 = ( ) [ ].
𝑘 (E(𝑋))2
An empirical mean and second moment for the sizes of individual claim losses
can be computed from past data, if available.
9.6. ESTIMATING CREDIBILITY PARAMETERS 333
̂ 1 𝑟 1 𝑟 𝑛
𝐸𝑃 𝑉 = ∑ 𝑠2𝑖 = ∑ ∑(𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2 . (9.11)
𝑟 𝑖=1 𝑟(𝑛 − 1) 𝑖=1 𝑗=1
𝑟
̂ 𝑋̄ 𝑖 ) = 1 1 𝑟
Var( ∑(𝑋̄ 𝑖 − 𝑋)̄ 2 and 𝑋̄ = ∑ 𝑋̄ 𝑖 ,
𝑟 − 1 𝑖=1 𝑟 𝑖=1
but Var(𝑋̄ 𝑖 ) is not the 𝑉 𝐻𝑀 . Using equation (16.2), the total variance formula
or unconditional variance formula is
The 𝑉 𝐻𝑀 is the second term on the right because 𝜇(𝜃𝑖 ) = E(𝑋̄ 𝑖 |Θ = 𝜃𝑖 ) is the
hypothetical mean for risk 𝑖. So,
𝑟 ̂
1 𝐸𝑃 𝑉
𝑉̂
𝐻𝑀 = ∑(𝑋̄ 𝑖 − 𝑋)̄ 2 − . (9.12)
𝑟 − 1 𝑖=1 𝑛
334CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Although the expected loss for a risk with parameter 𝜃𝑖 is 𝜇(𝜃𝑖 )=E(𝑋̄ 𝑖 |Θ = 𝜃𝑖 ),
the variance of the sample mean 𝑋̄ 𝑖 is greater than or equal to the variance
of the hypothetical means: Var(𝑋̄ 𝑖 ) ≥Var(𝜇(𝜃𝑖 )). The variance in the sample
means Var(𝑋̄ 𝑖 ) includes both the variance in the hypothetical means plus a
process variance term.
In some cases formula (9.12) can produce a negative value for 𝑉̂ 𝐻𝑀 because
̂
of the subtraction of 𝐸𝑃 𝑉 /𝑛, but a variance cannot be negative. The process
variance within risks is so large that it overwhelms the measurement of the
variance in means between risks. In this case we cannot use this method to
determine the values needed for Bühlmann credibility.
Example 9.6.1. Two policyholders had claims over a three-year period as
shown in the table below. Estimate the expected number of claims for each
policyholder using Bühlmann credibility and calculating necessary parameters
from the data.
̄ = 13 (0 + 1 + 0) = 13 , 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (2 + 1 + 2) = 5
3
𝑥̄ = 21 ( 13 + 53 ) = 1
1
𝑠2𝐴 = 3−1 [(0 − 13 )2 + (1 − 13 )2 + (0 − 13 )2 ] = 1
3
1
𝑠2𝐵 = 3−1 [(2 − 53 )2 + (1 − 53 )2 + (2 − 53 )2 ] = 1
3
̂
𝐸𝑃 𝑉 = 1
( 13 + 13 ) = 1
2 3
𝑉̂
𝐻𝑀 = 1
2−1 [( 13 − 1)2 + ( 53 − 1)2 ] − 1/3
3 = 7
9
1/3 3
𝐾= 7/9 = 7
3 7
𝑍= 3+(3/7)) = 8
7
𝜇𝐴
̂ = 8 ( 13 ) + (1 − 78 )1 = 5
12
7
𝜇𝐵
̂ = 8 ( 53 ) + (1 − 78 )1 = 19
12
̄ = 13 (3 + 0 + 0) = 1, 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (3 + 0 + 3) = 2
𝑥̄ = 12 (1 + 2) = 3
2
1
𝑠2𝐴 = 3−1 [(3 − 1)2 + (0 − 1)2 + (0 − 1)2 ] = 3
1
𝑠2𝐵 = 3−1 [(3 − 2)2 + (0 − 2)2 + (3 − 2)2 ] = 3
̂
𝐸𝑃 𝑉 = 21 (3 + 3) = 3
𝑉̂
𝐻𝑀 = 1
2−1 [(1 − 32 )2 + (2 − 32 )2 ] − 3
3 = − 12 .
The process variance is so large that it is not possible to estimate the 𝑉 𝐻𝑀 .
𝑛𝑖
∑𝑗=1 𝑚𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2
𝑠𝑖 2 = .
𝑛𝑖 − 1
The weights 𝑚𝑖𝑗 are applied to the squared differences because the 𝑋𝑖𝑗 are
the averages of 𝑚𝑖𝑗 exposures. The weighted average of the sample variances
𝑠𝑖 2 for each risk 𝑖 in the population with weights proportional to the number
of (𝑛𝑖 − 1) observation periods will produce the expected value of the process
variance (𝐸𝑃 𝑉 ) estimate
336CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
𝑟 𝑛
𝑟
∑𝑖=1 (𝑛𝑖 − 1)𝑠𝑖 2 ∑𝑖=1 ∑𝑗=1
𝑖
𝑚𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2
̂
𝐸𝑃 𝑉 = 𝑟 = 𝑟 .
∑𝑖=1 (𝑛𝑖 − 1) ∑𝑖=1 (𝑛𝑖 − 1)
̂
The quantity 𝐸𝑃 𝑉 is an unbiased estimator for the expected value of the process
variance of one exposure for a risk chosen at random from the population.
To calculate an estimator for the variance in the hypothetical means (𝑉 𝐻𝑀 )
the squared differences of the individual risk sample means 𝑋̄ 𝑖 and population
mean 𝑋̄ are used. An unbiased estimator for the 𝑉 𝐻𝑀 is
𝑟
∑𝑖=1 𝑚𝑖 (𝑋̄ 𝑖 − 𝑋)̄ 2 − (𝑟 − 1)𝐸𝑃
̂ 𝑉
𝑉̂
𝐻𝑀 = 1 𝑟 .
𝑚− 𝑚 ∑𝑖=1 𝑚2𝑖
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
0+2+2+3
Solution 𝑥𝐴
̄ = 1+2+2+2 =1
0+1+2 1
𝑥𝐵
̄ = 2+3+4 = 3
7(1)+9(1/3) 5
𝑥̄ = 7+9 = 8
1
𝑠2𝐴 = 4−1 [1(0 − 1)2 + 2(1 − 1)2 + 2(1 − 1)2 + 2( 23 − 1)2 ] = 1
2
1
𝑠2𝐵 = 3−1 [2(0 − 13 )2 + 3( 13 − 13 )2 + 4( 12 − 13 )2 ] = 1
6
̂
𝐸𝑃 𝑉 = [3 ( 12 ) + 2 ( 61 )] /(3 + 2) = 11
= 0.3667
30
𝑉̂
𝐻𝑀 = [(7(1 − 58 )2 + 9( 13 − 58 )2 − (2 − 1) 30
11 1
] / [16 − ( 16 ) (72 + 92 )] = 0.1757
0.3667
𝐾= 0.1757 = 2.0871
𝑚𝐴 = 7, 𝑚𝐵 = 9
7 9
𝑍𝐴 = 7+2.0871 = 0.7703, 𝑍𝐵 = 9+2.0871 = 0.8118
9.6. ESTIMATING CREDIBILITY PARAMETERS 337
𝜇𝐴
̂ = 0.7703(1) + (1 − 0.7703)(5/8) = 0.9139
𝜇𝐵
̂ = 0.8118(1/3) + (1 − 0.8118)(5/8) = 0.3882
̄ = 13 (0 + 1 + 0) = 31 , 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (2 + 1 + 2) = 5
3
𝑥̄ = 12 ( 13 + 35 ) = 1
2 1
With Poisson assumption the estimated variance for risk A is 𝜎̂𝐴 = 𝑥𝐴
̄ = 3
2 5
Similarly, 𝜎̂𝐵 = 𝑥𝐵
̄ = 3
̂
𝐸𝑃 𝑉 = 21 ( 13 ) + 21 ( 53 ) = 1. This is also 𝑥̄ because of Poisson assumption.
𝑉̂
𝐻𝑀 = 1
2−1 [( 13 − 1)2 + ( 35 − 1)2 ] − 1
3 = 5
9
1 9
𝐾= 5/9 = 5
3 5
𝑍𝐴 = 𝑍 𝐵 = 3+(9/5) = 8
5
𝜇𝐴
̂ = 8 ( 13 ) + (1 − 58 )1 = 7
12
5
𝜇𝐵
̂ = 8 ( 53 ) + (1 − 58 )1 = 17
12 .
338CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Although we assumed that the number of claims for each risk was Poisson
distributed in the prior example, we did not need this additional assumption
because there was enough information to use nonparametric estimation. In fact,
the Poisson assumption might not be appropriate because for risk B the sample
̄ = 35 ≠ 𝑠2𝐵 = 13 .
mean is not equal to the sample variance: 𝑥𝐵
Number of Claims
In 5 Years Number of policies
0 923
1 682
2 249
3 70
4 51
5 25
In your model you assume that the number of claims for each policyholder
has a Poisson distribution and that a policyholder’s expected number of claims
is constant through time. Use Bühlmann credibility to estimate the annual
expected number of claims for policyholders with 3 claims during the five-year
period.
Solution Let 𝜃𝑖 be the risk parameter for the 𝑖𝑡ℎ risk in the portfolio with mean
𝜇(𝜃𝑖 ) and variance 𝜎2 (𝜃𝑖 ). With the Poisson assumption 𝜇(𝜃𝑖 ) = 𝜎2 (𝜃𝑖 ). The ex-
pected value of the process variance is 𝐸𝑃 𝑉 = E(𝜎2 (𝜃𝑖 )) where the expectation
is taken across all risks in the population. Because of the Poisson assumption
for all risks it follows that 𝐸𝑃 𝑉 = E(𝜎2 (𝜃𝑖 )) = E(𝜇(𝜃𝑖 )). An estimate for the an-
nual expected number of claims is 𝜇(𝜃 ̂ 𝑖 )= (observed number of claims)/5. This
can also serve as the estimate for the expected value of the process variance for
a risk. Weighting the process variance estimates (or means) by the number of
policies in each group gives the estimators
1
𝑉̂
𝐻𝑀 = [923(0 − 0.1719)2 + 682(0.20 − 0.1719)2 + 249(0.40 − 0.1719)2
2000 − 1
0.1719
+70(0.60 − 0.1719)2 + 51(0.80 − 0.1719)2 + 25(1 − 0.1719)2 ] −
5
= 0.0111
𝐾̂ = ̂
𝐸𝑃 𝑉 /𝑉̂
𝐻𝑀 = 0.1719/0.0111 = 15.49
5
𝑍̂ = = 0.2440
5 + 15.49
𝜇3̂ claims = 0.2440(3/5) + (1 − 0.2440)0.1719 = 0.2764.
𝑟 𝑟
𝑋̄ = ∑(𝑚𝑖 /𝑚)𝑋̄ 𝑖 = ∑(𝑚𝑖 /𝑚)𝜇(𝜃
̂ 𝑖 ).
𝑖=1 𝑖=1
If this equation is satisfied then the estimated losses for each risk will add up to
the population total, an important goal in ratemaking, but this may not happen
if the complement of credibility is applied to 𝑋.̄
To achieve balance, we will set 𝑀̂ 𝑋 as the amount that is applied to the com-
plement of credibility and thus analyze the following equation:
𝑟 𝑟
∑(𝑚𝑖 /𝑚)𝑋̄ 𝑖 = ∑(𝑚𝑖 /𝑚) {𝑍𝑖 𝑋̄ 𝑖 + (1 − 𝑍𝑖 ) ⋅ 𝑀̂ 𝑋 } .
𝑖=1 𝑖=1
𝑟 𝑟 𝑟
∑ 𝑚𝑖 𝑋̄ 𝑖 = ∑ 𝑚𝑖 𝑍𝑖 𝑋̄ 𝑖 + 𝑀̂ 𝑋 ∑ 𝑚𝑖 (1 − 𝑍𝑖 ),
𝑖=1 𝑖=1 𝑖=1
and
340CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
𝑟
∑𝑖=1 𝑚𝑖 (1 − 𝑍𝑖 )𝑋̄ 𝑖
𝑀̂ 𝑋 = 𝑟 .
∑𝑖=1 𝑚𝑖 (1 − 𝑍𝑖 )
Using this value for 𝑀̂ 𝑋 will bring the credibility weighted estimators into bal-
ance.
If credibilities 𝑍𝑖 were computed using the Bühlmann-Straub model, then 𝑍𝑖 =
𝑚𝑖 /(𝑚𝑖 + 𝐾). The prior formula can be simplified using the following relation-
ship
𝑚𝑖 (𝑚 + 𝐾) − 𝑚𝑖
𝑚𝑖 (1 − 𝑍𝑖 ) = 𝑚𝑖 (1 − ) = 𝑚𝑖 ( 𝑖 ) = 𝐾𝑍𝑖 .
𝑚𝑖 + 𝐾 𝑚𝑖 + 𝐾
𝑟
∑𝑖=1 𝑍𝑖 𝑋̄ 𝑖
𝑀̂ 𝑋 = 𝑟 .
∑𝑖=1 𝑍𝑖
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
7
Solution The credibilities from the prior example are 𝑍𝐴 = 7+2.0871 = 0.7703
9
and 𝑍𝐵 = 9+2.0871 = 0.8118. The sample means are 𝑥𝐴 ̄ = 1 and 𝑥𝐵̄ = 1/3. The
balanced complement of credibility is
0.7703(1) + 0.8118(1/3)
𝑀̂ 𝑋 = = 0.6579.
0.7703 + 0.8118
Contributors
• Gary Dean, Ball State University is the author of the initial version of
this chapter. Email: [email protected] for chapter comments and suggested
improvements.
• Chapter reviewers include: Liang (Jason) Hong, Ambrose Lo, Ranee Thi-
agarajah, Hongjuan Zhou.
342CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Chapter 10
Insurance Portfolio
Management including
Reinsurance
343
344CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
In 1998 freezing rain fell on eastern Ontario, southwestern Quebec and lasted
for six days. The event was double the amount of precipitation in the area
experienced in any prior ice storm and resulted in a catastrophe that produced in
excess of 840,000 insurance claims. This number is 20% more than that of claims
caused by the Hurricane Andrew - one of the largest natural disasters in the
history of North America. The catastrophe caused approximately 1.44 billion
Canadian dollars in insurance settlements which is the highest loss burden in the
history of Canada. This is not an isolated example - similar catastrophic events
that caused extreme insurance losses are the Hurricane Harvey, Superstorm
Sandy, the 2011 Japanese earthquake and tsunami, and so forth.
In the context of insurance, a few large losses hitting a portfolio and then con-
verting into claims usually represent the greatest part of the indemnities paid
by insurance companies. The aforementioned losses, also called ‘extremes’, are
quantitatively modeled by the tails of the associated probability distributions.
From the quantitative modeling standpoint, relying on probabilistic models with
improper tails is rather daunting. For instance, periods of financial stress may
10.2. TAILS OF DISTRIBUTIONS 345
appear with a higher frequency than expected, and insurance losses may oc-
cur with worse severity. Therefore, the study of probabilistic behavior in the
tail portion of actuarial models is of utmost importance in the modern frame-
work of quantitative risk management. For this reason, this section is devoted
to the introduction of a few mathematical notions that characterize the tail
weight of random variables. The applications of these notions will benefit us in
the construction and selection of appropriate models with desired mathematical
properties in the tail portion, that are suitable for a given task.
Formally, define 𝑋 to be the (random) obligations that arise from a collection
(portfolio) of insurance contracts. (In earlier chapters, we used 𝑆 for aggregate
losses. Now, the focus is on distributional aspects of only the collective and so we
revert to the traditional 𝑋 notation.) We are particularly interested in studying
the right tail of the distribution of 𝑋, which represents the occurrence of large
losses. Informally, a random variable is said to be heavy-tailed if high probabilities
are assigned to large values. Note that this by no mean implies the probability
density/mass functions are increasing as the value of 𝑋 goes to infinity. Indeed
for a real-valued random variable, the pdf/pmf must diminish at infinity in
order to guarantee the total probability to be equal to one. Instead, what we
are concernded about is the rate of decay of the pdf/pmf. Unwelcome outcomes
are more likely to occur for an insurance portfolio that is described by a loss
random variable possessing a heavier (right) tail. Tail weight can be an absolute
or a relative concept. Specifically, for the former, we may consider a random
variable to be heavy-tailed if certain mathematical properties of the probability
distribution are met. For the latter, we can say the tail of one distribution is
heavier/lighter than the other if some tail measures are larger/smaller.
Several quantitative approaches have been proposed to classify and compare
tail weight. Among most of these approaches, the survival function serves as
the building block. In what follows, we introduce two simple yet useful tail
classification methods both of which are based on the behavior of the survival
function of 𝑋.
∞ ∞
𝜇′𝑘 = ∫ 𝑥𝑘 𝑓(𝑥) 𝑑𝑥 = 𝑘 ∫ 𝑥𝑘−1 𝑆(𝑥) 𝑑𝑥,
0 0
of the survival function at infinity. Namely, the faster the survival function
decays to zero, the higher the order of finite moment (𝑘) the associated random
variablepossesses. You may interpret 𝑘∗ to be the largest value of 𝑘 so that
the moment is finite. Formally, define 𝑘∗ = sup{𝑘 > 0 ∶ 𝜇′𝑘 < ∞}, where 𝑠𝑢𝑝
represents the supremum operator. This observation leads us to the moment-
based tail weight classification method, which is defined formally next.
• If all the positive raw moments exist, namely the maximal order of finite
moment 𝑘∗ = ∞, then 𝑋 is said to be light tailed based on the moment
method.
• If 𝑘∗ < ∞, then 𝑋 is said to be heavy tailed based on the moment method.
• Moreover, for two positive loss random variables 𝑋1 and 𝑋2 with maximal
orders of moment 𝑘1∗ and 𝑘2∗ respectively, we say 𝑋1 has a heavier (right)
tail than 𝑋2 if 𝑘1∗ ≤ 𝑘2∗ .
The first part of Definition 10.1 is an absolute concept of tail weight, while the
second part is a relative concept of tail weight which compares the (right) tails
between two distributions. Next, we present a few examples that illustrate the
applications of the moment-based method for comparing tail weight.
Solution.
∞
𝑥𝛼−1 𝑒−𝑥/𝜃
𝜇′𝑘 = ∫ 𝑥𝑘 𝑑𝑥
0 Γ(𝛼)𝜃𝛼
∞
(𝑦𝜃)𝛼−1 𝑒−𝑦
= ∫ (𝑦𝜃)𝑘 𝜃𝑑𝑦
0 Γ(𝛼)𝜃𝛼
𝜃𝑘
= Γ(𝛼 + 𝑘) < ∞.
Γ(𝛼)
Since all the positive moments exist, i.e., 𝑘∗ = ∞, in accordance with the
moment-based classification method in Definition 10.1, the gamma distribution
is light-tailed.
Solution.
10.2. TAILS OF DISTRIBUTIONS 347
∞
𝜏 𝑥𝜏−1 −(𝑥/𝜃)𝜏
𝜇′𝑘 = ∫ 𝑥𝑘 𝑒 𝑑𝑥
0 𝜃𝜏
∞
𝑦𝑘/𝜏 −𝑦/𝜃𝜏
= ∫ 𝑒 𝑑𝑦
0 𝜃𝜏
= 𝜃𝑘 Γ(1 + 𝑘/𝜏 ) < ∞.
Again, due to the existence of all the positive moments, the Weibull distribution
is light-tailed.
The gamma and Weibull distributions are used quite extensively in the actuar-
ial practice. Applications of these two distributions are vast which include, but
are not limited to, insurance claim severity modeling, solvency assessment, loss
reserving, aggregate risk approximation, reliability engineering and failure anal-
ysis. We have thus far seen two examples of using the moment-based method
to analyze light-tailed distributions. We document a heavy-tailed example in
what follows.
Example 10.2.3. Heavy tail nature of the Pareto distribution. Let
𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼, 𝜃), with 𝛼 > 0 and 𝜃 > 0, then for 𝑘 > 0
∞
′ 𝛼𝜃𝛼
𝜇𝑘 = ∫ 𝑥𝑘 𝑑𝑥
0 (𝑥 + 𝜃)𝛼+1
∞
= 𝛼𝜃𝛼 ∫ (𝑦 − 𝜃)𝑘 𝑦−(𝛼+1) 𝑑𝑦.
𝜃
∞
< ∞, for 𝑘 < 𝛼;
𝑔𝑘 = ∫ 𝑦𝑘−𝛼−1 𝑑𝑦 = {
𝜃
= ∞, for 𝑘 ≥ 𝛼.
Meanwhile,
(𝑦 − 𝜃)𝑘 𝑦−(𝛼+1)
lim = lim (1 − 𝜃/𝑦)𝑘 = 1.
𝑦→∞ 𝑦𝑘−𝛼−1 𝑦→∞
Application of the limit comparison theorem for improper integrals yields 𝜇′𝑘 is
finite if and only if 𝑔𝑘 is finite. Hence we can conclude that the raw moments
of Pareto random variables exist only up to 𝑘 < 𝛼, i.e., 𝑘∗ = 𝛼, and thus the
distribution is heavy-tailed. What is more, the maximal order of finite moment
348CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
𝑆𝑋 (𝑡)
𝛾 = lim .
𝑡→∞ 𝑆𝑌 (𝑡)
We say that
• 𝑋 has a heavier right tail than 𝑌 if 𝛾 = ∞;
Solution.
𝑆𝑋 (𝑡) (1 + 𝑡/𝜃)−𝛼
lim = lim
𝑡→∞ 𝑆𝑌 (𝑡) 𝑡→∞ exp{−(𝑡/𝜃)𝜏 }
exp{𝑡/𝜃𝜏 }
= lim
𝑡→∞ (1 + 𝑡1/𝜏 /𝜃)𝛼
∞ 𝑖
∑𝑖=0 ( 𝜃𝑡𝜏 ) /𝑖!
= lim
𝑡→∞ (1 + 𝑡1/𝜏 /𝜃)𝛼
∞ −𝛼
−𝑖/𝛼 𝑡(1/𝜏−𝑖/𝛼)
= lim ∑ (𝑡 + ) /𝜃𝜏𝑖 𝑖!
𝑡→∞
𝑖=0
𝜃
= ∞.
Therefore, the Pareto distribution has a heavier tail than the Weibull distribu-
tion. One may also realize that exponentials go to infinity faster than polyno-
mials, thus the aforementioned limit must be infinite.
For some distributions of which the survival functions do not admit explicit
expressions, we may find the following alternative formula useful:
′
𝑆𝑋 (𝑡) 𝑆𝑋 (𝑡)
lim = lim
′
𝑡→∞ 𝑆𝑌 (𝑡) 𝑡→∞ 𝑆 (𝑡)
𝑌
−𝑓𝑋 (𝑡)
= lim
𝑡→∞ −𝑓𝑌 (𝑡)
𝑓 (𝑡)
= lim 𝑋 .
𝑡→∞ 𝑓𝑌 (𝑡)
given that the density functions exist. This is an application of L’Hôpital’s Rule
from calculus.
Example 10.2.5. Comparison of Pareto to gamma distributions. Let
𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼, 𝜃) and 𝑌 ∼ 𝑔𝑎𝑚𝑚𝑎(𝛼, 𝜃), for 𝛼 > 0 and 𝜃 > 0. Show that the
Pareto has a heavier right tail than the gamma.
Solution.
𝑒𝑡/𝜆
∝ lim
𝑡→∞ (𝑡 + 𝜃)𝛼+1 𝑡𝜏−1
= ∞,
350CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
In the previous section, we studied two methods for classifying the weight of
distribution tails. We may claim that the risk associated with one distribution is
more dangerous (asymptotically) than the others if the tail is heavier. However,
knowing one risk is more dangerous (asymptotically) than the others may not
provide sufficient information for a sophisticated risk management purpose, and
in addition, one is also interested in quantifying how much more. In fact, the
magnitude of risk associated with a given loss distribution is an essential input
for many insurance applications, such as actuarial pricing, reserving, hedging,
insurance regulatory oversight, and so forth.
One can check that all the aforementioned functions are risk measures in which
we input the loss random variable and the functions output a numerical value.
On a different note, the function 𝐻 ∗ (𝑋) = 𝛼𝑋 𝛽 for any real-valued 𝛼, 𝛽 ≠ 0, is
10.3. RISK MEASURES 351
not a risk measure because 𝐻 ∗ produces another random variable rather than
a single numerical value.
Since risk measures are scalar measures which aim to use a single numerical
value to describe the stochastic nature of loss random variables, it should not
be surprising to us that there is no risk measure which can capture all the risk
information of the associated random variables. Therefore, when seeking useful
risk measures, it is important for us to keep in mind that the measures should
be at least
• interpretable practically;
• able to reflect the most critical information of risk underpinning the loss
distribution.
Several risk measures have been developed in the literature. Unfortunately,
there is no best risk measure that can outperform the others, and the selection
of appropriate risk measure depends mainly on the application questions at hand.
In this respect, it is imperative to emphasize that risk is a subjective concept,
and thus even given the same problem, there are multifarious approaches to
assess risk. However, for many risk management applications, there is a wide
agreement that economically sounded risk measures should satisfy four major
axioms which we are going to describe in detail next. Risk measures that satisfy
these axioms are termed coherent risk measures.
Consider a risk measure 𝐻(⋅). It is said to be a coherent risk measure for two
random variables 𝑋 and 𝑌 if the following axioms are satisfied.
• Axiom 1. Subadditivity: 𝐻(𝑋 + 𝑌 ) ≤ 𝐻(𝑋) + 𝐻(𝑌 ). The economic
implication of this axiom is that diversification benefits exist if different
risks are combined.
Moreover, the standard deviation also does not satisfy the monotonicity prop-
erty. To see this, consider the following two random variables:
Pr[𝑌 = 4] = 1. (10.3)
√ √
You can check that Pr[𝑋 ≤ 𝑌 ] = 1, but SD(𝑋) = 42 ⋅ 0.25 ⋅ 0.75 = 3 >
SD(𝑌 ) = 0.
We have so far checked that E[⋅] is a coherent risk measure, but not SD(⋅). Let us
now proceed to study the coherent property for the standard deviation principle
(10.1) which is a linear combination of coherent and incoherent risk measures.
It only remains to verify the monotonicity property, which may or may not be
satisfied depending on the value of 𝛼. To see this, consider
√ again the setup of
(10.2) and (10.3) in which Pr[𝑋 ≤ 𝑌 ] = 1. Let 𝛼 = 0.1 ⋅ 3, then 𝐻SD (𝑋) =
3 + 0.3 = 3.3 < 𝐻SD (𝑌√ ) = 4 and the monotonicity condition is met. On the
other hand, let 𝛼 = 3, then 𝐻SD (𝑋) = 3 + 3 = 6 > 𝐻SD (𝑌 ) = 4 and the
monotonicity condition is not satisfied. More precisely, by setting
√
𝐻SD (𝑋) = 3 + 𝛼 3 ≤ 4 = 𝐻SD (𝑌 ),
√
we find that the monotonicity condition is only satisfied for 0 ≤ 𝛼 ≤ 1/ 3, and
thus the standard deviation principle 𝐻SD is coherent. This result appears to
354CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
The literature on risk measures has been growing rapidly in popularity and
importance. In the succeeding two subsections, we introduce two indices which
have recently earned an unprecedented amount of interest among theoreticians,
practitioners, and regulators. They are namely the Value-at-Risk (𝑉 𝑎𝑅) and
the Tail Value-at-Risk (𝑇 𝑉 𝑎𝑅) measures. The economic rationale behind these
two popular risk measures is similar to that for the tail classification methods
introduced in the previous section, with which we hope to capture the risk of
extremal losses represented by the distribution tails.
10.3.2 Value-at-Risk
In Section 4.1.1, we defined the quantile of a distribution. We now look to a
special case of this and offer the formal definition of the value-at-risk, or VaR.
Here, 𝑖𝑛𝑓 is the infimum operator so that the 𝑉 𝑎𝑅 measure outputs the smallest
value of 𝑋 such that the associated cdf first exceeds or equates to 𝑞. This is
simply the quantile that was introduced in Section 3.1.2 and further developed
in Section 4.1.1.
𝑞 = 𝐹𝑋 (𝑉 𝑎𝑅𝑞 [𝑋])
= Pr [(𝑋 − 𝜇)/𝜎 ≤ (𝑉 𝑎𝑅𝑞 [𝑋] − 𝜇)/𝜎]
= Φ((𝑉 𝑎𝑅𝑞 [𝑋] − 𝜇)/𝜎).
Therefore, we have
𝑉 𝑎𝑅𝑞 [𝑋] = Φ−1 (𝑞) 𝜎 + 𝜇.
We have thus far seen a number of examples about the 𝑉 𝑎𝑅 for continuous
random variables, let us consider an example concerning the 𝑉 𝑎𝑅 for a discrete
random variable.
Example 10.3.4. 𝑉 𝑎𝑅 for a discrete random variable. Consider an
insurance loss random variable with the following probability distribution:
0.75, for 𝑥 = 1
Pr[𝑋 = 𝑥] = { 0.20, for 𝑥 = 3
0.05, for 𝑥 = 4.
⎧ 0, 𝑥 < 1;
{ 0.75, 1 ≤ 𝑥 < 3;
𝐹𝑋 (𝑥) = ⎨
0.95, 3 ≤ 𝑥 < 4;
{
⎩ 1, 4 ≤ 𝑥.
and thus these two loss distributions have the same level of risk according to
𝑉 𝑎𝑅0.95 . However, 𝑌 is riskier than 𝑋 if extremal losses are of major concern
since 𝑋 is bounded above while 𝑌 is unbounded. Simply quantifying risk by
using 𝑉 𝑎𝑅 at a specific confidence level could be misleading and may not reflect
the true nature of risk.
As a remedy, the Tail Value-at-Risk (𝑇 𝑉 𝑎𝑅) was proposed to measure the
extremal losses that are above a given level of 𝑉 𝑎𝑅 as an average. We document
the definition of 𝑇 𝑉 𝑎𝑅 in what follows. For the sake of simplicity, we are going
to confine ourselves to continuous positive random variables only, which are
more frequently used in the context of insurance risk management. We refer
the interested reader to Hardy (2006) for a more comprehensive discussion of
𝑇 𝑉 𝑎𝑅 for both discrete and continuous random variables.
358CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
Definition 10.4. Fix 𝑞 ∈ (0, 1), the tail value-at-risk of a (continuous) random
variable 𝑋 is formulated as
∞
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥. (10.5)
(1 − 𝑞) 𝜋𝑞
(1)
where ‘ =’ holds because of the results reported in Example 10.3.2. Next, we
turn to study 𝑇 𝑉 𝑎𝑅𝑞 [𝑍] = E[𝑍|𝑍 > 𝑉 𝑎𝑅𝑞 [𝑍]]. Let 𝜔(𝑞) = (Φ−1 (𝑞))2 /2, we
have
∞
1 2
(1 − 𝑞) 𝑇 𝑉 𝑎𝑅𝑞 [𝑍] = ∫ 𝑧 √ 𝑒−𝑧 /2 𝑑𝑧
−1
Φ (𝑞) 2𝜋
∞
1
= ∫ √ 𝑒−𝑥 𝑑𝑥
𝜔(𝑞) 2𝜋
1
= √ 𝑒−𝜔(𝑞)
2𝜋
= 𝜙(Φ−1 (𝑞)).
10.3. RISK MEASURES 359
Thus,
𝜙(Φ−1 (𝑞))
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜎 + 𝜇.
1−𝑞
2
𝑒𝜇+𝜎 /2
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = Φ(Φ−1 (𝑞) − 𝜎).
(1 − 𝑞)
Solution.
1
𝑓𝑋 (𝑥) = √ exp{−(log 𝑥 − 𝜇)2 /2𝜎2 }, for 𝑥 > 0.
𝜎 2𝜋𝑥
∞
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥
(1 − 𝑞) 𝜋𝑞
∞
1 1 (log 𝑥 − 𝜇)2
= ∫ √ exp {− } 𝑑𝑥
(1 − 𝑞) 𝜋𝑞 𝜎 2𝜋 2𝜎2
∞
(1) 1 1 1 2
= ∫ √ 𝑒− 2 𝑤 +𝜎𝑤+𝜇 𝑑𝑤
(1 − 𝑞) 𝜔(𝑞) 2𝜋
2 ∞
𝑒𝜇+𝜎 /2 1 1 2
= ∫ √ 𝑒− 2 (𝑤−𝜎) 𝑑𝑤
(1 − 𝑞) 𝜔(𝑞) 2𝜋
2
𝑒𝜇+𝜎 /2
= Φ(𝜔(𝑞) − 𝜎), (10.6)
(1 − 𝑞)
360CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
(1)
where = holds by applying change of variable 𝑤 = (log 𝑥 − 𝜇)/𝜎, and 𝜔(𝑞) =
(log 𝜋𝑞 − 𝜇)/𝜎. Evoking the formula of 𝑉 𝑎𝑅 for lognormal random variable
reported in Example 10.3.2, we can simplify the expression (10.6) into
2
𝑒𝜇+𝜎 /2
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = Φ(Φ−1 (𝑞) − 𝜎).
(1 − 𝑞)
∞
∞ 1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = [−𝑥𝑆𝑋 (𝑥)∣𝜋 + ∫ 𝑆𝑋 (𝑥)𝑑𝑥]
𝑞
𝜋𝑞 (1 − 𝑞)
∞
1
= 𝜋𝑞 + ∫ 𝑆 (𝑥)𝑑𝑥.
(1 − 𝑞) 𝜋𝑞 𝑋
𝜋𝑞 = −𝜃[log(1 − 𝑞)].
∞
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜋𝑞 + ∫ 𝑒−𝑥/𝜃 𝑑𝑥/(1 − 𝑞)
𝜋𝑞
= 𝜋𝑞 + 𝜃𝑒−𝜋𝑞 /𝜃 /(1 − 𝑞)
= 𝜋𝑞 + 𝜃.
∞
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ (𝑥 − 𝜋𝑞 + 𝜋𝑞 )𝑓𝑋 (𝑥)𝑑𝑥/(1 − 𝑞)
𝜋𝑞
∞
1
= 𝜋𝑞 + ∫ (𝑥 − 𝜋𝑞 )𝑓𝑋 (𝑥)𝑑𝑥
(1 − 𝑞) 𝜋𝑞
= 𝜋𝑞 + 𝑒𝑋 (𝜋𝑞 )
(E[𝑋] − E[𝑋 ∧ 𝜋𝑞 ])
= 𝜋𝑞 + , (10.7)
(1 − 𝑞)
where 𝑒𝑋 (𝑑) = E[𝑋 − 𝑑|𝑋 > 𝑑] for 𝑑 > 0 denotes the mean excess loss function.
For many commonly used parametric distributions, the formulas for calculating
E[𝑋] and E[𝑋 ∧ 𝜋𝑞 ] can be found in a table of distributions.
Example 10.3.9. 𝑇 𝑉 𝑎𝑅 of a Pareto distribution. Consider a loss random
variable 𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝜃, 𝛼) with 𝜃 > 0 and 𝛼 > 0. The cdf of 𝑋 is given by
𝛼
𝜃
𝐹𝑋 (𝑥) = 1 − ( ) , for 𝑥 > 0.
𝜃+𝑥
𝜃 (𝜃/(𝜃 + 𝜋𝑞 ))𝛼−1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜋𝑞 +
𝛼 − 1 (𝜃/(𝜃 + 𝜋𝑞 ))𝛼
𝜃 𝜋𝑞 + 𝜃
= 𝜋𝑞 + ( )
𝛼−1 𝜃
𝜋𝑞 + 𝜃
= 𝜋𝑞 + ,
𝛼−1
1
1
𝐶𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑉 𝑎𝑅𝛼 [𝑋] 𝑑𝛼.
1−𝑞 𝑞
The conditional value at risk is also known as the average value at risk (AVaR)
and the expected short-fall (ES). It can be shown that 𝐶𝑉 𝑎𝑅𝑞 [𝑋] = 𝑇 𝑉 𝑎𝑅𝑞 [𝑋]
when Pr(𝑋 = 𝑉 𝑎𝑅𝑞 [𝑋]) = 0 which holds for continuous random variables. That,
is, if 𝑋 is continuous, then via a change of variables, we can rewrite equation
(10.5) as
1
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑉 𝑎𝑅𝛼 [𝑋] 𝑑𝛼. (10.9)
1−𝑞 𝑞
This alternative formula (10.9) tells us that 𝑇 𝑉 𝑎𝑅 is the average of 𝑉 𝑎𝑅𝛼 [𝑋]
with varying degree of confidence level over 𝛼 ∈ [𝑞, 1]. Therefore, the 𝑇 𝑉 𝑎𝑅
effectively resolves most of the limitations of 𝑉 𝑎𝑅 outlined in the previous
subsection. First, due to the averaging effect, the 𝑇 𝑉 𝑎𝑅 may be less sensitive
to the change of confidence level compared with 𝑉 𝑎𝑅. Second, all the extremal
losses that are above the (1 − 𝑞) × 100% worst probable event are taken in
account.
In this respect, one can see that for any given 𝑞 ∈ (0, 1)
10.4 Reinsurance
The amounts paid by the primary insurer and the reinsurer are summarized as
where 𝑐 ∈ (0, 1) denotes the proportion retained by the insurer. Note that
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 + 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋.
Example 10.4.1. Distribution of losses under quota share. To develop
an intuition for the effect of quota-share agreement on the distribution of losses,
the following is a short R demonstration using simulation. The accompanying
figure provides the relative shapes of the distributions of total losses, the retained
portion (of the insurer), and the reinsurer’s portion.
0.008
0.008
0.006
0.006
0.006
Density
Density
Density
0.004
0.004
0.004
0.002
0.002
0.002
0.000
0.000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0.000 0 500 1000 1500 2000 2500 3000
for some generic function 𝑔(⋅) (known as the retention function). So that the
insurer does not retain more than the loss, we consider only functions so that
𝑔(𝑥) ≤ 𝑥. Suppose further that the insurer only cares about the variability of
retained claims and is indifferent to the choice of 𝑔 as long as 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) stays
the same and equals, say, 𝑄. Then, the following result shows that the quota
share reinsurance treaty minimizes the reinsurer’s uncertainty as measured by
𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ).
Proposition. Suppose that 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝑄. Then, 𝑉 𝑎𝑟((1 − 𝑐)𝑋) ≤
𝑉 𝑎𝑟(𝑔(𝑋)) for all 𝑔(.) such that E[𝑔(𝑋)] = 𝐾, where 𝑐 = 𝑄/𝑉 𝑎𝑟(𝑋).
Proof of the Proposition. With 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 and the law of
total variation
10.4. REINSURANCE 365
In this expression, we see that 𝑄 and 𝑉 𝑎𝑟(𝑋) do not change with the choice
of 𝑔. Thus, we can minimize 𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) by maximizing the correlation
𝐶𝑜𝑟𝑟(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ). If we use a quota share reinsurance agreement, then
𝐶𝑜𝑟𝑟(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐶𝑜𝑟𝑟(𝑋, (1 − 𝑐)𝑋) = 1, the maximum possible correlation.
This establishes the proposition.
The proposition is intuitively appealing - with quota share insurance, the rein-
surer shares the responsibility for very large claims in the tail of the distribution.
This is in contrast to non-proportional agreements where reinsurers take respon-
sibility for the very large claims.
In general, let us consider a variation of the basic quota share agreement where
the amount retained by the insurer may vary with each risk, say 𝑐𝑖 . Thus, the
𝑛
insurer’s portion of the portfolio risk is 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = ∑𝑖=1 𝑐𝑖 𝑋𝑖 . What is the best
choice of the proportions 𝑐𝑖 ?
𝐿 = 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝜆(𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝐾)
𝑛 𝑛
= ∑𝑖=1 𝑐𝑖2 𝑉 𝑎𝑟(𝑋𝑖 ) − 𝜆(∑𝑖=1 𝑐𝑖 𝐸(𝑋𝑖 ) − 𝐾)
Taking a partial derivative with respect to 𝜆 and setting this equal to zero
simply means that the constraint, 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾, is enforced and we have to
choose the proportions 𝑐𝑖 to satisfy this constraint. Moreover, taking the partial
derivative with respect to each proportion 𝑐𝑖 yields
𝜕
𝐿 = 2𝑐𝑖 𝑉 𝑎𝑟(𝑋𝑖 ) − 𝜆 𝐸(𝑋𝑖 ) = 0
𝜕𝑐𝑖
so that
𝜆 𝐸(𝑋𝑖 )
𝑐𝑖 = .
2 𝑉 𝑎𝑟(𝑋𝑖 )
3
𝐾 = ∑𝑖=1 𝑐𝑖 E(𝑋𝑖 )
3 2
= 𝜆2 ∑𝑖=1 𝑉E(𝑋 𝑖)
𝑎𝑟(𝑋 ) 𝑖
From the math, it turns out that the constant for the 𝑖th risk, 𝑐𝑖 is proportional
to 𝑉𝐸(𝑋 𝑖)
𝑎𝑟(𝑋𝑖 ) . This is intuitively appealing. Other things being equal, a higher
revenue as measured by 𝐸(𝑋𝑖 ) means a higher value of 𝑐𝑖 . In the same way,
a higher value of uncertainty as measured by 𝑉 𝑎𝑟(𝑋𝑖 ) means a lower value of
𝑐𝑖 . The proportional scaling factor is determined by the revenue requirement
𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. The following example helps to develop a feel for this rela-
tionship.
Example 10.4.2. Three Pareto risks. Consider three risks that have a
Pareto distribution, each having a different set of parameters (so they are inde-
pendent but non-identical). Specifically, use the parameters:
• 𝛼1 = 3, 𝜃1 = 1000 for the first risk 𝑋1 ,
• 𝛼2 = 3, 𝜃2 = 2000 for the second risk 𝑋2 , and
• 𝛼3 = 4, 𝜃3 = 3000 for the third risk 𝑋3 .
Provide a graph that give values of 𝑐1 , 𝑐2 , and 𝑐3 for a required revenue 𝐾. Note
that these values increase linearly with 𝐾.
Solution.
10.4. REINSURANCE 367
1.0
0.8
c1
c2
proportion
0.6
0.4
c3
0.2
0.0
𝑋 for 𝑋 ≤ 𝑀
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { = min(𝑋, 𝑀 ) = 𝑋 ∧ 𝑀
𝑀 for 𝑋 > 𝑀
and
0 for 𝑋 ≤ 𝑀
𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { = max(0, 𝑋 − 𝑀 ).
𝑋−𝑀 for 𝑋 > 𝑀
the variance). Then, the following result shows that the stop-loss reinsurance
treaty minimizes the reinsurer’s uncertainty as measured by 𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ).
Proposition. Suppose that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. Then, 𝑉 𝑎𝑟(𝑋 ∧𝑀 ) ≤ 𝑉 𝑎𝑟(𝑔(𝑋))
for all 𝑔(.), where 𝑀 is such that 𝐸(𝑋 ∧ 𝑀 ) = 𝐾.
Proof of the Proposition. Add and subtract a constant 𝑀 and expand the
square to get
because 𝐸(𝑔(𝑋)) = 𝐾.
Now, for any retention function, we have 𝑔(𝑋) ≤ 𝑋, that is, the insurer’s
retained claims are less than or equal to total claims. Using the notation
𝑔𝑆𝐿 (𝑋) = 𝑋 ∧ 𝑀 for stop-loss insurance, we have
𝑀 − 𝑔𝑆𝐿 (𝑋) = 𝑀 − (𝑋 ∧ 𝑀 )
= max(𝑀 − 𝑋, 0)
≤ max(𝑀 − 𝑔(𝑋), 0).
Squaring each side yields
Excess of Loss
A closely related form of non-proportional reinsurance is the excess of loss cov-
erage. Under this contract, we assume that the total risk 𝑋 can be thought
of as composed as 𝑛 separate risks 𝑋1 , … , 𝑋𝑛 and that each of these risks are
subject to an upper limit, say, 𝑀𝑖 . So the insurer retains
𝑛
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = ∑ 𝑌𝑖,𝑖𝑛𝑠𝑢𝑟𝑒𝑟 , where 𝑌𝑖,𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋𝑖 ∧ 𝑀𝑖 .
𝑖=1
10.4. REINSURANCE 369
and the reinsurer is responsible for the excess, 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 . The
retention limits may vary by risk or may be the same for all risks, that is,
𝑀𝑖 = 𝑀 , for all 𝑖.
𝐿 = 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝜆(𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝐾)
𝑛 𝑛
= ∑𝑖=1 𝑉 𝑎𝑟(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝜆(∑𝑖=1 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝐾).
𝑀
𝐸(𝑋 ∧ 𝑀 ) = ∫ (1 − 𝐹 (𝑥))𝑑𝑥
0
and
𝑀
𝐸(𝑋 ∧ 𝑀 )2 = 2 ∫ 𝑥(1 − 𝐹 (𝑥))𝑑𝑥.
0
Taking a partial derivative of 𝐿 with respect to 𝜆 and setting this equal to zero
simply means that the constraint, 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾, is enforced and we have
to choose the limits 𝑀𝑖 to satisfy this constraint. Moreover, taking the partial
derivative with respect to each limit 𝑀𝑖 yields
𝜕 𝜕 𝜕
𝜕𝑀𝑖 𝐿 = 𝜕𝑀
𝑖
𝑉 𝑎𝑟(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝜆 𝜕𝑀
𝑖
𝐸(𝑋𝑖 ∧ 𝑀𝑖 )
𝜕
= 𝜕𝑀 (𝐸(𝑋𝑖 ∧ 𝑀𝑖 ) − (𝐸(𝑋𝑖 ∧ 𝑀𝑖 ))2 ) − 𝜆(1 − 𝐹𝑖 (𝑀𝑖 ))
2
𝑖
= 2𝑀𝑖 (1 − 𝐹𝑖 (𝑀𝑖 )) − 2𝐸(𝑋𝑖 ∧ 𝑀𝑖 )(1 − 𝐹𝑖 (𝑀𝑖 )) − 𝜆(1 − 𝐹𝑖 (𝑀𝑖 )).
𝜕
Setting 𝜕𝑀𝑖 𝐿 = 0 and solving for 𝜆, we get
From the math, it turns out that the retention limit less the expected insurer’s
claims, 𝑀𝑖 − 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ), is the same for all risks. This is intuitively appealing.
Example 10.4.3. Excess of loss for three Pareto risks. Consider three
risks that have a Pareto distribution, each having a different set of parameters
(so they are independent but non-identical). Use the same set of parameters as
in Example 10.4.2. For this example:
Solution
a. We first optimize the Lagrangian using the R package alabama for Augmented
Lagrangian Adaptive Barrier Minimization Algorithm.
[1] 1344.135
[1] 1344.133
[1] 1344.133
0.0003
0.0015
0.00020
0.0002
Density
Density
Density
0.0010
0.00010
0.0001
0.0005
0.00000
0.0000
0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Layers of Coverage
One can also extend non-proportional stop-loss treaties by introducing addi-
tional parties to the contract. For example, instead of simply an insurer and
reinsurer or an insurer and a policyholder, think about the situation with all
three parties, a policyholder, insurer, and reinsurer, who agree on how to share
a risk. More generally, we consider 𝑘 parties. If 𝑘 = 3, it could be an insurer
and two different reinsurers.
Example 10.4.4. Layers of coverage for three parties.
• Suppose that there are 𝑘 = 3 parties. The first party is responsible for
the first 100 of claims, the second responsible for claims from 100 to 3000,
and the third responsible for claims above 3000.
• If there are four claims in the amounts 50, 600, 1800 and 4000, then they
would be allocated to the parties as follows:
To handle the general situation with 𝑘 groups, partition the positive real line
into 𝑘 intervals using the cut-points
0 = 𝑀0 < 𝑀1 < ⋯ < 𝑀𝑘−1 < 𝑀𝑘 = ∞.
Note that the 𝑗th interval is (𝑀𝑗−1 , 𝑀𝑗 ]. Now let 𝑌𝑗 be the amount of risk
shared by the 𝑗th party. To illustrate, if a loss 𝑥 is such that 𝑀𝑗−1 < 𝑥 ≤ 𝑀𝑗 ,
372CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
then
𝑌1 𝑀1 − 𝑀0
⎛
⎜ 𝑌2 ⎞
⎟ ⎛
⎜ 𝑀2 − 𝑀1 ⎞
⎟
⎜
⎜ ⎟
⎟ ⎜
⎜ ⎟
⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎜
⎜ 𝑌𝑗 ⎟
⎟ = ⎜
⎜ 𝑥 − 𝑀 ⎟
⎟
⎜
⎜ ⎟
⎟ ⎜
⎜
𝑗−1 ⎟
⎟
⎜ 𝑌𝑗+1 ⎟ ⎜ 0 ⎟
⎜
⎜ ⎟
⎟ ⎜ ⎜ ⎟
⎟
⋮ ⋮
⎝ 𝑌𝑘 ⎠ ⎝ 0 ⎠
With the expression 𝑌𝑗 = min(𝑋, 𝑀𝑗 )−min(𝑋, 𝑀𝑗−1 ), we see that the 𝑗th party
is responsible for claims in the interval (𝑀𝑗−1 , 𝑀𝑗 ]. With this, you can check
that 𝑋 = 𝑌1 + 𝑌2 + ⋯ + 𝑌𝑘 . As emphasized in the following example, we also
remark that the parties need not be different.
Example 10.4.5.
• Suppose that a policyholder is responsible for the first 100 of claims and
all claims in excess of 100,000. The insurer takes claims between 100 and
100,000.
• Then, we would use 𝑀1 = 100, 𝑀2 = 100000.
• The policyholder is responsible for 𝑌1 = min(𝑋, 100) and 𝑌3 = 𝑋 −
min(𝑋, 100000) = max(0, 𝑋 − 100000).
For additional reading, see the Wisconsin Property Fund site for an example on
layers of reinsurance.
To manage the risk, you seek some insurance protection. You wish to manage
internally small building and motor vehicles amounts, up to 𝑀1 and 𝑀2 , respec-
tively. You seek insurance to cover all other risks. Specifically, the insurer’s
portion is
a. Determine the expected claim amount of (i) that retained, (ii) that ac-
cepted by the insurer, and (iii) the total overall amount.
b. Determine the 80th, 90th, 95th, and 99th percentiles for (i) that retained,
(ii) that accepted by the insurer, and (iii) the total overall amount.
c. Compare the distributions by plotting the densities for (i) that retained,
(ii) that accepted by the insurer, and (iii) the total overall amount.
Solution.
With these parameters, we can now simulate realizations of the portfolio risks.
(a) Here are the results for the expected claim amounts.
(c) Here are the results for the density plots of the retained, insurer, and total
portfolio risk.
374CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
0.0004
0.0004
0.04
Density (Note different vertical scale)
0.0003
0.0003
0.03
Density
Density
0.0002
0.0002
0.02
0.0001
0.0001
0.01
0.0000
0.0000
0.00
0 100 200 300 400 500 0 5000 10000 15000 0 5000 10000 15000
Loss Reserving
Chapter Preview. This chapter introduces loss reserving (also known as claims
reserving) for property and casualty (P&C, or general, non-life) insurance prod-
ucts. In particular, the chapter sketches some basic, though essential, analytic
tools to assess the reserves on a portfolio of P&C insurance products. First,
Section 11.1 motivates the need for loss reserving, then Section 11.2 studies the
available data sources and introduces some formal notation to tackle loss reserv-
ing as a prediction challenge. Next, Section 11.3 covers the chain-ladder method
and Mack’s distribution-free chain-ladder model. Section 11.4 then develops a
fully stochastic approach to determine the outstanding reserve with generalized
linear models (GLMs), including the technique of bootstrapping to obtain a
predictive distribution of the outstanding reserve via simulation.
11.1 Motivation
Our starting point is the lifetime of a P&C insurance claim. Figure 11.1 pictures
the development of such a claim over time and identifies the events of interest:
The insured event or accident occurs at time 𝑡𝑜𝑐𝑐 . This incident is reported to the
insurance company at time 𝑡𝑟𝑒𝑝 , after some delay. If the filed claim is accepted
by the insurance company, payments will follow to reimburse the financial loss
of the policyholder. In this example the insurance company compensates the
incurred loss with loss payments at times 𝑡1 , 𝑡2 and 𝑡3 . Eventually, the claim
settles or closes at time 𝑡𝑠𝑒𝑡 .
Often claims will not settle immediately due to the presence of delay in the re-
porting of a claim, delay in the settlement process or both. The reporting delay
is the time that elapses between the occurrence of the insured event and the
reporting of this event to the insurance company. The time between reporting
and settlement of a claim is known as the settlement delay. For example, it is
very intuitive that a material or property damage claim settles quicker than a
375
376 CHAPTER 11. LOSS RESERVING
Reporting
bodily injury claim involving a complex type of injury. Closed claims may also
reopen due to new developments, e.g. an injury that requires extra treatment.
Put together, the development of a claim typically takes some time. The pres-
ence of this delay in the run-off of a claim requires the insurer to hold capital
in order to settle these claims in the future.
Present
Occurrence
Loss Payments
Reporting Settlement
An RBNS claim is one that has been Reported, But is Not fully Settled at
the present moment or the moment of evaluation (the valuation date), that is,
the moment when the reserves should be calculated and booked by the insurer.
Occurrence, reporting and possibly some loss payments take place before the
11.1. MOTIVATION 377
present moment, but the closing of the claim happens in the future, beyond the
present moment.
Present
Occurrence
Loss Payments
Reporting
Uncertainty
An IBNR claim is one that has Incurred in the past But is Not yet Reported.
For such a claim the insured event took place, but the insurance company is
not yet aware of the associated claim. This claim will be reported in the future
and its complete development (from reporting to settlement) takes place in the
future.
Occurrence
Present
tocc time
Uncertainty
Insurance companies will reserve capital to fulfill their future liabilities with
respect to both RBNS as well as IBNR claims. The future development of such
claims is uncertain and predictive modeling techniques will be used to calculate
appropriate reserves, from the historical development data observed on similar
claims.
typical manufacturing industry this is not the case and the manufacturer knows
- before selling a product - what the cost of producing this product was. At
a specified evaluation moment 𝜏 the insurer will predict outstanding liabilities
with respect to contracts sold in the past. This is the claims reserve or loss
reserve; it is the capital necessary to settle open claims from past exposures. It
is a very important element on the balance sheet of the insurer, more specifically
on the liabilities side of the balance sheet.
Payment delay
Occurrence Loss payments
Year of occurrence
Reporting All claims in portfolio
Settlement
Compress data
year during which the insured event occurred. The horizontal axis indicates the
payment delay in years since occurrence of the insured event. 0 delay is used
for payments made in the year of occurrence of the accident or insured event.
One year of delay is used for payments made in the year after occurrence of the
accident.
accident payment delay (in years)
year 0 1 2 3 4 5 6 7 8 9
2004 5,947.0 3,721.2 895.7 207.8 206.7 621.2 658.1 148.5 111.3 158.1
2005 6,346.8 3,246.4 723.2 151.8 678.2 366.0 527.5 111.9 116.5
2006 6,269.1 2,976.2 8470.5 262.8 152.7 654.4 535.5 892.4
2007 5,863 2,683.2 722.5 190.7 133.0 883.4 433.3
2008 5,778.9 2,745.2 653.9 273.4 230.3 105.2
2009 6,184.8 2,828.3 572.8 244.9 105.0
2010 5,600.2 2,893.2 563.1 225.5
2011 5,288.1 2,440.1 528.0
2012 5,290.8 2,357.9
2013 5,675.6
For example, cell (2004, 0) in the above triangle displays the number 5, 947, the
total amount paid in the year 2004 for all claims occurring in year 2004. Thus,
it is the total amount paid with 0 years of delay on all claims that occurred in
the year 2004. Similarly, the number in cell (2012, 1) displays the total 2, 357.9
paid in the year 2013 for all claims that occurred in year 2012.
accident payment delay (in years)
year 0 1 2 3 4 5 6 7 8 9
2004 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2005 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
2006 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636
2007 5,863 8,546 9,269 9,459 9,592 9,681 9,724
2008 5,779 8,524 9,178 9,451 9,682 9,787
2009 6,185 9,013 9,586 9,831 9,936
2010 5,600 8,493 9,057 9,282
2011 5,288 7,728 8,256
2012 5,291 7,649
2013 5,676
Whereas the triangle in Figure 11.6 displays incremental payment data, the Fig-
ure 11.7 shows the same information in cumulative format. Now, cell (2004, 1)
displays the total claim amount paid up to payment delay 1 for all claims that
occurred in year 2004. Therefore, it is the sum of the amount paid in 2004 and
the amount paid in 2005 on accidents that occurred in 2004.
Different pieces of information can be stored in run-off triangles as those shown
in Figure 11.6 and Figure 11.7. Depending on the kind of data stored, the
triangle will be used to estimate different quantities.
For example, in incremental format a cell may display:
• the claim payments, as motivated before
• the number of claims that occurred in a specific year and were reported
with a certain delay, when the goal is to estimate the number of IBNR
380 CHAPTER 11. LOSS RESERVING
claims
• the change in incurred amounts, where incurred claim amounts are the
sum of cumulative paid claims and the case estimates. The case estimate
is the claims handler’s expert estimate of the outstanding amount on a
claim.
In cumulative format a cell may display:
• the cumulative paid amount, as motivated before
• the total number of claims from an occurrence year, reported up to a
certain delay
• the incurred claim amounts.
Other sources of information are potentially available, e.g. covariates (like the
type of claim), external information (like inflation, change in regulation). Most
claims reserving methods designed for run-off triangles are rather based on a
single source of information, although recent contributions focus on the use of
more detailed data for loss reserving.
The random variable 𝑋𝑖𝑗 denotes the incremental claims paid in development
11.2. LOSS RESERVE DATA 381
period 𝑗 on claims from accident year 𝑖. Thus, 𝑋𝑖𝑗 is the total amount paid in
development year 𝑗 for all claims that happened in occurrence year 𝑖. These
payments are actually paid out in accounting or calendar year 𝑖 + 𝑗. Taking
a cumulative point of view, 𝐶𝑖𝑗 is the cumulative amount paid up until (and
including) development year 𝑗 for accidents that occurred in year 𝑖. Ultimately,
a total amount 𝐶𝑖𝐽 is paid in the final development year 𝐽 for claims that
occurred in accident year 𝑖. In this chapter time is expressed in years, though
other time units can be used as well, e.g. six-month periods or quarters.
𝐼−1
(0)
ℛ𝑖 = ∑ 𝑋𝑖ℓ = 𝐶𝑖,𝐼 − 𝐶𝑖,𝐼−𝑖 .
ℓ=𝐼−𝑖+1
We express the reserve either as a sum of incremental data, the 𝑋𝑖ℓ , or as a dif-
ference between cumulative numbers. In the latter case the outstanding amount
is the ultimate cumulative amount 𝐶𝑖,𝐼 minus the most recently observed cumu-
(0)
lative amount 𝐶𝑖,𝐼−𝑖 . Following Wüthrich and Merz (2015), the notation ℛ𝑖
refers to the reserve for occurrence year 𝑖 where 𝑖 = 1, … , 𝐼. The superscript
(0) refers to the evaluation of the reserve at the present moment, say 𝜏 = 0. We
understand 𝜏 = 0 at the end of occurrence year 𝐼, the most recent calendar year
for which data are observed and registered.
dev
origin 0 1 2 3 4 5 6 7 8 9
2004 5947 9668 10564 10772 10978 11041 11106 11121 11132 11148
2005 6347 9593 10316 10468 10536 10573 10625 10637 10648 NA
2006 6269 9245 10092 10355 10508 10573 10627 10636 NA NA
2007 5863 8546 9269 9459 9592 9681 9724 NA NA NA
2008 5779 8524 9178 9451 9682 9787 NA NA NA NA
2009 6185 9013 9586 9831 9936 NA NA NA NA NA
2010 5600 8493 9057 9282 NA NA NA NA NA NA
2011 5288 7728 8256 NA NA NA NA NA NA NA
2012 5291 7649 NA NA NA NA NA NA NA NA
2013 5676 NA NA NA NA NA NA NA NA NA
dev
origin 0 1 2 3 4 5 6 7 8 9
2004 5947 3721 896 208 207 62 66 15 11 16
2005 6347 3246 723 152 68 37 53 11 12 NA
2006 6269 2976 847 263 153 65 54 9 NA NA
2007 5863 2683 723 191 133 88 43 NA NA NA
2008 5779 2745 654 273 230 105 NA NA NA NA
2009 6185 2828 573 245 105 NA NA NA NA NA
2010 5600 2893 563 226 NA NA NA NA NA NA
2011 5288 2440 528 NA NA NA NA NA NA NA
2012 5291 2358 NA NA NA NA NA NA NA NA
2013 5676 NA NA NA NA NA NA NA NA NA
Visualizing Triangles
To explore the evolution of the cumulative payments per occurrence year, Fig-
ure 11.9 shows my_triangle using the plot function available for objects of
type triangle in the ChainLadder package. Each line in this plot depicts an
occurrence year (from 2004 to 2013, labelled as 1 to 10). Development periods
are labelled from 1 to 10 (instead of 0 to 9, as used above).
plot(my_triangle)
Alternatively, the lattice argument creates one plot per occurrence year.
plot(my_triangle, lattice = TRUE)
384 CHAPTER 11. LOSS RESERVING
1 1 1 1 1 1
1
8000000 10000000
1 2 2
3 3
2 3
2 3
2 2
2 3
3 6
1 6 5 5
4 4
2 6 4
5 4
3 4
5 7
my_triangle 6 7
4
5
7
8
8
9
6000000
2
3
6
1
4
5
0
7
9
8
2 4 6 8 10
dev. period
0 2 4 6 8 0 2 4 6 8
0 2 4 6 8
dev. period
2
3
6
1
4
5
0
5000000
7
9
8
my_triangle_incr
1
2
3
7
6
5
4
2000000
8
9
1
3
2
4
5
6
7
8
5
3
6
7
1
4
2 5
1
3
4
6
2 5
4
3
1
2 1
3
2
4 1
3
2 2
1 1
0
2 4 6 8 10
dev. period
0 2 4 6 8 0 2 4 6 8
dev. period
𝐶𝑖,𝑗+1 = 𝑓𝑗 × 𝐶𝑖,𝑗 .
Thus, the development factor tells you how the cumulative amount in develop-
ment year 𝑗 grows to the cumulative amount in year 𝑗 + 1. We highlight the
cumulative amount in period 0 in blue and the cumulative amount in period 1
in red on the Figure 11.10 taken from Wüthrich and Merz (2008) (Table 2.2,
also used in Wüthrich and Merz (2015), Table 1.4).
accident payment delay (in years)
year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724
5 5,779 8,524 9,178 9,451 9,682 9,787
6 6,185 9,013 9,586 9,831 9,936
7 5,600 8,493 9,057 9,282
8 5,288 7,728 8,256
9 5,291 7,649
10 5,676
10−0−1
∑𝑖=1 𝐶𝑖,0+1
̂ =
𝑓0𝐶𝐿 = 1.4925.
10−0−1
∑𝑖=1 𝐶𝑖0
Note that the index 𝑖, used in the sums in the numerator and denominator, runs
from the first occurrence period (1) to the last occurrence period (9) for which
both development periods 0 and 1 are observed. As such, this development
11.3. THE CHAIN-LADDER METHOD 387
factor measures how the data in blue grow to the data in red, averaged across
all occurrence periods for which both periods are observed. The chain-ladder
method then uses this development factor estimator to predict the cumulative
amount 𝐶10,1 (i.e. the cumulative amount paid up until and including develop-
ment year 1 for accidents that occurred in year 10). This prediction is obtained
by multiplying the most recent observed cumulative claim amount for occurrence
period 10 (i.e. 𝐶10,0 with development period 0) with the estimated development
̂ :
factor 𝑓0𝐶𝐿
̂
𝐶10,1 ̂ = 5, 676 ⋅ 1.4925 = 8, 471.
= 𝐶10,0 ⋅ 𝑓0𝐶𝐿
Going forward with this reasoning, the next development factor 𝑓1 can be es-
timated. Since 𝑓1 captures the development from period 1 to period 2, it can
be estimated as the ratio of the numbers in red and the numbers in blue as
highlighted in Figure 11.11.
10−1−1
∑𝑖=1 𝐶𝑖,1+1
̂
𝑓1𝐶𝐿 = = 1.0778.
10−1−1
∑𝑖=1 𝐶𝑖1
Consequently, this factor measures how the cumulative paid amount in devel-
opment period 1 grows to period 2, averaged across all occurrence periods for
which both periods are observed. The index 𝑖 now runs from period 1 to 8, since
these are the occurrence periods for which both development periods 1 and 2
are observed. This estimate for the second development factor is then used to
predict the missing, unobserved cells in development period 2:
̂
𝐶10,2 ̂ ⋅ 𝑓 𝐶𝐿
= 𝐶10,0 ⋅ 𝑓0𝐶𝐿 ̂ = 𝐶̂ ̂ = 8, 471 ⋅ 1.0778 = 9, 130
𝐶𝐿
1 10,1 ⋅ 𝑓1
̂
𝐶9,2 ̂
𝐶𝐿
= 𝐶9,1 ⋅ 𝑓1 = 7, 649 ⋅ 1.0778 = 8, 244.
388 CHAPTER 11. LOSS RESERVING
̂
Note that for 𝐶10,2 ̂
you actually use the estimate 𝐶10,1 and multiply it with the
̂
𝐶𝐿
estimated development factor 𝑓1 .
Eventually we need to estimate the values in the final column. The last develop-
ment factor 𝑓8 measures the growth from development period 8 to development
period 9 in the triangle. Since only the first row in the triangle has both cells
observed, this last factor is estimated as the ratio of the value in red and the
value in blue in Figure 11.13.
̂ is equal to:
Given observations 𝒟𝐼 , this factor estimate 𝑓8𝐶𝐿
10−8−1
∑𝑖=1 𝐶𝑖,8+1
̂ =
𝑓8𝐶𝐿 = 1.001.
10−8−1
∑𝑖=1 𝐶𝑖8
Typically this last development factor is close to 1 and hence the cash flows
paid in the final development period are minor. Using this development factor
estimate, we can now estimate the remaining cumulative claim amounts in the
column by multiplying the values for development year 8 with this factor.
11.3. THE CHAIN-LADDER METHOD 389
The general math notation for the chain ladder predictions for the lower triangle
(𝑖 + 𝑗 > 𝐼) is as follows:
̂
𝐶𝐿 𝑗−1
̂
𝐶𝑖𝑗 = 𝐶𝑖,𝐼−𝑖 ⋅ ∏𝑙=𝐼−𝑖 𝑓𝑙𝐶𝐿
𝐼−𝑗−1
̂ ∑𝑖=1 𝐶𝑖,𝑗+1
𝑓 𝐶𝐿
𝑗 = 𝐼−𝑗−1 ,
∑𝑖=1 𝐶𝑖𝑗
The numbers in the last column show the estimates for the ultimate claim
amounts. The estimate for the outstanding claim amount ℛ̂𝐶𝐿 𝑖 for a particular
occurrence period 𝑖 = 𝐼 − 𝐽 + 1, … , 𝐼 is then given by the difference between
the ultimate claim amount and the cumulative amount as observed on the most
recent diagonal:
ℛ̂𝐶𝐿
𝑖
̂ −𝐶
𝐶𝐿
= 𝐶𝑖𝐽 𝑖,𝐼−𝑖 .
This is the chain-ladder estimate for the reserve necessary to fulfill future liabil-
ities with respect to claims that occurred in this particular occurrence period.
These reserves per occurrence period and for the total summed over all occur-
rence periods are summarized in Figure 11.15.
Ci,I−i Dev.To.Date CL
ĈiJ R̂CL
i
1 11,148,123 1.000 11,148,123 0
2 10,648,192 0.999 10,663,317 15,125
3 10,635,750 0.998 10,662,007 26,257
4 9,724,069 0.996 9,758,607 34,538
5 9,786,915 0.991 9,872,216 85,301
6 9,935,752 0.984 10,092,245 156,493
7 9,282,022 0.970 9,568,142 286,120
8 8,256,212 0.948 8,705,378 449,166
9 7,648,729 0.880 8,691,971 1,043,242
10 5,675,568 0.590 9,626,383 3,950,815
totals 92,741,332.00 0.94 98,788,390.50 6,047,058.50
This means that the cumulative claims (𝐶𝑖𝑗 )𝑗=0,…,𝐽 are Markov processes (in
the development periods 𝑗) and hence the future only depends on the present.
Under these assumptions, the expected value of the ultimate claim amount 𝐶𝑖,𝐽 ,
given the available data in the upper triangle, is the cumulative amount on the
most recent diagonal (𝐶𝑖,𝐼−1 ) multiplied with appropriate development factors
𝑓𝑗 . In mathematical notation we obtain for known development factors 𝑓𝑗 and
observations 𝒟𝐼 :
𝐽−1
𝐸[𝐶𝑖𝐽 |𝒟𝐼 ] = 𝐶𝑖,𝐼−𝑖 ∏ 𝑓𝑗 .
𝑗=𝐼−𝑖
𝐼−𝑗−1
∑𝑗=1 𝐶𝑖,𝑗+1
̂
𝑓𝑗𝐶𝐿 = .
𝐼−𝑗−1
∑𝑖=1 𝐶𝑖𝑗
The predictions for the cells in the lower triangle (i.e. for cells $C_{i,j} $where
𝑖 + 𝑗 > 𝐼) are then obtained by replacing the unknown factors 𝑓𝑗 by their
̂ :
corresponding estimates 𝑓𝑗𝐶𝐿
𝑗−1
̂ =𝐶
𝐶𝐿
𝐶𝑖𝑗 ̂
𝐶𝐿
𝑖,𝐼−𝑖 ∏ 𝑓𝑙 .
𝑙=𝐼−𝑖
To quantify the prediction error that comes with the chain-ladder predictions,
Mack also introduced variance parameters 𝜎𝑗2 . To gain insight in the estimation
of these variance parameters, so-called individual development factors 𝑓𝑖,𝑗 are
introduced (which are specific to occurrence period 𝑖):
𝐶𝑖,𝑗+1
𝑓𝑖,𝑗 = .
𝐶𝑖𝑗
These individual development also describe how the cumulative amount grows
from period 𝑗 to period 𝑗 + 1, but they consider the ratio of only two cells
(instead of taking the ratio of two sums over all available occurrence periods).
Note that the development factors can be written as a weighted average of
individual development factors:
𝐼−𝑗−1
𝐶𝑖𝑗
̂ = ∑
𝑓𝑗𝐶𝐿 𝑓𝑖,𝑗 ,
𝐼−𝑗−1
𝑖=1 ∑𝑖=1 𝐶𝑖𝑗
where the weights are equal to the cumulative claims 𝐶𝑖𝑗 .
Let us now estimate the variance parameters 𝜎2 by writing Mack’s variance
assumption in equivalent ways. First, the variance of the ratio of 𝐶𝑖,𝑗+1 and 𝑐𝑖,𝑗
conditional on 𝐶𝑖,0 , … , 𝐶𝑖,𝑗 is proportional to the inverse of 𝐶𝑖,𝑗 :
1
Var[𝐶𝑖,𝑗+1 /𝐶𝑖𝑗 |𝐶𝑖0 , … , 𝐶𝑖𝑗 ] ∝ .
𝐶𝑖𝑗
This reminds us of a typical weighted least squares setting where the weights
are the inverse of the variability of a response. Therefore, a more volatile or
imprecise response variable will get less weight. The 𝐶𝑖,𝑗 play the role of the
weights. Using the unknown variance parameter 𝜎𝑗2 this variance assumption
can be written as:
The connection with weighted least squares then directly leads to an unbiased
estimate for the unknown variance parameter 𝜎𝑗2 in the form of a weighted
residual sum of squares:
𝐼−𝑗−1 2
1 𝐶𝑖,𝑗+1
𝜎̂𝑗2 = ∑ 𝐶𝑖𝑗 ( ̂ ) .
− 𝑓𝑗𝐶𝐿
𝐼 − 𝑗 − 2 𝑖=1 𝐶𝑖𝑗
The weights are again equal to 𝐶𝑖,𝑗 and the residuals are the differences between
the ratios 𝐶𝑖,𝑗+1 /𝐶𝑖,𝑗 and the individual development factors.
We now have all ingredients required to calibrate the distribution-free chain-
ladder model to the data. The next step is then to analyze the prediction
uncertainty and the prediction error. Hereto we use the chain-ladder predictor
where we replace the unknown development factors with their estimators:
𝐽−1
̂ =𝐶
𝐶𝐿
𝐶𝑖𝐽 ̂
𝐶𝐿
𝑖,𝐼−𝑖 ∏ 𝑓𝑙
𝑙=𝐼−𝑖
2
̂ ) = 𝐸 [(𝐶 − 𝐶 𝐶𝐿
𝐶𝐿
𝑀 𝑆𝐸𝑃𝐶𝑖𝐽 |𝒟𝐼 (𝐶𝑖𝐽 ̂
𝑖𝐽 𝑖𝐽 ) |𝒟𝐼 ] .
The reason for this equivalence is the fact that the reserve is the ultimate claim
amount minus the most recently observed claim amount. The latter is observed
and used in both ℛ𝐼𝑖 and ℛ̂𝐼𝑖 .
It is interesting to decompose this MSEP into a component that captures process
variance and a component that captures parameter estimation variance:
2
̂ )
𝐶𝐿
𝑀 𝑆𝐸𝑃𝐶𝑖𝐽|𝒟 (𝐶𝑖𝐽 ̂ ) |𝒟 ]
= 𝐸 [(𝐶𝑖𝐽 − 𝐶𝑖𝐽
𝐼
𝐼
2
̂ )
𝐶𝐿
= Var(𝐶𝑖𝐽 |𝒟𝐼 ) + (𝐸[𝐶𝑖𝐽 |𝒟𝐼 ] − 𝐶𝑖𝐽
= process variance + parameter estimation variance,
2
E (𝑋 − 𝑎)2 = Var(𝑋) + [E(𝑋) − 𝑎] .
𝐽−1
2 𝜎̂𝑗2 1 1
̂
𝑀 𝑆𝐸𝑃 ̂
𝐶𝐿
𝐶𝑖𝐽 |𝒟𝐼 = (𝐶𝑖𝐽 ) ∑ [ ( + 𝐼−𝑗−1 )] .
𝑗=𝐼−𝑖 (𝑓𝑗̂ )2 𝐶𝑖𝑗̂
𝐶𝐿 𝐶𝐿
∑𝑛=1 𝐶𝑛𝑗
For the derivation of this popular formula, we refer to his paper. Note that it is
an estimate of the MSEP since the unknown parameters 𝑓𝑗 and 𝜎𝑗 need to be
estimated as the estimation error cannot be calculated explicitly.
Mack also derived a formula for the MSEP for the total reserve, across all
occurrence periods:
̂𝐼 𝐼 ̂ )
𝐶𝐿
𝑀 𝑆𝐸𝑃 ∑ ̂
𝐶𝐿
𝐶𝑖𝐽
(∑𝑖=1 𝐶𝑖𝐽
𝑖=1
2
𝐼 𝐽−1 ̂ )
𝜎̂ 𝑗2 /(𝑓𝑗CL
̂
∑𝑖=1 𝑀 𝑆𝐸𝑃 ̂
𝐶𝐿 ̂
𝐶𝐿 ̂
𝐶𝐿
𝐶𝑖𝐽 |𝒟𝐼 (𝐶𝑖𝐽 ) +2 ∑1≤𝑖<𝑘≤𝐼 𝐶𝑖𝐽 𝐶𝑘𝐽 ∑𝑗=𝐼−𝑖 𝐼−𝑗−1
∑𝑛=1 𝐶𝑛𝑗
.
394 CHAPTER 11. LOSS RESERVING
The result is the sum of the MSEPs per occurrence period plus a covariance
term. This covariance term is added because the MSEPs for different occurrence
̂ of 𝑓 for different accident years
periods 𝑖 use the same parameter estimates 𝑓𝑗𝐶𝐿 𝑗
𝑖.
MackChainLadder(Triangle = my_triangle)
Totals
Latest: 92,741,334.00
Dev: 0.94
Ultimate: 98,788,397.77
IBNR: 6,047,063.77
Mack.S.E 462,977.83
CV(IBNR): 0.08
round(summary(CL)$Totals)
Totals
Latest: 92741334
Dev: 1
Ultimate: 98788398
IBNR: 6047064
Mack S.E.: 462978
11.3. THE CHAIN-LADDER METHOD 395
CV(IBNR): 0
[1] 1.4925 1.0778 1.0229 1.0148 1.0070 1.0051 1.0011 1.0010 1.0014 1.0000
dev
origin 0 1 2 3 4 5 6 7
2004 5946975 9668212 10563929 10771690 10978394 11040518 11106331 11121181
2005 6346756 9593162 10316383 10468180 10536004 10572608 10625360 10636546
2006 6269090 9245313 10092366 10355134 10507837 10573282 10626827 10635751
2007 5863015 8546239 9268771 9459424 9592399 9680740 9724068 9734574
2008 5778885 8524114 9178009 9451404 9681692 9786916 9837277 9847905
2009 6184793 9013132 9585897 9830796 9935753 10005044 10056528 10067393
2010 5600184 8493391 9056505 9282022 9419776 9485469 9534279 9544579
2011 5288066 7728169 8256211 8445057 8570389 8630159 8674567 8683939
2012 5290793 7648729 8243496 8432051 8557190 8616868 8661208 8670566
2013 5675568 8470989 9129696 9338521 9477113 9543206 9592313 9602676
dev
origin 8 9
2004 11132310 11148124
2005 10648192 10663318
2006 10646884 10662008
2007 9744764 9758606
2008 9858214 9872218
2009 10077931 10092247
2010 9554570 9568143
2011 8693029 8705378
2012 8679642 8691971
2013 9612728 9626383
The MSEP for the total reserve across all occurrence periods is given by:
CL$Total.Mack.S.E^2
9
214348469061
plot(CL)
Forecast 3 6 6 5 4
Amount
Amount
1
2
3 6
4
5 5
4
7 5
4 4
Latest 6
4
5 7
7 8
8
6000000
9
2
3
6
1
4
5
0
7
9
8
Standardised residuals
1
1
−1
−1
8000000 9000000 10000000 11000000 2004 2006 2008 2010 2012
Standardised residuals
1
1
−1
−1
The top left-hand plot is a bar-chart of the latest claims position plus IBNR and
Mack’s standard error by occurrence period. The top right-hand plot shows the
forecasted development patterns for all occurrence periods (starting with 1 for
the oldest occurrence period).
0 2 4 6 8 0 2 4 6 8
10000000
9000000
8000000
7000000
6000000
2012 2013
11000000
10000000
9000000
8000000
7000000
6000000
0 2 4 6 8
Development period
This section is being written and is not yet complete nor edited. It
is here to give you a flavor of what will be in the final version.
This section covers regression models to analyze run-off triangles. When analyz-
ing the data in a run-off triangle with a regression model, the standard toolbox
for model building, estimation and prediction becomes available. Using these
tools we are able to go beyond the point estimate and standard error as derived
in Section 11.3. More specifically, we build a generalized linear model (GLM) for
the incremental payments 𝑋𝑖𝑗 in Figure 11.6. Whereas the chain-ladder method
works with cumulative data, typical GLMs assume the response variables to be
independent and therefore work with incremental run-off triangles.
𝜇𝑖𝑗 = 𝜋𝑖 ⋅ 𝛾𝑗 ,
𝑋𝑖𝑗 ∼ 𝜙 ⋅ 𝑍𝑖𝑗
𝜇𝑖𝑗 = exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ).
Consequently, 𝑋𝑖𝑗 has the same specification for the mean as in the basic Poisson
regression model, but now
This construction allows for under (when 𝜙 < 1) and over-dispersion (with
𝜙 > 1). Because 𝑋𝑖𝑗 no longer follows a well-known distribution, this approach
is referred to as quasi-likelihood. It is particularly useful to model a run-off
triangle with incremental payments, as these typically reveal over-dispersion.
11.5. FURTHER RESOURCES AND CONTRIBUTORS 399
Third, the gamma regression model is relevant to model a run-off triangle with
claim payments. Recall from Section 3.2.1 (see also the Appendix Chapter
18) that the gamma distribution has shape parameter 𝛼 and scale parameter 𝜃.
From these, we reparameterize and define a new parameter 𝜇 = 𝛼⋅𝜃 while retain-
ing the scale parameter 𝜃. Further, assume that 𝑋𝑖𝑗 has a gamma distribution
and allow 𝜇 to vary by 𝑖𝑗 such that
𝜇𝑖𝑗 = exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ).
Point estimates for outstanding reserves (per occurrence year 𝑖 or the total
reserve) then follow by summing the cell-specific estimates. By combining the
observations in the upper triangle with their point estimates, we can construct
properly defined residuals and use these for residual inspection.
11.4.3 Bootstrap
Over time actuaries started to think about possible underlying models and we
mention some important contributions:
• Kremer (1982): two-way ANOVA
• Kremer (1984), Mack (1991): Poisson model
• Mack (1993): distribution-free chain-ladder model
• Renshaw (1989); Renshaw and Verrall (1998): over-dispersed Poisson
model
• Gisler (2006); Gisler and Wüthrich (2008); Bühlmann et al. (2009):
Bayesian chain-ladder model.
The various stochastic models proposed in actuarial literature rely on different
assumptions and have different model properties, but have in common that they
provide exactly the chain-ladder reserve estimates. For more information we also
refer to Mack and Venter (2000) and to the lively discussion that was published
in ASTIN Bulletin: Journal of the International Actuarial Association in 2006
(Venter, 2006).
To read more about exponential families and generalized linear models, see, for
example, McCullagh and Nelder (1989) and Wüthrich and Merz (2008). We
refer to (Kremer, 1982), (Renshaw and Verrall, 1998) and (England and Verrall,
2002), and the overviews in (Taylor, 2000), (Wüthrich and Merz, 2008) and
(Wüthrich and Merz, 2015) for more details on the discussed GLMs. XXX
presents alternative distributional assumptions and specifications of the linear
predictor.
Chapter 12
12.1 Introduction
Bonus-malus system, which is used interchangeably as “no-fault discount”,
“merit rating”, “experience rating” or “no-claim discount” in different countries,
is based on penalizing insureds who are responsible for one or more claims by a
premium surcharge (malus), and rewarding insureds with a premium discount
(bonus) if they do not have any claims. Insurers use bonus-malus systems
for two main purposes: to encourage drivers to drive more carefully in a year
without any claims, and to ensure insureds to pay premiums proportional to
their risks based on their claims experience via an experience rating mechanism.
401
402 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
forward to the next discount class, such as from a 0% discount to a 25% discount
in the renewal year. If a policyholder is already at the highest class, which is
at a 55% discount, a claim-free year indicates that the policyholder remains in
the same class. On the other hand, if one or more claims are made within the
year, the NCD will be forfeited and the policyholder has to start at 0% discount
in the renewal year. This set of transition rules can also be summarized as a
rule of -1/Top, that is, a class of bonus for a claim-free year, and moving to the
highest class after having one or more claims. For an illustration purpose, Table
12.1 and Figure 12.1 respectively show the classes and the transition diagram
for the Malaysian NCD system.
as a rule of -1/+1, that is, a class of bonus for a claim-free year, and a class of
malus for each claim reported.
The NCD system in Switzerland are subdivided into twenty-two classes, with
the following premium levels: 270, 250, 230, 215, 200, 185, 170, 155, 140, 130,
120, 110, 100, 90, 80, 75, 70, 65, 60, 55, 50 and 45 (Lemaire and Zi, 1994). These
levels are also equivalent to the following loadings (malus): 170%, 150%, 130%,
115%, 100%, 85%, 70%, 55%, 40%, 30%, 20%, and 10%, and the following
discounts: 0%, 10%, 20%, 25%, 30%, 35%, 40%, 45%, 50% and 55%. New
policyholders have to start at 0% discount, or at premium level of 100, and
a claim-free year indicates that a policyholder can move one-class forward. If
one or more claims incurred within the year, the policyholder has to move four-
classes backward for each claim. Table 12.3 and Figure 12.3 respectively show
the classes and the transition diagram for the NCD system in Switzerland. This
set of transition rule can be summarized as a rule of -1/+4.
(𝑡)
A time-homogeneous Markov Chain satisfies the property of 𝑝𝑖𝑗 (𝑛, 𝑛 + 𝑡) = 𝑝𝑖𝑗
(1)
for all 𝑛. For instance, we have 𝑝𝑖𝑗 (𝑛, 𝑛 + 1) = 𝑝𝑖𝑗 ≡ 𝑝𝑖𝑗 . In this case, the
Chapman-Kolmogorov equation can be written as
(𝑚) (𝑛)
𝑝𝑖𝑗 (0, 𝑚 + 𝑛) = ∑ 𝑝𝑖𝑙 (0, 𝑚)𝑝𝑙𝑗 (𝑚, 𝑚 + 𝑛) = ∑ 𝑝𝑖𝑙 𝑝𝑙𝑗 .
𝑙∈𝑆 𝑙∈𝑆
In the context of BMS, the transition of the NCD classes is governed by the
transition probability in a given year. The transition of the NCD classes is also
a time-homogeneous Markov Chain since the set of transition rules is fixed and
independent of time. We can represent the one-step transition probabilities by a
𝑘×𝑘 transition matrix P = (𝑝𝑖𝑗 ) that corresponds to NCD classes 0, 1, 2, … , 𝑘−1.
Here, its (𝑖, 𝑗)-th element is the transition probability from state 𝑖 to state 𝑗.
In other words, each row of the transition matrix represents the transition of
flowing out of state, whereas each column represents the transition of flowing
into the state. The summation of transition probabilities of flowing out of state
must equal to 1, or each row of the matrix must sum to 1, i.e. ∑𝑗 𝑝𝑖𝑗 = 1. All
probabilities must also be non-negative (since they are probabilities), i.e. 𝑝𝑖𝑗 ≥ 0.
Consider the Malaysian NCD system. Let {𝑋𝑡 ∶ 𝑡 = 0, 1, 2, …} be the NCD
class occupied by a policyholder at time 𝑡 with state space 𝑆 = {0, 1, 2, 3, 4, 5}.
Therefore, the transition probability in a no-claim year is equal to the probability
of transition from state 𝑖 to state 𝑖 + 1, i.e. 𝑝𝑖𝑖+1 . If an insured has one or
more claims within the year, the probability of transitioning back to state 0
is represented by 𝑝𝑖0 = 1 − 𝑝𝑖𝑖+1 . Hence, the Malaysian NCD system can be
represented by the following 6 × 6 transition matrix:
Example 12.3.1. Provide the transition matrix for the NCD system in Brazil.
Solution
Based on the NCD classes and the transition diagram shown respectively in
Table 12.2 and Figure 12.2, the probability of a no-claim year is equal to the
12.3. BMS AND MARKOV CHAIN MODEL 407
1 − 𝑝01 𝑝01 0 0 0 0 0
⎡ 1 − 𝑝12 0 𝑝12 0 0 0 0 ⎤
⎢ 1 − ∑𝑗 𝑝2𝑗 𝑝21 0 𝑝23 0 0 0 ⎥
⎢ ⎥
P=⎢ 1 − ∑𝑗 𝑝3𝑗 𝑝31 𝑝32 0 𝑝34 0 0 ⎥
⎢ 1 − ∑𝑗 𝑝4𝑗 𝑝41 𝑝42 𝑝43 0 𝑝45 0 ⎥
⎢ ⎥
⎢ 1 − ∑𝑗 𝑝5𝑗 𝑝51 𝑝52 𝑝53 𝑝54 0 𝑝56 ⎥
⎣ 1 − ∑𝑗 𝑝6𝑗 𝑝61 𝑝62 𝑝63 𝑝64 𝑝65 𝑝66 ⎦
Example 12.3.2. Provide the transition matrix for the NCD system in Switzer-
land.
Solution.
From Table 12.3 and Figure 12.3, the probability of a no-claim year is equal to
the probability of moving one-class forward, whereas the probability of having
one or more claims within the year is equal to the probability of moving four-
classes backward for each claim. The transition matrix is:
1 − 𝑝01 𝑝01 0 0 0 0 0 0 0 0 0 ⋯
⎡ 1 − 𝑝12 0 𝑝12 0 0 0 0 0 0 0 0 ⋯ ⎤
⎢ 1 − 𝑝23 0 0 𝑝23 0 0 0 0 0 0 0 ⋯ ⎥
⎢ 1 − 𝑝34 0 0 0 𝑝34 0 0 0 0 0 0 ⋯
⎥
⎢ ⎥
⎢ 1 − 𝑝45 0 0 0 0 𝑝45 0 0 0 0 0 ⋯ ⎥
⎢ 1 − ∑ 𝑝5𝑗 𝑝51 0 0 0 0 𝑝56 0 0 0 0 ⋯ ⎥
⎢ 𝑗 ⎥
⎢ 1 − ∑ 𝑝6𝑗 0 𝑝62 0 0 0 0 𝑝67 0 0 0 ⋯ ⎥
P=⎢ 𝑗
⎥
⎢ 1 − ∑ 𝑝7𝑗 0 0 𝑝73 0 0 0 0 𝑝78 0 0 ⋯ ⎥
𝑗
⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⎥
⎢ 1 − ∑ 𝑝19,𝑗 0 0 𝑝19,3 0 0 0 𝑝19,7 0 0 0 ⋯ ⎥
⎢ 𝑗 ⎥
⎢ 1 − ∑ 𝑝20,𝑗 0 0 0 𝑝20,4 0 0 0 𝑝20,8 0 0 ⋯ ⎥
⎢ 𝑗
⎥
1 − ∑ 𝑝21,𝑗 𝑝21,1 0 0 0 𝑝21,5 0 0 0 𝑝21,9 0 ⋯
⎣ 𝑗 ⎦
408 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
0 ≤ 𝜋𝑗 ≤ 1,
∑ 𝜋𝑗 = 1,
𝑗
𝜋𝑗 = ∑ 𝜋𝑖 𝑝𝑖𝑗 .
𝑖
The last equation can be written as 𝜋P = 𝜋. The first two conditions are
necessary for probability distribution whereas the last property indicates that
the row vector 𝜋 is invariant (i.e. unchanged) by the one-step transition matrix.
In other words, once the Markov Chain has reached the stationary state, its
probability distribution will stay stationary over time. Mathematically, the
stationary vector 𝜋 can also be obtained by finding the left eigenvector of the
one-step transition matrix.
Example 12.4.1. Find the stationary distribution for the NCD system in
Malaysia assuming that the probability of a no-claim year for all NCD classes
are 𝑝0 .
1 − 𝑝0 𝑝0 0 0 0 0
⎡ 1 − 𝑝0 0 𝑝0 0 0 0 ⎤
⎢ 1 − 𝑝0 0 0 𝑝0 0 0 ⎥
P=⎢ ⎥
⎢ 1 − 𝑝0 0 0 0 𝑝0 0 ⎥
⎢ 1 − 𝑝0 0 0 0 0 𝑝0 ⎥
⎣ 1 − 𝑝0 0 0 0 0 𝑝0 ⎦
𝜋0 = ∑ 𝜋𝑖 𝑝𝑖0 = (1 − 𝑝0 ) ∑ 𝜋𝑖 = 1 − 𝑝0
𝑖 𝑖
𝜋1 = ∑ 𝜋𝑖 𝑝𝑖1 = 𝜋0 𝑝01 = (1 − 𝑝0 )𝑝0
𝑖
𝜋2 = ∑ 𝜋𝑖 𝑝𝑖2 = 𝜋1 𝑝12 = (1 − 𝑝0 )𝑝0 2
𝑖
𝜋3 = ∑ 𝜋𝑖 𝑝𝑖3 = 𝜋2 𝑝23 = (1 − 𝑝0 )𝑝0 3
𝑖
𝜋4 = ∑ 𝜋𝑖 𝑝𝑖4 = 𝜋3 𝑝34 = (1 − 𝑝0 )𝑝0 4
𝑖
𝜋5 = ∑ 𝜋𝑖 𝑝𝑖5 = 𝜋4 𝑝45 + 𝜋5 𝑝55 = (1 − 𝑝0 )𝑝0 5 + 𝜋5 𝑝0
𝑖
(1−𝑝0 )𝑝0 5
∴𝜋5 = (1−𝑝0 )
= 𝑝0 5
𝜋0 = 1 − 𝑝0 = 0.1000
𝜋1 = (1 − 𝑝0 )𝑝0 = 0.0900
𝜋2 = (1 − 𝑝0 )𝑝0 2 = 0.0810
𝜋3 = (1 − 𝑝0 )𝑝0 3 = 0.0729
𝜋4 = (1 − 𝑝0 )𝑝0 4 = 0.0656
𝜋5 = 𝑝0 5 = 0.5905
In other words, 𝜋0 = 0.10 indicates that 10% of insureds will eventually belong
to class 0, 𝜋1 = 0.09 indicates that 9% of insureds will eventually belong to
class 1, and so forth, until 𝜋5 = 0.59, which indicates that 59% of insureds will
eventually belong to class 5.
[,1] [,2]
[1,] 0.162+0i 0.000000000000115+0.000000000000084i
[2,] 0.145+0i -0.000000000066000-0.000000000203000i
[3,] 0.131+0i -0.000000098000000+0.000000302000000i
[4,] 0.118+0i 0.000384000000000-0.000279000000000i
[5,] 0.106+0i -0.707000000000000+0.000000000000000i
[6,] 0.954+0i 0.707000000000000+0.000000000000000i
[,3]
[1,] 0.000000000000115-0.000000000000084i
[2,] -0.000000000066000+0.000000000203000i
[3,] -0.000000098000000-0.000000302000000i
[4,] 0.000384000000000+0.000279000000000i
[5,] -0.707000000000000+0.000000000000000i
[6,] 0.707000000000000+0.000000000000000i
[,4]
[1,] -0.000000000000044+0.000000000000136i
[2,] 0.000000000172000+0.000000000125000i
[3,] 0.000000257000000-0.000000187000000i
[4,] -0.000147000000000-0.000451000000000i
[5,] -0.707000000000000+0.000000000000000i
[6,] 0.707000000000000+0.000000000000000i
[,5] [,6]
[1,] -0.000000000000044-0.000000000000136i 0.000000000000143+0i
[2,] 0.000000000172000-0.000000000125000i 0.000000000213000+0i
[3,] 0.000000257000000+0.000000187000000i 0.000000317000000+0i
[4,] -0.000147000000000+0.000451000000000i 0.000474000000000+0i
[5,] -0.707000000000000+0.000000000000000i 0.707000000000000+0i
[6,] 0.707000000000000+0.000000000000000i -0.707000000000000+0i
Example 12.4.2. Find the stationary distribution for the NCD system in
Brazil assuming that the number of claims is Poisson distributed with parameter
𝜆 = 0.10.
Solution. Under the Poisson distribution, the probability of* 𝑘 claims is 𝑝𝑘 =
𝑘
𝑒−0.1 (0.1)
𝑘! , 𝑘 = 0, 1, 2, … .
The transition matrix is:
1 − 𝑝0 𝑝0 0 0 0 0 0 0.0952 0.9048 0 0 0 0 0
⎡ 1 − 𝑝0 0 𝑝0 0 0 0 0 ⎤ ⎡ 0.0952 0 0.9048 0 0 0 0 ⎤
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝1 0 𝑝0 0 0 0 ⎥ ⎢ 0.0047 0.0905 0 0.9048 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
P=⎢ 1 − ∑𝑖 𝑝𝑖 𝑝2 𝑝1 0 𝑝0 0 0 ⎥=⎢ 0.0002 0.0045 0.0905 0 0.9048 0 0 ⎥
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝3 𝑝2 𝑝1 0 𝑝0 0 ⎥ ⎢ 0.0000 0.0002 0.0045 0.0905 0 0.9048 0 ⎥
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝4 𝑝3 𝑝2 𝑝1 0 𝑝0 ⎥ ⎢ 0.0000 0.0000 0.0002 0.0045 0.0905 0 0.9048 ⎥
⎣ 1 − ∑𝑖 𝑝𝑖 𝑝5 𝑝4 𝑝3 𝑝2 𝑝1 𝑝0 ⎦ ⎣ 0.0000 0.0000 0.0000 0.0002 0.0045 0.0905 0.9048 ⎦
𝜋0 0.0000
⎡ 𝜋1 ⎤ ⎡ 0.0000 ⎤
⎢ ⎥ ⎢ ⎥
⎢ 𝜋2 ⎥ ⎢ 0.0003 ⎥
⎢ 𝜋3 ⎥=⎢ 0.0022 ⎥.
⎢ ⎥ ⎢ ⎥
⎢ 𝜋4 ⎥ ⎢ 0.0145 ⎥
⎢ 𝜋5 ⎥ ⎢ 0.0936 ⎥
⎣ 𝜋6 ⎦ ⎣ 0.8894 ⎦
The probabilities indicate that 89% of insureds will eventually belong to class
6, 9% of insureds will eventually belong to class 5, and 1.5% of insureds will
eventually belong to class 4. Other classes would have less than 1% of insureds
in the long run.
Example 12.4.3. Using the results from Example 12.4.2, find the final pre-
mium under the steady state condition assuming that the premium prior to
implementing the NCD is 𝑚.
Solution. Using the stationary probabilities from Example 12.4.2, the station-
ary final premium is:
412 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
The results indicate that the final premium reduce from 𝑚 to 0.6565𝑚 in the
long run under stationary condition if the NCD is considered. From a financial
standpoint, this implies that the collected premium is insufficient to cover the
expected claim cost of 𝑚. This result is not surprising because none of the classes
in the NCD system in Brazil impose a malus loading for the policyholders. More
importantly, it indicates that NCD will only be financially balanced if there are
both bonus and malus classes and the premium levels are re-calculated such
that the expected premium under the stationary distribution equals to 𝑚.
Example 12.4.4. Observe the premiums in 20 years under the NCD system
in Malaysia, assuming that the probability of claims is Poisson distributed with
parameter 𝜆 = 0.10 and the premium prior to implementing the NCD is 𝑚 =
100.
0.0952 0.9048 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 ⎤
⎢ 0.0952 0 0 0.9048 0 0 ⎥
P (1)
=⎢ ⎥
⎢ 0.0952 0 0 0 0.9048 0 ⎥
⎢ 0.0952 0 0 0 0 0.9048 ⎥
⎣ 0.0952 0 0 0 0 0.9048 ⎦
The premium in the first year, after implementing the NCD, is:
12.4. BMS AND STATIONARY DISTRIBUTION 413
Using similar steps, the premium in the 𝑛-th year for 𝑛 = 1, 2, ..., 20 can be
observed. From R, the premiums in 20 years are:
62.55, 59.87, 58.06, 57.06, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58,
56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58.
[1] 58.1
[1] 63 60 58 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57
Example 12.4.5. Observe the premiums in 20 years under the NCD system
−0.1 𝑘
in Brazil, assuming that the probability of 𝑘 claims is 𝑝𝑘 = 𝑒 𝑘!(0.1) , 𝑘 =
0, 1, 2, …, and the premium prior to implementing the NCD is 𝑚 = 100.
Solution. The transition matrix for the NCD system in Brazil is:
12.4. BMS AND STATIONARY DISTRIBUTION 415
0.0952 0.9048 0 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 0 ⎤
⎢ ⎥
⎢ 0.0047 0.0905 0 0.9048 0 0 0 ⎥
P=⎢ 0.0002 0.0045 0.0905 0 0.9048 0 0 ⎥
⎢ ⎥
⎢ 0.0000 0.0002 0.0045 0.0905 0 0.9048 0 ⎥
⎢ 0.0000 0.0000 0.0002 0.0045 0.0905 0 0.9048 ⎥
⎣ 0.0000 0.0000 0.0000 0.0002 0.0045 0.0905 0.9048 ⎦
The results in Examples 12.4-5 allow us to observe the evolution of premium for
the NCD systems in Malaysia and Brazil assuming that the number of claims
is Poisson distributed with parameter 𝜆 = 0.10, and the premium prior to im-
plementing the NCD is 𝑚 = 100. The evolution of premiums for both countries
are provided in Table 12.4, and are shown graphically in Figure 12.4.
Table 12.4. Evolution of Premium (Malaysia and Brazil)
(𝑛)
∣𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑝𝑖𝑗 ) − 𝜋𝑗 ∣ .
Therefore, the total variation can be measured by the sum of variation in all
classes:
416 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
(𝑛)
∑ ∣𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑝𝑖𝑗 ) − 𝜋𝑗 ∣.
𝑗
The total variation is also called the convergence rate because it measures the
convergence rate after 𝑛 years (or 𝑛 transitions). A lower total variation implies
a better convergence rate between the 𝑛-step transition probabilities and the
stationary distribution.
Example 12.4.6. Provide the total variations (convergence rate) in 20 years
under the NCD system in Malaysia, assuming that the probability of claims is
Poisson distributed with parameter 𝜆 = 0.10.
Solution. Using R, the stationary probabilities are:
𝜋0 0.0952
⎡ 𝜋1 ⎤ ⎡ 0.0861 ⎤
⎢ ⎥ ⎢ ⎥
⎢ 𝜋2 ⎥=⎢ 0.0779 ⎥
⎢ 𝜋3 ⎥ ⎢ 0.0705 ⎥
⎢ 𝜋4 ⎥ ⎢ 0.0638 ⎥
⎣ 𝜋5 ⎦ ⎣ 0.6064 ⎦
0.0952 0.9048 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 ⎤
⎢ ⎥
0.0952 0 0 0.9048 0 0
P(1) =⎢ ⎥
⎢ 0.0952 0 0 0 0.9048 0 ⎥
⎢ 0.0952 0 0 0 0 0.9048 ⎥
⎣ 0.0952 0 0 0 0 0.9048 ⎦
𝑝𝑖0
∣∑ 6 − 𝜋0 ∣ = 0
𝑖
∣∑ 𝑝6𝑖1 − 𝜋1 ∣ = 0.0647
𝑖
⋮
∣∑ 𝑝6𝑖5 − 𝜋5 ∣ = .3048
𝑖
𝑝𝑖𝑗
∑ ∣∑ − 𝜋𝑗 ∣ = 0.6096.
𝑗 𝑖
6
dif[j]=abs(mean(powA(n)[,j])-SP[j])
sum(dif)
}
#example for n=1
signif(TV(1), digits = 4)
[1] 0.6096
4. Provide total variations (convergence rate) in 20 years
tot.var=numeric(0)
for (n in 1:20) {tot.var[n] = TV(n)}
signif(tot.var,4)
Examples 12.4.6-7 provide the degree of convergence for two different BMS (two
different countries). The Malaysian BMS reaches full stationary only after five
years, while the BMS in Brazil takes a longer period. As mentioned in Lemaire
(1998), a more sophisticated BMS would converge more slowly, and is considered
as a drawback as it takes a longer period to stabilize. The main objective of
a BMS is to separate the good drivers from the bad drivers, and thus, it is
desirable to have a classification process that can be finalized (or stabilized) as
soon as possible.
the same a priori premium. The underlying reason for utilizing BMS that re-
lies on claims experience information is to deal with the residual heterogeneity
within each homogeneous risk class since the observable variables are far from
perfect in predicting the riskiness of driving behaviors.
The ideal a posteriori mechanism is the credibility premium (see Dionne and
Vanasse, 1989) framework, whereby premiums are derived on an individual basis
for each policyholder by incorporating both the a priori and a posteriori infor-
mation. However, such individual premium determination is overly complex
from a commercial standpoint for practical implementations by motor insurers.
For this reason, BMS is the preferred solution and it consists of the following
three building blocks: (a) BMS classes; (b) transition rules; (c) premium levels
(also known as premium relativities or premium adjustment coefficients). The
first two building blocks are pre-specified in advance and have been discussed
in previous sections, whereas the determination (instead of pre-determined as
discussed in the cases of Malaysian, Brazilian and Swiss systems) of premium
relativities are important for motor insurers precisely because of its complemen-
tary and correction nature to account for the imperfection or inaccuracies in
the a priori risk classification. In the following subsections, we briefly introduce
the required modelling setup to study the determination of optimal relativities.
We refer interested reader to Denuit et al. (2007) for a fuller discussion on the
technical details.
𝑞
𝜆𝑖 = 𝑑𝑖 exp (𝛽0̂ + ∑ 𝛽𝑚
̂ 𝑥 ),
𝑖𝑚
𝑚=1
(𝜆𝑖 𝜃)𝑘
Pr(𝑁𝑖 = 𝑘|Θ𝑖 = 𝜃) = exp(−𝜆𝑖 𝜃) , 𝑘 = 0, 1, 2, … .
𝑘!
We further assume that all the Θ𝑖 ’s are independent and follow a gamma (𝑎, 𝑎)
distribution with the following density function
1 𝑎 𝑎−1
𝑓(𝜃) = 𝑎 𝜃 exp(−𝑎𝜃), 𝜃 > 0,
Γ(𝑎)
where 1 is the column vector of 1’s and 𝜋ℓ𝜆 (𝜆𝜃) is the stationary probability
for a driver with true expected claim frequency of 𝜆𝜃 to be in level ℓ when the
equilibrium steady state is reached in the long run. Note that with the incorpo-
ration of random effect parameter 𝜃, the expression (not numerical values) for
𝜋ℓ𝜆 (𝜆𝜃) can be found by say MATLAB but not R.
With these setup, the probability of drivers staying in BMS level 𝐿 = ℓ for
12.5. BMS AND PREMIUM RATING 421
where 𝜆̂ is the constant expected claim frequency for all policyholders in the
absence of a priori risk classification and 𝑟𝐿 is the premium relativity for BMS
level 𝐿. Pitrebois et al. (2003) then incorporated the information of a priori
risk classification into the optimization of the same objective function of
min 𝔼 ((Θ − 𝑟𝐿 )2 )
to derive 𝑟𝐿 analytically. Tan et al. (2015) further proposed the minimization
of the following objective function
min 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 ) , subject to 𝔼(𝑟𝐿 ) = 1,
under a financial balanced constraint (that is, the expected premium relativity
equals 1) to determine the optimal relativities of a BMS given pre-specified BMS
levels and transition rules, where
𝑘−1
min 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 ) = ∑ 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0
𝑘−1
= ∑ 𝔼 (𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ, Λ) |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0
𝑘−1 ℎ
= ∑ ∑ 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ, Λ = 𝜆𝑔 ) × Pr (Λ = 𝜆𝑔 |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0 𝑔=1
𝑘−1 ℎ ∞
= ∑ ∑ ∫ (𝜆𝑔 𝜃 − 𝜆𝑔 𝑟ℓ )2 𝜋ℓ (𝜆𝑔 𝜃)𝑤𝑔 𝑓(𝜃)𝑑𝜃
ℓ=0 𝑔=1 0
ℎ ∞ 𝑘−1
= ∑ 𝑤𝑔 ∫ ∑(𝜆𝑔 𝜃 − 𝜆𝑔 𝑟ℓ )2 𝜋ℓ (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃.
𝑔=1 0 ℓ=0
422 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
where r = (𝑟0 , 𝑟1 , … , 𝑟𝑘−1 )𝑇 . The required first order conditions are given as
follows
𝛼unconstrained = 0,
𝔼(Λ2 Θ|𝐿 = ℓ)
𝑟ℓunconstrained = .
𝔼(Λ2 |𝐿 = ℓ)
We also assume that the gamma parameteris fixed at 𝑎 = 1.5. Note that while
these modelling assumptions are simple, the purpose here is to demonstrate
the determination of optimal relativities under a relatively simple setup, and
that the optimization procedure for the BMS remains the same even if the a
priori risk classification is performed extensively. We refer interested readers to
the motor vehicle claims data as documented in De Jong and Heller (2008) to
conduct the a priori risk segmentation before proceeding to the determination
of optimal relativities.
Furthermore, as mentioned earlier, the inclusion of random effect parameter
𝜃 implies that the required expression (not numerical values) for stationary
probabilities 𝜋ℓ𝜆 (𝜆𝜃) to be used for subsequent integrals can be found by say
MATLAB but not R. Also, since the obtained form of stationary probabilities
are rather complex, in this section we choose not to include any R codes for
the determination of optimal relativities. More importantly, we hope that the
key take-away for this subsection is for readers to get a solid overall conceptual
understanding on how to account for all the relevant information in the design
of a bonus-malus system.
For the Malaysian BMS with 6 levels and the transition rule of -1/Top, the
obtained numerical values of optimal relativities are presented in Table 12.5
together with the stationary probabilities. We find that around half of the poli-
cyholders will occupy the highest BMS level with the lowest premium relativity
over the long run when the stationary state has been reached. We also observe
that the constrained optimal relativities are higher than the unconstrained coun-
terparts because of the need to satisfy the financial balanced constraint.
424 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
Moreover, we see that except for the highest BMS level (level 5), other BMS lev-
els will impose malus surcharges to policyholders occupying those levels. This
finding is not surprising since our theoretical framework here is to determine
optimal relativities given the calculation of a priori base premiums by solely
relying on claim frequency information but not claim severity. In practice, in-
surers could afford to introduce NCD levels with only discounts (bonuses) but
not loadings (maluses) because the a priori base premiums have been inflated
accordingly taking into account both the information of claim frequency and
claim severity.
For the Brazilian BMS with 7 levels and the transition rule of -1/+1, the corre-
sponding numerical values of optimal relativities are shown in Table 12.6. We
find that around three quarters of the policyholders will occupy the highest
BMS level with the lowest premium relativity in the stationary state. This
finding is mainly due to the less severe penalty in the transition rule of -1/+1
in comparison to the rule of -1/Top, so more policyholders are expected to oc-
cupy the highest BMS level. Similar to the earlier example, we find that the
unconstrained optimal relativities are lower and result in a lower value of 𝔼(𝑟𝐿 ).
Table 12.6. Optimal Relativities with 𝑘 = 7 levels and transition rule
of -1/+1
Note that the obtained values of optimal relativities may not be desirable for
commercial implementations because of the possibility of irregular differences
between adjacent BMS levels. To alleviate this problem, insurers could consider
12.5. BMS AND PREMIUM RATING 425
linear
imposing linear optimal relativities in the form of 𝑟𝐿 = 𝑎 + 𝑏𝐿 by solving the
following constrained optimization with an inequality constraint
Contributors
• Noriszura Ismail, Universiti Kebangsaan Malaysia and Chong It Tan,
Macquarie University, are the principal authors of the initial version of
this chapter. Email: [email protected] or [email protected] for
chapter comments and suggested improvements.
• This chapter has not yet been reviewed. Write Noriszura, Chong It, or
Jed Frees ([email protected]> if you are interested.
426 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
Chapter 13
Chapter Preview. This chapter covers the learning areas on data and systems
outlined in the IAA (International Actuarial Association) Education Syllabus
published in September 2015. This chapter is organized into three major parts:
data, data analysis, and data analysis techniques. The first part introduces
data basics such as data types, data structures, data storages, and data sources.
The second part discusses the process and various aspects of data analysis. The
third part presents some commonly used techniques for data analysis.
13.1 Data
13.1.1 Data Types and Sources
In terms of how data are collected, data can be divided into two types (Hox
and Boeije, 2005): primary data and secondary data. Primary data are original
data that are collected for a specific research problem. Secondary data are
data originally collected for a different purpose and reused for another research
problem. A major advantage of using primary data is that the theoretical
constructs, the research design, and the data collection strategy can be tailored
to the underlying research question to ensure that data collected help to solve
the problem. A disadvantage of using primary data is that data collection can
be costly and time consuming. Using secondary data has the advantage of lower
cost and faster access to relevant information. However, using secondary data
may not be optimal for the research question under consideration.
In terms of the degree of organization of the data, data can be also divided
into two types (Inmon and Linstedt, 2014; O’Leary, 2013; Hashem et al., 2015;
Abdullah and Ahmad, 2013; Pries and Dunnigan, 2015): structured data and
unstructured data. Structured data have a predictable and regularly occurring
format. In contrast, unstructured data lack any regularly occurring format and
427
428 CHAPTER 13. DATA AND SYSTEMS
𝑉1 𝑉2 ⋯ 𝑉𝑑
x1 𝑥11 𝑥12 ⋯ 𝑥1𝑑
x2 𝑥21 𝑥22 ⋯ 𝑥2𝑑
⋮ ⋮ ⋮ ⋯ ⋮
x𝑛 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑑
can also be used to identify inaccurate data elements. There are five types
of analysis that can be used to identify inaccurate data (Olson, 2003): data
element analysis, structural analysis, value correlation, aggregation correlation,
and value inspection.
Companies can create a data quality assurance program to create high-quality
databases. For more information about management of data quality issues and
data profiling techniques, readers are referred to Olson (2003).
EDA CDA
Data Observational data Experimental data
Techniques for EDA include descriptive statistics (e.g., mean, median, standard
deviation, quantiles), distributions, histograms, correlation analysis, dimension
434 CHAPTER 13. DATA AND SYSTEMS
reduction, and cluster analysis. Techniques for CDA include the traditional
statistical tools of inference, significance, and confidence.
Unsupervised learning methods work with unlabeled data, which include ex-
planatory variables only. In other words, unsupervised learning methods do not
use target variables. As a result, unsupervised learning methods are also called
descriptive modeling methods.
(TB). Massive level analysis is conducted when data surpass the capabilities of
products for BI level analysis. Usually Hadoop and MapReduce are used in
massive level analysis.
• Competence. Do I or the whole team have the expertise to carry out the
project? Incompetence may lead to weakness in the analytics such as col-
lecting large amounts of data poorly and drawing superficial conclusions.
• Benefits, costs, and reciprocity. Will each stakeholder gain from the
project? Are the benefits and costs equitable? A project will likely to fail
if the benefit and the cost for a stakeholder do not match.
• Privacy and confidentiality. How do we make sure that the information
is kept confidentially? How do we verify where are raw data and analysis
results stored? How will we have access to them? These questions should
be addressed and documented in explicit confidentiality agreements.
Supervised Unsupervised
Discrete Data Classification Clustering
Continuous Data Regression Dimension reduction
Descriptive Statistics
In one sense (as a “mass noun”), “descriptive statistics” is an area of statistics
that concerns the collection, organization, summarization, and presentation of
data (Bluman, 2012). In another sense (as a “count noun”), “descriptive statis-
tics” are summary statistics that quantitatively describe or summarize data.
Table 13.4. Commonly Used Descriptive Statistics
Descriptive Statistics
Measures of central tendency Mean, median, mode, midrange
Measures of variation Range, variance, standard deviation
Measures of position Quantile
Table 13.4 lists some commonly used descriptive statistics. In R, we can use the
function summary to calculate some of the descriptive statistics. For numeric
data, we can visualize the descriptive statistics using a boxplot.
In addition to these quantitative descriptive statistics, we can also qualitatively
describe shapes of the distributions (Bluman, 2012). For example, we can say
that a distribution is positively skewed, symmetric, or negatively skewed. To
visualize the distribution of a variable, we can draw a histogram.
𝑑
𝑍𝑖 = e′𝑖 X = ∑ 𝑒𝑖𝑗 𝑋𝑗 ,
𝑗=1
Var (𝑍𝑖 ) 𝜆𝑖
𝑑
= .
∑𝑗=1 Var (𝑍𝑗 ) 𝜆1 + 𝜆2 + ⋯ + 𝜆𝑑
440 CHAPTER 13. DATA AND SYSTEMS
For more information about PCA, readers are referred to Mirkin (2011).
Cluster Analysis
Cluster analysis (aka data clustering) refers to the process of dividing a dataset
into homogeneous groups or clusters such that points in the same cluster are
similar and points from different clusters are quite distinct (Gan et al., 2007;
Gan, 2011). Data clustering is one of the most popular tools for exploratory
data analysis and has found its applications in many scientific areas.
During the past several decades, many clustering algorithms have been proposed.
Among these clustering algorithms, the 𝑘-means algorithm is perhaps the most
well-known algorithm due to its simplicity. To describe the k-means algorithm,
let 𝑋 = {x1 , x2 , … , x𝑛 } be a dataset containing 𝑛 points, each of which is
described by 𝑑 numerical features. Given a desired number of clusters 𝑘, the
𝑘-means algorithm aims at minimizing the following objective function:
𝑘 𝑛
𝑃 (𝑈 , 𝑍) = ∑ ∑ 𝑢𝑖𝑙 ‖x𝑖 − z𝑙 ‖2 ,
𝑙=1 𝑖=1
𝑘
∑ 𝑢𝑖𝑙 = 1, 𝑖 = 1, 2, … , 𝑛.
𝑙=1
Linear Models
Linear models, also called linear regression models, aim at using a linear function
to approximate the relationship between the dependent variable and indepen-
dent variables. A linear regression model is called a simple linear regression
13.3. DATA ANALYSIS TECHNIQUES 441
model if there is only one independent variable. When more than one indepen-
dent variable is involved, a linear regression model is called a multiple linear
regression model.
Let 𝑋 and 𝑌 denote the independent and the dependent variables, respectively.
For 𝑖 = 1, 2, … , 𝑛, let (𝑥𝑖 , 𝑦𝑖 ) be the observed values of (𝑋, 𝑌 ) in the 𝑖th case.
Then the simple linear regression model is specified as follows (Frees, 2009):
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, 2, … , 𝑛,
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘 + 𝜖𝑖 ,
𝜇𝑖 = E [𝑦𝑖 ],
𝜂𝑖 = x′𝑖 𝛽 = 𝑔(𝜇𝑖 ),
where x𝑖 = (1, 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑘 )′ is a vector of regressor values, 𝜇𝑖 is the mean
response for the 𝑖th case, and 𝜂𝑖 is a systematic component of the GLM. The
function 𝑔(⋅) is known and is called the link function. The mean response can
vary by observations by allowing some parameters to change. However, the re-
gression parameters 𝛽 are assumed to be the same among different observations.
442 CHAPTER 13. DATA AND SYSTEMS
Tree-based Models
Decision trees, also known as tree-based models, involve dividing the predictor
space (i.e., the space formed by independent variables) into a number of simple
regions and using the mean or the mode of the region for prediction (Breiman
et al., 1984). There are two types of tree-based models: classification trees
and regression trees. When the dependent variable is categorical, the result-
ing tree models are called classification trees. When the dependent variable is
continuous, the resulting tree models are called regression trees.
The process of building classification trees is similar to that of building regression
trees. Here we only briefly describe how to build a regression tree. To do
that, the predictor space is divided into non-overlapping regions such that the
following objective function
𝐽 𝑛
𝑓(𝑅1 , 𝑅2 , … , 𝑅𝐽 ) = ∑ ∑ 𝐼𝑅𝑗 (x𝑖 )(𝑦𝑖 − 𝜇𝑗 )2
𝑗=1 𝑖=1
Table 13.5 lists a few R functions for different data analysis tasks. Readers can
go to the R documentation to learn how to use these functions. There are also
other R packages that do similar things. However, the functions listed in this
table provide good starting points for readers to conduct data analysis in R. For
analyzing large datasets in R in an efficient way, readers are referred to Daroczi
(2015).
13.5 Summary
In this chapter, we give a high-level overview of data analysis by introducing
data types, data structures, data storages, data sources, data analysis processes,
and data analysis techniques. In particular, we present various aspects of data
analysis. In addition, we provide several websites where readers can obtain real-
world datasets to hone their data analysis skills. We also list some R packages
and functions that can be used to perform various data analysis tasks.
Dependence Modeling
Chapter Preview. In practice, there are many types of variables that one encoun-
ters. The first step in dependence modeling is identifying the type of variable
to help direct you to the appropriate technique. This chapter introduces read-
ers to variable types and techniques for modeling dependence or association of
multivariate distributions. Section 14.1 provides an overview of the types of vari-
ables. Section 14.2 then elaborates basic measures for modeling the dependence
between variables.
Section 14.3 introduces an approach to modeling dependence using copulas
which is reinforced with practical illustrations in Section 14.4. The types of
copula families and basic properties of copula functions are explained Section
14.5. The chapter concludes by explaining why the study of dependence model-
ing is important in Section 14.6.
People, firms, and other entities that we want to understand are described in
a dataset by numerical characteristics. As these characteristics vary by entity,
they are commonly known as variables. To manage insurance systems, it will
be critical to understand the distribution of each variable and how they are
associated with one another. It is common for data sets to have many variables
(high dimensional) and so is useful to begin by classifying them into different
445
446 CHAPTER 14. DEPENDENCE MODELING
types. As will be seen, these classifications are not strict; there is overlap among
the groups. Nonetheless, the grouping summarized in Table 14.1 and explained
in the remainder of this section provides a solid first step in framing a data set.
Table 14.1. Variable Types
3000000
EntityType
2000000
City
County
Claim
Misc
School
Town
1000000 Village
A binary variable is a special type of categorical variable where there are only
two categories commonly taken to be 0 and 1. For example, we might code a
variable in a dataset to be 1 if an insured is female and 0 if male.
448 CHAPTER 14. DEPENDENCE MODELING
Insurance data typically are multivariate in the sense that we can take many
measurements on a single entity. For example, when studying losses associated
with a firm’s workers’ compensation plan, we might want to know the location
of its manufacturing plants, the industry in which it operates, the number of
employees, and so forth. The usual strategy for analyzing multivariate data is
to begin by examining each variable in isolation of the others. This is known as
a univariate approach.
In contrast, for some variables, it makes little sense to only look at one dimen-
sional aspect. For example, insurers typically organize spatial data by longitude
and latitude to analyze the location of weather related insurance claims due to
hailstorms. Having only a single number, either longitude or latitude, provides
little information in understanding geographic location.
Another special case of a multivariate variable, less obvious, involves coding for
missing data. Historically, some statistical packages used a -99 to report when a
variable, such as policyholder’s age, was not available or not reported. This led
to many unsuspecting analysts providing strange statistics when summarizing
a set of data. When data are missing, it is better to think about the variable as
having two dimensions, one to indicate whether or not the variable is reported
and the second providing the age (if reported). In the same way, insurance data
are commonly censored and truncated. We refer you to Section 4.3 for more
on censored and truncated data. Aggregate claims, described in Chapter 5, can
also be coded as another special type of multivariate variable.
Perhaps the most complicated type of multivariate variable is a realization of
a stochastic process. You will recall that a stochastic process is little more
than a collection of random variables. For example, in insurance, we might
think about the times that claims arrive to an insurance company in a one-
year time horizon. This is a high dimensional variable that theoretically is
infinite dimensional. Special techniques are required to understand realizations
of stochastic processes that will not be addressed here.
Pearson Correlation
̂ 𝑛 ̄ 𝑖 − 𝑌 ̄ ),
Define the sample covariance function 𝐶𝑜𝑣(𝑋, 𝑌 ) = 𝑛1 ∑𝑖=1 (𝑋𝑖 − 𝑋)(𝑌
where 𝑋̄ and 𝑌 ̄ are the sample means of 𝑋 and 𝑌 , respectively. Then, the
product-moment (Pearson) correlation can be written as
̂
𝐶𝑜𝑣(𝑋, 𝑌) ̂
𝐶𝑜𝑣(𝑋, 𝑌)
𝑟= = .
̂
√𝐶𝑜𝑣(𝑋, ̂ ,𝑌)
𝑋)𝐶𝑜𝑣(𝑌 √𝑉̂
𝑎𝑟(𝑋)√𝑉̂
𝑎𝑟(𝑌 )
̂
𝐶𝑜𝑣(𝑅(𝑋), 𝑅(𝑌 )) ̂
𝐶𝑜𝑣(𝑅(𝑋), 𝑅(𝑌 ))
𝑟𝑆 = = .
̂
√𝐶𝑜𝑣(𝑅(𝑋), ̂ (𝑛2 − 1)/12
𝑅(𝑋))𝐶𝑜𝑣(𝑅(𝑌 ), 𝑅(𝑌 ))
You can obtain the Spearman correlation statistic 𝑟𝑆 using the cor() function
in R and selecting the spearman method. From below, the Spearman correlation
between the Coverage rating variable in millions of dollars and Claim amount
variable in dollars is 0.41.
We can show that the Spearman correlation statistic is invariant under strictly
increasing transformations. From the R Code for the Spearman correlation
statistic above, 𝑟𝑆 = 0.41 between the Coverage rating variable in logarithmic
millions of dollars and Claim amount variable in dollars.
Kendall’s Tau
An alternative measure that uses ranks is based on the concept of concordance.
An observation pair (𝑋, 𝑌 ) is said to be concordant (discordant) if the ob-
servation with a larger value of 𝑋 has also the larger (smaller) value of 𝑌 .
Then Pr(𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) = Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) > 0] , Pr(𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) =
Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) < 0], Pr(𝑡𝑖𝑒) = Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) = 0] and
452 CHAPTER 14. DEPENDENCE MODELING
To estimate this, the pairs (𝑋𝑖 , 𝑌𝑖 ) and (𝑋𝑗 , 𝑌𝑗 ) are said to be concordant if the
product 𝑠𝑔𝑛(𝑋𝑗 − 𝑋𝑖 )𝑠𝑔𝑛(𝑌𝑗 − 𝑌𝑖 ) equals 1 and discordant if the product equals
-1. Here, 𝑠𝑔𝑛(𝑥) = 1, 0, −1 as 𝑥 > 0, 𝑥 = 0, 𝑥 < 0, respectively. With this, we
can express the association measure of Kendall (1938), known as Kendall’s tau,
as
2
𝜏̂ = 𝑛(𝑛−1) ∑𝑖<𝑗 𝑠𝑔𝑛(𝑋𝑗 − 𝑋𝑖 ) × 𝑠𝑔𝑛(𝑌𝑗 − 𝑌𝑖 )
2
= 𝑛(𝑛−1) ∑𝑖<𝑗 𝑠𝑔𝑛(𝑅(𝑋𝑗 ) − 𝑅(𝑋𝑖 )) × 𝑠𝑔𝑛(𝑅(𝑌𝑗 ) − 𝑅(𝑌𝑖 )).
Interestingly, Hougaard (2000), page 137, attributes the original discovery of this
statistic to Fechner (1897), noting that Kendall’s discovery was independent and
more complete than the original work.
You can obtain Kendall’s tau using the cor() function in R and selecting the
kendall method. From below, 𝜏 ̂ = 0.32 between the Coverage rating variable
in millions of dollars and the Claim amount variable in dollars. When there are
ties in the data, the cor() function computes Kendall’s tau_b as proposed by
Kendall (1945).
Also, to show that the Kendall’s tau is invariant under strictly increasing trans-
formations, we see that 𝜏 ̂ = 0.32 between the Coverage rating variable in loga-
rithmic millions of dollars and the Claim amount variable in dollars.
𝜋11 − 𝜋𝑋 𝜋𝑌
𝜌= .
√𝜋𝑋 (1 − 𝜋𝑋 )𝜋𝑌 (1 − 𝜋𝑌 )
Unlike the case for continuous data, it is not possible for this measure to achieve
the limiting boundaries of the interval [−1, 1]. To see this, students of probability
may recall the Fréchet-Höeffding bounds for a joint distribution that turn out
to be max{0, 𝜋𝑋 + 𝜋𝑌 − 1} ≤ 𝜋11 ≤ min{𝜋𝑋 , 𝜋𝑌 } for this joint probability.
(More discussion of these bounds is in Section 14.5.4.) This limit on the joint
probability imposes an additional restriction on the Pearson correlation. As an
14.2. CLASSIC MEASURES OF SCALAR ASSOCIATIONS 453
For example, if 𝜋 = 0.8, then the smallest that the Pearson correlation could be
is -0.25. More generally, there are bounds on 𝜌 that depend on 𝜋𝑋 and 𝜋𝑌 that
make it difficult to interpret this measure.
As noted by Bishop et al. (1975) (page 382), squaring this correlation coefficient
yields the Pearson chi-square statistic (introduced in Section 2.7). Despite the
boundary problems described above, this feature makes the Pearson correlation
coefficient a good choice for describing dependence with binary data.
As an alternative measure for Bernoulli variables, the odds ratio is given by
𝑛𝑒𝑤
𝜋𝑖𝑗 = 𝑎𝑖 𝑏𝑗 𝜋𝑖𝑗
𝑛𝑒𝑤
and ∑𝑖𝑗 𝜋𝑖𝑗 = 1. Then,
For additional help with interpretation, Yule proposed two transforms for the
odds ratio, the first in Yule (1900),
𝑂𝑅 − 1
,
𝑂𝑅 + 1
and the second in Yule (1912),
√
𝑂𝑅 − 1
√ .
𝑂𝑅 + 1
454 CHAPTER 14. DEPENDENCE MODELING
Although these statistics provide the same information as is the original odds
ratio 𝑂𝑅, they have the advantage of taking values in the interval [−1, 1], making
them easier to interpret.
In a later section, we will also see that the marginal distributions have no ef-
fect on the Fréchet-Höeffding of the tetrachoric correlation, another measure of
association, see also, Joe (2014), page 48.
1611(956)
From Table 14.2, 𝑂𝑅(𝜋11 ) = 897(2175) = 0.79. You can obtain the 𝑂𝑅(𝜋11 ),
using the oddsratio() function from the epitools library in R. From the output
below, 𝑂𝑅(𝜋11 ) = 0.79 for the binary variables NoClaimCredit and Fire5 from
the LGPIF data.
Table 14.2. 2 × 2 Table of Counts for Fire5 and NoClaimCredit
Fire5
NoClaimCredit 0 1 Total
0 1611 2175 3786
1 897 956 1853
Total 2508 3131 5639
Categorical Variables
More generally, let (𝑋, 𝑌 ) be a bivariate pair having 𝑛𝑐𝑎𝑡𝑋 and 𝑛𝑐𝑎𝑡𝑌 numbers
of categories, respectively. For a two-way table of counts, let 𝑛𝑗𝑘 be the number
in the 𝑗th row, 𝑘th column. Let 𝑛𝑗 be the row margin total, 𝑛𝑘 be the column
margin total and 𝑛 = ∑𝑗,𝑘 𝑛𝑗,𝑘 . Define the Pearson chi-square statistic as
and
𝐺2 𝜋𝑗𝑘
≈ 2 ∑ 𝜋𝑗𝑘 log .
𝑛 𝑗𝑘
𝜋𝑋,𝑗 𝜋𝑌 ,𝑘
14.2. CLASSIC MEASURES OF SCALAR ASSOCIATIONS 455
NoClaimCredit
EntityType 0 1
City 644 149
County 310 18
Misc 336 273
School 1103 494
Town 492 479
Village 901 440
You can obtain the Pearson chi-square statistic, using the chisq.test() func-
tion from the MASS library in R. Here, we test whether the EntityType variable
is independent of the NoClaimCredit variable using Table 14.3.
As the p-value is less than the .05 significance level, we reject the null hypothesis
that EntityType is independent of NoClaimCredit.
Furthermore, you can obtain the likelihood ratio test statistic, using the
likelihood.test() function from the Deducer library in R. From below, we
test whether EntityType is independent of NoClaimCredit from the LGPIF
data. The same conclusion is drawn as the Pearson chi-square test.
𝑚2 𝑚2 ̂ , 𝜉2𝑡
̂ ; 𝜌) − Φ2 (𝜉1,𝑠−1
̂ ̂ ; 𝜌)
𝜌𝑁̂ = argmax𝜌 ∑𝑠=𝑚 ∑𝑡=𝑚 𝑛𝑠𝑡 log {Φ2 (𝜉1𝑠 , 𝜉2𝑡
1 1
̂ ̂ ̂ ̂
−Φ2 (𝜉1𝑠 , 𝜉2,𝑡−1 ; 𝜌) + Φ2 (𝜉1,𝑠−1 , 𝜉2,𝑡−1 ; 𝜌)} .
NoClaimCredit
AlarmCredit 0 1
1 1669 942
2 121 118
3 195 132
4 1801 661
You can obtain the polychoric or tetrachoric correlation using the polychoric()
or tetrachoric() function from the psych library in R. The polychoric correla-
tion is illustrated using Table 14.4. Here, 𝜌𝑁
̂ = −0.14, which means that there
is a negative relationship between AlarmCredit and NoClaimCredit.
𝑛 ̂
𝜉2,𝑦 − 𝜌𝑧𝑖1 ̂
𝜉2,𝑦 − 𝜌𝑧𝑖1
𝜌𝑁̂ = argmax𝜌 ∑ log {𝜙(𝑧𝑖1 ) [Φ ( 𝑖2
) − Φ( 𝑖2−1
)]} .
𝑖=1
(1 − 𝜌2 )1/2 (1 − 𝜌2 )1/2
The biserial correlation is defined similarly, when one variable is continuous and
the other binary.
Table 14.5. Summary of Claim by NoClaimCredit
You can obtain the polyserial or biserial correlation using the polyserial()
or biserial() function, respectively, from the psych library in R. Table 14.5
gives the summary of Claim by NoClaimCredit and the biserial correlation is
illustrated using R code below. The 𝜌𝑁̂ = −0.04 which means that there is a
negative correlation between Claim and NoClaimCredit.
Copulas are widely used in insurance and many other fields to model the de-
pendence among multivariate outcomes. A copula is a multivariate distribution
function with uniform marginals. Specifically, let {𝑈1 , … , 𝑈𝑝 } be 𝑝 uniform
random variables on (0, 1). Their distribution function
𝐶(𝑢1 , … , 𝑢𝑝 ) = Pr(𝑈1 ≤ 𝑢1 , … , 𝑈𝑝 ≤ 𝑢𝑝 ),
is a copula. We seek to use copulas in applications that are based on more than
just uniformly distributed data. Thus, consider arbitrary marginal distribu-
tion functions 𝐹1 (𝑦1 ),…,𝐹𝑝 (𝑦𝑝 ). Then, we can define a multivariate distribution
function using the copula such that
Sklar also showed that, if the marginal distributions are continuous, then there
is a unique copula representation. In this chapter we focus on copula modeling
with continuous variables. For discrete case, readers can see Joe (2014) and
Genest and Nešlohva (2007).
For the bivariate case where 𝑝 = 2, we can write a copula and the distribution
function of two random variables as
𝐶(𝑢1 , 𝑢2 ) = Pr(𝑈1 ≤ 𝑢1 , 𝑈2 ≤ 𝑢2 )
and
As an example, we can look to the copula due to Frank (1979). The copula
(distribution function) is
1 (exp(𝛾𝑢1 ) − 1)(exp(𝛾𝑢2 ) − 1)
𝐶(𝑢1 , 𝑢2 ) = log (1 + ). (14.2)
𝛾 exp(𝛾) − 1
This is a bivariate distribution function with its domain on the unit square
[0, 1]2 . Here 𝛾 is dependence parameter, that is, the range of dependence is
controlled by the parameter 𝛾. Positive association increases as 𝛾 increases. As
we will see, this positive association can be summarized with Spearman’s rho
(𝜌𝑆 ) and Kendall’s tau (𝜏 ). Frank’s copula is commonly used. We will see other
copula functions in Section 14.5.
This section analyzes the insurance losses and expenses data with the statistical
program R. The data set was introduced in Frees and Valdez (1998) and is now
readily available in the copula package. The model fitting process is started by
marginal modeling of each of the two variables, LOSS and ALAE. Then we model
the joint distribution of these marginal outcomes.
14.4. APPLICATION USING COPULAS 459
12
10
300000
log(ALAE)
ALAE
8
6
0 100000
0 1000000 2000000 2 4 6 8 10 14
LOSS log(LOSS)
𝛼
𝜃
𝐹 (𝑦) = 1 − (1 + ) .
𝑦+𝜃
The marginal distributions of losses and expenses are fit using the method of
maximum likelihood. Specifically, we use the vglm function from the R VGAM
package. Firstly, we fit the marginal distribution of ALAE. Parameters are
summarized in Table 14.6.
We repeat this procedure to fit the marginal distribution of the LOSS variable.
Because the loss variable also seems right-skewed and heavy-tailed data, we
also model the marginal distribution with the Pareto distribution (although
with different parameters).
Table 14.6. Summary of Pareto Maximum Likelihood Fitted Parame-
ters from the LGPIF Data
Shape 𝜃 ̂ Scale 𝛼̂
𝐴𝐿𝐴𝐸 15133.60360 2.22304
𝐿𝑂𝑆𝑆 16228.14797 1.23766
To visualize the fitted distribution of LOSS and ALAE variables, one can use
the estimated parameters and plot the corresponding distribution function and
density function. For more details on the selection of marginal models, see
Chapter 4.
−𝛼̂
𝐴𝐿𝐴𝐸
𝑢1 = 𝐹1̂ −1 (𝐴𝐿𝐴𝐸) = 1 − (1 + ) .
𝜃̂
150
Frequency
100
50
0
In the same way, the variable LOSS is also transformed to the variable 𝑢2 , which
follows a uniform distribution on [0, 1]. The left-hand panel of Figure 14.4
shows a plot the histogram of Transformed ALAE, again reinforcing the Pareto
distribution specification. For another way of looking at the data, the variable
𝑢2 can be transformed to a normal score with the quantile function of standard
normal distribution. As we see in Figure 14.4, normal scores of the variable LOSS
are approximately marginally standard normal. This figure is helpful because
analysts are used to looking for patterns of approximate normality (which seems
to be evident in the figure). The logic is that, if the Pareto distribution is
correctly specified, then transformed losses 𝑢2 should be approximately normal,
and the normal scores Φ−1 (𝑢2 ), should be approximately normal. (Here, Φ is
the cumulative standard normal distribution function.)
300
200
150
200
Frequency
Frequency
100
100
50
50
0
0
0.0 0.2 0.4 0.6 0.8 1.0 −3 −1 0 1 2 3
2
0.8
Transformed LOSS
1
qnorm(u2)
0.6
0
0.4
−1
0.2
−2
0.0
−3
Figure 14.5: Left: Scatter plot for transformed variables. Right:Scatter plot for
normal scores
14.5. TYPES OF COPULAS 463
ranks. Therefore, the statistic is the same for (1) the original data in Figure
14.2, (2) the data transformed to uniform scales in the left-hand panel of Figure
14.5, and (3) the normal scores in the right-hand panel of Figure 14.5.
The next step is to calculate estimates of the copula parameters. One option
is to use traditional maximum likelihood and determine all the parameters at
the same time which can be computationally burdensome. Even in our simple
example, this means maximizing a (log) likelihood function over five parameters,
two for the marginal ALAE distribution, two for the marginal LOSS distribution,
and one for the copula. A widely alternative, known as the inference for margins
(IFM) approach, is to simply use the fitted marginal distributions, 𝑢1 and 𝑢2 ,
as inputs when determining the copula. This is the approach taken here. In the
following code, you will see that it turns how that the fitted copula parameter
𝛾̂ = 3.114.
To visualize the fitted Frank’s copula, the distribution function and density
function perspective plots are drawn in Figure 14.6.
1.0
0.8
C(u,v
0.6
0.4
)
0.2 1.0
0.0 0.8
0.0 3
0.6
0.2 1.0
v
0.8 2 0.4
0.4
c(u,v
0.6
0.6 0.2
u
1
0.4 v
)
0.8 0.0
0.2 0
0.0 0.2 0.4 0.6 0.8 1.0
1.0 0.0 u
Figure 14.6: Left: Plot for distribution function for Franks Copula. Right:Plot
for density function for Franks Copula
There are several families of copulas have been described in the literature. Two
main families of the copula families are the Archimedean and Elliptical cop-
ulas.
1 1 −1
𝜙𝑁 (z) = √ exp (− z′ Σ z) . (14.3)
(2𝜋)𝑝/2 det Σ 2
𝑝
−1 −1 1
𝑐𝑁 (𝑢1 , … , 𝑢𝑝 ) = 𝜙𝑁 (Φ (𝑢1 ), … , Φ (𝑢𝑝 )) ∏ −1 (𝑢 ))
.
𝑗=1
𝜙(Φ 𝑗
Here, we use Φ and 𝜙 to denote the standard normal distribution and density
functions. Unlike the usual probability density function 𝜙𝑁 , the copula density
14.5. TYPES OF COPULAS 465
function has its domain on the hyper-cube [0, 1]𝑝 . For contrast, Figure 14.7
compares these two density functions.
0.15
F
0.10
1.5
0.05
c(u,v
1.0
3 1.0
0.5
2 0.8
)
1 0.0 0.6
3 0.0
0
2 0.2
y
v
0.4
−1 1
0.4
0
−2 −1 x u 0.6 0.2
−2 0.8
−3 0.0
−3 1.0
𝑘𝑝 1 −1
ℎ𝐸 (z) = √ 𝑔𝑝 ( (z − 𝜇)′ Σ (z − 𝜇)) ,
det Σ 2
−(𝑝+𝑟)/2
−1
𝑘𝑝 (z − 𝜇)′ Σ (z − 𝜇)
ℎ𝑡𝑟 (z) = √ exp ⎡
⎢− (1 + ) ⎤.
⎥
det Σ 𝑟
⎣ ⎦
Table 14.7. Generator Functions (𝑔𝑝 (⋅)) for Selected Elliptical Distri-
butions
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟
𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑔𝑝 (𝑥)
Normal distribution 𝑒−𝑥
𝑡 − distribution with 𝑟 degrees of freedom (1 + 2𝑥/𝑟)−(𝑝+𝑟)/2
Cauchy (1 + 2𝑥)−(𝑝+1)/2
Logistic 𝑒−𝑥 /(1 + 𝑒−𝑥 )2
Exponential power exp(−𝑟𝑥𝑠 )
𝑝
1
𝑐𝐸 (𝑢1 , … , 𝑢𝑝 ) = ℎ𝐸 (𝐻 −1 (𝑢1 ), … , 𝐻 −1 (𝑢𝑝 )) ∏ .
𝑗=1
ℎ(𝐻 −1 (𝑢𝑗 ))
As noted above, most empirical work focuses on the normal copula and 𝑡-copula.
Specifically, 𝑡-copulas are useful for modeling the dependency in the tails of
bivariate distributions, especially in financial risk analysis applications. The
𝑡-copulas with same association parameter in varying the degrees of freedom
14.5. TYPES OF COPULAS 467
Clayton Copula
For 𝑝 = 2, the Clayton copula is parameterized by 𝛾 ∈ [−1, ∞) is defined by
This is a bivariate distribution function defined on the unit square [0, 1]2 . The
range of dependence is controlled by the parameter 𝛾, similar to Frank’s copula.
Gumbel-Hougaard Copula
The Gumbel-Hougaard copula is parametrized by 𝛾 ∈ [1, ∞) and defined by
1/𝛾
2
𝐶𝛾𝐺𝐻 (𝑢) = exp ⎛
⎜− (∑(− log 𝑢𝑖 )𝛾 ) ⎞
⎟, 𝑢 ∈ [0, 1]2 .
⎝ 𝑖=1 ⎠
For more information on Archimedean copulas, see Joe (2014), Frees and Valdez
(1998), and Genest and Mackay (1986).
468 CHAPTER 14. DEPENDENCE MODELING
Bounds on Association
Any distribution function is bounded below by zero and from above by one.
Additional types of bounds are available in multivariate contexts. These bounds
are useful when studying dependencies. That is, as an analyst thinks about
variables as being extremely dependent, one has available bounds that cannot
be exceeded, regardless of the dependence. The most widely used bounds in
dependence modeling are known as the Fréchet-Höeffding bounds, given as
Measures of Association
Empirical versions of Spearman’s rho and Kendall’s tau were introduced in
Sections 14.2.2 and 14.2.2, respectively. The interesting thing about these ex-
pressions is that these summary measures of association are based only on the
ranks of each variable. Thus, any strictly increasing transform does not affect
these measures of association. Specifically, consider two random variables, 𝑌1
and 𝑌2 , and let m1 and m2 be strictly increasing functions. Then, the associa-
tion, when measured by Spearman’s rho or Kendall’s tau, between 𝑚1 (𝑌1 ) and
𝑚2 (𝑌2 ) does not change regardless of the choice of m1 and m2 . For example,
this allows analysts to consider dollars, Euros, or log dollars, and still retain the
14.5. TYPES OF COPULAS 469
1.0
1.0
0.8
0.8
0.6
0.6
U2
U2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
U1 U1
same essential dependence. As we have seen in Section 14.2, this is not the case
with the Pearson’s measure of correlation.
Schweizer et al. (1981) established that the copula accounts for all the depen-
dence in the sense that they way 𝑌1 and 𝑌2 “move together” is captured by
the copula, regardless of the scale in which each variable is measured. They
also showed that (population versions of) the two standard nonparametric mea-
sures of association could be expressed solely in terms of the copula function.
Spearman’s correlation coefficient is given by
1 1
𝜌𝑆 = 12 ∫ ∫ {𝐶(𝑢, 𝑣) − 𝑢𝑣} 𝑑𝑢𝑑𝑣. (14.4)
0 0
1 1
𝜏 = 4 ∫ ∫ 𝐶(𝑢, 𝑣) 𝑑𝐶(𝑢, 𝑣) − 1.
0 0
Example. Loss versus Expenses. Earlier, in Section 14.4, we saw that the
Spearman’s correlation was 0.452, calculated with the rho function. Then, we
fit Frank’s copula to these data, and estimated the dependence parameter to be
𝛾̂ = 0.452. As an alternative, the following code shows how to use the empirical
version of equation (14.4). In this case, the Spearman’s correlation coefficient
is 0.462, which is close to the sample Spearman’s correlation coefficient, 0.452.
470 CHAPTER 14. DEPENDENCE MODELING
Tail Dependency
There are applications in which it is useful to distinguish the part of the dis-
tribution in which the association is strongest. For example, in insurance it is
helpful to understand association among the largest losses, that is, association
in the right tails of the data.
To capture this type of dependency, we use the right-tail concentration function,
defined as
Pr(𝑈1 ≤ 𝑧, 𝑈2 ≤ 𝑧) 𝐶(𝑧, 𝑧)
𝐿(𝑧) = = Pr(𝑈1 ≤ 𝑧|𝑈2 ≤ 𝑧) = ,
𝑧 𝑧
with the lower tail dependence parameter 𝐿 = lim𝑧→0 𝐿(𝑧). A tail dependency
concentration function captures the probability of two random variables simul-
taneously having extreme values.
It is of interest to see how well a given copula can capture tail dependence. To
this end, we calculate the left and right tail concentration functions for four
different types of copulas; Normal, Frank, Gumbel and 𝑡- copulas. The results
are summarized for concentration function values for these four copulas in Table
14.8. As in Venter (2002), we show 𝐿(𝑧) for 𝑧 ≤ 0.5 and 𝑅(𝑧) for 𝑧 > 0.5 in
the tail dependence plot in Figure 14.9. We interpret the tail dependence plot
to mean that both the Frank and Normal copula exhibit no tail dependence
whereas the 𝑡- and the Gumbel do so. The 𝑡- copula is symmetric in its treatment
of upper and lower tails.
Table 14.8. Tail Dependence Parameters for Four Copulas
1.0
0.8
Gumbel
Tail Dependence
0.6
0.4
t with 5 df
0.2
normal Frank
0.0
In insurance, ignoring dependence modeling may not impact pricing but could
lead to misestimation of required capital to cover losses. For instance, from
Section 14.4 , it is seen that there was a positive relationship between LOSS
and ALAE. This means that, if there is a large loss then we expect expenses to
be large as well and ignoring this relationship could lead to mis-estimation of
reserves.
Table 14.9. Results for Portfolio Expected Value and Quantiles (𝑉 𝑎𝑅𝑞 )
472 CHAPTER 14. DEPENDENCE MODELING
−1
𝛽𝐵 = 4𝐹 (𝐹𝑋 (1/2), 𝐹𝑌−1 (1/2)) − 1.
−1
That is, first evaluate each marginal at its median (𝐹𝑋 (1/2) and 𝐹𝑌−1 (1/2),
respectively). Then, evaluate the bivariate distribution function at the two
medians. After rescaling (multiplying by 4 and subtracting 1), the coefficient
turns out to have a range of [−1, 1], where 0 occurs under independence.
Like Spearman’s rho and Kendall’s tau, an estimator based on ranks is easy to
provide. First write 𝛽𝐵 = 4𝐶(1/2, 1/2) − 1 = 2 Pr((𝑈1 − 1/2)(𝑈2 − 1/2)) − 1
where 𝑈1 , 𝑈2 are uniform random variables. Then, define
2 𝑛 𝑛+1 𝑛+1
𝛽𝐵̂ = ∑ 𝐼 ((𝑅(𝑋𝑖 ) − )(𝑅(𝑌𝑖 ) − ) ≥ 0) − 1.
𝑛 𝑖=1 2 2
See, for example, Joe (2014), page 57 or Hougaard (2000), page 135, for more
details.
14.7. FURTHER RESOURCES AND CONTRIBUTORS 473
1
𝑟1𝑠 = 𝑛𝑚1 + ⋯ + 𝑛𝑠−1, + (1 + 𝑛𝑠 )
2
and similarly 𝑟2𝑡 = 21 [(𝑛𝑚1 + ⋯ + 𝑛,𝑠−1 + 1) + (𝑛𝑚1 + ⋯ + 𝑛𝑠 )]. With this,
we have Spearman’s rho with tied rank is
𝑚 𝑚
∑𝑠=𝑚
2
∑𝑡=𝑚
2
𝑛𝑠𝑡 (𝑟1𝑠 − 𝑟)(𝑟
̄ 2𝑡 − 𝑟)̄
𝜌𝑆̂ = 1 1
2
𝑚2 𝑚
[∑𝑠=𝑚 𝑛𝑠 (𝑟1𝑠 − 𝑟)̄ 2 ∑𝑡=𝑚
2
𝑛𝑡 (𝑟2𝑡 − 𝑟)̄ 2 ]
1 1
have
1 1
∑∑ 𝑛𝑠𝑡 (𝑟1𝑠 − 𝑟)(𝑟
̄ 2𝑡 − 𝑟)̄
𝑠=0 𝑡=0
𝑛0 − 𝑛 𝑛0 − 𝑛 𝑛 − 𝑛 𝑛0 𝑛 𝑛 −𝑛 𝑛 𝑛
= 𝑛00 + 𝑛01 0 + 𝑛10 0 0 + 𝑛11 0 0
2 2 2 2 2 2 2 2
1
= (𝑛 (𝑛 − 𝑛)(𝑛0 − 𝑛) + (𝑛0 − 𝑛00 )(𝑛0 − 𝑛)𝑛0
4 00 0
+ (𝑛0 − 𝑛00 )𝑛0 (𝑛0 − 𝑛) + (𝑛 − 𝑛0 − 𝑛0 + 𝑛00 )𝑛0 𝑛0 )
1
= (𝑛 𝑛2 − 𝑛0 (𝑛0 − 𝑛)𝑛0
4 00
+ 𝑛0 𝑛0 (𝑛0 − 𝑛) + (𝑛 − 𝑛0 − 𝑛0 )𝑛0 𝑛0 )
1
= (𝑛 𝑛2 − 𝑛0 𝑛0 (𝑛0 − 𝑛 + 𝑛0 − 𝑛 + 𝑛 − 𝑛0 − 𝑛0 )
4 00
𝑛
= (𝑛𝑛00 − 𝑛0 𝑛0 ).
4
This yields
where 𝜋𝑋̂ = (𝑛 − 𝑛0 )/𝑛 and similarly for 𝜋𝑌̂ . Note that this is same form as
the Pearson measure. From this, we see that the joint count 𝑛00 drives this
association measure.
You can obtain the ties-corrected Spearman correlation statistic 𝑟𝑆 using the
cor() function in R and selecting the spearman method. From below 𝜌𝑆̂ =
−0.09.
Chapter 15
Appendix A: Review of
Statistical Inference
In this section, you learn the following concepts related to statistical inference.
• Random sampling from a population that can be summarized using a list
of items or individuals within the population
• Sampling distributions that characterize the distributions of possible out-
comes for a statistic calculated from a random sample
• The central limit theorem that guides the distribution of the mean of a
random sample from the population
475
476CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
the population that can be sampled. While the process has a broad spectrum of
applications in various areas including science, engineering, health, social, and
economic fields, statistical inference is important to insurance companies that
use data from their existing policy holders in order to make inference on the
characteristics (e.g., risk profiles) of a specific segment of target customers (i.e.,
the population) whom the insurance companies do not directly observe.
Example – Wisconsin Property Fund. Assume there are 1,377 individual
claims from the 2010 experience.
Frequency
600
200
50
0
0 4000000 10000000 0 5 10 15
Using the 2010 claim experience (the sample), the Wisconsin Property Fund may
be interested in assessing the severity of all claims that could potentially occur,
such as 2010, 2011, and so forth (the population). This process is important
in the contexts of ratemaking or claim predictive modeling. In order for such
inference to be valid, we need to assume that
We assume that the random variable 𝑋 represents a draw from a population with
a distribution function 𝐹 (⋅) with mean E[𝑋] = 𝜇 and variance Var[𝑋] = E[(𝑋 −
𝜇)2 ], where 𝐸(⋅) denotes the expectation of a random variable. In random
sampling, we make a total of 𝑛 such draws represented by 𝑋1 , … , 𝑋𝑛 , each
unrelated to one another (i.e., statistically independent). We refer to 𝑋1 , … , 𝑋𝑛
as a random sample (with replacement) from 𝐹 (⋅), taking either a parametric
or nonparametric form. Alternatively, we may say that 𝑋1 , … , 𝑋𝑛 are identically
and independently distributed (iid) with distribution function 𝐹 (⋅).
When using a statistic (e.g., the sample mean 𝑋)̄ to make statistical inference
on the population attribute (e.g., population mean 𝜇), the quality of inference
is determined by the bias and uncertainty in the estimation, owing to the use
of a sample in place of the population. Hence, it is important to study the
distribution of a statistic that quantifies the bias and variability of the statistic.
In particular, the distribution of the sample mean, 𝑋̄ (or any other statistic), is
called the sampling distribution. The sampling distribution depends on the
sampling process, the statistic, the sample size 𝑛 and the population distribution
𝐹 (⋅). The central limit theorem gives the large-sample (sampling) distribution
of the sample mean under certain conditions.
478CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
Note that the CLT does not require a parametric form for 𝐹 (⋅). Based on the
CLT, we may perform statistical inference on the population mean (we infer,
not deduce). The types of inference we may perform include estimation of
the population, hypothesis testing on whether a null statement is true, and
prediction of future samples from the population.
𝜇2̂ = ℎ2 (𝜃1̂ , ⋯ , 𝜃𝑚
̂ );
⋯
̂ = ℎ𝑚 (𝜃1̂ , ⋯ , 𝜃𝑚
𝜇𝑚 ̂ ).
𝜃 ̂ = argmax𝜃∈Θ 𝑙(𝜃|x),
15.2. POINT ESTIMATION AND PROPERTIES 481
where Θ is the parameter space of 𝜃, and argmax𝜃∈Θ 𝑙(𝜃|x) is defined as the value
of 𝜃 at which the function 𝑙(𝜃|x) reaches its maximum.
Given the analytical form of the likelihood function, the mle can be obtained by
taking the first derivative of the log-likelihood function with respect to 𝜃, and
setting the values of the partial derivatives to zero. That is, the mle are the
solutions of the equations of
̂
𝜕𝑙(𝜃|x)
= 0;
𝜕𝜃 ̂ 1
𝜕𝑙(𝜃|x)̂
= 0;
𝜕 𝜃2̂
⋯
𝜕𝑙(𝜃|x) ̂
= 0,
𝜕 𝜃𝑚̂
provided that the second partial derivatives are negative.
For parametric models, the mle of the parameters can be obtained either an-
alytically (e.g., in the case of normal distributions and linear estimators), or
numerically through iterative algorithms such as the Newton-Raphson method
and its adaptive versions (e.g., in the case of generalized linear models with a
non-normal response variable).
Normal distribution. Assume (𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 ) to be a random sample from
the normal distribution 𝑁 (𝜇, 𝜎2 ). With an observed sample (𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 ) =
(𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ), we can write the likelihood function of 𝜇, 𝜎2 as
𝑛 2
1 (𝑥𝑖 −𝜇)
𝐿(𝜇, 𝜎2 ) = ∏ [ √ 𝑒− 2𝜎2 ],
𝑖=1 2𝜋𝜎2
𝑛 1 𝑛 2
𝑙(𝜇, 𝜎2 ) = − [log(2𝜋) + log(𝜎2 )] − 2 ∑ (𝑥𝑖 − 𝜇) .
2 2𝜎 𝑖=1
By solving
𝜕𝑙(𝜇,̂ 𝜎2 )
= 0,
𝜕 𝜇̂
𝑛
we obtain 𝜇̂ = 𝑥̄ = (1/𝑛) ∑𝑖=1 𝑥𝑖 . It is straightforward to verify that
2 2
𝜕𝑙 (𝜇,𝜎
̂
𝜕 𝜇̂ 2
)
∣𝜇=< 0. Since this works for arbitrary 𝑥, 𝜇̂ = 𝑋̄ is the mle of 𝜇.
̂ 𝑥̄
Similarly, by solving
𝜕𝑙(𝜇, 𝜎̂ 2 )
= 0,
𝜕 𝜎̂ 2
𝑛
we obtain 𝜎̂ 2 = (1/𝑛) ∑𝑖=1 (𝑥𝑖 − 𝜇)2 . Further replacing 𝜇 by 𝜇,̂ we derive the
𝑛
mle of 𝜎2 as 𝜎̂ 2 = (1/𝑛) ∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 2 .
482CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
Hence, the sample mean 𝑋̄ and 𝜎̂ 2 are both the mme and MLE for the mean
𝜇 and variance 𝜎2 , under a normal population distribution 𝐹 (⋅). More details
regarding the properties of the likelihood function are given in Appendix Section
17.1.
• derive the exact sampling distribution of the mle of the normal mean
• obtain the large-sample approximation of the sampling distribution using
the large sample properties of the mle
• construct a confidence interval of a parameter based on the large sample
properties of the mle
Now that we have introduced the mme and mle, we may perform the first type
of statistical inference, interval estimation that quantifies the uncertainty
resulting from the use of a finite sample. By deriving the sampling distribution of
mle, we can estimate an interval (a confidence interval) for the parameter. Under
the frequentist approach (e.g., that based on maximum likelihood estimation),
the confidence intervals generated from the same random sampling frame will
cover the true value the majority of times (e.g., 95% of the times), if we repeat
the sampling process and re-calculate the interval over and over again. Such a
process requires the derivation of the sampling distribution for the mle.
2
𝜎
𝑋̄ ∼ 𝑁 (𝜇, ) .
𝑛
where 𝑉 is the inverse of the Fisher Information. Hence, the mle 𝜃 ̂ ap-
proximately follows a normal distribution with mean 𝜃 and variance 𝑉 /𝑛,
when the sample size is large.
• The mle is efficient, meaning that it has the smallest asymptotic vari-
ance 𝑉 , commonly referred to as the Cramer–Rao lower bound. In
particular, the Cramer–Rao lower bound is the inverse of the Fisher infor-
mation defined as ℐ(𝜃) = −E(𝜕 2 log 𝑓(𝑋; 𝜃)/𝜕𝜃2 ). Hence, Var(𝜃)̂ can be
estimated based on the observed Fisher information that can be written
𝑛
as − ∑𝑖=1 𝜕 2 log 𝑓(𝑋𝑖 ; 𝜃)/𝜕𝜃2 .
For many parametric distributions, the Fisher information may be derived ana-
lytically for the mle of parameters. For more sophisticated parametric models,
the Fisher information can be evaluated numerically using numerical integration
for continuous distributions, or numerical summation for discrete distributions.
More details regarding maximum likelihood estimation are given in Appendix
Section 17.2.
𝜃̂− 𝜃
𝑠𝑒(𝜃)̂
484CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
𝛼 𝜃̂− 𝜃 𝛼
Pr [−𝑡𝑛−𝑝 (1 − )< < 𝑡𝑛−𝑝 (1 − )] = 1 − 𝛼,
2 𝑠𝑒(𝜃)̂ 2
from which we can derive a confidence interval for 𝜃. From the above equa-
tion we can derive a pair of statistics, 𝜃1̂ and 𝜃2̂ , that provide an interval of
the form [𝜃1̂ , 𝜃2̂ ]. This interval is a 1 − 𝛼 confidence interval for 𝜃 such that
Pr (𝜃1̂ ≤ 𝜃 ≤ 𝜃2̂ ) = 1 − 𝛼, where the probability 1 − 𝛼 is referred to as the con-
fidence level. Note that the above confidence interval is not valid for small
samples, except for the case of the normal mean.
Normal distribution. For the normal population mean 𝜇, the mle has an
√
sampling distribution 𝑋̄ ∼ 𝑁 (𝜇, 𝜎/ 𝑛), in which we can estimate 𝑠𝑒(𝜃)̂
exact √
by 𝜎/̂ 𝑛. Based on the Cochran’s theorem, the resulting statistic has an
exact Student-𝑡 distribution with degrees of freedom 𝑛 − 1. Hence, we can
derive the lower and upper bounds of the confidence interval as
𝛼 𝜎̂
𝜇1̂ = 𝜇̂ − 𝑡𝑛−1 (1 − )√
2 𝑛
and
𝛼 𝜎̂
𝜇2̂ = 𝜇̂ + 𝑡𝑛−1 (1 − )√ .
2 𝑛
When 𝛼 = 0.05, 𝑡𝑛−1 (1 − 𝛼/2) ≈ 1.96 for large values of 𝑛. Based on the
Cochran’s theorem, the confidence interval is valid regardless of the sample size.
meaning that if the null hypothesis were true, we would reject the null hypothesis
486CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
only 5% of the times, if we repeat the sampling process and perform the test
over and over again.
Thus, the level of significance is the probability of making a type I error (error
of the first kind), the error of incorrectly rejecting a true null hypothesis. For
this reason, the level of significance 𝛼 is also referred to as the type I error
rate. Another type of error we may make in hypothesis testing is the type II
error (error of the second kind), the error of incorrectly accepting a false null
hypothesis. Similarly, we can define the type II error rate as the probability
of not rejecting (accepting) a null hypothesis given that it is not true. That is,
the type II error rate is given by
Another important quantity concerning the quality of the statistical test is called
the power of the test 𝛽, defined as the probability of rejecting a false null
hypothesis. The mathematical definition of the power is
Note that the power of the test is typically calculated based on a specific alter-
native value of 𝜃 = 𝜃𝑎 , given a specific sampling distribution and a given sample
size. In real experimental studies, people usually calculate the required sample
size in order to choose a sample size that will ensure a large chance of obtaining
a statistically significant test (i.e., with a prespecified statistical power such as
85%).
𝜃 ̂ − 𝜃0
𝑡-stat = ,
𝑠𝑒(𝜃)̂
𝛼 𝛼 𝛼
Pr [𝑡-stat < −𝑡𝑛−𝑝 (1 − )] = Pr [𝑡-stat > 𝑡𝑛−𝑝 (1 − )] = .
2 2 2
15.4. HYPOTHESIS TESTING 487
In addition to the concept of rejection region, we may reject the test based on
the 𝑝-value defined as 2 Pr(𝑇 > |𝑡-stat|) for the aforementioned two-sided test,
where the random variable 𝑇 ∼ 𝑇𝑛−𝑝 . We reject the null hypothesis if 𝑝-value
is smaller than and equal to 𝛼. For a given sample, a 𝑝-value is defined to be
the smallest significance level for which the null hypothesis would be rejected.
Similarly, we can construct a one-sided test for the null hypothesis 𝐻0 ∶ 𝜃 ≤ 𝜃0
(or 𝐻0 ∶ 𝜃 ≥ 𝜃0 ). Using the same test statistic, we reject the null hypothesis
when 𝑡-stat > 𝑡𝑛−𝑝 (1 − 𝛼) (or 𝑡-stat < −𝑡𝑛−𝑝 (1 − 𝛼) for the test on 𝐻0 ∶ 𝜃 ≥ 𝜃0 ).
The corresponding 𝑝-value is defined as Pr(𝑇 > |𝑡-stat|) (or Pr(𝑇 < |𝑡-stat|) for
the test on 𝐻0 ∶ 𝜃 ≥ 𝜃0 ). Note that the test is not valid for small samples, except
for the case of the test on the normal mean.
One-sample 𝑡 Test for Normal Mean. For the test on the normal mean
of the form 𝐻0 ∶ 𝜇 = 𝜇0 , 𝐻0 ∶ 𝜇 ≤ 𝜇0 or 𝐻0 ∶ 𝜇 ≥ 𝜇0 , we can define the test
statistic as
𝑋̄ − 𝜇0
𝑡-stat = √ ,
𝜎/̂ 𝑛
for which we have an exact sampling distribution 𝑡-stat ∼ 𝑇𝑛−1 from the
Cochran’s theorem, with 𝑇𝑛−1 denoting a Student-𝑡 distribution with degrees of
freedom 𝑛 − 1. According to the Cochran’s theorem, the test is valid for both
small and large samples.
Example – Wisconsin Property Fund. Assume that mean logarithmic
claims have historically been approximately by 𝜇0 = log(5000) = 8.517. We
might want to use the 2010 data to assess whether the mean of the distribution
has changed (a two-sided test), or whether it has increased (a one-sided test).
Given the actual 2010 average 𝜇̂ = 7.804, we may use the one-sample 𝑡 test to
assess whether this is a significant departure from 𝜇0 = 8.517 (i.e., √
in testing
𝐻0 ∶ 𝜇 = 8.517). The test statistic 𝑡-stat = (8.517 − 7.804)/(1.683/ 1377) =
15.72 > 𝑡1376 (0.975). Hence, we reject the two-sided test at 𝛼 = 5%. Similarly,
we will reject the one-sided test at 𝛼 = 5%.
Example – Wisconsin Property Fund. For numerical stability and exten-
sions to regression applications, statistical packages often work with transformed
versions of parameters. The following estimates are from the R package VGAM
(the function). More details on the mle of other distribution families are given
in Appendix Chapter 17.
Given the likelihood function 𝐿(𝜃|x) and Θ0 ⊂ Θ, the likelihood ratio test
statistic for testing 𝐻0 ∶ 𝜃 ∈ Θ0 against 𝐻𝑎 ∶ 𝜃 ∉ Θ0 is given by
sup𝜃∈Θ 𝐿(𝜃|x)
𝐿= 0
,
sup𝜃∈Θ 𝐿(𝜃|x)
𝐿(𝜃0 |x)
𝐿= .
sup𝜃∈Θ 𝐿(𝜃|x)
The LRT rejects the null hypothesis when 𝐿 < 𝑐, with the threshold depending
on the level of significance 𝛼, the sample size 𝑛, and the number of parameters
in 𝜃. Based on the Neyman–Pearson Lemma, the LRT is the uniformly
most powerful test for testing 𝐻0 ∶ 𝜃 = 𝜃0 versus 𝐻𝑎 ∶ 𝜃 = 𝜃𝑎 . That is, it
provides the largest power 𝛽 for a given 𝛼 and a given alternative value 𝜃𝑎 .
Based on the Wilks’s Theorem, the likelihood ratio test statistic −2 log(𝐿)
converges in distribution to a Chi-square distribution with the degree of freedom
being the difference between the dimensionality of the parameter spaces Θ and
Θ0 , when the sample size goes to infinity and when the null model is nested
within the alternative model. That is, when the null model is a special case of
the alternative model containing a restricted sample space, we may approximate
𝑐 by 𝜒2𝑝1 −𝑝2 (1 − 𝛼), the 100 × (1 − 𝛼) th percentile of the Chi-square distribution,
with 𝑝1 − 𝑝2 being the degrees of freedom, and 𝑝1 and 𝑝2 being the numbers of
parameters in the alternative and null models, respectively. Note that the LRT
is also a large-sample test that will not be valid for small samples.
15.4. HYPOTHESIS TESTING 489
where 𝜃 ̂ denotes the mle of 𝜃, and 𝑝 is the number of parameters in the model.
The additional term 2𝑝 represents a penalty for the complexity of the model.
That is, with the same maximized likelihood function, the 𝐴𝐼𝐶 favors model
with less parameters. We note that the 𝐴𝐼𝐶 does not consider the impact from
the sample size 𝑛.
Alternatively, people use the Bayesian information criterion (𝐵𝐼𝐶) that
takes into consideration the sample size. The 𝐵𝐼𝐶 is defined as
We observe that the 𝐵𝐼𝐶 generally puts a higher weight on the number of
parameters. With the same maximized likelihood function, the 𝐵𝐼𝐶 will suggest
a more parsimonious model than the 𝐴𝐼𝐶.
Example – Wisconsin Property Fund. Both the 𝐴𝐼𝐶 and 𝐵𝐼𝐶 statistics
suggest that the GB2 is the best fitting model whereas gamma is the worst.
In this graph,
• black represents actual (smoothed) logarithmic claims
• Best approximated by green which is fitted GB2
• Pareto (purple) and Lognormal (lightblue) are also pretty good
• Worst are the exponential (in red) and gamma (in dark blue)
490CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
0.30
0.20
Density
0.10
0.00
0 5 10 15
Log Expenditures
Contributors
• Lei (Larry) Hua, Northern Illinois University, and Edward W. (Jed)
Frees, University of Wisconsin-Madison, are the principal authors of the
initial version of this chapter. Email: [email protected] or [email protected]
for chapter comments and suggested improvements.
Chapter 16
Appendix B: Iterated
Expectations
491
492 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
The iterated expectations are the laws regarding calculation of the expectation
and variance of a random variable using a conditional distribution of the variable
given another variable. Hence, we first introduce the concepts related to the
conditional distribution, and the calculation of the conditional expectation and
variable based on a given conditional distribution.
Discrete Case
Suppose that 𝑋 and 𝑌 are both discrete random variables, meaning that they
can take a finite or countable number of possible values with a positive proba-
bility. The joint probability (mass) function of (𝑋, 𝑌 ) is defined as
When 𝑋 and 𝑌 are independent (the value of 𝑋 does not depend on that of
𝑌 ), we have
𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦),
with 𝑝(𝑥) = Pr[𝑋 = 𝑥] and 𝑝(𝑦) = Pr[𝑌 = 𝑦] being the marginal probability
function of 𝑋 and 𝑌 , respectively.
Given the joint probability function, we may obtain the marginal probability
functions of 𝑌 as
where the summation is over all possible values of 𝑥, and the marginal probabil-
ity function of 𝑋 can be obtained in a similar manner.
The conditional probability (mass) function of (𝑌 |𝑋) is defined as
𝑝(𝑥, 𝑦)
𝑝(𝑦|𝑥) = Pr[𝑌 = 𝑦|𝑋 = 𝑥] = ,
Pr[𝑋 = 𝑥]
16.1. CONDITIONAL DISTRIBUTION AND CONDITIONAL EXPECTATION493
Continuous Case
For continuous random variables 𝑋 and 𝑌 , we may define their joint probability
(density) function based on the joint cumulative distribution function. The joint
cumulative distribution function of (𝑋, 𝑌 ) is defined as
with 𝐹 (𝑥) = Pr[𝑋 ≤ 𝑥] and 𝐹 (𝑦) = Pr[𝑌 ≤ 𝑦] being the cumulative distri-
bution function (cdf) of 𝑋 and 𝑌 , respectively. The random variable 𝑋 is
referred to as a continuous random variable if its cdf is continuous on 𝑥.
When the cdf 𝐹 (𝑥) is continuous on 𝑥, then we define 𝑓(𝑥) = 𝜕𝐹 (𝑥)/𝜕𝑥 as the
(marginal) probability density function (pdf) of 𝑋. Similarly, if the joint
cdf 𝐹 (𝑥, 𝑦) is continuous on both 𝑥 and 𝑦, we define
𝜕 2 𝐹 (𝑥, 𝑦)
𝑓(𝑥, 𝑦) =
𝜕𝑥𝜕𝑦
𝑓(𝑥, 𝑦) = 𝑓(𝑥)𝑓(𝑦).
Given the joint density function, we may obtain the marginal density function
of 𝑌 as
where the integral is over all possible values of 𝑥, and the marginal probability
function of 𝑋 can be obtained in a similar manner.
494 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
Based on the joint pdf and the marginal pdf, we define the conditional prob-
ability density function of (𝑌 |𝑋) as
𝑓(𝑥, 𝑦)
𝑓(𝑦|𝑥) = ,
𝑓(𝑥)
Discrete Case
For a discrete random variable 𝑌 , its expectation is defined as E[𝑌 ] =
∑𝑦 𝑦 𝑝(𝑦) if its value is finite, and its variance is defined as Var[𝑌 ] =
E{(𝑌 − E[𝑌 ])2 } = ∑𝑦 𝑦2 𝑝(𝑦) − {E[𝑌 ]}2 if its value is finite.
For a discrete random variable 𝑌 , the conditional expectation of the random
variable 𝑌 given the event 𝑋 = 𝑥 is defined as
E[𝑌 |𝑋 = 𝑥] = ∑ 𝑦 𝑝(𝑦|𝑥),
𝑦
where 𝑋 does not have to be a discrete variable, as far as the conditional prob-
ability function 𝑝(𝑦|𝑥) is given.
Note that the conditional expectation E[𝑌 |𝑋 = 𝑥] is a fixed number. When we
replace 𝑥 with 𝑋 on the right hand side of the above equation, we can define
the expectation of 𝑌 given the random variable 𝑋 as
Continuous Case
For a continuous random variable 𝑌 , its expectation is defined as E[𝑌 ] =
∫𝑦 𝑦 𝑓(𝑦)𝑑𝑦 if the integral exists, and its variance is defined as Var[𝑌 ] = E{(𝑋 −
E[𝑌 ])2 } = ∫𝑦 𝑦2 𝑓(𝑦)𝑑𝑦 − {E[𝑌 ]}2 if its value is finite.
For jointly continuous random variables 𝑋 and 𝑌 , the conditional expecta-
tion of the random variable 𝑌 given 𝑋 = 𝑥 is defined as
E[𝑌 |𝑋 = 𝑥] = ∫ 𝑦 𝑓(𝑦|𝑥)𝑑𝑦.
𝑦
• the Law of Total Variance for calculating the variance of a random variable
based on its conditional distribution given another random variable
• how to calculate the expectation and variance based on an example of a
two-stage model
Assuming all the expectations exist and are finite, the Law of Iterated Ex-
pectations states that
where the first (inside) expectation is taken with respect to the random variable
𝑌 and the second (outside) expectation is taken with respect to 𝑋.
For the Law of Iterated Expectations, the random variables may be discrete,
continuous, or a hybrid combination of the two. We use the example of discrete
variables of 𝑋 and 𝑌 to illustrate the calculation of the unconditional expecta-
tion using the Law of Iterated Expectations. For continuous random variables,
we only need to replace the summation with the integral, as illustrated earlier
in the appendix.
Given 𝑝(𝑦|𝑥) the joint pmf of 𝑋 and 𝑌 , the conditional expectation of ℎ(𝑋, 𝑌 )
given the event 𝑋 = 𝑥 is defined as
= ∑ ∑ ℎ(𝑥, 𝑦)𝑝(𝑦|𝑥)𝑝(𝑥) .
𝑥 𝑦
The Law of Iterated Expectations for the continuous and hybrid cases can be
proved in a similar manner, by replacing the corresponding summation(s) by
integral(s).
where the first (inside) expectation/variance is taken with respect to the ran-
dom variable 𝑌 and the second (outside) expectation/variance is taken with
respect to 𝑋. Thus, the unconditional variance equals to the expectation of the
conditional variance plus the variance of the conditional expectation.
In order to verify this rule, first note that we can calculate a conditional variance
as
2
Var [ℎ(𝑋, 𝑌 )|𝑋] = E[ℎ(𝑋, 𝑌 )2 |𝑋] − {E [ℎ(𝑋, 𝑌 )|𝑋]} .
2
E{Var [ℎ(𝑋, 𝑌 )|𝑋]} = E {E [ℎ(𝑋, 𝑌 )2 |𝑋]} − E ({E [ℎ(𝑋, 𝑌 )|𝑋]} )
2
= E [ℎ(𝑋, 𝑌 )2 ] − E ({E [ℎ(𝑋, 𝑌 )|𝑋]} ) . (16.3)
Thus, adding Equations (16.3) and (16.4) leads to the unconditional variance
Var [ℎ(𝑋, 𝑌 )].
16.2.3 Application
To apply the Law of Iterated Expectations and the Law of Total Variance, we
generally adopt the following procedure.
1. Identify the random variable that is being conditioned upon, typically a
stage 1 outcome (that is not observed).
2. Conditional on the stage 1 outcome, calculate summary measures such as
a mean, variance, and the like.
3. There are several results of the step 2, one for each stage 1 outcome. Then,
combine these results using the iterated expectations or total variance
rules.
Mixtures of Finite Populations. Suppose that the random variable 𝑁1 rep-
resents a realization of the number of claims in a policy year from the population
of good drivers and 𝑁2 represents that from the population of bad drivers. For a
specific driver, there is a probability 𝛼 that (s)he is a good driver. For a specific
draw 𝑁 , we have
Var[𝑁 |𝑇 = 𝑗] = E[𝑁 |𝑇 = 𝑗] = 𝜆𝑗 , 𝑗 = 1, 2.
Note that the later is the variance for a Bernoulli with outcomes 𝜆1 and 𝜆2 , and
the binomial probability 𝛼.
Based on the Law of Total Variance, the unconditional variance of 𝑁 is given
by
𝑥𝛾 − 𝑏(𝛾)
𝑓(𝑥; 𝛾, 𝜃) = exp ( + 𝑆 (𝑥, 𝜃)) .
𝜃
Density or
Distribution Parameters Mass Function Components
General 𝛾, 𝜃 exp ( 𝑥𝛾−𝑏(𝛾)
𝜃
+ 𝑆 (𝑥, 𝜃)) 𝛾, 𝜃, 𝑏(𝛾), 𝑆(𝑥, 𝜃)
2 𝛾2 2
Normal 𝜇, 𝜎2 √1
𝜎 2𝜋
exp (− (𝑥−𝜇)
2𝜎2
) 𝜇, 𝜎2 , 2
, − ( 𝑥2𝜃 + log(2𝜋𝜃)
2
)
𝑛 𝑥 𝑛−𝑥 𝜋 𝛾
Binomal 𝜋 (𝑥)𝜋 (1 − 𝜋) log ( 1−𝜋 ) , 1, 𝑛 log(1 + 𝑒 ),
log (𝑛𝑥)
𝜆𝑥 𝛾
Poisson 𝜆 𝑥!
exp(−𝜆) log 𝜆, 1, 𝑒 , − log(𝑥!)
Γ(𝑥+𝑟) 𝑟
Negative 𝑟, 𝑝 𝑥!Γ(𝑟)
𝑝 (1 − 𝑝)𝑥 log(1 − 𝑝), 1, −𝑟 log(1 − 𝑒𝛾 ),
∗
Binomial log [ Γ(𝑥+𝑟)
𝑥!Γ(𝑟)
]
1 𝛼−1 1 1 −1
Gamma 𝛼, 𝜃 Γ(𝛼)𝜃𝛼
𝑥 exp(−𝑥/𝜃) − ,
𝛼𝛾 𝛼
, − log(−𝛾), −𝜃 log 𝜃
−1 −1
− log (Γ(𝜃 )) + (𝜃 − 1) log 𝑥
∗
This assumes that the parameter r is fixed but need not be an integer.
The Tweedie (see Section 5.3.4) and inverse Gaussian distributions are also
members of the linear exponential family. The linear exponential family of
distribution families is extensively used as the basis of generalized linear models
as described in, for example, Frees (2009).
The prior distribution of 𝛾 is normal with mean 𝑎1 /𝑎2 and variance 𝑎−12 . The
posterior distribution of 𝛾 given 𝑥 is normal with mean (𝑎1 + 𝑥/𝜎2 )/(𝑎2 + 𝜎−2 )
and variance (𝑎2 + 𝜎−2 )−1 .
Contributors
• Lei (Larry) Hua, Northern Illinois University, and Edward W. (Jed)
Frees, University of Wisconsin-Madison, are the principal authors of the
initial version of this chapter. Email: [email protected] or [email protected]
for chapter comments and suggested improvements.
502 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
Chapter 17
Appendix C: Maximum
Likelihood Theory
503
504 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
𝑛
𝑙(𝜃|x) = log 𝐿(𝜃|x) = ∑ log 𝑓(𝑥𝑖 |𝜃),
𝑖=1
500 𝛼
𝐹 (𝑥) = Pr(𝑋𝑖 ≤ 𝑥) = 1 − ( ) , 𝑥 > 500,
𝑥
with parameter 𝜃 = 𝛼.
The corresponding probability density function is 𝑓(𝑥) = 500𝛼 𝛼𝑥−𝛼−1 and the
log-likelihood function can be derived as
𝑛 𝑛
𝑙(𝛼|x) = ∑ log 𝑓(𝑥𝑖 ; 𝛼) = 𝑛𝛼 log 500 + 𝑛 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 .
𝑖=1 𝑖=1
𝑛 𝑛
𝜕 𝜕 𝜕
𝑢(𝜃) = 𝑙(𝜃|x) = log ∏ 𝑓(𝑥𝑖 ; 𝜃) = ∑ log 𝑓(𝑥𝑖 ; 𝜃),
𝜕𝜃 𝜕𝜃 𝑖=1 𝑖=1
𝜕𝜃
where 𝑢(𝜃) = (𝑢1 (𝜃), 𝑢2 (𝜃), ⋯ , 𝑢𝑝 (𝜃)) when 𝜃 = (𝜃1 , ⋯ , 𝜃𝑝 ) contains 𝑝 > 2 pa-
rameters, with the element 𝑢𝑘 (𝜃) = 𝜕𝑙(𝜃|x)/𝜕𝜃𝑘 being the partial derivative
with respect to 𝜃𝑘 (𝑘 = 1, 2, ⋯ , 𝑝).
The likelihood function has the following properties:
• One basic property of the likelihood function is that the expectation of
the score function with respect to x is 0. That is,
𝜕
E[𝑢(𝜃)] = E [ 𝑙(𝜃|x)] = 0.
𝜕𝜃
To illustrate this, we have
𝜕
𝜕 𝑓(x; 𝜃) 𝜕
E[ 𝑙(𝜃|x)] = E [ 𝜕𝜃 ] = ∫ 𝑓(y; 𝜃)𝑑y
𝜕𝜃 𝑓(x; 𝜃) 𝜕𝜃
𝜕 𝜕
= ∫ 𝑓(y; 𝜃)𝑑y = 1 = 0.
𝜕𝜃 𝜕𝜃
′ 2
• Denote by 𝜕 2 𝑙(𝜃|x)/𝜕𝜃𝜕𝜃 = 𝜕 2 𝑙(𝜃|x)/𝜕𝜃 the second derivative
of the log-likelihood function when 𝜃 is a single parameter, or by
′
𝜕 2 𝑙(𝜃|x)/𝜕𝜃𝜕𝜃 = (ℎ𝑗𝑘 ) = (𝜕 2 𝑙(𝜃|x)/𝜕𝑥𝑗 𝜕𝑥𝑘 ) the hessian matrix of the
log-likelihood function when it contains multiple parameters. Denote
′
[𝜕𝑙(𝜃|x)𝜕𝜃][𝜕𝑙(𝜃|x)𝜕𝜃 ] = 𝑢2 (𝜃) when 𝜃 is a single parameter, or let
′
[𝜕𝑙(𝜃|x)𝜕𝜃][𝜕𝑙(𝜃|x)𝜕𝜃 ] = (𝑢𝑢𝑗𝑘 ) be a 𝑝 × 𝑝 matrix when 𝜃 contains a total
of 𝑝 parameters, with each element 𝑢𝑢𝑗𝑘 = 𝑢𝑗 (𝜃)𝑢𝑘 (𝜃) and 𝑢𝑗 (𝜃) being the
𝑘th element of the score vector as defined earlier. Another basic property
of the likelihood function is that sum of the expectation of the hessian
matrix and the expectation of the Kronecker product of the score vector
and its transpose is 0. That is,
𝜕2 𝜕𝑙(𝜃|x) 𝜕𝑙(𝜃|x)
E( ′ 𝑙(𝜃|x)) + E ( ′ ) = 0.
𝜕𝜃𝜕𝜃 𝜕𝜃 𝜕𝜃
𝜕𝑙(𝜃|x) 𝜕𝑙(𝜃|x) 𝜕2
ℐ(𝜃) = E ( ′ ) = −E ( ′ 𝑙(𝜃|x)) .
𝜕𝜃 𝜕𝜃 𝜕𝜃𝜕𝜃
As the sample size 𝑛 goes to infinity, the score function (vector) converges in dis-
tribution to a normal distribution (or multivariate normal distribution
when 𝜃 contains multiple parameters) with mean 0 and variance (or covariance
matrix in the multivariate case) given by ℐ(𝜃).
506 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
̂
𝜃𝑀𝐿𝐸 = argmax𝜃∈Θ 𝑙(𝜃|x).
Given the analytical form of the likelihood function, the mle can be obtained by
taking the first derivative of the log-likelihood function with respect to 𝜃, and
setting the values of the partial derivatives to zero. That is, the mle are the
solutions of the equations of
̂
𝜕𝑙(𝜃|x)
= 0.
𝜕 𝜃̂
Example. Course C/Exam 4. May 2000, 21. You are given the following
five observations: 521, 658, 702, 819, 1217. You use the single-parameter Pareto
with cumulative distribution function:
500 𝛼
𝐹 (𝑥) = 1 − ( ) , 𝑥 > 500.
𝑥
5 5
𝑙(𝛼|x) = ∑ log 𝑓(𝑥𝑖 ; 𝛼) = 5𝛼 log 500 + 5 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 .
𝑖=1 𝑖=1
5
𝜕 5
𝑙(𝛼|x) = 5 log 500+5/𝛼−∑ log 𝑥𝑖 =𝑠𝑒𝑡 0 ⇒ 𝛼𝑀𝐿𝐸
̂ = 5 = 2.453.
𝜕𝛼 𝑖=1 ∑𝑖=1 log 𝑥𝑖 − 5 log 500
̂
• The mle of a parameter 𝜃, 𝜃𝑀𝐿𝐸 , is a consistent estimator. That is, the
̂
mle 𝜃𝑀𝐿𝐸 converges in probability to the true value 𝜃, as the sample size
𝑛 goes to infinity.
• The mle has the asymptotic normality property, meaning that the esti-
mator will converge in distribution to a multivariate normal distribution
centered around the true value, when the sample size goes to infinity.
Namely,
√
𝑛(𝜃 ̂ − 𝜃) → 𝑁 (0, 𝑉 ) , as 𝑛 → ∞,
𝑀𝐿𝐸
• The mle is efficient, meaning that it has the smallest asymptotic variance
𝑉 , commonly referred to as the Cramer–Rao lower bound. In particu-
lar, the Cramer–Rao lower bound is the inverse of the Fisher information
̂
(matrix) ℐ(𝜃) defined earlier in this appendix. Hence, Var(𝜃𝑀𝐿𝐸 ) can be
estimated based on the observed Fisher information.
Based on the above results, we may perform statistical inference based on the
procedures defined in Appendix Chapter 15.
508 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
𝜕2 5 3
ℐ(𝜃1 , 𝜃2 ) = −E ( ′ 𝑙(𝜃|x)) = ( 3 2 )
𝜕𝜃𝜕𝜃
and
1 2 −3 2 −3
ℐ−1 (𝜃1 , 𝜃2 ) = ( )=( ).
5(2) − 3(3) −3 5 −3 5
Contributors
• Lei (Larry) Hua, Northern Illinois University, and Edward W. (Jed)
Frees, University of Wisconsin-Madison, are the principal authors of the
initial version of this chapter. Email: [email protected] or [email protected]
for chapter comments and suggested improvements.
512 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
Chapter 18
Appendix D: Summary of
Distributions
User Notes
Functions
513
514 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 𝜆>0
𝑝0 𝑒−𝜆
𝑒−𝜆 𝜆𝑘
Probability mass function 𝑘!
𝑝𝑘
Expected value 𝜆
E[𝑁 ]
Variance 𝜆
Probability generating function 𝑒𝜆(𝑧−1)
𝑃 (𝑧)
𝑎 and 𝑏 for recursion 𝑎=0
𝑏=𝜆
R Commands
Geometric
Functions
Name Function
Parameter assumptions 𝛽>0
1
𝑝0 1+𝛽
𝛽𝑘
Probability mass function (1+𝛽)𝑘+1
𝑝𝑘
Expected value 𝛽
E[𝑁 ]
Variance 𝛽(1 + 𝛽)
Probability generating function [1 − 𝛽(𝑧 − 1)]−1
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎 = 1+𝛽
𝑏=0
R Commands
18.1. DISCRETE DISTRIBUTIONS 515
Binomial
Functions
Name Function
Parameter assumptions 0 < 𝑞 < 1, m is an integer
0≤𝑘≤𝑚
𝑝0 (1 − 𝑞)𝑚
Probability mass function (𝑚
𝑘
)𝑞 𝑘
(1 − 𝑞)𝑚−𝑘
𝑝𝑘
Expected value 𝑚𝑞
E[𝑁 ]
Variance 𝑚𝑞(1 − 𝑞)
Probability generating function [1 + 𝑞(𝑧 − 1)]𝑚
𝑃 (𝑧)
−𝑞
𝑎 and 𝑏 for recursion 𝑎 = 1−𝑞
(𝑚+1)𝑞
𝑏 = 1−𝑞
R Commands
Negative Binomial
Functions
516 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 𝑟 > 0, 𝛽 > 0
𝑝0 (1 + 𝛽)−𝑟
𝑟(𝑟+1)⋯(𝑟+𝑘−1)𝛽𝑘
Probability mass function 𝑘!(1+𝛽)𝑟+𝑘
𝑝𝑘
Expected value 𝑟𝛽
E[𝑁 ]
Variance 𝑟𝛽(1 + 𝛽)
Probability generating function [1 − 𝛽(𝑧 − 1)]−𝑟
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎 = 1+𝛽
(𝑟−1)𝛽
𝑏 = 1+𝛽
R Commands
Name Function
Parameter assumptions 𝜆>0
𝜆
𝑝1𝑇 𝑒𝜆 −1
𝜆𝑘
Probability mass function 𝑘!(𝑒𝜆 −1)
𝑝𝑘𝑇
𝜆
Expected value 1−𝑒−𝜆
E[𝑁 ]
𝜆[1−(𝜆+1)𝑒−𝜆 ]
Variance (1−𝑒−𝜆 )2
𝑒𝜆𝑧 −1
Probability generating function 𝑒𝜆 −1
𝑃 (𝑧)
𝑎 and 𝑏 for recursion 𝑎=0
𝑏=𝜆
R Commands
18.1. DISCRETE DISTRIBUTIONS 517
Functions
Name Function
Parameter assumptions 𝛽>0
1
𝑝1𝑇 1+𝛽
𝑘−1
𝛽
Probability mass function (1+𝛽)𝑘
𝑝𝑘𝑇
Expected value 1+𝛽
E[𝑁 ]
Variance 𝛽(1 + 𝛽)
[1−𝛽(𝑧−1)]−1 −(1+𝛽)−1
Probability generating function 1−(1+𝛽)−1
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎 = 1+𝛽
𝑏=0
R Commands
Functions
518 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 0 < 𝑞 < 1, m is an integer
0≤𝑘≤𝑚
𝑚(1−𝑞)𝑚−1 𝑞
𝑝1𝑇 1−(1−𝑞)𝑚
(𝑚𝑘)𝑞𝑘 (1−𝑞)𝑚−𝑘
Probability mass function 1−(1−𝑞)𝑚
𝑝𝑘𝑇
𝑚𝑞
Expected value 1−(1−𝑞)𝑚
E[𝑁 ]
𝑚𝑞[(1−𝑞)−(1−𝑞+𝑚𝑞)(1−𝑞)𝑚 ]
Variance [1−(1−𝑞)𝑚 ]2
[1+𝑞(𝑧−1)𝑚 ]−(1−𝑞)𝑚
Probability generating function 1−(1−𝑞)𝑚
𝑃 (𝑧)
−𝑞
𝑎 and 𝑏 for recursion 𝑎 = 1−𝑞
(𝑚+1)𝑞
𝑏 = 1−𝑞
R Commmands
Name Function
Parameter assumptions 𝑟 > −1, 𝑟 ≠ 0
𝑟𝛽
𝑝1𝑇 (1+𝛽)𝑟+1 −(1+𝛽)
𝑟(𝑟+1)⋯(𝑟+𝑘−1) 𝛽 𝑘
Probability mass function 𝑘![(1+𝛽)𝑟 −1] ( 1+𝛽 )
𝑝𝑘𝑇
𝑟𝛽
Expected value 1−(1+𝛽)−𝑟
E[𝑁 ]
𝑟𝛽[(1+𝛽)−(1+𝛽+𝑟𝛽)(1+𝛽)−𝑟 ]
Variance [1−(1+𝛽)−𝑟 ]2
[1−𝛽(𝑧−1)]−𝑟 −(1+𝛽)−𝑟
Probability generating function 1−(1+𝛽)−𝑟
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎 = 1+𝛽
𝑏 = (𝑟−1)𝛽
1+𝛽
R Commands
18.2. CONTINUOUS DISTRIBUTIONS 519
Logarithmic
Functions
Name Function
Parameter assumptions 𝛽>0
𝛽
𝑝1𝑇 (1+𝛽)𝑙𝑛(1+𝛽)
𝛽𝑘
Probability mass function 𝑘(1+𝛽)𝑘 ln(1+𝛽)
𝑝𝑘𝑇
𝛽
Expected value ln(1+𝛽)
E[𝑁 ]
𝛽
𝛽[1+𝛽− 𝑙𝑛(1+𝛽) ]
Variance ln(1+𝛽)
Probability generating function 1 − 𝑙𝑛[1−𝛽(𝑧−1)]
ln(1+𝛽)
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎= 1+𝛽
−𝛽
𝑏= 1+𝛽
R Commands
Name Function
Parameter assumptions 𝜃>0
1 −𝑥/𝜃
Probability density 𝜃𝑒
function 𝑓(𝑥)
Distribution function 1 − 𝑒−𝑥/𝜃
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(𝑘 + 1)
𝑘
E[𝑋 ] 𝑘 > −1
𝑉 𝑎𝑅𝑝 (𝑥) −𝜃 ln(1 − 𝑝)
Limited Expected Value 𝜃(1 − 𝑒−𝑥/𝜃 )
E[𝑋 ∧ 𝑥]
R Commands
Illustrative Graph
Exponential Distribution
0.008
Probability density
0.004
0.000
X
18.2. CONTINUOUS DISTRIBUTIONS 521
Inverse Exponential
Functions
Name Function
Parameter assumptions 𝜃>0
𝜃𝑒−𝜃/𝑥
Probability density 𝑥2
function 𝑓(𝑥)
Distribution function 𝑒−𝜃/𝑥
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 − 𝑘)
𝑘
E[𝑋 ] 𝑘<1
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 𝐺(1 − 𝑘; 𝜃/𝑥) + 𝑥𝑘 (1 − 𝑒−𝜃/𝑥 )
R Commands
Illustrative Graph
0.002
0.000
X
522 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 𝜃 is known, 𝑥 > 𝜃, 𝛼 > 0
𝛼𝜃𝛼
Probability density 𝑥𝛼+1
function 𝑓(𝑥)
Distribution function 1 − (𝜃/𝑥)𝛼
𝐹 (𝑥)
𝑡ℎ 𝛼𝜃𝑘
k raw moment 𝛼−𝑘
E[𝑋 𝑘 ] 𝑘<𝛼
𝛼𝜃𝑘 𝑘𝜃𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] 𝛼−𝑘 − (𝛼−𝑘)𝑥 𝛼−𝑘
𝑥≥𝜃
R Commands
Illustrative Graph
0.020
0.010
0.000
X
18.2. CONTINUOUS DISTRIBUTIONS 523
Pareto
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼 > 0
𝛼𝜃𝛼
Probability density (𝑥+𝜃)𝛼+1
function 𝑓(𝑥)
𝛼
𝜃
Distribution function 1 − ( 𝑥+𝜃 )
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑘+1)Γ(𝛼−𝑘)
k raw moment Γ(𝛼)
E[𝑋 𝑘 ] −1 < 𝑘 < 𝛼
𝛼−1
𝜃 𝜃
Limited Expected Value: 𝛼 ≠ 1 𝛼−1 [1 − ( 𝑥+𝜃 ) ]
E[𝑋 ∧ 𝑥]
𝜃
Limited Expected Value: 𝛼 = 1 −𝜃 ln ( 𝑥+𝜃 )
E[𝑋 ∧ 𝑥]
𝜃𝑘 Γ(𝑘+1)Γ(𝛼−𝑘) 𝑥 𝜃 𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) 𝛽(𝑘 + 1, 𝛼 − 𝑘; 𝑥+𝜃 ) + 𝑥𝑘 ( 𝑥+𝜃 )
R Commands
Illustrative Graph
524 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Pareto Distribution
0.015
Probability density
0.010
0.005
0.000
Inverse Pareto
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜏 > 0
𝜏𝜃𝑥𝜏−1
Probability density (𝑥+𝜃)𝜏 −1
function 𝑓(𝑥)
𝜏
𝑥
Distribution function ( 𝑥+𝜃 )
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+𝑘)Γ(1−𝑘)
k raw moment Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 < 𝑘 < 1
𝑥/(𝑥+𝜃) 𝜏
𝑘 𝑘 𝜏+𝑘−1 𝑥
E[(𝑋 ∧ 𝑥) ] 𝜃 𝜏 ∫0 𝑦 (1 − 𝑦)−𝑘 𝑑𝑦 + 𝑥𝑘 [1 − ( 𝑥+𝜃 ) ]
𝑘 > −𝜏
R Commands
Illustrative Graph
0.0004
0.0000
Loglogistic
Functions
Name Function
(𝑥/𝜃)𝛾
Parameter assumptions 𝜃 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃) 𝛾
𝛾(𝑥/𝜃)𝛾
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]2
function 𝑓(𝑥)
Distribution function 𝑢
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 + (𝑘/𝛾))Γ(1 − (𝑘/𝛾))
𝑘
E[𝑋 ] −𝛾 < 𝑘 < 𝛾
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 Γ(1 + (𝑘/𝛾))Γ(1 − (𝑘/𝛾))𝛽(1 + (𝑘/𝛾), 1 − (𝑘/𝛾); 𝑢) + 𝑥𝑘 (1 − 𝑢)
𝑘 > −𝛾
Illustrative Graph
526 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
0.006
0.004
0.002
0.000
Paralogistic
Functions
Name Function
1
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝑢 = 1+(𝑥/𝜃) 𝛼
𝛼2 (𝑥/𝜃)𝛼
Probability density 𝑥[1+(𝑥/𝜃)𝛼 ]𝛼+1
function 𝑓(𝑥)
Distribution function 1 − 𝑢𝛼
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(1+(𝑘/𝛼))Γ(𝛼−(𝑘/𝛼))
k raw moment Γ(𝛼)
E[𝑋 𝑘 ] −𝛼 < 𝑘 < 𝛼 2
𝜃𝑘 Γ(1+(𝑘/𝛼))Γ(𝛼−(𝑘/𝛼))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) + (𝑘/𝛼), 𝛼 − (𝑘/𝛼); 1 − 𝑢) + 𝑥𝑘 𝑢𝛼
𝛽(1
𝑘 > −𝛼
R Commands
Illustrative Graph
Paralogistic Distribution
0.008
Probability density
0.004
0.000
Gamma
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼 > 0
1 𝛼−1 −𝑥/𝜃
Probability density 𝜃 Γ(𝛼) 𝑥
𝛼 𝑒
function 𝑓(𝑥)
Distribution function Γ(𝛼; 𝑥𝜃 )
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(𝛼+𝑘)
Γ(𝛼)
𝑘
E[𝑋 ] 𝑘 > −𝛼
𝜃𝑘 Γ(𝑘+𝛼) 𝑘
Γ(𝛼) Γ(𝑘 + 𝛼; 𝑥/𝜃) + 𝑥 [1 − Γ(𝛼; 𝑥/𝜃)]
E[𝑋 ∧ 𝑥]𝑘 𝑘 > −𝛼
R Commands
Illustrative Graph
Gamma Distribution
0.006
Probability density
0.004
0.002
0.000
Inverse Gamma
Functions
Name Function
(𝜃/𝑥)𝛼 𝑒−𝜃/𝑥
Probability density 𝑥Γ(𝛼)
function 𝑓(𝑥)
Distribution function 1 − Γ(𝛼; 𝜃/𝑥)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝛼−𝑘)
k raw moment Γ(𝛼)
E[𝑋 𝑘 ] 𝑘<𝛼
𝜃𝑘 Γ(𝛼−𝑘)
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) [1 − Γ(𝛼 − 𝑘; 𝜃/𝑥)] + 𝑥𝑘 Γ(𝛼; 𝜃/𝑥)
R Commands
Illustrative Graph
0.0000002
0.0000000
Weibull
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼 > 0
𝛼 𝛼
𝛼( 𝑥
𝜃 ) exp (−( 𝜃 )
𝑥
)
Probability density 𝑥
function 𝑓(𝑥)
𝛼
Distribution function 1 − exp ( − ( 𝑥𝜃 ) )
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 + 𝛼𝑘 )
𝑘
E[𝑋 ] 𝑘 > −𝛼
𝛼 𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 Γ(1 + 𝛼𝑘 )Γ[1 + 𝛼𝑘 ; ( 𝑥𝜃 ) ] + 𝑥𝑘 exp ( − ( 𝑥𝜃 ) )
𝑘 > −𝛼
R Commands
530 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Illustrative Graph
Weibull Distribution
0.000 0.002 0.004 0.006 0.008
Probability density
Inverse Weibull
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜏 > 0
𝜏
R Commands
Illustrative Graph
0.010
0.005
0.000
Uniform
Functions
532 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions −∞ < 𝛼 < 𝛽 < ∞
1
Probability density 𝛽−𝛼
f(x)
𝑥−𝛼
Distribution function 𝛽−𝛼
𝐹 (𝑥)
𝛽+𝛼
Mean 2
E[X]
(𝛽−𝛼)2
Variance 12
𝐸[(𝑋 − 𝜇)2 ]
E[(𝑋 − 𝜇)𝑘 ] 𝜇𝑘 = 0 for odd k
𝑘
𝜇𝑘 = 2(𝛽−𝛼)
𝑘 (𝑘+1) for even k
R Commands
Illustrative Graph
0.020
0.015
50 60 70 80 90 100
X
18.2. CONTINUOUS DISTRIBUTIONS 533
Normal
Functions
Name Function
Parameter assumptions −∞ < 𝜇 < ∞, 𝜎 > 0
2
Probability density √1
2𝜋𝜎
exp (− (𝑥−𝜇)
2𝜎2 )
f(x)
Distribution function Φ ( 𝑥−𝜇
𝜎 )
𝐹 (𝑥)
Mean 𝜇
E[X]
Variance 𝜎2
𝐸[(𝑋 − 𝜇)2 ]
E[(𝑥 − 𝜇)𝑘 ] 𝜇𝑘 = 0 for even k
2
𝜇𝑘 = ( 𝑘𝑘!𝜎
)!2𝑘/2
for odd k
2
R Commands
Illustrative Graph
534 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Normal Distribution
0.04
0.03
Probability density
0.02
0.01
0.00
Cauchy
Functions
Name Function
Parameter assumptions −∞ < 𝛼 < ∞, 𝛽 > 0
2
1
Probability density 𝜋𝛽 [1 + ( 𝑥−𝛼
𝛽 ) ]
−1
function 𝑓(𝑥)
R Commands
Illustrative Graph
18.2. CONTINUOUS DISTRIBUTIONS 535
Cauchy Distribution
0.0030
Probability density
0.0020
0.0010
0.0000
Functions
Name Function
𝑥
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝜏 > 0, 𝑢 = 𝑥+𝜃
𝛼
Γ(𝛼+𝜏) 𝜃 𝑥 𝜏−1
Probability density Γ(𝛼)Γ(𝜏) (𝑥+𝜃)𝛼+𝜏
function 𝑓(𝑥)
Distribution function 𝛽(𝜏 , 𝛼; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+1)Γ(𝛼−𝑘)
k raw moment Γ(𝛼)Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 < 𝑘 < 𝛼
𝜃𝑘 Γ(𝜏+𝑘)Γ(𝛼−𝑘)
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼)Γ(𝜏) 𝛽(𝜏 + 𝑘, 𝛼 − 𝑘; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝜏 , 𝛼; 𝑢)]
𝑘 > −𝜏
R Commands
536 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Illustrative Graph
0.002
0.001
0.000
Burr
Functions
Name Function
1
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃)𝛾
𝛾
𝛼𝛾(𝑥/𝜃)
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]𝛼+1
function 𝑓(𝑥)
Distribution function 1 − 𝑢𝛼
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(1+(𝑘/𝛾))Γ(𝛼−(𝑘/𝛾))
k raw moment Γ(𝛼)
E[𝑋 𝑘 ] −𝛾 < 𝑘 < 𝛼𝛾
𝜃𝑘 Γ(1+(𝑘/𝛾))Γ(𝛼−(𝑘/𝛾))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) 𝛽(1+ (𝑘/𝛾), 𝛼 − (𝑘/𝛾); 1 − 𝑢) + 𝑥𝑘 𝑢𝛼
𝑘 > −𝛾
18.2. CONTINUOUS DISTRIBUTIONS 537
R Commands
Illustrative Graph
Burr Distribution
0.012
Probability density
0.008
0.004
0.000
Inverse Burr
Functions
538 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
(𝑥/𝜃)𝛾
Parameter assumptions 𝜃 > 0, 𝜏 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃)𝛾
𝜏𝛾(𝑥/𝜃)𝜏𝛾
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]𝜏+1
function 𝑓(𝑥)
Distribution function 𝑢𝜏
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+(𝑘/𝛾))Γ(1−(𝑘/𝛾))
k raw moment Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 𝛾 < 𝑘 < 𝛾
𝜃𝑘 Γ(𝜏+(𝑘/𝛾))Γ(1−(𝑘/𝛾))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝜏) 𝛽(𝜏+ (𝑘/𝛾), 1 − (𝑘/𝛾); 𝑢) + 𝑥𝑘 [1 − 𝑢𝜏 ]
𝑘 > −𝜏 𝛾
R Commands
Illustrative Graph
0.004
0.002
0.000
X
18.2. CONTINUOUS DISTRIBUTIONS 539
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼1 > 0, 𝛼2 > 0, 𝜎 > 0
(𝑥/𝜃)𝛼2 /𝜎
Probability density 1/𝜎
𝛼1 +𝛼2
𝑥𝜎 B(𝛼1 ,𝛼2 )[1+(𝑥/𝜃) ]
function 𝑓(𝑥)
𝑡ℎ 𝜃𝑘 B(𝛼1 +𝑘𝜎,𝛼2 −𝑘𝜎)
k raw moment B(𝛼1 ,𝛼2 )
E[𝑋 𝑘 ] k>0
R Commands
Please see the R Codes for Loss Data Analytics site for information about this
distribution.
Lognormal
Functions
Name Function
Parameter assumptions −∞ < 𝜇 < ∞, 𝜎 > 0
2
Probability density √1
𝑥 2𝜋𝜎
exp (− (ln 2𝜎
𝑥−𝜇)
2 )
function 𝑓(𝑥)
Distribution function Φ ( ln(𝑥)−𝜇
𝜎 )
𝐹 (𝑥)
𝑡ℎ 𝑘2 𝜎2
k raw moment exp(𝑘𝜇 + 2 )
E[𝑋 𝑘 ]
𝑘2 𝜎2 ln(𝑥)−𝜇−𝑘𝜎2
Limited Expected Value exp (𝑘𝜇 + 2 )Φ( 𝜎 ) + 𝑥𝑘 [1 − Φ( ln(𝑥)−𝜇
𝜎 )]
E[𝑋 ∧ 𝑥]
Illustrative Graph
540 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
0.0020
0.0010
0.0000
Inverse Gaussian
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜇 > 0, 𝑧 = 𝑥−𝜇
𝜇 , 𝑦 =
𝑥+𝜇
𝜇
1/2 2
𝜃
Probability density ( 2𝜋𝑥 3) exp ( −𝜃𝑧
2𝑥 )
function 𝑓(𝑥)
1/2 1/2
Distribution function Φ[𝑧( 𝑥𝜃 ) ] + exp ( 2𝜃 𝜃
𝜇 )Φ[ − 𝑦( 𝑥 ) ]
𝐹 (𝑥)
Mean 𝜇
E[𝑋]
𝜇3
Var[X] 𝜃
1/2 1/2
E[(𝑋 ∧ 𝑥)𝑘 ] 𝑥 − 𝜇𝑥Φ[𝑧( 𝑥𝜃 ) ] − (𝜇𝑦) exp ( 2𝜃 𝜃
𝜇 )Φ[ − 𝑦( 𝑥 ) ]
R Commands
18.2. CONTINUOUS DISTRIBUTIONS 541
Illustrative Graph
0.008
0.004
0.000
0 20 40 60 80 100
Beta
Functions
542 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 𝜃 > 0, 𝑎 > 0, 𝑏 > 0, 𝑢 = 𝑥𝜃 , 0 < 𝑥 < 𝜃
Γ(𝑎+𝑏) 𝑎 𝑏−1 1
Probability density Γ(𝑎)Γ(𝑏) 𝑢 (1 − 𝑢) 𝑥
function 𝑓(𝑥)
Distribution function 𝛽(𝑎, 𝑏; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+𝑘)
k raw moment Γ(𝑎)Γ(𝑎+𝑏+𝑘)
E[𝑋 𝑘 ] 𝑘 > −𝑎
𝜃𝑘 𝑎(𝑎+1)⋯(𝑎+𝑘−1)
(𝑎+𝑏)(𝑎+𝑏+1)⋯(𝑎+𝑏+𝑘−1) 𝛽(𝑎+ 𝑘, 𝑏; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝑎, 𝑏; 𝑢)]
E[𝑋 ∧ 𝑥]𝑘
R Commands
Beta Distribution
2.0
1.5
Probability density
1.0
0.5
0.0
X
18.2. CONTINUOUS DISTRIBUTIONS 543
Generalized Beta
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝑎 > 0, 𝑏 > 0, 𝜏 > 0, 0 < 𝑥 < 𝜃 , 𝑢 = (𝑥/𝜃)𝜏
Γ(𝑎+𝑏) 𝛼 𝑏−1 𝜏
Probability density Γ(𝑎)Γ(𝑏) 𝑢 (1 − 𝑢) 𝑥
function 𝑓(𝑥)
Distribution function 𝛽(𝑎, 𝑏; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+(𝑘/𝜏))
k raw moment Γ(𝑎)Γ(𝑎+𝑏+(𝑘/𝜏))
E[𝑋 𝑘 ] 𝑘 > −𝛼𝜏
𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+(𝑘/𝜏))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝑎)Γ(𝑎+𝑏+(𝑘/𝜏)) 𝛽(𝑎 + (𝑘/𝜏 ), 𝑏; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝑎, 𝑏; 𝑢)]
R Commmands
Illustrative Graph
0.0010
0.0000
X
544 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Functions
Limited Expected Value Functions
Distribuion Function
𝜃Γ(𝜏+1)Γ(𝛼−1) 𝑥 𝑥
GB2 Γ(𝛼)Γ(𝜏) 𝛽(𝜏 + 1, 𝛼 − 1; 𝑥+𝛽 ) + 𝑥[1 − 𝛽(𝜏 , 𝛼; 𝑥+𝛽 )]
1
𝜃Γ(1+ 𝛾 1
)Γ(𝛼− 𝛾 ) 𝛼
Burr Γ(𝛼) 𝛽(1 + 𝛾1 , 𝛼 − 𝛾1 ; 1 − 1+(𝑥/𝜃)
1 1
𝛾 ) + 𝑥( 1+(𝑥/𝜃)𝛾 )
𝛾 𝛾 𝜏
𝜃Γ(𝜏+(1/𝛾))Γ(1−(1/𝛾)) (𝑥/𝜃) (𝑥/𝜃)
Inverse Burr Γ(𝜏) 𝛽(𝜏 + 𝛾1 , 1 − 𝛾1 ; 1+(𝑥/𝜃) 𝛾 ) + 𝑥[1 − ( 1+(𝑥/𝜃)𝛾 ) ]
Pareto
𝜃
𝛼=1 −𝜃 ln ( 𝑥+𝜃 )
𝛼−1
𝜃 𝜃
𝛼≠1 𝛼−1 [1 − ( 𝑥+𝜃 ) ]
𝑥/(𝑥+𝜃) 𝜏
𝑥
Inverse Pareto 𝜃𝜏 ∫0 𝑦𝜏 (1 − 𝑦)−1 𝑑𝑦 + 𝑥[1 − ( 𝑥+𝜃 ) ]
𝛾
(𝑥/𝜃) (𝑥/𝜃)𝛾
Loglogistic 𝜃Γ(1 + 𝛾1 )Γ(1 − 𝛾1 )𝛽(1 + 𝛾1 , 1 − 𝛾1 ; 1+(𝑥/𝜃) 𝛾 ) + 𝑥(1 − 1+(𝑥/𝜃)𝛾 )
1 1 𝛼
𝜃Γ(1+ 𝛼 )Γ(𝛼− 𝛼 )
Paralogistic Γ(𝛼) 𝛽(1 + 𝛼1 , 𝛼 − 𝛼1 ; 1 − 1+(𝑥/𝜃)
1 1
𝛼 ) + 𝑥( 1+(𝑥/𝜃)𝛼 )
𝜏
𝜃Γ(𝜏+ 𝜏1 )Γ(1− 𝜏1 ) (𝑥/𝜃)𝜏 (𝑥/𝜃)𝜏
Inverse Paralogistic Γ(𝜏) 𝛽(𝜏 + 𝜏1 , 1 − 𝜏1 ; 1+(𝑥/𝜃) 𝜏 ) + 𝑥[1 − ( 1+(𝑥/𝜃)𝜏 ) ]
𝜃Γ(𝛼+1) 𝑥 𝑥
Gamma Γ(𝛼) Γ(𝛼 + 1; 𝜃 ) + 𝑥[1 − Γ(𝛼; 𝜃 )]
𝜃Γ(𝛼−1) 𝜃 𝜃
Inverse Gamma Γ(𝛼) [1 − Γ(𝛼 − 1; 𝑥 )] + 𝑥Γ(𝛼; 𝑥 )
𝛼
Weibull 𝜃Γ(1 + 𝛼1 )Γ(1 + 𝛼1 ; ( 𝑥𝜃 ) ) + 𝑥 ∗ exp(−(𝑥/𝜃)𝛼 )
𝛼
Inverse Weibull 𝜃Γ(1 − 𝛼1 )[1 − Γ(1 − 𝛼1 ; ( 𝑥𝜃 ) )] + 𝑥[1 − exp(−(𝜃/𝑥)𝛼 )]
Exponential 𝜃(1 − exp(−(𝑥/𝜃)))
Inverse Exponential 𝜃𝐺(0; 𝑥𝜃 ) + 𝑥(1 − exp(−(𝜃/𝑥)))
2
Lognormal exp(𝜇 + 𝜎2 /2)Φ( ln(𝑥)−𝜇−𝜎
𝜎 ) + 𝑥[1 − Φ( ln(𝑥)−𝜇
𝜎 )]
1/2 1/2
Inverse Gaussian 𝑥 − 𝜇( 𝑥−𝜇 𝑥−𝜇 𝜃
𝜇 )Φ[( 𝜇 )( 𝑥 ) ] − 𝜇( 𝑥+𝜇 2𝜃 𝑥+𝜇 𝜃
𝜇 ) exp ( 𝜇 )Φ[ − ( 𝜇 )( 𝑥 ) ]
𝛼𝜃 𝜃𝛼
Single-Parameter Pareto − (𝛼−1)𝑥
𝛼−1 𝛼−1
𝜏 𝜏
𝜃Γ(𝑎+𝑏)Γ(𝑎+ 𝜏1 ) 1 𝑥 𝑥
Generalized Beta 1 𝛽(𝑎 + 𝜏 , 𝑏; ( 𝜃 ) ) + 𝑥[1 − 𝛽(𝑎, 𝑏; ( 𝜃 ) )]
Γ(𝑎)Γ(𝑎+𝑏+ 𝜏 )
𝜃𝑎 𝑥 𝑥
Beta (𝑎+𝑏) 𝛽(𝑎 + 1, 𝑏; 𝜃 ) + 𝑥[1 − 𝛽(𝑎, 𝑏; 𝜃 )]
Illustrative Graph
Comparison of Limited Expected Values for Selected Distributions
18.3. LIMITED EXPECTED VALUES 545
Distribution Parameters E[𝑋] 𝐸[𝑋 ∧ 100] 𝐸[𝑋 ∧ 250] 𝐸[𝑋 ∧ 500] 𝐸[𝑋 ∧ 1000]
Pareto 𝛼 = 3, 𝜃 = 200 100 55.55 80.25 91.84 97.22
Exponential 𝜃 = 100 100 63.21 91.79 99.33 99.99
Gamma 𝛼 = 2, 𝜃 = 50 100 72.93 97.64 99.97 100
200
Weibull 𝜏 = 2, 𝜃 = √ 𝜋
100 78.99 99.82 100 100
GB2 𝛼 = 3, 𝜏 = 2, 𝜃 = 100 100 62.50 86.00 94.91 98.42
80
60
40
Pareto
Exponential
20
Gamma
Weibull
GB2
0
x values
546 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Chapter 19
Appendix E: Conventions
for Notation
547
548 CHAPTER 19. APPENDIX E: CONVENTIONS FOR NOTATION
19.2 Abbreviations
Here is a list of commonly used statistical symbols and operators, including the
latex code that we use to generate them (in the parens).
19.4. COMMON MATHEMATICAL SYMBOLS AND FUNCTIONS 549
𝐼(⋅) binary indicator function (𝐼). For example, 𝐼(𝐴) is one if an outcome in event
𝐴 occurs and is 0 otherwise.
Pr(⋅) probability (\Pr)
E(⋅) expectation operator (\mathrm{E}). For example, E(𝑋) = E 𝑋 is the
expected value of the random variable 𝑋, commonly denoted by 𝜇.
Var(⋅) variance operator (\mathrm{Var}). For example, Var(𝑋) = Var 𝑋 is the
variance of the random variable 𝑋, commonly denoted by 𝜎2 .
𝜇𝑘 = E 𝑋 𝑘 kth moment of the random variable X. For 𝑘=1, use 𝜇 = 𝜇1 .
Cov(⋅, ⋅) covariance operator (\mathrm{Cov}). For example,
Cov(𝑋, 𝑌 ) = E {(𝑋 − E 𝑋)(𝑌 − E 𝑌 )} = E(𝑋𝑌 ) − (E 𝑋)(E 𝑌 )
is the covariance between random variables 𝑋 and 𝑌 .
E(𝑋|⋅) conditional expectation operator. For example, E(𝑋|𝑌 = 𝑦) is the
conditional expected value of a random variable 𝑋 given that
the random variable 𝑌 equals y.
Φ(⋅) standard normal cumulative distribution function (\Phi)
𝜙(⋅) standard normal probability density function (\phi)
∼ means is distributed as (\sim). For example, 𝑋 ∼ 𝐹 means that the
random variable 𝑋 has distribution function 𝐹 .
𝑠𝑒(𝛽)̂ standard error of the parameter estimate 𝛽 ̂ (\hat{\beta}), usually
̂
an estimate of the standard deviation of 𝛽,̂ which is √𝑉 𝑎𝑟(𝛽).
𝐻0 null hypothesis
𝐻𝑎 or 𝐻1 alternative hypothesis
Glossary
Term Definition
analytics Analytics is the process of using data to make decisions.
renters insurance Renters insurance is an insurance policy that covers the contents of
an apartment or house that you are renting.
automobile insurance An insurance policy that covers damage to your vehicle, damage to
other vehicles in the accident, as well as medical expenses of those
injured in the accident.
casualty insurance Causalty insurance is a form of liability insurance providing
coverage for negligent acts and omissions. examples include workers
compensation, errors and omissions, fidelity, crime, glass, boiler, and
various malpractice coverages.
commercial insurance
term The duration of an insurance contract
insurance claim An insurance claim is the compensation provided by the insurer for
incurred hurt, loss, or damage that is covered by the policy.
homeowners insurance Homeowners insurance is an insurance policy that covers the
contents and property of a building that is owned by you or a friend.
property insurance Property insurance is a policy that protects the insured against loss
or damage to real or personal property. the cause of loss might be
fire, lightening, business interruption, loss of rents, glass breakage,
tornado, windstorm, hail, water damage, explosion, riot, civil
commotion, rain, or damage from aircraft or vehicles.
non-life Non-life insurance is any type of insurance where payments are not
based on the death (or survivorship) of a named insured. examples
include automobile, homeowners, and so on. also known as property
and casualty or general insurance.
551
552 CHAPTER 20. GLOSSARY
life insurance Life insurance is a contract where the insurer promises to pay upon
the death of an insured person. the person being paid is the
beneficiary.
personal insurance Insurance purchased by a person
loss adjustment Loss adjustment expenses are costs to the insurer that are directly
expenses attributable to settling a claims. for example, the cost of an adjuster
is someone who assess the claim cost or a lawyer who becomes
involve in settling an insurer’s legal obligation on a claim
unallocated Unallocated loss adjustment expenses are costs that can only be
indirectly attributed to claim settlement; for example, the cost of an
office to support claims staff
allocated Allocated loss adjustment expenses, sometimes known by the
acronym alea, are costs that can be directly attributed to settling a
claim; for example, the cost of an adjuster
underwriting Underwriting is the process where the company makes a decision as
to whether or not to take on a risk.
loss reserving A loss reserve is an estimate of liability indicating the amount the
insurer expects to pay for claims that have not yet been realized.
this includes losses incurred but not yet reported (ibnr) and those
claims that have been reported claims that haven’t been paid
(known by the acronym rbns for reported but not settled).
risk classification Risk classification is the process of grouping policyholders into
categories, or classes, where each insured in the class has a risk
profile that is similar to others in the class.
retrospective The process of determining the cost of an insurance policy based on
premiums the actual loss experience determined as an adjustment to the initial
premium payment.
claims adjustment Claims adjustment is the process of determining coverage, legal
liability, and settling claims.
claims leakage Claims leakage respresents money lost through claims management
inefficiencies.
adjuster An adjuster is a person who investigates claims and recommends
settlement options based on estimates of damage and insurance
policies held.
dividends A dividend is the refund of a portion of the premium paid by the
insured from insurer surplus.
indemnification Indemnification is the compensation provided by the insurer.
rating variables Rating variables are the components of an insurance pricing formula.
they can include numeric variables (like values, revenue, or area)
and classification variables (like location, type of vehicle, or type of
occupancy.)
frequency Count random variables that represent the number of claims
severity The amount, or size, of each payment for an insured event
553
probability mass A function that gives the probability that a discrete random
function (pmf) variable is exactly equal to some value
distribution function The chance that the random variable is less than or equal to x, as a
function of x
mean Average
moments The rth moment of a list is the average value of the random variable
raised to the rth power
survival function The probability that the random variable takes on a value greater
than a number x
moment generating The mgf of random variable n is defined the expectation of exp(tn),
function (mgf) as a function of t
probability generating For a random variable n, its pgf is defined as the expectation of s^n,
function (pgf) as a function of s
convex hulls The convex hull of a set of points x is the smallest convex set that
contains x
risk classes The formation of different premiums for the same coverage based on
each homogeneous group’s characteristics.
binomial distribution A random variable has a binomial distribution (with parameters m
and q) if it is the number of ”successes” in a fixed number m of
independent random trials, all of which have the same probability q
of resulting in ”success.”
binary outcomes Outcomes whose unit can take on only two possible states,
traditionally labeled as 0 and 1
m-convolution The addition of m independent random variables
poisson distribution A discrete probability distribution that expresses the probability of
a given number of events occurring in a fixed interval of time or
space if these events occur with a known constant rate and
independently of the time since the last event
negative binomial The number of successes until we observe the rth failure in
distribution independent repetitions of an experiment with binary outcomes
overdispersed The presence of greater variability (statistical dispersion) in a data
set than would be expected based on a given statistical model
underdispersed There was less variation in the data than predicted
(a, b, 0) class The poisson, binomial and negative binomial distributions
maximum likelihood The possible value of the parameter for which the chance of
estimator (mle) observing the data largest
local extrema The largest and smallest value of the function within a given range
central limit theorem In some situations, when independent random variables are added,
(clt) their properly normalized sum tends toward a normal distribution
even if the original variables themselves are not normally
distributed.
newton’s method A root-finding algorithm which produces successively better
approximations to the roots of a real-valued function
554 CHAPTER 20. GLOSSARY
central moment The kth central moment of a random variable x is the expected
value of (x-its mean)^k
skewness Measure of the symmetry of a distribution, 3rd central
moment/standard deviation^3
kurtosis Measure of the peaked-ness of a distribution, 4th central
moment/standard deviation^4
expected value Average
exponential A single parameter continous probability distribution that is defined
distribution by its rate parameter
independent Two variables are independent if conditional information given about
one variable provides no information regarding the other variable
percentile The pth percentile of a random variable x is the smallest value x_p
such that the probability of not exceeding it is p%
chi-square distribution A common distribution used in chi-square tests for determining
goodness of fit of observed data to a theorized distribution
light tailed A distribution with thinner tails than the benchmark exponential
distribution distribution
pareto distribution A heavy-tailed and positively skewed distribution with 2 parameters
hazard function Ratio of the probability density function and the survival function:
f(x)/s(x), and represents an instantaneous probability within a small
time frame
weibull distribution A positively skewed continuous distribution with 2 parameters that
can have an increasing or decreasing hazard function depending on
the shape parameter
generalized beta A 4-parameter flexible distribution that encompasses many common
distribution of the distributions
second kind
parametric Probability distribution defined by a fixed set of parameters
distributions
transformation A function or method that turns one distribution into another
distribution function A transformation technique that involves finding the cdf of the
technique transformed distribution through its relation with the original cdf
change-of-variable A transformation technique that involves finding the pdf of the
technique transformed distribution through its relation with the original pdf
using inverse functions
moment-generating A transformation technique that uses moment generating functions
function technique properties to determine the mgf of a linear combination of variables
lognormal distribution A heavy-tailed, positively skewed 2-parameter continuous
distribution such that the natural log of the random variable is
normally distributed with the same parameter values
reliability data A dataset consisting of failure times for failed units and run times
for units still functioning
power transformation A transformation type that involves raising a random variable to a
power
556 CHAPTER 20. GLOSSARY
method of maximum Statistical method used to derive the parameter values from data
likelihood that maximize the probability of observing the data given the
parameters
grouped data Data bucketed into categories with ranges, such as for use in
histograms or frequency tables
large-sample Asymptotic properties of a distribution as the amount of data
properties increases towards infinity
asymptotic variance Variability of the distribution of an estimator as the amount of data
increases towards infinity
delta method Statistical method used to approximate the asymptotic variance for
a function based on parameters whose asymptotic variance can be
determined
log-likelihood function Natural log of the likelihood function
covariance matrix Matrix where the (i,j)^th element represents the covariance between
the ith and jth random variables
complete data Data where each individual observation is known, and no values are
censored, truncated, or grouped
parametric Distributional assumptions made on the population from which the
data is drawn, with properties defined using parameters.
nonparametric No distributional assumptions are made on the population from
which the data is drawn.
sampling scheme How the data is obtained from the population and what data is
observed.
unbiased An estimator that has no bias, that is, the expected value of an
estimator equals the parameter being estimated.
plug-in principle The plug-in principle or analog principle of estimation proposes that
population parameters be estimated by sample statistics which have
the same property in the sample as the parameters do in the
population.
indicator A categorical variable that has only two groups. the numerical
values are usually taken to be one to indicate the presence of an
attribute, and zero otherwise. another name for a binary variable.
empirical distribution The empirical distribution is a non-parametric estimate of the
function underlying distribution of a random variable. it directly uses the
data observations to construct the distribution, with each observed
data point in a size-n sample having probability 1/n.
first quartile The 25th percentile; the number such that approximately 25% of
the data is below it.
third quartile The 75th percentile; the number such that approximately 75% of
the data is below it.
quantile The q-th quantile is the point(s) at which the distribution function
is equal to q, i.e. the inverse of the cumulative distribution function.
smoothed empirical A quantile obtained by linear interpolation between two empirical
quantile quantiles, i.e. data points.
558 CHAPTER 20. GLOSSARY
bandwidth A small positive constant that defines the width of the steps and the
degree of smoothing.
kernel density A nonparametric estimator of the density function of a random
estimator variable.
bias-variance tradeoff The tradeoff between model simplicity (underfitting; high bias) and
flexibility (overfitting; high variance).
model diagnostics Procedures to assess the validity of a model
probability-probability A plot that compares two models through their cumulative
(pp) plot probabilities.
quantile-quantile (qq) A plot that compares two models through their quantiles.
plot
goodness of fit A measure used to assess how well a statistical model fits the data,
statistics usually by summarizing the discrepancy between the observations
and the expected values under the model.
method of moments The estimation of population parameters by approximating
parametric moments using empirical sample moments.
percentile matching The estimation of population parameters by approximating
parametric percentiles using empirical quantiles.
percentile A 100p-th percentile is the number such that 100 times p percent of
the data is below it.
gini index A measure for assessing income inequality. it measures the
discrepancy between the income and population distributions and is
calculated from the lorenz curve.
model selection The process of selecting a statistical model from a set of candidate
models using data.
in-sample A dataset used for analysis and model development. also known as a
training dataset.
out-of-sample A dataset used for model validation. also known as a test dataset.
cross-validation A model validation procedure in which the data sample is
partitioned into subsamples, where splits are formed by separately
taking each subsample as the out-of-sample dataset.
model validation The process of confirming that the proposed model is appropriate.
data-snooping Repeatedly fitting models to a data set without a prior hypothesis
of interest.
predictive inference Preditive inference is the process of using past data observations to
predict future observations.
likelihood function A function of the likeliness of the parameters in a model, given the
observed data.
ogive estimator A nonparametric estimator for the distribution function in the
presence of grouped data.
product-limit A nonparametric estimator of the survival function in the presence
estimator of incomplete data. also known as the kaplan-meier estimator.
risk set The number of observations that are active (not censored) at a
specific point.
559
leave-one-out cross A special case of k-fold cross validation, where each single data
validation point gets a turn in being the lone hold-out test data point, and n
separate models in total are built and tested
jackknife statistics To calculate an estimator, leave out each observation in turn,
calculate the sample estimator statistic each time, and average over
the n separate estimates
accept-reject A sampling method that is used where the random sample is
mechanism discarded if not within a certain pre-specified range [a, b] and is
commonly used when the traditional inverse transform method
cannot be easily used
importance sampling Type of sampling method where values in the region of interest can
mechanism be over-sampled or values outside the region of interest can be
under-sampled
ergodic theorem Ergodic theory studies the behavior of a dynamical system when it
is allowed to run for an extended time
markov process A stochastic (time dependent) process that satisfies memorylessness,
meaning future predictions of the process can be made solely based
on its present state and not the historical path
invariant measure Any mathematical measure that is preserved by a function (the
mean is an example)
composants Component (smaller, self-contained part of larger entity)
hastings metropolis A markov chain monte carlo (mcmc) method for random sampling
from a probability distribution where values are iteratively
generated, with the distribution of the next sample dependent only
on the current sample value, and at each iteration, the candidate
sample can be either accepted or rejected
gibbs sampler A markov chain monte carlo (mcmc) method to obtain a sequence of
random samples from a specified multivariate continuous probability
distribution
premium Amount of money an insurer charges to provide the coverage
described in the policy
ratemaking Process used by insurers to calculate insurance rates, which drive
insurance premiums
insurance rates Amount of money needed to cover losses, expenses, and profit per
one unit of exposure
insured contingent A condition that results in an insurance claim
event
expected costs The cost to an insurer of payments to the insured and allocated loss
adjustment expenses (alaes). overhead and profit are not included
underwriting profit Profit an insurer derives from providing coverage, excluding
investment income
experience rating A type of rating plan that uses the insured’s historical loss
experience as part of the premium determination
price A quantity, usually of money, that is exchanged for a good or service
564 CHAPTER 20. GLOSSARY
rates A rate is the price, or premium, charged per unit of exposure. a rate
is a premium expressed in standardized units.
technical prices
loss cost The sum of losses divided by an exposure; it is also known as the
pure premium.
profit loading A factor or percentage applied to the premium calculation to
account for insurer profit in a policy
indicated change A factor calculated from the loss ratio method that calculates how
factor the rates should change, with factors > 1 indicating an increase and
vice versa
indicated rate In a rate filing, the amount that the loss experience suggests that
the insurer should charge to cover costs.
credibility Weight assigned to observed data vs. that assigned to an external or
broader-based set of data
parametric Model assumption that the sample data comes from a population
distribution that can be modeled by a probability distribution with a fixed set of
parameters
commercial business Line of business that insures against damage to their buildings and
property contents due to a covered cause of loss
continuous variables Type of variable that can take on any real value
discrimination Process of determining premiums on the basis of likelihood of loss.
insurance laws prohibit ”unfair discrimination”.
rating factor A rating factor, or rating variable, is a characteristic of the
policyholder or risk being insured by which rates vary.
rating variable A rating factor, or rating variable, is a characteristic of the
policyholder or risk being insured by which rates vary.
factor A variable that varies by groups or categories.
relativity The difference of the expected risk between a specific level of a
rating factor and an accepted baseline value. this difference may be
arithmetic or proportional.
scale distribution Suppose that y = c x, where x comes from a parametric distribution
family and c is a positive constant. the distribution is said to be a
scale distribution if (i) the distributions of y and x come from the
same family and (ii) only a single parameter differs and that by a
factor of c.
written exposures Exposure is based off policies written/issued
earned exposures Exposure is based off amount exposed to loss for which coverage has
been provided
unearned exposures Exposure amount for which coverage has not yet been provided
in force exposures Exposure amount subject to loss at a particular point in time
calendar year method Experience for rating is aggregated based on calendar year, as
opposed to other methods such as when a policy term began
accident date Date of loss occurrence that gives rise to a claim
report date Date when insurer is notified of the claim
565
open claim A claim that has been reported but not yet closed
mix of business Different types of policies in an insurer’s portfolio
on-level earned Earned premium of historical policies using the current rate
premium structure
experience loss ratio Ratio of experience loss to on-level earned premium in the
experience period
claim The amount paid to an individual or corporation for the recovery,
under a policy of insurance, for loss that comes within that policy.
incurred but not A claim is said to be incurred but not reported if the insured event
reported occurs prior to a valuation date (and hence the insurer is liable for
payment) but the event has not been reported to the insurer.
closed A claim is said to be closed when the company deems its financial
obligations on the claim to be resolved.
valuation date A valuation date is the date at which a company summarizes its
financial position, typically quarterly or annually.
policy year This is the period between a policy’s anniversary dates.
gini index The gini index is twice the area between a lorenz curve and a 45
degree line.
line of equality 45 degree line equating x and y, that represents a perfect alignment
in the sample and population distribution
pp plot Statistical plot used to assess how close a data sample matches a
theorized distribution
performance curve A concentration curve is a graph of the distribution of two variables,
where both variables are ordered by only one of variables. for
insurance applications, it is a graph of distribution of losses versus
premiums, where both losses and premiums are ordered by
premiums.
community rating This generally refers to the premium principle where all risks pay
the same amount.
market conduct Regulation that ensures consumers obtain fair and reasonable
regulation insurance prices and coverage
government prescribed Government sets the entire rating system including coverages
prior approval Regulator must approve rates, forms, rules filed by insurers before
use
no file Insurers may use new rates, forms, rules without approval from
regulators
file only Insurers must file rates, forms, rules for record keeping and use
immediately
rating factors Characteristics of a risk that help price the insurance contract
multiplicative tariff A rating method where each rating factor is the product of
model parameters associated with that rating factor
risk characteristics The distinguishing features of a policy that help determine the
expected loss on the policy
566 CHAPTER 20. GLOSSARY
gross insurance Sum of expected losses and expenses and profit on a policy
premium
adverse selection A pricing structure that entices riskier individuals to purchase and
discourages low-risk individuals from purchasing
adverse selection spiral Phenomenon where a book of business deteriorates as it attracts
ever-riskier individuals when forced to increase premiums due to
losses
a priori variables Variables which the insurer has prior knowledge of before the policy
inception
closed-form A mathematical expression that can be well defined with a formula
expressions that has a finite number of operations
levels Different outcomes of a categorical variable
nominal A categorical variable where the categories do not have a natural
order and any numbering is arbitrary
dummy variables A variable that takes on a value of 0 or 1 to indicate the absence or
presence of a categorical characteristic
log linear form Linear regression model where the response variable is the natural
log of the expected response value
base case The categorical level chosen as the default with all dummy variable
indicators of 0
workers compensation A no-fault insurance system prescribed by state law where benefits
are provided by an employer to an employee due to a job-related
injury, including death, resulting from an accident or occupational
disease
exposure bases The unit of measurement chosen to represent the exposure for a
particular risk
offset Natural log of the exposure amount that is added to a regression
model to account for varying exposures
tariff A table or list that contains the rating factors and associated
premiums and other risk information
in-force times The timeframe during which a policy is active and the insurer is
bound by the contractual obligation
rate parameter Parameter in certain distributions, such as the exponential, that
indicate how quickly the function decays, and it is the reciprocal of
the scale parameter
functional forms The algebraic relationship between a dependent variable and
explanatory variables
multiplicative form Relationship where the dependent variable is a product of the
explanatory variables
base tariff cell The chosen set of rating categories where the rate equals the
intercept of the model (the base value)
relativities A numerical estimate of value in one category relative to the value
in a base classification, typically expressed as a factor
567
non-automobile Motorized vehicles which are not autos, such as atvs, off-road
vehicles vehicles, go-carts, etc.
distributional The manner in which a statistical distribution is parameterized
structure
information matrix Matrix that measures the amount of information that an observable
random variable x carries about an unknown parameter of a
distribution, and is used to calculate covariance matrices of
maximum likelihood estimators
classification rating A rating plan that uses an insured’s risk characteristics to determine
plan premium
credibility weight The weight assigned to an insured’s historical loss experience for the
purposes of determining their premium in an experience rating plan
complement of The remainder of the weight not assigned to an insured’s historical
credibility loss experience in the experience rating plan
class rate Average rate per exposure for an insured in a particular
classification group
full credibility The threshold of experience necessary to assign 100% credibility to
standard the insured’s own experience
limited fluctuation A credibility method that attempts to limit fluctuations in its
credibility estimates
cumulative Cumulative density function for the normal distribution with mean
distribution function 0 and standard deviation 1
of the standard normal
buhlmann credibility A credibility method that uses the amount of experience, expected
value of the process variance, and variance of the hypothetical
means to determine the credibility weight
collective mean The mean estimate of a risk when no loss information about the risk
is known
law of total The expected value of the conditional expected value of x given y is
expectation the same as the expected value of x
risk parameter Parameter in a distribution whose value reflects the risk
categorization
expected value of the Average of the natural variability of observations from within each
process variance risk
variance of the Variance of the means across different classes, used to determine
hypothetical means how similar or different the classes are from one another
buhlmann-straub An extension of the buhlmann credibility model that allows for
credibility varying exposure by year
bayes theorem A probability law that expresses conditional probability of the event
a given the event b in terms of the conditional probability of the
event b given the event a and the unconditional probability of a
bayesian inference A branch of statistics that leverages bayes theorem to update the
distribution as more experience becomes available
568 CHAPTER 20. GLOSSARY
gamma-poisson model A statistical model that assumes the frequency of claims is poisson
whose mean has a prior distribution that is a gamma distribution
exact credibility A situation where the bayesian credibility estimate matches that of
the buhlmann credibility estimate
beta-binomial model A statistical model for modeling the probability of an event using
the binomial distribution with a probability that has a prior
distribution from a beta distribution
nonparametric Statistical method that allows the functional form of a fit from data
estimation to have no assumed prior distribution, constraints, or parameters
empirical bayes Credibility methods that estimate the credibility weight without
methods using any assumptions about prior distributions or likelihoods,
instead relying only on empirical data
semiparametric Credibility method that assumes a distribution for the loss per
estimation exposure random variable and otherwise uses empirical data
portfolios A collection of contracts
insurance portfolios A collection, or aggregation, of insurance contracts
reinsurers A company that sells reinsurance
heavy tailed A rv is said to be heavy tailed if high probabilities are assigned to
large values
survival function One minus the distribution function. it gives the probability that a
rv exceeds a specific value.
coherent risk measure A risk measure that is is subadditive, monontonic, has positive
homogeneity, and is translation invariant.
mean excess loss The expected value of a loss in excess of a quantity, given that the
function loss exceeds the quantity
risk measure A measure that summarizes the riskiness, or uncertainty, of a
distribution
value-at-risk A risk measure based on a quantile function
ceding company A company that purchases reinsurance (also known as the reinsured)
excess of loss Under an excess of loss arrangement, the insurer sets a retention
level for each claim and pays claim amounts less than the level with
the reinsurer paying the excess.
primary insurance Insurance purchased by a non-insurer
proportional An agreement between a reinsurer and a ceding company (also
reinsurance known as the reinsured) in which the reinsurer assumes a given
percent of losses and premium
quota share A proportional treaty where the reinsurer receives a flat percent of
the premium for the book of business reinsured and pays a
percentage of losses, including allocated loss adjustment expenses.
the reinsurer may also pays the ceding company a ceding
commission which is designed to reflect the differences in
underwriting expenses incurred.
reinsured A company that purchases reinsurance (also known as the ceding
company)
569
retained line The amount of exposure that the the reinsured retains on a given
line in a surplus share reinsurance agreement.
retention function A function that maps the insurer portfolio loss into the amount of
loss retained by the insurer.
stop-loss Under a stop-loss arrangement, the insurer sets a retention level and
pays in full total claims less than the level with the reinsurer paying
the excess.
surplus share A proportional reinsurance treaty that is common in commercial
property insurance. a surplus share treaty allows the reinsured to
limit its exposure on any one risk to a given amount (the retained
line). the reinsurer assumes a part of the risk in proportion to the
amount that the insured value exceeds the retained line, up to a
given limit (expressed as a multiple of the retained line, or number
of lines).
treaty A reinsurance contract that applies to a designated book of business
or exposures.
bonus-malus system A type of rating mechanism where insured premiums are adjusted
based on their individual loss experience history
no claim discount A type of experience rating where insureds obtain discounts on
(ncd) system future years’ premiums based on claims-free experience
hunger for bonus Phenomenon where insureds under an experience rating system are
dissuaded from filing minor claims in order to keep their no-claims
discount
takaful Co-operative system of reimbursement or repayment in case of loss
as an insurance alternative
markov chain A stochastic model (time dependent) where the probability of each
event depends only on the current state and not the historical path
transition matrix Matrix that represents all probabilities for transition from one state
to another (could be same state) for a markov chain
stationary distribution Probability distribution remains unchanged in the markov chain as
time progresses
ergodic Irreducible markov chain where it is eventually possible to move
from any state to any other state, with positive probability
irreversible A markov chain where there does not exist a probability distribution
that allows for the chain to be walked backwards in time
eigenvector A non-zero vector that changes by only a scalar factor when that
linear transformation is applied
n-step transition Probability of ending in a state j after n periods, starting in state i,
probability where i and j can be the same state
convergence rate After n transitions, the sum of variation between the probability in
each state vs. the stationary probability
poisson regression Type of regression model used for fitting data with an integral
model (count) response variable with mean equal to the variance
570 CHAPTER 20. GLOSSARY
negative binomial Type of regression model used for fitting data with an integral
regression model (count) response variable and can account for variance greater than
the mean
overdispersion Phenomenon where the variance of data is larger than what is
modeled
cross-classified rating Table that combines the effects of multiple rating classifications
classes
structured data Data that can be organized into a repository format, typically a
database
unstructured data Data that is not in a predefined format, most notably text, audio
visual
qualitative data Data which is non numerical in nature
quantitative data Data which is numerical in nature
ordinal data Data field with a natural ordering
interval data Continuous data which is broken into interval bands with a natural
ordering
key-value databases Data storage method that stores amd finds records using a unique
key hash
column-oriented Data storage method that stores records by column instead of by
databases row
document databases Data storage method that uses the document metadata for search
and retrieval, also known as semi-structured data
data decay Corruption of data due to hardware failure in the storage device
reverification Manual process of checking the integrity of data
data element analysis Analysis of the format and definition of each field
structural analysis Statistical analysis of the structured data present to detect
irregularities
robust Statistics which are more unaffected by outliers or small departures
from model assumptions
exploratory data Approach to analyzing data sets to summarize their main
analysis characteristics, using visual methods, descriptive statistics,
clustering, dimension reduction
confirmatory data Process used to challenge assumptions about the data through
analysis hypothesis tests, significance testing, model estimation, prediction,
confidence intervals, and inference
supervised learning Model that predicts a response target variable using explanatory
methods predictors as input
unsupervised learning Models that work with explanatory variables only to describe
methods patterns or groupings
classification methods Supervised learning method where the response is a categorical
variable
regression methods Supervised learning method where the response is a continuous
variable
571
categorical variable A variable whose values are qualitative groups and can have no
natural ordering (nominal) or an ordering (ordinal)
variables A variable is any characteristics, number, or quantity that can be
measured or counted.
interval variable An ordinal variable with the additional property that the
magnitudes of the differences between two values are meaningful
spatial data Data and information having an implicit or explicit association with
a location relative to the earth
high dimensional Data set is high dimensional when it has many variables. In many
applications, the number of variables may be larger than the sample
size.
qualitative This is a type of variable in which the measurement denotes
membership in a set of groups, or categories
nominal variable This is a type of qualitative/ categorical variable which has two or
more categories without having any kind of natural order.
ordinal variable This is a type of qualitative/ categorical variable which has two or
more ordered categories.
binary variable Is a special type of categorical variable where there are only two
categories.
quantitative variable A quantitative variable is a type of variable in which numerical level
is a realization from some scale so that the distance between any
two levels of the scale takes on meaning.
continuous variable A continuous variable is a quantitative variable that can take on any
value within a finite interval.
policyholder Person in actual possession of insurance policy; policy owner.
discrete variable A discrete variable is quantitative variable that takes on only a
finite number of values in any finite interval.
count variable A count variable is a discrete variable with values on nonnegative
integers.
circular data In a circular data, all values around the circle are equally likely.
Example, imagine an analog picture of a clock.
insurers An insurance company authorized to write insurance under the laws
of any state.
multivariate Multivariate variable involves taking many measurements on a
single entity.
workers’ compensation Insurance that covers an employer’s liability for injuries, disability
or death to persons in their employment, without regard to fault, as
prescribed by state or federal workers’ compensation laws and other
statutes.
univariate Univariate analysis is the simplest form of analyzing data. “Uni”
means “one”, so in other words your data has only one variable.
573
missing data Missing data occur when no data value is stored for a variable in an
observation. Missing data can occur because of nonresponse: no
information is provided for one or more items or for a whole unit or
subject.
censored Censored data have unknown values beyond a bound on either end
of the number line or both. Here, the data is observed but the
values (measurements) are not known completely.
truncated Truncation occurs when values beyond a boundary are either
excluded when gathered or excluded when analyzed. An object can
be detected only if its value is greater than some number.
stochastic process Stochastic process is defined as a collection of random variables that
is indexed by some mathematical set, meaning that each random
variable of the stochastic process is uniquely associated with an
element in the set.
deductibles A deductible is a parameter specified in the contract. Typically,
losses below the deductible are paid by the policyholder whereas
losses in excess of the deductible are the insurer’s responsibility
(subject to policy limits and coninsurance).
rank based measures Statistical dependence between the rankings of two variables
odds ratio A statistic quantifying the strength of the association between two
events, a and b, which is defined as the ratio of the odds of a in the
presence of b and the odds of a in the absence of b
likelihood ratio test A statistical test of the goodness-of-fit between two models
pearson correlation A measure of the linear correlation between two variables
product-moment Pearson correlation, a measure of the linear correlation between two
(pearson) correlation variables
kendall’s tau A statistic used to measure the ordinal association between two
measured quantities
concordant An observation pair (x,y) is said to be concordant if the observation
with a larger value of x has also the larger value of y
discordant An observation pair (x,y) is said to be discordant if the observation
with a larger value of x has the smaller value of y
pearson chi-square A statistical test applied to sets of categorical data to evaluate how
statistic likely it is that any observed difference between the sets arose by
chance
tetrachoric correlation A technique for estimating the correlation between two theorised
normally distributed continuous latent variables, from two observed
binary variables
polychoric correlation A technique for estimating the correlation between two theorised
normally distributed continuous latent variables, from two observed
ordinal variables
polyserial correlation The correlation between two continuous variables with a bivariate
normal distribution, where one variable is observed directly, and the
other is unobserved
574 CHAPTER 20. GLOSSARY
tail value-at-risk The expected value of a risk given that the risk exceeds a
value-at-risk
coefficient of variation Standard deviation divided by the mean of a distribution, to
measure variability in terms of units of the mean
loss ratio The sum of losses divided by the premium.
homogeneous risks Risks that have the same distribution, that is, the distributions are
identical.
heterogeneous Heterogeneous risks have different distributions. often, we can
attribute differences to varying exposures or risk factors.
exposure A type of rating variable that is so important that premiums and
losses are often quoted on a ”per exposure” basis. that is, premiums
and losses are commonly standardized by exposure variables.
loss The amount of damages sustained by an individual or corporation,
typically as the result of an insurable event.
iid Independent and identically distributed
pdf Probability density function
aic Akaike’s information criterion
bic Bayesian information criterion
pmf Probability mass function
mcmc Markov Chain Monte Carlo
cdf Cumulative distribution function
df Degrees of freedom
glm Generalized linear model
mle Maximum likelihood estimate
ols Ordinary least squares
pf Probability function
rv Random variable
reporting delay The time that elapses between the occurrence of the insured event
and the reporting of this event to the insurance company.
settlement delay The time between reporting and settlement of a claim.
rbns Reported, But is Not fully Settled
ibnr Incurred in the past But is Not yet Reported. For such a claim the
insured event took place, but the insurance company is not yet
aware of the associated claim.
granular
case estimates The claims handlers expert estimate of the outstanding amount on a
claim.
.csv Comma separated value file
.txt Text file
run-off triangle Triangular display of loss reserve data. Accident or occurrence
periods on one axis (often vertical) with development periods on the
other (often horizontal). Also known as a development triangle.
576 CHAPTER 20. GLOSSARY
577
578 BIBLIOGRAPHY
Chen, Min, Shiwen Mao, Yin Zhang, and Victor CM Leung (2014). Big
Data: Related Technologies, Challenges and Future Prospects, New York, NY.
Springer.
Clark, David R (1996). Basics of reinsurance pricing, pp.41–43, URL: https:
//www.soa.org/files/edu/edu-2014-exam-at-study-note-basics-rein.pdf.
Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang (2009). Principles and
theory for data mining and machine learning, New York, NY. Springer-Verlag.
Cummins, J. David and Richard A. Derrig (2012). Managing the Insolvency Risk
of Insurance Companies: Proceedings of the Second International Conference
on Insurance Solvency, Vol. 12. Springer Science & Business Media.
Dabrowska, Dorota M. (1988). “Kaplan-meier estimate on the plane,” The An-
nals of Statistics, pp. 1475–1489.
Daroczi, Gergely (2015). Mastering Data Analysis with R, Birmingham, UK.
Packt Publishing.
De Jong, Piet and Gillian Z. Heller (2008). Generalized linear models for insur-
ance data. Cambridge University Press, Cambridge.
Denuit, Michel, Xavier Maréchal, Sandra Pitrebois, and Jean-François Walhin
(2007). Actuarial modelling of claim counts: risk classification, credibility and
bonus-malus systems. John Wiley & Sons, Chichester.
Denuit, Michel, Dominik Sznajder, and Julien Trufin (2019). “Model selection
based on Lorenz and concentration curves, Gini indices and convex order,”
Insurance: Mathematics and Economics.
Derrig, Richard A, Krzysztof M Ostaszewski, and Grzegorz A Rempala (2001).
“Applications of resampling methods in actuarial practice,” in Proceedings
of the Casualty Actuarial Society, Vol. 87, pp. 322–364, Casualty Actuarial
Society.
Dickson, David C. M., Mary Hardy, and Howard R. Waters (2013). Actuarial
Mathematics for Life Contingent Risks. Cambridge University Press.
Dionne, Georges and Charles Vanasse (1989). “A generalization of automobile
insurance rating models: the negative binomial distribution with a regression
component,” ASTIN Bulletin, Vol. 19(2), pp. 199–212.
Dobson, Annette J and Adrian Barnett (2008). An Introduction to Generalized
Linear Models. CRC press.
Earnix (2013). “2013 Insurance Predictive Modeling Survey,” Earnix and Insur-
ance Services Office, Inc. URL: https://www.verisk.com/archived/2013/m
ajority-of-north-american-insurance-companies-use-predictive-analytics-to-
enhance-business-performance-new-earnix-iso-survey-shows/, [Retrieved on
July 23, 2020].
580 BIBLIOGRAPHY
Klugman, Stuart A., Harry H. Panjer, and Gordon E. Willmot (2012). Loss
Models: From Data to Decisions. John Wiley & Sons.
Kreer, Markus, Ayşe Kızılersü, Anthony W Thomas, and Alfredo D Egídio dos
Reis (2015). “Goodness-of-fit tests and applications for left-truncated Weibull
distributions to non-life insurance,” European Actuarial Journal, Vol. 5, pp.
139–163.
Kremer, Erhard (1982). “IBNR-claims and the two-way model of ANOVA,”
Scandinavian Actuarial Journal, Vol. 1982, pp. 47–55.
(1984). “A class of autoregressive models for predicting the final claims
amount,” Insurance: Mathematics and Economics, Vol. 3, pp. 111–119.
Kubat, Miroslav (2017). An Introduction to Machine Learning, New York, NY.
Springer, 2nd edition.
Lee Rodgers, J and W. A Nicewander (1998). “Thirteen ways to look at the
correlation coeffeicient,” The American Statistician, Vol. 42, pp. 59–66.
Lemaire, Jean (1998). “Bonus-malus systems: the European and Asian approach
to merit rating,” North American Actuarial Journal, Vol. 2(1), pp. 26–38.
Lemaire, Jean and Hongmin Zi (1994). “A comparative analysis of 30 bonus-
malus systems,” ASTIN Bulletin, Vol. 24(2), pp. 287–309.
Levin, Bruce, James Reeds et al. (1977). “Compound multinomial likelihood
functions are unimodal: Proof of a conjecture of IJ Good,” The Annals of
Statistics, Vol. 5, pp. 79–87.
Lorenz, Max O. (1905). “Methods of measuring the concentration of wealth,”
Publications of the American statistical association, Vol. 9, pp. 209–219.
Mack, Thomas (1991). “A simple parametric model for rating automobile insur-
ance or estimating IBNR claims reserves,” ASTIN Bulletin: The Journal of
the IAA, Vol. 21, pp. 93–109.
(1993). “Distribution-free calculation of the standard error of chain
ladder reserve estimates,” ASTIN Bulletin: The Journal of the IAA, Vol. 23,
pp. 213–225.
Mack, Thomas and Gary Venter (2000). “A comparison of stochastic models
that reproduce chain ladder reserve estimates,” Insurance: mathematics and
economics, Vol. 26, pp. 101–107.
Mailund, Thomas (2017). Beginning Data Science in R: Data Analysis, Visual-
ization, and Modelling for the Data Scientist. Apress.
McCullagh, Peter and John A. Nelder (1989). Generalized Linear Models, Sec-
ond Edition, Chapman and Hall/CRC Monographs on Statistics and Applied
Probability Series. Chapman & Hall, London.
584 BIBLIOGRAPHY
McDonald, James B (1984). “Some generalized functions for the size distribution
of income,” Econometrica: journal of the Econometric Society, pp. 647–663.
McDonald, James B and Yexiao J Xu (1995). “A generalization of the beta
distribution with applications,” Journal of Econometrics, Vol. 66, pp. 133–
152.
Miles, Matthew, Michael Hberman, and Johnny Sdana (2014). Qualitative Data
Analysis: A Methods Sourcebook, Thousand Oaks, CA. Sage, 3rd edition.
Mirkin, Boris (2011). Core Concepts in Data Analysis: Summarization, Corre-
lation and Visualization, London, UK. Springer.
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Founda-
tions of Machine Learning, Cambridge, MA. MIT Press.
NAIC Glossary (2018). “Glossary of Insurance Terms,” National Association of
Insurance Commissioners, URL: https://www.naic.org/consumer_glossary.
htm, [Retrieved on Sept 11, 2018].
Nelson, Roger B. (1997). An Introduction to Copulas. Lecture Notes in Statistics
139.
Niehaus, Gregory and Scott Harrington (2003). Risk Management and Insur-
ance, New York. McGraw Hill.
Norberg, Ragnar (1976). “A credibility theory for automobile bonus system,”
Scandinavian Actuarial Journal, Vol. 2, pp. 92–107.
Ohlsson, Esbjörn and Björn Johansson (2010). Non-life Insurance Pricing with
Generalized Linear Models, Vol. 21. Springer.
O’Leary, D. E. (2013). “Artificial Intelligence and Big Data,” IEEE Intelligent
Systems, Vol. 28, pp. 96–99.
Olkin, Ingram, A John Petkau, and James V Zidek (1981). “A comparison of n
estimators for the binomial distribution,” Journal of the American Statistical
Association, Vol. 76, pp. 637–642.
Olson, Jack E. (2003). Data Quality: The Accuracy Dimension, San Francisco,
CA. Morgan Kaufmann.
Picard, Richard R. and Kenneth N. Berk (1990). “Data splitting,” The American
Statistician, Vol. 44, pp. 140–147.
Pitrebois, Sandra, Michel Denuit, and Jean-François Walhin (2003). “Setting
a bonus-malus scale in the presence of other rating factors: Taylor’s work
revisited,” ASTIN Bulletin, Vol. 33(2), pp. 419–436.
Pries, Kim H. and Robert Dunnigan (2015). Big Data Analytics: A Practical
Guide for Managers, Boca Raton, FL. CRC Press.
BIBLIOGRAPHY 585
Venter, Gary G (2006). “Discussion of the mean square error of prediction in the
chain ladder reserving method,” ASTIN Bulletin: The Journal of the IAA,
Vol. 36, pp. 566–571.
Werner, Geoff and Claudine Modlin (2016). Basic Ratemaking, Fifth Edition.
Casualty Actuarial Society, URL: https://www.casact.org/library/studynote
s/werner_modlin_ratemaking.pdf, [Retrieved on April 1, 2019].
Wüthrich, Mario V. and Michael Merz (2008). Stochastic claims reserving meth-
ods in insurance, Vol. 435 of Wiley Finance. John Wiley & Sons.
(2015). Stochastic Claims Reserving Manual: Advances in Dynamic
Modeling. SSRN.
Young, Virginia R (2014). “Premium principles,” Wiley StatsRef: Statistics
Reference Online.
Yule, G. Udny (1900). “On the association of attributes in statistics: with il-
lustrations from the material of the childhood society,” Philosophical Trans-
actions of the Royal Society of London. Series A, Containing Papers of a
Mathematical or Physical Character, pp. 257–319.
(1912). “On the methods of measuring association between two at-
tributes,” Journal of the Royal Statistical Society, pp. 579–652.