Loss Data Analytics Aug 2020

Loss Data Analytics
An open text authored by the Actuarial Community

2
Contents
Preface 13
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Other Collaborators . . . . . . . . . . . . . . . . . . . . . . . . . 19
Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
For our Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1 Introduction to Loss Data Analytics 21

1.1 Relevance of Analytics to Insurance Activities . . . . . . . . . . . 21
1.1.1 Nature and Relevance of Insurance . . . . . . . . . . . . . 21
1.1.2 What is Analytics? . . . . . . . . . . . . . . . . . . . . . . 22
1.1.3 Insurance Processes . . . . . . . . . . . . . . . . . . . . . 23
1.2 Insurance Company Operations . . . . . . . . . . . . . . . . . . . 24
1.2.1 Initiating Insurance . . . . . . . . . . . . . . . . . . . . . 27
1.2.2 Renewing Insurance . . . . . . . . . . . . . . . . . . . . . 28
1.2.3 Claims and Product Management . . . . . . . . . . . . . . 29
1.2.4 Loss Reserving . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3 Case Study: Wisconsin Property Fund . . . . . . . . . . . . . . . 31
1.3.1 Fund Claims Variables: Frequency and Severity . . . . . . 32
1.3.2 Fund Rating Variables . . . . . . . . . . . . . . . . . . . . 34
1.3.3 Fund Operations . . . . . . . . . . . . . . . . . . . . . . . 37
1.4 Further Resources and Contributors . . . . . . . . . . . . . . . . 39
2 Frequency Modeling 41
2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 How Frequency Augments Severity Information . . . . . . 42
2.2 Basic Frequency Distributions . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.2 Moment and Probability Generating Functions . . . . . . 46
2.2.3 Important Frequency Distributions . . . . . . . . . . . . . 47
2.3 The (a, b, 0) Class . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Estimating Frequency Distributions . . . . . . . . . . . . . . . . . 54
3
4 CONTENTS
2.4.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . 55

2.4.2 Frequency Distributions MLE . . . . . . . . . . . . . . . . 57
2.5 Other Frequency Distributions . . . . . . . . . . . . . . . . . . . 64
2.5.1 Zero Truncation or Modification . . . . . . . . . . . . . . 65
2.6 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . 67
2.7 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.9.1 TS 2.A. R Code for Plots . . . . . . . . . . . . . . . . . . 75
3 Modeling Loss Severity 77

3.1 Basic Distributional Quantities . . . . . . . . . . . . . . . . . . . 77
3.1.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.3 Moment Generating Function . . . . . . . . . . . . . . . . 80
3.2 Continuous Distributions for Modeling Loss Severity . . . . . . . 81
3.2.1 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . 82
3.2.2 Pareto Distribution . . . . . . . . . . . . . . . . . . . . . . 84
3.2.3 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . 86
3.2.4 The Generalized Beta Distribution of the Second Kind . . 89
3.3 Methods of Creating New Distributions . . . . . . . . . . . . . . 90
3.3.1 Functions of Random Variables and their Distributions . 90
3.3.2 Multiplication by a Constant . . . . . . . . . . . . . . . . 90
3.3.3 Raising to a Power . . . . . . . . . . . . . . . . . . . . . . 92
3.3.4 Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.5 Finite Mixtures . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.6 Continuous Mixtures . . . . . . . . . . . . . . . . . . . . . 97
3.4 Coverage Modifications . . . . . . . . . . . . . . . . . . . . . . . . 99
3.4.1 Policy Deductibles . . . . . . . . . . . . . . . . . . . . . . 99
3.4.2 Policy Limits . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.4.3 Coinsurance and Inflation . . . . . . . . . . . . . . . . . . 106
3.4.4 Reinsurance . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 110
3.5.1 Maximum Likelihood Estimators for Complete Data . . . 110
3.5.2 Maximum Likelihood Estimators using Modified Data . . 118
4 Model Selection and Estimation 125

4.1 Nonparametric Inference . . . . . . . . . . . . . . . . . . . . . . . 125
4.1.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . 126
4.1.2 Tools for Model Selection and Diagnostics . . . . . . . . . 136
4.1.3 Starting Values . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.2.1 Iterative Model Selection . . . . . . . . . . . . . . . . . . 145
4.2.2 Model Selection Based on a Training Dataset . . . . . . . 146
4.2.3 Model Selection Based on a Test Dataset . . . . . . . . . 147
CONTENTS 5
4.2.4 Model Selection Based on Cross-Validation . . . . . . . . 149

4.3 Estimation using Modified Data . . . . . . . . . . . . . . . . . . . 150
4.3.1 Parametric Estimation using Modified Data . . . . . . . . 150
4.3.2 Nonparametric Estimation using Modified Data . . . . . . 158
4.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.4.1 Introduction to Bayesian Inference . . . . . . . . . . . . . 165
4.4.2 Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . 168
4.4.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 169
4.4.4 Conjugate Distributions . . . . . . . . . . . . . . . . . . . 175
5 Aggregate Loss Models 179

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.2 Individual Risk Model . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3 Collective Risk Model . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3.1 Moments and Distribution . . . . . . . . . . . . . . . . . . 188
5.3.2 Stop-loss Insurance . . . . . . . . . . . . . . . . . . . . . . 194
5.3.3 Analytic Results . . . . . . . . . . . . . . . . . . . . . . . 197
5.3.4 Tweedie Distribution . . . . . . . . . . . . . . . . . . . . . 199
5.4 Computing the Aggregate Claims Distribution . . . . . . . . . . 200
5.4.1 Recursive Method . . . . . . . . . . . . . . . . . . . . . . 200
5.4.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.5 Effects of Coverage Modifications . . . . . . . . . . . . . . . . . . 205
5.5.1 Impact of Exposure on Frequency . . . . . . . . . . . . . 205
5.5.2 Impact of Deductibles on Claim Frequency . . . . . . . . 206
5.5.3 Impact of Policy Modifications on Aggregate Claims . . . 210
TS 5.A.1. Individual Risk Model Properties . . . . . . . . . . . . 214
TS 5.A.2. Relationship Between Probability Generating Func-
tions of 𝑋𝑖 and 𝑋𝑖𝑇 . . . . . . . . . . . . . . . . . . . . . . 216
TS 5.A.3. Example 5.3.8 Moment Generating Function of Aggre-
gate Loss 𝑆𝑁 . . . . . . . . . . . . . . . . . . . . . . . . . 216
6 Simulation and Resampling 219

6.1 Simulation Fundamentals . . . . . . . . . . . . . . . . . . . . . . 219
6.1.1 Generating Independent Uniform Observations . . . . . . 220
6.1.2 Inverse Transform Method . . . . . . . . . . . . . . . . . . 221
6.1.3 Simulation Precision . . . . . . . . . . . . . . . . . . . . . 225
6.1.4 Simulation and Statistical Inference . . . . . . . . . . . . 230
6.2 Bootstrapping and Resampling . . . . . . . . . . . . . . . . . . . 233
6.2.1 Bootstrap Foundations . . . . . . . . . . . . . . . . . . . . 233
6.2.2 Bootstrap Precision: Bias, Standard Deviation, and Mean
Square Error . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 239
6.2.4 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . 241
6.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6 CONTENTS
6.3.1 k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . 244

6.3.2 Leave-One-Out Cross-Validation . . . . . . . . . . . . . . 246
6.3.3 Cross-Validation and Bootstrap . . . . . . . . . . . . . . . 247
6.4 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.5 Monte Carlo Markov Chain (MCMC) . . . . . . . . . . . . . . . 249
6.5.1 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . 250
6.5.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . 250
6.6.1 TS 6.A. Bootstrap Applications in Predictive Modeling . 251
7 Premium Foundations 253

7.1 Introduction to Ratemaking . . . . . . . . . . . . . . . . . . . . . 253
7.2 Aggregate Ratemaking Methods . . . . . . . . . . . . . . . . . . 256
7.2.1 Pure Premium Method . . . . . . . . . . . . . . . . . . . 257
7.2.2 Loss Ratio Method . . . . . . . . . . . . . . . . . . . . . . 258
7.3 Pricing Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
7.3.1 Premium Principles . . . . . . . . . . . . . . . . . . . . . 260
7.3.2 Properties of Premium Principles . . . . . . . . . . . . . . 262
7.4 Heterogeneous Risks . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.4.1 Exposure to Risk . . . . . . . . . . . . . . . . . . . . . . . 263
7.4.2 Rating Factors . . . . . . . . . . . . . . . . . . . . . . . . 265
7.5 Development and Trending . . . . . . . . . . . . . . . . . . . . . 267
7.5.1 Exposures and Premiums . . . . . . . . . . . . . . . . . . 268
7.5.2 Losses, Claims, and Payments . . . . . . . . . . . . . . . . 270
7.5.3 Comparing Pure Premium and Loss Ratio Methods . . . 271
7.6 Selecting a Premium . . . . . . . . . . . . . . . . . . . . . . . . . 274
7.6.1 Classic Lorenz Curve . . . . . . . . . . . . . . . . . . . . . 274
7.6.2 Performance Curve and a Gini Statistic . . . . . . . . . . 275
7.6.3 Out-of-Sample Validation . . . . . . . . . . . . . . . . . . 278
TS 7.A. Rate Regulation . . . . . . . . . . . . . . . . . . . . . . . 281
8 Risk Classification 285

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.2 Poisson Regression Model . . . . . . . . . . . . . . . . . . . . . . 287
8.2.1 Need for Poisson Regression . . . . . . . . . . . . . . . . . 288
8.2.2 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . 291
8.2.3 Incorporating Exposure . . . . . . . . . . . . . . . . . . . 292
8.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3 Categorical Variables and Multiplicative Tariff . . . . . . . . . . 295
8.3.1 Rating Factors and Tariff . . . . . . . . . . . . . . . . . . 295
8.3.2 Multiplicative Tariff Model . . . . . . . . . . . . . . . . . 297
8.3.3 Poisson Regression for Multiplicative Tariff . . . . . . . . 298
8.3.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . 300
TS 8.A. Estimating Poisson Regression Models . . . . . . . . . . 303
CONTENTS 7
TS 8.B. Selecting Rating Factors . . . . . . . . . . . . . . . . . . 305
9 Experience Rating Using Credibility Theory 309

9.1 Introduction to Applications of Credibility Theory . . . . . . . . 309
9.2 Limited Fluctuation Credibility . . . . . . . . . . . . . . . . . . . 310
9.2.1 Full Credibility for Claim Frequency . . . . . . . . . . . . 311
9.2.2 Full Credibility for Aggregate Losses and Pure Premium . 314
9.2.3 Full Credibility for Severity . . . . . . . . . . . . . . . . . 316
9.2.4 Partial Credibility . . . . . . . . . . . . . . . . . . . . . . 317
9.3 Bühlmann Credibility . . . . . . . . . . . . . . . . . . . . . . . . 319
9.3.1 Credibility Z, EPV, and VHM . . . . . . . . . . . . . . . 321
9.4 Bühlmann-Straub Credibility . . . . . . . . . . . . . . . . . . . . 324
9.5 Bayesian Inference and Bühlmann Credibility . . . . . . . . . . . 327
9.5.1 Gamma-Poisson Model . . . . . . . . . . . . . . . . . . . . 328
9.5.2 Beta-Binomial Model . . . . . . . . . . . . . . . . . . . . 330
9.5.3 Exact Credibility . . . . . . . . . . . . . . . . . . . . . . . 331
9.6 Estimating Credibility Parameters . . . . . . . . . . . . . . . . . 332
9.6.1 Full Credibility Standard for Limited Fluctuation Credi-
bility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
9.6.2 Nonparametric Estimation for Bühlmann and Bühlmann-
Straub Models . . . . . . . . . . . . . . . . . . . . . . . . 333
9.6.3 Semiparametric Estimation for Bühlmann and Bühlmann-
Straub Models . . . . . . . . . . . . . . . . . . . . . . . . 337
9.6.4 Balancing Credibility Estimators . . . . . . . . . . . . . . 339
10 Insurance Portfolio Management including Reinsurance 343

10.1 Introduction to Insurance Portfolios . . . . . . . . . . . . . . . . 343
10.2 Tails of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.2.1 Classification Based on Moments . . . . . . . . . . . . . . 345
10.2.2 Comparison Based on Limiting Tail Behavior . . . . . . . 348
10.3 Risk Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
10.3.1 Coherent Risk Measures . . . . . . . . . . . . . . . . . . . 350
10.3.2 Value-at-Risk . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.3.3 Tail Value-at-Risk . . . . . . . . . . . . . . . . . . . . . . 357
10.4 Reinsurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
10.4.1 Proportional Reinsurance . . . . . . . . . . . . . . . . . . 363
10.4.2 Non-Proportional Reinsurance . . . . . . . . . . . . . . . 367
10.4.3 Additional Reinsurance Treaties . . . . . . . . . . . . . . 371
11 Loss Reserving 375

11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.1.1 Closed, IBNR, and RBNS Claims . . . . . . . . . . . . . . 376
11.1.2 Why Reserving? . . . . . . . . . . . . . . . . . . . . . . . 377
11.2 Loss Reserve Data . . . . . . . . . . . . . . . . . . . . . . . . . . 378
8 CONTENTS
11.2.1 From Micro to Macro . . . . . . . . . . . . . . . . . . . . 378

11.2.2 Run-off Triangles . . . . . . . . . . . . . . . . . . . . . . . 378
11.2.3 Loss Reserve Notation . . . . . . . . . . . . . . . . . . . . 380
11.2.4 R Code to Summarize Loss Reserve Data . . . . . . . . . 381
11.3 The Chain-Ladder Method . . . . . . . . . . . . . . . . . . . . . 385
11.3.1 The Deterministic Chain-Ladder . . . . . . . . . . . . . . 386
11.3.2 Mack’s Distribution-Free Chain-Ladder Model . . . . . . 390
11.3.3 R code for Chain-Ladder Predictions . . . . . . . . . . . . 394
11.4 GLMs and Bootstrap for Loss Reserves . . . . . . . . . . . . . . 397
11.4.1 Model Specification . . . . . . . . . . . . . . . . . . . . . 397
11.4.2 Model Estimation and Prediction . . . . . . . . . . . . . . 399
11.4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12 Experience Rating using Bonus-Malus 401

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
12.2 NCD System in Several Countries . . . . . . . . . . . . . . . . . . 402
12.2.1 NCD System in Malaysia . . . . . . . . . . . . . . . . . . 402
12.2.2 NCD System in Other Countries . . . . . . . . . . . . . . 403
12.3 BMS and Markov Chain Model . . . . . . . . . . . . . . . . . . . 405
12.3.1 Transition Probability . . . . . . . . . . . . . . . . . . . . 405
12.4 BMS and Stationary Distribution . . . . . . . . . . . . . . . . . . 408
12.4.1 Stationary Distribution . . . . . . . . . . . . . . . . . . . 408
12.4.2 R Code for a Stationary Distribution . . . . . . . . . . . . 409
12.4.3 Premium Evolution . . . . . . . . . . . . . . . . . . . . . 412
12.4.4 R Program for Premium Evolution . . . . . . . . . . . . . 413
12.4.5 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . 415
12.4.6 R Program for Convergence Rate . . . . . . . . . . . . . . 417
12.5 BMS and Premium Rating . . . . . . . . . . . . . . . . . . . . . 418
12.5.1 Premium Rating . . . . . . . . . . . . . . . . . . . . . . . 418
12.5.2 A Priori Risk Classification . . . . . . . . . . . . . . . . . 419
12.5.3 Modelling of Residual Heterogeneity . . . . . . . . . . . . 419
12.5.4 Stationary Distribution Allowing for Residual Heterogeneity420
12.5.5 Determination of Optimal Relativities . . . . . . . . . . . 421
12.5.6 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . 423
13 Data and Systems 427

13.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.1.1 Data Types and Sources . . . . . . . . . . . . . . . . . . . 427
13.1.2 Data Structures and Storage . . . . . . . . . . . . . . . . 429
13.1.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . 430
13.1.4 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . 431
13.2 Data Analysis Preliminaries . . . . . . . . . . . . . . . . . . . . . 431
13.2.1 Data Analysis Process . . . . . . . . . . . . . . . . . . . . 432
13.2.2 Exploratory versus Confirmatory . . . . . . . . . . . . . . 433
13.2.3 Supervised versus Unsupervised . . . . . . . . . . . . . . . 434
CONTENTS 9
13.2.4 Parametric versus Nonparametric . . . . . . . . . . . . . . 434

13.2.5 Explanation versus Prediction . . . . . . . . . . . . . . . . 435
13.2.6 Data Modeling versus Algorithmic Modeling . . . . . . . 435
13.2.7 Big Data Analysis . . . . . . . . . . . . . . . . . . . . . . 436
13.2.8 Reproducible Analysis . . . . . . . . . . . . . . . . . . . . 437
13.2.9 Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 437
13.3 Data Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . 438
13.3.1 Exploratory Techniques . . . . . . . . . . . . . . . . . . . 438
13.3.2 Confirmatory Techniques . . . . . . . . . . . . . . . . . . 440
13.4 Some R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 443
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
14 Dependence Modeling 445

14.1 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
14.1.1 Qualitative Variables . . . . . . . . . . . . . . . . . . . . . 446
14.1.2 Quantitative Variables . . . . . . . . . . . . . . . . . . . . 448
14.1.3 Multivariate Variables . . . . . . . . . . . . . . . . . . . . 448
14.2 Classic Measures of Scalar Associations . . . . . . . . . . . . . . 449
14.2.1 Association Measures for Quantitative Variables . . . . . 450
14.2.2 Rank Based Measures . . . . . . . . . . . . . . . . . . . . 451
14.2.3 Nominal Variables . . . . . . . . . . . . . . . . . . . . . . 452
14.2.4 Ordinal Variables . . . . . . . . . . . . . . . . . . . . . . . 455
14.2.5 Interval Variables . . . . . . . . . . . . . . . . . . . . . . . 456
14.2.6 Discrete and Continuous Variables . . . . . . . . . . . . . 456
14.3 Introduction to Copulas . . . . . . . . . . . . . . . . . . . . . . . 457
14.4 Application Using Copulas . . . . . . . . . . . . . . . . . . . . . . 458
14.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . 459
14.4.2 Marginal Models . . . . . . . . . . . . . . . . . . . . . . . 459
14.4.3 Probability Integral Transformation . . . . . . . . . . . . 460
14.4.4 Joint Modeling with Copula Function . . . . . . . . . . . 461
14.5 Types of Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
14.5.1 Normal (Gaussian) Copulas . . . . . . . . . . . . . . . . . 464
14.5.2 t- and Elliptical Copulas . . . . . . . . . . . . . . . . . . . 465
14.5.3 Archimedean Copulas . . . . . . . . . . . . . . . . . . . . 467
14.5.4 Properties of Copulas . . . . . . . . . . . . . . . . . . . . 468
14.6 Why is Dependence Modeling Important? . . . . . . . . . . . . . 470
TS 14.A. Other Classic Measures of Scalar Associations . . . . . 472
15 Appendix A: Review of Statistical Inference 475

15.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
15.1.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . 477
15.1.2 Sampling Distribution . . . . . . . . . . . . . . . . . . . . 477
15.1.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 478
15.2 Point Estimation and Properties . . . . . . . . . . . . . . . . . . 478
10 CONTENTS
15.2.1 Method of Moments Estimation . . . . . . . . . . . . . . . 479

15.2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . 480
15.3 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 482
15.3.1 Exact Distribution for Normal Sample Mean . . . . . . . 482
15.3.2 Large-sample Properties of MLE . . . . . . . . . . . . . . 483
15.3.3 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . 483
15.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 485
15.4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . 485
15.4.2 Student-t test based on mle . . . . . . . . . . . . . . . . . 486
15.4.3 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . 488
15.4.4 Information Criteria . . . . . . . . . . . . . . . . . . . . . 489
16 Appendix B: Iterated Expectations 491

16.1 Conditional Distribution and Conditional Expectation . . . . . . 491
16.1.1 Conditional Distribution . . . . . . . . . . . . . . . . . . . 492
16.1.2 Conditional Expectation and Conditional Variance . . . . 494
16.2 Iterated Expectations and Total Variance . . . . . . . . . . . . . 495
16.2.1 Law of Iterated Expectations . . . . . . . . . . . . . . . . 496
16.2.2 Law of Total Variance . . . . . . . . . . . . . . . . . . . . 497
16.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . 498
16.3 Conjugate Distributions . . . . . . . . . . . . . . . . . . . . . . . 499
16.3.1 Linear Exponential Family . . . . . . . . . . . . . . . . . 499
16.3.2 Conjugate Distributions . . . . . . . . . . . . . . . . . . . 500
17 Appendix C: Maximum Likelihood Theory 503

17.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . 503
17.1.1 Likelihood and Log-likelihood Functions . . . . . . . . . . 503
17.1.2 Properties of Likelihood Functions . . . . . . . . . . . . . 504
17.2 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . 506
17.2.1 Definition and Derivation of MLE . . . . . . . . . . . . . 506
17.2.2 Asymptotic Properties of MLE . . . . . . . . . . . . . . . 507
17.2.3 Use of Maximum Likelihood Estimation . . . . . . . . . . 508
17.3 Statistical Inference Based on Maximum Likelihood Estimation . 509
17.3.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . 509
17.3.2 MLE and Model Validation . . . . . . . . . . . . . . . . . 510
18 Appendix D: Summary of Distributions 513

18.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . 513
18.1.1 The (a,b,0) Class . . . . . . . . . . . . . . . . . . . . . . . 513
18.1.2 The (a,b,1) Class . . . . . . . . . . . . . . . . . . . . . . . 516
18.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 519
18.2.1 One Parameter Distributions . . . . . . . . . . . . . . . . 519
18.2.2 Two Parameter Distributions . . . . . . . . . . . . . . . . 523
18.2.3 Three Parameter Distributions . . . . . . . . . . . . . . . 535
18.2.4 Four Parameter Distribution . . . . . . . . . . . . . . . . 539
18.2.5 Other Distributions . . . . . . . . . . . . . . . . . . . . . 539
CONTENTS 11
18.2.6 Distributions with Finite Support . . . . . . . . . . . . . 541

18.3 Limited Expected Values . . . . . . . . . . . . . . . . . . . . . . . 544
19 Appendix E: Conventions for Notation 547

19.1 General Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 547
19.2 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
19.3 Common Statistical Symbols and Operators . . . . . . . . . . . . 548
19.4 Common Mathematical Symbols and Functions . . . . . . . . . . 549
19.5 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
20 Glossary 551
12 CONTENTS
Preface
Date: 23 August 2020
Book Description
Loss Data Analytics is an interactive, online, freely available text.
• The online version contains many interactive objects (quizzes, computer

demonstrations, interactive graphs, video, and the like) to promote deeper
learning.
• A subset of the book is available for offline reading in pdf and EPUB
formats.
• The online text will be available in multiple languages to promote access
to a worldwide audience.
What will success look like?
The online text will be freely available to a worldwide audience. The online ver-
sion will contain many interactive objects (quizzes, computer demonstrations,
interactive graphs, video, and the like) to promote deeper learning. Moreover, a
subset of the book will be available in pdf format for low-cost printing. The on-
line text will be available in multiple languages to promote access to a worldwide
audience.
How will the text be used?
This book will be useful in actuarial curricula worldwide. It will cover the loss
data learning objectives of the major actuarial organizations. Thus, it will be
suitable for classroom use at universities as well as for use by independent learn-
ers seeking to pass professional actuarial examinations. Moreover, the text will
also be useful for the continuing professional development of actuaries and other
professionals in insurance and related financial risk management industries.
13
14 CONTENTS
Why is this good for the profession?

An online text is a type of open educational resource (OER). One important
benefit of an OER is that it equalizes access to knowledge, thus permitting a
broader community to learn about the actuarial profession. Moreover, it has
the capacity to engage viewers through active learning that deepens the learning
process, producing analysts more capable of solid actuarial work.
Why is this good for students and teachers and others involved in the learning
process? Cost is often cited as an important factor for students and teachers in
textbook selection (see a recent post on the $400 textbook). Students will also
appreciate the ability to “carry the book around” on their mobile devices.
Why loss data analytics?

The intent is that this type of resource will eventually permeate throughout the
actuarial curriculum. Given the dramatic changes in the way that actuaries
treat data, loss data seems like a natural place to start. The idea behind the
name loss data analytics is to integrate classical loss data models from applied
probability with modern analytic tools. In particular, we recognize that big
data (including social media and usage based insurance) are here to stay and
that high speed computation is readily available.
Project Goal
The project goal is to have the actuarial community author our textbooks in
a collaborative fashion. To get involved, please visit our Open Actuarial Text-
books Project Site.
Acknowledgements
Edward Frees acknowledges the John and Anne Oros Distinguished Chair for
Inspired Learning in Business which provided seed money to support the project.
Frees and his Wisconsin colleagues also acknowledge a Society of Actuaries Cen-
ter of Excellence Grant that provided funding to support work in dependence
modeling and health initiatives. Wisconsin also provided an education innova-
tion grant that provided partial support for the many students who have worked
on this project.
We acknowledge the Society of Actuaries for permission to use problems from
their examinations.
We thank Rob Hyndman, Monash University, for allowing us to use his excellent
style files to produce the online version of the book.
We thank Yihui Xie and his colleagues at Rstudio for the R bookdown package
that allows us to produce this book.
CONTENTS 15
We also wish to acknowledge the support and sponsorship of the International

Association of Black Actuaries in our joint efforts to provide actuarial educa-
tional content to all.
Contributors
The project goal is to have the actuarial community author our textbooks in a
collaborative fashion. The following contributors have taken a leadership role
in developing Loss Data Analytics.
• Zeinab Amin is a Professor at the Department of Mathematics and Ac-

tuarial Science and Associate Provost for Assessment and Accreditation at
the American University in Cairo (AUC). Amin holds a PhD in Statistics
and is an Associate of the Society of Actuaries. Amin is the recipient of
the 2016 Excellence in Academic Service Award and the 2009 Excellence
in Teaching Award from AUC. Amin has designed and taught a variety of
statistics and actuarial science courses. Amin’s current area of research
includes quantitative risk assessment, reliability assessment, general sta-
tistical modelling, and Bayesian statistics.
• Katrien Antonio, KU Leuven
• Jan Beirlant, KU Leuven
• Arthur Charpentier is a professor in the Department of Mathematics

at the Université du Québec á Montréal. Prior to that, he worked at a
large general insurance company in Hong Kong, China, and the French
Federation of Insurers in Paris, France. He received a MS on mathematical
economics at Université Paris Dauphine and a MS in actuarial science at
ENSAE (National School of Statistics) in Paris, and a PhD degree from
KU Leuven, Belgium. His research interests include econometrics, applied
probability and actuarial science. He has published several books (the
most recent one on Computational Actuarial Science with R, CRC) and
papers on a variety of topics. He is a Fellow of the French Institute of
Actuaries, and was in charge of the ‘Data Science for Actuaries’ program
from 2015 to 2018.
• Curtis Gary Dean is the Lincoln Financial Distinguished Professor of

Actuarial Science at Ball State University. He is a Fellow of the Casualty
Actuarial Society and a CFA charterholder. He has extensive practical
experience as an actuary at American States Insurance, SAFECO, and
Travelers. He has served the CAS and actuarial profession as chair of
the Examination Committee, first editor-in-chief for Variance: Advancing
the Science of Risk, and as a member of the Board of Directors and the
16 CONTENTS
Executive Council. He contributed a chapter to Predictive Modeling Ap-

plications in Actuarial Science published by Cambridge University Press.
• Edward W. (Jed) Frees is an emeritus professor, formerly the Hickman-
Larson Chair of Actuarial Science at the University of Wisconsin-Madison.
He is a Fellow of both the Society of Actuaries and the American Statisti-
cal Association. He has published extensively (a four-time winner of the
Halmstad and Prize for best paper published in the actuarial literature)
and has written three books. He also is a co-editor of the two-volume
series Predictive Modeling Applications in Actuarial Science published by
Cambridge University Press.
• Guojun Gan is an associate professor in the Department of Mathematics
at the University of Connecticut, where he has been since August 2014.
Prior to that, he worked at a large life insurance company in Toronto,
Canada for six years. He received a BS degree from Jilin University,
Changchun, China, in 2001 and MS and PhD degrees from York Uni-
versity, Toronto, Canada, in 2003 and 2007, respectively. His research
interests include data mining and actuarial science. He has published
several books and papers on a variety of topics, including data cluster-
ing, variable annuity, mathematical finance, applied statistics, and VBA
programming.
• Lisa Gao is a PhD candidate in the Risk and Insurance department at
the University of Wisconsin-Madison. She holds a BMath in Actuarial
Science and Statistics from the University of Waterloo and is an Associate
of the Society of Actuaries.
• José Garrido, Concordia University
• Lei (Larry) Hua is an Associate Professor of Actuarial Science at North-
ern Illinois University. He earned a PhD degree in Statistics from the
University of British Columbia. He is an Associate of the Society of Ac-
tuaries. His research work focuses on multivariate dependence modeling
for non-Gaussian phenomena and innovative applications for financial and
insurance industries.
• Noriszura Ismail is a Professor and Head of Actuarial Science Program,
Universiti Kebangsaan Malaysia (UKM). She specializes in Risk Modelling
and Applied Statistics. She obtained her BSc and MSc (Actuarial Science)
in 1991 and 1993 from University of Iowa, and her PhD (Statistics) in 2007
from UKM. She also passed several papers from Society of Actuaries in
1994. She has received several research grants from Ministry of Higher
Education Malaysia (MOHE) and UKM, totaling about MYR1.8 million.
She has successfully supervised and co-supervised several PhD students
(13 completed and 11 on-going). She currently has about 180 publications,
consisting of 88 journals and 95 proceedings.
• Joseph H.T. Kim, Ph.D., FSA, CERA, is Associate Professor of Applied
CONTENTS 17
Statistics at Yonsei University, Seoul, Korea. He holds a Ph.D. degree in

Actuarial Science from the University of Waterloo, at which he taught as
Assistant Professor. He also worked in the life insurance industry. He
has published papers in Insurance Mathematics and Economics, Journal
of Risk and Insurance, Journal of Banking and Finance, ASTIN Bulletin,
and North American Actuarial Journal, among others.
• Nii-Armah Okine is an assistant professor at the Mathematical Sciences
Department at Appalachian State University. He holds a Ph.D. in Busi-
ness (Actuarial Science) from the University of Wisconsin - Madison and
obtained his master’s degree in Actuarial science from Illinois State Univer-
sity. His research interest includes micro-level reserving, joint longitudinal-
survival modeling, dependence modeling, micro-insurance, and machine
learning.
• Emine Selin Sarıdaş is a doctoral candidate in the Statistics department
of Mimar Sinan University. She holds a bachelor degree in Actuarial Sci-
ence with a minor in Economics and a master degree in Actuarial Science
from Hacettepe University. Her research interest includes dependence
modeling, regression, loss models and life contingencies.
• Peng Shi is an associate professor in the Risk and Insurance Department
at the Wisconsin School of Business. He is also the Charles & Laura Al-
bright Professor in Business and Finance. Professor Shi is an Associate of
the Casualty Actuarial Society (ACAS) and a Fellow of the Society of Ac-
tuaries (FSA). He received a Ph.D. in actuarial science from the University
of Wisconsin-Madison. His research interests are problems at the inter-
section of insurance and statistics. He has won several research awards,
including the Charles A. Hachemeister Prize, the Ronald Bornhuetter Loss
Reserve Prize, and the American Risk and Insurance Association Prize.
• Nariankadu D. Shyamalkumar (Shyamal) is an associate professor
in the Department of Statistics and Actuarial Science at The University of
Iowa. He is an Associate of the Society of Actuaries, and has volunteered
in various elected and non-elected roles within the SoA. Having a broad
theoretical interest as well as interest in computing, he has published in
prominent actuarial, computer science, probability theory, and statistical
journals. Moreover, he has worked in the financial industry, and since
then served as an independent consultant to the insurance industry. He
has experience educating actuaries in both Mexico and the US, serving
in the roles of directing an undergraduate program, and as a graduate
adviser for both masters and doctoral students.
• Jianxi Su is an Assistant Professor at the Department of Statistics at Pur-
due University. He is the Associate Director of Purdue’s Actuarial Science.
Prior to joining Purdue in 2016, he completed the PhD at York University
(2012-2015). He obtained the Fellow of the Society of Actuaries (FSA) in
2017. His research expertise are in dependence modelling, risk manage-
18 CONTENTS
ment, and pricing. During the PhD candidature, Jianxi also worked as
a research associate at the Model Validation and ORSA Implementation
team of Sun Life Financial (Toronto office).
• Tim Verdonck is associate professor at the University of Antwerp. He
has a degree in Mathematics and a PhD in Science: Mathematics, ob-
tained at the University of Antwerp. During his PhD he successfully took
the Master in Insurance and the Master in Financial and Actuarial Engi-
neering, both at KU Leuven. His research focuses on the adaptation and
application of robust statistical methods for insurance and finance data.
• Krupa Viswanathan is an Associate Professor in the Risk, Insurance
and Healthcare Management Department in the Fox School of Business,
Temple University. She is an Associate of the Society of Actuaries. She
teaches courses in Actuarial Science and Risk Management at the under-
graduate and graduate levels. Her research interests include corporate
governance of insurance companies, capital management, and sentiment
analysis. She received her Ph.D. from The Wharton School of the Univer-
sity of Pennsylvania.
Reviewers
Our goal is to have the actuarial community author our textbooks in a collabora-
tive fashion. Part of the writing process involves many reviewers who generously
donated their time to help make this book better. They are:
• Yair Babab
• Chunsheng Ban, Ohio State University
• Vytaras Brazauskas, University of Wisconsin - Milwaukee
• Yvonne Chueh, Central Washington University
• Chun Yong Chew, Universiti Tunku Abdul Rahman (UTAR)
• Eren Dodd, University of Southampton
• Gordon Enderle, University of Wisconsin - Madison
• Rob Erhardt, Wake Forest University
• Runhun Feng, University of Illinois
• Brian Hartman, Brigham Young University
• Liang (Jason) Hong, University of Texas at Dallas
• Fei Huang, Australian National University
• Hirokazu (Iwahiro) Iwasawa
• Himchan Jeong, University of Connecticut
• Min Ji, Towson University
• Paul Herbert Johnson, University of Wisconsin - Madison
• Dalia Khalil, Cairo University
• Samuel Kolins, Lebonan Valley College
• Andrew Kwon-Nakamura, Zurich North America
• Ambrose Lo, University of Iowa
CONTENTS 19
• Mark Maxwell, University of Texas at Austin

• Tatjana Miljkovic, Miami University
• Bell Ouelega, American University in Cairo
• Zhiyu (Frank) Quan, University of Connecticut
• Jiandong Ren, Western University
• Rajesh V. Sahasrabuddhe, Oliver Wyman
• Sherly Paola Alfonso Sanchez, Universidad Nacional de Colombia
• Ranee Thiagarajah, Illinois State University
• Ping Wang, Saint Johns University
• Chengguo Weng, University of Waterloo
• Toby White, Drake University
• Michelle Xia, Northern Illinois University
• Di (Cindy) Xu, University of Nebraska - Lincoln
• Lina Xu, Columbia University
• Lu Yang, University of Amsterdam
• Jorge Yslas, University of Copenhagen
• Jeffrey Zheng, Temple University
• Hongjuan Zhou, Arizona State University
Other Collaborators
• Alyaa Nuval Binti Othman, Aisha Nuval Binti Othman, and Khairina
(Rina) Binti Ibraham were three of many students at the Univeristy of
Wiscinson-Madison that helped with the text over the years.
• Maggie Lee, Macquarie University, and Anh Vu (then at University of
New South Wales) contributed the end of the section quizzes.
• Jeffrey Zheng, Temple University, Lu Yang (University of Amsterdam),
and Paul Johnson, University of Wisconsin-Madison, led the work on the
glossary.
Version
• This is Version 1.1, August 2020. Edited by Edward (Jed) Frees and
Paul Johnson.
• Version 1.0, January 2020, was edited by Edward (Jed) Frees.
You can also access pdf and epub (current and older) versions of the text in our
Offline versions of the text.
For our Readers

We hope that you find this book worthwhile and even enjoyable. For your
convenience, at our Github Landing site (https://openacttexts.github.io/), you
will find links to the book that you can (freely) download for offline reading,
20 CONTENTS
including a pdf version (for Adobe Acrobat) and an EPUB version suitable for
mobile devices. Data for running our examples are available at the same site.
In developing this book, we are emphasizing the online version that has lots of
great features such as a glossary, code and solutions to examples that you can
be revealed interactively. For example, you will find that the statistical code is
hidden and can only be seen by clicking on terms such as
We hide the code because we don’t want to insist that you use the R statistical
software (although we like it). Still, we encourage you to try some statistical
code as you read the book – we have opted to make it easy to learn R as you
go. We have set up a separate R Code for Loss Data Analytics site to explain
more of the details of the code.
Like any book, we have a set of notations and conventions. It will probably save
you time if you regularly visit our Appendix Chapter 19 to get used to ours.
Freely available, interactive textbooks represent a new venture in actuarial ed-
ucation and we need your input. Although a lot of effort has gone into the
development, we expect hiccoughs. Please let your instructor know about oppor-
tunities for improvement, write us through our project site, or contact chapter
contributors directly with suggested improvements.
Chapter 1
Introduction to Loss Data

Analytics
Chapter Preview. This book introduces readers to methods of analyzing in-

surance data. Section 1.1 begins with a discussion of why the use of data is
important in the insurance industry. Section 1.2 gives a general overview of
the purposes of analyzing insurance data which is reinforced in the Section 1.3
case study. Naturally, there is a huge gap between the broad goals summarized
in the overview and a case study application; this gap is covered through the
methods and techniques of data analysis covered in the rest of the text.
1.1 Relevance of Analytics to Insurance Activi-

ties
In this section, you learn how to:

• Summarize the importance of insurance to consumers and the economy
• Describe analytics
• Identify data generating events associated with the timeline of a typical
insurance contract
1.1.1 Nature and Relevance of Insurance

This book introduces the process of using data to make decisions in an insur-
ance context. It does not assume that readers are familiar with insurance but
introduces insurance concepts as needed. If you are new to insurance, then it is
21
22 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
probably easiest to think about an insurance policy that covers the contents of
an apartment or house that you are renting (known as renters insurance) or the
contents and property of a building that is owned by you or a friend (known as
homeowners insurance). Another common example is automobile insurance. In
the event of an accident, this policy may cover damage to your vehicle, damage
to other vehicles in the accident, as well as medical expenses of those injured in
the accident.
One way to think about the nature of insurance is who buys it. Renters, home-
owners, and auto insurance are examples of personal insurance in that these
are policies issued to people. Businesses also buy insurance, such as coverage
on their properties, and this is known as commercial insurance. The seller, an
insurance company, is also known as an insurer. Even insurance companies need
insurance; this is known as reinsurance.
Another way to think about the nature of insurance is the type of risk being
covered. In the U.S., policies such as renters and homeowners are known as
property insurance whereas a policy such as auto that covers medical damages
to people is known as casualty insurance. In the rest of the world, these are both
known as non-life or general insurance, to distinguish them from life insurance.
Both life and non-life insurances are important components of the world econ-
omy. The Insurance Information Institute (2016) estimates that direct insurance
premiums in the world for 2014 was 2,654,549 for life and 2,123,699 for non-life;
these figures are in millions of U.S. dollars. The total represents 6.2% of the
world gross domestic product (GDP). Put another way, life accounts for 55.5%
of insurance premiums and 3.4% of world GDP whereas non-life accounts for
44.5% of insurance premiums and 2.8% of world GDP. Both life and non-life
represent important economic activities.
Insurance may not be as entertaining as the sports industry (another industry
that depends heavily on data) but it does affect the financial livelihoods of
many. By almost any measure, insurance is a major economic activity. As
noted earlier, on a global level, insurance premiums comprised about 6.2% of
the world GDP in 2014, (Insurance Information Institute, 2016). As examples,
premiums accounted for 18.9% of GDP in Taiwan (the highest in the study)
and represented 7.3% of GDP in the United States. On a personal level, almost
everyone owning a home has insurance to protect themselves in the event of a
fire, hailstorm, or some other calamitous event. Almost every country requires
insurance for those driving a car. In sum, although not particularly entertaining,
insurance plays an important role in the economies of nations and the lives of
individuals.
1.1.2 What is Analytics?

Insurance is a data-driven industry. Like all major corporations and organiza-
tions, insurers use data when trying to decide how much to pay employees, how
many employees to retain, how to market their services and products, how to
1.1. RELEVANCE OF ANALYTICS TO INSURANCE ACTIVITIES 23
forecast financial trends, and so on. These represent general areas of activities
that are not specific to the insurance industry. Although each industry has its
own data nuances and needs, the collection, analysis and use of data is an ac-
tivity shared by all, from the internet giants to a small business, by public and
governmental organizations, and is not specific to the insurance industry. You
will find that the data collection and analysis methods and tools introduced in
this text are relevant for all.
In any data-driven industry, analytics is a key to deriving and extracting in-
formation from data. But what is analytics? Making data-driven business
decisions has been described as business analytics, business intelligence, and
data science. These terms, among others, are sometimes used interchangeably
and sometimes refer to distinct applications. Business intelligence may focus
on processes of collecting data, often through databases and data warehouses,
whereas business analytics utilizes tools and methods for statistical analyses of
data. In contrast to these two terms that emphasize business applications, the
term data science can encompass broader data related applications in many sci-
entific domains. For our purposes, we use the term analytics to refer to the
process of using data to make decisions. This process involves gathering data,
understanding concepts and models of uncertainty, making general inferences,
and communicating results.
When introducing data methods in this text, we focus on losses that arise from,
or related to, obligations in insurance contracts. This could be the amount of
damage to one’s apartment under a renter’s insurance agreement, the amount
needed to compensate someone that you hurt in a driving accident, and the
like. We call this type of obligation an insurance claim. With this focus, we
are able to introduce and directly use generally applicable statistical tools and
techniques.
1.1.3 Insurance Processes

Yet another way to think about the nature of insurance is by the duration of
an insurance contract, known as the term. This text will focus on short-term
insurance contracts. By short-term, we mean contracts where the insurance
coverage is typically provided for a year or six months. Most commercial and
personal contracts are for a year so that is our default duration. An important
exception is U.S. auto policies that are often six months in length.
In contrast, we typically think of life insurance as a long-term contract where
the default is to have a multi-year contract. For example, if a person 25 years
old purchases a whole life policy that pays upon death of the insured and that
person does not die until age 100, then the contract is in force for 75 years.
There are other important differences between life and non-life products. In
life insurance, the benefit amount is often stipulated in the contract provisions.
In contrast, most non-life contracts provide for compensation of insured losses
which are unknown before the accident. (There are usually limits placed on the
compensation amounts.) In a life insurance contract that stretches over many

years, the time value of money plays a prominent role. In a non-life contract,
the random amount of compensation takes priority.
In both life and non-life insurances, the frequency of claims is very important.
For many life insurance contracts, the insured event (such as death) happens
only once. In contrast, for non-life insurances such as automobile, it is common
for individuals (especially young male drivers) to get into more than one accident
during a year. So, our models need to reflect this observation; we introduce
different frequency models that you may also see when studying life insurance.
For short-term insurance, the framework of the probabilistic model is straight-
forward. We think of a one-period model (the period length, e.g., one year, will
be specified in the situation).
• At the beginning of the period, the insured pays the insurer a known
premium that is agreed upon by both parties to the contract.
• At the end of the period, the insurer reimburses the insured for a (possibly
multivariate) random loss.
This framework will be developed as we proceed; but we first focus on inte-
grating this framework with concerns about how the data may arise. From an
insurer’s viewpoint, contracts may be only for a year but they tend to be re-
newed. Moreover, payments arising from claims during the year may extend
well beyond a single year. One way to describe the data arising from operations
of an insurance company is to use a timeline granular approach. A process
approach provides an overall view of the events occurring during the life of an
insurance contract, and their nature – random or planned, loss events (claims)
and contract changes events, and so forth. In this micro oriented view, we can
think about what happens to a contract at various stages of its existence.
Figure 1.1 traces a timeline of a typical insurance contract. Throughout the
life of the contract, the company regularly processes events such as premium
collection and valuation, described in Section 1.2; these are marked with an x
on the timeline. Non-regular and unanticipated events also occur. To illustrate,
t2 and t4 mark the event of an insurance claim (some contracts, such as life
insurance, can have only a single claim). Times t3 and t5 mark events when
a policyholder wishes to alter certain contract features, such as the choice of
a deductible or the amount of coverage. From a company perspective, one
can even think about the contract initiation (arrival, time t1 ) and contract
termination (departure, time t6 ) as uncertain events. (Alternatively, for some
purposes, you may condition on these events and treat them as certain.)
1.2 Insurance Company Operations

1.2. INSURANCE COMPANY OPERATIONS 25
Policy Contract Contract

Enters Claim Alteration Claim AlterationPolicy
Premium Occurs Increase Occurs Increase Nonrenewal
Paid Deductible Coverage
Policy Policy Policy

Valuation Renewal Valuation RenewalValuation Renewal Valuation
Date 1 Premium Date 2 PremiumDate 3 Premium Date 4
Paid Paid Paid
X X X X X X X
t1 t2 t3 t4 t5 t6
Figure 1.1: Timeline of a Typical Insurance Policy. Arrows mark the

occurrences of random events. Each x marks the time of scheduled events that
are typically non-random.
• Describe five major operational areas of insurance companies.

• Identify the role of data and analytics opportunities within each opera-
tional area.
Armed with insurance data, the end goal is to use data to make decisions. We
will learn more about methods of analyzing and extrapolating data in future
chapters. To begin, let us think about why we want to do the analysis. We
take the insurance company’s viewpoint (not the insured person) and introduce
ways of bringing money in, paying it out, managing costs, and making sure
that we have enough money to meet obligations. The emphasis is on insurance-
specific operations rather than on general business activities such as advertising,
marketing, and human resources management.
Specifically, in many insurance companies, it is customary to aggregate detailed
insurance processes into larger operational units; many companies use these
functional areas to segregate employee activities and areas of responsibilities.
Actuaries, other financial analysts, and insurance regulators work within these
units and use data for the following activities:
1. Initiating Insurance. At this stage, the company makes a decision as
to whether or not to take on a risk (the underwriting stage) and assign
an appropriate premium (or rate). Insurance analytics has its actuarial
roots in ratemaking, where analysts seek to determine the right price for
the right risk.
2. Renewing Insurance. Many contracts, particularly in general insurance,
have relatively short durations such as 6 months or a year. Although
there is an implicit expectation that such contracts will be renewed, the
insurer has the opportunity to decline coverage and to adjust the premium.
Analytics is also used at this policy renewal stage where the goal is to retain
profitable customers.
3. Claims Management. Analytics has long been used in (1) detecting
and preventing claims fraud, (2) managing claim costs, including identi-
fying the appropriate support for claims handling expenses, as well as (3)
understanding excess layers for reinsurance and retention.
4. Loss Reserving. Analytic tools are used to provide management with an
appropriate estimate of future obligations and to quantify the uncertainty
of those estimates.
5. Solvency and Capital Allocation. Deciding on the requisite amount
of capital and on ways of allocating capital among alternative investments
are also important analytics activities. Companies must understand how
much capital is needed so that they have sufficient flow of cash available
to meet their obligations at the times they are expected to materialize
(solvency). This is an important question that concerns not only company
managers but also customers, company shareholders, regulatory authori-
ties, as well as the public at large. Related to issues of how much capital
is the question of how to allocate capital to differing financial projects,
typically to maximize an investor’s return. Although this question can

arise at several levels, insurance companies are typically concerned with
how to allocate capital to different lines of business within a firm and to
different subsidiaries of a parent firm.
Although data represent a critical component of solvency and capital alloca-
tion, other components including the local and global economic framework, the
financial investments environment, and quite specific requirements according
to the regulatory environment of the day, are also important. Because of the
background needed to address these components, we do not address solvency,
capital allocation, and regulation issues in this text.
Nonetheless, for all operating functions, we emphasize that analytics in the
insurance industry is not an exercise that a small group of analysts can do
by themselves. It requires an insurer to make significant investments in their
information technology, marketing, underwriting, and actuarial functions. As
these areas represent the primary end goals of the analysis of data, additional
background on each operational unit is provided in the following subsections.
1.2.1 Initiating Insurance

Setting the price of an insurance product can be a perplexing problem. This
is in contrast to other industries such as manufacturing where the cost of a
product is (relatively) known and provides a benchmark for assessing a market
demand price. Similarly, in other areas of financial services, market prices are
available and provide the basis for a market-consistent pricing structure of prod-
ucts. However, for many lines of insurance, the cost of a product is uncertain
and market prices are unavailable. Expectations of the random cost is a reason-
able place to start for a price. (If you have studied finance, then you will recall
that an expectation is the optimal price for a risk-neutral insurer.) It has been
traditional in insurance pricing to begin with the expected cost. Insurers then
add margins to this, to account for the product’s riskiness, expenses incurred in
servicing the product, and an allowance for profit/surplus of the company.
Use of expected costs as a foundation for pricing is prevalent in some lines of the
insurance business. These include automobile and homeowners insurance. For
these lines, analytics has served to sharpen the market by making the calculation
of the product’s expected cost more precise. The increasing availability of the
internet to consumers has also promoted transparency in pricing; in today’s
marketplace, consumers have ready access to competing quotes from a host
of insurers. Insurers seek to increase their market share by refining their risk
classification systems, thus achieving a better approximation of the products’
prices and enabling cream-skimming underwriting strategies (“cream-skimming”
is a phrase used when the insurer underwrites only the best risks). Surveys (e.g.,
Earnix (2013)) indicate that pricing is the most common use of analytics among
insurers.
Underwriting, the process of classifying risks into homogeneous categories and
assigning policyholders to these categories, lies at the core of ratemaking. Poli-

cyholders within a class (category) have similar risk profiles and so are charged
the same insurance price. This is the concept of an actuarially fair premium; it
is fair to charge different rates to policyholders only if they can be separated by
identifiable risk factors. An early article, Two Studies in Automobile Insurance
Ratemaking (Bailey and LeRoy, 1960), provided a catalyst to the acceptance of
analytic methods in the insurance industry. This paper addresses the problem
of classification ratemaking. It describes an example of automobile insurance
that has five use classes cross-classified with four merit rating classes. At that
time, the contribution to premiums for use and merit rating classes were deter-
mined independently of each other. Thinking about the interacting effects of
different classification variables is a more difficult problem.
When the risk is initially obtained, the insurer’s obligations can be managed
by imposing contract parameters that modify contract payouts. Chapter 3
describes common modifications including coinsurance, deductibles and policy
upper limits.
1.2.2 Renewing Insurance

Insurance is a type of financial service and, like many service contracts, insurance
coverage is often agreed upon for a limited time period at which time coverage
commitments are complete. Particularly for general insurance, the need for
coverage continues and so efforts are made to issue a new contract providing
similar coverage when the existing contract comes to the end of its term. This
is called policy renewal. Renewal issues can also arise in life insurance, e.g.,
term (temporary) life insurance. At the same time other contracts, such as life
annuities, terminate upon the insured’s death and so issues of renewability are
irrelevant.
In the absence of legal restrictions, at renewal the insurer has the opportunity
to:
• accept or decline to underwrite the risk; and
• determine a new premium, possibly in conjunction with a new classifica-
tion of the risk.
Risk classification and rating at renewal is based on two types of information.
First, at the initial stage, the insurer has available many rating variables upon
which decisions can be made. Many variables are not likely to change, e.g.,
sex, whereas others are likely to change, e.g., age, and still others may or may
not change, e.g., credit score. Second, unlike the initial stage, at renewal the
insurer has available a history of policyholder’s loss experience, and this history
can provide insights into the policyholder that are not available from rating
variables. Modifying premiums with claims history is known as experience rating,
also sometimes referred to as merit rating.
Experience rating methods are either applied retrospectively or prospectively.
With retrospective methods, a refund of a portion of the premium is provided to

the policyholder in the event of favorable (to the insurer) experience. Retrospec-
tive premiums are common in life insurance arrangements (where policyholders
earn dividends in the U.S., bonuses in the U.K., and profit sharing in Israeli
term life coverage). In general insurance, prospective methods are more com-
mon, where favorable insured experience is rewarded through a lower renewal
premium.
Claims history can provide information about a policyholder’s risk appetite. For
example, in personal lines it is common to use a variable to indicate whether
or not a claim has occurred in the last three years. As another example, in
a commercial line such as worker’s compensation, one may look to a policy-
holder’s average claim frequency or severity over the last three years. Claims
history can reveal information that is otherwise hidden (to the insurer) about
the policyholder.
1.2.3 Claims and Product Management

In some of types of insurance, the process of paying claims for insured events
is relatively straightforward. For example, in life insurance, a simple death
certificate is all that is needed to pay the benefit amount as provided in the
contract. However, in non-life areas such as property and casualty insurance,
the process can be much more complex. Think about a relatively simple insured
event such as an automobile accident. Here, it is often required to determine
which party is at fault and then one needs to assess damage to all of the vehicles
and people involved in the incident, both insured and non-insured. Further, the
expenses incurred in assessing the damages must be assessed, and so forth. The
process of determining coverage, legal liability, and settling claims is known as
claims adjustment.
Insurance managers sometimes use the phrase claims leakage to mean dollars
lost through claims management inefficiencies. There are many ways in which
analytics can help manage the claims process, c.f., Gorman and Swenson (2013).
Historically, the most important has been fraud detection. The claim adjusting
process involves reducing information asymmetry (the claimant knows what
happened; the company knows some of what happened). Mitigating fraud is an
important part of the claims management process.
Fraud detection is only one aspect of managing claims. More broadly, one can
think about claims management as consisting of the following components:
• Claims triaging. Just as in the medical world, early identification and

appropriate handling of high cost claims (patients, in the medical world),
can lead to dramatic savings. For example, in workers compensation,
insurers look to achieve early identification of those claims that run the
risk of high medical costs and a long payout period. Early intervention
into these cases could give insurers more control over the handling of the
claim, the medical treatment, and the overall costs with an earlier return-
to-work.
• Claims processing. The goal is to use analytics to identify routine situa-
tions that are anticipated to have small payouts. More complex situations
may require more experienced adjusters and legal assistance to appropri-
ately handle claims with high potential payouts.
• Adjustment decisions. Once a complex claim has been identified and
assigned to an adjuster, analytic driven routines can be established to aid
subsequent decision-making processes. Such processes can also be helpful
for adjusters in developing case reserves, an estimate of the insurer’s future
liability. This is an important input to the insurer’s loss reserves, described
in Section 1.2.4.
In addition to the insured’s reimbursement for losses, the insurer also needs to be
concerned with another source of revenue outflow, expenses. Loss adjustment
expenses are part of an insurer’s cost of managing claims. Analytics can be
used to reduce expenses directly related to claims handling (allocated) as well
as general staff time for overseeing the claims processes (unallocated). The
insurance industry has high operating costs relative to other portions of the
financial services sectors.
In addition to claims payments, there are many other ways in which insurers
use data to manage their products. We have already discussed the need for
analytics in underwriting, that is, risk classification at the initial acquisition
and renewal stages. Insurers are also interested in which policyholders elect to
renew their contracts and, as with other products, monitor customer loyalty.
Analytics can also be used to manage the portfolio, or collection, of risks that an
insurer has acquired. As described in Chapter 10, after the contract has been
agreed upon with an insured, the insurer may still modify its net obligation
by entering into a reinsurance agreement. This type of agreement is with a
reinsurer, an insurer of an insurer. It is common for insurance companies to
purchase insurance on its portfolio of risks to gain protection from unusual
events, just as people and other companies do.
1.2.4 Loss Reserving

An important feature that distinguishes insurance from other sectors of the
economy is the timing of the exchange of considerations. In manufacturing,
payments for goods are typically made at the time of a transaction. In contrast,
for insurance, money received from a customer occurs in advance of benefits or
services; these are rendered at a later date if the insured event occurs. This leads
to the need to hold a reservoir of wealth to meet future obligations in respect
to obligations made, and to gain the trust of the insureds that the company
will be able to fulfill its commitments. The size of this reservoir of wealth, and
the importance of ensuring its adequacy, is a major concern for the insurance
industry.
1.3. CASE STUDY: WISCONSIN PROPERTY FUND 31
Setting aside money for unpaid claims is known as loss reserving; in some juris-
dictions, reserves are also known as technical provisions. We saw in Figure 1.1
several times at which a company summarizes its financial position; these times
are known as valuation dates. Claims that arise prior to valuation dates have
either been paid, are in the process of being paid, or are about to be paid; claims
in the future of these valuation dates are unknown. A company must estimate
these outstanding liabilities when determining its financial strength. Accurately
determining loss reserves is important to insurers for many reasons.
1. Loss reserves represent an anticipated claim that the insurer owes its cus-
tomers. Under-reserving may result in a failure to meet claim liabilities.
Conversely, an insurer with excessive reserves may present a conservative
estimate of surplus and thus portray a weaker financial position than it
truly has.
2. Reserves provide an estimate for the unpaid cost of insurance that can be
used for pricing contracts.
3. Loss reserving is required by laws and regulations. The public has a strong
interest in the financial strength and solvency of insurers.
4. In addition to regulators, other stakeholders such as insurance company
management, investors, and customers make decisions that depend on
company loss reserves. Whereas regulators and customers appreciate con-
servative estimates of unpaid claims, managers and investors seek more
unbiased estimates to represent the true financial health of the company.
Loss reserving is a topic where there are substantive differences between life
and general (also known as property and casualty, or non-life) insurance. In life
insurance, the severity (amount of loss) is often not a source of uncertainty as
payouts are specified in the contract. The frequency, driven by mortality of the
insured, is a concern. However, because of the lengthy time for settlement of
life insurance contracts, the time value of money uncertainty as measured from
issue to date of payment can dominate frequency concerns. For example, for
an insured who purchases a life contract at age 20, it would not be unusual for
the contract to still be open in 60 years time, when the insured celebrates his
or her 80th birthday. See, for example, Bowers et al. (1986) or Dickson et al.
(2013) for introductions to reserving for life insurance. In contrast, for most
lines of non-life business, severity is a major source of uncertainty and contract
durations tend to be shorter.
1.3 Case Study: Wisconsin Property Fund
In this section, we use the Wisconsin Property Fund as a case study. You learn
how to:
• Describe how data generating events can produce data of interest to in-
surance analysts.
• Produce relevant summary statistics for each variable.

• Describe how these summary statistics can be used in each of the major
operational areas of an insurance company.
Let us illustrate the kind of data under consideration and the goals that we
wish to achieve by examining the Local Government Property Insurance Fund
(LGPIF), an insurance pool administered by the Wisconsin Office of the Insur-
ance Commissioner. The LGPIF was established to provide property insurance
for local government entities that include counties, cities, towns, villages, school
districts, and library boards. The fund insures local government property such
as government buildings, schools, libraries, and motor vehicles. It covers all
property losses except those resulting from flood, earthquake, wear and tear,
extremes in temperature, mold, war, nuclear reactions, and embezzlement or
theft by an employee.
The fund covers over a thousand local government entities who pay approxi-
mately 25 million dollars in premiums each year and receive insurance coverage
of about 75 billion. State government buildings are not covered; the LGPIF is
for local government entities that have separate budgetary responsibilities and
who need insurance to moderate the budget effects of uncertain insurable events.
Coverage for local government property has been made available by the State
of Wisconsin since 1911, thus providing a wealth of historical data.
In this illustration, we restrict consideration to claims from coverage of building
and contents; we do not consider claims from motor vehicles and specialized
equipment owned by local entities (such as snow plowing machines). We also
consider only claims that are closed, with obligations fully met.
1.3.1 Fund Claims Variables: Frequency and Severity

At a fundamental level, insurance companies accept premiums in exchange for
promises to compensate a policyholder upon the occurrence of an insured event.
Indemnification is the compensation provided by the insurer for incurred hurt,
loss, or damage that is covered by the policy. This compensation is also known
as a claim. The extent of the payout, known as the severity, is a key financial
expenditure for an insurer.
In terms of money outgo, an insurer is indifferent to having ten claims of 100
when compared to one claim of 1,000. Nonetheless, it is common for insurers to
study how often claims arise, known as the frequency of claims. The frequency
is important for expenses, but it also influences contractual parameters (such as
deductibles and policy limits that are described later) that are written on a per
occurrence basis. Frequency is routinely monitored by insurance regulators and
can be a key driver in the overall indemnification obligation of the insurer. We
shall consider the frequency and severity as the two main claim variables that
we wish to understand, model, and manage.
To illustrate, in 2010 there were 1,110 policyholders in the property fund who
experienced a total of 1,377 claims. Table 1.1 shows the distribution. Almost
two-thirds (0.637) of the policyholders did not have any claims and an additional
18.8% had only one claim. The remaining 17.5% (=1 - 0.637 - 0.188) had more
than one claim; the policyholder with the highest number recorded 239 claims.
The average number of claims for this sample was 1.24 (=1377/1110).
Table 1.1. 2010 Claims Frequency Distribution
Type
Number 0 1 2 3 4 5 6 7 8 9 or more Sum
Policies 707 209 86 40 18 12 9 4 6 19 1,110
Claims 0 209 172 120 72 60 54 28 48 617 1,377
Proportion 0.637 0.188 0.077 0.036 0.016 0.011 0.008 0.004 0.005 0.017 1.000
For the severity distribution, a common approach is to examine the distribution

of the sample of 1,377 claims. However, another common approach is to examine
the distribution of the average claims of those policyholders with claims. In
our 2010 sample, there were 403 (=1110-707) such policyholders. For 209 of
these policyholders with one claim, the average claim equals the only claim they
experienced. For the policyholder with highest frequency, the average claim
is an average over 239 separately reported claim events. This average is also
known as the pure premium or loss cost.
Table 1.2 summarizes the sample distribution of average severities from the 403
policyholders who made a claim; it shows that the average claim amount was
56,330 (all amounts are in U.S. Dollars). However, the average gives only a
limited look at the distribution. More information can be gleaned from the
summary statistics which show a very large claim in the amount of 12,920,000.
Figure 1.2 provides further information about the distribution of sample claims,
showing a distribution that is dominated by this single large claim so that the
histogram is not very helpful. Even when removing the large claim, you will
find a distribution that is skewed to the right. A generally accepted technique
is to work with claims in logarithmic units especially for graphical purposes; the
corresponding figure in the right-hand panel is much easier to interpret.
Table 1.2. 2010 Average Severity Distribution
First Third
Minimum Quartile Median Mean Quartile Maximum
167 2,226 4,951 56,330 11,900 12,920,000
400
100 120
300
Frequency
Frequency
80
200
60
40
100
20
0
0
0.0e+00 6.0e+06 1.2e+07 6 8 10 14
Average Claims Logarithmic Average Claims
Figure 1.2: Distribution of Positive Average Severities
1.3.2 Fund Rating Variables

Developing models to represent and manage the two outcome variables, fre-
quency and severity, is the focus of the early chapters of this text. However,
when actuaries and other financial analysts use those models, they do so in the
context of external variables. In general statistical terminology, one might call
these explanatory or predictor variables; there are many other names in statis-
tics, economics, psychology, and other disciplines. Because of our insurance
focus, we call them rating variables as they are useful in setting insurance rates
and premiums.
We earlier considered observations from a sample of 1,110 policyholders which
may seem like a lot. However, as we will see in our forthcoming applications,
because of the preponderance of zeros and the skewed nature of claims, actuaries
typically yearn for more data. One common approach that we adopt here is to
examine outcomes from multiple years, thus increasing the sample size. We will
discuss the strengths and limitations of this strategy later but, at this juncture,
we just wish to show the reader how it works.
Specifically, Table 1.3 shows that we now consider policies over five years of data,
2006, …, 2010, inclusive. The data begins in 2006 because there was a shift in
claim coding in 2005 so that comparisons with earlier years are not helpful. To
mitigate the effect of open claims, we consider policy years prior to 2011. An
open claim means that not all of the obligations for the claim are known at the
time of the analysis; for some claims, such an injury to a person in an auto
accident or in the workplace, it can take years before costs are fully known.
Table 1.3. Claims Summary by Policyholder
Average Average Average Number of

Year Frequency Severity Coverage Policyholders
2006 0.951 9,695 32,498,186 1,154
2007 1.167 6,544 35,275,949 1,138
2008 0.974 5,311 37,267,485 1,125
2009 1.219 4,572 40,355,382 1,112
2010 1.241 20,452 41,242,070 1,110
Table 1.3 shows that the average claim varies over time, especially with the high
2010 value (that we saw was due to a single large claim)1 . The total number
of policyholders is steadily declining and, conversely, the coverage is steadily
increasing. The coverage variable is the amount of coverage of the property
and contents. Roughly, you can think of it as the maximum possible payout
of the insurer. For our immediate purposes, the coverage is our first rating
variable. Other things being equal, we would expect that policyholders with
larger coverage have larger claims. We will make this vague idea much more
precise as we proceed, and also justify this expectation with data.
For a different look at the 2006-2010 data, Table 1.4 summarizes the distribution
of our two outcomes, frequency and claims amount. In each case, the average
exceeds the median, suggesting that the two distributions are right-skewed. In
addition, the table summarizes our continuous rating variables, coverage and
deductible amount. The table also suggests that these variables also have right-
skewed distributions.
Table 1.4. Summary of Claim Frequency and Severity, Deductibles,
and Coverages
Minimum Median Average Maximum

Claim Frequency 0 0 1.109 263
Claim Severity 0 0 9,292 12,922,218
Deductible 500 1,000 3,365 100,000
Coverage (000’s) 8.937 11,354 37,281 2,444,797
Table 1.5 describes the rating variables considered in this chapter. Hopefully,
these are variables that you think might naturally be related to claims outcomes.
You can learn more about them in Frees et al. (2016). To handle the skewness,
we henceforth focus on logarithmic transformations of coverage and deductibles.
1 Note that the average severity in Table 1.3 differs from that reported in Table 1.2. This is
because the former includes policyholders with zero claims where as the latter does not. This
is an important distinction that we will address in later portions of the text.
Table 1.5. Description of Rating Variables
𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
EntityType Categorical variable that is one of six types: (Village, City,
County, Misc, School, or Town)
LnCoverage Total building and content coverage, in logarithmic millions of dollars
LnDeduct Deductible, in logarithmic dollars
AlarmCredit Categorical variable that is one of four types: (0, 5, 10, or 15)
for automatic smoke alarms in main rooms
NoClaimCredit Binary variable to indicate no claims in the past two years
Fire5 Binary variable to indicate the fire class is below 5
(The range of fire class is 0 to 10)
To get a sense of the relationship between the non-continuous rating variables

and claims, Table 1.6 relates the claims outcomes to these categorical variables.
Table 1.6 suggests substantial variation in the claim frequency and average
severity of the claims by entity type. It also demonstrates higher frequency and
severity for the Fire5 variable and the reverse for the NoClaimCredit variable.
The relationship for the Fire5 variable is counter-intuitive in that one would
expect lower claim amounts for those policyholders in areas with better public
protection (when the protection code is five or less). Naturally, there are other
variables that influence this relationship. We will see that these background
variables are accounted for in the subsequent multivariate regression analysis,
which yields an intuitive, appealing (negative) sign for the Fire5 variable.
Table 1.6. Claims Summary by Entity Type, Fire Class, and No Claim
Credit
Number of Claim Average

Variable Policies Frequency Severity
EntityType
Village 1,341 0.452 10,645
City 793 1.941 16,924
County 328 4.899 15,453
Misc 609 0.186 43,036
School 1,597 1.434 64,346
Town 971 0.103 19,831
Fire
Fire5=0 2,508 0.502 13,935
Fire5=1 3,131 1.596 41,421
No Claims Credit
NoClaimCredit=0 3,786 1.501 31,365
NoClaimCredit=1 1,853 0.310 30,499
Total 5,639 1.109 31,206
Table 1.7 shows the claims experience by alarm credit. It underscores the dif-
ficulty of examining variables individually. For example, when looking at the
experience for all entities, we see that policyholders with no alarm credit have
on average lower frequency and severity than policyholders with the highest
(15%, with 24/7 monitoring by a fire station or security company) alarm credit.
In particular, when we look at the entity type School, the frequency is 0.422
and the severity 25,523 for no alarm credit, whereas for the highest alarm level
it is 2.008 and 85,140, respectively. This may simply imply that entities with
more claims are the ones that are likely to have an alarm system. Summary
tables do not examine multivariate effects; for example, Table 1.6 ignores the
effect of size (as we measure through coverage amounts) that affect claims.
Table 1.7. Claims Summary by Entity Type and Alarm Credit (AC)
Category
AC0 AC0 AC0 AC5 AC5 AC5

Entity Claim Avg. Num. Claim Avg. Num.
Type Frequency Severity Policies Frequency Severity Policies
Village 0.326 11,078 829 0.278 8,086 54
City 0.893 7,576 244 2.077 4,150 13
County 2.140 16,013 50 - - 1
Misc 0.117 15,122 386 0.278 13,064 18
School 0.422 25,523 294 0.410 14,575 122
Town 0.083 25,257 808 0.194 3,937 31
Total 0.318 15,118 2,611 0.431 10,762 239
AC10 AC10 AC10 AC15 AC15 AC15

Entity Claim Avg. Num. Claim Avg. Num.
Type Frequency Severity Policies Frequency Severity Policies
Village 0.500 8,792 50 0.725 10,544 408
City 1.258 8,625 31 2.485 20,470 505
County 2.125 11,688 8 5.513 15,476 269
Misc 0.077 3,923 26 0.341 87,021 179
School 0.488 11,597 168 2.008 85,140 1,013
Town 0.091 2,338 44 0.261 9,490 88
Total 0.517 10,194 327 2.093 41,458 2,462
1.3.3 Fund Operations

We have now seen distributions of the Fund’s two outcome variables: a count
variable for the number of claims, and a continuous variable for the claims
amount. We have also introduced a continuous rating variable (coverage); a dis-
crete quantitative variable (logarithmic deductibles); two binary rating variables
(no claims credit and fire class); and two categorical rating variables (entity type
and alarm credit). Subsequent chapters will explain how to analyze and model
the distribution of these variables and their relationships. Before getting into
these technical details, let us first think about where we want to go. General in-
surance company functional areas are described in Section 1.2; we now consider
how these areas might apply in the context of the property fund.
Initiating Insurance
Because this is a government sponsored fund, we do not have to worry about
selecting good or avoiding poor risks; the fund is not allowed to deny a cover-
age application from a qualified local government entity. If we do not have to
underwrite, what about how much to charge?
We might look at the most recent experience in 2010, where the total fund claims
were approximately 28.16 million USD (= 1377 claims×20452 average severity).
Dividing that among 1,110 policyholders, that suggests a rate of 24,370 ( ≈
28,160,000/1110). However, 2010 was a bad year; using the same method, our
premium would be much lower based on 2009 data. This swing in premiums
would defeat the primary purpose of the fund, to allow for a steady charge that
local property managers could utilize in their budgets.
Having a single price for all policyholders is nice but hardly seems fair. For
example, Table 1.6 suggests that schools have higher aggregate claims than
other entities and so should pay more. However, simply doing the calculation
on an entity by entity basis is not right either. For example, we saw in Table
1.7 that had we used this strategy, entities with a 15% alarm credit (for good
behavior, having top alarm systems) would actually wind up paying more.
So, we have the data for thinking about the appropriate rates to charge but need
to dig deeper into the analysis. We will explore this topic further in Chapter 7
on premium calculation fundamentals. Selecting appropriate risks is introduced
in Chapter 8 on risk classification.
Renewing Insurance
Although property insurance is typically a one-year contract, Table 1.3 suggests
that policyholders tend to renew; this is typical of general insurance. For re-
newing policyholders, in addition to their rating variables we have their claims
history and this claims history can be a good predictor of future claims. For
example, Table 1.6 shows that policyholders without a claim in the last two
years had much lower claim frequencies than those with at least one accident
(0.310 compared to 1.501); a lower predicted frequency typically results in a
lower premium. This is why it is common for insurers to use variables such as
NoClaimCredit in their rating. We will explore this topic further in Chapter 9
on experience rating.
1.4. FURTHER RESOURCES AND CONTRIBUTORS 39
Claims Management
Of course, the main story line of the 2010 experience was the large claim of over
12 million USD, nearly half the amount of claims for that year. Are there ways
that this could have been prevented or mitigated? Are their ways for the fund to
purchase protection against such large unusual events? Another unusual feature
of the 2010 experience noted earlier was the very large frequency of claims (239)
for one policyholder. Given that there were only 1,377 claims that year, this
means that a single policyholder had 17.4 % of the claims. These extreme
features of the data suggests opportunities for managing claims, the subject of
Chapter 10.
Loss Reserving
In our case study, we look only at the one year outcomes of closed claims (the op-
posite of open). However, like many lines of insurance, obligations from insured
events to buildings such as fire, hail, and the like, are not known immediately
and may develop over time. Other lines of business, including those where there
are injuries to people, take much longer to develop. Chapter 11 introduces this
concern and loss reserving, the discipline of determining how much the insurance
company should retain to meet its obligations.
1.4 Further Resources and Contributors

Contributor
• Edward W. (Jed) Frees, University of Wisconsin-Madison, is the princi-
pal author of the initial version of this chapter. Email: [email protected]
for chapter comments and suggested improvements.
• Chapter reviewers include: Yair Babad, Chunsheng Ban, Aaron Bruhn,
Gordon Enderle, Hirokazu (Iwahiro) Iwasawa, Dalia Khalil, Bell Ouelega,
Michelle Xia.
This book introduces loss data analytic tools that are most relevant to actuaries
and other financial risk analysts. We have also introduced you to many new
insurance terms; more terms can be found at the NAIC Glossary (2018). Here
are a few references cited in the chapter.
Chapter 2
Frequency Modeling
Chapter Preview. A primary focus for insurers is estimating the magnitude of

aggregate claims it must bear under its insurance contracts. Aggregate claims
are affected by both the frequency and the severity of the insured event. De-
composing aggregate claims into these two components, each of which warrant
significant attention, is essential for analysis and pricing. This chapter dis-
cusses frequency distributions, summary measures, and parameter estimation
techniques.
In Section 2.1, we present terminology and discuss reasons why we study fre-
quency and severity separately. The foundations of frequency distributions and
measures are presented in Section 2.2 along with three principal distributions:
the binomial, the Poisson, and the negative binomial. These three distributions
are members of what is known as the (𝑎, 𝑏, 0) class of distributions, a distinguish-
ing, identifying feature which allows for efficient calculation of probabilities,
further discussed in Section 2.3. When fitting a dataset with a distribution,
parameter values need to be estimated and in Section 2.4, the procedure for
maximum likelihood estimation is explained.
For insurance datasets, the observation at zero denotes no occurrence of a par-
ticular event; this often deserves additional attention. As explained further in
Section 2.5, for some datasets it may be impossible to have zero of the studied
event or zero events may follow a different model than other event counts. In
either case, direct fitting of typical count models could lead to improper esti-
mates. Zero truncation or modification techniques allow for more appropriate
distribution fit.
Noting that our insurance portfolio could consist of different sub-groups, each
with its own set of individual characteristics, Section 2.6 introduces mixture
distributions and methodology to allow for this heterogeneity within a portfolio.
Section 2.7 describes goodness of fit which measures the reasonableness of the
parameter estimates. Exercises are presented in Section 2.8 and Section 2.9.1
41
42 CHAPTER 2. FREQUENCY MODELING
concludes the chapter with R Code for plots depicted in Section 2.4.
2.1 Frequency Distributions
In this section, you learn how to summarize the importance of frequency mod-
eling in terms of
• contractual,
• behavioral,
• database, and
• regulatory/administrative motivations.
2.1.1 How Frequency Augments Severity Information

Basic Terminology
In this chapter, loss, also referred to as ground-up loss, denotes the amount
of financial loss suffered by the insured. We use claim to denote the indem-
nification upon the occurrence of an insured event, thus the amount paid by
the insurer. While some texts use loss and claim interchangeably, we wish to
make a distinction here to recognize how insurance contractual provisions, such
as deductibles and limits, affect the size of the claim stemming from a loss. Fre-
quency represents how often an insured event occurs, typically within a policy
contract. Here, we focus on count random variables that represent the number
of claims, that is, how frequently an event occurs. Severity denotes the amount,
or size, of each payment for an insured event. In future chapters, the aggregate
model, which combines frequency models with severity models, is examined.
The Importance of Frequency

Recall from Section 1.2 that setting the price of an insurance good can be a
complex problem. In manufacturing, the cost of a good is (relatively) known.
In other financial service areas, market prices are available. In insurance, we can
generalize the price setting as follows. Start with an expected cost, then add
“margins” to account for the product’s riskiness, expenses incurred in servicing
the product, and a profit/surplus allowance for the insurer.
The expected cost for insurance can be determined as the expected number of
claims times the amount per claim, that is, expected value of frequency times
severity. The focus on claim count allows the insurer to consider those factors
which directly affect the occurrence of a loss, thereby potentially generating a
claim.
2.1. FREQUENCY DISTRIBUTIONS 43
Why Examine Frequency Information?

Insurers and other stakeholders, including governmental organizations, have var-
ious motivations for gathering and maintaining frequency datasets.
• Contractual. In insurance contracts, it is common for particular de-
ductibles and policy limits to be listed and invoked for each occurrence of
an insured event. Correspondingly, the claim count data generated would
indicate the number of claims which meet these criteria, offering a unique
claim frequency measure. Extending this, models of total insured losses
would need to account for deductibles and policy limits for each insured
event.
• Behavioral. In considering factors that influence loss frequency, the risk-
taking and risk-reducing behavior of individuals and companies should be
considered. Explanatory (rating) variables can have different effects on
models of how often an event occurs in contrast to the size of the event.
– In healthcare, the decision to utilize healthcare by individuals, and
minimize such healthcare utilization through preventive care and
wellness measures, is related primarily to his or her personal char-
acteristics. The cost per user is determined by the patient’s medical
condition, potential treatment measures, and decisions made by the
healthcare provider (such as the physician) and the patient. While
there is overlap in those factors and how they affect total healthcare
costs, attention can be focused on those separate drivers of healthcare
visit frequency and healthcare cost severity.
– In personal lines, prior claims history is an important underwriting
factor. It is common to use an indicator of whether or not the insured
had a claim within a certain time period prior to the contract. Also,
the number of claims incurred by the insured in previous periods has
predictive power.
– In homeowners insurance, in modeling potential loss frequency, the
insurer could consider loss prevention measures that the homeowner
has adopted, such as visible security systems. Separately, when mod-
eling loss severity, the insurer would examine those factors that affect
repair and replacement costs.
• Databases. Insurers may hold separate data files that suggest developing
separate frequency and severity models. For example, a policyholder file is
established when a policy is written. This file records much underwriting
information about the insured(s), such as age, gender, and prior claims
experience, policy information such as coverage, deductibles and limita-
tions, as well as any insurance claims event. A separate file, known as the
“claims” file, records details of the claim against the insurer, including the
amount. (There may also be a “payments” file that records the timing of
the payments although we shall not deal with that here.) This recording
process could then extend to insurers modeling the frequency and severity
as separate processes.
• Regulatory and Administrative. Insurance is a highly regulated and

monitored industry, given its importance in providing financial security to
individuals and companies facing risk. As part of their duties, regulators
routinely require the reporting of claims numbers as well as amounts. This
may be due to the fact that there can be alternative definitions of an
“amount,” e.g., paid versus incurred, and there is less potential error when
reporting claim numbers. This continual monitoring helps ensure financial
stability of these insurance companies.
2.2 Basic Frequency Distributions
• Determine quantities that summarize a distribution such as the distri-

bution and survival function, as well as moments such as the mean and
variance
• Define and compute the moment and probability generating functions
• Describe and understand relationships among three important frequency
distributions, the binomial, Poisson, and negative binomial distributions
In this section, we introduce the distributions that are commonly used in actu-
arial practice to model count data. The claim count random variable is denoted
by 𝑁 ; by its very nature it assumes only non-negative integer values. Hence
the distributions below are all discrete distributions supported on the set of
non-negative integers {0, 1, …}.
2.2.1 Foundations
Since 𝑁 is a discrete random variable taking values in {0, 1, …}, the most natural
full description of its distribution is through the specification of the probabilities
with which it assumes each of the non-negative integer values. This leads us to
the concept of the probability mass function (pmf) of 𝑁 , denoted as 𝑝𝑁 (⋅) and
defined as follows:
𝑝𝑁 (𝑘) = Pr(𝑁 = 𝑘), for 𝑘 = 0, 1, …
We note that there are alternate complete descriptions, or characterizations, of

the distribution of 𝑁 ; for example, the distribution function of 𝑁 defined by
𝐹𝑁 (𝑥) = Pr(𝑁 ≤ 𝑥) and determined as:
2.2. BASIC FREQUENCY DISTRIBUTIONS 45
⎧ ⌊𝑥⌋
{ ∑ Pr(𝑁 = 𝑘), 𝑥 ≥ 0;
𝐹𝑁 (𝑥) = 𝑘=0
⎨
{0, otherwise.
⎩
In the above, ⌊⋅⌋ denotes the floor function; ⌊𝑥⌋ denotes the greatest integer
less than or equal to 𝑥. This expression also suggests the descriptor cumula-
tive distribution function, a commonly used alternative way of expressing the
distribution function. We also note that the survival function of 𝑁 , denoted
by 𝑆𝑁 (⋅), is defined as the ones’-complement of 𝐹𝑁 (⋅), i.e. 𝑆𝑁 (⋅) = 1 − 𝐹𝑁 (⋅).
Clearly, the latter is another characterization of the distribution of 𝑁 .
Often one is interested in quantifying a certain aspect of the distribution and
not in its complete description. This is particularly useful when comparing
distributions. A center of location of the distribution is one such aspect, and
there are many different measures that are commonly used to quantify it. Of
these, the mean is the most popular; the mean of 𝑁 , denoted by 𝜇𝑁 ,1 is defined
as
∞
𝜇𝑁 = ∑ 𝑘 𝑝𝑁 (𝑘).
𝑘=0
We note that 𝜇𝑁 is the expected value of the random variable 𝑁 , i.e. 𝜇𝑁 = E[𝑁 ].
This leads to a general class of measures, the moments of the distribution; the
𝑟-th raw moment of 𝑁 , for 𝑟 > 0, is defined as E[𝑁 𝑟 ] and denoted by 𝜇′𝑁 (𝑟).
We remark that the prime ′ here does not denote differentiation. Rather, it is
commonly used notation to distinguish a raw from a central moment, as will be
introduction in Section 3.1.1. For 𝑟 > 0, we have
∞
𝜇′𝑁 (𝑟) = E[𝑁 𝑟 ] = ∑ 𝑘𝑟 𝑝𝑁 (𝑘).
𝑘=0
We note that 𝜇′𝑁 (⋅) is a well-defined non-decreasing function taking values in

[0, ∞], as Pr(𝑁 ∈ {0, 1, …}) = 1; also, note that 𝜇𝑁 = 𝜇′𝑁 (1). In the following,
when we refer to a moment it will be implicit that it is finite unless mentioned
otherwise.
Another basic aspect of a distribution is its dispersion, and of the various mea-
sures of dispersion studied in the literature, the standard deviation is the most
popular. Towards defining it, we first define the variance of 𝑁 , denoted by
Var[𝑁 ], as Var[𝑁 ] = E[(𝑁 − 𝜇𝑁 )2 ] when 𝜇𝑁 is finite. By basic properties of
the expected value of a random variable, we see that Var[𝑁 ] = E[𝑁 2 ] − [E(𝑁 )]2 .
The standard deviation of 𝑁 , denoted by 𝜎𝑁 , is defined as the square root of
1 For convenience, we have indexed 𝜇
𝑁 with the random variable 𝑁 instead of 𝐹𝑁 or 𝑝𝑁 ,
even though it is solely a function of the distribution of the random variable.
Var 𝑁 . Note that the latter is well-defined as Var[𝑁 ], by its definition as the
average squared deviation from the mean, is non-negative; Var[𝑁 ] is denoted
2
by 𝜎𝑁 . Note that these two measures take values in [0, ∞].
2.2.2 Moment and Probability Generating Functions

Now we introduce two generating functions that are found to be useful when
working with count variables. Recall that for a discrete random variable, the
moment generating function (mgf) of 𝑁 , denoted as 𝑀𝑁 (⋅), is defined as
∞
𝑀𝑁 (𝑡) = E [𝑒𝑡𝑁 ] = ∑ 𝑒𝑡𝑘 𝑝𝑁 (𝑘), 𝑡 ∈ ℝ.
𝑘=0
We note that while 𝑀𝑁 (⋅) is well defined as it is the expectation of a non-

negative random variable (𝑒𝑡𝑁 ), though it can assume the value ∞. Note that
for a count random variable, 𝑀𝑁 (⋅) is finite valued on (−∞, 0] with 𝑀𝑁 (0) = 1.
The following theorem, whose proof can be found in Billingsley (2008) (pages
285-6), encapsulates the reason for its name.
Theorem 2.1.
∗
Let 𝑁 be a count random variable such that E [𝑒𝑡 𝑁 ] is finite for some 𝑡∗ > 0.
We have the following:
a. All moments of 𝑁 are finite, i.e.
E[𝑁 𝑟 ] < ∞, 𝑟 > 0.

b. The mgf can be used to generate its moments as follows:
d𝑚
𝑀 (𝑡)∣ = E[𝑁 𝑚 ], 𝑚 ≥ 1.
d𝑡𝑚 𝑁 𝑡=0
c. The mgf 𝑀𝑁 (⋅) characterizes the distribution; in other words it uniquely

specifies the distribution.
Another reason that the mgf is very useful as a tool is that for two independent
random variables 𝑋 and 𝑌 , with their mgfs existing in a neighborhood of 0,
the mgf of 𝑋 + 𝑌 is the product of their respective mgfs, that is, 𝑀𝑋+𝑌 (𝑡) =
𝑀𝑋 (𝑡)𝑀𝑌 (𝑡), for small 𝑡.
A related generating function to the mgf is the probability generating function
(pgf), and is a useful tool for random variables taking values in the non-negative
integers. For a random variable 𝑁 , by 𝑃𝑁 (⋅) we denote its pgf and we define it
as follows2 :
2 00 =1
𝑃𝑁 (𝑠) = E [𝑠𝑁 ], 𝑠 ≥ 0.
It is straightforward to see that if the mgf 𝑀𝑁 (⋅) exists on (−∞, 𝑡∗ ) then

∗
𝑃𝑁 (𝑠) = 𝑀𝑁 (log(𝑠)), 𝑠 < 𝑒𝑡 .
Moreover, if the pgf exists on an interval [0, 𝑠∗ ) with 𝑠∗ > 1, then the mgf
𝑀𝑁 (⋅) exists on (−∞, log(𝑠∗ )), and hence uniquely specifies the distribution of
𝑁 by Theorem 2.1. (As a reminder, throughout this text we use log as the
natural logarithm, not the base ten (common) logarithm or other version.) The
following result for pgf is an analog of Theorem 2.1, and in particular justifies
its name.
Theorem 2.2. Let 𝑁 be a count random variable such that E (𝑠∗ )𝑁 is finite
for some 𝑠∗ > 1. We have the following:
a. All moments of 𝑁 are finite, i.e.
E 𝑁 𝑟 < ∞, 𝑟 ≥ 0.
b. The 𝑝𝑚𝑓 of 𝑁 can be derived from the pgf as follows:
⎧𝑃𝑁 (0), 𝑚 = 0;
{
𝑝𝑁 (𝑚) = ⎨
{( 1 ) d𝑚𝑚 𝑃 (𝑠)∣ , 𝑚 ≥ 1.
⎩ 𝑚! d𝑠 𝑁 𝑠=0
c. The factorial moments of 𝑁 can be derived as follows:

𝑚−1
d𝑚
𝑃 𝑁 (𝑠)∣ = E ∏ (𝑁 − 𝑖), 𝑚 ≥ 1.
d𝑠𝑚 𝑠=1 𝑖=0
d. The pgf 𝑃𝑁 (⋅) characterizes the distribution; in other words it uniquely

specifies the distribution.
2.2.3 Important Frequency Distributions

In this sub-section we study three important frequency distributions used in
statistics, namely the binomial, the Poisson, and the negative binomial distri-
butions. In the following, a risk denotes a unit covered by insurance. A risk
could be an individual, a building, a company, or some other identifier for which
insurance coverage is provided. For context, imagine an insurance data set con-
taining the number of claims by risk or stratified in some other manner. The
above mentioned distributions also happen to be the most commonly used in
insurance practice for reasons, some of which we mention below.
• These distributions can be motivated by natural random experiments

which are good approximations to real life processes from which many
insurance data arise. Hence, not surprisingly, they together offer a rea-
sonable fit to many insurance data sets of interest. The appropriateness
of a particular distribution for the set of data can be determined using
standard statistical methodologies, as we discuss later in this chapter.
• They provide a rich enough basis for generating other distributions that
even better approximate or well cater to more real situations of interest
to us.
– The three distributions are either one-parameter or two-parameter
distributions. In fitting to data, a parameter is assigned a particular
value. The set of these distributions can be enlarged to their convex
hulls by treating the parameter(s) as a random variable (or vector)
with its own probability distribution, with this larger set of distri-
butions offering greater flexibility. A simple example that is better
addressed by such an enlargement is a portfolio of claims generated
by insureds belonging to many different risk classes.
– In insurance data, we may observe either a marginal or inordinate
number of zeros, that is, zero claims by risk. When fitting to the data,
a frequency distribution in its standard specification often fails to rea-
sonably account for this occurrence. The natural modification of the
above three distributions, however, accommodate this phenomenon
well towards offering a better fit.
– In insurance we are interested in total claims paid, whose distribu-
tion results from compounding the fitted frequency distribution with
a severity distribution. These three distributions have properties that
make it easy to work with the resulting aggregate severity distribu-
tion.
Binomial Distribution
We begin with the binomial distribution which arises from any finite sequence
of identical and independent experiments with binary outcomes. The most
canonical of such experiments is the (biased or unbiased) coin tossing experiment
with the outcome being heads or tails. So if 𝑁 denotes the number of heads
in a sequence of 𝑚 independent coin tossing experiments with an identical coin
which turns heads up with probability 𝑞, then the distribution of 𝑁 is called
the binomial distribution with parameters (𝑚, 𝑞), with 𝑚 a positive integer
and 𝑞 ∈ [0, 1]. Note that when 𝑞 = 0 (resp., 𝑞 = 1) then the distribution is
degenerate with 𝑁 = 0 (resp., 𝑁 = 𝑚) with probability 1. Clearly, its support
when 𝑞 ∈ (0, 1) equals {0, 1, … , 𝑚} with pmf given by 3
𝑚
𝑝𝑘 = ( )𝑞 𝑘 (1 − 𝑞)𝑚−𝑘 , 𝑘 = 0, … , 𝑚.
𝑘
3 In the following we suppress the reference to 𝑁 and denote the pmf by the sequence
{𝑝𝑘 }𝑘≥0 , instead of the function 𝑝𝑁 (⋅).

where
𝑚 𝑚!
( )=
𝑘 𝑘!(𝑚 − 𝑘)!
The reason for its name is that the pmf takes values among the terms that arise
from the binomial expansion of (𝑞 + (1 − 𝑞))𝑚 . This realization then leads to
the the following expression for the pgf of the binomial distribution:
𝑚
𝑃𝑁 (𝑧) = ∑𝑘=0 𝑧𝑘 (𝑚 𝑘
𝑘 )𝑞 (1 − 𝑞)
𝑚−𝑘
𝑚 𝑚
= ∑𝑘=0 ( 𝑘 )(𝑧𝑞) (1 − 𝑞)𝑚−𝑘
𝑘
= (𝑞𝑧 + (1 − 𝑞))𝑚 = (1 + 𝑞(𝑧 − 1))𝑚 .
Note that the above expression for the pgf confirms the fact that the binomial
distribution is the m-convolution of the Bernoulli distribution, which is the
binomial distribution with 𝑚 = 1 and pgf (1 + 𝑞(𝑧 − 1)). By “m-convolution,”
we mean that we can write 𝑁 as the sum of 𝑁1 , … , 𝑁𝑚 . Here, 𝑁𝑖 are iid
Bernoulli variates. Also, note that the mgf of the binomial distribution is given
by (1 + 𝑞(𝑒𝑡 − 1))𝑚 .
The mean and variance of the binomial distribution can be found in a few
different ways. To emphasize the key property that it is a 𝑚-convolution of
the Bernoulli distribution, we derive below the moments using this property.
We begin by observing that the Bernoulli distribution with parameter 𝑞 assigns
probability of 𝑞 and 1 − 𝑞 to 1 and 0, respectively. So its mean equals 𝑞 (=
0 × (1 − 𝑞) + 1 × 𝑞); note that its raw second moment equals its mean as 𝑁 2 = 𝑁
with probability 1. Using these two facts we see that the variance equals 𝑞(1−𝑞).
Moving on to the binomial distribution with parameters 𝑚 and 𝑞, using the fact
that it is the 𝑚-convolution of the Bernoulli distribution, we write 𝑁 as the
sum of 𝑁1 , … , 𝑁𝑚 , where 𝑁𝑖 are iid Bernoulli variates, as above. Now using
the moments of Bernoulli and linearity of the expectation, we see that
𝑚 𝑚
E[𝑁 ] = E [∑ 𝑁𝑖 ] = ∑ E[𝑁𝑖 ] = 𝑚𝑞.
𝑖=1 𝑖=1
Also, using the fact that the variance of the sum of independent random
variables is the sum of their variances, we see that
𝑚 𝑚
Var[𝑁 ] = Var [∑ 𝑁𝑖 ] = ∑ Var[𝑁𝑖 ] = 𝑚𝑞(1 − 𝑞).
𝑖=1 𝑖=1
Alternate derivations of the above moments are suggested in the exercises. One
important observation, especially from the point of view of applications, is that
the mean is greater than the variance unless 𝑞 = 0.
Poisson Distribution
After the binomial distribution, the Poisson distribution (named after the French
polymath Simeon Denis Poisson) is probably the most well known of discrete
distributions. This is partly due to the fact that it arises naturally as the
distribution of the count of the random occurrences of a type of event in a
certain time period, if the rate of occurrences of such events is a constant. It
also arises as the asymptotic limit of the binomial distribution with 𝑚 → ∞
and 𝑚𝑞 → 𝜆.
The Poisson distribution is parametrized by a single parameter usually denoted
by 𝜆 which takes values in (0, ∞). Its pmf is given by
𝑒−𝜆 𝜆𝑘
𝑝𝑘 = , 𝑘 = 0, 1, …
𝑘!
It is easy to check that the above specifies a pmf as the terms are clearly non-
negative, and that they sum to one follows from the infinite Taylor series expan-
sion of 𝑒𝜆 . More generally, we can derive its pgf, 𝑃𝑁 (⋅), as follows:
∞ ∞
𝑒−𝜆 𝜆𝑘 𝑧 𝑘
𝑃𝑁 (𝑧) = ∑ 𝑝𝑘 𝑧𝑘 = ∑ = 𝑒−𝜆 𝑒𝜆𝑧 = 𝑒𝜆(𝑧−1) , ∀𝑧 ∈ ℝ.
𝑘=0 𝑘=0
𝑘!
From the above, we derive its mgf as follows:

𝑡
𝑀𝑁 (𝑡) = 𝑃𝑁 (𝑒𝑡 ) = 𝑒𝜆(𝑒 −1)
, 𝑡 ∈ ℝ.
Towards deriving its mean, we note that for the Poisson distribution
0, 𝑘=0
𝑘𝑝𝑘 = {
𝜆 𝑝𝑘−1 , 𝑘 ≥ 1.
This can be checked easily. In particular, this implies that
E[𝑁 ] = ∑ 𝑘 𝑝𝑘 = 𝜆 ∑ 𝑝𝑘−1 = 𝜆 ∑ 𝑝𝑗 = 𝜆.
𝑘≥0 𝑘≥1 𝑗≥0
In fact, more generally, using either a generalization of the above or using The-
orem 2.1, we see that
𝑚−1
d𝑚
E ∏ (𝑁 − 𝑖) = 𝑃 (𝑠)∣ = 𝜆𝑚 , 𝑚 ≥ 1.
𝑖=0
d𝑠𝑚 𝑁 𝑠=1
This, in particular, implies that
Var[𝑁 ] = E[𝑁 2 ] − [E(𝑁 )]2 = E [𝑁 (𝑁 − 1)] + E[𝑁 ] − (E[𝑁 ])2 = 𝜆2 + 𝜆 − 𝜆2 = 𝜆.
Note that interestingly for the Poisson distribution Var[𝑁 ] = E[𝑁 ].

Negative Binomial Distribution

The third important count distribution is the negative binomial distribution.
Recall that the binomial distribution arose as the distribution of the number of
successes in 𝑚 independent repetition of an experiment with binary outcomes.
If we instead consider the number of successes until we observe the 𝑟-th failure
in independent repetitions of an experiment with binary outcomes, then its
distribution is a negative binomial distribution. A particular case, when 𝑟 = 1,
is the geometric distribution. However when 𝑟 in not an integer, the above
random experiment would not be applicable. In the following, we allow the
parameter 𝑟 to be any positive real number to then motivate the distribution
more generally. To explain its name, we recall the binomial series, i.e.
𝑠(𝑠 − 1) 2
(1 + 𝑥)𝑠 = 1 + 𝑠𝑥 + 𝑥 + … ..., 𝑠 ∈ ℝ; |𝑥| < 1.
2!
If we define (𝑘𝑠), the generalized binomial coefficient, by
𝑠 𝑠(𝑠 − 1) ⋯ (𝑠 − 𝑘 + 1)
( )= ,
𝑘 𝑘!
then we have
∞
𝑠
(1 + 𝑥)𝑠 = ∑ ( )𝑥𝑘 , 𝑠 ∈ ℝ; |𝑥| < 1.
𝑘=0
𝑘
If we let 𝑠 = −𝑟, then we see that the above yields
∞
(𝑟 + 1)𝑟 2 𝑟+𝑘−1 𝑘
(1 − 𝑥)−𝑟 = 1 + 𝑟𝑥 + 𝑥 + … ... = ∑ ( )𝑥 , 𝑟 ∈ ℝ; |𝑥| < 1.
2! 𝑘=0
𝑘
This implies that if we define 𝑝𝑘 as

𝑟 𝑘
𝑘+𝑟−1 1 𝛽
𝑝𝑘 = ( )( ) ( ) , 𝑘 = 0, 1, …
𝑘 1+𝛽 1+𝛽
for 𝑟 > 0 and 𝛽 ≥ 0, then it defines a valid pmf. Such defined distribution is
called the negative binomial distribution with parameters (𝑟, 𝛽) with 𝑟 > 0 and
𝛽 ≥ 0. Moreover, the binomial series also implies that the pgf of this distribution
is given by
1
𝑃𝑁 (𝑧) = (1 − 𝛽(𝑧 − 1))−𝑟 , |𝑧| < 1 + , 𝛽 ≥ 0.
𝛽
The above implies that the mgf is given by

1
𝑀𝑁 (𝑡) = (1 − 𝛽(𝑒𝑡 − 1))−𝑟 , 𝑡 < log (1 + ) , 𝛽 ≥ 0.
𝛽
We derive its moments using Theorem 2.1 as follows:

E[𝑁 ] = 𝑀 ′ (0) = 𝑟𝛽𝑒𝑡 (1 − 𝛽(𝑒𝑡 − 1))−𝑟−1 ∣𝑡=0 = 𝑟𝛽;

E[𝑁 2 ] = 𝑀 ″ (0) = [𝑟𝛽𝑒𝑡 (1 − 𝛽(𝑒𝑡 − 1))−𝑟−1 + 𝑟(𝑟 + 1)𝛽 2 𝑒2𝑡 (1 − 𝛽(𝑒𝑡 − 1))−𝑟−2 ]∣𝑡=0
= 𝑟𝛽(1 + 𝛽) + 𝑟2 𝛽 2 ;
and Var[𝑁 ] = E[𝑁 2 ] − (E[𝑁 ])2 = 𝑟𝛽(1 + 𝛽) + 𝑟2 𝛽 2 − 𝑟2 𝛽 2 = 𝑟𝛽(1 + 𝛽)
We note that when 𝛽 > 0, we have Var[𝑁 ] > E[𝑁 ]. In other words, this
distribution is overdispersed (relative to the Poisson); similarly, when 𝑞 > 0 the
binomial distribution is said to be underdispersed (relative to the Poisson).
Finally, we observe that the Poisson distribution also emerges as a limit of
negative binomial distributions. Towards establishing this, let 𝛽𝑟 be such that
as 𝑟 approaches infinity 𝑟𝛽𝑟 approaches 𝜆 > 0. Then we see that the mgfs of
negative binomial distributions with parameters (𝑟, 𝛽𝑟 ) satisfies
lim (1 − 𝛽𝑟 (𝑒𝑡 − 1))−𝑟 = exp{𝜆(𝑒𝑡 − 1)},

𝑟→0
with the right hand side of the above equation being the mgf of the Poisson
distribution with parameter 𝜆.4
2.3 The (a, b, 0) Class

• Define the (a,b,0) class of frequency distributions
• Discuss the importance of the recursive relationship underpinning this
class of distributions
• Identify conditions under which this general class reduces to each of the
binomial, Poisson, and negative binomial distributions
In the previous section we studied three distributions, namely the binomial, the
Poisson and the negative binomial distributions. In the case of the Poisson, to
derive its mean we used the the fact that
𝑘𝑝𝑘 = 𝜆𝑝𝑘−1 , 𝑘 ≥ 1,
which can be expressed equivalently as
𝑝𝑘 𝜆
= , 𝑘 ≥ 1.
𝑝𝑘−1 𝑘
4 For the theoretical basis underlying the above argument, see Billingsley (2008).
2.3. THE (A, B, 0) CLASS 53
Interestingly, we can similarly show that for the binomial distribution

𝑝𝑘 −𝑞 (𝑚 + 1)𝑞 1
= +( ) , 𝑘 = 1, … , 𝑚,
𝑝𝑘−1 1−𝑞 1−𝑞 𝑘
and that for the negative binomial distribution
𝑝𝑘 𝛽 (𝑟 − 1)𝛽 1
= +( ) , 𝑘 ≥ 1.
𝑝𝑘−1 1+𝛽 1+𝛽 𝑘
The above relationships are all of the form
𝑝𝑘 𝑏
=𝑎+ , 𝑘 ≥ 1; (2.1)
𝑝𝑘−1 𝑘
this raises the question if there are any other distributions which satisfy this
seemingly general recurrence relation. Note that the ratio on the left, the ratio
of two probabilities, is non-negative.
Snippet of Theory. To begin with, let 𝑎 < 0. In this case as 𝑘 → ∞,

(𝑎 + 𝑏/𝑘) → 𝑎 < 0. It follows that if 𝑎 < 0 then 𝑏 should satisfy 𝑏 = −𝑘𝑎, for
some 𝑘 ≥ 1. Any such pair (𝑎, 𝑏) can be written as
−𝑞 (𝑚 + 1)𝑞
( , ), 𝑞 ∈ (0, 1), 𝑚 ≥ 1;
1−𝑞 1−𝑞
note that the case 𝑎 < 0 with 𝑎 + 𝑏 = 0 yields the degenerate at 0 distribution
which is the binomial distribution with 𝑞 = 0 and arbitrary 𝑚 ≥ 1.
In the case of 𝑎 = 0, again by non-negativity of the ratio 𝑝𝑘 /𝑝𝑘−1 , we have 𝑏 ≥ 0.
If 𝑏 = 0 the distribution is degenerate at 0, which is a binomial with 𝑞 = 0 or a
Poisson distribution with 𝜆 = 0 or a negative binomial distribution with 𝛽 = 0.
If 𝑏 > 0, then clearly such a distribution is a Poisson distribution with mean
(i.e. 𝜆) equal to 𝑏, as presented at the beginning of this section.
In the case of 𝑎 > 0, again by non-negativity of the ratio 𝑝𝑘 /𝑝𝑘−1 , we have
𝑎 + 𝑏/𝑘 ≥ 0 for all 𝑘 ≥ 1. The most stringent of these is the inequality 𝑎 + 𝑏 ≥ 0.
Note that 𝑎 + 𝑏 = 0 again results in degeneracy at 0; excluding this case we have
𝑎 + 𝑏 > 0 or equivalently 𝑏 = (𝑟 − 1)𝑎 with 𝑟 > 0. Some algebra easily yields
the following expression for 𝑝𝑘 :
𝑘+𝑟−1
𝑝𝑘 = ( )𝑝0 𝑎𝑘 , 𝑘 = 1, 2, … .
𝑘
The above series converges for 𝑎 < 1 when 𝑟 > 0, with the sum given by
𝑝0 ⋅ ((1 − 𝑎)(−𝑟) − 1). Hence, equating the latter to 1 − 𝑝0 we get 𝑝0 = (1 − 𝑎)(𝑟) .
So in this case the pair (𝑎, 𝑏) is of the form (𝑎, (𝑟 − 1)𝑎), for 𝑟 > 0 and 0 < 𝑎 < 1;
since an equivalent parametrization is (𝛽/(1 + 𝛽), (𝑟 − 1)𝛽/(1 + 𝛽)), for 𝑟 > 0
and 𝛽 > 0, we see from above that such distributions are negative binomial
distributions.
From the above development we see that not only does the recurrence (2.1) tie
these three distributions together, but also it characterizes them. For this reason
these three distributions are collectively referred to in the actuarial literature
as (a,b,0) class of distributions, with 0 referring to the starting point of the
recurrence. Note that the value of 𝑝0 is implied by (𝑎, 𝑏) since the probabilities
have to sum to one. Of course, (2.1) as a recurrence relation for 𝑝𝑘 makes the
computation of the pmf efficient by removing redundancies. Later, we will see
that it does so even in the case of compound distributions with the frequency
distribution belonging to the (𝑎, 𝑏, 0) class - this fact is the more important
motivating reason to study these three distributions from this viewpoint.
Example 2.3.1. A discrete probability distribution has the following properties
2
𝑝𝑘 = 𝑐 (1 + ) 𝑝𝑘−1 𝑘 = 1, 2, 3, …
𝑘
9
𝑝1 =
256
Determine the expected value of this discrete random variable.
Solution: Since the pmf satisfies the (𝑎, 𝑏, 0) recurrence relation we know that
the underlying distribution is one among the binomial, Poisson, and negative
binomial distributions. Since the ratio of the parameters (i.e. 𝑏/𝑎) equals 2, we
know that it is negative binomial and that 𝑟 = 3. Moreover, since for a negative
binomial 𝑝1 = 𝑟(1 + 𝛽)−(𝑟+1) 𝛽, we have
9 𝛽
=3
256 (1 + 𝛽)4
3 𝛽
⟹ =
(1 + 3)4 (1 + 𝛽)4
⟹ 𝛽 =3.
Finally, since the mean of a negative binomial is 𝑟𝛽 we have the mean of the
given distribution equals 9.
2.4 Estimating Frequency Distributions

• Define a likelihood for a sample of observations from a discrete distribution
• Define the maximum likelihood estimator for a random sample of obser-
vations from a discrete distribution
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 55
• Calculate the maximum likelihood estimator for the binomial, Poisson,

and negative binomial distributions
2.4.1 Parameter Estimation

In Section 2.2 we introduced three distributions of importance in modeling var-
ious types of count data arising from insurance. Let us now suppose that we
have a set of count data to which we wish to fit a distribution, and that we
have determined that one of these (𝑎, 𝑏, 0) distributions is more appropriate
than the others. Since each one of these forms a class of distributions if we
allow its parameter(s) to take any permissible value, there remains the task of
determining the best value of the parameter(s) for the data at hand. This is
a statistical point estimation problem, and in parametric inference problems
the statistical inference paradigm of maximum likelihood usually yields efficient
estimators. In this section we describe this paradigm and derive the maximum
likelihood estimators.
Let us suppose that we observe the independent and identically distributed, iid,
random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 from a distribution with pmf 𝑝𝜃 , where 𝜃 is a
vector of parameters and an unknown value in the parameter space Θ ⊆ ℝ𝑑 . For
example, in the case of the Poisson distribution, there is a single parameter so
that 𝑑 = 1 and
𝜃𝑥
𝑝𝜃 (𝑥) = 𝑒−𝜃 , 𝑥 = 0, 1, … ,
𝑥!
with 𝜃 > 0. In the case of the binomial distribution we have
𝑚
𝑝𝜃 (𝑥) = ( )𝑞 𝑥 (1 − 𝑞)𝑚−𝑥 , 𝑥 = 0, 1, … , 𝑚.
𝑥
For some applications, we can view 𝑚 as a parameter and so take 𝑑 = 2 so that

𝜃 = (𝑚, 𝑞) ∈ {0, 1, 2, …} × [0, 1].
Let us suppose that the observations are 𝑥1 , … , 𝑥𝑛 , observed values of the ran-
dom sample 𝑋1 , 𝑋2 , … , 𝑋𝑛 presented earlier. In this case, the probability of
observing this sample from 𝑝𝜃 equals
𝑛
∏ 𝑝𝜃 (𝑥𝑖 ).
𝑖=1
The above, denoted by 𝐿(𝜃), viewed as a function of 𝜃, is called the likelihood.

Note that we suppressed its dependence on the data, to emphasize that we are
viewing it as a function of the parameter vector. For example, in the case of

the Poisson distribution we have
−1
𝑛
𝑛
−𝑛𝜆 ∑𝑖=1 𝑥𝑖
𝐿(𝜆) = 𝑒 𝜆 (∏ 𝑥𝑖 !) .
𝑖=1
In the case of the binomial distribution we have

𝑛
𝑚 𝑛 𝑛
𝐿(𝑚, 𝑞) = (∏ ( )) 𝑞 ∑𝑖=1 𝑥𝑖 (1 − 𝑞)𝑛𝑚−∑𝑖=1 𝑥𝑖 .
𝑖=1
𝑥𝑖
The maximum likelihood estimator (mle) for 𝜃 is any maximizer of the likeli-
hood; in a sense the mle chooses the set of parameter values that best explains
the observed observations. Appendix Section 15.2.2 reviews the foundations of
maximum likelihood estimation with more mathematical details in Appendix
Chapter 17.
Special Case: Three Bernoulli Outcomes. To illustrate, consider a sample
of size 𝑛 = 3 from a Bernoulli distribution (binomial with 𝑚 = 1) with values
0, 1, 0. The likelihood in this case is easily checked to equal
𝐿(𝑞) = 𝑞(1 − 𝑞)2 ,
and the plot of the likelihood is given in Figure 2.1. As shown in the plot, the
maximum value of the likelihood equals 4/27 and is attained at 𝑞 = 1/3, and
hence the maximum likelihood estimate for 𝑞 is 1/3 for the given sample. In
this case one can resort to algebra to show that
1 2 4 4
𝑞(1 − 𝑞)2 = (𝑞 − ) (𝑞 − ) + ,
3 3 27
and conclude that the maximum equals 4/27, and is attained at 𝑞 = 1/3 (using
the fact that the first term is non-positive in the interval [0, 1]).
But as is apparent, this way of deriving the mle using algebra does not gener-
alize. In general, one resorts to calculus to derive the mle - note that for some
likelihoods one may have to resort to other optimization methods, especially
when the likelihood has many local extrema. It is customary to equivalently
maximize the logarithm of the likelihood5 𝐿(⋅), denoted by 𝑙(⋅), and look at
the set of zeros of its first derivative6 𝑙′ (⋅). In the case of the above likelihood,
𝑙(𝑞) = log(𝑞) + 2 log(1 − 𝑞), and
d 1 2
𝑙′ (𝑞) = 𝑙(𝑞) = − .
d𝑞 𝑞 1−𝑞
The unique zero of 𝑙′ (⋅) equals 1/3, and since 𝑙″ (⋅) is negative, we have 1/3 is the
unique maximizer of the likelihood and hence its maximum likelihood estimate.
5 The set of maximizers of 𝐿(⋅) are the same as the set of maximizers of any strictly increas-
ing function of 𝐿(⋅), and hence the same as those for 𝑙(⋅).
6 A slight benefit of working with 𝑙(⋅) is that constant terms in 𝐿(⋅) do not appear in 𝑙′ (⋅)
whereas they do in 𝐿′ (⋅).

Figure 2.1: Likelihood of a (0, 1, 0) 3-sample from Bernoulli
2.4.2 Frequency Distributions MLE

In the following, we derive the maximum likelihood estimator, mle, for the
three members of the (𝑎, 𝑏, 0) class. We begin by summarizing the discussion
above. In the setting of observing iid, independent and identically distributed,
random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 from a distribution with pmf 𝑝𝜃 , where 𝜃 takes
an unknown value in Θ ⊆ ℝ𝑑 , the likelihood 𝐿(⋅), a function on Θ is defined as
𝑛
𝐿(𝜃) = ∏ 𝑝𝜃 (𝑥𝑖 ),
𝑖=1
̂
where 𝑥1 , … , 𝑥𝑛 are the observed values. The mle of 𝜃, denoted as 𝜃MLE , is a
function which maps the observations to an element of the set of maximizers of
𝐿(⋅), namely
{𝜃|𝐿(𝜃) = max 𝐿(𝜂)}.
𝜂∈Θ
Note the above set is a function of the observations, even though this dependence
is not made explicit. In the case of the three distributions that we study, and
quite generally, the above set is a singleton with probability tending to one (with
increasing sample size). In other words, for many commonly used distributions
and when the sample size is large, the likelihood estimate is uniquely defined
with high probability.
In the following, we assume that we have observed 𝑛 iid random variables
𝑋1 , 𝑋2 , … , 𝑋𝑛 from the distribution under consideration, even though the para-
metric value is unknown. Also, 𝑥1 , 𝑥2 , … , 𝑥𝑛 will denote the observed values.
We note that in the case of count data, and data from discrete distributions in
general, the likelihood can alternately be represented as
𝑚𝑘
𝐿(𝜃) = ∏ (𝑝𝜃 (𝑘)) ,
𝑘≥0
where 𝑚𝑘 is the number of observations equal to 𝑘. Mathematically, we have

𝑛
𝑚𝑘 = |{𝑖|𝑥𝑖 = 𝑘, 1 ≤ 𝑖 ≤ 𝑛}| = ∑ 𝐼(𝑥𝑖 = 𝑘), 𝑘 ≥ 0.
𝑖=1
Note that this transformation retains all of the data, compiling it in a stream-
lined manner. For large 𝑛 it leads to compression of the data in the sense of
sufficiency. Below, we present expressions for the mle in terms of {𝑚𝑘 }𝑘≥1 as
well.
Special Case: Poisson Distribution. In this case, as noted above, the
likelihood is given by
−1
𝑛
𝑛
𝐿(𝜆) = (∏ 𝑥𝑖 !) 𝑒−𝑛𝜆 𝜆∑𝑖=1 𝑥𝑖 .
𝑖=1
Taking logarithms, the log-likelihood is

𝑛 𝑛
𝑙(𝜆) = − ∑ log(𝑥𝑖 !) − 𝑛𝜆 + log(𝜆) ⋅ ∑ 𝑥𝑖 .
𝑖=1 𝑖=1
Taking a derivative, we have

1 𝑛
𝑙′ (𝜆) = −𝑛 + ∑𝑥 .
𝜆 𝑖=1 𝑖
𝑛
In evaluating 𝑙″ (𝜆), when ∑𝑖=1 𝑥𝑖 > 0, 𝑙″ < 0. Consequently, the maximum
𝑛
is attained at the sample mean, 𝑥, presented below. When ∑𝑖=1 𝑥𝑖 = 0, the
likelihood is a decreasing function and hence the maximum is attained at the
least possible parameter value; this results in the maximum likelihood estimate
being zero. Hence, we have
1 𝑛
𝑥 = 𝜆̂ MLE = ∑ 𝑥𝑖 .
𝑛 𝑖=1
Note that the sample mean can be computed also as
1
𝑥= ∑ 𝑘 ⋅ 𝑚𝑘 .
𝑛 𝑘≥1
It is noteworthy that in the case of the Poisson, the exact distribution of 𝜆̂ MLE is
available in closed form - it is a scaled Poisson - when the underlying distribution
is a Poisson. This is so as the sum of independent Poisson random variables is a
Poisson as well. Of course, for large sample size one can use the ordinary Central
Limit Theorem (CLT) to derive a normal approximation. Note that the latter
approximation holds even if the underlying distribution is any distribution with
a finite second moment.
Special Case: Binomial Distribution. Unlike the case of the Poisson distri-
bution, the parameter space in the case of the binomial is 2-dimensional. Hence
the optimization problem is a bit more challenging. We begin by observing that
the likelihood is given by
𝑛
𝑚 𝑛 𝑛
𝐿(𝑚, 𝑞) = (∏ ( )) 𝑞 ∑𝑖=1 𝑥𝑖 (1 − 𝑞)𝑛𝑚−∑𝑖=1 𝑥𝑖 .
𝑖=1
𝑥𝑖
Taking logarithms, the log-likelihood is
𝑛 𝑛
𝑙(𝑚, 𝑞) = ∑𝑖=1 log ((𝑥𝑚 )) + (∑𝑖=1 𝑥𝑖 ) log(𝑞)
𝑛𝑖
+ (𝑛𝑚 − ∑𝑖=1 𝑥𝑖 ) log(1 − 𝑞)
𝑛
= ∑𝑖=1 log ((𝑥𝑚 )) + 𝑛𝑥 log(𝑞) + 𝑛 (𝑚 − 𝑥) log(1 − 𝑞),
𝑖
𝑛
where 𝑥 = 𝑛−1 ∑𝑖=1 𝑥𝑖 . Note that since 𝑚 takes only non-negative integer val-
ues, we cannot use multivariate calculus to find the optimal values. Nevertheless,
we can use single variable calculus to show that
𝑞𝑀𝐿𝐸
̂ × 𝑚̂ 𝑀𝐿𝐸 = 𝑥. (2.2)
Towards this we note that for a fixed value of 𝑚,
𝛿 𝑛𝑥 𝑛 (𝑚 − 𝑥)
𝑙(𝑚, 𝑞) = − ,
𝛿𝑞 𝑞 1−𝑞
and that
𝛿2 𝑛𝑥 𝑛 (𝑚 − 𝑥)
2
𝑙(𝑚, 𝑞) = − 2 + ≤ 0.
𝛿𝑞 𝑞 (1 − 𝑞)2
The above implies that for any fixed value of 𝑚, the maximizing value of 𝑞
satisfies
𝑚𝑞 = 𝑥,
and hence we establish equation (2.2).
With equation (2.2), the above reduces the task to the search for 𝑚̂ MLE , which
is a maximizer of
𝑥
𝐿 (𝑚, ). (2.3)
𝑚
Note the likelihood would be zero for values of 𝑚 smaller than max 𝑥𝑖 , and
1≤𝑖≤𝑛
hence 𝑚̂ MLE ≥ max1≤𝑖≤𝑛 𝑥𝑖 .
Towards specifying an algorithm to compute 𝑚̂ MLE , we first point out that for
some data sets 𝑚̂ MLE could equal ∞, indicating that a Poisson distribution
would render a better fit than any binomial distribution. This is so as the bino-
mial distribution with parameters (𝑚, 𝑥/𝑚) approaches the Poisson distribution
with parameter 𝑥 with 𝑚 approaching infinity. The fact that some data sets
prefer a Poisson distribution should not be surprising since in the above sense
the set of Poisson distribution is on the boundary of the set of binomial distri-
butions. Interestingly, in Olkin et al. (1981) they show that if the sample mean
is less than or equal to the sample variance then 𝑚̂ MLE = ∞; otherwise, there
exists a finite 𝑚 that maximizes equation (2.3).
In Figure 2.2 below we display the plot of 𝐿 (𝑚, 𝑥/𝑚) for three different samples
of size 5; they differ only in the value of the sample maximum. The first sample
of (2, 2, 2, 4, 5) has the ratio of sample mean to sample variance greater than
1 (1.875), the second sample of (2, 2, 2, 4, 6) has the ratio equal to 1.25 which
is closer to 1, and the third sample of (2, 2, 2, 4, 7) has the ratio less than 1
(0.885). For these three samples, as shown in Figure 2.2, 𝑚̂ MLE equals 7, 18 and
∞, respectively. Note that the limiting value of 𝐿 (𝑚, 𝑥/𝑚) as 𝑚 approaches
infinity equals
−1
𝑛
𝑛𝑥
(∏ 𝑥𝑖 !) exp (−𝑛𝑥 ) ( 𝑥 ) . (2.4)
𝑖=1
Also, note that Figure 2.2 shows that the mle of 𝑚 is non-robust, i.e. changes
in a small proportion of the data set can cause large changes in the estimator.
The above discussion suggests the following simple algorithm:
• Step 1. If the sample mean is less than or equal to the sample variance,
then set 𝑚̂ 𝑀𝐿𝐸 = ∞. The mle suggested distribution is a Poisson distri-
bution with 𝜆̂ = 𝑥.
• Step 2. If the sample mean is greater than the sample variance, then
compute 𝐿(𝑚, 𝑥/𝑚) for 𝑚 values greater than or equal to the sample
maximum until 𝐿(𝑚, 𝑥/𝑚) is close to the value of the Poisson likelihood
given in (2.4). The value of 𝑚 that corresponds to the maximum value of
𝐿(𝑚, 𝑥/𝑚) among those computed equals 𝑚̂ 𝑀𝐿𝐸 .
We note that if the underlying distribution is the binomial distribution with

parameters (𝑚, 𝑞) (with 𝑞 > 0) then 𝑚̂ 𝑀𝐿𝐸 equals 𝑚 for large sample sizes.
Also, 𝑞𝑀𝐿𝐸
̂ will have an asymptotically normal distribution and converge with
probability one to 𝑞.
Special Case: Negative Binomial Distribution. The case of the negative

binomial distribution is similar to that of the binomial distribution in the sense
that we have two parameters and the mles are not available in closed form. A
difference between them is that unlike the binomial parameter 𝑚 which takes
positive integer values, the parameter 𝑟 of the negative binomial can assume any
positive real value. This makes the optimization problem a tad more complex.
We begin by observing that the likelihood can be expressed in the following
Figure 2.2: Plot of 𝐿(𝑚, 𝑥/𝑚)

̄ for a Binomial Distribution
form:
𝑛
𝑟 + 𝑥𝑖 − 1
𝐿(𝑟, 𝛽) = (∏ ( )) (1 + 𝛽)−𝑛(𝑟+𝑥) 𝛽 𝑛𝑥 .
𝑖=1
𝑥𝑖
The above implies that log-likelihood is given by
𝑛
𝑟 + 𝑥𝑖 − 1
𝑙(𝑟, 𝛽) = ∑ log ( ) − 𝑛(𝑟 + 𝑥) log(1 + 𝛽) + 𝑛𝑥 log 𝛽,
𝑖=1
𝑥𝑖
and hence
𝛿 𝑛(𝑟 + 𝑥) 𝑛𝑥
𝑙(𝑟, 𝛽) = − + .
𝛿𝛽 1+𝛽 𝛽
Equating the above to zero, we get
𝑟𝑀𝐿𝐸
̂ ̂
× 𝛽𝑀𝐿𝐸 = 𝑥.
The above reduces the two dimensional optimization problem to a one-

dimensional problem - we need to maximize
𝑛
𝑟 + 𝑥𝑖 − 1
𝑙(𝑟, 𝑥/𝑟) = ∑ log ( ) − 𝑛(𝑟 + 𝑥) log(1 + 𝑥/𝑟) + 𝑛𝑥 log(𝑥/𝑟),
𝑖=1
𝑥𝑖
with respect to 𝑟, with the maximizing 𝑟 being its mle and 𝛽𝑀𝐿𝐸 ̂ = 𝑥/𝑟𝑀𝐿𝐸
̂ .
In Levin et al. (1977) it is shown that if the sample variance is greater than
the sample mean then there exists a unique 𝑟 > 0 that maximizes 𝑙(𝑟, 𝑥/𝑟) and
hence a unique mle for 𝑟 and 𝛽. Also, they show that if 𝜎̂ 2 ≤ 𝑥, then the
negative binomial likelihood will be dominated by the Poisson likelihood with
𝜆̂ = 𝑥. In other words, a Poisson distribution offers a better fit to the data. The
guarantee in the case of 𝜎̂ 2 > 𝜇̂ permits us to use some algorithm to maximize
𝑙(𝑟, 𝑥/𝑟). Towards an alternate method of computing the likelihood, we note
that
𝑛 𝑥 𝑛
𝑙(𝑟, 𝑥/𝑟) = ∑𝑖=1 ∑𝑗=1
𝑖
log(𝑟 − 1 + 𝑗) − ∑𝑖=1 log(𝑥𝑖 !)
−𝑛(𝑟 + 𝑥) log(𝑟 + 𝑥) + 𝑛𝑟 log(𝑟) + 𝑛𝑥 log(𝑥),
which yields
𝑥
1 𝛿 1 𝑛 𝑖 1
( ) 𝑙(𝑟, 𝑥/𝑟) = ∑ ∑ − log(𝑟 + 𝑥) + log(𝑟).
𝑛 𝛿𝑟 𝑛 𝑖=1 𝑗=1 𝑟 − 1 + 𝑗
We note that, in the above expressions for the terms involving a double sum-
mation, the inner sum equals zero if 𝑥𝑖 = 0. The maximum likelihood estimate
for 𝑟 is a root of the last expression and we can use a root finding algorithm to
compute it. Also, we have
𝑥
1 𝛿2 𝑥 1 𝑛 𝑖 1
( ) 2 𝑙(𝑟, 𝑥/𝑟) = − ∑∑ .
𝑛 𝛿𝑟 𝑟(𝑟 + 𝑥) 𝑛 𝑖=1 𝑗=1 (𝑟 − 1 + 𝑗)2
A simple but quickly converging iterative root finding algorithm is the Newton’s
method, which incidentally the Babylonians are believed to have used for com-
puting square roots. Under this method, an initial approximation is selected
for the root and new approximations for the root are successively generated
until convergence. Applying the Newton’s method to our problem results in the
following algorithm:
Step i. Choose an approximate solution, say 𝑟0 . Set 𝑘 to 0.
Step ii. Define 𝑟𝑘+1 as
1 𝑛 𝑥 1
𝑛 ∑𝑖=1 ∑𝑗=1
𝑖
𝑟𝑘 −1+𝑗 − log(𝑟𝑘 + 𝑥) + log(𝑟𝑘 )
𝑟𝑘+1 = 𝑟𝑘 − 𝑥 1 𝑛 𝑥𝑖 1
𝑟𝑘 (𝑟𝑘 +𝑥) − 𝑛 ∑𝑖=1 ∑𝑗=1 (𝑟𝑘 −1+𝑗)2
Step iii. If 𝑟𝑘+1 ∼ 𝑟𝑘 , then report 𝑟𝑘+1 as maximum likelihood estimate; else
increment 𝑘 by 1 and repeat Step ii.
For example, we simulated a 5 observation sample of 41, 49, 40, 27, 23 from the
negative binomial with parameters 𝑟 = 10 and 𝛽 = 5. Choosing the starting
value of 𝑟 such that
𝑟𝛽 = 𝜇̂ and 𝑟𝛽(1 + 𝛽) = 𝜎̂ 2
where 𝜇̂ represents the estimated mean and 𝜎̂ 2 is the estimated variance. This
leads to the starting value for 𝑟 of 23.14286. The iterates of 𝑟 from the Newton’s
method are
21.39627, 21.60287, 21.60647, 21.60647;
the rapid convergence seen above is typical of the Newton’s method. Hence in
this example, 𝑟𝑀𝐿𝐸
̂ ∼ 21.60647 and 𝛽𝑀𝐿𝐸̂ = 1.66616.
To summarize our discussion of MLE for the (𝑎, 𝑏, 0) class of distributions, in Fig-
ure 2.3 below we plot the maximum value of the Poisson likelihood, 𝐿(𝑚, 𝑥/𝑚)
for the binomial, and 𝐿(𝑟, 𝑥/𝑟) for the negative binomial, for the three sam-
ples of size 5 given in Table 2.1. The data was constructed to cover the three
orderings of the sample mean and variance. As shown in the Figure 2.3, and
supported by theory, if 𝜇̂ < 𝜎̂ 2 then the negative binomial results in a higher
maximum likelihood value; if 𝜇̂ = 𝜎̂ 2 the Poisson has the highest likelihood
value; and finally in the case that 𝜇̂ > 𝜎̂ 2 the binomial gives a better fit than
the others. So before fitting a frequency data with an (𝑎, 𝑏, 0) distribution, it
is best to start with examining the ordering of 𝜇̂ and 𝜎̂ 2 . We again emphasize
that the Poisson is on the boundary of the negative binomial and binomial
distributions. So in the case that 𝜇̂ ≥ 𝜎̂ 2 (𝜇̂ ≤ 𝜎̂ 2 , resp.) the Poisson yields
a better fit than the negative binomial (binomial, resp.), which is indicated by
𝑟 ̂ = ∞ (𝑚̂ = ∞, respectively).
Table 2.1. Three Samples of Size 5
Data Mean (𝜇)̂ Variance (𝜎̂ 2 )

(2, 3, 6, 8, 9) 5.60 7.44
(2, 5, 6, 8, 9) 6 6
(4, 7, 8, 10, 11) 8 6
Figure 2.3: Plot of (𝑎, 𝑏, 0) Partially Maximized Likelihoods
2.5 Other Frequency Distributions
• Define the (a,b,1) class of frequency distributions and discuss the impor-
tance of the recursive relationship underpinning this class of distributions
• Interpret zero truncated and modified versions of the binomial, Poisson,
and negative binomial distributions
• Compute probabilities using the recursive relationship
2.5. OTHER FREQUENCY DISTRIBUTIONS 65
In the previous sections, we discussed three distributions with supports con-

tained in the set of non-negative integers, which well cater to many insurance
applications. Moreover, typically by allowing the parameters to be a function
of known (to the insurer) explanatory variables such as age, sex, geographic
location (territory), and so forth, these distributions allow us to explain claim
probabilities in terms of these variables. The field of statistical study that stud-
ies such models is known as regression analysis - it is an important topic of
actuarial interest that we will not pursue in this book; see Frees (2009).
There are clearly infinitely many other count distributions, and more impor-
tantly the above distributions by themselves do not cater to all practical needs.
In particular, one feature of some insurance data is that the proportion of zero
counts can be out of place with the proportion of other counts to be explainable
by the above distributions. In the following we modify the above distributions
to allow for arbitrary probability for zero count irrespective of the assignment of
relative probabilities for the other counts. Another feature of a data set which
is naturally comprised of homogeneous subsets is that while the above distribu-
tions may provide good fits to each subset, they may fail to do so to the whole
data set. Later we naturally extend the (𝑎, 𝑏, 0) distributions to be able to cater
to, in particular, such data sets.
2.5.1 Zero Truncation or Modification

Let us suppose that we are looking at auto insurance policies which appear in a
database of auto claims made in a certain period. If one is to study the number
of claims that these policies have made during this period, then clearly the
distribution has to assign a probability of zero to the count variable assuming
the value zero. In other words, by restricting attention to count data from
policies in the database of claims, we have in a sense zero-truncated the count
data of all policies. In personal lines (like auto), policyholders may not want
to report that first claim because of fear that it may increase future insurance
rates - this behavior inflates the proportion of zero counts. Examples such as the
latter modify the proportion of zero counts. Interestingly, natural modifications
of the three distributions considered above are able to provide good fits to zero-
modified/truncated data sets arising in insurance.
As presented below, we modify the probability assigned to zero count by the

(𝑎, 𝑏, 0) class while maintaining the relative probabilities assigned to non-zero
counts - zero modification. Note that since the (𝑎, 𝑏, 0) class of distributions sat-
isfies the recurrence (2.1), maintaining relative probabilities of non-zero counts
implies that recurrence (2.1) is satisfied for 𝑘 ≥ 2. This leads to the definition
of the following class of distributions.
Definition. A count distribution is a member of the (𝑎, 𝑏, 1) class if for some

constants 𝑎 and 𝑏 the probabilities 𝑝𝑘 satisfy
𝑝𝑘 𝑏
=𝑎+ , 𝑘 ≥ 2. (2.5)
𝑝𝑘−1 𝑘
Note that since the recursion starts with 𝑝1 , and not 𝑝0 , we refer to this super-
class of (𝑎, 𝑏, 0) distributions by (a,b,1). To understand this class, recall that
each valid pair of values for 𝑎 and 𝑏 of the (𝑎, 𝑏, 0) class corresponds to a unique
vector of probabilities {𝑝𝑘 }𝑘≥0 . If we now look at the probability vector {𝑝𝑘̃ }𝑘≥0
given by
1 − 𝑝0̃
𝑝𝑘̃ = ⋅ 𝑝 , 𝑘 ≥ 1,
1 − 𝑝0 𝑘
where 𝑝0̃ ∈ [0, 1) is arbitrarily chosen, then since the relative probabilities for
positive values according to {𝑝𝑘 }𝑘≥0 and {𝑝𝑘̃ }𝑘≥0 are the same, we have {𝑝𝑘̃ }𝑘≥0
satisfies recurrence (2.5). This, in particular, shows that the class of (𝑎, 𝑏, 1)
distributions is strictly wider than that of (𝑎, 𝑏, 0).
In the above, we started with a pair of values for 𝑎 and 𝑏 that led to a valid
(𝑎, 𝑏, 0) distribution, and then looked at the (𝑎, 𝑏, 1) distributions that corre-
sponded to this (𝑎, 𝑏, 0) distribution. We now argue that the (𝑎, 𝑏, 1) class allows
for a larger set of permissible distributions for 𝑎 and 𝑏 than the (𝑎, 𝑏, 0) class.
Recall from Section 2.3 that in the case of 𝑎 < 0 we did not use the fact that
the recurrence (2.1) started at 𝑘 = 1, and hence the set of pairs (𝑎, 𝑏) with
𝑎 < 0 that are permissible for the (𝑎, 𝑏, 0) class is identical to those that are
permissible for the (𝑎, 𝑏, 1) class. The same conclusion is easily drawn for pairs
with 𝑎 = 0. In the case that 𝑎 > 0, instead of the constraint 𝑎 + 𝑏 > 0 for the
(𝑎, 𝑏, 0) class we now have the weaker constraint of 𝑎 + 𝑏/2 > 0 for the (𝑎, 𝑏, 1)
class. With the parametrization 𝑏 = (𝑟 − 1)𝑎 as used in Section 2.3, instead of
𝑟 > 0 we now have the weaker constraint of 𝑟 > −1. In particular, we see that
while zero modifying a (𝑎, 𝑏, 0) distribution leads to a distribution in the (𝑎, 𝑏, 1)
class, the conclusion does not hold in the other direction.
Zero modification of a count distribution 𝐹 such that it assigns zero probability
to zero count is called a zero truncation of 𝐹 . Hence, the zero truncated version
of probabilities {𝑝𝑘 }𝑘≥0 is given by
0, 𝑘 = 0;
𝑝𝑘̃ = { 𝑝𝑘
1−𝑝0 , 𝑘 ≥ 1.
In particular, we have that a zero modification of a count distribution {𝑝𝑘𝑇 }𝑘≥0 ,

denoted by {𝑝𝑘𝑀 }𝑘≥0 , can be written as a convex combination of the degenerate
distribution at 0 and the zero truncation of {𝑝𝑘 }𝑘≥0 , denoted by {𝑝𝑘𝑇 }𝑘≥0 . That
is we have
𝑝𝑘𝑀 = 𝑝0𝑀 ⋅ 𝛿0 (𝑘) + (1 − 𝑝0𝑀 ) ⋅ 𝑝𝑘𝑇 , 𝑘 ≥ 0.
Example 2.5.1. Zero Truncated/Modified Poisson. Consider a Poisson

distribution with parameter 𝜆 = 2. Calculate 𝑝𝑘 , 𝑘 = 0, 1, 2, 3, for the usual
(unmodified), truncated and a modified version with (𝑝0𝑀 = 0.6).
2.6. MIXTURE DISTRIBUTIONS 67
Solution. For the Poisson distribution as a member of the (𝑎, 𝑏,0) class, we
have 𝑎 = 0 and 𝑏 = 𝜆 = 2. Thus, we may use the recursion 𝑝𝑘 = 𝜆𝑝𝑘−1 /𝑘 =
2𝑝𝑘−1 /𝑘 for each type, after determining starting probabilities. The calculation
of probabilities for 𝑘 ≤ 3 is shown in Table 2.2.
Table 2.2. Calculation of Probabilities for 𝑘 ≤ 3
𝑘 𝑝𝑘 𝑝𝑘𝑇 𝑝𝑘𝑀
−𝜆
0 𝑝0 = 𝑒 = 0.135335 0 0.6
𝑝1 1−𝑝𝑀
1 𝑝1 = 𝑝0 (0 + 𝜆1 ) = 0.27067 1−𝑝0 = 0.313035 1−𝑝0 𝑝1 = 0.125214
0
2 𝑝2 = 𝑝1 ( 𝜆2 ) = 0.27067 𝑝2 = 𝑝1𝑇 ( 𝜆2 ) = 0.313035

𝑇
𝑝2 = 𝑝1𝑀 ( 𝜆2 ) = 0.125214
𝑀
3 𝑝3 = 𝑝2 ( 𝜆3 ) = 0.180447 𝑝3𝑇 = 𝑝2𝑇 ( 𝜆3 ) = 0.208690 𝑝3𝑀 = 𝑝2𝑀 ( 𝜆3 ) = 0.083476
2.6 Mixture Distributions

• Define a mixture distribution when the mixing component is based on a
finite number of sub-groups
• Compute mixture distribution probabilities from mixing proportions and
knowledge of the distribution of each subgroup
• Define a mixture distribution when the mixing component is continuous
In many applications, the underlying population consists of naturally defined

sub-groups with some homogeneity within each sub-group. In such cases it
is convenient to model the individual sub-groups, and in a ground-up manner
model the whole population. As we shall see below, beyond the aesthetic appeal
of the approach, it also extends the range of applications that can be catered to
by standard parametric distributions.
Let 𝑘 denote the number of defined sub-groups in a population, and let 𝐹𝑖 denote
the distribution of an observation drawn from the 𝑖-th subgroup. If we let 𝛼𝑖
𝑘
denote the proportion of the population in the 𝑖-th subgroup, with ∑𝑖=1 𝛼𝑖 = 1,
then the distribution of a randomly chosen observation from the population,
denoted by 𝐹 , is given by
𝑘
𝐹 (𝑥) = ∑ 𝛼𝑖 ⋅ 𝐹𝑖 (𝑥). (2.6)
𝑖=1
The above expression can be seen as a direct application of the Law of Total
Probability. As an example, consider a population of drivers split broadly into
two sub-groups, those with at most five years of driving experience and those
with more than five years experience. Let 𝛼 denote the proportion of drivers
with less than 5 years experience, and 𝐹≤5 and 𝐹>5 denote the distribution of
the count of claims in a year for a driver in each group, respectively. Then the
distribution of claim count of a randomly selected driver is given by
𝛼 ⋅ 𝐹≤5 (𝑥) + (1 − 𝛼)𝐹>5 (𝑥).
An alternate definition of a mixture distribution is as follows. Let 𝑁𝑖 be a

random variable with distribution 𝐹𝑖 , 𝑖 = 1, … , 𝑘. Let 𝐼 be a random vari-
able taking values 1, 2, … , 𝑘 with probabilities 𝛼1 , … , 𝛼𝑘 , respectively. Then the
random variable 𝑁𝐼 has a distribution given by equation (2.6)7 .
In (2.6) we see that the distribution function is a convex combination of the
component distribution functions. This result easily extends to the probability
mass function, the survival function, the raw moments, and the expectation as
these are all linear mappings of the distribution function. We note that this is
not true for central moments like the variance, and conditional measures like
the hazard rate function. In the case of variance it is easily seen as
𝑘
Var[𝑁𝐼 ] = E[Var[𝑁𝐼 |𝐼]] + Var[E[𝑁𝐼 |𝐼]] = ∑ 𝛼𝑖 Var[𝑁𝑖 ] + Var[E[𝑁𝐼 |𝐼]]. (2.7)
𝑖=1
Appendix Chapter 16 provides additional background about this important ex-

pression.
Example 2.6.1. Actuarial Exam Question. In a certain town the number
of common colds an individual will get in a year follows a Poisson distribution
that depends on the individual’s age and smoking status. The distribution of
the population and the mean number of colds are as follows:
Table 2.3. The Distribution of the Population and the Mean Number
of Colds
Proportion of population Mean number of colds

Children 0.3 3
Adult Non-Smokers 0.6 1
Adult Smokers 0.1 4
1. Calculate the probability that a randomly drawn person has 3 common

colds in a year.
2. Calculate the conditional probability that a person with exactly 3 common
colds in a year is an adult smoker.
7 This in particular lays out a way to simulate from a mixture distribution that makes use
of efficient simulation schemes that may exist for the component distributions.
2.6. MIXTURE DISTRIBUTIONS 69
Solution.
1. Using Law of Total Probability, we can write the required probability as
Pr(𝑁𝐼 = 3), with 𝐼 denoting the group of the randomly selected individual
with 1, 2 and 3 signifying the groups Children, Adult Non-Smoker, and
Adult Smoker, respectively. Now by conditioning we get
Pr(𝑁𝐼 = 3) = 0.3 ⋅ Pr(𝑁1 = 3) + 0.6 ⋅ Pr(𝑁2 = 3) + 0.1 ⋅ Pr(𝑁3 = 3),
with 𝑁1 , 𝑁2 and 𝑁3 following Poisson distributions with means 3, 1, and

4, respectively. Using the above, we get Pr(𝑁𝐼 = 3) ∼ 0.1235
2. The conditional probability of event A given event B, Pr(𝐴|𝐵) = Pr(𝐴,𝐵)
Pr(𝐵) .
The required conditional probability in this problem can then be written
as Pr(𝐼 = 3|𝑁𝐼 = 3), which equals
Pr(𝐼 = 3, 𝑁3 = 3) 0.1 × 0.1954
Pr(𝐼 = 3|𝑁𝐼 = 3) = ∼ ∼ 0.1581.
Pr(𝑁𝐼 = 3) 0.1235
In the above example, the number of subgroups 𝑘 was equal to three. In general,
𝑘 can be any natural number, but when 𝑘 is large it is parsimonious from a
modeling point of view to take the following infinitely many subgroup approach.
To motivate this approach, let the 𝑖-th subgroup be such that its component
distribution 𝐹𝑖 is given by 𝐺𝜃 ̃ , where 𝐺⋅ is a parametric family of distributions
𝑖
with parameter space Θ ⊆ ℝ𝑑 . With this assumption, the distribution function
𝐹 of a randomly drawn observation from the population is given by
𝑘
𝐹 (𝑥) = ∑ 𝛼𝑖 𝐺𝜃 ̃ (𝑥), ∀𝑥 ∈ ℝ,
𝑖
𝑖=1
similar to equation (2.6). Alternately, it can be written as
𝐹 (𝑥) = E[𝐺𝜗̃(𝑥)], ∀𝑥 ∈ ℝ,
where 𝜗 ̃ takes values 𝜃𝑖̃ with probability 𝛼𝑖 , for 𝑖 = 1, … , 𝑘. The above makes it
clear that when 𝑘 is large, one could model the above by treating 𝜗 ̃ as continuous
random variable.
To illustrate this approach, suppose we have a population of drivers with the
distribution of claims for an individual driver being distributed as a Poisson.
Each person has their own (personal) expected number of claims 𝜆 - smaller
values for good drivers, and larger values for others. There is a distribution of 𝜆
in the population; a popular and convenient choice for modeling this distribution
is a gamma distribution with parameters (𝛼, 𝜃) (the gamma distribution will be
introduced formally in Section 3.2.1). With these specifications it turns out
that the resulting distribution of 𝑁 , the claims of a randomly chosen driver, is a
negative binomial with parameters (𝑟 = 𝛼, 𝛽 = 𝜃). This can be shown in many
ways, but a straightforward argument is as follows:
∞ 𝑒−𝜆 𝜆𝑘 𝜆𝛼−1 𝑒−𝜆/𝜃 1 ∞ 𝛼+𝑘−1 −𝜆(1+1/𝜃)

Pr(𝑁 = 𝑘) = ∫0 𝑘! Γ(𝛼)𝜃𝛼 𝑑𝜆 = 𝑘!Γ(𝛼)𝜃𝛼 ∫0 𝜆 𝑒 𝑑𝜆
Γ(𝛼+𝑘)
= 𝑘!Γ(𝛼)𝜃𝛼 (1+1/𝜃)𝛼+𝑘
1 𝛼 𝜃 𝑘
= (𝛼+𝑘−1
𝑘 ) ( 1+𝜃 ) ( 1+𝜃 ) , 𝑘 = 0, 1, …
Note that the above derivation implicitly uses the following:
𝑒−𝜆 𝜆𝑘 𝜆𝛼−1 𝑒−𝜆/𝜃

𝑓𝑁|Λ=𝜆 (𝑁 = 𝑘) = , 𝑘 ≥ 0; and 𝑓Λ (𝜆) = , 𝜆 > 0.
𝑘! Γ(𝛼)𝜃𝛼
By considering mixtures of a parametric class of distributions, we increase the

richness of the class. This expansion of distributions results in the mixture
class being able to cater well to more applications that the parametric class we
started with. Mixture modeling is an important modeling technique in insur-
ance applications and later chapters will cover more aspects of this modeling
technique.
Example 2.6.2. Suppose that 𝑁 |Λ ∼ Poisson(Λ) and that Λ ∼ gamma with
mean of 1 and variance of 2. Determine the probability that 𝑁 = 1.
Solution. For a gamma distribution with parameters (𝛼, 𝜃), we have that the
mean is 𝛼𝜃 and the variance is 𝛼𝜃2 . Using these expressions we have
1
𝛼= and 𝜃 = 2.
2
Now, one can directly use the above result to conclude that 𝑁 is distributed as
a negative binomial with 𝑟 = 𝛼 = 12 and 𝛽 = 𝜃 = 2. Thus
1
1+𝑟−1 1 𝛽
Pr(𝑁 = 1) = ( )( )( )
1 (1 + 𝛽)𝑟 1+𝛽
1 1
1+ 2 −1 1 2
=( ) ( )
1 (1 + 2)1/2 1 + 2
1
= = 0.19245.
33/2
2.7 Goodness of Fit

• Calculate a goodness of fit statistic to compare a hypothesized discrete
distribution to a sample of discrete observations
2.7. GOODNESS OF FIT 71
• Compare the statistic to a reference distribution to assess the adequacy of

the fit
In the above we have discussed three basic frequency distributions, along with
their extensions through zero modification/truncation and by looking at mix-
tures of these distributions. Nevertheless, these classes still remain parametric
and hence by their very nature a small subset of the class of all possible fre-
quency distributions (that is, the set of distributions on non-negative integers).
Hence, even though we have talked about methods for estimating the unknown
parameters, the fitted distribution is not be a good representation of the un-
derlying distribution if the latter is far from the class of distribution used for
modeling. In fact, it can be shown that the maximum likelihood estimator con-
verges to a value such that the corresponding distribution is a Kullback-Leibler
projection of the underlying distribution on the class of distributions used for
modeling. Below we present one testing method - Pearson’s chi-square statistic
- to check for the goodness of fit of the fitted distribution. For more details on
the Pearson’s chi-square test, at an introductory mathematical statistics level,
we refer the reader to Section 9.1 of Hogg et al. (2015).
In 1993, a portfolio of 𝑛 = 7, 483 automobile insurance policies from a ma-

jor Singaporean insurance company had the distribution of auto accidents per
policyholder as given in Table 2.4.
Table 2.4. Singaporean Automobile Accident Data
Count (𝑘) 0 1 2 3 4 Total

No. of Policies with 𝑘 accidents (𝑚𝑘 ) 6, 996 455 28 4 0 7, 483
If we a fit a Poisson distribution, then the mle for 𝜆, the Poisson mean, is the
sample mean which is given by
0 ⋅ 6996 + 1 ⋅ 455 + 2 ⋅ 28 + 3 ⋅ 4 + 4 ⋅ 0
𝑁= = 0.06989.
7483
Now if we use Poisson (𝜆̂ 𝑀𝐿𝐸 ) as the fitted distribution, then a tabular compar-
ison of the fitted counts and observed counts is given by Table 2.5 below, where
𝑝𝑘̂ represents the estimated probabilities under the fitted Poisson distribution.
Table 2.5. Comparison of Observed to Fitted Counts: Singaporean

Auto Data
Count Observed Fitted Counts

(𝑘) (𝑚𝑘 ) Using Poisson (𝑛𝑝𝑘̂ )
0 6, 996 6, 977.86
1 455 487.70
2 28 17.04
3 4 0.40
≥4 0 0.01
Total 7, 483 7, 483.00
While the fit seems reasonable, a tabular comparison falls short of a statistical
test of the hypothesis that the underlying distribution is indeed Poisson. The
Pearson’s chi-square statistic is a goodness of fit statistical measure that can
be used for this purpose. To explain this statistic let us suppose that a dataset
of size 𝑛 is grouped into 𝑘 cells with 𝑚𝑘 /𝑛 and 𝑝𝑘̂ , for 𝑘 = 1 … , 𝐾 being the
observed and estimated probabilities of an observation belonging to the 𝑘-th
cell, respectively. The Pearson’s chi-square test statistic is then given by
𝐾 2
(𝑚𝑘 − 𝑛𝑝𝑘̂ )
∑ .
𝑘=1
𝑛𝑝𝑘̂
The motivation for the above statistic derives from the fact that
𝐾 2
(𝑚𝑘 − 𝑛𝑝𝑘 )
∑
𝑘=1
𝑛𝑝𝑘
has a limiting chi-square distribution with 𝐾 − 1 degrees of freedom if 𝑝𝑘 , 𝑘 =

1, … , 𝐾 are the true cell probabilities. Now suppose that only the summarized
data represented by 𝑚𝑘 , 𝑘 = 1, … , 𝐾 is available. Further, if 𝑝𝑘 ’s are functions
of 𝑠 parameters, replacing 𝑝𝑘 ’s by any efficiently estimated probabilities 𝑝𝑘̂ ’s
results in the statistic continuing to have a limiting chi-square distribution but
with degrees of freedom given by 𝐾 − 1 − 𝑠. Such efficient estimates can be
derived for example by using the mle method (with a multinomial likelihood)
or by estimating the 𝑠 parameters which minimizes the Pearson’s chi-square
statistic above. For example, the R code below does calculate an estimate for 𝜆
doing the latter and results in the estimate 0.06623153, close but different from
the mle of 𝜆 using the full data:
m<-c(6996,455,28,4,0);
op<-m/sum(m);
g<-function(lam){sum((op-c(dpois(0:3,lam),1-ppois(3,lam)))^2)};
optim(sum(op*(0:4)),g,method="Brent",lower=0,upper=10)$par
When one uses the full data to estimate the probabilities, the asymptotic distri-
bution is in between chi-square distributions with parameters 𝐾 −1 and 𝐾 −1−𝑠.
In practice it is common to ignore this subtlety and assume the limiting chi-
square has 𝐾 − 1 − 𝑠 degrees of freedom. Interestingly, this practical shortcut
works quite well in the case of the Poisson distribution.
2.8. EXERCISES 73
For the Singaporean auto data the Pearson’s chi-square statistic equals 41.98
using the full data mle for 𝜆. Using the limiting distribution of chi-square with
5 − 1 − 1 = 3 degrees of freedom, we see that the value of 41.98 is way out in
the tail (99-th percentile is below 12). Hence we can conclude that the Poisson
distribution provides an inadequate fit for the data.
In the above, we started with the cells as given in the above tabular summary.
In practice, a relevant question is how to define the cells so that the chi-square
distribution is a good approximation to the finite sample distribution of the
statistic. A rule of thumb is to define the cells in such a way to have at least
80%, if not all, of the cells having expected counts greater than 5. Also, it is
clear that a larger number of cells results in a higher power of the test, and
hence a simple rule of thumb is to maximize the number of cells such that each
cell has at least 5 observations.
2.8 Exercises
Theoretical Exercises
Exercise 2.1. Derive an expression for 𝑝𝑁 (⋅) in terms of 𝐹𝑁 (⋅) and 𝑆𝑁 (⋅).
Exercise 2.2. A measure of center of location must be equi-variant with
respect to shifts, or location transformations. In other words, if 𝑁1 and 𝑁2 are
two random variables such that 𝑁1 +𝑐 has the same distribution as 𝑁2 , for some
constant 𝑐, then the difference between the measures of the center of location
of 𝑁2 and 𝑁1 must equal 𝑐. Show that the mean satisfies this property.
Exercise 2.3. Measures of dispersion should be invariant with respect to shifts
and scale equi-variant. Show that standard deviation satisfies these properties
by doing the following:
• Show that for a random variable 𝑁 , its standard deviation equals that of
𝑁 + 𝑐, for any constant 𝑐.
• Show that for a random variable 𝑁 , its standard deviation equals 1/𝑐
times that of 𝑐𝑁 , for any positive constant 𝑐.
Exercise 2.4. Let 𝑁 be a random variable with probability mass function
given by
( 62 ) ( 𝑘12 ) , 𝑘 ≥ 1;
𝑝𝑁 (𝑘) = { 𝜋
0, otherwise.
Show that the mean of 𝑁 is ∞.
Exercise 2.5. Let 𝑁 be a random variable with a finite second moment. Show
that the function 𝜓(⋅) defined by
𝜓(𝑥) = E(𝑁 − 𝑥)2 . 𝑥∈ℝ
is minimized at 𝜇𝑁 without using calculus. Also, give a proof of this fact using
derivatives. Conclude that the minimum value equals the variance of 𝑁 .
Exercise 2.6. Derive the first two central moments of the (𝑎, 𝑏, 0) distributions
using the methods mentioned below:
• For the binomial distribution, derive the moments using only its pmf, then
its mgf, and then its pgf.
• For the Poisson distribution, derive the moments using only its mgf.
• For the negative binomial distribution, derive the moments using only its
pmf, and then its pgf.
Exercise 2.7. Let 𝑁1 and 𝑁2 be two independent Poisson random variables
with means 𝜆1 and 𝜆2 , respectively. Identify the conditional distribution of 𝑁1
given 𝑁1 + 𝑁2 .
Exercise 2.8. (Non-Uniqueness of the MLE) Consider the following para-
metric family of densities indexed by the parameter 𝑝 taking values in [0, 1]:
𝑓𝑝 (𝑥) = 𝑝 ⋅ 𝜙(𝑥 + 2) + (1 − 𝑝) ⋅ 𝜙(𝑥 − 2), 𝑥 ∈ ℝ,
where 𝜙(⋅) represents the standard normal density.

• Show that for all 𝑝 ∈ [0, 1], 𝑓𝑝 (⋅) above is a valid density function.
• Find an expression in 𝑝 for the mean and the variance of 𝑓𝑝 (⋅).
• Let us consider a sample of size one consisting of 𝑥. Show that when 𝑥
equals 0, the set of maximum likelihood estimates for 𝑝 equals [0, 1]; also
show that the mle is unique otherwise.
Exercise 2.9. Graph the region of the plane corresponding to values of (𝑎, 𝑏)
that give rise to valid (𝑎, 𝑏, 0) distributions. Do the same for (𝑎, 𝑏, 1) distribu-
tions.
Exercise 2.10. (Computational Complexity) For the (𝑎, 𝑏, 0) class of distri-
butions, count the number of basic mathematical operations (addition, subtrac-
tion, multiplication, division) needed to compute the 𝑛 probabilities 𝑝0 … 𝑝𝑛−1
using the recurrence relationship. For the negative binomial distribution with
non-integer 𝑟, count the number of such operations. What do you observe?
Exercise 2.11. (** **) Using the development of Section 2.3 rigorously show
that not only does the recurrence (2.1) tie the binomial, the Poisson and the
negative binomial distributions together, but that it also characterizes them.
Exercises with a Practical Focus

Exercise 2.12. Actuarial Exam Question. You are given:
1. 𝑝𝑘 denotes the probability that the number of claims equals 𝑘 for 𝑘 =
0, 1, 2, …
2. 𝑝𝑝𝑛 = 𝑚! 𝑛! , 𝑚 ≥ 0, 𝑛 ≥ 0
𝑚
Using the corresponding zero-modified claim count distribution with 𝑝0𝑀 = 0.1,
calculate 𝑝1𝑀 .
Exercise 2.13. Actuarial Exam Question. During a one-year period, the

number of accidents per day was distributed as follows:
No. of Accidents 0 1 2 3 4 5
No. of Days 209 111 33 7 5 2
You use a chi-square test to measure the fit of a Poisson distribution with mean
0.60. The minimum expected number of observations in any group should be
5. The maximum number of groups should be used. Determine the value of the
chi-square statistic.
Additional Exercises
Here are a set of exercises that guide the viewer through some of the theoretical
foundations of Loss Data Analytics. Each tutorial is based on one or more
questions from the professional actuarial examinations – typically the Society
of Actuaries Exam C/STAM.
Frequency Distribution Guided Tutorials

Appendix Chapter 15 gives a general introduction to maximum likelihood theory
regarding estimation of parameters from a parametric family. Appendix Chapter
17 gives more specific examples and expands some of the concepts.
Contributors
• N.D. Shyamalkumar, The University of Iowa, and Krupa
Viswanathan, Temple University, are the principal authors of the
initial version of this chapter. Email: [email protected] for
chapter comments and suggested improvements.
• Chapter reviewers include: Chunsheng Ban, Paul Johnson, Hirokazu
(Iwahiro) Iwasawa, Dalia Khalil, Tatjana Miljkovic, Rajesh Sahasrabud-
dhe, and Michelle Xia.
2.9.1 TS 2.A. R Code for Plots

Code for Figure 2.2:
Code for Figure 2.3:

Chapter 3
Modeling Loss Severity
Chapter Preview. The traditional loss distribution approach to modeling aggre-

gate losses starts by separately fitting a frequency distribution to the number of
losses and a severity distribution to the size of losses. The estimated aggregate
loss distribution combines the loss frequency distribution and the loss severity
distribution by convolution. Discrete distributions often referred to as counting
or frequency distributions were used in Chapter 2 to describe the number of
events such as number of accidents to the driver or number of claims to the
insurer. Lifetimes, asset values, losses and claim sizes are usually modeled as
continuous random variables and as such are modeled using continuous distribu-
tions, often referred to as loss or severity distributions. A mixture distribution
is a weighted combination of simpler distributions that is used to model phe-
nomenon investigated in a heterogeneous population, such as modeling more
than one type of claims in liability insurance (small frequent claims and large
relatively rare claims). In this chapter we explore the use of continuous as well
as mixture distributions to model the random size of loss. Sections 3.1 and
3.2 present key attributes that characterize continuous models and means of
creating new distributions from existing ones. Section 3.4 describes the effect
of coverage modifications, which change the conditions that trigger a payment,
such as applying deductibles, limits, or adjusting for inflation, on the distribu-
tion of individual loss amounts. For calibrating models, Section 3.5 deepens our
understanding of maximum likelihood methods. The frequency distributions
from Chapter 2 will be combined with the ideas from this chapter to describe
the aggregate losses over the whole portfolio in Chapter 5.
3.1 Basic Distributional Quantities
In this section, you learn how to define some basic distributional quantities:
77
78 CHAPTER 3. MODELING LOSS SEVERITY
• moments,
• percentiles, and
• generating functions.
3.1.1 Moments
Let 𝑋 be a continuous random variable with probability density function (pdf )
𝑓𝑋 (𝑥) and distribution function 𝐹𝑋 (𝑥). The k-th raw moment of 𝑋, denoted
by 𝜇′𝑘 , is the expected value of the k-th power of 𝑋, provided it exists. The first
raw moment 𝜇′1 is the mean of 𝑋 usually denoted by 𝜇. The formula for 𝜇′𝑘 is
given as
∞
𝜇′𝑘 = E (𝑋 𝑘 ) = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥.
0
The support of the random variable 𝑋 is assumed to be nonnegative since actu-

arial phenomena are rarely negative. For example, an easy integration by parts
shows that the raw moments for nonnegative variables can also be computed
using
∞
𝜇′𝑘 = ∫ 𝑘 𝑥𝑘−1 [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥,
0
that is based on the survival function, denoted as 𝑆𝑋 (𝑥) = 1 − 𝐹𝑋 (𝑥). This

formula is particularly useful when 𝑘 = 1. Section 3.4.2 discusses this approach
in more detail.
The k-th central moment of 𝑋, denoted by 𝜇𝑘 , is the expected value of the k-th
power of the deviation of 𝑋 from its mean 𝜇. The formula for 𝜇𝑘 is given as
∞
𝑘 𝑘
𝜇𝑘 = E [(𝑋 − 𝜇) ] = ∫ (𝑥 − 𝜇) 𝑓𝑋 (𝑥) 𝑑𝑥.
0
The second central moment 𝜇2 defines the variance of 𝑋, denoted by 𝜎2 . The

square root of the variance is the standard deviation 𝜎.
From a classical perspective, further characterization of the shape of the dis-
tribution includes its degree of symmetry as well as its flatness compared to
the normal distribution. The ratio of the third central moment to the cube
of the standard deviation (𝜇3 /𝜎3 ) defines the coefficient of skewness which is
a measure of symmetry. A positive coefficient of skewness indicates that the
distribution is skewed to the right (positively skewed). The ratio of the fourth
central moment to the fourth power of the standard deviation (𝜇4 /𝜎4 ) defines
the coefficient of kurtosis. The normal distribution has a coefficient of kurtosis
of 3. Distributions with a coefficient of kurtosis greater than 3 have heavier tails
than the normal, whereas distributions with a coefficient of kurtosis less than 3
have lighter tails and are flatter. Section 10.2 describes the tails of distributions
from an insurance and actuarial perspective.
3.1. BASIC DISTRIBUTIONAL QUANTITIES 79
Example 3.1.1. Actuarial Exam Question. Assume that the rv 𝑋 has a

gamma distribution with mean 8 and skewness 1. Find the variance of 𝑋. (Hint:
The gamma distribution is reviewed in Section 3.2.1.)
Solution. The pdf of 𝑋 is given by
𝛼
(𝑥/𝜃) −𝑥/𝜃
𝑓𝑋 (𝑥) = 𝑒
𝑥 Γ (𝛼)
for 𝑥 > 0. For 𝛼 > 0, the k-th raw moment is
∞
1 Γ (𝑘 + 𝛼) 𝑘
𝜇′𝑘 = E (𝑋 𝑘 ) = ∫ 𝑥𝑘+𝛼−1 𝑒−𝑥/𝜃 𝑑𝑥 = 𝜃
0 Γ (𝛼) 𝜃𝛼 Γ (𝛼)
Given Γ (𝑟 + 1) = 𝑟Γ (𝑟) and Γ (1) = 1, then 𝜇′1 = E (𝑋) = 𝛼𝜃, 𝜇′2 = E (𝑋 2 ) =
(𝛼 + 1) 𝛼𝜃2 , 𝜇′3 = E (𝑋 3 ) = (𝛼 + 2) (𝛼 + 1) 𝛼𝜃3 , and Var (𝑋) = (𝛼 + 1)𝛼𝜃2 −
(𝛼𝜃)2 = 𝛼𝜃2 .
3
E[(𝑋−𝜇′1 ) ] 𝜇′3 −3𝜇′2 𝜇′1 +2𝜇′1
3
Skewness = 3/2 = 3/2
(Var𝑋) (Var𝑋)
(𝛼+2)(𝛼+1)𝛼𝜃3 −3(𝛼+1)𝛼2 𝜃3 +2𝛼3 𝜃3
= 3/2
(𝛼𝜃2 )
2
= 𝛼1/2
= 1.
Hence, 𝛼 = 4. Since, E (𝑋) = 𝛼𝜃 = 8, then 𝜃 = 2 and finally, Var (𝑋) = 𝛼𝜃2 =

16.
3.1.2 Quantiles
Quantiles can also be used to describe the characteristics of the distribution of
𝑋. When the distribution of 𝑋 is continuous, for a given fraction 0 ≤ 𝑝 ≤ 1 the
corresponding quantile is the solution of the equation
𝐹𝑋 (𝜋𝑝 ) = 𝑝.
For example, the middle point of the distribution, 𝜋0.5 , is the median. A per-
centile is a type of quantile; a 100𝑝 percentile is the number such that 100 × 𝑝
percent of the data is below it.
Example 3.1.1. Actuarial Exam Question. Let 𝑋 be a continuous random
variable with density function 𝑓𝑋 (𝑥) = 𝜃𝑒−𝜃𝑥 , for 𝑥 > 0 and 0 elsewhere. If the
median of this distribution is 31 , find 𝜃.
Solution.
The distribution function is 𝐹𝑋 (𝑥) = 1 − 𝑒−𝜃𝑥 . So, 𝐹𝑋 (𝜋0.5 ) = 1 − 𝑒−𝜃𝜋0.5 = 0.5.
As, 𝜋0.5 = 13 , we have 𝐹𝑋 ( 31 ) = 1 − 𝑒−𝜃/3 = 0.5 and 𝜃 = 3 log 2.
Section 4.1.1 will extend the definition of quantiles to include distributions that
are discrete, continuous, or a hybrid combination.
3.1.3 Moment Generating Function

The moment generating function (mgf), denoted by 𝑀𝑋 (𝑡) uniquely character-
izes the distribution of 𝑋. While it is possible for two different distributions to
have the same moments and yet still differ, this is not the case with the moment
generating function. That is, if two random variables have the same moment
generating function, then they have the same distribution. The moment gener-
ating function is given by
∞
𝑀𝑋 (𝑡) = E (𝑒𝑡𝑋 ) = ∫ 𝑒tx 𝑓𝑋 (𝑥) 𝑑𝑥
0
for all 𝑡 for which the expected value exists. The mgf is a real function whose
k-th derivative at zero is equal to the k-th raw moment of 𝑋. In symbols, this
is
𝑑𝑘
𝑀 (𝑡)∣ = E (𝑋 𝑘 ) .
𝑑𝑡𝑘 𝑋 𝑡=0
Example 3.1.3. Actuarial Exam Question. The random variable 𝑋 has an

exponential distribution with mean 1𝑏 . It is found that 𝑀𝑋 (−𝑏2 ) = 0.2. Find
𝑏. (Hint: The exponential is a special case of the gamma distribution which is
reviewed in Section 3.2.1.)
Solution.
With 𝑋 having an exponential distribution with mean 1𝑏 , we have that
∞ ∞
𝑏
𝑀𝑋 (𝑡) = E (𝑒𝑡𝑋 ) = ∫ 𝑒tx 𝑏𝑒−𝑏𝑥 𝑑𝑥 = ∫ 𝑏𝑒−𝑥(𝑏−𝑡) 𝑑𝑥 = .
0 0 (𝑏 − 𝑡)
Then,
𝑏 1
𝑀𝑋 (−𝑏2 ) = 2
= = 0.2.
(𝑏 + 𝑏 ) (1 + 𝑏)
Thus, 𝑏 = 4.
Example 3.1.4. Actuarial Exam Question. Let 𝑋1 , … , 𝑋𝑛 be independent

random variables, where 𝑋𝑖 has a gamma distribution with parameters 𝛼𝑖 and 𝜃.
𝑛
Find the distribution of 𝑆 = ∑𝑖=1 𝑋𝑖 , the mean E(𝑆), and the variance Var(𝑆).
Solution.
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY81
The mgf of 𝑆 is
𝑛
𝑛
𝑀𝑆 (𝑡) = E (𝑒tS ) = E (𝑒𝑡 ∑𝑖=1 𝑋𝑖 ) = E (∏ 𝑒𝑡𝑋𝑖 ) .
𝑖=1
Using independence, we get

𝑛 𝑛
𝑀𝑆 (𝑡) = ∏ E (𝑒𝑡𝑋𝑖 ) = ∏ 𝑀𝑋𝑖 (𝑡).
𝑖=1 𝑖=1
The moment generating function of the gamma distribution 𝑋𝑖 is 𝑀𝑋𝑖 (𝑡) =

(1 − 𝜃𝑡)𝛼𝑖 . Then,
𝑛 𝑛
−𝛼𝑖 − ∑𝑖=1 𝛼𝑖
𝑀𝑆 (𝑡) = ∏ (1 − 𝜃𝑡) = (1 − 𝜃𝑡) .
𝑖=1
𝑛
This indicates that the distribution of 𝑆 is gamma with parameters ∑𝑖=1 𝛼𝑖 and
𝜃.
This is a demonstration of how we can use the uniqueness property of the mo-
ment generating function to determine the probability distribution of a function
of random variables.
We can find the mean and variance from the properties of the gamma distribu-
tion. Alternatively, by finding the first and second derivatives of 𝑀𝑆 (𝑡) at zero,
𝑛
we can show that E (𝑆) = 𝜕𝑀𝜕𝑡𝑆 (𝑡) ∣ = 𝛼𝜃 where 𝛼 = ∑𝑖=1 𝛼𝑖 , and
𝑡=0
𝜕 2 𝑀𝑆 (𝑡)
E (𝑆 2 ) = ∣ = (𝛼 + 1) 𝛼𝜃2 .
𝜕𝑡2 𝑡=0
Hence, Var (𝑆) = 𝛼𝜃2 .
One can also use the moment generating function to compute the probability
generating function
𝑃𝑋 (𝑧) = E (𝑧𝑋 ) = 𝑀𝑋 (log 𝑧) .
As introduced in Section 2.2.2, the probability generating function is more useful

for discrete random variables.
3.2 Continuous Distributions for Modeling Loss

Severity
In this section, you learn how to define and apply four fundamental severity
distributions:
• gamma,
• Pareto,
• Weibull, and
• generalized beta distribution of the second kind.
3.2.1 Gamma Distribution

Recall that the traditional approach in modeling losses is to fit separate models
for frequency and claim severity. When frequency and severity are modeled
separately it is common for actuaries to use the Poisson distribution (introduced
in Section 2.2.3) for claim count and the gamma distribution to model severity.
An alternative approach for modeling losses that has recently gained popularity
is to create a single model for pure premium (average claim cost) that will be
described in Chapter 4.
The continuous variable 𝑋 is said to have the gamma distribution with shape
parameter 𝛼 and scale parameter 𝜃 if its probability density function is given
by
𝛼
(𝑥/𝜃)
𝑓𝑋 (𝑥) = exp (−𝑥/𝜃) for 𝑥 > 0.
𝑥 Γ (𝛼)
Note that 𝛼 > 0, 𝜃 > 0.
The two panels in Figure 3.1 demonstrate the effect of the scale and shape
parameters on the gamma density function.
When 𝛼 = 1 the gamma reduces to an exponential distribution and when 𝛼 = 𝑛2
and 𝜃 = 2 the gamma reduces to a chi-square distribution with 𝑛 degrees of
freedom. As we will see in Section 15.4, the chi-square distribution is used
extensively in statistical hypothesis testing.
The distribution function of the gamma model is the incomplete gamma function,
denoted by Γ (𝛼; 𝑥𝜃 ), and defined as
𝑥/𝜃
𝑥 1
𝐹𝑋 (𝑥) = Γ (𝛼; ) = ∫ 𝑡𝛼−1 𝑒−𝑡 𝑑𝑡,
𝜃 Γ (𝛼) 0
with 𝛼 > 0, 𝜃 > 0. For an integer 𝛼, it can be written as Γ (𝛼; 𝑥𝜃 ) = 1 −

𝛼−1 (𝑥/𝜃)𝑘
𝑒−𝑥/𝜃 ∑𝑘=0 𝑘! .
The 𝑘-th raw moment of the gamma distributed random variable for any positive
𝑘 is given by
Γ (𝛼 + 𝑘)
E (𝑋 𝑘 ) = 𝜃𝑘 .
Γ (𝛼)
scale=100 shape=2
scale=150 shape=3
scale=200 shape=4
0.003
0.003
scale=250 shape=5
Gamma Density
Gamma Density
0.002
0.002
0.001
0.001
0.000
0.000
0 200 400 600 800 0 200 400 600 800
x x
Figure 3.1: Gamma Densities. The left-hand panel is with shape=2 and
varying scale. The right-hand panel is with scale=100 and varying shape.
The mean and variance are given by E (𝑋) = 𝛼𝜃 and Var (𝑋) = 𝛼𝜃2 , respec-
tively.
Since all moments exist for any positive 𝑘, the gamma distribution is considered
a light tailed distribution, which may not be suitable for modeling risky assets
as it will not provide a realistic assessment of the likelihood of severe losses.
3.2.2 Pareto Distribution

The Pareto distribution, named after the Italian economist Vilfredo Pareto
(1843-1923), has many economic and financial applications. It is a positively
skewed and heavy-tailed distribution which makes it suitable for modeling in-
come, high-risk insurance claims and severity of large casualty losses. The sur-
vival function of the Pareto distribution which decays slowly to zero was first
used to describe the distribution of income where a small percentage of the
population holds a large proportion of the total wealth. For extreme insurance
claims, the tail of the severity distribution (losses in excess of a threshold) can
be modeled using a Generalized Pareto distribution.
The continuous variable 𝑋 is said to have the (two parameter) Pareto distribu-
tion with shape parameter 𝛼 and scale parameter 𝜃 if its pdf is given by
𝛼𝜃𝛼
𝑓𝑋 (𝑥) = 𝛼+1
𝑥 > 0, 𝛼 > 0, 𝜃 > 0. (3.1)
(𝑥 + 𝜃)
The two panels in Figure 3.2 demonstrate the effect of the scale and shape
parameters on the Pareto density function. There are other formulations of
the Pareto distribution including a one parameter version given in Appendix
Section 18.2. Henceforth, when we refer the Pareto distribution, we mean the
version given through the pdf in equation (3.1).
The distribution function of the Pareto distribution is given by
𝛼
𝜃
𝐹𝑋 (𝑥) = 1 − ( ) 𝑥 > 0, 𝛼 > 0, 𝜃 > 0.
𝑥+𝜃
It can be easily seen that the hazard function of the Pareto distribution is a
decreasing function in 𝑥, another indication that the distribution is heavy tailed.
Again using the analogy of the income of a population, when the hazard function
decreases over time the population dies off at a decreasing rate resulting in a
heavier tail for the distribution. The hazard function reveals information about
the tail distribution and is often used to model data distributions in survival
analysis. The hazard function is defined as the instantaneous potential that the
event of interest occurs within a very narrow time frame.
The 𝑘-th raw moment of the Pareto distributed random variable exists, if and
0.0015
0.0020
α =1 θ =2000
α =2 θ =2500
α =3 θ =3000
α =4 θ =3500
0.0015
0.0010
Pareto density
Pareto density
0.0010
0.0005
0.0005
0.0000
0.0000
0 500 1500 2500 0 500 1500 2500
x x
Figure 3.2: Pareto Densities. The left-hand panel is with scale=2000 and
varying shape. The right-hand panel is with shape=3 and varying scale.
only if, 𝛼 > 𝑘. If 𝑘 is a positive integer then

𝜃𝑘 𝑘!
E (𝑋 𝑘 ) = 𝛼 > 𝑘.
(𝛼 − 1) ⋯ (𝛼 − 𝑘)
The mean and variance are given by
𝜃
E (𝑋) = for 𝛼 > 1
𝛼−1
and
𝛼𝜃2
Var (𝑋) = 2
for 𝛼 > 2,
(𝛼 − 1) (𝛼 − 2)
respectively.
Example 3.2.1. The claim size of an insurance portfolio follows the Pareto
distribution with mean and variance of 40 and 1800, respectively. Find
a. The shape and scale parameters.
b. The 95-th percentile of this distribution.
Solution.
𝜃 𝛼𝜃2
a. As, 𝑋 ∼ 𝑃 𝑎(𝛼, 𝜃), we have E (𝑋) = 𝛼−1 = 40 and Var (𝑋) = 2 =
(𝛼−1) (𝛼−2)
1800. By dividing the square of the first equation by the second we get 𝛼−2
𝛼 =
402
1800 . Thus, 𝛼 = 18.02 and 𝜃 = 680.72.
b. The 95-th percentile, 𝜋0.95 , satisfies the equation
18.02
680.72
𝐹𝑋 (𝜋0.95 ) = 1 − ( ) = 0.95.
𝜋0.95 + 680.72
Thus, 𝜋0.95 = 122.96.
3.2.3 Weibull Distribution

The Weibull distribution, named after the Swedish physicist Waloddi Weibull
(1887-1979) is widely used in reliability, life data analysis, weather forecasts and
general insurance claims. Truncated data arise frequently in insurance studies.
The Weibull distribution has been used to model excess of loss treaty over
automobile insurance as well as earthquake inter-arrival times.
The continuous variable 𝑋 is said to have the Weibull distribution with shape
parameter 𝛼 and scale parameter 𝜃 if its pdf is given by
𝛼 𝑥 𝛼−1 𝑥 𝛼
𝑓𝑋 (𝑥) = ( ) exp (− ( ) ) 𝑥 > 0, 𝛼 > 0, 𝜃 > 0.
𝜃 𝜃 𝜃
The two panels in Figure 3.3 demonstrate the effects of the scale and shape
parameters on the Weibull density function.
0.012
scale=50 shape=1.5
scale=100 shape=2
0.020
scale=150 shape=2.5
0.010
scale=200 shape=3
0.008
0.015
Weibull density
Weibull density
0.006
0.010
0.004
0.005
0.002
0.000
0.000
0 100 200 300 400 0 100 200 300 400
z z
Figure 3.3: Weibull Densities. The left-hand panel is with shape=3 and
varying scale. The right-hand panel is with scale=100 and varying shape.
The distribution function of the Weibull distribution is given by

𝑥 𝛼
𝐹𝑋 (𝑥) = 1 − exp (− ( ) ) 𝑥 > 0, 𝛼 > 0, 𝜃 > 0.
𝜃
It can be easily seen that the shape parameter 𝛼 describes the shape of the
hazard function of the Weibull distribution. The hazard function is a decreasing
function when 𝛼 < 1 (heavy tailed distribution), constant when 𝛼 = 1 and
increasing when 𝛼 > 1 (light tailed distribution). This behavior of the hazard
function makes the Weibull distribution a suitable model for a wide variety of
phenomena such as weather forecasting, electrical and industrial engineering,
insurance modeling, and financial risk analysis.
The 𝑘-th raw moment of the Weibull distributed random variable is given by
𝑘
E (𝑋 𝑘 ) = 𝜃𝑘 Γ (1 + ).
𝛼
The mean and variance are given by

1
E (𝑋) = 𝜃 Γ (1 + )
𝛼
and
2 1 2
Var(𝑋) = 𝜃2 (Γ (1 + ) − [Γ (1 + )] ) ,
𝛼 𝛼
respectively.
Example 3.2.2. Suppose that the probability distribution of the lifetime of
AIDS patients (in months) from the time of diagnosis is described by the Weibull
distribution with shape parameter 1.2 and scale parameter 33.33.
a. Find the probability that a randomly selected person from this population
survives at least 12 months.
b. A random sample of 10 patients will be selected from this population.
What is the probability that at most two will die within one year of diag-
nosis.
c. Find the 99-th percentile of the distribution of lifetimes.
Solution.
a. Let 𝑋 be the lifetime of AIDS patients (in months) having a Weibull distri-
bution with parameters (1.2, 33.33). We have,
12 1.2
Pr (𝑋 ≥ 12) = 𝑆𝑋 (12) = 𝑒−( 33.33 ) = 0.746.
b. Let 𝑌 be the number of patients who die within one year of diagnosis. Then,
𝑌 ∼ 𝐵𝑖𝑛 (10, 0.254) and Pr (𝑌 ≤ 2) = 0.514.
c. Let 𝜋0.99 denote the 99-th percentile of this distribution. Then,
𝜋0.99 1.2
𝑆𝑋 (𝜋0.99 ) = exp {− ( ) } = 0.01.
33.33
Solving for 𝜋0.99 , we get 𝜋0.99 = 118.99.
3.2.4 The Generalized Beta Distribution of the Second

Kind
The Generalized Beta Distribution of the Second Kind (GB2) was introduced
by Venter (1983) in the context of insurance loss modeling and by McDonald
(1984) as an income and wealth distribution. It is a four-parameter, very flexible,
distribution that can model positively as well as negatively skewed distributions.
The continuous variable 𝑋 is said to have the GB2 distribution with parameters
𝜎, 𝜃, 𝛼1 and 𝛼2 if its pdf is given by
(𝑥/𝜃)𝛼2 /𝜎
𝑓𝑋 (𝑥) = 𝛼1 +𝛼2 for 𝑥 > 0, (3.2)
1/𝜎
𝑥𝜎 B (𝛼1 , 𝛼2 ) [1 + (𝑥/𝜃) ]
𝜎, 𝜃, 𝛼1 , 𝛼2 > 0, and where the beta function B (𝛼1 , 𝛼2 ) is defined as
1
𝛼2 −1
B (𝛼1 , 𝛼2 ) = ∫ 𝑡𝛼1 −1 (1 − 𝑡) 𝑑𝑡.
0
The GB2 provides a model for heavy as well as light tailed data. It includes the
exponential, gamma, Weibull, Burr, Lomax, F, chi-square, Rayleigh, lognormal
and log-logistic as special or limiting cases. For example, by setting the param-
eters 𝜎 = 𝛼1 = 𝛼2 = 1, the GB2 reduces to the log-logistic distribution. When
𝜎 = 1 and 𝛼2 → ∞, it reduces to the gamma distribution, and when 𝛼 = 1 and
𝛼2 → ∞, it reduces to the Weibull distribution.
A GB2 random variable can be constructed as follows. Suppose that 𝐺1 and
𝐺2 are independent random variables where 𝐺𝑖 has a gamma distribution with
shape parameter 𝛼𝑖 and scale parameter 1. Then, one can show that the random
𝐺1 𝜎
variable 𝑋 = 𝜃 ( 𝐺 ) has a GB2 distribution with pdf summarized in equation
2
(3.2). This theoretical result has several implications. For example, when the
moments exist, one can show that the 𝑘-th raw moment of the GB2 distributed
random variable is given by
𝜃𝑘 B (𝛼1 + 𝑘𝜎, 𝛼2 − 𝑘𝜎)

E (𝑋 𝑘 ) = , 𝑘 > 0.
B (𝛼1 , 𝛼2 )
As will be described in Section 3.3.3, the GB2 is also related to an 𝐹 -distribution,

a result that can be useful in simulation and residual analysis.
Earlier applications of the GB2 were on income data and more recently have
been used to model long-tailed claims data (Section 10.2 describes different
interpretations of the descriptor “long-tail”). The GB2 has been used to model
different types of automobile insurance claims, severity of fire losses, as well as
medical insurance claim data.
3.3 Methods of Creating New Distributions

• Understand connections among the distributions
• Give insights into when a distribution is preferred when compared to al-
ternatives
• Provide foundations for creating new distributions
3.3.1 Functions of Random Variables and their Distribu-

tions
In Section 3.2 we discussed some elementary known distributions. In this sec-
tion we discuss means of creating new parametric probability distributions from
existing ones. Specifically, let 𝑋 be a continuous random variable with a known
pdf 𝑓𝑋 (𝑥) and distribution function 𝐹𝑋 (𝑥). We are interested in the distribution
of 𝑌 = 𝑔 (𝑋), where 𝑔(𝑋) is a one-to-one transformation defining a new random
variable 𝑌 . In this section we apply the following techniques for creating new
families of distributions: (a) multiplication by a constant (b) raising to a power,
(c) exponentiation and (d) mixing.
3.3.2 Multiplication by a Constant

If claim data show change over time then such transformation can be useful to
adjust for inflation. If the level of inflation is positive then claim costs are rising,
and if it is negative then costs are falling. To adjust for inflation we multiply
the cost 𝑋 by 1+ inflation rate (negative inflation is deflation). To account for
currency impact on claim costs we also use a transformation to apply currency
conversion from a base to a counter currency.
Consider the transformation 𝑌 = 𝑐𝑋, where 𝑐 > 0, then the distribution func-
tion of 𝑌 is given by
𝑦 𝑦
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (𝑐𝑋 ≤ 𝑦) = Pr (𝑋 ≤ ) = 𝐹𝑋 ( ) .
𝑐 𝑐
3.3. METHODS OF CREATING NEW DISTRIBUTIONS 91
Using the chain rule for differentiation, the pdf of interest 𝑓𝑌 (𝑦) can be written
as
1 𝑦
𝑓𝑌 (𝑦) = 𝑓𝑋 ( ) .
𝑐 𝑐
Suppose that 𝑋 belongs to a certain set of parametric distributions and define
a rescaled version 𝑌 = 𝑐𝑋, 𝑐 > 0. If 𝑌 is in the same set of distributions
then the distribution is said to be a scale distribution. When a member of a
scale distribution is multiplied by a constant 𝑐 (𝑐 > 0), the scale parameter for
this scale distribution meets two conditions:
• The parameter is changed by multiplying by 𝑐;
• All other parameters remain unchanged.
Example 3.3.1. Actuarial Exam Question. Losses of Eiffel Auto Insurance
are denoted in Euro currency and follow a lognormal distribution with 𝜇 = 8
and 𝜎 = 2. Given that 1 euro = 1.3 dollars, find the set of lognormal parameters
which describe the distribution of Eiffel’s losses in dollars.
Solution.
Let 𝑋 and 𝑌 denote the aggregate losses of Eiffel Auto Insurance in euro cur-
rency and dollars respectively. As 𝑌 = 1.3𝑋, we have,
𝑦 𝑦
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (1.3𝑋 ≤ 𝑦) = Pr (𝑋 ≤ ) = 𝐹𝑋 ( ).
1.3 1.3
𝑋 follows a lognormal distribution with parameters 𝜇 = 8 and 𝜎 = 2. The pdf

of 𝑋 is given by
2
1 1 log 𝑥 − 𝜇
𝑓𝑋 (𝑥) = √ exp {− ( ) } for 𝑥 > 0.
𝑥𝜎 2𝜋 2 𝜎
As ∣ 𝑑𝑥
𝑑𝑦 ∣ =
1
1.3 , the pdf of interest 𝑓𝑌 (𝑦) is
1 𝑦
𝑓𝑌 (𝑦) = 1.3 𝑓𝑋 ( 1.3 )
2
= 1 1.3 1
1.3 𝑦𝜎 2𝜋 exp {− 2
√ ( log(𝑦/1.3)−𝜇
𝜎 ) }
2
= 1
√
𝑦𝜎 2𝜋
exp {− 12 ( log 𝑦−(log
𝜎
1.3+𝜇)
) }.
Then 𝑌 follows a lognormal distribution with parameters log 1.3 + 𝜇 = 8.26 and
𝜎 = 2.00. If we let 𝜇 = log(𝑚) then it can be easily seen that 𝑚 = 𝑒𝜇 is the
scale parameter which was multiplied by 1.3 while 𝜎 is the shape parameter that
remained unchanged.
Example 3.3.2. Actuarial Exam Question. Demonstrate that the gamma

distribution is a scale distribution.
Solution.
Let 𝑋 ∼ 𝐺𝑎(𝛼, 𝜃) and 𝑌 = 𝑐𝑋. As ∣ 𝑑𝑥 1
𝑑𝑦 ∣ = 𝑐 , then
𝛼
1 𝑦 (𝑦) 𝑦
𝑓𝑌 (𝑦) = 𝑓𝑋 ( ) = 𝑐𝜃 exp (− ) .
𝑐 𝑐 𝑦 Γ (𝛼) 𝑐𝜃
We can see that 𝑌 ∼ 𝐺𝑎(𝛼, 𝑐𝜃) indicating that gamma is a scale distribution
and 𝜃 is a scale parameter.
Using the same approach you can demonstrate that other distributions intro-
duced in Section 3.2 are also scale distributions. In actuarial modeling, working
with a scale distribution is very convenient because it allows to incorporate the
effect of inflation and to accommodate changes in the currency unit.
3.3.3 Raising to a Power

In Section 3.2.3 we talked about the flexibility of the Weibull distribution in
fitting reliability data. Looking to the origins of the Weibull distribution, we
recognize that the Weibull is a power transformation of the exponential distri-
bution. This is an application of another type of transformation which involves
raising the random variable to a power.
Consider the transformation 𝑌 = 𝑋 𝜏 , where 𝜏 > 0, then the distribution func-
tion of 𝑌 is given by
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (𝑋 𝜏 ≤ 𝑦) = Pr (𝑋 ≤ 𝑦1/𝜏 ) = 𝐹𝑋 (𝑦1/𝜏 ) .
Hence, the pdf of interest 𝑓𝑌 (𝑦) can be written as
1 (1/𝜏)−1
𝑓𝑌 (𝑦) = 𝑦 𝑓𝑋 (𝑦1/𝜏 ) .
𝜏
On the other hand, if 𝜏 < 0, then the distribution function of 𝑌 is given by
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (𝑋 𝜏 ≤ 𝑦) = Pr (𝑋 ≥ 𝑦1/𝜏 ) = 1 − 𝐹𝑋 (𝑦1/𝜏 ) ,
and
1
𝑓𝑌 (𝑦) = ∣ ∣ 𝑦(1/𝜏)−1 𝑓 𝑋 (𝑦1/𝜏 ) .
𝜏
Example 3.3.3. We assume that 𝑋 follows the exponential distribution with

mean 𝜃 and consider the transformed variable 𝑌 = 𝑋 𝜏 . Show that 𝑌 follows
the Weibull distribution when 𝜏 is positive and determine the parameters of the
Weibull distribution.
Solution.
As 𝑋 follows the exponential distribution with mean 𝜃, we have
1 −𝑥/𝜃
𝑓𝑋 (𝑥) = 𝑒 𝑥 > 0.
𝜃
Solving for x yields 𝑥 = 𝑦1/𝜏 . Taking the derivative, we have
𝑑𝑥 1 1
∣ ∣ = 𝑦 𝜏 −1 .
𝑑𝑦 𝜏
Thus,
1 1 −1 1 1 1 −1 − 𝑦 𝜏1 𝛼 𝑦 𝛼−1 −(𝑦/𝛽)𝛼
𝑓𝑌 (𝑦) = 𝑦 𝜏 𝑓 𝑋 (𝑦 𝜏 ) = 𝑦𝜏 𝑒 𝜃 = ( ) 𝑒 .
𝜏 𝜏𝜃 𝛽 𝛽
where 𝛼 = 𝜏1 and 𝛽 = 𝜃𝜏 . Then, 𝑌 follows the Weibull distribution with shape
parameter 𝛼 and scale parameter 𝛽.
Special Case. Relating a GB2 to an 𝐹 - Distribution. We can use

tranforms such as multiplication by a constant and raising to a power to verify
that the GB2 distribution is related to an 𝐹 -distribution, a distribution widely
used in applied statistics.
To see this relationship, we first note that 12 𝐺1 has a gamma distribution with
shape parameter 𝛼1 and scale parameter 0.5. Readers with some background in
applied statistics may also recognize this to be a chi-square distribution with de-
grees of freedom 2𝛼1 . The ratio of independent chi-squares has an 𝐹 -distribution.
That is
𝐺1 0.5𝐺1
= =𝐹
𝐺2 0.5𝐺2
has an 𝐹 -distribution with numerator degrees of freedom 2𝛼1 and denominator

degrees of freedom 2𝛼2 . Thus, a random variable 𝑋 with a GB2 distribution
𝐺1 𝜎
can be expressed as 𝑋 = 𝜃 ( 𝐺 ) = 𝜃 𝐹 𝜎 . With this, you can think of a GB2
2
as a “power 𝐹 ” or a “generalized 𝐹 ”, as it is sometimes known in the literature.
Simulation, discussed in Chapter 6, provides a direct application of this result.
Suppose we know how to simulate an outcome with an 𝐹 − 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (that
is easy to do using, for example, the R function rf(n,df1,df2)), say 𝐹 . Then
we raise it to the power 𝜎 and multiply it by 𝜃 so that 𝜃 𝐹 𝜎 is an outcome that
has a GB2 distribution.
Residual analysis provides another direct application. Suppose we have an out-

come, say 𝑋, that we think comes from a GB2 distribution. Then we can
1/𝜎
examine the transformed version 𝑋 ∗ = (𝑋/𝜃) . If the original specification
is correct, then 𝑋 ∗ has an 𝐹 − distribution and there are many well-known
techniques, some described in Chapter 4, for verifying this assertion.
3.3.4 Exponentiation
The normal distribution is a very popular model for a wide number of applica-
tions and when the sample size is large, it can serve as an approximate distri-
bution for other models. If the random variable 𝑋 has a normal distribution
with mean 𝜇 and variance 𝜎2 , then 𝑌 = 𝑒𝑋 has a lognormal distribution with
parameters 𝜇 and 𝜎2 . The lognormal random variable has a lower bound of
zero, is positively skewed and has a long right tail. A lognormal distribution is
commonly used to describe distributions of financial assets such as stock prices.
It is also used in fitting claim amounts for automobile as well as health in-
surance. This is an example of another type of transformation which involves
exponentiation.
In general, consider the transformation 𝑌 = 𝑒𝑋 . Then, the distribution function
of 𝑌 is given by
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (𝑒𝑋 ≤ 𝑦) = Pr (𝑋 ≤ log 𝑦) = 𝐹𝑋 (log 𝑦) .

Taking derivatives, we see that the pdf of interest 𝑓𝑌 (𝑦) can be written as
1
𝑓𝑌 (𝑦) = 𝑓 (log 𝑦) .
𝑦 𝑋
As an important special case, suppose that 𝑋 is normally distributed with mean

𝜇 and variance 𝜎2 . Then, the distribution of 𝑌 = 𝑒𝑋 is
2
1 1 1 log 𝑦 − 𝜇
𝑓𝑌 (𝑦) = 𝑓𝑋 (log 𝑦) = √ exp {− ( ) }.
𝑦 𝑦𝜎 2𝜋 2 𝜎
This is known as a lognormal distribution.

Example 3.3.4. Actuarial Exam Question. Assume that 𝑋 has a uniform
distribution on the interval (0, 𝑐) and define 𝑌 = 𝑒𝑋 . Find the distribution of
𝑌.
Solution.
We begin with the cdf of 𝑌 ,
𝐹𝑌 (𝑦) = Pr (𝑌 ≤ 𝑦) = Pr (𝑒𝑋 ≤ 𝑦) = Pr (𝑋 ≤ log 𝑦) = 𝐹𝑋 (log 𝑦) .

Taking the derivative, we have,
1 1
𝑓𝑌 (𝑦) = 𝑓 (log 𝑦) = .
𝑦 𝑋 𝑐𝑦
Since 0 < 𝑥 < 𝑐, then 1 < 𝑦 < 𝑒𝑐 .
3.3.5 Finite Mixtures

Mixture distributions represent a useful way of modeling data that are drawn
from a heterogeneous population. This parent population can be thought to be
divided into multiple subpopulations with distinct distributions.
Two-point Mixture
If the underlying phenomenon is diverse and can actually be described as two
phenomena representing two subpopulations with different modes, we can con-
struct the two-point mixture random variable 𝑋. Given random variables 𝑋1
and 𝑋2 , with pdf s 𝑓𝑋1 (𝑥) and 𝑓𝑋2 (𝑥) respectively, the pdf of 𝑋 is the weighted
average of the component pdf 𝑓𝑋1 (𝑥) and 𝑓𝑋2 (𝑥). The pdf and distribution
function of 𝑋 are given by
𝑓𝑋 (𝑥) = 𝑎𝑓𝑋1 (𝑥) + (1 − 𝑎) 𝑓𝑋2 (𝑥) ,
and
𝐹𝑋 (𝑥) = 𝑎𝐹𝑋1 (𝑥) + (1 − 𝑎) 𝐹𝑋2 (𝑥) ,
for 0 < 𝑎 < 1, where the mixing parameters 𝑎 and (1 − 𝑎) represent the propor-
tions of data points that fall under each of the two subpopulations respectively.
This weighted average can be applied to a number of other distribution related
quantities. The k-th raw moment and moment generating function of 𝑋 are
given by E (𝑋 𝑘 ) = 𝑎E (𝑋1𝐾 ) + (1 − 𝑎) E (𝑋2𝑘 ), and
𝑀𝑋 (𝑡) = 𝑎𝑀𝑋1 (𝑡) + (1 − 𝑎) 𝑀𝑋2 (𝑡),
respectively.
Example 3.3.5. Actuarial Exam Question. A collection of insurance poli-
cies consists of two types. 25% of policies are Type 1 and 75% of policies are
Type 2. For a policy of Type 1, the loss amount per year follows an exponential
distribution with mean 200, and for a policy of Type 2, the loss amount per year
follows a Pareto distribution with parameters 𝛼 = 3 and 𝜃 = 200. For a policy
chosen at random from the entire collection of both types of policies, find the
probability that the annual loss will be less than 100, and find the average loss.
Solution.
The two types of losses are the random variables 𝑋1 and 𝑋2 . 𝑋1 has an expo-
100
nential distribution with mean 100, so 𝐹𝑋1 (100) = 1 − 𝑒− 200 = 0.393. 𝑋2 has
a Pareto distribution with parameters 𝛼 = 3 and 𝜃 = 200, so 𝐹𝑋1 (100) = 1 −
200 3
( 100+200 ) = 0.704. Hence, 𝐹𝑋 (100) = (0.25 × 0.393) + (0.75 × 0.704) = 0.626.
The average loss is given by
E (𝑋) = 0.25E (𝑋1 ) + 0.75E (𝑋2 ) = (0.25 × 200) + (0.75 × 100) = 125
k-point Mixture
In case of finite mixture distributions, the random variable of interest 𝑋 has
a probability 𝑝𝑖 of being drawn from homogeneous subpopulation 𝑖, where
𝑖 = 1, 2, … , 𝑘 and 𝑘 is the initially specified number of subpopulations in our mix-
ture. The mixing parameter 𝑝𝑖 represents the proportion of observations from
subpopulation 𝑖. Consider the random variable 𝑋 generated from 𝑘 distinct sub-
populations, where subpopulation 𝑖 is modeled by the continuous distribution
𝑓𝑋𝑖 (𝑥). The probability distribution of 𝑋 is given by
𝑘
𝑓𝑋 (𝑥) = ∑ 𝑝𝑖 𝑓𝑋𝑖 (𝑥),
𝑖=1
𝑘
where 0 < 𝑝𝑖 < 1 and ∑𝑖=1 𝑝𝑖 = 1.
This model is often referred to as a finite mixture or a 𝑘-point mixture. The
distribution function, 𝑟-th raw moment and moment generating functions of the
𝑘-th point mixture are given as
𝑘
𝐹𝑋 (𝑥) = ∑ 𝑝𝑖 𝐹𝑋𝑖 (𝑥),
𝑖=1
𝑘
E (𝑋 𝑟 ) = ∑ 𝑝𝑖 E (𝑋𝑖𝑟 ), and
𝑖=1
𝑘
𝑀𝑋 (𝑡) = ∑ 𝑝𝑖 𝑀𝑋𝑖 (𝑡),
𝑖=1
respectively.
Example 3.3.6. Actuarial Exam Question. 𝑌1 is a mixture of 𝑋1 and 𝑋2
with mixing weights 𝑎 and (1 − 𝑎). 𝑌2 is a mixture of 𝑋3 and 𝑋4 with mixing
weights 𝑏 and (1 − 𝑏). 𝑍 is a mixture of 𝑌1 and 𝑌2 with mixing weights 𝑐 and
(1 − 𝑐).
Show that 𝑍 is a mixture of 𝑋1 , 𝑋2 , 𝑋3 and 𝑋4 , and find the mixing weights.

Solution. Applying the formula for a mixed distribution, we get
𝑓𝑌1 (𝑥) = 𝑎𝑓𝑋1 (𝑥) + (1 − 𝑎) 𝑓𝑋2 (𝑥)
𝑓𝑌2 (𝑥) = 𝑏𝑓𝑋3 (𝑥) + (1 − 𝑏) 𝑓𝑋4 (𝑥)
𝑓𝑍 (𝑥) = 𝑐𝑓𝑌1 (𝑥) + (1 − 𝑐) 𝑓𝑌2 (𝑥)
Substituting the first two equations into the third, we get
𝑓𝑍 (𝑥) = 𝑐 [𝑎𝑓𝑋1 (𝑥) + (1 − 𝑎) 𝑓𝑋2 (𝑥)] + (1 − 𝑐) [𝑏𝑓𝑋3 (𝑥) + (1 − 𝑏) 𝑓𝑋4 (𝑥)]
= 𝑐𝑎𝑓𝑋1 (𝑥) + 𝑐 (1 − 𝑎) 𝑓𝑋2 (𝑥) + (1 − 𝑐) 𝑏𝑓𝑋3 (𝑥) + (1 − 𝑐) (1 − 𝑏) 𝑓𝑋4 (𝑥)
.
Then, 𝑍 is a mixture of 𝑋1 , 𝑋2 , 𝑋3 and 𝑋4 , with mixing weights ca, 𝑐 (1 − 𝑎),
(1 − 𝑐) 𝑏 and (1 − 𝑐) (1 − 𝑏), respectively. It can be easily seen that the mixing
weights sum to one.
3.3.6 Continuous Mixtures

A mixture with a very large number of subpopulations (𝑘 goes to infinity) is often
referred to as a continuous mixture. In a continuous mixture, subpopulations are
not distinguished by a discrete mixing parameter but by a continuous variable
Θ, where Θ plays the role of 𝑝𝑖 in the finite mixture. Consider the random
variable 𝑋 with a distribution depending on a parameter Θ, where Θ itself is a
continuous random variable. This description yields the following model for 𝑋
∞
𝑓𝑋 (𝑥) = ∫ 𝑓𝑋 (𝑥 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
where 𝑓𝑋 (𝑥|𝜃) is the conditional distribution of 𝑋 at a particular value of Θ = 𝜃

and 𝑔Θ (𝜃) is the probability statement made about the unknown parameter 𝜃.
In a Bayesian context (described in Section 4.4), this is known as the prior
distribution of Θ (the prior information or expert opinion to be used in the
analysis).
The distribution function, 𝑘-th raw moment and moment generating functions
of the continuous mixture are given as
∞
𝐹𝑋 (𝑥) = ∫ 𝐹𝑋 (𝑥 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
∞
E (𝑋 𝑘 ) = ∫ E (𝑋 𝑘 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
∞
𝑀𝑋 (𝑡) = E (𝑒𝑡𝑋 ) = ∫ E (𝑒𝑡𝑥 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃,
−∞
respectively.
The 𝑘-th raw moment of the mixture distribution can be rewritten as
∞
E (𝑋 𝑘 ) = ∫ E (𝑋 𝑘 |𝜃 ) 𝑔Θ (𝜃)𝑑𝜃 = E [E (𝑋 𝑘 |Θ )] .
−∞
Using the law of iterated expectations (see Appendix Chapter 16), we can define
the mean and variance of 𝑋 as
E (𝑋) = E [E (𝑋 |Θ )]
and
Var (𝑋) = E [Var (𝑋 |Θ )] + Var [E (𝑋 |Θ )] .
Example 3.3.7. Actuarial Exam Question. 𝑋 has a normal distribution

with a mean of Λ and variance of 1. Λ has a normal distribution with a mean
of 1 and variance of 1. Find the mean and variance of 𝑋.
Solution.
X is a continuous mixture with mean
E (𝑋) = E [E (𝑋|Λ)] = E (Λ) = 1 and V (𝑋) = V [E (𝑋|Λ)]+E [V (𝑋|Λ)] = V (Λ)+E (1) = 1+1 = 2.
Example 3.3.8. Actuarial Exam Question. Claim sizes, 𝑋, are uniform

on the interval (Θ, Θ + 10) for each policyholder. Θ varies by policyholder
according to an exponential distribution with mean 5. Find the unconditional
distribution, mean and variance of 𝑋.
Solution.
1
The conditional distribution of 𝑋 is 𝑓𝑋 (𝑥|𝜃) = 10 for 𝜃 < 𝑥 < 𝜃 + 10. The prior
1 −5𝜃
distribution of 𝜃 is 𝑔Θ (𝜃) = 5 𝑒 for 0 < 𝜃 < ∞.
Multiplying and integrating yields the unconditional distribution of 𝑋
𝑓𝑋 (𝑥) = ∫ 𝑓𝑋 (𝑥|𝜃) 𝑔Θ (𝜃)𝑑𝜃.

3.4. COVERAGE MODIFICATIONS 99
For this example, this is
𝑥 1 −𝜃 1 −𝑥
∫0 50 𝑒
5 𝑑𝜃 =
10 (1 − 𝑒
5) 0 ≤ 𝑥 ≤ 10,
𝑓𝑋 (𝑥) = { 𝑥 1 − 5𝜃 1 − (𝑥−10) 𝑥
∫𝑥−10 50 𝑒 𝑑𝜃 = 10 (𝑒 5 − 𝑒− 5 ) 10 < 𝑥 < ∞.
One can use this to derive the mean and variance of the unconditional distribu-
tion. Alternatively, start with the conditional mean and variance of 𝑋, given
by
𝜃 + 𝜃 + 10
E (𝑋|𝜃) = =𝜃+5
2
and
2
[(𝜃 + 10) − 𝜃] 100
Var (𝑋|𝜃) = = ,
12 12
respectively. With these, the unconditional mean and variance of 𝑋 are given
by
E (𝑋) = E [E (𝑋 |Θ )] = E (Θ + 5) = E (Θ) + 5 = 5 + 5 = 10,
and
100
Var (𝑋) = E [𝑉 (𝑋 |Θ )]+Var [E (𝑋 |Θ )] = E ( )+Var (Θ + 5) = 8.33+Var (Θ) = 33.33.
12
3.4 Coverage Modifications

In this section we evaluate the impacts of coverage modifications: a) deductibles,
b) policy limit, c) coinsurance and d) inflation on insurer’s costs.
3.4.1 Policy Deductibles

Under an ordinary deductible policy, the insured (policyholder) agrees to cover
a fixed amount of an insurance claim before the insurer starts to pay. This fixed
expense paid out of pocket is called the deductible and often denoted by 𝑑. If
the loss exceeds 𝑑 then the insurer is responsible for covering the loss X less
the deductible 𝑑. Depending on the agreement, the deductible may apply to
each covered loss or to the total losses during a defined benefit period (such as
a month, year, etc.)
Deductibles reduce premiums for the policyholders by eliminating a large num-
ber of small claims, the costs associated with handling these claims, and the
potential moral hazard arising from having insurance. Moral hazard occurs
when the insured takes more risks, increasing the chances of loss due to perils
insured against, knowing that the insurer will incur the cost (e.g. a policyholder
with collision insurance may be encouraged to drive recklessly). The larger the
deductible, the less the insured pays in premiums for an insurance policy.
Let 𝑋 denote the loss incurred to the insured and 𝑌 denote the amount of
paid claim by the insurer. Speaking of the benefit paid to the policyholder, we
differentiate between two variables: The payment per loss and the payment per
payment. The payment per loss variable, denoted by 𝑌 𝐿 or (𝑋 − 𝑑)+ is left
censored because values of 𝑋 that are less than 𝑑 are set equal to zero. This
variable is defined as
0 𝑋 ≤ 𝑑,
𝑌 𝐿 = (𝑋 − 𝑑)+ = { .
𝑋−𝑑 𝑋>𝑑
𝑌 𝐿 is often referred to as left censored and shifted variable because the values
below 𝑑 are not ignored and all losses are shifted by a value 𝑑.
On the other hand, the payment per payment variable, denoted by 𝑌 𝑃 , is defined
only when there is a payment. Specifically, 𝑌 𝑃 equals 𝑋 − 𝑑 on the event
{𝑋 > 𝑑}, denoted as 𝑌 𝑃 = 𝑋 − 𝑑||𝑋 > 𝑑. Another way of expressing this that
is commonly used is
Undefined 𝑋 ≤ 𝑑
𝑌𝑃 = {
𝑋−𝑑 𝑋 > 𝑑.
Here, 𝑌 𝑃 is often referred to as left truncated and shifted variable or excess loss
variable because the claims smaller than 𝑑 are not reported and values above 𝑑
are shifted by 𝑑.
Even when the distribution of 𝑋 is continuous, the distribution of 𝑌 𝐿 is a hybrid
combination of discrete and continuous components. The discrete part of the
distribution is concentrated at 𝑌 = 0 (when 𝑋 ≤ 𝑑) and the continuous part
is spread over the interval 𝑌 > 0 (when 𝑋 > 𝑑). For the discrete part, the
probability that no payment is made is the probability that losses fall below the
deductible; that is,
Pr (𝑌 𝐿 = 0) = Pr (𝑋 ≤ 𝑑) = 𝐹𝑋 (𝑑) .
Using the transformation 𝑌 𝐿 = 𝑋 −𝑑 for the continuous part of the distribution,

we can find the pdf of 𝑌 𝐿 given by
𝐹𝑋 (𝑑) 𝑦=0
𝑓𝑌 𝐿 (𝑦) = {
𝑓𝑋 (𝑦 + 𝑑) 𝑦 > 0.
We can see that the payment per payment variable is the payment per loss
variable (𝑌 𝑃 = 𝑌 𝐿 ) conditional on the loss exceeding the deductible (𝑋 > 𝑑);
that is, 𝑌 𝑃 = 𝑌 𝐿 ∣ 𝑋 > 𝑑.. Alternatively, it can be expressed as 𝑌 𝑃 = (𝑋 −

𝑑)|𝑋 > 𝑑, that is, 𝑌 𝑃 is the loss in excess of the deductible given that the loss
exceeds the deductible. Hence, the pdf of 𝑌 𝑃 is given by
𝑓𝑋 (𝑦 + 𝑑)
𝑓𝑌 𝑃 (𝑦) = ,
1 − 𝐹𝑋 (𝑑)
for 𝑦 > 0. Accordingly, the distribution functions of 𝑌 𝐿 and 𝑌 𝑃 are given by
𝐹𝑋 (𝑑) 𝑦=0
𝐹𝑌 𝐿 (𝑦) = {
𝐹𝑋 (𝑦 + 𝑑) 𝑦 > 0,
and
𝐹𝑋 (𝑦 + 𝑑) − 𝐹𝑋 (𝑑)
𝐹𝑌 𝑃 (𝑦) = ,
1 − 𝐹𝑋 (𝑑)
for 𝑦 > 0, respectively.

The raw moments of 𝑌 𝐿 and 𝑌 𝑃 can be found directly using the pdf of 𝑋 as
follows ∞
𝑘 𝑘
E [(𝑌 𝐿 ) ] = ∫ (𝑥 − 𝑑) 𝑓𝑋 (𝑥) 𝑑𝑥,
𝑑
and
∞ 𝑘 𝑘
∫𝑑 (𝑥 − 𝑑) 𝑓𝑋 (𝑥) 𝑑𝑥 E [(𝑌 𝐿 ) ]
𝑃 𝑘
E [(𝑌 ) ] = = ,
1 − 𝐹 𝑋 (𝑑) 1 − 𝐹 𝑋 (𝑑)
respectively. For 𝑘 = 1, we can use the survival function to calculate E(𝑌 𝐿 ) as
∞
E(𝑌 𝐿 ) = ∫ [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥.
𝑑
This could be easily proved if we start with the initial definition of E(𝑌 𝐿 ) and
use integration by parts.
We have seen that the deductible 𝑑 imposed on an insurance policy is the amount
of loss that has to be paid out of pocket before the insurer makes any payment.
The deductible 𝑑 imposed on an insurance policy reduces the insurer’s payment.
The loss elimination ratio (LER) is the percentage decrease in the expected
payment of the insurer as a result of imposing the deductible. It is defined as
E (𝑋) − E (𝑌 𝐿 )
𝐿𝐸𝑅 = .
E (𝑋)
A little less common type of policy deductible is the franchise deductible. The
franchise deductible will apply to the policy in the same way as ordinary de-
ductible except that when the loss exceeds the deductible 𝑑, the full loss is
covered by the insurer. The payment per loss and payment per payment vari-
ables are defined as
0 𝑋 ≤ 𝑑,
𝑌𝐿 = {
𝑋 𝑋 > 𝑑,
and
Undefined 𝑋 ≤ 𝑑,
𝑌𝑃 = {
𝑋 𝑋 > 𝑑,
respectively.
Example 3.4.1. Actuarial Exam Question. A claim severity distribution
is exponential with mean 1000. An insurance company will pay the amount of
each claim in excess of a deductible of 100. Calculate the variance of the amount
paid by the insurance company for one claim, including the possibility that the
amount paid is 0.
Solution.
Let 𝑌 𝐿 denote the amount paid by the insurance company for one claim.
0 𝑋 ≤ 100,
𝑌 𝐿 = (𝑋 − 100)+ = {
𝑋 − 100 𝑋 > 100.
The first and second moments of 𝑌 𝐿 are

100
∞ ∞ − 1000
𝐿
𝐸 (𝑌 ) = ∫ (𝑥 − 100) 𝑓𝑋 (𝑥) 𝑑𝑥 = ∫ 𝑆𝑋 (𝑥)𝑑𝑥 = 1000𝑒 ,
100 100
and ∞
2 2 100
𝐸 [(𝑌 𝐿 ) ] = ∫ (𝑥 − 100) 𝑓𝑋 (𝑥) 𝑑𝑥 = 2 × 10002 𝑒− 1000 .
100
So,
100 2
Var (𝑌 𝐿 ) = (2 × 10002 𝑒− 1000 ) − (1000𝑒− 1000 ) = 990, 944.
100
An arguably simpler path to the solution is to make use of the relationship

between 𝑋 and 𝑌 𝑃 . If 𝑋 is exponentially distributed with mean 1000, then 𝑌 𝑃
is also exponentially distributed with the same mean, because of the memoryless
property of the exponential distribution. Hence, 𝐸 (𝑌 𝑃 )=1000 and
2
𝐸 [(𝑌 𝑃 ) ] = 2 × 10002 .
Using the relationship between 𝑌 𝐿 and 𝑌 𝑃 we find

100
𝐸 (𝑌 𝐿 ) = 𝐸 (𝑌 𝑃 ) 𝑆𝑋 (100) = 1000𝑒− 1000
2 2 100
𝐸 [(𝑌 𝐿 ) ] = 𝐸 [(𝑌 𝑃 ) ] 𝑆𝑋 (100) = 2 × 10002 𝑒− 1000 .
The relationship between 𝑋 and 𝑌 𝑃 can also be used when dealing with the
uniform or the Pareto distributions. You can easily show that if 𝑋 is uniform
over the interval (0, 𝜃) then 𝑌 𝑃 is uniform over the interval (0, 𝜃 − 𝑑) and if 𝑋
is Pareto with parameters 𝛼 and 𝜃 then 𝑌 𝑃 is Pareto with parameters 𝛼 and
𝜃 + 𝑑.
Example 3.4.2. Actuarial Exam Question. For an insurance:

• Losses have a density function
0.02𝑥 0 < 𝑥 < 10,

𝑓𝑋 (𝑥) = {
0 elsewhere.
• The insurance has an ordinary deductible of 4 per loss.

• 𝑌 𝑃 is the claim payment per payment random variable.
Calculate E (𝑌 𝑃 ).
Solution.
We define 𝑌 𝑃 as follows
Undefined 𝑋 ≤ 4,
𝑌𝑃 = {
𝑋−4 𝑋 > 4.
10
∫4 (𝑥−4)0.02𝑥𝑑𝑥 2.88
So, 𝐸 (𝑌 𝑃 ) = 1−𝐹 𝑋 (4) = 0.84 = 3.43.
Note that we divide by 𝑆𝑋 (4) = 1 − 𝐹𝑋 (4), as this is the probability where the
variable 𝑌 𝑃 is defined.
Example 3.4.3. Actuarial Exam Question. You are given:

• Losses follow an exponential distribution with the same mean in all years.
• The loss elimination ratio this year is 70%.
• The ordinary deductible for the coming year is 4/3 of the current de-
ductible.
Compute the loss elimination ratio for the coming year.
Solution.
Let the losses 𝑋 ∼ 𝐸𝑥𝑝(𝜃) and the deductible for the coming year 𝑑′ = 43 𝑑, the
deductible of the current year. The LER for the current year is
𝐸 (𝑋) − 𝐸 (𝑌 𝐿 ) 𝜃 − 𝜃𝑒−𝑑/𝜃
= = 1 − 𝑒−𝑑/𝜃 = 0.7.
𝐸 (𝑋) 𝜃
Then, 𝑒−𝑑/𝜃 = 0.3.

The LER for the coming year is
4𝑑
′ 𝜃−𝜃 exp(− 3 )
𝜃−𝜃 exp(− 𝑑𝜃 ) 𝜃
𝜃 = 𝜃
4
𝑑 4/3
= 1 − exp (− 3𝜃 ) = 1 − (𝑒−𝑑/𝜃 ) = 1 − 0.34/3 = 0.8.
3.4.2 Policy Limits

Under a limited policy, the insurer is responsible for covering the actual loss 𝑋
up to the limit of its coverage. This fixed limit of coverage is called the policy
limit and often denoted by 𝑢. If the loss exceeds the policy limit, the difference
𝑋 − 𝑢 has to be paid by the policyholder. While a higher policy limit means a
higher payout to the insured, it is associated with a higher premium.
Let 𝑋 denote the loss incurred to the insured and 𝑌 denote the amount of paid
claim by the insurer. The variable 𝑌 is known as the limited loss variable and
is denoted by 𝑋 ∧ 𝑢. It is a right censored variable because values above 𝑢 are
set equal to 𝑢. The limited loss random variable 𝑌 is defined as
𝑋 𝑋≤𝑢
𝑌 =𝑋∧𝑢={
𝑢 𝑋 > 𝑢.
It can be seen that the distinction between 𝑌 𝐿 and 𝑌 𝑃 is not needed under
limited policy as the insurer will always make a payment.
Using the definitions of (𝑋 − 𝑢)+ and (𝑋 ∧ 𝑢), it can be easily seen that the
expected payment without any coverage modification, 𝑋, is equal to the sum of
the expected payments with deductible 𝑢 and limit 𝑢. That is, 𝑋 = (𝑋 − 𝑢)+ +
(𝑋 ∧ 𝑢).
When a loss is subject to a deductible 𝑑 and a limit 𝑢, the per-loss variable 𝑌 𝐿
is defined as
⎧ 0 𝑋≤𝑑
{
𝑌 𝐿 = ⎨𝑋 − 𝑑 𝑑 < 𝑋 ≤ 𝑢
{
⎩𝑢 − 𝑑 𝑋 > 𝑢.
Hence, 𝑌 𝐿 can be expressed as 𝑌 𝐿 = (𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑).
Even when the distribution of 𝑋 is continuous, the distribution of 𝑌 is a hybrid
combination of discrete and continuous components. The discrete part of the
distribution is concentrated at 𝑌 = 𝑢 (when 𝑋 > 𝑢), while the continuous part
is spread over the interval 𝑌 < 𝑢 (when 𝑋 ≤ 𝑢). For the discrete part, the
probability that the benefit paid is 𝑢, is the probability that the loss exceeds
the policy limit 𝑢; that is,
Pr (𝑌 = 𝑢) = Pr (𝑋 > 𝑢) = 1 − 𝐹 𝑋 (𝑢) .
For the continuous part of the distribution 𝑌 = 𝑋, hence the pdf of 𝑌 is given
by
𝑓𝑋 (𝑦) 0<𝑦<𝑢
𝑓𝑌 (𝑦) = {
1 − 𝐹𝑋 (𝑢) 𝑦 = 𝑢.
Accordingly, the distribution function of 𝑌 is given by
𝐹𝑋 (𝑥) 0<𝑦<𝑢
𝐹𝑌 (𝑦) = {
1 𝑦 ≥ 𝑢.
The raw moments of 𝑌 can be found directly using the pdf of 𝑋 as follows
𝑢 ∞ 𝑢
𝑘
E (𝑌 𝑘 ) = E [(𝑋 ∧ 𝑢) ] = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥+∫ 𝑢𝑘 𝑓𝑋 (𝑥)𝑑𝑥 = ∫ 𝑥𝑘 𝑓𝑋 (𝑥) 𝑑𝑥+𝑢𝑘 [1 − 𝐹 𝑋 (𝑢)] .
0 𝑢 0
An alternative expression using the survival function is
𝑢
𝑘
E [(𝑋 ∧ 𝑢) ] = ∫ 𝑘𝑥𝑘−1 [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥.
0
In particular, for 𝑘 = 1, this is
𝑢
E (𝑌 ) = E (𝑋 ∧ 𝑢) = ∫ [1 − 𝐹𝑋 (𝑥)]𝑑𝑥.
0
This could be easily proved if we start with the initial definition of E (𝑌 ) and
use integration by parts. Alternatively, see the following justification of this
limited expectation result.
𝑘 𝑋∧𝑢
E [(𝑋 ∧ 𝑢) ] = E [∫0 𝑘𝑥𝑘−1 𝑑𝑥]
𝑢
= E [∫0 𝑘𝑥𝑘−1 𝐼(𝑋 > 𝑥)𝑑𝑥]
𝑢
= ∫0 𝑘𝑥𝑘−1 E𝐼(𝑋 > 𝑥)𝑑𝑥
𝑢
= ∫0 𝑘𝑥𝑘−1 [1 − 𝐹𝑋 (𝑥)] 𝑑𝑥.
This approach uses the Fubini-Tonelli theorem to exchange the expectation and
integration. Note that it does not make any continuity assumptions about the
distribution of 𝑋.
Example 3.4.4. Actuarial Exam Question. Under a group insurance policy,

an insurer agrees to pay 100% of the medical bills incurred during the year by
employees of a small company, up to a maximum total of one million dollars.
The total amount of bills incurred, 𝑋, has pdf
𝑥(4−𝑥)
0<𝑥<3
𝑓𝑋 (𝑥) = { 9
0 elsewhere.
where 𝑥 is measured in millions. Calculate the total amount, in millions of

dollars, the insurer would expect to pay under this policy.
Solution.
Define the total amount of bills paid by the insurer as
𝑋 𝑋≤1
𝑌 =𝑋∧1={
1 𝑋 > 1.
1 𝑥2 (4−𝑥) 3 𝑥(4−𝑥)
So E (𝑌 ) = E (𝑋 ∧ 1) = ∫0 9 𝑑𝑥 + 1 ⋅ ∫1 9 𝑑𝑥 = 0.935.
3.4.3 Coinsurance and Inflation

As we have seen in Section 3.4.1, the amount of loss retained by the policyholder
can be losses up to the deductible 𝑑. The retained loss can also be a percentage
of the claim. The percentage 𝛼, often referred to as the coinsurance factor, is
the percentage of claim the insurance company is required to cover. If the policy
is subject to an ordinary deductible and policy limit, coinsurance refers to the
percentage of claim the insurer is required to cover, after imposing the ordinary
deductible and policy limit. The payment per loss variable, 𝑌 𝐿 , is defined as
⎧ 0 𝑋 ≤ 𝑑,
{
𝑌𝐿 = 𝛼 (𝑋 − 𝑑) 𝑑 < 𝑋 ≤ 𝑢,
⎨
{
⎩ 𝛼 (𝑢 − 𝑑) 𝑋 > 𝑢.
The maximum amount paid by the insurer in this case is 𝛼 (𝑢 − 𝑑), while 𝑢 is
the maximum covered loss.
We have seen in Section 3.4.2 that when a loss is subject to both a de-
ductible 𝑑 and a limit 𝑢 the per-loss variable 𝑌 𝐿 can be expressed as
𝑌 𝐿 = (𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑). With coinsurance, this becomes 𝑌 𝐿 can be expressed
as 𝑌 𝐿 = 𝛼 [(𝑋 ∧ 𝑢) − (𝑋 ∧ 𝑑)].
The 𝑘-th raw moment of 𝑌 𝐿 is given by
𝑢
𝑘 𝑘 𝑘
E [(𝑌 𝐿 ) ] = ∫ [𝛼 (𝑥 − 𝑑)] 𝑓𝑋 (𝑥) 𝑑𝑥 + [𝛼 (𝑢 − 𝑑)] [1 − 𝐹𝑋 (𝑢)].
𝑑
A growth factor (1 + 𝑟) may be applied to 𝑋 resulting in an inflated loss random

variable (1 + 𝑟) 𝑋 (the prespecified 𝑑 and 𝑢 remain unchanged). The resulting
per loss variable can be written as
𝑑
⎧ 0 𝑋 ≤ 1+𝑟
𝐿
{ 𝑑 𝑢
𝑌 = ⎨𝛼 [(1 + 𝑟) 𝑋 − 𝑑] 1+𝑟< 𝑋 ≤ 1+𝑟
{
⎩ 𝛼 (𝑢 − 𝑑) 𝑢
𝑋 > 1+𝑟 .
The first and second moments of 𝑌 𝐿 can be expressed as
𝑢 𝑑
E (𝑌 𝐿 ) = 𝛼 (1 + 𝑟) [E (𝑋 ∧ ) − E (𝑋 ∧ )] ,
1+𝑟 1+𝑟
and
2 2
2 2 𝑢 𝑑 𝑑 𝑢
E [(𝑌 𝐿 ) ] = 𝛼2 (1 + 𝑟) {E [(𝑋 ∧ ) ] − E [(𝑋 ∧ ) ] −2 ( ) [E (𝑋 ∧ ) − E (𝑋 ∧
1+𝑟 1+𝑟 1+𝑟 1+𝑟 1
respectively.
The formulas given for the first and second moments of 𝑌 𝐿 are general. Under
full coverage, 𝛼 = 1, 𝑟 = 0, 𝑢 = ∞, 𝑑 = 0 and E (𝑌 𝐿 ) reduces to E (𝑋). If only
an ordinary deductible is imposed, 𝛼 = 1, 𝑟 = 0, 𝑢 = ∞ and E (𝑌 𝐿 ) reduces to
E (𝑋) − E (𝑋 ∧ 𝑑). If only a policy limit is imposed 𝛼 = 1, 𝑟 = 0, 𝑑 = 0 and
E (𝑌 𝐿 ) reduces to E (𝑋 ∧ 𝑢).
Example 3.4.5. Actuarial Exam Question. The ground up loss random
variable for a health insurance policy in 2006 is modeled with 𝑋, a random
variable with an exponential distribution having mean 1000. An insurance policy
pays the loss above an ordinary deductible of 100, with a maximum annual
payment of 500. The ground up loss random variable is expected to be 5%
larger in 2007, but the insurance in 2007 has the same deductible and maximum
payment as in 2006. Find the percentage increase in the expected cost per
payment from 2006 to 2007.
Solution.
We define the amount per loss 𝑌 𝐿 in both years as
⎧ 0 𝑋 ≤ 100,
𝐿 {
𝑌2006 = ⎨𝑋 − 100 100 < 𝑋 ≤ 600,
{
⎩ 500 𝑋 > 600.
⎧ 0 𝑋 ≤ 95.24,
𝐿 {
𝑌2007 = ⎨1.05𝑋 − 100 95.24 < 𝑋 ≤ 571.43,
{
⎩ 500 𝑋 > 571.43.
So,
𝐿
𝐸 (𝑌2006 ) = 𝐸 (𝑋 ∧ 600) − 𝐸 (𝑋 ∧ 100)
600 100
= 1000 (1 − 𝑒− 1000 ) − 1000 (1 − 𝑒− 1000 )
= 356.026.
Further,
𝐿
𝐸 (𝑌2007 ) = 1.05 [𝐸 (𝑋 ∧ 571.43) − 𝐸 (𝑋 ∧ 95.24)]
571.43 95.24
= 1.05 [1000 (1 − 𝑒− 1000 ) − 1000 (1 − 𝑒− 1000 )]
= 361.659.
𝑃 356.026
𝐸 (𝑌2006 )= 100 = 393.469.
𝑒− 1000
𝑃 361.659
𝐸 (𝑌2007 ) = 95.24 = 397.797.
𝑒− 1000
𝐸(𝑌 𝑃 )
Because 𝐸(𝑌2007
𝑃 − 1 = 0.011, there is an increase of 1.1% from 2006 to 2007.
2006 )
Due to the policy limit, the cost per payment event grew by only 1.1% between
2006 and 2007 even though the ground up losses increased by 5% between the
two years.
3.4.4 Reinsurance
In Section 3.4.1 we introduced the policy deductible feature of the insurance
contract. In this feature, there is a contractual arrangement under which an
insured transfers part of the risk by securing coverage from an insurer in return
for an insurance premium. Under that policy, the insured must pay all losses
up to the deductible, and the insurer only pays the amount (if any) above the
deductible. We now introduce reinsurance, a mechanism of insurance for in-
surance companies. Reinsurance is a contractual arrangement under which an
insurer transfers part of the underlying insured risk by securing coverage from
another insurer (referred to as a reinsurer) in return for a reinsurance premium.
Although reinsurance involves a relationship between three parties: the original
insured, the insurer (often referred to as cedent or cedent) and the reinsurer,
the parties of the reinsurance agreement are only the primary insurer and the
reinsurer. There is no contractual agreement between the original insured and
the reinsurer. Though many different types of reinsurance contracts exist, a
common form is excess of loss coverage. In such contracts, the primary insurer
must make all required payments to the insured until the primary insurer’s total
payments reach a fixed reinsurance deducible. The reinsurer is then only respon-
sible for paying losses above the reinsurance deductible. The maximum amount
retained by the primary insurer in the reinsurance agreement (the reinsurance
deductible) is called retention.
Reinsurance arrangements allow insurers with limited financial resources to in-

crease the capacity to write insurance and meet client requests for larger in-
surance coverage while reducing the impact of potential losses and protecting
the insurance company against catastrophic losses. Reinsurance also allows
the primary insurer to benefit from underwriting skills, expertise and proficient
complex claim file handling of the larger reinsurance companies.
Example 3.4.6. Actuarial Exam Question. Losses arising in a certain
portfolio have a two-parameter Pareto distribution with 𝛼 = 5 and 𝜃 = 3, 600.
A reinsurance arrangement has been made, under which (a) the reinsurer accepts
15% of losses up to 𝑢 = 5, 000 and all amounts in excess of 5,000 and (b) the
insurer pays for the remaining losses.
a) Express the random variables for the reinsurer’s and the insurer’s pay-
ments as a function of 𝑋, the portfolio losses.
b) Calculate the mean amount paid on a single claim by the insurer.
c) By assuming that the upper limit is 𝑢 = ∞, calculate an upper bound on
the standard deviation of the amount paid on a single claim by the insurer
(retaining the 15% copayment).
Solution.
a). The reinsurer’s portion is
0.15𝑋 𝑋 < 5000,

𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { .
0.15(5000) + 𝑋 − 5000 𝑋 ≥ 5000
and the insurer’s portion is
0.85𝑋 𝑋 < 5000,

𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { = 0.85(𝑋 ∧ 5000).
0.85(5000) 𝑋 ≥ 5000
b) Using the limited expected value tables for the Pareto distribution, we have
𝛼−1 5−1
𝜃 𝜃 3600 3600
E (𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 0.85 E (𝑋∧5000) = 0.85 [1 − ( ) ] = 0.85 [1 − ( ) ] = 741.51
𝛼−1 5000 + 𝜃 5−1 5000 + 3600
c) The unlimited variable is 0.85𝑋. For the first moment, we have
𝜃 3600
0.85 E 𝑋 = 0.85 = 0.85 = 765.
𝛼−1 5−1
For the second moment of the unlimited variable, we use the table of distribu-
tions to get
𝜃2 Γ(2 + 1)Γ(𝛼 − 2) 36002 ⋅ 2 ⋅ 2

0.852 E 𝑋 2 = 0.852 = 0.852 = 1560600.
Γ(𝛼) 24
Thus, the variance is 1560600 − 7652 = 975375. Alternatively, you can use the
formula
𝛼𝜃2 5(36002 )
0.852 Var 𝑋 = 0.852 = 0.85 2
= 975375.
(𝛼 − 1)2 (𝛼 − 2) (5 − 1)2 (5 − 2)
√
Taking square roots, the standard deviation is 975375 ≈ 987.6108.
Further discussions of reinsurance will be provided in Section 10.4.
3.5 Maximum Likelihood Estimation

• Define a likelihood for a sample of observations from a continuous distri-
bution
• Define the maximum likelihood estimator for a random sample of obser-
vations from a continuous distribution
• Estimate parametric distributions based on grouped, censored, and trun-
cated data
3.5.1 Maximum Likelihood Estimators for Complete Data

Up to this point, the chapter has focused on parametric distributions that are
commonly used in insurance applications. However, to be useful in applied work,
these distributions must use “realistic” values for the parameters and for this we
turn to data. At a foundational level, we assume that the analyst has available a
random sample 𝑋1 , … , 𝑋𝑛 from a distribution with distribution function 𝐹𝑋 (for
brevity, we sometimes drop the subscript 𝑋). As is common, we use the vector
𝜃 to denote the set of parameters for 𝐹 . This basic sample scheme is reviewed
in Appendix Section 15.1. Although basic, this sampling scheme provides the
foundations for understanding more complex schemes that are regularly used in
practice, and so it is important to master the basics.
Before drawing from a distribution, we consider potential outcomes summarized
by the random variable 𝑋𝑖 (here, 𝑖 is 1, 2, …, 𝑛). After the draw, we observe
𝑥𝑖 . Notationally, we use uppercase roman letters for random variables and lower
case ones for realizations. We have seen this set-up already in Section 2.4, where
we used Pr(𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 ) to quantify the “likelihood” of drawing a
sample {𝑥1 , … , 𝑥𝑛 }. With continuous data, we use the joint probability density
function instead of joint probabilities. With the independence assumption, the
3.5. MAXIMUM LIKELIHOOD ESTIMATION 111
joint pdf may be written as the product of pdfs. Thus, we define the likelihood
to be
𝑛
𝐿(𝜃) = ∏ 𝑓(𝑥𝑖 ). (3.3)
𝑖=1
From the notation, note that we consider this to be a function of the parameters
in 𝜃, with the data {𝑥1 , … , 𝑥𝑛 } held fixed. The maximum likelihood estimator
is that value of the parameters in 𝜃 that maximize 𝐿(𝜃).
From calculus, we know that maximizing a function produces the same results
as maximizing the logarithm of a function (this is because the logarithm is a
monotone function). Because we get the same results, to ease computational
considerations, it is common to consider the logarithmic likelihood, denoted
as
𝑛
𝑙(𝜃) = log 𝐿(𝜃) = ∑ log 𝑓(𝑥𝑖 ). (3.4)
𝑖=1
Appendix Section 15.2.2 reviews the foundations of maximum likelihood estima-

tion with more mathematical details in Appendix Chapter 17.
Example 3.5.1. Actuarial Exam Question. You are given the following
five observations: 521, 658, 702, 819, 1217. You use the single-parameter Pareto
with distribution function:
500 𝛼
𝐹 (𝑥) = 1 − ( ) , 𝑥 > 500.
𝑥
With 𝑛 = 5, the log-likelihood function is

5 5
𝑙(𝛼) = ∑ log 𝑓(𝑥𝑖 ; 𝛼) = 5𝛼 log 500 + 5 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 .
𝑖=1 𝑖=1
Figure 3.4 shows the logarithmic likelihood as a function of the parameter 𝛼.

We can determine the maximum value of the logarithmic likelihood by taking
derivatives and setting it equal to zero. This yields
𝜕 5
𝜕𝛼 𝑙(𝛼) = 5 log 500 + 5/𝛼 − ∑𝑖=1 log 𝑥𝑖 =𝑠𝑒𝑡 0 ⇒
5
𝛼𝑀𝐿𝐸
̂ = 5 = 2.453.
∑𝑖=1 log 𝑥𝑖 −5 log 500
Naturally, there are many problems where it is not practical to use hand calcula-
tions for optimization. Fortunately there are many statistical routines available
such as the R function optim.
−34.0
log−like
−34.5
−35.0 1 2 3 4 5
alpha
Figure 3.4: Logarithmic Likelihood for a One-Parameter Pareto
This code confirms our hand calculation result where the maximum likelihood
estimator is 𝛼𝑀𝐿𝐸 = 2.453125.
We present a few additional examples to illustrate how actuaries fit a parametric

distribution model to a set of claim data using maximum likelihood.
Example 3.5.2. Actuarial Exam Question. Consider a random sample of
claim amounts: 8000 10000 12000 15000. You assume that claim amounts follow
an inverse exponential distribution, with parameter 𝜃. Calculate the maximum
likelihood estimator for 𝜃.
Solution.
The pdf is
𝜃
𝜃𝑒− 𝑥
𝑓𝑋 (𝑥) = ,
𝑥2
where 𝑥 > 0.
The likelihood function, 𝐿 (𝜃), can be viewed as the probability of the observed
data, written as a function of the model’s parameter 𝜃
4 1
4 −𝜃 ∑𝑖=1
𝜃4 𝑒 𝑥𝑖
𝐿 (𝜃) = ∏ 𝑓𝑋𝑖 (𝑥𝑖 ) = 4
.
𝑖=1 ∏𝑖=1 𝑥2𝑖
The log-likelihood function, log 𝐿 (𝜃), is the sum of the individual logarithms
4 4
1
log 𝐿 (𝜃) = 4 log 𝜃 − 𝜃 ∑ − 2 ∑ log 𝑥𝑖 .
𝑖=1
𝑥𝑖 𝑖=1
4
𝑑 log 𝐿 (𝜃) 4 1
= −∑ .
𝑑𝜃 𝜃 𝑖=1 𝑥𝑖
The maximum likelihood estimator of 𝜃, denoted by 𝜃,̂ is the solution to the

equation
4
4 1
−∑ = 0.
𝜃̂ 𝑖=1
𝑥𝑖
Thus, 𝜃 ̂ = 4
4
1
= 10, 667.
∑𝑖=1 𝑥𝑖
The second derivative of log 𝐿 (𝜃) is given by
𝑑2 log 𝐿 (𝜃) −4
= 2.
𝑑𝜃2 𝜃
Evaluating the second derivative of the loglikelihood function at 𝜃 ̂ = 10, 667

gives a negative value, indicating 𝜃 ̂ as the value that maximizes the loglikelihood
function.
Example 3.5.3. Actuarial Exam Question. A random sample of size 6 is

from a lognormal distribution with parameters 𝜇 and 𝜎. The sample values are
200 3000 8000 60000 60000 160000.
Calculate the maximum likelihood estimator for 𝜇 and 𝜎.

Solution.
The pdf is
2
1 1 log 𝑥 − 𝜇
𝑓𝑋 (𝑥) = √ exp (− ( ) ),
𝑥𝜎 2𝜋 2 𝜎
where 𝑥 > 0.
The likelihood function, 𝐿 (𝜇, 𝜎), is the product of the pdf for each data point.
6 2
1 1 6 log 𝑥𝑖 − 𝜇
𝐿 (𝜇, 𝜎) = ∏ 𝑓𝑋𝑖 (𝑥𝑖 ) = 3 6
exp (− ∑( ) ).
𝑖=1 𝜎6 (2𝜋) ∏𝑖=1 𝑥𝑖 2 𝑖=1 𝜎
Taking a logarithm yields the loglikelihood function, log 𝐿 (𝜇, 𝜎), which is the
sum of the individual logarithms.
6 2
1 6 log 𝑥𝑖 − 𝜇
log 𝐿 (𝜇, 𝜎) = −6 log 𝜎 − 3 log (2𝜋) − ∑ log 𝑥𝑖 − ∑( ) .
𝑖=1
2 𝑖=1 𝜎
The first partial derivatives are
𝜕 log 𝐿(𝜇,𝜎) 1 6
𝜕𝜇 = 𝜎2 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)
𝜕 log 𝐿(𝜇,𝜎) −6 1 6 2
𝜕𝜎 = 𝜎 + 𝜎3 ∑𝑖=1 (log 𝑥𝑖 − 𝜇) .
The maximum likelihood estimators of 𝜇 and 𝜎, denoted by 𝜇̂ and 𝜎,̂ are the
solutions to the equations
1 6
𝜎̂ 2 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)̂ =0
−6 1 6 2
𝜎̂ + 𝜎̂ 3 ∑𝑖=1 (log 𝑥𝑖 − 𝜇)̂ = 0.
These yield the estimates
6 6 2
∑ log 𝑥𝑖 ∑ (log 𝑥𝑖 − 𝜇)̂
𝜇̂ = 𝑖=1 = 9.38 and 𝜎̂ = 𝑖=1
2
= 5.12.
6 6
To check that these estimates maximize, and do not minimize, the likelihood,
you may also wish to compute the second partial derivatives. These are
𝜕 2 log 𝐿 (𝜇, 𝜎) −6 𝜕 2 log 𝐿 (𝜇, 𝜎) −2 6

= 2, = 3 ∑ (log 𝑥𝑖 − 𝜇)
𝜕𝜇2 𝜎 𝜕𝜇𝜕𝜎 𝜎 𝑖=1
and
𝜕 2 log 𝐿 (𝜇, 𝜎) 6 3 6 2
= − ∑ (log 𝑥𝑖 − 𝜇) .
𝜕𝜎2 𝜎2 𝜎4 𝑖=1
Two follow-up questions rely on large sample properties that you may have
seen in an earlier course. Appendix Chapter 17 reviews the definition of the
likelihood function, introduces its properties, reviews the maximum likelihood
estimators, extends their large-sample properties to the case where there are
multiple parameters in the model, and reviews statistical inference based on
maximum likelihood estimators. In the solutions of these examples we derive the
asymptotic variance of maximum-likelihood estimators of the model parameters.
We use the delta method to derive the asymptotic variances of functions of these
parameters.
Example 3.5.2 - Follow - Up. Refer to Example 3.5.2.

a. Approximate the variance of the maximum likelihood estimator.
b. Determine an approximate 95% confidence interval for 𝜃.
c. Determine an approximate 95% confidence interval for Pr (𝑋 ≤ 9, 000) .
Solution.
a. Taking reciprocal of negative expectation of the second derivative of log 𝐿 (𝜃),
2 −1
we obtain an estimate of the variance of 𝜃,̂ 𝑉̂
𝑎𝑟 (𝜃)̂ = [𝐸 ( 𝑑 log
𝑑𝜃2
𝐿(𝜃)
)] ∣ =
𝜃=𝜃 ̂
𝜃2̂
4 = 28, 446, 222.
It should be noted that as the sample size 𝑛 → ∞, the distribution of the
maximum likelihood estimator 𝜃 ̂ converges to a normal distribution with mean
̂ The approximate confidence interval in this example is
𝜃 and variance 𝑉 ̂ (𝜃).
based on the assumption of normality, despite the small sample size, only for
the purpose of illustration.
b. The 95% confidence interval for 𝜃 is given by
10, 667 ± 1.96√28, 446, 222 = (213.34, 21120.66) .

𝑥
c. The distribution function of 𝑋 is 𝐹 (𝑥) = 1 − 𝑒− 𝜃 . Then, the maximum
likelihood estimate of 𝑔Θ (𝜃) = 𝐹 (9, 000) is
9,000
𝑔 (𝜃)̂ = 1 − 𝑒− 10,667 = 0.57.
̂
We use the delta method to approximate the variance of 𝑔 (𝜃).
9000
dg (𝜃) 9000 − 𝜃
=− 2 𝑒 .
𝑑𝜃 𝜃
9000 2
𝑉̂ ̂ = (− 9000 𝑒−
𝑎𝑟 [𝑔 (𝜃)] 𝜃̂
) 𝑉 ̂ (𝜃)̂ = 0.0329.
𝜃2̂
The 95% confidence interval for 𝐹 (9000) is given by

√
0.57 ± 1.96 0.0329 = (0.214, 0.926) .
Example 3.5.3 - Follow - Up. Refer to Example 3.5.3.

a. Estimate the covariance matrix of the maximum likelihood estimator.
b. Determine approximate 95% confidence intervals for 𝜇 and 𝜎.
c. Determine an approximate 95% confidence interval for the mean of the
lognormal distribution.
a. To derive the covariance matrix of the mle we need to find the expectations
of the second derivatives. Since the random variable 𝑋 is from a lognormal
distribution with parameters 𝜇 and 𝜎, then log 𝑋 is normally distributed with
mean 𝜇 and variance 𝜎2 .
𝜕 2 log L (𝜇, 𝜎) −6 −6
E( ) = E( 2 ) = 2 ,
𝜕𝜇2 𝜎 𝜎
𝜕 2 log L (𝜇, 𝜎) −2 6 −2 6 −2 6
E( ) = 3 ∑ E (log 𝑥𝑖 − 𝜇) = 3 ∑ [E (log 𝑥𝑖 ) − 𝜇] = 3 ∑ (𝜇 − 𝜇) = 0,
𝜕𝜇𝜕𝜎 𝜎 𝑖=1 𝜎 𝑖=1 𝜎 𝑖=1
and
𝜕 2 log L (𝜇, 𝜎) 6 3 6 2 6 3 6 6 3 6 2 −12

E( 2
) = 2
− 4
∑ E (log 𝑥 𝑖 − 𝜇) = 2
− 4
∑ Var (log 𝑥𝑖 ) = 2
− 4
∑𝜎 = 2 .
𝜕𝜎 𝜎 𝜎 𝑖=1 𝜎 𝜎 𝑖=1 𝜎 𝜎 𝑖=1 𝜎
Using the negatives of these expectations we obtain the Fisher information ma-
trix
6
0
[ 𝜎2 12 ] .
0 𝜎2
The covariance matrix, Σ, is the inverse of the Fisher information matrix
𝜎2
6 0
Σ=[ 𝜎2 ] .
0 12
The estimated matrix is given by
0.8533 0
Σ̂ = [ ].
0 0.4267
√
b. The 95% confidence interval for 𝜇 is given by 9.38 ± 1.96 0.8533 =
(7.57, 11.19).
√
The 95% confidence interval for 𝜎2 is given by 5.12±1.96 0.4267 = (3.84, 6.40).
𝜎2
c. The mean of X is exp (𝜇 + 2 ). Then, the maximum likelihood estimate of
𝜎2
𝑔 (𝜇, 𝜎) = exp (𝜇 + )
2
is
𝜎̂ 2
𝑔 (𝜇,̂ 𝜎)̂ = exp (𝜇̂ + ) = 153, 277.
2
We use the delta method to approximate the variance of the mle 𝑔 (𝜇,̂ 𝜎).
̂
𝜕𝑔(𝜇,𝜎) 𝜎2 𝜕𝑔(𝜇,𝜎) 𝜎2
𝜕𝜇 = exp (𝜇 + 2 ) and 𝜕𝜎 = 𝜎 exp (𝜇 + 2 ).
Using the delta method, the approximate variance of 𝑔 (𝜇,̂ 𝜎)̂ is given by
𝜕𝑔(𝜇,𝜎)
0.8533 0 153, 277
𝑉̂ ̂ = [ 𝜕𝑔(𝜇,𝜎)
𝑎𝑟 (𝑔 (𝜇,̂ 𝜎)) 𝜕𝜇
𝜕𝑔(𝜇,𝜎)
𝜕𝜎 ] Σ [ 𝜕𝜇
𝜕𝑔(𝜇,𝜎) ]∣ = [153, 277 346, 826] [ ][ ] = 71, 374
0 0.4267 346, 826
𝜕𝜎 𝜇=𝜇,𝜎=
̂ 𝜎̂
𝜎2
The 95% confidence interval for exp (𝜇 + 2 ) is given by
153277 ± 1.96√71, 374, 380, 000 = (−370356, 676910) .
Since the mean of the lognormal distribution cannot be negative, we should

replace the negative lower limit in the previous interval by a zero.
Example 3.5.4. Wisconsin Property Fund. To see how maximum likeli-

hood estimators work with real data, we return to the 2010 claims data intro-
duced in Section 1.3.
The following snippet of code shows how to fit the exponential, gamma, Pareto,
lognormal, and 𝐺𝐵2 models. For consistency, the code employs the R package
VGAM. The acronym stands for Vector Generalized Linear and Additive Models;
as suggested by the name, this package can do far more than fit these models
although it suffices for our purposes. The one exception is the 𝐺𝐵2 density
which is not widely used outside of insurance applications; however, we can
code this density and compute maximum likelihood estimators using the optim
general purpose optimizer.
Results from the fitting exercise are summarized in Figure 3.5. Here, the black
“longdash” curve is a smoothed histogram of the actual data (that we will in-
troduce in Section 4.1); the other curves are parametric curves where the pa-
rameters are computed via maximum likelihood. We see poor fits in the red
dashed line from the exponential distribution fit and the blue dotted line from
the gamma distribution fit. Fits of the other curves, Pareto, lognormal, and
GB2, all seem to provide reasonably good fits to the actual data. Chapter 4
describes in more detail the principles of model selection.
log(Expend)
Exponential
Gamma
0.3
Pareto
Lognormal
GB2
Density
0.2
0.1
0.0
0 5 10 15
Log Expenditures
Figure 3.5: Density Comparisons for the Wisconsin Property Fund
3.5.2 Maximum Likelihood Estimators using Modified

Data
In many applications, actuaries and other analysts wish to estimate model pa-
rameters based on individual data that are not limited. However, there are also
important applications when only limited, or modified, data are available. This
section introduces maximum likelihood estimation for grouped, censored, and
truncated data. Later, we will follow up with additional details in Section 4.3.
Maximum Likelihood Estimators for Grouped Data
In the previous section we considered the maximum likelihood estimation of

continuous models from complete (individual) data. Each individual observation
is recorded, and its contribution to the likelihood function is the density at
that value. In this section we consider the problem of obtaining maximum
likelihood estimates of parameters from grouped data. The observations are
only available in grouped form, and the contribution of each observation to the
likelihood function is the probability of falling in a specific group (interval). Let
𝑛𝑗 represent the number of observations in the interval ( 𝑐𝑗−1 , 𝑐𝑗 ] The grouped
data likelihood function is thus given by
𝑘
𝑛𝑗
𝐿 (𝜃) = ∏ [𝐹𝑋 ( 𝑐𝑗 ∣ 𝜃) − 𝐹𝑋 ( 𝑐𝑗−1 ∣ 𝜃)] ,
𝑗=1
where 𝑐0 is the smallest possible observation (often set to zero) and 𝑐𝑘 is the
largest possible observation (often set to infinity).
Example 3.5.5. Actuarial Exam Question. For a group of policies, you are
given that losses follow the distribution function 𝐹𝑋 (𝑥) = 1 − 𝑥𝜃 , for 𝜃 < 𝑥 < ∞.
Further, a sample of 20 losses resulted in the following:
Interval Number of Losses

(𝜃, 10] 9
(10, 25] 6
(25, ∞) 5
Calculate the maximum likelihood estimate of 𝜃.

Solution.
The contribution of each of the 9 observations in the first interval to the likeli-
hood function is the probability of 𝑋 ≤ 10; that is, Pr (𝑋 ≤ 10) = 𝐹𝑋 (10). Simi-
larly, the contributions of each of 6 and 5 observations in the second and third in-
tervals are Pr (10 < 𝑋 ≤ 25) = 𝐹𝑋 (25) − 𝐹𝑋 (10) and 𝑃 (𝑋 > 25) = 1 − 𝐹𝑋 (25),
respectively. The likelihood function is thus given by
9 6 5
𝐿 (𝜃) = [𝐹𝑋 (10)] [𝐹𝑋 (25) − 𝐹𝑋 (10)] [1 − 𝐹𝑋 (25)]
𝜃 9 𝜃 𝜃 6 𝜃 5
= (1 − 10 ) ( 10 − 25 ) ( 25 )
9 6 𝜃 5
= ( 10−𝜃 15𝜃
10 ) ( 250 ) ( 25 ) .
Taking logarithms, the logarithmic likelihood is
log 𝐿 (𝜃) = 9 log (10 − 𝜃) + 6 log 𝜃 + 5 log 𝜃 − 9 log 10 + 6 log 15 − 6 log 250 − 5 log 25
= 9 log (10 − 𝜃) + 11 log 𝜃 + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡.
Here, 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 is a number that does not depend on 𝜃. Taking derivatives, we

have
𝑑 log 𝐿 (𝜃) −9 11
= + .
𝑑𝜃 (10 − 𝜃) 𝜃
̂
The maximum likelihood estimator, 𝜃, is the solution to the equation
−9 11
+ =0
(10 − 𝜃)̂ 𝜃̂
which yields 𝜃 ̂ = 5.5.

Maximum Likelihood Estimators for Censored Data

Another possible distinguishing feature of a data gathering mechanism is cen-
soring. While for some events of interest (losses, claims, lifetimes, etc.) the
complete data maybe available, for others only partial information is available;
all that may be known is that the observation exceeds a specific value. The
limited policy introduced in Section 3.4.2 is an example of right censoring. Any
loss greater than or equal to the policy limit is recorded at the limit. The contri-
bution of the censored observation to the likelihood function is the probability
of the random variable exceeding this specific limit. Note that contributions
of both complete and censored data share the survival function, for a complete
point this survival function is multiplied by the hazard function, but for a cen-
sored observation it is not. The likelihood function for censored data is then
given by
𝑟
𝑚
𝐿(𝜃) = [∏ 𝑓𝑋 (𝑥𝑖 )] [𝑆𝑋 (𝑢)] ,
𝑖=1
where 𝑟 is the number of known loss amounts below the limit 𝑢 and 𝑚 is the
number of loss amounts larger than the limit 𝑢.
Example 3.5.6. Actuarial Exam Question. The random variable 𝑋 has
survival function:
𝜃4
𝑆𝑋 (𝑥) = 2
.
(𝜃2 + 𝑥2 )
Two values of 𝑋 are observed to be 2 and 4. One other value exceeds 4. Calculate
the maximum likelihood estimate of 𝜃.
Solution.
The contributions of the two observations 2 and 4 are 𝑓𝑋 (2) and 𝑓𝑋 (4) respec-
tively. The contribution of the third observation, which is only known to exceed
4 is 𝑆𝑋 (4). The likelihood function is thus given by
𝐿 (𝜃) = 𝑓𝑋 (2) 𝑓𝑋 (4) 𝑆𝑋 (4) .
The pdf of 𝑋 is given by
4𝑥𝜃4
𝑓𝑋 (𝑥) = 3
.
(𝜃2 + 𝑥2 )
Thus,
8𝜃4 16𝜃4 𝜃4 128𝜃12

𝐿 (𝜃) = 3 3 2
= 3 5
,
(𝜃2 + 4) (𝜃2 + 16) (𝜃2 + 16) (𝜃2 + 4) (𝜃2 + 16)
So,
log 𝐿 (𝜃) = log 128 + 12 log 𝜃 − 3 log (𝜃2 + 4) − 5 log (𝜃2 + 16) ,
and
𝑑 log 𝐿(𝜃) 12 6𝜃 10𝜃
𝑑𝜃 = 𝜃 − (𝜃2 +4) − (𝜃2 +16) .
The maximum likelihood estimator, 𝜃,̂ is the solution to the equation
12 6𝜃 ̂ 10𝜃 ̂
− − =0
𝜃̂ (𝜃2̂ + 4) (𝜃2̂ + 16)
or
12 (𝜃2̂ + 4) (𝜃2̂ + 16)−6𝜃2̂ (𝜃2̂ + 16)−10𝜃2̂ (𝜃2̂ + 4) = −4𝜃4̂ +104𝜃2̂ +768 = 0,
which yields 𝜃2̂ = 32 and 𝜃 ̂ = 5.7.
Maximum Likelihood Estimators for Truncated Data

This section is concerned with the maximum likelihood estimation of the con-
tinuous distribution of the random variable 𝑋 when the data is incomplete due
to truncation. If the values of 𝑋 are truncated at 𝑑, then it should be noted
that we would not have been aware of the existence of these values had they not
exceeded 𝑑. The policy deductible introduced in Section 3.4.1 is an example of
left truncation. Any loss less than or equal to the deductible is not recorded.
The contribution to the likelihood function of an observation 𝑥 truncated at 𝑑
will be a conditional probability and the 𝑓𝑋 (𝑥) will be replaced by 𝑆𝑓𝑋 (𝑥) . The
𝑋 (𝑑)
likelihood function for truncated data is then given by
𝑘
𝑓𝑋 (𝑥𝑖 )
𝐿(𝜃) = ∏ ,
𝑖=1
𝑆𝑋 (𝑑)
where 𝑘 is the number of loss amounts larger than the deductible 𝑑.

Example 3.5.7. Actuarial Exam Question. For the single-parameter
Pareto distribution with 𝜃 = 2, maximum likelihood estimation is applied to
estimate the parameter 𝛼. Find the estimated mean of the ground up loss dis-
tribution based on the maximum likelihood estimate of 𝛼 for the following data
set:
• Ordinary policy deductible of 5, maximum covered loss of 25 (policy limit
20)
• 8 insurance payment amounts: 2, 4, 5, 5, 8, 10, 12, 15
• 2 limit payments: 20, 20.
Solution.
The contributions of the different observations can be summarized as follows:
• For the exact loss: 𝑓𝑋 (𝑥)
• For censored observations: 𝑆𝑋 (25).
• For truncated observations: 𝑆𝑓𝑋 (𝑥)
(5) .𝑋
Given that ground up losses smaller than 5 are omitted from the data set,
the contribution of all observations should be conditional on exceeding 5. The
likelihood function becomes
8
∏𝑖=1 𝑓𝑋 (𝑥𝑖 ) 𝑆𝑋 (25) 2
𝐿 (𝛼) = 8
[ ] .
[𝑆𝑋 (5)] 𝑆𝑋 (5)
For the single-parameter Pareto the probability density and distribution func-
tions are given by
𝛼
𝛼𝜃𝛼 𝜃
𝑓𝑋 (𝑥) = and 𝐹𝑋 (𝑥) = 1 − ( ) ,
𝑥𝛼+1 𝑥
for 𝑥 > 𝜃, respectively. Then, the likelihood is given by
𝛼8 510𝛼
𝐿 (𝛼) = 8 2𝛼
.
∏𝑖=1 𝑥𝛼+1 25
𝑖
Taking logarithms, the loglikelihood function is
8
log 𝐿 (𝛼) = 8 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 + 10𝛼 log 5 − 2𝛼 log 25.
𝑖=1
8
𝑑 log 𝐿 (𝛼) 8
= − ∑ log 𝑥𝑖 + 10 log 5 − 2 log 25.
𝑑𝜃 𝛼 𝑖=1
With this, the maximum likelihood estimator, 𝛼,̂ is the solution to the equation
8
8
− ∑ log 𝑥𝑖 + 10 log 5 − 2 log 25 = 0,
𝛼̂ 𝑖=1
which yields
8
𝛼̂ = 8
∑𝑖=1 log 𝑥𝑖 −10 log 5+2 log 25
8
= (log 7+log 9+⋯+log 20)−10 log 5+2 log 25 = 0.785.
The mean of the single-parameter Pareto is finite for 𝛼 > 1 (see Appendix
Section 18.2). Since 𝛼̂ = 0.785 < 1. Then, the mean is infinite.

Contributors
• Zeinab Amin, The American University in Cairo, is the principal author
of this chapter. Email: [email protected] for chapter comments
and suggested improvements.
• Many helpful comments have been provided by Hirokazu (Iwahiro) Iwa-
sawa, [email protected] .
• Other chapter reviewers include: Rob Erhardt, Samuel Kolins, Tatjana
Miljkovic, Michelle Xia, and Jorge Yslas.
Exercises
questions from the professional actuarial examinations – typically the Society
of Actuaries Exam C/STAM.
Severity Distribution Guided Tutorials
Further Readings and References

Notable contributions include: Cummins and Derrig (2012), Frees and Valdez
(2008), Klugman et al. (2012), Kreer et al. (2015), McDonald (1984), McDonald
and Xu (1995), Tevet (2016), and Venter (1983).
Chapter 4
Model Selection and

Estimation
Chapter Preview. Chapters 2 and 3 have described how to fit parametric mod-
els to frequency and severity data, respectively. This chapter begins with the
selection of models. To compare alternative parametric models, it is helpful to
summarize data without reference to a specific parametric distribution. Section
4.1 describes nonparametric estimation, how we can use it for model compar-
isons and how it can be used to provide starting values for parametric procedures.
The process of model selection is then summarized in Section 4.2. Although our
focus is on data from continuous distributions, the same process can be used for
discrete versions or data that come from a hybrid combination of discrete and
continuous distributions.
Model selection and estimation are fundamental aspects of statistical modeling.
To provide a flavor as to how they can be adapted to alternative sampling
schemes, Section 4.3.1 describes estimation for grouped, censored and truncated
data (following the Section 3.5 introduction). To see how they can be adapted
to alternative models, the chapter closes with Section 4.4 on Bayesian inference,
an alternative procedure where the (typically unknown) parameters are treated
as random variables.
4.1 Nonparametric Inference

• Estimate moments, quantiles, and distributions without reference to a
parametric distribution
125
126 CHAPTER 4. MODEL SELECTION AND ESTIMATION
• Summarize the data graphically without reference to a parametric distri-

bution
• Determine measures that summarize deviations of a parametric from a
nonparametric fit
• Use nonparametric estimators to approximate parameters that can be used
to start a parametric estimation procedure
4.1.1 Nonparametric Estimation

In Section 2.2 for frequency and Section 3.1 for severity, we learned how to sum-
marize a distribution by computing means, variances, quantiles/percentiles, and
so on. To approximate these summary measures using a dataset, one strategy
is to:
i. assume a parametric form for a distribution, such as a negative binomial
for frequency or a gamma distribution for severity,
ii. estimate the parameters of that distribution, and then
iii. use the distribution with the estimated parameters to calculate the desired
summary measure.
This is the parametric approach. Another strategy is to estimate the desired
summary measure directly from the observations without reference to a para-
metric model. Not surprisingly, this is known as the nonparametric approach.
Let us start by considering the most basic type of sampling scheme and assume
that observations are realizations from a set of random variables 𝑋1 , … , 𝑋𝑛 that
are iid draws from an unknown population distribution 𝐹 (⋅). An equivalent way
of saying this is that 𝑋1 , … , 𝑋𝑛 , is a random sample (with replacement) from
𝐹 (⋅). To see how this works, we now describe nonparametric estimators of many
important measures that summarize a distribution.
Moment Estimators
We learned how to define moments in Section 2.2.2 for frequency and Section
3.1.1 for severity. In particular, the 𝑘-th moment, E [𝑋 𝑘 ] = 𝜇′𝑘 , summarizes
many aspects of the distribution for different choices of 𝑘. Here, 𝜇′𝑘 is sometimes
called the 𝑘th population moment to distinguish it from the 𝑘th sample moment,
1 𝑛
∑ 𝑋𝑘,
𝑛 𝑖=1 𝑖
which is the corresponding nonparametric estimator. In typical applications, 𝑘

is a positive integer, although it need not be in theory.
An important special case is the first moment where 𝑘 = 1. In this case, the
prime symbol (′) and the 1 subscript are usually dropped and one uses 𝜇 = 𝜇′1
to denote the population mean, or simply the mean. The corresponding sample
4.1. NONPARAMETRIC INFERENCE 127
estimator for 𝜇 is called the sample mean, denoted with a bar on top of the
random variable:
1 𝑛
𝑋= ∑𝑋 .
𝑛 𝑖=1 𝑖
Another type of summary measure of interest is the 𝑘-th central moment, E [(𝑋−
𝜇)𝑘 ] = 𝜇𝑘 . (Sometimes, 𝜇′𝑘 is called the 𝑘-th raw moment to distinguish it from
the central moment 𝜇𝑘 .). A nonparametric, or sample, estimator of 𝜇𝑘 is
1 𝑛 𝑘
∑ (𝑋𝑖 − 𝑋) .
𝑛 𝑖=1
The second central moment (𝑘 = 2) is an important case for which we typically

assign a new symbol, 𝜎2 = E [(𝑋 − 𝜇)2 ], known as the variance. Properties
𝑛 2
of the sample moment estimator of the variance such as 𝑛−1 ∑𝑖=1 (𝑋𝑖 − 𝑋)
have been studied extensively but is not the only possible estimator. The most
widely used version is one where the effective sample size is reduced by one, and
so we define
𝑛
1 2
𝑠2 = ∑ (𝑋𝑖 − 𝑋) .
𝑛 − 1 𝑖=1
Dividing by 𝑛 − 1 instead of 𝑛 matters little when you have a large sample size
𝑛 as is common in insurance applications. The sample variance estimator 𝑠2 is
unbiased in the sense that E [𝑠2 ] = 𝜎2 , a desirable property particularly when
interpreting results of an analysis.
Empirical Distribution Function

We have seen how to compute nonparametric estimators of the 𝑘th moment
E [𝑋 𝑘 ]. In the same way, for any known function g(⋅), we can estimate E [g(𝑋)]
𝑛
using 𝑛−1 ∑𝑖=1 g(𝑋𝑖 ).
Now consider the function g(𝑋) = 𝐼(𝑋 ≤ 𝑥) for a fixed 𝑥. Here, the notation
𝐼(⋅) is the indicator function; it returns 1 if the event (⋅) is true and 0 otherwise.
Note that now the random variable g(𝑋) has Bernoulli distribution (a binomial
distribution with 𝑛 = 1). We can use this distribution to readily calculate
quantities such as the mean and the variance. For example, for this choice of
g(⋅), the expected value is E [𝐼(𝑋 ≤ 𝑥)] = Pr(𝑋 ≤ 𝑥) = 𝐹 (𝑥), the distribution
function evaluated at 𝑥. Using the analog principle, we define the nonparametric
estimator of the distribution function
1 𝑛
𝐹𝑛 (𝑥) = ∑ 𝐼 (𝑋𝑖 ≤ 𝑥)
𝑛 𝑖=1
number of observations less than or equal to 𝑥
= .
𝑛
As 𝐹𝑛 (⋅) is based on only observations and does not assume a parametric fam-
ily for the distribution, it is nonparametric and also known as the empirical
distribution function. It is also known as the empirical cumulative distribution
function and, in R, one can use the ecdf(.) function to compute it.
Example 4.1.1. Toy Data Set. To illustrate, consider a fictitious, or “toy,”
data set of 𝑛 = 10 observations. Determine the empirical distribution function.
𝑖 1 2 3 4 5 6 7 8 9 10
𝑋𝑖 10 15 15 15 20 23 23 23 23 30
You should check that the sample mean is 𝑋 = 19.7 and that the sample variance
is 𝑠2 = 34.45556. The corresponding empirical distribution function is
⎧ 0 for 𝑥 < 10
{ 0.1 for 10 ≤ 𝑥 < 15
{
{ 0.4 for 15 ≤ 𝑥 < 20
𝐹𝑛 (𝑥) = ⎨
0.5 for 20 ≤ 𝑥 < 23
{
{ 0.9 for 23 ≤ 𝑥 < 30
{
⎩ 1 for 𝑥 ≥ 30,
as shown in Figure 4.1. The empirical distribution is generally discrete and

continuous from the right.
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
5 10 15 20 25 30 35
Figure 4.1: Empirical Distribution Function of a Toy Example
Quartiles, Percentiles and Quantiles

We have already seen in Section 3.1.1 the median, which is the number such
that approximately half of a data set is below (or above) it. The first quartile is
the number such that approximately 25% of the data is below it and the third
quartile is the number such that approximately 75% of the data is below it. A
100𝑝 percentile is the number such that 100 × 𝑝 percent of the data is below it.
To generalize this concept, consider a distribution function 𝐹 (⋅), which may or
may not be continuous, and let 𝑞 be a fraction so that 0 < 𝑞 < 1. We want
to define a quantile, say 𝑞𝐹 , to be a number such that 𝐹 (𝑞𝐹 ) ≈ 𝑞. Notice that
when 𝑞 = 0.5, 𝑞𝐹 is the median; when 𝑞 = 0.25, 𝑞𝐹 is the first quartile, and so
on. In the same way, when 𝑞 = 0, 0.01, 0.02, … , 0.99, 1.00, the resulting 𝑞𝐹 is
a percentile. So, a quantile generalizes the concepts of median, quartiles, and
percentiles.
To be precise, for a given 0 < 𝑞 < 1, define the 𝑞th quantile 𝑞𝐹 to be any
number that satisfies
𝐹 (𝑞𝐹 −) ≤ 𝑞 ≤ 𝐹 (𝑞𝐹 ) (4.1)
Here, the notation 𝐹 (𝑥−) means to evaluate the function 𝐹 (⋅) as a left-hand
limit.
To get a better understanding of this definition, let us look at a few special
cases. First, consider the case where 𝑋 is a continuous random variable so that
the distribution function 𝐹 (⋅) has no jump points, as illustrated in Figure 4.2.
In this figure, a few fractions, 𝑞1 , 𝑞2 , and 𝑞3 are shown with their corresponding
quantiles 𝑞𝐹 ,1 , 𝑞𝐹 ,2 , and 𝑞𝐹 ,3 . In each case, it can be seen that 𝐹 (𝑞𝐹 −) = 𝐹 (𝑞𝐹 )
so that there is a unique quantile. Because we can find a unique inverse of the
distribution function at any 0 < 𝑞 < 1, we can write 𝑞𝐹 = 𝐹 −1 (𝑞).
q3
F(x)
q2
q1
qF,1 qF,2 qF,3
Figure 4.2: Continuous Quantile Case
Figure 4.3 shows three cases for distribution functions. The left panel corre-
sponds to the continuous case just discussed. The middle panel displays a jump
point similar to those we already saw in the empirical distribution function
of Figure 4.1. For the value of 𝑞 shown in this panel, we still have a unique
value of the quantile 𝑞𝐹 . Even though there are many values of 𝑞 such that
𝐹 (𝑞𝐹 −) ≤ 𝑞 ≤ 𝐹 (𝑞𝐹 ), for a particular value of 𝑞, there is only one solution to
equation (4.1). The right panel depicts a situation in which the quantile cannot
be uniquely determined for the 𝑞 shown as there is a range of 𝑞𝐹 ’s satisfying
equation (4.1).
F(x)
F(x)
F(x)
q q q
qF qF qF
x x x
Figure 4.3: Three Quantile Cases
Example 4.1.2. Toy Data Set: Continued. Determine quantiles corre-

sponding to the 20th, 50th, and 95th percentiles.
Solution. Consider Figure 4.1. The case of 𝑞 = 0.20 corresponds to the middle
panel of Figure Figure 4.3, so the 20th percentile is 15. The case of 𝑞 = 0.50
corresponds to the right panel, so the median is any number between 20 and 23
inclusive. Many software packages use the average 21.5 (e.g. R, as seen below).
For the 95th percentile, the solution is 30. We can see from Figure 4.1 that 30
also corresponds to the 99th and the 99.99th percentiles.
quantile(xExample, probs=c(0.2, 0.5, 0.95), type=6)
## 20% 50% 95%

## 15.0 21.5 30.0
By taking a weighted average between data observations, smoothed empirical

quantiles can handle cases such as the right panel in Figure 4.3. The 𝑞th
smoothed empirical quantile is defined as
𝜋𝑞̂ = (1 − ℎ)𝑋(𝑗) + ℎ𝑋(𝑗+1)
where 𝑗 = ⌊(𝑛 + 1)𝑞⌋, ℎ = (𝑛 + 1)𝑞 − 𝑗, and 𝑋(1) , … , 𝑋(𝑛) are the ordered values
(known as the order statistics) corresponding to 𝑋1 , … , 𝑋𝑛 . (Recall that the
brackets ⌊⋅⌋ are the floor function denoting the greatest integer value.) Note
that 𝜋𝑞̂ is simply a linear interpolation between 𝑋(𝑗) and 𝑋(𝑗+1) .
Example 4.1.3. Toy Data Set: Continued. Determine the 50th and 20th
smoothed percentiles.
Solution Take 𝑛 = 10 and 𝑞 = 0.5. Then, 𝑗 = ⌊(11)(0.5)⌋ = ⌊5.5⌋ = 5 and
ℎ = (11)(0.5) − 5 = 0.5. Then the 0.5-th smoothed empirical quantile is
𝜋0.5
̂ = (1 − 0.5)𝑋(5) + (0.5)𝑋(6) = 0.5(20) + (0.5)(23) = 21.5.
Now take 𝑛 = 10 and 𝑞 = 0.2. In this case, 𝑗 = ⌊(11)(0.2)⌋ = ⌊2.2⌋ = 2 and

ℎ = (11)(0.2) − 2 = 0.2. Then the 0.2-th smoothed empirical quantile is
𝜋0.2
̂ = (1 − 0.2)𝑋(2) + (0.2)𝑋(3) = 0.8(15) + (0.2)(15) = 15.
Density Estimators
Discrete Variable. When the random variable is discrete, estimating the
probability mass function 𝑓(𝑥) = Pr(𝑋 = 𝑥) is straightforward. We simply use
the sample average, defined to be
1 𝑛
𝑓𝑛 (𝑥) = ∑ 𝐼(𝑋𝑖 = 𝑥),
𝑛 𝑖=1
which is the proportion of the sample equal to 𝑥.

Continuous Variable within a Group. For a continuous random variable,
consider a discretized formulation in which the domain of 𝐹 (⋅) is partitioned by
constants {𝑐0 < 𝑐1 < ⋯ < 𝑐𝑘 } into intervals of the form [𝑐𝑗−1 , 𝑐𝑗 ), for 𝑗 = 1, … , 𝑘.
The data observations are thus “grouped” by the intervals into which they fall.
Then, we might use the basic definition of the empirical mass function, or a
variation such as
𝑛𝑗
𝑓𝑛 (𝑥) = 𝑐𝑗−1 ≤ 𝑥 < 𝑐𝑗 ,
𝑛 × (𝑐𝑗 − 𝑐𝑗−1 )
where 𝑛𝑗 is the number of observations (𝑋𝑖 ) that fall into the interval [𝑐𝑗−1 , 𝑐𝑗 ).
Continuous Variable (not grouped). Extending this notion to instances
where we observe individual data, note that we can always create arbitrary
groupings and use this formula. More formally, let 𝑏 > 0 be a small positive
constant, known as a bandwidth, and define a density estimator to be
1 𝑛
𝑓𝑛 (𝑥) = ∑ 𝐼(𝑥 − 𝑏 < 𝑋𝑖 ≤ 𝑥 + 𝑏) (4.2)
2𝑛𝑏 𝑖=1
Snippet of Theory. The idea is that the estimator 𝑓𝑛 (𝑥) in equation (4.2) is
the average over 𝑛 iid realizations of a random variable with mean
1 1
E [ 𝐼(𝑥 − 𝑏 < 𝑋 ≤ 𝑥 + 𝑏)] = (𝐹 (𝑥 + 𝑏) − 𝐹 (𝑥 − 𝑏))
2𝑏 2𝑏
→ 𝐹 ′ (𝑥) = 𝑓(𝑥),
as 𝑏 → 0. That is, 𝑓𝑛 (𝑥) is an asymptotically unbiased estimator of 𝑓(𝑥) (its

expectation approaches the true value as sample size increases to infinity). This
development assumes some smoothness of 𝐹 (⋅), in particular, twice differentia-
bility at 𝑥, but makes no assumptions on the form of the distribution function
𝐹 . Because of this, the density estimator 𝑓𝑛 is said to be nonparametric.
More generally, define the kernel density estimator of the pdf at 𝑥 as
1 𝑛 𝑥 − 𝑋𝑖
𝑓𝑛 (𝑥) = ∑𝑤( ), (4.3)
𝑛𝑏 𝑖=1 𝑏
where 𝑤 is a probability density function centered about 0. Note that equation

(4.2) is a special case of the kernel density estimator where 𝑤(𝑥) = 12 𝐼(−1 <
𝑥 ≤ 1), also known as the uniform kernel. Other popular choices are shown in
Table 4.1.
Table 4.1. Popular Kernel Choices
Kernel 𝑤(𝑥)
1
Uniform 2
𝐼(−1 < 𝑥 ≤ 1)
Triangle (1 − |𝑥|) × 𝐼(|𝑥| ≤ 1)
3
Epanechnikov 4
(1 − 𝑥2 ) × 𝐼(|𝑥| ≤ 1)
Gaussian 𝜙(𝑥)
Here, 𝜙(⋅) is the standard normal density function. As we will see in the following
example, the choice of bandwidth 𝑏 comes with a bias-variance tradeoff between
matching local distributional features and reducing the volatility.
Example 4.1.4. Property Fund. Figure 4.4 shows a histogram (with shaded
gray rectangles) of logarithmic property claims from 2010. The (blue) thick
curve represents a Gaussian kernel density where the bandwidth was selected
automatically using an ad hoc rule based on the sample size and volatility of
these data. For this dataset, the bandwidth turned out to be 𝑏 = 0.3255. For
comparison, the (red) dashed curve represents the density estimator with a
bandwidth equal to 0.1 and the green smooth curve uses a bandwidth of 1. As
anticipated, the smaller bandwidth (0.1) indicates taking local averages over less
data so that we get a better idea of the local average, but at the price of higher
volatility. In contrast, the larger bandwidth (1) smooths out local fluctuations,
yielding a smoother curve that may miss perturbations in the local average. For
actuarial applications, we mainly use the kernel density estimator to get a quick
visual impression of the data. From this perspective, you can simply use the
default ad hoc rule for bandwidth selection, knowing that you have the ability
to change it depending on the situation at hand.
b=0.3255 (default)
0.30
b=0.1
b=1.0
0.20
Density
0.10
0.00
0 5 10 15
Log Expenditures
Figure 4.4: Histogram of Logarithmic Property Claims with Superim-

posed Kernel Density Estimators
Nonparametric density estimators, such as the kernel estimator, are regularly

used in practice. The concept can also be extended to give smooth versions of
an empirical distribution function. Given the definition of the kernel density
estimator, the kernel estimator of the distribution function can be found as
𝐹𝑛̃ (𝑥) = ∑𝑊 ( ).
𝑛 𝑖=1 𝑏
where 𝑊 is the distribution function associated with the kernel density 𝑤. To

illustrate, for the uniform kernel, we have 𝑤(𝑦) = 12 𝐼(−1 < 𝑦 ≤ 1), so
⎧0 𝑦 < −1
{
𝑊 (𝑦) = ⎨ 𝑦+1
2 −1 ≤ 𝑦 < 1 .
{1 𝑦≥1
⎩
Example 4.1.5. Actuarial Exam Question.

You study five lives to estimate the time from the onset of a disease to death.
The times to death are:
2 3 3 3 7
Using a triangular kernel with bandwidth 2, calculate the density function esti-
mate at 2.5.
Solution. For the kernel density estimate, we have
𝑓𝑛 (𝑥) = ∑𝑤( ),
𝑛𝑏 𝑖=1 𝑏
where 𝑛 = 5, 𝑏 = 2, and 𝑥 = 2.5. For the triangular kernel, 𝑤(𝑥) = (1 − |𝑥|) ×

𝐼(|𝑥| ≤ 1). Thus,
𝑥−𝑋𝑖
𝑋𝑖 𝑏 𝑤 ( 𝑥−𝑋
𝑏 )
𝑖
2.5−2
2 2 = 14 1
(1 − 4 )(1) = 3
4
3
2.5−3 −1
3 2 = 4 (1 − ∣ −1
4 ∣) (1) =
3
4
3
2.5−7
7 2 = −2.25 (1 − | − 2.25|)(0) = 0
Then the kernel density estimate at 𝑥 = 2.5 is

1 3 3 3
𝑓𝑛 (2.5) = ( + (3) + 0) =
5(2) 4 4 10
Plug-in Principle
One way to create a nonparametric estimator of some quantity is to use the
analog or plug-in principle where one replaces the unknown cdf 𝐹 with a known
estimate such as the empirical cdf 𝐹𝑛 . So, if we are trying to estimate E [g(𝑋)] =
E𝐹 [g(𝑋)] for a generic function g, then we define a nonparametric estimator to
𝑛
be E𝐹𝑛 [g(𝑋)] = 𝑛−1 ∑𝑖=1 g(𝑋𝑖 ).
To see how this works, as a special case of g we consider the loss per payment
random variable is 𝑌 = (𝑋 − 𝑑)+ and the loss elimination ratio introduced in
Section 3.4.1. We can express this as
E [𝑋 − (𝑋 − 𝑑)+ ] E [min(𝑋, 𝑑)]

𝐿𝐸𝑅(𝑑) = = ,
E [𝑋] E [𝑋]
for a fixed deductible 𝑑.

Example. 4.1.6. Bodily Injury Claims and Loss Elimination Ratios
We use a sample of 432 closed auto claims from Boston from Derrig et al. (2001).
Losses are recorded for payments due to bodily injuries in auto accidents. Losses
are not subject to deductibles but are limited by various maximum coverage
amounts that are also available in the data. It turns out that only 17 out of 432
(≈ 4%) were subject to these policy limits and so we ignore these data for this
illustration.
The average loss paid is 6906 in U.S. dollars. Figure 4.5 shows other aspects of
the distribution. Specifically, the left-hand panel shows the empirical distribu-
tion function, the right-hand panel gives a nonparametric density plot.
1.0
0.00000 0.00004 0.00008 0.00012

0.8
0.6
Density
ECDF
0.4
0.2
0.0
0 10000 20000 0 10000 20000
x x
Figure 4.5: Bodily Injury Claims. The left-hand panel gives the empirical
distribution function. The right-hand panel presents a nonparametric density
plot.
The impact of bodily injury losses can be mitigated by the imposition of limits
or purchasing reinsurance policies (see Section 10.3). To quantify the impact
of these risk mitigation tools, it is common to compute the loss elimination
ratio (LER) as introduced in Section 3.4.1. The distribution function is not
available and so must be estimated in some way. Using the plug-in principle, a
nonparametric estimator can be defined as
𝑛 𝑛
𝑛−1 ∑𝑖=1 min(𝑋𝑖 , 𝑑) ∑𝑖=1 min(𝑋𝑖 , 𝑑)
𝐿𝐸𝑅𝑛 (𝑑) = 𝑛 = 𝑛 .
𝑛−1 ∑𝑖=1 𝑋𝑖 ∑𝑖=1 𝑋𝑖
Figure 4.6 shows the estimator 𝐿𝐸𝑅𝑛 (𝑑) for various choices of 𝑑. For example,
at 𝑑 = 1, 000, we have 𝐿𝐸𝑅𝑛 (1000) ≈ 0.1442. Thus, imposing a limit of 1,000
means that expected retained claims are 14.42 percent lower when compared to
expected claims with a zero deductible.
1.0
0.8
0.6
LER
0.4
0.2
0.0
0 5000 10000 15000 20000 25000
Figure 4.6: LER for Bodily Injury Claims. The figure presents the loss
elimination ratio (LER) as a function of deductible 𝑑.
4.1.2 Tools for Model Selection and Diagnostics

The previous section introduced nonparametric estimators in which there was
no parametric form assumed about the underlying distributions. However, in
many actuarial applications, analysts seek to employ a parametric fit of a dis-
tribution for ease of explanation and the ability to readily extend it to more
complex situations such as including explanatory variables in a regression set-
ting. When fitting a parametric distribution, one analyst might try to use a
gamma distribution to represent a set of loss data. However, another analyst
may prefer to use a Pareto distribution. How does one determine which model
to select?
Nonparametric tools can be used to corroborate the selection of parametric mod-
els. Essentially, the approach is to compute selected summary measures under
a fitted parametric model and to compare it to the corresponding quantity un-
der the nonparametric model. As the nonparametric model does not assume a
specific distribution and is merely a function of the data, it is used as a bench-
mark to assess how well the parametric distribution/model represents the data.
Also, as the sample size increases, the empirical distribution converges almost
surely to the underlying population distribution (by the strong law of large num-
bers). Thus the empirical distribution is a good proxy for the population. The
comparison of parametric to nonparametric estimators may alert the analyst to
deficiencies in the parametric model and sometimes point ways to improving
the parametric specification. Procedures geared towards assessing the validity
of a model are known as model diagnostics.
Graphical Comparison of Distributions
We have already seen the technique of overlaying graphs for comparison pur-
poses. To reinforce the application of this technique, Figure 4.7 compares the
empirical distribution to two parametric fitted distributions. The left panel
shows the distribution functions of claims distributions. The dots forming an
“S-shaped” curve represent the empirical distribution function at each observa-
tion. The thick blue curve gives corresponding values for the fitted gamma
distribution and the light purple is for the fitted Pareto distribution. Because
the Pareto is much closer to the empirical distribution function than the gamma,
this provides evidence that the Pareto is the better model for this data set. The
right panel gives similar information for the density function and provides a
consistent message. Based (only) on these figures, the Pareto distribution is the
clear choice for the analyst.
0.0 0.2 0.4 0.6 0.8 1.0
log(claims)
Gamma
Distribution Function
0.20
Pareto
Density
0.10
log(claims)
Gamma
0.00
Pareto
0 5 10 15 0 5 10 15
Log Claims Log Claims
Figure 4.7: Nonparametric Versus Fitted Parametric Distribution and

Density Functions. The left-hand panel compares distribution functions, with
the dots corresponding to the empirical distribution, the thick blue curve corre-
sponding to the fitted gamma and the light purple curve corresponding to the
fitted Pareto. The right hand panel compares these three distributions summa-
rized using probability density functions.
For another way to compare the appropriateness of two fitted models, consider
the probability-probability (pp) plot. A 𝑝𝑝 plot compares cumulative probabili-
ties under two models. For our purposes, these two models are the nonparamet-
ric empirical distribution function and the parametric fitted model. Figure 4.8
shows 𝑝𝑝 plots for the Property Fund data introduced in Section 1.3. The fitted
gamma is on the left and the fitted Pareto is on the right, compared to the same
empirical distribution function of the data. The straight line represents equality
between the two distributions being compared, so points close to the line are
desirable. As seen in earlier demonstrations, the Pareto is much closer to the
empirical distribution than the gamma, providing additional evidence that the
Pareto is the better model.
0.0 0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0
Gamma DF
Pareto DF
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Empirical DF Empirical DF
Figure 4.8: Probability-Probability (𝑝𝑝) Plots. The horizontal axis gives

the empirical distribution function at each observation. In the left-hand panel,
the corresponding distribution function for the gamma is shown in the vertical
axis. The right-hand panel shows the fitted Pareto distribution. Lines of 𝑦 = 𝑥
are superimposed.
A 𝑝𝑝 plot is useful in part because no artificial scaling is required, such as with

the overlaying of densities in Figure 4.7, in which we switched to the log scale to
better visualize the data. The Chapter 4 Technical Supplement A.1 introduces
a variation of the 𝑝𝑝 plot known as a Lorenz curve; this is an important tool for
assessing income inequality. Furthermore, 𝑝𝑝 plots are available in multivariate
settings where more than one outcome variable is available. However, a limi-
tation of the 𝑝𝑝 plot is that, because it plots cumulative distribution functions,
it can sometimes be difficult to detect where a fitted parametric distribution is
deficient. As an alternative, it is common to use a quantile-quantile (qq) plot,
as demonstrated in Figure 4.9.
The 𝑞𝑞 plot compares two fitted models through their quantiles. As with 𝑝𝑝
plots, we compare the nonparametric to a parametric fitted model. Quan-
tiles may be evaluated at each point of the data set, or on a grid (e.g., at
0, 0.001, 0.002, … , 0.999, 1.000), depending on the application. In Figure 4.9, for
each point on the aforementioned grid, the horizontal axis displays the empir-
ical quantile and the vertical axis displays the corresponding fitted parametric
quantile (gamma for the upper two panels, Pareto for the lower two). Quan-
tiles are plotted on the original scale in the left panels and on the log scale in
the right panels to allow us to see where a fitted distribution is deficient. The
straight line represents equality between the empirical distribution and fitted
distribution. From these plots, we again see that the Pareto is an overall bet-
ter fit than the gamma. Furthermore, the lower-right panel suggests that the
Pareto distribution does a good job with large claims, but provides a poorer fit
for small claims.
Log Gamma Quantile
10
Gamma Quantile
0
200000
−10
−20
0
0 1000000 3000000 2 4 6 8 10 12 14
Empirical Quantile Log Emp Quantile

15
Log Pareto Quantile
Pareto Quantile
2000000
10
5
0
0
0 1000000 3000000 2 4 6 8 10 12 14
Empirical Quantile Log Emp Quantile
Figure 4.9: Quantile-Quantile (𝑞𝑞) Plots. The horizontal axis gives the
empirical quantiles at each observation. The right-hand panels they are graphed
on a logarithmic basis. The vertical axis gives the quantiles from the fitted
distributions; gamma quantiles are in the upper panels, Pareto quantiles are in
the lower panels.
Example 4.1.7. Actuarial Exam Question. The graph below shows a 𝑝𝑝

plot of a fitted distribution compared to a sample.
Comment on the two distributions with respect to left tail, right tail, and median
probabilities.
1.0
0.8
0.6
Fitted
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Sample
Solution. The tail of the fitted distribution is too thick on the left, too thin
on the right, and the fitted distribution has less probability around the median
than the sample. To see this, recall that the 𝑝𝑝 plot graphs the cumulative
distribution of two distributions on its axes (empirical on the x-axis and fitted
on the y-axis in this case). For small values of 𝑥, the fitted model assigns greater
probability to being below that value than occurred in the sample (i.e. 𝐹 (𝑥) >
𝐹𝑛 (𝑥)). This indicates that the model has a heavier left tail than the data. For
large values of 𝑥, the model again assigns greater probability to being below that
value and thus less probability to being above that value (i.e. 𝑆(𝑥) < 𝑆𝑛 (𝑥)).
This indicates that the model has a lighter right tail than the data. In addition,
as we go from 0.4 to 0.6 on the horizontal axis (thus looking at the middle 20%
of the data), the 𝑝𝑝 plot increases from about 0.3 to 0.4. This indicates that
the model puts only about 10% of the probability in this range.
Statistical Comparison of Distributions

When selecting a model, it is helpful to make the graphical displays presented.
However, for reporting results, it can be effective to supplement the graphical
displays with selected statistics that summarize model goodness of fit. Table
4.2 provides three commonly used goodness of fit statistics. In this table, 𝐹𝑛
is the empirical distribution, 𝐹 is the fitted or hypothesized distribution, and
𝐹𝑖∗ = 𝐹 (𝑥𝑖 ).
Table 4.2. Three Goodness of Fit Statistics
Statistic Definition Computational Expression

Kolmogorov- max𝑥 |𝐹𝑛 (𝑥) − 𝐹 (𝑥)| max(𝐷+ , 𝐷− ) where
Smirnov 𝐷+ = max𝑖=1,…,𝑛 ∣ 𝑛𝑖 − 𝐹𝑖∗ ∣
𝐷− = max𝑖=1,…,𝑛 ∣𝐹𝑖∗ − 𝑖−1 𝑛
∣
1 𝑛 2
Cramer-von Mises 𝑛 ∫(𝐹𝑛 (𝑥) − 𝐹 (𝑥))2 𝑓(𝑥)𝑑𝑥 12𝑛
+ ∑ 𝑖=1
(𝐹𝑖
∗
− (2𝑖 − 1)/𝑛)
(𝑥)−𝐹 (𝑥))2 𝑛 2
Anderson-Darling 𝑛 ∫ (𝐹𝐹𝑛(𝑥)(1−𝐹 (𝑥))
𝑓(𝑥)𝑑𝑥 −𝑛 − 𝑛1 ∑𝑖=1 (2𝑖 − 1) log (𝐹𝑖∗ (1 − 𝐹𝑛+1−𝑖 ))
The Kolmogorov-Smirnov statistic is the maximum absolute difference between

the fitted distribution function and the empirical distribution function. Instead
of comparing differences between single points, the Cramer-von Mises statistic
integrates the difference between the empirical and fitted distribution functions
over the entire range of values. The Anderson-Darling statistic also integrates
this difference over the range of values, although weighted by the inverse of the
variance. It therefore places greater emphasis on the tails of the distribution
(i.e when 𝐹 (𝑥) or 1 − 𝐹 (𝑥) = 𝑆(𝑥) is small).
Example 4.1.8. Actuarial Exam Question (modified). A sample of claim

payments is:
29 64 90 135 182
Compare the empirical claims distribution to an exponential distribution with

mean 100 by calculating the value of the Kolmogorov-Smirnov test statistic.
Solution. For an exponential distribution with mean 100, the cumulative dis-
tribution function is 𝐹 (𝑥) = 1 − 𝑒−𝑥/100 . Thus,
𝑥 𝐹 (𝑥) 𝐹𝑛 (𝑥) 𝐹𝑛 (𝑥−) max(|𝐹 (𝑥) − 𝐹𝑛 (𝑥)|, |𝐹 (𝑥) − 𝐹𝑛 (𝑥−)|)

29 0.2517 0.2 0 max(0.0517, 0.2517) = 0.2517
64 0.4727 0.4 0.2 max(0.0727, 0.2727) = 0.2727
90 0.5934 0.6 0.4 max(0.0066, 0.1934) = 0.1934
135 0.7408 0.8 0.6 max(0.0592, 0.1408) = 0.1408
182 0.8380 1 0.8 max(0.1620, 0.0380) = 0.1620
The Kolmogorov-Smirnov test statistic is therefore
𝐾𝑆 = max(0.2517, 0.2727, 0.1934, 0.1408, 0.1620) = 0.2727.
4.1.3 Starting Values

The method of moments and percentile matching are nonparametric estimation
methods that provide alternatives to maximum likelihood. Generally, maximum
likelihood is the preferred technique because it employs data more efficiently.
(See Appendix Chapter 17 for precise definitions of efficiency.) However, meth-
ods of moments and percentile matching are useful because they are easier to
interpret and therefore allow the actuary or analyst to explain procedures to
others. Additionally, the numerical estimation procedure (e.g. if performed in
R) for the maximum likelihood is iterative and requires starting values to begin
the recursive process. Although many problems are robust to the choice of the
starting values, for some complex situations, it can be important to have a start-
ing value that is close to the (unknown) optimal value. Method of moments and
percentile matching are techniques that can produce desirable estimates without
a serious computational investment and can thus be used as a starting value for
computing maximum likelihood.
Method of Moments
Under the method of moments, we approximate the moments of the parametric
distribution using the empirical (nonparametric) moments described in Section
4.1.1. We can then algebraically solve for the parameter estimates.
Example 4.1.9. Property Fund. For the 2010 property fund, there are
𝑛 = 1, 377 individual claims (in thousands of dollars) with
1 𝑛 1 𝑛
𝑚1 = ∑ 𝑋 = 26.62259 and 𝑚2 = ∑ 𝑋 2 = 136154.6.
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
Fit the parameters of the gamma and Pareto distributions using the method of
moments.
Solution. To fit a gamma distribution, we have 𝜇1 = 𝛼𝜃 and 𝜇′2 = 𝛼(𝛼 + 1)𝜃2 .
Equating the two yields the method of moments estimators, easy algebra shows
that
𝜇21 𝜇′2 − 𝜇21

𝛼= and 𝜃 = .
𝜇′2 − 𝜇21 𝜇1
Thus, the method of moment estimators are
26.622592
𝛼̂ = = 0.005232809
136154.6 − 26.622592
136154.6 − 26.622592
𝜃̂ = = 5, 087.629.
26.62259
For comparison, the maximum likelihood values turn out to be 𝛼𝑀𝐿𝐸 ̂ =

̂
0.2905959 and 𝜃𝑀𝐿𝐸 = 91.61378, so there are big discrepancies between the
two estimation procedures. This is one indication, as we have seen before, that
the gamma model fits poorly.
In contrast, now assume a Pareto distribution so that 𝜇1 = 𝜃/(𝛼 − 1) and
𝜇′2 = 2𝜃2 /((𝛼 − 1)(𝛼 − 2)). Note that this expression for 𝜇′2 is only valid for
𝛼 > 2. Easy algebra shows
𝜇′2
𝛼=1+ and 𝜃 = (𝛼 − 1)𝜇1 .
𝜇′2 − 𝜇21
Thus, the method of moment estimators are
136154.6
𝛼̂ = 1 + = 2.005233
136154.6 − 26, 622592
𝜃 ̂ = (2.005233 − 1) ⋅ 26.62259 = 26.7619
The maximum likelihood values turn out to be 𝛼𝑀𝐿𝐸 ̂ ̂

= 0.9990936 and 𝜃𝑀𝐿𝐸 =
2.2821147. It is interesting that 𝛼𝑀𝐿𝐸
̂ < 1; for the Pareto distribution, recall
that 𝛼 < 1 means that the mean is infinite. This is another indication that the
property claims data set is a long tail distribution.
As the above example suggests, there is flexibility with the method of moments.
For example, we could have matched the second and third moments instead
of the first and second, yielding different estimators. Furthermore, there is no
guarantee that a solution will exist for each problem. For data that are censored
or truncated, matching moments is possible for a few problems but, in general,
this is a more difficult scenario. Finally, for distributions where the moments do
not exist or are infinite, method of moments is not available. As an alternative,
one can use the percentile matching technique.
Percentile Matching
Under percentile matching, we approximate the quantiles or percentiles of the
parametric distribution using the empirical (nonparametric) quantiles or per-
centiles described in Section 4.1.1.
Example 4.1.10. Property Fund. For the 2010 property fund, we illus-
trate matching on quantiles. In particular, the Pareto distribution is intuitively
pleasing because of the closed-form solution for the quantiles. Recall that the
distribution function for the Pareto distribution is
𝛼
𝜃
𝐹 (𝑥) = 1 − ( ) .
𝑥+𝜃
Easy algebra shows that we can express the quantile as
𝐹 −1 (𝑞) = 𝜃 ((1 − 𝑞)−1/𝛼 − 1) .

for a fraction 𝑞, 0 < 𝑞 < 1.
Determine estimates of the Pareto distribution parameters using the 25th and
95th empirical quantiles.
Solution.
The 25th percentile (the first quartile) turns out to be 0.78853 and the 95th
percentile is 50.98293 (both in thousands of dollars). With two equations
0.78853 = 𝜃 (1 − (1 − .25)−1/𝛼 ) and 50.98293 = 𝜃 (1 − (1 − .75)−1/𝛼 )
and two unknowns, the solution is
𝛼̂ = 0.9412076 and 𝜃 ̂ = 2.205617.
We remark here that a numerical routine is required for these solutions as no

analytic solution is available. Furthermore, recall that the maximum likelihood
estimates are 𝛼𝑀𝐿𝐸
̂ ̂
= 0.9990936 and 𝜃𝑀𝐿𝐸 = 2.2821147, so the percentile
matching provides a better approximation for the Pareto distribution than the
method of moments.

(i) Losses follow a loglogistic distribution with cumulative distribution func-
tion:
𝛾
(𝑥/𝜃)
𝐹 (𝑥) = 𝛾
1 + (𝑥/𝜃)
(ii) The sample of losses is:
10 35 80 86 90 120 158 180 200 210 1500
Calculate the estimate of 𝜃 by percentile matching, using the 40th and 80th
empirically smoothed percentile estimates.
Solution. With 11 observations, we have 𝑗 = ⌊(𝑛+1)𝑞⌋ = ⌊12(0.4)⌋ = ⌊4.8⌋ = 4
and ℎ = (𝑛 + 1)𝑞 − 𝑗 = 12(0.4) − 4 = 0.8. By interpolation, the 40th empirically
smoothed percentile estimate is 𝜋0.4
̂ = (1−ℎ)𝑋(𝑗) +ℎ𝑋(𝑗+1) = 0.2(86)+0.8(90) =
89.2.
Similarly, for the 80th empirically smoothed percentile estimate, we have
12(0.8) = 9.6 so the estimate is 𝜋0.8
̂ = 0.4(200) + 0.6(210) = 206.
Using the loglogistic cumulative distribution, we need to solve the following two
equations for parameters 𝜃 ̂ and 𝛾:̂
(89.2/𝜃)̂ 𝛾̂ (206/𝜃)̂ 𝛾̂
0.4 = and 0.8 = .
1 + (89.2/𝜃)̂ 𝛾̂ 1 + (206/𝜃)̂ 𝛾̂
4.2. MODEL SELECTION 145
Solving for each parenthetical expression gives 32 = (89.2/𝜃)𝛾̂ and 4 = (206/𝜃)̂ 𝛾̂ .

Taking the ratio of the second equation to the first gives 6 = (206/89.2)𝛾̂ ⇒
log(6)
𝛾̂ = log(206/89.2) = 2.1407. Then 41/2.1407 = 206/𝜃 ̂ ⇒ 𝜃 ̂ = 107.8.
Like the method of moments, percentile matching is almost too flexible in the
sense that estimators can vary depending on different percentiles chosen. For ex-
ample, one actuary may use estimation on the 25th and 95th percentiles whereas
another uses the 20th and 80th percentiles. In general estimated parameters will
differ and there is no compelling reason to prefer one over the other. Also as
with the method of moments, percentile matching is appealing because it pro-
vides a technique that can be readily applied in selected situations and has an
intuitive basis. Although most actuarial applications use maximum likelihood
estimators, it can be convenient to have alternative approaches such as method
of moments and percentile matching available.
4.2 Model Selection

• Describe the iterative model selection specification process
• Outline steps needed to select a parametric model
• Describe pitfalls of model selection based purely on in-sample data when
compared to the advantages of out-of-sample model validation
This section underscores the idea that model selection is an iterative process in
which models are cyclically (re)formulated and tested for appropriateness before
using them for inference. After an overview, we describe the model selection
process based on:
• an in-sample or training dataset,
• an out-of-sample or test dataset, and
• a method that combines these approaches known as cross-validation.
4.2.1 Iterative Model Selection

In our development, we examine the data graphically, hypothesize a model
structure, and compare the data to a candidate model in order to formulate
an improved model. Box (1980) describes this as an iterative process which is
shown in Figure 4.10.
This iterative process provides a useful recipe for structuring the task of speci-
fying a model to represent a set of data.
Figure 4.10: Iterative Model Specification Process
1. The first step, the model formulation stage, is accomplished by examining

the data graphically and using prior knowledge of relationships, such as
from economic theory or industry practice.
2. The second step in the iteration is fitting based on the assumptions of the
specified model. These assumptions must be consistent with the data to
make valid use of the model.
3. The third step is diagnostic checking; the data and model must be con-
sistent with one another before additional inferences can be made. Di-
agnostic checking is an important part of the model formulation; it can
reveal mistakes made in previous steps and provide ways to correct these
mistakes.
The iterative process also emphasizes the skills you need to make analytics
work. First, you need a willingness to summarize information numerically and
portray this information graphically. Second, it is important to develop an
understanding of model properties. You should understand how a probabilistic
model behaves in order to match a set of data to it. Third, theoretical properties
of the model are also important for inferring general relationships based on the
behavior of the data.
4.2.2 Model Selection Based on a Training Dataset

It is common to refer to a dataset used for analysis as an in-sample or train-
ing dataset. Techniques available for selecting a model depend upon whether
the outcomes 𝑋 are discrete, continuous, or a hybrid of the two, although the
principles are the same.
Graphical and other Basic Summary Measures. Begin by summarizing
the data graphically and with statistics that do not rely on a specific parametric
form, as summarized in Section 4.1. Specifically, you will want to graph both
the empirical distribution and density functions. Particularly for loss data that
contain many zeros and that can be skewed, deciding on the appropriate scale
(e.g., logarithmic) may present some difficulties. For discrete data, tables are
often preferred. Determine sample moments, such as the mean and variance, as
well as selected quantiles, including the minimum, maximum, and the median.
For discrete data, the mode (or most frequently occurring value) is usually
helpful.
These summaries, as well as your familiarity of industry practice, will suggest
one or more candidate parametric models. Generally, start with the simpler
parametric models (for example, one parameter exponential before a two param-
eter gamma), gradually introducing more complexity into the modeling process.
Critique the candidate parametric model numerically and graphically. For the
graphs, utilize the tools introduced in Section 4.1.2 such as 𝑝𝑝 and 𝑞𝑞 plots. For
the numerical assessments, examine the statistical significance of parameters
and try to eliminate parameters that do not provide additional information.
Likelihood Ratio Tests. For comparing model fits, if one model is a subset
of another, then a likelihood ratio test may be employed; the general approach
to likelihood ratio testing is described in Sections 15.4.3 and 17.3.2.
Goodness of Fit Statistics. Generally, models are not proper subsets of one
another so overall goodness of fit statistics are helpful for comparing models.
Information criteria are one type of goodness of statistic. The most widely used
examples are Akaike’s Information Criterion (AIC) and the (Schwarz) Bayesian
Information Criterion (BIC); they are widely cited because they can be readily
generalized to multivariate settings. Section 15.4.4 provides a summary of these
statistics.
For selecting the appropriate distribution, statistics that compare a parametric
fit to a nonparametric alternative, summarized in Section 4.1.2, are useful for
model comparison. For discrete data, a goodness of fit statistic (as described in
Section 2.7) is generally preferred as it is more intuitive and simpler to explain.
4.2.3 Model Selection Based on a Test Dataset

Model validation is the process of confirming that the proposed model is appro-
priate, especially in light of the purposes of the investigation. An important
limitation of the model selection process based only on in-sample data is that it
can be susceptible to data-snooping, that is, fitting a great number of models
to a single set of data. By looking at a large number of models, we may overfit
the data and understate the natural variation in our representation.
Selecting a model based only on in-sample data also does not support the goal of
predictive inference. Particularly in actuarial applications, our goal is to make
statements about new experience rather than a dataset at hand. For example,
we use claims experience from one year to develop a model that can be used to
price insurance contracts for the following year. As an analogy, we can think
about the training data set as experience from one year that is used to predict
the behavior of the next year’s test data set.
We can respond to these criticisms by using a technique known as out-of-

sample validation. The ideal situation is to have available two sets of data,
one for training, or model development, and the other for testing, or model
validation. We initially develop one or several models on the first data set that
we call our candidate models. Then, the relative performance of the candidate
models can be measured on the second set of data. In this way, the data used
to validate the model are unaffected by the procedures used to formulate the
model.
Random Split of the Data. Unfortunately, rarely will two sets of data be
available to the investigator. However, we can implement the validation process
by splitting the data set into training and test subsamples, respectively. Figure
4.11 illustrates this splitting of the data.
TRAINING
SUBSAMPLE SIZEn1
1 2 3 4 5 6 ... n
ORIGINAL
SAMPLE
SIZE n
TEST
SUBSAMPLE
SIZE n2
Figure 4.11: Model Validation. A data set is randomly split into two sub-
samples.
Various researchers recommend different proportions for the allocation. Snee

(1977) suggests that data-splitting not be done unless the sample size is mod-
erately large. The guidelines of Picard and Berk (1990) show that the greater
the number of parameters to be estimated, the greater the proportion of obser-
vations needed for the model development subsample.
Model Validation Statistics. Much of the literature supporting the estab-

lishment of a model validation process is based on regression and classification
models that you can think of as an input-output problem (James et al. (2013)).
That is, we have several inputs 𝑥1 , … , 𝑥𝑘 that are related to an output 𝑦 through
a function such as
𝑦 = g (𝑥1 , … , 𝑥𝑘 ) .
One uses the training sample to develop an estimate of g, say, g,̂ and then
calibrate the distance from the observed outcomes to the predictions using a
criterion of the form
∑ d(𝑦𝑖 , ĝ (𝑥𝑖1 , … , 𝑥𝑖𝑘 )). (4.4)

𝑖
Here, “d” is some measure of distance and the sum 𝑖 is over the test data. In
many regression applications, it is common to use squared Euclidean distance
of the form d(𝑦𝑖 , g) = (𝑦𝑖 − g)2 . In actuarial applications, Euclidean distance
d(𝑦𝑖 , g) = |𝑦𝑖 − g| is often preferred because of the skewed nature of the data
(large outlying values of 𝑦 can have a large effect on the measure). Chapter
7 describes another measure, the Gini index, that is useful in actuarial appli-
cations particularly when there is a large proportion of zeros in claims data
(corresponding to no claims).
Selecting a Distribution. Still, our focus so far has been to select a distribu-
tion for a data set that can be used for actuarial modeling without additional
inputs 𝑥1 , … , 𝑥𝑘 . Even in this more fundamental problem, the model validation
approach is valuable. If we base all inference on only in-sample data, then there
is a tendency to select more complicated models than needed. For example, we
might select a four parameter GB2, generalized beta of the second kind, distri-
bution when only a two parameter Pareto is needed. Information criteria such
as AIC and BIC included penalties for model complexity and so provide some
protection but using a test sample is the best guarantee to achieve parsimonious
models. From a quote often attributed to Albert Einstein, we want to “use the
simplest model as possible but no simpler.”
4.2.4 Model Selection Based on Cross-Validation

Although out-of-sample validation is the gold standard in predictive modeling, it
is not always practical to do so. The main reason is that we have limited sample
sizes and the out-of-sample model selection criterion in equation (4.4) depends
on a random split of the data. This means that different analysts, even when
working the same data set and same approach to modeling, may select different
models. This is likely in actuarial applications because we work with skewed
data sets where there is a large chance of getting some very large outcomes and
large outcomes may have a great influence on the parameter estimates.
Cross-Validation Procedure. Alternatively, one may use cross-validation,
as follows.
• The procedure begins by using a random mechanism to split the data into
𝐾 subsets of roughly equal size known as folds, where analysts typically
use 5 to 10.
• Next, one uses the first 𝐾-1 subsamples to estimate model parameters.
Then, “predict” the outcomes for the 𝐾th subsample and use a measure
such as in equation (4.4) to summarize the fit.
• Now, repeat this by holding out each of the 𝐾 subsamples, summariz-
ing with an out-of-sample statistic. Thus, summarize these 𝐾 statistics,
typically by averaging, to give a single overall statistic for comparison

purposes.
Repeat these steps for several candidate models and choose the model with the
lowest overall cross-validation statistic.
Cross-validation is widely used because it retains the predictive flavor of the
out-of-sample model validation process but, due to the re-use of the data, is
more stable over random samples.
4.3 Estimation using Modified Data

• Describe grouped, censored, and truncated data
• Estimate parametric distributions based on grouped, censored, and trun-
cated data
• Estimate distributions nonparametrically based on grouped, censored, and
truncated data
4.3.1 Parametric Estimation using Modified Data

Basic theory and many applications are based on individual observations that
are “complete” and “unmodified,” as we have seen in the previous section. Sec-
tion 3.5 introduced the concept of observations that are “modified” due to two
common types of limitations: censoring and truncation. For example, it is
common to think about an insurance deductible as producing data that are trun-
cated (from the left) or policy limits as yielding data that are censored (from the
right). This viewpoint is from the primary insurer (the seller of the insurance).
Another viewpoint is that of a reinsurer (an insurer of an insurance company)
that will be discussed more in Chapter 10. A reinsurer may not observe a claim
smaller than an amount, only that a claim exists; this is an example of censor-
ing from the left. So, in this section, we cover the full gamut of alternatives.
Specifically, this section will address parametric estimation methods for three
alternatives to individual, complete, and unmodified data: interval-censored
data available only in groups, data that are limited or censored, and data that
may not be observed due to truncation.
Parametric Estimation using Grouped Data

Consider a sample of size 𝑛 observed from the distribution 𝐹 (⋅), but in groups
so that we only know the group into which each observation fell, not the exact
value. This is referred to as grouped or interval-censored data. For example,
we may be looking at two successive years of annual employee records. People
4.3. ESTIMATION USING MODIFIED DATA 151
employed in the first year but not the second have left sometime during the year.
With an exact departure date (individual data), we could compute the amount
of time that they were with the firm. Without the departure date (grouped
data), we only know that they departed sometime during a year-long interval.
Formalizing this idea, suppose there are 𝑘 groups or intervals delimited by
boundaries 𝑐0 < 𝑐1 < ⋯ < 𝑐𝑘 . For each observation, we only observe the interval
into which it fell (e.g. (𝑐𝑗−1 , 𝑐𝑗 )), not the exact value. Thus, we only know the
number of observations in each interval. The constants {𝑐0 < 𝑐1 < ⋯ < 𝑐𝑘 } form
some partition of the domain of 𝐹 (⋅). Then the probability of an observation
𝑋𝑖 falling in the 𝑗th interval is
Pr (𝑋𝑖 ∈ (𝑐𝑗−1 , 𝑐𝑗 ]) = 𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 ).
The corresponding probability mass function for an observation is
⎧𝐹 (𝑐1 ) − 𝐹 (𝑐0 ) if 𝑥 ∈ (𝑐0 , 𝑐1 ]

{
𝑓(𝑥) = ⎨⋮ ⋮
{𝐹 (𝑐 ) − 𝐹 (𝑐 ) if 𝑥 ∈ (𝑐 , 𝑐 ]
⎩ 𝑘 𝑘−1 𝑘−1 𝑘
𝑘
𝐼(𝑥∈(𝑐𝑗−1 ,𝑐𝑗 ])
= ∏ {𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 )}
𝑗=1
Now, define 𝑛𝑗 to be the number of observations that fall in the 𝑗th interval,
(𝑐𝑗−1 , 𝑐𝑗 ]. Thus, the likelihood function (with respect to the parameter(s) 𝜃) is
𝑛 𝑘
𝑛𝑗
𝐿(𝜃) = ∏ 𝑓(𝑥𝑖 ) = ∏ {𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 )}
𝑗=1 𝑗=1
And the log-likelihood function is
𝑛 𝑘
𝑙(𝜃) = log 𝐿(𝜃) = log ∏ 𝑓(𝑥𝑖 ) = ∑ 𝑛𝑗 log {𝐹 (𝑐𝑗 ) − 𝐹 (𝑐𝑗−1 )}
𝑗=1 𝑗=1
Maximizing the likelihood function (or equivalently, maximizing the log-

likelihood function) would then produce the maximum likelihood estimates for
grouped data.
(i) Losses follow an exponential distribution with mean 𝜃.
(ii) A random sample of 20 losses is distributed as follows:
Loss Range Frequency

[0, 1000] 7
(1000, 2000] 6
(2000, ∞) 7

Solution.
𝐿(𝜃) = 𝐹 (1000)7 [𝐹 (2000) − 𝐹 (1000)]6 [1 − 𝐹 (2000)]7
= (1 − 𝑒−1000/𝜃 )7 (𝑒−1000/𝜃 − 𝑒−2000/𝜃 )6 (𝑒−2000/𝜃 )7
= (1 − 𝑝)7 (𝑝 − 𝑝2 )6 (𝑝2 )7
= 𝑝20 (1 − 𝑝)13
where 𝑝 = 𝑒−1000/𝜃 . Maximizing this expression with respect to 𝑝 is equivalent

20
to maximizing the likelihood with respect to 𝜃. The maximum occurs at 𝑝 = 33
and so 𝜃 ̂ = log(20/33)
−1000
= 1996.90.
Censored Data
Censoring occurs when we record only a limited value of an observation. The
most common form is right-censoring, in which we record the smaller of the
“true” dependent variable and a censoring value. Using notation, let 𝑋 represent
an outcome of interest, such as the loss due to an insured event or time until an
event. Let 𝐶𝑈 denote the censoring amount. With right-censored observations,
we record 𝑋𝑈∗ = min(𝑋, 𝐶𝑈 ) = 𝑋 ∧𝐶𝑈 . We also record whether or not censoring
has occurred. Let 𝛿𝑈 = 𝐼(𝑋 ≤ 𝐶𝑈 ) be a binary variable that is 0 if censoring
occurs and 1 if it does not, that is, 𝛿𝑈 indicates whether or not 𝑋 is uncensored.
For an example that we saw in Section 3.4.2, 𝐶𝑈 may represent the upper limit
of coverage of an insurance policy (we used 𝑢 for the upper limit in that section).
The loss may exceed the amount 𝐶𝑈 , but the insurer only has 𝐶𝑈 in its records
as the amount paid out and does not have the amount of the actual loss 𝑋 in
its records.
Similarly, with left-censoring, we record the larger of a variable of interest
and a censoring variable. If 𝐶𝐿 is used to represent the censoring amount, we
record 𝑋𝐿∗ = max(𝑋, 𝐶𝐿 ) along with the censoring indicator 𝛿𝐿 = 𝐼(𝑋 > 𝐶𝐿 ).
As an example, you got a brief introduction to reinsurance (insurance for insur-
ers) in Section 3.4.4 and will see more in Chapter 10. Suppose a reinsurer will
cover insurer losses greater than 𝐶𝐿 ; this means that the reinsurer is responsi-
ble for the excess of 𝑋𝐿∗ over 𝐶𝐿 . Using notation, the loss of the reinsurer is
𝑌 = 𝑋𝐿∗ − 𝐶𝐿 . To see this, first consider the case where the policyholder loss
𝑋 < 𝐶𝐿 . Then, the insurer will pay the entire claim and 𝑌 = 𝐶𝐿 − 𝐶𝐿 = 0,
no loss for the reinsurer. For contrast, if the loss 𝑋 ≥ 𝐶𝐿 , then 𝑌 = 𝑋 − 𝐶𝐿

represents the reinsurer’s retained claims. Put another way, if a loss occurs, the
reinsurer records the actual amount if it exceeds the limit 𝐶𝐿 and otherwise it
only records that it had a loss of 0.
Truncated Data
Censored observations are recorded for study, although in a limited form. In
contrast, truncated outcomes are a type of missing data. An outcome is poten-
tially truncated when the availability of an observation depends on the outcome.
In insurance, it is common for observations to be left-truncated at 𝐶𝐿 when
the amount is
we do not observe 𝑋 𝑋 ≤ 𝐶𝐿
𝑌 ={ .
𝑋 𝑋 > 𝐶𝐿
In other words, if 𝑋 is less than the threshold 𝐶𝐿 , then it is not observed.

For an example we saw in Section 3.4.1, 𝐶𝐿 may represent the deductible of an
insurance policy (we used 𝑑 for the deductible in that section). If the insured
loss is less than the deductible, then the insurer may not observe or record the
loss at all. If the loss exceeds the deductible, then the excess 𝑋 − 𝐶𝐿 is the
claim that the insurer covers. In Section 3.4.1, we defined the per payment loss
to be
Undefined 𝑋 ≤ 𝑑
𝑌𝑃 = { ,
𝑋−𝑑 𝑋>𝑑
so that if a loss exceeds a deductible, we record the excess amount 𝑋 − 𝑑. This
is very important when considering amounts that the insurer will pay. However,
for estimation purposes of this section, it matters little if we subtract a known
constant such as 𝐶𝐿 = 𝑑. So, for our truncated variable 𝑌 , we use the simpler
convention and do not subtract 𝑑.
Similarly for right-truncated data, if 𝑋 exceeds a threshold 𝐶𝑈 , then it is not
observed. In this case, the amount is
𝑋 𝑋 ≤ 𝐶𝑈
𝑌 ={
we do not observe 𝑋 𝑋 > 𝐶𝑈 .
Classic examples of truncation from the right include 𝑋 as a measure of distance

to a star. When the distance exceeds a certain level 𝐶𝑈 , the star is no longer
observable.
Figure 4.12 compares truncated and censored observations. Values of 𝑋 that
are greater than the “upper” censoring limit 𝐶𝑈 are not observed at all (right-
censored), while values of 𝑋 that are smaller than the “lower” truncation limit
𝐶𝐿 are observed, but observed as 𝐶𝐿 rather than the actual value of 𝑋 (left-
truncated).
No observed value under No observed value under

left−truncation right−truncation
No exact value under

interval−censoring
No exact value under No exact value under
left−censoring right−censoring
| | |
0 CL CU
X
Figure 4.12: Censoring and Truncation
Example – Mortality Study. Suppose that you are conducting a two-year

study of mortality of high-risk subjects, beginning January 1, 2010 and finishing
January 1, 2012. Figure 4.13 graphically portrays the six types of subjects
recruited. For each subject, the beginning of the arrow represents that the
subject was recruited and the arrow end represents the event time where in this
example the event represents death. The arrow represents exposure time.
1/1/2010 1/1/2011 1/1/2012
Calendar Time
Figure 4.13: Timeline for Several Subjects on Test in a Mortality Study
• Type A - Right-censored. This subject is alive at the beginning and

the end of the study. Because the time of death is not known by the end
of the study, it is right-censored. Most subjects are Type A.
• Type B - Complete information is available for a type B subject. The
subject is alive at the beginning of the study and the death occurs within
the observation period.

• Type C - Right-censored and left-truncated. A type C subject is
right-censored, in that death occurs after the observation period. However,
the subject entered after the start of the study and is said to have a delayed
entry time. Because the subject would not have been observed had death
occurred before entry, it is left-truncated.
• Type D - Left-truncated. A type D subject also has delayed entry.
Because death occurs within the observation period, this subject is not
right censored.
• Type E - Left-truncated. A type E subject is not included in the study
because death occurs prior to the observation period.
• Type F - Right-truncated. Similarly, a type F subject is not included
because the entry time occurs after the observation period.
To summarize, for outcome 𝑋 and constants 𝐶𝐿 and 𝐶𝑈 ,
Limitation Type Limited Variable Recording Information

right censoring 𝑋𝑈∗ = min(𝑋, 𝐶𝑈 ) 𝛿𝑈 = 𝐼(𝑋 ≤ 𝐶𝑈 )
left censoring 𝑋𝐿∗ = max(𝑋, 𝐶𝐿 ) 𝛿𝐿 = 𝐼(𝑋 > 𝐶𝐿 )
interval censoring
right truncation 𝑋 observe 𝑋 if 𝑋 ≤ 𝐶𝑈
left truncation 𝑋 observe 𝑋 if 𝑋 > 𝐶𝐿
Parametric Estimation using Censored and Truncated Data

For simplicity, we assume non-random censoring amounts and a continuous
outcome 𝑋. To begin, consider the case of right-censored data where we record
𝑋𝑈∗ = min(𝑋, 𝐶𝑈 ) and censoring indicator 𝛿 = 𝐼(𝑋 ≤ 𝐶𝑈 ). If censoring occurs
so that 𝛿 = 0, then 𝑋 > 𝐶𝑈 and the likelihood is Pr(𝑋 > 𝐶𝑈 ) = 1 − 𝐹 (𝐶𝑈 ). If
censoring does not occur so that 𝛿 = 1, then 𝑋 ≤ 𝐶𝑈 and the likelihood is 𝑓(𝑥).
Summarizing, we have the likelihood of a single observation as
1 − 𝐹 (𝐶𝑈 ) if 𝛿 = 0 𝛿 1−𝛿
{ = {𝑓(𝑥)} {1 − 𝐹 (𝐶𝑈 )} .
𝑓(𝑥) if 𝛿 = 1
The right-hand expression allows us to present the likelihood more compactly.

Now, for an iid sample of size 𝑛, the likelihood is
𝑛
𝛿 1−𝛿𝑖
𝐿(𝜃) = ∏ {𝑓(𝑥𝑖 )} 𝑖 {1 − 𝐹 (𝐶𝑈𝑖 )} = ∏ 𝑓(𝑥𝑖 ) ∏ {1 − 𝐹 (𝐶𝑈𝑖 )},
𝑖=1 𝛿𝑖 =1 𝛿𝑖 =0
with potential censoring times {𝐶𝑈1 , … , 𝐶𝑈𝑛 }. Here, the notation “∏𝛿 ”
𝑖 =1
means to take the product over uncensored observations, and similarly for
“∏𝛿 =0 .”
𝑖
On the other hand, truncated data are handled in likelihood inference via condi-
tional probabilities. Specifically, we adjust the likelihood contribution by divid-
ing by the probability that the variable was observed. To summarize, we have
the following contributions to the likelihood function for six types of outcomes:
Outcome Likelihood Contribution

exact value 𝑓(𝑥)
right-censoring 1 − 𝐹 (𝐶𝑈 )
left-censoring 𝐹 (𝐶𝐿 )
right-truncation 𝑓(𝑥)/𝐹 (𝐶𝑈 )
left-truncation 𝑓(𝑥)/(1 − 𝐹 (𝐶𝐿 ))
interval-censoring 𝐹 (𝐶𝑈 ) − 𝐹 (𝐶𝐿 )
For known outcomes and censored data, the likelihood is
𝐿(𝜃) = ∏ 𝑓(𝑥𝑖 ) ∏{1 − 𝐹 (𝐶𝑈𝑖 )} ∏ 𝐹 (𝐶𝐿𝑖 ) ∏(𝐹 (𝐶𝑈𝑖 ) − 𝐹 (𝐶𝐿𝑖 )),

𝐸 𝑅 𝐿 𝐼
where “∏𝐸 ” is the product over observations with Exact values, and similarly
for 𝑅ight-, 𝐿eft- and 𝐼nterval-censoring.
For right-censored and left-truncated data, the likelihood is
𝑓(𝑥𝑖 ) 1 − 𝐹 (𝐶𝑈𝑖 )
𝐿(𝜃) = ∏ ∏ ,
𝐸
1 − 𝐹 (𝐶𝐿𝑖 ) 𝑅 1 − 𝐹 (𝐶𝐿𝑖 )
and similarly for other combinations. To get further insights, consider the fol-
lowing.
Special Case: Exponential Distribution. Consider data that are

right-censored and left-truncated, with random variables 𝑋𝑖 that are expo-
nentially distributed with mean 𝜃. With these specifications, recall that
𝑓(𝑥) = 𝜃−1 exp(−𝑥/𝜃) and 𝐹 (𝑥) = 1 − exp(−𝑥/𝜃).
For this special case, the log-likelihood is
𝑙(𝜃) = ∑ {log 𝑓(𝑥𝑖 ) − log(1 − 𝐹 (𝐶𝐿𝑖 ))} + ∑ {log(1 − 𝐹 (𝐶𝑈𝑖 )) − log(1 − F(𝐶𝐿𝑖 ))}
𝐸 𝑅
= ∑(− log 𝜃 − (𝑥𝑖 − 𝐶𝐿𝑖 )/𝜃) − ∑(𝐶𝑈𝑖 − 𝐶𝐿𝑖 )/𝜃.

𝐸 𝑅
To simplify the notation, define 𝛿𝑖 = 𝐼(𝑋𝑖 < 𝐶𝑈𝑖 ) to be a binary variable

that indicates right-censoring. Let 𝑋𝑖∗∗ = min(𝑋𝑖 , 𝐶𝑈𝑖 ) − 𝐶𝐿𝑖 be the amount
that the observed variable exceeds the lower truncation limit. With this, the
log-likelihood is
𝑛
𝑥∗∗
𝑖
𝑙(𝜃) = − ∑ ((1 − 𝛿𝑖 ) log 𝜃 + ) (4.5)
𝑖=1
𝜃
Taking derivatives with respect to the parameter 𝜃 and setting it equal to zero
yields the maximum likelihood estimator
1 𝑛 ∗∗
𝜃̂ = ∑𝑥 ,
𝑛𝑢 𝑖=1 𝑖
where 𝑛𝑢 = ∑𝑖 (1 − 𝛿𝑖 ) is the number of uncensored observations.

(i) A sample of losses is: 600 700 900
(ii) No information is available about losses of 500 or less.
(iii) Losses are assumed to follow an exponential distribution with mean 𝜃.
Solution. These observations are truncated at 500. The contribution of each
observation to the likelihood function is
𝑓(𝑥) 𝜃−1 𝑒−𝑥/𝜃

= −500/𝜃
1 − 𝐹 (500) 𝑒
Then the likelihood function is
𝜃−1 𝑒−600/𝜃 𝜃−1 𝑒−700/𝜃 𝜃−1 𝑒−900/𝜃

𝐿(𝜃) = = 𝜃−3 𝑒−700/𝜃
(𝑒−500/𝜃 )3
The log-likelihood is
𝑙(𝜃) = log 𝐿(𝜃) = −3 log 𝜃 − 700𝜃−1
Maximizing this expression by setting the derivative with respect to 𝜃 equal to

0, we have
700
𝐿′ (𝜃) = −3𝜃−1 + 700𝜃−2 = 0 ⇒ 𝜃 ̂ = = 233.33.
3
information about a random sample:
(i) The sample size equals five.
(ii) The sample is from a Weibull distribution with 𝜏 = 2.
(iii) Two of the sample observations are known to exceed 50, and the remaining
three observations are 20, 30, and 45.
Solution. The likelihood function is
𝐿(𝜃) = 𝑓(20)𝑓(30)𝑓(45)[1 − 𝐹 (50)]2

2 2 2
2(20/𝜃)2 𝑒−(20/𝜃) 2(30/𝜃)2 𝑒−(30/𝜃) 2(45/𝜃)2 𝑒−(45/𝜃) −(50/𝜃)2 2
= (𝑒 )
20 30 45
1 2
∝ 6 𝑒−8325/𝜃
𝜃
8325
The natural logarithm of the above expression is −6 log 𝜃 − 𝜃2 . Maximizing
this expression by setting its derivative to 0, we get
1
−6 16650 ̂ = ( 16650 ) = 52.6783
2
+ = 0 ⇒ 𝜃
𝜃 𝜃3 6
4.3.2 Nonparametric Estimation using Modified Data

Nonparametric estimators provide useful benchmarks, so it is helpful to under-
stand the estimation procedures for grouped, censored, and truncated data.
Grouped Data
As we have seen in Section 4.3.1, observations may be grouped (also referred
to as interval censored) in the sense that we only observe them as belonging in
one of 𝑘 intervals of the form (𝑐𝑗−1 , 𝑐𝑗 ], for 𝑗 = 1, … , 𝑘. At the boundaries, the
empirical distribution function is defined in the usual way:
number of observations ≤ 𝑐𝑗
𝐹𝑛 (𝑐𝑗 ) = .
𝑛
Ogive Estimator. For other values of 𝑥 ∈ (𝑐𝑗−1 , 𝑐𝑗 ), we can estimate the dis-
tribution function with the ogive estimator, which linearly interpolates between
𝐹𝑛 (𝑐𝑗−1 ) and 𝐹𝑛 (𝑐𝑗 ), i.e. the values of the boundaries 𝐹𝑛 (𝑐𝑗−1 ) and 𝐹𝑛 (𝑐𝑗 ) are
connected with a straight line. This can formally be expressed as
𝑐𝑗 − 𝑥 𝑥 − 𝑐𝑗−1
𝐹𝑛 (𝑥) = 𝐹𝑛 (𝑐𝑗−1 ) + 𝐹 (𝑐 ) for 𝑐𝑗−1 ≤ 𝑥 < 𝑐𝑗
𝑐𝑗 − 𝑐𝑗−1 𝑐𝑗 − 𝑐𝑗−1 𝑛 𝑗
The corresponding density is
𝐹𝑛 (𝑐𝑗 ) − 𝐹𝑛 (𝑐𝑗−1 )
𝑓𝑛 (𝑥) = 𝐹𝑛′ (𝑥) = for 𝑐𝑗−1 < 𝑥 < 𝑐𝑗 .
𝑐𝑗 − 𝑐𝑗−1
information regarding claim sizes for 100 claims:
Claim Size Number of Claims

0 − 1, 000 16
1, 000 − 3, 000 22
3, 000 − 5, 000 25
5, 000 − 10, 000 18
10, 000 − 25, 000 10
25, 000 − 50, 000 5
50, 000 − 100, 000 3
over 100, 000 1
Using the ogive, calculate the estimate of the probability that a randomly chosen
claim is between 2000 and 6000.
Solution. At the boundaries, the empirical distribution function is defined in
the usual way, so we have
𝐹100 (1000) = 0.16, 𝐹100 (3000) = 0.38, 𝐹100 (5000) = 0.63, 𝐹100 (10000) = 0.81
For other claim sizes, the ogive estimator linearly interpolates between these
values:
𝐹100 (2000) = 0.5𝐹100 (1000) + 0.5𝐹100 (3000) = 0.5(0.16) + 0.5(0.38) = 0.27

𝐹100 (6000) = 0.8𝐹100 (5000) + 0.2𝐹100 (10000) = 0.8(0.63) + 0.2(0.81) = 0.666
Thus, the probability that a claim is between 2000 and 6000 is 𝐹100 (6000) −
𝐹100 (2000) = 0.666 − 0.27 = 0.396.
Right-Censored Empirical Distribution Function

It can be useful to calibrate parametric estimators with nonparametric methods
that do not rely on a parametric form of the distribution. The product-limit
estimator due to (Kaplan and Meier, 1958) is a well-known estimator of the
distribution function in the presence of censoring.
Motivation for the Kaplan-Meier Product Limit Estimator. To explain
why the product-limit works so well with censored observations, let us first
return to the “usual” case without censoring. Here, the empirical distribution
function 𝐹𝑛 (𝑥) is an unbiased estimator of the distribution function 𝐹 (𝑥). This
is because 𝐹𝑛 (𝑥) is the average of indicator variables each of which are unbiased,
that is, E [𝐼(𝑋𝑖 ≤ 𝑥)] = Pr(𝑋𝑖 ≤ 𝑥) = 𝐹 (𝑥).
Now suppose the random outcome is censored on the right by a limiting amount,
say, 𝐶𝑈 , so that we record the smaller of the two, 𝑋 ∗ = min(𝑋, 𝐶𝑈 ). For values
of 𝑥 that are smaller than 𝐶𝑈 , the indicator variable still provides an unbiased
estimator of the distribution function before we reach the censoring limit. That
is, E [𝐼(𝑋 ∗ ≤ 𝑥)] = 𝐹 (𝑥) because 𝐼(𝑋 ∗ ≤ 𝑥) = 𝐼(𝑋 ≤ 𝑥) for 𝑥 < 𝐶𝑈 . In the
same way, E [𝐼(𝑋 ∗ > 𝑥)] = 1 − 𝐹 (𝑥) = 𝑆(𝑥). But, for 𝑥 > 𝐶𝑈 , 𝐼(𝑋 ∗ ≤ 𝑥) is in
general not an unbiased estimator of 𝐹 (𝑥).
As an alternative, consider two random variables that have different censor-
ing limits. For illustration, suppose that we observe 𝑋1∗ = min(𝑋1 , 5) and
𝑋2∗ = min(𝑋2 , 10) where 𝑋1 and 𝑋2 are independent draws from the same dis-
tribution. For 𝑥 ≤ 5, the empirical distribution function 𝐹2 (𝑥) is an unbiased
estimator of 𝐹 (𝑥). However, for 5 < 𝑥 ≤ 10, the first observation cannot be used
for the distribution function because of the censoring limitation. Instead, the
strategy developed by (Kaplan and Meier, 1958) is to use 𝑆2 (5) as an estimator
of 𝑆(5) and then to use the second observation to estimate the survival function
conditional on survival to time 5, Pr(𝑋 > 𝑥|𝑋 > 5) = 𝑆(𝑥) 𝑆(5) . Specifically, for
5 < 𝑥 ≤ 10, the estimator of the survival function is
̂
𝑆(𝑥) = 𝑆2 (5) × 𝐼(𝑋2∗ > 𝑥).
Kaplan-Meier Product Limit Estimator. Extending this idea, for each

observation 𝑖, let 𝑢𝑖 be the upper censoring limit (= ∞ if no censoring). Thus,
the recorded value is 𝑥𝑖 in the case of no censoring and 𝑢𝑖 if there is censoring.
Let 𝑡1 < ⋯ < 𝑡𝑘 be 𝑘 distinct points at which an uncensored loss occurs, and
let 𝑠𝑗 be the number of uncensored losses 𝑥𝑖 ’s at 𝑡𝑗 . The corresponding risk set
is the number of observations that are active (not censored) at a value less than
𝑛 𝑛
𝑡𝑗 , denoted as 𝑅𝑗 = ∑𝑖=1 𝐼(𝑥𝑖 ≥ 𝑡𝑗 ) + ∑𝑖=1 𝐼(𝑢𝑖 ≥ 𝑡𝑗 ).
With this notation, the product-limit estimator of the distribution function
is
0 𝑥 < 𝑡1
𝐹 ̂ (𝑥) = { 𝑠𝑗 . (4.6)
1 − ∏𝑗∶𝑡 (1 − 𝑅𝑗 ) 𝑥 ≥ 𝑡1
𝑗 ≤𝑥
For example, if 𝑥 is smaller than the smallest uncensored loss, then 𝑥 < 𝑡1 and
𝐹 ̂ (𝑥) = 0. As another example, if 𝑥 falls between then second and third smallest
uncensored losses, then 𝑥 ∈ (𝑡2 , 𝑡3 ] and 𝐹 ̂ (𝑥) = 1 − (1 − 𝑅𝑠1 ) (1 − 𝑅𝑠2 ).
1 2
̂
As usual, the corresponding estimate of the survival function is 𝑆(𝑥) = 1 − 𝐹 ̂ (𝑥).
Example 4.3.5. Actuarial Exam Question. The following is a sample of

10 payments:
4 4 5+ 5+ 5+ 8 10+ 10+ 12 15
where + indicates that a loss has exceeded the policy limit.

Using the Kaplan-Meier product-limit estimator, calculate the probability that
̂
the loss on a policy exceeds 11, 𝑆(11).
Solution. There are four event times (non-censored observations). For each
time 𝑡𝑗 , we can calculate the number of events 𝑠𝑗 and the risk set 𝑅𝑗 as the
following:
𝑗 𝑡 𝑗 𝑠𝑗 𝑅𝑗
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
Thus, the Kaplan-Meier estimate of 𝑆(11) is
2
𝑠𝑗 𝑠𝑗
̂
𝑆(11) = ∏ (1 − ) = ∏ (1 − )
𝑗∶𝑡𝑗 ≤11
𝑅𝑗 𝑗=1
𝑅 𝑗
2 1
= (1 − ) (1 − ) = (0.8)(0.8) = 0.64.
10 5
Example. 4.3.6. Bodily Injury Claims. We consider again the Boston

auto bodily injury claims data from Derrig et al. (2001) that was introduced in
Example 4.1.11. In that example, we omitted the 17 claims that were censored
by policy limits. Now, we include the full dataset and use the Kaplan-Meier
product limit to estimate the survival function. This is given in Figure 4.14.
Right-Censored, Left-Truncated Empirical Distribution Function. In

addition to right-censoring, we now extend the framework to allow for left-
truncated data. As before, for each observation 𝑖, let 𝑢𝑖 be the upper censoring
limit (= ∞ if no censoring). Further, let 𝑑𝑖 be the lower truncation limit (0 if
no truncation). Thus, the recorded value (if it is greater than 𝑑𝑖 ) is 𝑥𝑖 in the
case of no censoring and 𝑢𝑖 if there is censoring. Let 𝑡1 < ⋯ < 𝑡𝑘 be 𝑘 distinct
points at which an event of interest occurs, and let 𝑠𝑗 be the number of recorded
1.0
0.8
Kaplan Meier Survival
0.6
0.4
0.2
0.0
0 5000 10000 15000 20000 25000
Figure 4.14: Kaplan-Meier Estimate of the Survival Function for Bodily

Injury Claims
events 𝑥𝑖 ’s at time point 𝑡𝑗 . The corresponding risk set is
𝑛 𝑛 𝑛
𝑅𝑗 = ∑ 𝐼(𝑥𝑖 ≥ 𝑡𝑗 ) + ∑ 𝐼(𝑢𝑖 ≥ 𝑡𝑗 ) − ∑ 𝐼(𝑑𝑖 ≥ 𝑡𝑗 ).
𝑖=1 𝑖=1 𝑖=1
With this new definition of the risk set, the product-limit estimator of the dis-
tribution function is as in equation (4.6).
Greenwood’s Formula. (Greenwood, 1926) derived the formula for the esti-
mated variance of the product-limit estimator to be
𝑠𝑗
𝑉̂
𝑎𝑟(𝐹 ̂ (𝑥)) = (1 − 𝐹 ̂ (𝑥))2 ∑ .
𝑗∶𝑡𝑗
𝑅 (𝑅𝑗 − 𝑠𝑗 )
≤𝑥 𝑗
As usual, we refer to the square root of the estimated variance as a standard

error, a quantity that is routinely used in confidence intervals and for hypoth-
esis testing. To compute this, R‘s survfit method takes a survival data ob-
ject and creates a new object containing the Kaplan-Meier estimate of the
survival function along with confidence intervals. The Kaplan-Meier method
(type='kaplan-meier') is used by default to construct an estimate of the sur-
vival curve. The resulting discrete survival function has point masses at the
observed event times (discharge dates) 𝑡𝑗 , where the probability of an event
given survival to that duration is estimated as the number of observed events
at the duration 𝑠𝑗 divided by the number of subjects exposed or ’at-risk’ just
prior to the event duration 𝑅𝑗 .
Alternative Estimators. Two alternate types of estimation are also avail-
able for the survfit method. The alternative (type='fh2') handles ties, in
essence, by assuming that multiple events at the same duration occur in some
arbitrary order. Another alternative (type='fleming-harrington') uses the
Nelson-Aalen (see (Aalen, 1978)) estimate of the cumulative hazard func-
tion to obtain an estimate of the survival function. The estimated cumulative
̂
hazard 𝐻(𝑥) starts at zero and is incremented at each observed event duration
𝑡𝑗 by the number of events 𝑠𝑗 divided by the number at risk 𝑅𝑗 . With the same
notation as above, the Nelson-Äalen estimator of the distribution function is
0 𝑥 < 𝑡1
̂ (𝑥) = {
𝐹𝑁𝐴 𝑠 .
1− exp (− ∑𝑗∶𝑡 ≤𝑥 𝑅𝑗 ) 𝑥 ≥ 𝑡1
𝑗 𝑗
Note that the above expression is a result of the Nelson-Äalen estimator of the
cumulative hazard function
𝑠𝑗
̂
𝐻(𝑥) = ∑
𝑗∶𝑡𝑗
𝑅
≤𝑥 𝑗
and the relationship between the survival function and cumulative hazard func-
̂
̂ (𝑥) = 𝑒−𝐻(𝑥)
tion, 𝑆𝑁𝐴 .
Example 4.3.7. Actuarial Exam Question.
For observation 𝑖 of a survival study:
• 𝑑𝑖 is the left truncation point

• 𝑥𝑖 is the observed value if not right censored
• 𝑢𝑖 is the observed value if right censored
You are given:
Observation (𝑖) 1 2 3 4 5 6 7 8 9 10
𝑑𝑖 0 0 0 0 0 0 0 1.3 1.5 1.6
𝑥𝑖 0.9 − 1.5 − − 1.7 − 2.1 2.1 −
𝑢𝑖 − 1.2 − 1.5 1.6 − 1.7 − − 2.3
̂
Calculate the Kaplan-Meier product-limit estimate, 𝑆(1.6)
𝑛
Solution. Recall the risk set 𝑅𝑗 = ∑𝑖=1 {𝐼(𝑥𝑖 ≥ 𝑡𝑗 ) + 𝐼(𝑢𝑖 ≥ 𝑡𝑗 ) − 𝐼(𝑑𝑖 ≥ 𝑡𝑗 )}.
Then
𝑗 𝑡𝑗 𝑠𝑗 𝑅𝑗 ̂ )
𝑆(𝑡 𝑗
1 0.9 1 10 − 3 = 7 1 − 71 = 76
6 1 5
2 1.5 1 8−2=6 7 (1 − 6 ) = 7
5 1 4
3 1.7 1 5−0=5 7 (1 − 5 ) = 7
4 2 4
4 2.1 2 3 7 (1 − 3 ) = 21
̂
The Kaplan-Meier estimate is therefore 𝑆(1.6) = 75 .
Example 4.3.8. Actuarial Exam Question. - Continued.
a) Using the Nelson-Äalen estimator, calculate the probability that the loss
̂ (11).
on a policy exceeds 11, 𝑆𝑁𝐴
b) Calculate Greenwood’s approximation to the variance of the product-limit
̂
estimate 𝑆(11).
Solution. As before, there are four event times (non-censored observations).

For each time 𝑡𝑗 , we can calculate the number of events 𝑠𝑗 and the risk set 𝑅𝑗
as the following:
4.4. BAYESIAN INFERENCE 165
𝑗 𝑡 𝑗 𝑠𝑗 𝑅𝑗
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
̂
̂ (11) = 𝑒−𝐻(11) = 𝑒−0.4 = 0.67, since
The Nelson-Äalen estimate of 𝑆(11) is 𝑆𝑁𝐴
2
𝑠𝑗 𝑠𝑗
̂
𝐻(11) = ∑ =∑
𝑗∶𝑡𝑗
𝑅
≤11 𝑗 𝑗=1
𝑅𝑗
2 1
= + = 0.2 + 0.2 = 0.4.
10 5
̂
From earlier work, the Kaplan-Meier estimate of 𝑆(11) is 𝑆(11) = 0.64. Then
Greenwood’s estimate of the variance of the product-limit estimate of 𝑆(11) is
𝑠𝑗 2 1
𝑉̂ ̂
𝑎𝑟(𝑆(11)) ̂
= (𝑆(11)) 2
∑ = (0.64)2 ( + ) = 0.0307.
𝑗∶𝑡𝑗
𝑅 (𝑅𝑗 − 𝑠𝑗 )
≤11 𝑗
10(8) 5(4)
4.4 Bayesian Inference

• Describe the Bayesian model as an alternative to the frequentist approach
and summarize the five components of this modeling approach.
• Summarize posterior distributions of parameters and use these posterior
distributions to predict new outcomes.
• Use conjugate distributions to determine posterior distributions of param-
eters.
4.4.1 Introduction to Bayesian Inference

Up to this point, our inferential methods have focused on the frequentist set-
ting, in which samples are repeatedly drawn from a population. The vector of
parameters 𝜃 is fixed yet unknown, whereas the outcomes 𝑋 are realizations of
random variables.
In contrast, under the Bayesian framework, we view both the model parameters
and the data as random variables. We are uncertain about the parameters 𝜃
and use probability tools to reflect this uncertainty.
To get a sense of the Bayesian framework, begin by recalling Bayes’ rule,
Pr(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) × Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎) = ,
Pr(𝑑𝑎𝑡𝑎)
where
• Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) is the distribution of the parameters, known as the prior
distribution.
• Pr(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) is the sampling distribution. In a frequentist context,
it is used for making inferences about the parameters and is known as the
likelihood.
• Pr(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|𝑑𝑎𝑡𝑎) is the distribution of the parameters having observed
the data, known as the posterior distribution.
• Pr(𝑑𝑎𝑡𝑎) is the marginal distribution of the data. It is generally obtained
by integrating (or summing) the joint distribution of data and parameters
over parameter values.
Why Bayes? There are several advantages of the Bayesian approach. First, we
can describe the entire distribution of parameters conditional on the data. This
allows us, for example, to provide probability statements regarding the likeli-
hood of parameters. Second, the Bayesian approach provides a unified approach
for estimating parameters. Some non-Bayesian methods, such as least squares,
require a separate approach to estimate variance components. In contrast, in
Bayesian methods, all parameters can be treated in a similar fashion. This is
convenient for explaining results to consumers of the data analysis. Third, this
approach allows analysts to blend prior information known from other sources
with the data in a coherent manner. This topic is developed in detail in the cred-
ibility Chapter 9. Fourth, Bayesian analysis is particularly useful for forecasting
future responses.
Gamma - Poisson Special Case. To develop intuition, we consider the
gamma-Poisson case that holds a prominent position in actuarial applications.
The idea is to consider a set of random variables 𝑋1 , … , 𝑋𝑛 where each 𝑋𝑖 could
represent the number of claims for the 𝑖th policyholder. Assume that claims of
all policyholders follow the same Poisson so that 𝑋𝑖 has a Poisson distribution
with parameter 𝜆. This is analogous to the likelihood that we first saw in Chap-
ter 2. In a non-Bayesian (or frequentist) context, the parameter 𝜆 is viewed
as an unknown quantity that is not random (it is said to be “fixed”). In the
Bayesian context, the unknown parameter 𝜆 is viewed as uncertain and is mod-
eled as a random variable. In this special case, we use the gamma distribution
to reflect this uncertainty, the prior distribution.
Think of the following two-stage sampling scheme to motivate our probabilistic
set-up.
1. In the first stage, the parameter 𝜆 is drawn from a gamma distribution.
2. In the second stage, for that value of 𝜆, there are 𝑛 draws from the same
(identical) Poisson distribution that are independent, conditional on 𝜆.
From this simple set-up, some important conclusions emerge.
• The marginal, or unconditional, distribution of 𝑋𝑖 is no longer Poisson.

For this special case, it turns out to be a negative binomial distribution
(see the following “Snippet of Theory”).
• The random variables 𝑋1 , … , 𝑋𝑛 are not independent. This is because
they share the common random variable 𝜆.
• As in the frequentist context, the goal is to make statements about likely
values of parameters such as 𝜆 given the observed data 𝑋1 , … , 𝑋𝑛 . How-
ever, because now both the parameter and the data are random variables,
we can use the language of conditional probability to make such state-
ments. As we will see in Section 4.4.4, it turns out that the distribution
of 𝜆 given the data 𝑋1 , … , 𝑋𝑛 is also gamma (with updated parameters),
a result that simplifies the task of inferring likely values of the parameter
𝜆.
Let us demonstrate that the distribution of 𝑋 is negative binomial. We assume

that the distribution of 𝑋 given 𝜆 is Poisson, so that
𝜆𝑥
Pr(𝑋 = 𝑥|𝜆) = 𝑒−𝜆 ,
Γ(𝑥 + 1)
using notation Γ(𝑥 + 1) = 𝑥! for integer 𝑥. Assume that 𝜆 is a draw from a

gamma distribution with fixed parameters, say, 𝛼 and 𝜃, so this has pdf
𝜆𝛼−1
𝑓(𝜆) = exp(−𝜆/𝜃).
𝜃𝛼 Γ(𝛼)
We know that a pdf integrates to one and so we have
∞ ∞
∫ 𝑓(𝜆) 𝑑𝜆 = 1 ⟹ 𝜃𝛼 Γ(𝛼) = ∫ 𝜆𝛼−1 exp (−𝜆/𝜃) 𝑑𝜆.
0 0
From Appendix Chapter 16 on iterated expectations, we have that the pmf of

𝑋 can be computed in an iterated fashion as
Pr(𝑋 = 𝑥) = E {Pr(𝑋 = 𝑥|𝜆)}

∞
=∫ Pr(𝑋 = 𝑥|𝜆)𝑓(𝜆) 𝑑𝜆
0
∞
𝜆𝑥 𝜆𝛼−1
=∫ 𝑒−𝜆 𝛼 exp(−𝜆/𝜃) 𝑑𝜆
0 Γ(𝑥 + 1) 𝜃 Γ(𝛼)
∞
1 1
= ∫ 𝜆𝑥+𝛼−1 exp (−𝜆(1 + )) 𝑑𝜆
𝜃𝛼 Γ(𝑥 + 1)Γ(𝛼) 0 𝜃
1 1 −(𝑥+𝛼)
= Γ(𝑥 + 𝛼) (1 + )
𝜃𝛼 Γ(𝑥
+ 1)Γ(𝛼) 𝜃
𝛼 𝑥
Γ(𝑥 + 𝛼) 1 𝜃
= ( ) ( ) .
Γ(𝑥 + 1)Γ(𝛼) 1 + 𝜃 1+𝜃
Here, we used the gamma distribution equality with the substitution 𝜃𝑟 = 1/(1+
1/𝜃). As can be seen from Section 2.2.3, this is a negative binomial distribution
with parameter 𝑟 = 𝛼 and 𝛽 = 𝜃.
In this section, we use small examples that can be done by hand in order to
focus on the foundations. For practical implementation, analysts rely heavily
on simulation methods using modern computational methods such as Markov
Chain Monte Carlo (MCMC) simulation. We will get an exposure to simulation
techniques in Chapter 6 but more intensive techniques such as MCMC requires
yet more background. See Hartman (2016) for an introduction to computational
Bayesian methods from an actuarial perspective.
4.4.2 Bayesian Model

With the intuition developed in the preceding Section 4.4.1, we now restate
the Bayesian model with a bit more precision using mathematical notation.
For simplicity, we assume both the outcomes and parameters are continuous
random variables. In the examples, we sometimes ask the viewer to apply these
same principles to discrete versions. Conceptually both the continuous and
discrete cases are the same; mechanically, one replaces a pdf by a pmf and an
integral by a sum.
To emphasize, under the Bayesian perspective, the model parameters and data
are both viewed as random. Our uncertainty about the parameters of the un-
derlying data generating process is reflected in the use of probability tools.
Prior Distribution. Specifically, think about parameters 𝜃 as a random vec-
tor and let 𝜋(𝜃) denote the corresponding mass or density function. This is
knowledge that we have before outcomes are observed and is called the prior
distribution. Typically, the prior distribution is a regular distribution and so
integrates or sums to one, depending on whether 𝜃 is continuous or discrete.
However, we may be very uncertain (or have no clue) about the distribution of
𝜃; the Bayesian machinery allows the following situation
∫ 𝜋(𝜃) 𝑑𝜃 = ∞,
in which case 𝜋(⋅) is called an improper prior.

Model Distribution. The distribution of outcomes given an assumed value of
𝜃 is known as the model distribution and denoted as 𝑓(𝑥|𝜃) = 𝑓𝑋|𝜃 (𝑥|𝜃). This
is the usual frequentist mass or density function. This is simply the likelihood
in the frequentist context and so it is also convenient to use this as a descriptor
for the model distribution.
Joint Distribution. The distribution of outcomes and model parameters is a
joint distribution of two random quantities. Its joint density function is denoted
as 𝑓(𝑥, 𝜃) = 𝑓(𝑥|𝜃)𝜋(𝜃).
Marginal Outcome Distribution. The distribution of outcomes can be ex-
pressed as
𝑓(𝑥) = ∫ 𝑓(𝑥|𝜃)𝜋(𝜃) 𝑑𝜃.
This is analogous to a frequentist mixture distribution. In the mixture distribu-

tion, we combine (or “mix”) different subpopulations. In the Bayesian context,
the marginal distribution is a combination of different realizations of parameters
(in some literatures, you can think about this as combining different “states of
nature”).
Posterior Distribution of Parameters. After outcomes have been observed
(hence the terminology “posterior”), one can use Bayes theorem to write the
density function as
𝑓(𝑥, 𝜃) 𝑓(𝑥|𝜃)𝜋(𝜃)
𝜋(𝜃|𝑥) = = .
𝑓(𝑥) 𝑓(𝑥)
The idea is to update your knowledge of the distribution of 𝜃 (𝜋(𝜃)) with the
data 𝑥. Making statements about potential values of parameters is an important
aspect of statistical inference.
4.4.3 Bayesian Inference

Summarizing the Posterior Distribution of Parameters
One way to summarize a distribution is to use a confidence interval type state-
ment. To summarize the posterior distribution of parameters, the interval [𝑎, 𝑏]
is said to be a 100(1 − 𝛼)% credibility interval for 𝜃 if
Pr(𝑎 ≤ 𝜃 ≤ 𝑏|x) ≥ 1 − 𝛼.
Particularly for insurance applications, this is also known as a credible interval

to distinguish it from credibility theory introduced in Chapter 9.
For another approach to summarization, we can look to classical decision analy-
sis. In this set-up, the loss function 𝑙(𝜃,̂ 𝜃) determines the penalty paid for using
the estimate 𝜃 ̂ instead of the true 𝜃. The Bayes estimate is the value that
minimizes the expected loss E [𝑙(𝜃,̂ 𝜃)]. Some important special cases include:
Loss function 𝑙(𝜃,̂ 𝜃) Descriptor Bayes Estimate

(𝜃 ̂ − 𝜃)2 squared error loss E(𝜃|𝑋)
|𝜃 ̂ − 𝜃| absolute deviation loss median of 𝜋(𝜃|𝑥)
𝐼(𝜃 ̂ = 𝜃) zero-one loss (for discrete probabilities) mode of 𝜋(𝜃|𝑥)
Minimizing expected loss is a rigorous method for providing a single “best guess”
about a likely value of a parameter, comparable to a frequentist estimator of
the unknown (fixed) parameter.

(i) In a portfolio of risks, each policyholder can have at most one claim per
year.
(ii) The probability of a claim for a policyholder during a year is 𝑞.
(iii) The prior density is
𝜋(𝑞) = 𝑞 3 /0.07, 0.6 < 𝑞 < 0.8
A randomly selected policyholder has one claim in Year 1 and zero claims in Year
2. For this policyholder, calculate the posterior probability that 0.7 < 𝑞 < 0.8.
Solution. The posterior density is proportional to the product of the likelihood
function and prior density. Thus,
𝜋(𝑞|1, 0) ∝ 𝑓(1|𝑞) 𝑓(0|𝑞) 𝜋(𝑞) ∝ 𝑞(1 − 𝑞)𝑞 3 = 𝑞 4 − 𝑞 5
To get the exact posterior density, we integrate the above function over its range
(0.6, 0.8)
0.8 0.8
𝑞5 𝑞6 𝑞4 − 𝑞5
∫ 𝑞 4 − 𝑞 5 𝑑𝑞 = − ∣ = 0.014069 ⇒ 𝜋(𝑞|1, 0) =
0.6 5 6 0.6 0.014069
Then
0.8
𝑞4 − 𝑞5
Pr(0.7 < 𝑞 < 0.8|1, 0) = ∫ 𝑑𝑞 = 0.5572
0.7 0.014069

(i) The prior distribution of the parameter Θ has probability density function:
1
𝜋(𝜃) = , 1<𝜃<∞
𝜃2
(ii) Given Θ = 𝜃, claim sizes follow a Pareto distribution with parameters
𝛼 = 2 and 𝜃.
A claim of 3 is observed. Calculate the posterior probability that Θ exceeds 2.
Solution: The posterior density, given an observation of 3 is
2𝜃2 1
𝑓(3|𝜃)𝜋(𝜃) (3+𝜃)3 𝜃2 2(3 + 𝜃)−3 −3
𝜋(𝜃|3) = ∞ = ∞ = ∞ = 32(3+𝜃) , 𝜃 > 1
∫1 𝑓(3|𝜃)𝜋(𝜃)𝑑𝜃 ∫1 2(3 + 𝜃)−3 𝑑𝜃 −(3 + 𝜃)−2 |1
Then
∞
∞ 16
Pr(Θ > 2|3) = ∫ 32(3 + 𝜃)−3 𝑑𝜃 = −16(3 + 𝜃)−2 ∣2 = = 0.64
2 25
Bayesian Predictive Distribution

For another type of statistical inference, it is often of interest to “predict” the
value of a random outcome that is yet to be observed. Specifically, for new data
𝑦, the predictive distribution is
𝑓(𝑦|𝑥) = ∫ 𝑓(𝑦|𝜃)𝜋(𝜃|𝑥)𝑑𝜃.
It is also sometimes called a “posterior predictive” distribution as the distribu-

tion of the new data is conditional on a base set of data.
Using squared error loss for the loss function, the Bayesian prediction of 𝑌
is
E(𝑌 |𝑋) = ∫ 𝑦𝑓(𝑦|𝑋)𝑑𝑦 = ∫ 𝑦 (∫ 𝑓(𝑦|𝜃)𝜋(𝜃|𝑋)𝑑𝜃) 𝑑𝑦
= ∫ (∫ 𝑦𝑓(𝑦|𝜃) 𝑑𝑦) 𝜋(𝜃|𝑋) 𝑑𝜃
= ∫ E(𝑌 |𝜃)𝜋(𝜃|𝑋) 𝑑𝜃.

As noted earlier, for some situations the distribution of parameters is discrete,

not continuous. Having a discrete set of possible parameters allows us to think
of them as alternative “states of nature,” a helpful interpretation.
Example 4.4.3. Actuarial Exam Question. For a particular policy, the

conditional probability of the annual number of claims given Θ = 𝜃, and the
probability distribution of Θ are as follows:
Number of Claims 0 1 2
Probability 2𝜃 𝜃 1 − 3𝜃
𝜃 0.05 0.30
Probability 0.80 0.20
Two claims are observed in Year 1. Calculate the Bayesian prediction of the
number of claims in Year 2.
Solution. Start with the posterior distribution of the parameter
Pr(𝑋|𝜃) Pr(𝜃)
Pr(𝜃|𝑋) =
∑𝜃 Pr(𝑋|𝜃) Pr(𝜃)
so
Pr(𝑋 = 2|𝜃 = 0.05) Pr(𝜃 = 0.05)
Pr(𝜃 = 0.05|𝑋 = 2) =
Pr(𝑋 = 2|𝜃 = 0.05) Pr(𝜃 = 0.05) + Pr(𝑋 = 2|𝜃 = 0.3) Pr(𝜃 = 0.3)
(1 − 3 × 0.05)(0.8) 68
= = .
(1 − 3 × 0.05)(0.8) + (1 − 3 × 0.3)(0.2) 70
2
Thus, Pr(𝜃 = 0.3|𝑋 = 1) = 1 − Pr(𝜃 = 0.05|𝑋 = 1) = 70 .
From the model distribution, we have
𝐸(𝑋|𝜃) = 0 × 2𝜃 + 1 × 𝜃 + 2 × (1 − 3𝜃) = 2 − 5𝜃.
Thus,
𝐸(𝑌 |𝑋) = ∑ E(𝑌 |𝜃)𝜋(𝜃|𝑋)

𝜃
= E(𝑌 |𝜃 = 0.05)𝜋(𝜃 = 0.05|𝑋) + E(𝑌 |𝜃 = 0.3)𝜋(𝜃 = 0.3|𝑋)
68 2
= (2 − 5(0.05)) + (2 − 5(0.3)) = 1.714.
70 70

(i) Losses on a company’s insurance policies follow a Pareto distribution with

probability density function:
𝜃
𝑓(𝑥|𝜃) = , 0<𝑥<∞
(𝑥 + 𝜃)2
(ii) For half of the company’s policies 𝜃 = 1 , while for the other half 𝜃 = 3.
For a randomly selected policy, losses in Year 1 were 5. Calculate the posterior
probability that losses for this policy in Year 2 will exceed 8.
Solution. We are given the prior distribution of 𝜃 as Pr(𝜃 = 1) = Pr(𝜃 = 3) = 21 ,
the conditional distribution 𝑓(𝑥|𝜃), and the fact that we observed 𝑋1 = 5. The
goal is to find the predictive probability Pr(𝑋2 > 8|𝑋1 = 5).
The posterior probabilities are
𝑓(5|𝜃 = 1) Pr(𝜃 = 1)
Pr(𝜃 = 1|𝑋1 = 5) =
𝑓(5|𝜃 = 1) Pr(𝜃 = 1) + 𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
1 1 1
36 ( 2 ) 72 16
= 1 1 3 1
= 1 3 =
36 ( 2 ) + 64 ( 2 ) 72 + 128
43
𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
Pr(𝜃 = 3|𝑋1 = 5) =
𝑓(5|𝜃 = 1) Pr(𝜃 = 1) + 𝑓(5|𝜃 = 3) Pr(𝜃 = 3)
27
= 1 − Pr(𝜃 = 1|𝑋1 = 5) =
43
Note that the conditional probability that losses exceed 8 is
∞
Pr(𝑋2 > 8|𝜃) = ∫ 𝑓(𝑥|𝜃)𝑑𝑥
8
∞ ∞
𝜃 𝜃 𝜃
=∫ 2
𝑑𝑥 = − ∣ =
8 (𝑥 + 𝜃) 𝑥+𝜃 8 8+𝜃
The predictive probability is therefore
Pr(𝑋2 > 8|𝑋1 = 5) = Pr(𝑋2 > 8|𝜃 = 1) Pr(𝜃 = 1|𝑋1 = 5) + Pr(𝑋2 > 8|𝜃 = 3) Pr(𝜃 = 3|𝑋1 = 5)
1 16 3 27
= ( )+ ( ) = 0.2126
8 + 1 43 8 + 3 43

(i) The probability that an insured will have at least one loss during any year
is 𝑝.
(ii) The prior distribution for 𝑝 is uniform on [0, 0.5].
(iii) An insured is observed for 8 years and has at least one loss every year.
Calculate the posterior probability that the insured will have at least one loss
during Year 9.
Solution. To ease notation, define x = (1, 1, 1, 1, 1, 1, 1, 1) represent the data
indicating that an insured has at least one loss every year for 8 years. Condi-
tional on knowing 𝑝, this has probability 𝑝8 . With this, the posterior probability
density is proportional to
𝜋(𝑝|x) ∝ Pr(x|𝑝) 𝜋(𝑝) = 𝑝8 (2) ∝ 𝑝8 .
Because a pdf integrates to one, we can calculate the proportionality constant

as
𝑝8 𝑝8
𝜋(𝑝|x) = 5 = = 9(0.5−9 )𝑝8 .
∫0 𝑝8 𝑑𝑝 (0.59 )/9
Thus, the posterior probability that the insured will have at least one loss during
Year 9 is
5
Pr(𝑋9 = 1|x) = ∫ Pr(𝑋9 = 1|𝑝) {𝜋(𝑝|x)} 𝑑𝑝
0
5
= ∫ 𝑝 {(9)(0.5−9 )𝑝8 } 𝑑𝑝
0
= 9(0.5−9 )(0.510 )/10 = 0.45

(i) Each risk has at most one claim each year.
Type of Risk Prior Probability Annual Claim Probability

I 0.7 0.1
II 0.2 0.2
III 0.1 0.4
One randomly chosen risk has three claims during Years 1-6. Calculate the
posterior probability of a claim for this risk in Year 7.
Solution. The probabilities are from a binomial distribution with 6 trials in
which 3 successes were observed.
6
Pr(3|I) = ( )(0.13 )(0.93 ) = 0.01458
3
6
Pr(3|II) = ( )(0.23 )(0.83 ) = 0.08192
3
6
Pr(3|III) = ( )(0.43 )(0.63 ) = 0.27648
3
The probability of observing three successes is
Pr(3) = Pr(3|I) Pr(I) + Pr(3|II) Pr(II) + Pr(3|III) Pr(III)

= 0.7(0.01458) + 0.2(0.08192) + 0.1(0.27648) = 0.054238
The three posterior probabilities are

Pr(3|I) Pr(I) 0.7(0.01458)
Pr(I|3) = = = 0.18817
Pr(3) 0.054238
Pr(3|II) Pr(II) 0.2(0.08192)
Pr(II|3) = = = 0.30208
Pr(3) 0.054238
Pr(3|III) Pr(III) 0.1(0.27648)
Pr(III|3) = = = 0.50975
Pr(3) 0.054238
The posterior probability of a claim is then
Pr(claim|3) = Pr(claim|I) Pr(I|3) + Pr(claim|II) Pr(II|3) + Pr(claim|III) Pr(III|3)

= 0.1(0.18817) + 0.2(0.30208) + 0.4(0.50975) = 0.28313
4.4.4 Conjugate Distributions

In the Bayesian framework, the key to statistical inference is understanding the
posterior distribution of the parameters. As described in Section 4.4.1, modern
data analysis using Bayesian methods utilize computationally intensive tech-
niques such as MCMC simulation. Another approach for computing posterior
distributions are based on conjugate distributions. Although this approach is
available only for a limited number of distributions, it has the appeal that it
provides closed-form expressions for the distributions, allowing for easy inter-
pretations of results.
To relate the prior and posterior distributions of the parameters, we have the
relationship
𝑓(𝑥|𝜃)𝜋(𝜃)
𝜋(𝜃|𝑥) = 𝑓(𝑥)
∝ 𝑓(𝑥|𝜃)𝜋(𝜃)
Posterior is proportional to likelihood × prior.
For conjugate distributions, the posterior and the prior belong to the same
family of distributions. The following illustration looks at the gamma-Poisson
special case, the most well-known in actuarial applications.
Special Case – Gamma-Poisson - Continued. Assume a Poisson(𝜆) model
distribution and that 𝜆 follows a gamma(𝛼, 𝜃) prior distribution. Then, the
posterior distribution of 𝜆 given the data follows a gamma distribution with
new parameters 𝛼𝑝𝑜𝑠𝑡 = ∑𝑖 𝑥𝑖 + 𝛼 and 𝜃𝑝𝑜𝑠𝑡 = 1/(𝑛 + 1/𝜃).
The model distribution is

𝑛
𝜆𝑥𝑖 𝑒−𝜆
𝑓(x|𝜆) = ∏ .
𝑖=1
𝑥𝑖 !
The prior distribution is

𝛼
(𝜆/𝜃) exp(−𝜆/𝜃)
𝜋(𝜆) = .
𝜆Γ(𝛼)
Thus, the posterior distribution is proportional to
𝜋(𝜆|x) ∝ 𝑓(x|𝜃)𝜋(𝜆)
= 𝐶𝜆∑𝑖 𝑥𝑖 +𝛼−1 exp(−𝜆(𝑛 + 1/𝜃))
where 𝐶 is a constant. We recognize this to be a gamma distribution with

new parameters 𝛼𝑝𝑜𝑠𝑡 = ∑𝑖 𝑥𝑖 + 𝛼 and 𝜃𝑝𝑜𝑠𝑡 = 1/(𝑛 + 1/𝜃). Thus, the gamma
distribution is a conjugate prior for the Poisson model distribution.

(i) The conditional distribution of the number of claims per policyholder is
Poisson with mean 𝜆.
(ii) The variable 𝜆 has a gamma distribution with parameters 𝛼 and 𝜃.
(iii) For policyholders with 1 claim in Year 1, the Bayes prediction for the
number of claims in Year 2 is 0.15.
(iv) For policyholders with an average of 2 claims per year in Year 1 and Year
2, the Bayes prediction for the number of claims in Year 3 is 0.20.
Calculate 𝜃.
Solution.
Since the conditional distribution of the number of claims per policyholder,
E(𝑋|𝜆) = Var(𝑋|𝜆) = 𝜆, the Bayes prediction is
E(𝑋2 |𝑋1 ) = ∫ E(𝑋2 |𝜆)𝜋(𝜆|𝑋1 )𝑑𝜆 = 𝛼𝑛𝑒𝑤 𝜃𝑛𝑒𝑤
because the posterior distribution is gamma with parameters 𝛼𝑛𝑒𝑤 and 𝜃𝑛𝑒𝑤 .
For year 1, we have
1 1
0.15 = (𝑋1 + 𝛼) × = (1 + 𝛼) × ,
𝑛 + 1/𝜃 1 + 1/𝜃
so 0.15(1 + 1/𝜃) = 1 + 𝛼. For year 2, we have
1 1
0.2 = (𝑋1 + 𝑋2 + 𝛼) × = (4 + 𝛼) × ,
𝑛 + 1/𝜃 2 + 1/𝜃
so 0.2(2 + 1/𝜃) = 4 + 𝛼. Equating these yields
0.2(2 + 1/𝜃) = 3 + 0.15(1 + 1/𝜃)
resulting in 𝜃 = 1/55 = 0.018182.
Closed-form expressions mean that results can be readily interpreted and easily
computed; hence, conjugate distributions are useful in actuarial practice. Two
other special cases used extensively are:
• The uncertainty of parameters is summarized using a beta distribution and
the outcomes have a (conditional on the parameter) binomial distribution.
• The uncertainty about the mean of the normal distribution is summarized
using a normal distribution and the outcomes are conditionally normally
distributed.
Additional results on conjugate distributions are summarized in the Appendix
Section 16.3.

Exercises
questions from the professional actuarial examinations, typically the Society of
Actuaries Exam C/STAM.
Model Selection Guided Tutorials
Contributors
• Edward W. (Jed) Frees and Lisa Gao, University of Wisconsin-
Madison, are the principal authors of the initial version of this chapter.
Email: jf re e s @ b u s . w i s c . e d u for chapter comments and suggested
improvements.
• Chapter reviewers include: Vytaras Brazauskas, Yvonne Chueh, Eren
Dodd, Hirokazu (Iwahiro) Iwasawa, Joseph Kim, Andrew Kwon-
Nakamura, Jiandong Ren, and Di (Cindy) Xu.
Chapter 5
Aggregate Loss Models
Chapter Preview. This chapter introduces probability models for describing

the aggregate (total) claims that arise from a portfolio of insurance contracts.
We present two standard modeling approaches, the individual risk model and
the collective risk model. Further, we discuss strategies for computing the dis-
tribution of the aggregate claims, including exact methods for special cases,
recursion, and simulation. Finally, we examine the effects of individual policy
modifications such as deductibles, coinsurance, and inflation, on the frequency
and severity distributions, and thus on the aggregate loss distribution.
5.1 Introduction
The objective of this chapter is to build a probability model to describe the
aggregate claims by an insurance system occurring in a fixed time period. The
insurance system could be a single policy, a group insurance contract, a business
line, or an entire book of an insurer’s business. In this chapter, aggregate claims
refer to either the number or the amount of claims from a portfolio of insurance
contracts. However, the modeling framework can be readily applied in the more
general setup.
Consider an insurance portfolio of 𝑛 individual contracts, and let 𝑆 denote the
aggregate losses of the portfolio in a given time period. There are two approaches
to modeling the aggregate losses 𝑆, the individual risk model and the collective
risk model. The individual risk model emphasizes the loss from each individual
contract and represents the aggregate losses as:
𝑆𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ,
where 𝑋𝑖 (𝑖 = 1, … , 𝑛) is interpreted as the loss amount from the 𝑖th contract.

It is worth stressing that 𝑛 denotes the number of contracts in the portfolio
179
180 CHAPTER 5. AGGREGATE LOSS MODELS
and thus is a fixed number rather than a random variable. For the individual
risk model, one usually assumes the 𝑋𝑖 ’s are independent. Because of different
contract features such as coverage and exposure, the 𝑋𝑖 ’s are not necessarily
identically distributed. A notable feature of the distribution of each 𝑋𝑖 is the
probability mass at zero corresponding to the event of no claims.
The collective risk model represents the aggregate losses in terms of a frequency
distribution and a severity distribution:
𝑆𝑁 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑁 .
Here, one thinks of a random number of claims 𝑁 that may represent either
the number of losses or the number of payments. In contrast, in the individual
risk model, we use a fixed number of contracts 𝑛. We think of 𝑋1 , 𝑋2 , … , 𝑋𝑁
as representing the amount of each loss. Each loss may or may not correspond
to a unique contract. For instance, there may be multiple claims arising from
a single contract. It is natural to think about 𝑋𝑖 > 0 because if 𝑋𝑖 = 0
then no claim has occurred. Typically we assume that conditional on 𝑁 = 𝑛,
𝑋1 , 𝑋2 , … , 𝑋𝑛 are iid random variables. The distribution of 𝑁 is known as
the frequency distribution, and the common distribution of 𝑋 is known as the
severity distribution. We further assume 𝑁 and 𝑋 are independent. With the
collective risk model, we may decompose the aggregate losses into the frequency
(𝑁 ) process and the severity (𝑋) model. This flexibility allows the analyst to
comment on these two separate components. For example, sales growth due to
lower underwriting standards could lead to higher frequency of losses but might
not affect severity. Similarly, inflation or other economic forces could have an
impact on severity but not on frequency.
5.2 Individual Risk Model

As noted earlier, for the individual risk model, we think of 𝑋𝑖 as the loss from
𝑖th contract and interpret
𝑆𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ,
to be the aggregate loss from all contracts in a portfolio or group of contracts.

Here, the 𝑋𝑖 ’s are not necessarily identically distributed and we have
𝑛
E(𝑆𝑛 ) = ∑ E(𝑋𝑖 ) .
𝑖=1
Under the independence assumption on 𝑋𝑖 ’s (which implies Cov (𝑋𝑖 , 𝑋𝑗 ) = 0

for all 𝑖 ≠ 𝑗), it can further be shown that
5.2. INDIVIDUAL RISK MODEL 181
𝑛
Var(𝑆𝑛 ) = ∑ Var(𝑋𝑖 )
𝑖=1
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ 𝑃𝑋𝑖 (𝑧)
𝑖=1
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ 𝑀𝑋𝑖 (𝑡),
𝑖=1
where 𝑃𝑆𝑛 (⋅) and 𝑀𝑆𝑛 (⋅) are the probability generating function (pgf ) and the
moment generating function (mgf ) of 𝑆𝑛 , respectively. The distribution of each
𝑋𝑖 contains a probability mass at zero, corresponding to the event of no claims
from the 𝑖th contract. One strategy to incorporate the zero mass in the distri-
bution is to use the two-part framework:
0, if 𝐼𝑖 = 0
𝑋𝑖 = 𝐼𝑖 × 𝐵𝑖 = {
𝐵𝑖 , if 𝐼𝑖 = 1.
Here, 𝐼𝑖 is a Bernoulli variable indicating whether or not a loss occurs for
the 𝑖th contract, and 𝐵𝑖 is a random variable with nonnegative support rep-
resenting the amount of losses of the contract given loss occurrence. Assume
that 𝐼1 , … , 𝐼𝑛 , 𝐵1 , … , 𝐵𝑛 are mutually independent. Denote Pr(𝐼𝑖 = 1) = 𝑞𝑖 ,
𝜇𝑖 = E(𝐵𝑖 ), and 𝜎𝑖2 = Var(𝐵𝑖 ). It can be shown (see Technical Supplement
5.A.1 for details) that
𝑛
E(𝑆𝑛 ) = ∑ 𝑞𝑖 𝜇𝑖
𝑖=1
𝑛
Var(𝑆𝑛 ) = ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 )
𝑖=1
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ (1 − 𝑞𝑖 + 𝑞𝑖 𝑃𝐵𝑖 (𝑧))
𝑖=1
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ (1 − 𝑞𝑖 + 𝑞𝑖 𝑀𝐵𝑖 (𝑡)) .
𝑖=1
A special case of the above model is when 𝐵𝑖 follows a degenerate distribution

with 𝜇𝑖 = 𝑏𝑖 and 𝜎𝑖2 = 0. One example is term life insurance or a pure en-
dowment insurance where 𝑏𝑖 represents the insurance benefit amount of the 𝑖th
contract.
Another strategy to accommodate the zero mass in the loss from each contract
is to consider them in aggregate at the portfolio level, as in the collective risk
model. Here, the aggregate loss is 𝑆𝑁 = 𝑋1 + ⋯ + 𝑋𝑁 , where 𝑁 is a random
variable representing the number of non-zero claims that occurred out of the
entire group of contracts. Thus, not every contract in the portfolio may be
represented in this sum, and 𝑆𝑁 = 0 when 𝑁 = 0. The collective risk model
will be discussed in detail in the next section.
Example 5.2.1. Actuarial Exam Question. An insurance company sold

300 fire insurance policies as follows:
Number of Policy Probability of

Policies Maximum Claim Per Policy
(𝑀𝑖 ) (𝑞𝑖 )
100 400 0.05
200 300 0.06
You are given:

(i) The claim amount for each policy, 𝑋𝑖 , is uniformly distributed between 0
and the policy maximum 𝑀𝑖 .
(ii) The probability of more than one claim per policy is 0.
(iii) Claim occurrences are independent.
Calculate the mean, E (𝑆300 ), and variance, Var (𝑆300 ), of the aggregate claims.
How would these results change if every claim is equal to the policy maximum?
Solution. The aggregate claims are 𝑆300 = 𝑋1 + ⋯ + 𝑋300 , where 𝑋1 , … , 𝑋300
are independent but not identically distributed. Policy claims amounts are
uniformly distributed on (0, 𝑀𝑖 ), so the mean claim amount is 𝑀𝑖 /2 and the
variance is 𝑀𝑖2 /12. Thus, for policy 𝑖 = 1, … , 300, we have
Number of Policy Probability of Mean Variance

Policies Maximum Claim Per Policy Amount Amount
(𝑀𝑖 ) (𝑞𝑖 ) (𝜇𝑖 ) (𝜎𝑖2 )
100 400 0.05 200 4002 /12
200 300 0.06 150 3002 /12
The mean of the aggregate claims is
300
E (𝑆300 ) = ∑ 𝑞𝑖 𝜇𝑖 = 100 {0.05(200)} + 200 {0.06(150)} = 2, 800
𝑖=1
The variance of the aggregate claims is
300
Var (𝑆300 ) = ∑𝑖=1 (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 ) since 𝑋𝑖 ’s are independent
2
= 100 {0.05 ( 400
12 ) + 0.05(1 − 0.05)200 }
2
3002
+200 {0.06 ( 12 ) + 0.06(1 − 0.06)1502 }
= 600, 467.
Follow-Up. Now suppose everybody receives the policy maximum 𝑀𝑖 if a claim

occurs. What is the expected aggregate loss E (𝑆)̃ and variance of the aggregate
̃
loss Var (𝑆)?
Each policy claim amount 𝑋𝑖 is now deterministic and fixed at 𝑀𝑖 instead of a
randomly distributed amount, so 𝜎𝑖2 = Var (𝑋𝑖 ) = 0 and 𝜇𝑖 = 𝑀𝑖 . Again, the
probability of a claim occurring for each policy is 𝑞𝑖 . Under these circumstances,
the expected aggregate loss is
300
E (𝑆)̃ = ∑ 𝑞𝑖 𝜇𝑖 = 100 {0.05(400)} + 200 {0.06(300)} = 5, 600.
𝑖=1
The variance of the aggregate loss is

300 300
Var (𝑆)̃ = ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 ) = ∑ (𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 )
𝑖=1 𝑖=1
2
= 100 {(0.05)(1 − 0.05)400 } + 200 {(0.06)(1 − 0.06)3002 }
= 1, 775, 200.
The individual risk model can also be used for claim frequency. If 𝑋𝑖 denotes
the number of claims from the 𝑖th contract, then 𝑆𝑛 is interpreted as the total
number of claims from the portfolio. In this case, the above two-part frame-
work still applies since there is a probability mass at zero for contracts that
do not experience any claims. Assume 𝑋𝑖 belongs to the (𝑎, 𝑏, 0) class with
pmf denoted by 𝑝𝑖𝑘 = Pr(𝑋𝑖 = 𝑘) for 𝑘 = 0, 1, … (see Section 2.3). Let 𝑋𝑖𝑇
denote the associated zero-truncated distribution in the (𝑎, 𝑏, 1) class with pmf
𝑇
𝑝𝑖𝑘 = 𝑝𝑖𝑘 /(1 − 𝑝𝑖0 ) for 𝑘 = 1, 2, … (see Section 2.5.1). Using the relationship
between their probability generating functions (see Technical Supplement 5.A.2
for details):
𝑃𝑋𝑖 (𝑧) = 𝑝𝑖0 + (1 − 𝑝𝑖0 )𝑃𝑋𝑖𝑇 (𝑧),
we can write 𝑋𝑖 = 𝐼𝑖 ×𝐵𝑖 with 𝑞𝑖 = Pr(𝐼𝑖 = 1) = Pr(𝑋𝑖 > 0) = 1−𝑝𝑖0 and 𝐵𝑖 =
𝑋𝑖𝑇 . Notice that in this case, we have a zero-modified distribution since the 𝐼𝑖
variable covers the modified probability mass at zero with 𝑞𝑖 = Pr(𝐼𝑖 = 1), while
the 𝐵𝑖 = 𝑋𝑖𝑇 covers the discrete non-zero frequency portion. See Section 2.5.1
for the relationship between zero-truncated and zero-modified distributions.
Example 5.2.2. An insurance company sold a portfolio of 100 independent

homeowners insurance policies, each of which has claim frequency following a
zero-modified Poisson distribution, as follows:
Type of Number of Probability of 𝜆

Policy Policies At Least 1 Claim
Low-risk 40 0.03 1
High-risk 60 0.05 2
Find the expected value and variance of the claim frequency for the entire
portfolio.
Solution. For each policy, we can write the zero-modified Poisson claim fre-
quency 𝑁𝑖 as 𝑁𝑖 = 𝐼𝑖 × 𝐵𝑖 , where
𝑞𝑖 = Pr(𝐼𝑖 = 1) = Pr(𝑁𝑖 > 0) = 1 − 𝑝𝑖0 .
For the low-risk policies, we have 𝑞𝑖 = 0.03 and for the high-risk policies, we
have 𝑞𝑖 = 0.05. Further, 𝐵𝑖 = 𝑁𝑖𝑇 , the zero-truncated version of 𝑁𝑖 . Thus, we
have
𝜆
𝜇𝑖 = E(𝐵𝑖 ) = E(𝑁𝑖𝑇 ) =
1 − 𝑒−𝜆
𝜆[1 − (𝜆 + 1)𝑒−𝜆 ]
𝜎𝑖2 = Var(𝐵𝑖 ) = Var(𝑁𝑖𝑇 ) = .
(1 − 𝑒−𝜆 )2
100
Using 𝑛 = 100, let the portfolio claim frequency be 𝑆100 = ∑𝑖=1 𝑁𝑖 . Using the
formulas above, the expected claim frequency of the portfolio is
100
E (𝑆100 ) = ∑ 𝑞𝑖 𝜇𝑖
𝑖=1
1 2
= 40 [0.03 ( −1
)] + 60 [0.05 ( )]
1−𝑒 1 − 𝑒−2
= 40(0.03)(1.5820) + 60(0.05)(2.3130) = 8.8375.
The variance of the claim frequency of the portfolio is
100
Var (𝑆100 ) = ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 )𝜇2𝑖 )
𝑖=1
1 − 2𝑒−1
= 40 [0.03 ( ) + 0.03(0.97)(1.58202 )]
(1 − 𝑒−1 )2
2[1 − 3𝑒−2 ]
+ 60 [0.05 ( ) + 0.05(0.95)(2.31302 )]
(1 − 𝑒−2 )2
= 23.7214.
Note that equivalently, we could have calculated the mean and variance of an
individual policy directly using the relationship between the zero-modified and
zero-truncated Poisson distributions (see Section 2.3).
To understand the distribution of the aggregate loss, one could use the central
limit theorem to approximate the distribution of 𝑆𝑛 for large 𝑛. Denote 𝜇𝑆𝑛 =
E(𝑆𝑛 ) and 𝜎𝑆2 𝑛 = Var(𝑆𝑛 ) and let 𝑍 ∼ 𝑁 (0, 1), a standard normal random
variable with cdf Φ. Then the cdf of 𝑆𝑛 can be approximated as follows:
𝑆𝑛 − 𝜇𝑆𝑛 𝑠 − 𝜇 𝑆𝑛
𝐹𝑆𝑛 (𝑠) = Pr(𝑆𝑛 ≤ 𝑠) = Pr ( ≤ )
𝜎𝑆𝑛 𝜎𝑆𝑛
𝑠 − 𝜇 𝑆𝑛 𝑠 − 𝜇 𝑆𝑛
≈ Pr (𝑍 ≤ ) = Φ( ).
𝜎𝑆𝑛 𝜎𝑆𝑛
Example 5.2.3. Actuarial Exam Question - Follow-Up. As in the Ex-

ample 5.2.1 earlier, an insurance company sold 300 fire insurance policies, with
claim amounts 𝑋𝑖 uniformly distributed between 0 and the policy maximum 𝑀𝑖 .
Using the normal approximation, calculate the probability that the aggregate
claim amount 𝑆300 exceeds $3, 500.
Solution. We have seen earlier that E(𝑆300 ) = 2, 800 and Var(𝑆300 ) = 600, 467.
Then
Pr(𝑆300 > 3, 500) = 1 − Pr(𝑆300 ≤ 3, 500)

3, 500 − 2, 800
≈ 1 − Φ( √ ) = 1 − Φ (0.90334)
600, 467
= 1 − 0.8168 = 0.1832.
For small 𝑛, the distribution of 𝑆𝑛 is likely skewed, and the normal approxima-
tion would be a poor choice. To examine the aggregate loss distribution, we go
back to first principles. Specifically, the distribution can be derived recursively.
Define 𝑆𝑘 = 𝑋1 + ⋯ + 𝑋𝑘 , 𝑘 = 1, … , 𝑛.
For 𝑘 = 1:
𝐹𝑆1 (𝑠) = Pr(𝑆1 ≤ 𝑠) = Pr(𝑋1 ≤ 𝑠) = 𝐹𝑋1 (𝑠).
For 𝑘 = 2, … , 𝑛:
𝐹𝑆𝑘 (𝑠) = Pr(𝑋1 + ⋯ + 𝑋𝑘 ≤ 𝑠) = Pr(𝑆𝑘−1 + 𝑋𝑘 ≤ 𝑠)

= E𝑋𝑘 [Pr(𝑆𝑘−1 ≤ 𝑠 − 𝑋𝑘 |𝑋𝑘 )] = E𝑋𝑘 [𝐹𝑆𝑘−1 (𝑠 − 𝑋𝑘 )] .
A special case is when 𝑋𝑖 ’s are identically distributed. Let 𝐹𝑋 (𝑥) = Pr(𝑋 ≤ 𝑥)

be the common distribution of 𝑋𝑖 , 𝑖 = 1, … , 𝑛. We define
∗𝑛
𝐹𝑋 (𝑥) = Pr(𝑋1 + ⋯ + 𝑋𝑛 ≤ 𝑥),
∗𝑛
the 𝑛-fold convolution of 𝐹𝑋 . More generally, we can compute 𝐹𝑋 recursively.
∗1
Begin the recursion at 𝑘 = 1 using 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥). Next, for 𝑘 = 2, we have
∗2
𝐹𝑋 (𝑥) = Pr(𝑋1 + 𝑋2 ≤ 𝑥) = E𝑋2 [Pr(𝑋1 ≤ 𝑥 − 𝑋2 |𝑋2 )]
= E𝑋2 [𝐹 (𝑥 − 𝑋2 )]
𝑥
∫0 𝐹 (𝑥 − 𝑦)𝑓(𝑦)𝑑𝑦 for continuous 𝑋𝑖 ’s
= {
∑𝑦≤𝑥 𝐹 (𝑥 − 𝑦)𝑓(𝑦) for discrete 𝑋𝑖 ’s
Recall 𝐹 (0) = 0.
Similarly for 𝑘 = 𝑛, we have 𝑆𝑛 = 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 and
𝐹 ∗𝑛 (𝑥) = Pr(𝑆𝑛 ≤ 𝑥) = Pr(𝑆𝑛−1 + 𝑋𝑛 ≤ 𝑥)

= E𝑋𝑛 [Pr(𝑆𝑛−1 ≤ 𝑥 − 𝑋𝑛 |𝑋𝑛 )]
= E𝑋 [𝐹 ∗(𝑛−1) (𝑥 − 𝑋)]
𝑥
∫ 𝐹 ∗(𝑛−1) (𝑥 − 𝑦)𝑓(𝑦)𝑑𝑦 for continuous 𝑋𝑖 ’s
= { 0
∑𝑦≤𝑥 𝐹 ∗(𝑛−1) (𝑥 − 𝑦)𝑓(𝑦) for discrete 𝑋𝑖 ’s
When the 𝑋𝑖 ’s are independent and belong to the same family of distributions,
there are some simple cases where 𝑆𝑛 has a closed form. This makes it easy
to compute Pr(𝑆𝑛 ≤ 𝑥). This property is known as closed under convolution,
meaning the distribution of the sum of independent random variables belongs
to the same family of distributions as that of the component variables, just with
different parameters. Table 5.1 provides a few examples.
Table 5.1. Closed Form Partial Sum Distributions
Distribution of 𝑋𝑖 Abbreviation Distribution of 𝑆𝑛

𝑛 𝑛
Normal with mean 𝜇𝑖 and variance 𝜎𝑖2 𝑁 (𝜇𝑖 , 𝜎𝑖2 ) 𝑁 (∑𝑖=1 𝜇𝑖 , ∑𝑖=1 𝜎𝑖2 )
Exponential with mean 𝜃 𝐸𝑥𝑝(𝜃) 𝐺𝑎𝑚(𝑛, 𝜃)
𝑛
Gamma with shape 𝛼𝑖 and scale 𝜃 𝐺𝑎𝑚(𝛼𝑖 , 𝜃) 𝐺𝑎𝑚 (∑𝑖=1 𝛼𝑖 , 𝜃)
𝑛
Poisson with mean (and variance) 𝜆𝑖 𝑃 𝑜𝑖(𝜆𝑖 ) 𝑃 𝑜𝑖 (∑𝑖=1 𝜆𝑖 )
𝑛
Binomial with 𝑚𝑖 trials and 𝑞 success probability 𝐵𝑖𝑛(𝑚𝑖 , 𝑞) 𝐵𝑖𝑛 (∑𝑖=1 𝑚𝑖 , 𝑞)
Geometric with mean 𝛽 𝐺𝑒𝑜(𝛽) 𝑁 𝐵(𝛽, 𝑛)
𝑛
Negative binomial with mean 𝑟𝑖 𝛽 𝑁 𝐵(𝛽, 𝑟𝑖 ) 𝑁 𝐵 (𝛽, ∑𝑖=1 𝑟𝑖 )
and variance 𝑟𝑖 𝛽(1 + 𝛽)
Example 5.2.4. Gamma Distribution. Assume that 𝑋1 , … , 𝑋𝑛 are inde-

pendent random variables with 𝑋𝑖 ∼ 𝐺𝑎𝑚(𝛼𝑖 , 𝜃). The mgf of 𝑋𝑖 is 𝑀𝑋𝑖 (𝑡) =
(1 − 𝜃𝑡)−𝛼𝑖 . Thus, the mgf of the sum 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛 is

𝑛
𝑀𝑆𝑛 (𝑡) = ∏ 𝑀𝑋𝑖 (𝑡) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛
𝑛
= ∏(1 − 𝜃𝑡)−𝛼𝑖 = (1 − 𝜃𝑡)− ∑𝑖=1 𝛼𝑖 ,
𝑖=1
𝑛
which is the mgf of a gamma random variable with parameters (∑𝑖=1 𝛼𝑖 , 𝜃).
𝑛
Thus, 𝑆𝑛 ∼ 𝐺𝑎𝑚(∑𝑖=1 𝛼𝑖 , 𝜃).
Example 5.2.5. Negative Binomial Distribution. Assume that 𝑋1 , … , 𝑋𝑛

are independent random variables with 𝑋𝑖 ∼ 𝑁 𝐵(𝛽, 𝑟𝑖 ). The pgf of 𝑋𝑖 is
−𝑟
𝑃𝑋𝑖 (𝑧) = [1 − 𝛽(𝑧 − 1)] 𝑖 . Thus, the pgf of the sum 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛 is
𝑃𝑆𝑛 (𝑧) = E [𝑧 𝑆𝑛 ]
= E [𝑧 𝑋1 ] ⋯ E [𝑧 𝑋𝑛 ] from the independence of 𝑋𝑖 ’s
𝑛 𝑛 𝑛
−𝑟𝑖 − ∑𝑖=1 𝑟𝑖
= ∏ 𝑃𝑋𝑖 (𝑧) = ∏ [1 − 𝛽(𝑧 − 1)] = [1 − 𝛽(𝑧 − 1)] ,
𝑖=1 𝑖=1
which is the pgf of a negative binomial random variable with parameters

𝑛 𝑛
(𝛽, ∑𝑖=1 𝑟𝑖 ). Thus, 𝑆𝑛 ∼ 𝑁 𝐵(𝛽, ∑𝑖=1 𝑟𝑖 ).
Example 5.2.6. Actuarial Exam Question (modified). The annual num-

ber of doctor visits for each individual in a family of 4 has geometric distribution
with mean 1.5. The annual numbers of visits for the family members are mu-
tually independent. An insurance pays 100 per doctor visit beginning with the
4th visit per family. Calculate the probability that the family will not receive
an insurance payment this year.
Solution. Let 𝑋𝑖 ∼ 𝐺𝑒𝑜(𝛽 = 1.5) be the number of doctor visits for one
individual in the family and 𝑆4 = 𝑋1 + 𝑋2 + 𝑋3 + 𝑋4 be the number of doctor
visits for the family. The sum of 4 independent geometric random variables each
with mean 𝛽 = 1.5 follows a negative binomial distribution, i.e. 𝑆4 ∼ 𝑁 𝐵(𝛽 =
1.5, 𝑟 = 4).
If the insurance pays 100 per visit beginning with the 4th visit for the family,
then the family will not receive an insurance payment if they have less than 4
claims. This probability is
Pr(𝑆4 < 4) = Pr(𝑆4 = 0) + Pr(𝑆4 = 1) + Pr(𝑆4 = 2) + Pr(𝑆4 = 3)
4(1.5) 4(5)(1.52 ) 4(5)(6)(1.53 )
= (1 + 1.5)−4 + 5
+ 6
+
(1 + 1.5) 2(1 + 1.5) 3!(1 + 1.5)7
= 0.0256 + 0.0614 + 0.0922 + 0.1106 = 0.2898.
5.3 Collective Risk Model

5.3.1 Moments and Distribution
Under the collective risk model 𝑆𝑁 = 𝑋1 + ⋯ + 𝑋𝑁 , {𝑋𝑖 } are iid, and indepen-
dent of 𝑁 . Let 𝜇 = E (𝑋𝑖 ) and 𝜎2 = Var (𝑋𝑖 ) for all 𝑖. Thus, conditional on
𝑁 , we have that the expectation of the sum is the sum of expectations and that
the variance of the sum is the sum of variances,
E(𝑆|𝑁 ) = E(𝑋1 + ⋯ + 𝑋𝑁 |𝑁 ) = 𝜇𝑁
Var(𝑆|𝑁 ) = Var(𝑋1 + ⋯ + 𝑋𝑁 |𝑁 ) = 𝜎2 𝑁 .
Using the law of iterated expectations from Appendix Section 16.2, the mean
of the aggregate loss is
E(𝑆𝑁 ) = E𝑁 [E𝑆 (𝑆|𝑁 )] = E𝑁 (𝑁 𝜇) = 𝜇 E(𝑁 ).
Using the law of total variance from Appendix Section 16.2, the variance of the
aggregate loss is
Var(𝑆𝑁 ) = E𝑁 [Var(𝑆𝑁 |𝑁 )] + Var𝑁 [E(𝑆𝑁 |𝑁 )]

= E𝑁 [𝜎2 𝑁 ] + Var𝑁 [𝜇𝑁 ]
= 𝜎2 E[𝑁 ] + 𝜇2 Var[𝑁 ].
Special Case: Poisson Distributed Frequency. If 𝑁 ∼ 𝑃 𝑜𝑖(𝜆), then
E(𝑁 ) = Var(𝑁 ) = 𝜆
E(𝑆𝑁 ) = 𝜆 E(𝑋)
Var(𝑆𝑁 ) = 𝜆(𝜎2 + 𝜇2 ) = 𝜆 E(𝑋 2 ).
Example 5.3.1. Actuarial Exam Question. The number of accidents fol-

lows a Poisson distribution with mean 12. Each accident generates 1, 2, or 3
claimants with probabilities 1/2, 1/3, and 1/6 respectively.
Calculate the variance in the total number of claimants.
Solution.
1 1 1 10
E(𝑋 2 ) = 12 ( ) + 22 ( ) + 32 ( ) =
2 3 6 3
10
⇒Var(𝑆𝑁 ) = 𝜆 E(𝑋 2 ) = 12 ( ) = 40.
3
5.3. COLLECTIVE RISK MODEL 189
Alternatively, using the general approach, Var(𝑆𝑁 ) = 𝜎2 E(𝑁 ) + 𝜇2 Var(𝑁 ),

where
E(𝑁 ) = Var(𝑁 ) = 12
1 1 1 5
𝜇 = E(𝑋) = 1 ( ) + 2 ( ) + 3 ( ) =
2 3 6 3
10 25 5
𝜎2 = E(𝑋 2 ) − [E(𝑋)]2 = − =
3 9 9
2
5 5
⇒ Var(𝑆𝑁 ) = ( ) (12) + ( ) (12) = 40.
9 3
In general, the moments of 𝑆𝑁 can be derived from its moment generating

function (mgf ). Because 𝑋𝑖 ’s are iid, we denote the mgf of 𝑋 as 𝑀𝑋 (𝑡) =
E (𝑒𝑡𝑋 ). Using the law of iterated expectations, the mgf of 𝑆𝑁 is
𝑀𝑆𝑁 (𝑡) = E(𝑒𝑡𝑆𝑁 ) = E𝑁 [ E(𝑒𝑡𝑆𝑁 |𝑁 ) ]

= E𝑁 [ E (𝑒𝑡(𝑋1 +⋯+𝑋𝑁 ) ) ] = E𝑁 [E(𝑒𝑡𝑋1 ) ⋯ E(𝑒𝑡𝑋𝑁 )] since 𝑋𝑖 ’s are independent
𝑁
= E𝑁 [ (𝑀𝑋 (𝑡)) ].
Now, recall that the probability generating function (pgf ) of 𝑁 is 𝑃𝑁 (𝑧) = E(𝑧 𝑁 ).
Denote 𝑀𝑋 (𝑡) = 𝑧. Substituting into the expression for the mgf of 𝑆𝑁 above,
it is shown
𝑀𝑆𝑁 (𝑡) = E (𝑧 𝑁 ) = 𝑃𝑁 (𝑧) = 𝑃𝑁 [𝑀𝑋 (𝑡)].
Similarly, if 𝑆𝑁 is discrete, one can show the pgf of 𝑆𝑁 is:
𝑃𝑆𝑁 (𝑧) = 𝑃𝑁 [𝑃𝑋 (𝑧)].
To get E(𝑆𝑁 ) = 𝑀𝑆′ 𝑁 (0), we use the chain rule
𝜕
𝑀𝑆′ 𝑁 (𝑡) = 𝑃 (𝑀 (𝑡)) = 𝑃𝑁′ (𝑀𝑋 (𝑡))𝑀𝑋
′
(𝑡)
𝜕𝑡 𝑁 𝑋
′
and recall 𝑀𝑋 (0) = 1, 𝑀𝑋 (0) = E(𝑋) = 𝜇, 𝑃𝑁′ (1) = E(𝑁 ). So,
E(𝑆𝑁 ) = 𝑀𝑆′ 𝑁 (0) = 𝑃𝑁′ (𝑀𝑋 (0))𝑀𝑋

′
(0) = 𝜇E(𝑁 ).
2
Similarly, one could use relation E(𝑆𝑁 ) = 𝑀𝑆″𝑁 (0) to get
Var(𝑆𝑁 ) = 𝜎2 E(𝑁 ) + 𝜇2 Var(𝑁 ).

Special Case. Poisson Frequency. Let 𝑁 ∼ 𝑃 𝑜𝑖(𝜆). Thus, the pgf of 𝑁 is

𝑃𝑁 (𝑧) = 𝑒𝜆(𝑧−1) and the mgf of 𝑆𝑁 is
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 𝑒𝜆(𝑀𝑋 (𝑡)−1) .
Taking derivatives yields
𝑀𝑆′ 𝑁 (𝑡) = 𝑒𝜆(𝑀𝑋 (𝑡)−1) 𝜆 𝑀𝑋

′ ′
(𝑡) = 𝑀𝑆𝑁 (𝑡) 𝜆 𝑀𝑋 (𝑡)
𝑀𝑆″𝑁 (𝑡) = 𝑀𝑆𝑁 (𝑡) 𝜆 𝑀𝑋
″ ′
(𝑡) + [ 𝑀𝑆𝑁 (𝑡) 𝜆 𝑀𝑋 ′
(𝑡) ] 𝜆 𝑀𝑋 (𝑡).
Evaluating these at 𝑡 = 0 yields
E(𝑆𝑁 ) = 𝑀𝑆′ 𝑁 (0) = 𝜆E(𝑋) = 𝜆𝜇
and
𝑀𝑆″𝑁 (0) = 𝜆E(𝑋 2 ) + 𝜆2 𝜇2

⇒ Var(𝑆𝑁 ) = 𝜆E(𝑋 2 ) + 𝜆2 𝜇2 − (𝜆𝜇)2 = 𝜆 E(𝑋 2 ).
Example 5.3.2. Actuarial Exam Question. You are the producer of a

television quiz show that gives cash prizes. The number of prizes, 𝑁 , and prize
amount, 𝑋, have the following distributions:
𝑛 Pr(𝑁 = 𝑛) 𝑥 Pr(𝑋 = 𝑥)
1 0.8 0 0.2
2 0.2 100 0.7
1000 0.1
Your budget for prizes equals the expected aggregate cash prizes plus the stan-
dard deviation of aggregate cash prizes. Calculate your budget.
Solution. We need to calculate the mean and standard deviation of the ag-
gregate (sum) of cash prizes. The moments of the frequency distribution 𝑁
are
E(𝑁 ) = 1(0.8) + 2(0.2) = 1.2
E(𝑁 2 ) = 12 (0.8) + 22 (0.2) = 1.6
2
Var(𝑁 ) = E(𝑁 2 ) − [E(𝑁 )] = 0.16.
The moments of the severity distribution 𝑋 are
E(𝑋) = 0(0.2) + 100(0.7) + 1000(0.1) = 170 = 𝜇

E(𝑋 2 ) = 02 (0.2) + 1002 (0.7) + 10002 (0.1) = 107, 000
2
Var(𝑋) = E(𝑋 2 ) − [E(𝑋)] = 78, 100 = 𝜎2 .
Thus, the mean and variance of the aggregate cash prize are
E(𝑆𝑁 ) = 𝜇E(𝑁 ) = 170(1.2) = 204

Var(𝑆𝑁 ) = 𝜎2 E(𝑁 ) + 𝜇2 Var(𝑁 )
= 78, 100(1.2) + 1702 (0.16) = 98, 344.
This gives the following required budget
𝐵𝑢𝑑𝑔𝑒𝑡 = E(𝑆𝑁 ) + √Var(𝑆𝑁 )

= 204 + √98, 344 = 517.60.
The distribution of 𝑆𝑁 is called a compound distribution, and it can be derived

based on the convolution of 𝐹𝑋 as follows:
𝐹𝑆𝑁 (𝑠) = Pr (𝑋1 + ⋯ + 𝑋𝑁 ≤ 𝑠)

= E [Pr (𝑋1 + ⋯ + 𝑋𝑁 ≤ 𝑠|𝑁 = 𝑛)]
∗𝑁
= E [𝐹𝑋 (𝑠)]
∞
∗𝑛
= 𝑝0 + ∑ 𝑝𝑛 𝐹𝑋 (𝑠).
𝑛=1
Example 5.3.3. Actuarial Exam Question. The number of claims in a

period has a geometric distribution with mean 4. The amount of each claim 𝑋
follows Pr(𝑋 = 𝑥) = 0.25, 𝑥 = 1, 2, 3, 4, i.e. a discrete uniform distribution on
{1, 2, 3, 4}. The number of claims and the claim amounts are independent. Let
𝑆𝑁 denote the aggregate claim amount in the period. Calculate 𝐹𝑆𝑁 (3).
Solution. By definition, we have
𝑁 ∞ 𝑛
𝐹𝑆𝑁 (3) = Pr (∑ 𝑋𝑖 ≤ 3) = ∑ Pr (∑ 𝑋𝑖 ≤ 3|𝑁 = 𝑛) Pr(𝑁 = 𝑛)
𝑖=1 𝑛=0 𝑖=1
3
= ∑ 𝐹 ∗𝑛 (3) 𝑝𝑛 = ∑ 𝐹 ∗𝑛 (3)𝑝𝑛
𝑛 𝑛=0
= 𝑝0 + 𝐹 ∗1 (3) 𝑝1 + 𝐹 ∗2 (3) 𝑝2 + 𝐹 ∗3 (3) 𝑝3 .
Because 𝑁 ∼ 𝐺𝑒𝑜(𝛽 = 4), we know that
𝑛 𝑛
1 𝛽 1 4
𝑝𝑛 = ( ) = ( ) .
1+𝛽 1+𝛽 5 5
For the claim severity distribution, recursively, we have

3
𝐹 ∗1 (3) = Pr(𝑋 ≤ 3) =
4
𝐹 (3) = ∑ 𝐹 (3 − 𝑦)𝑓(𝑦) = 𝐹 ∗1 (2)𝑓(1) + 𝐹 ∗1 (1)𝑓(2)
∗2 ∗1
𝑦≤3
1 ∗1 1
= [𝐹 (2) + 𝐹 ∗1 (1)] = [Pr(𝑋 ≤ 2) + Pr(𝑋 ≤ 1)]
4 4
1 2 1 3
= ( + )=
4 4 4 16
1 3
𝐹 ∗3 (3) = Pr(𝑋1 + 𝑋2 + 𝑋3 ≤ 3) = Pr(𝑋1 = 𝑋2 = 𝑋3 = 1) = ( ) .
4
Notice that we did not need to recursively calculate 𝐹 ∗3 (3) by recognizing that
each 𝑋 ∈ {1, 2, 3, 4}, so the only way of obtaining 𝑋1 + 𝑋2 + 𝑋3 ≤ 3 is to have
𝑋1 = 𝑋2 = 𝑋3 = 1. Additionally, for 𝑛 ≥ 4, 𝐹 ∗𝑛 (3) = 0 since it is impossible
for the sum of 4 or more 𝑋’s to be less than 3. For 𝑛 = 0, 𝐹 ∗0 (3) = 1 since
the sum of 0 𝑋’s is 0, which is always less than 3. Laying out the probabilities
systematically,
𝑥 𝐹 ∗1 (𝑥) 𝐹 ∗2 (𝑥) 𝐹 ∗3 (𝑥)

0
1
1 4 0
2 2
2 4 ( 14 )
3 3 3
3 4 16 ( 14 )
Finally,
𝐹𝑆𝑁 (3) = 𝑝0 + 𝐹 ∗1 (3) 𝑝1 + 𝐹 ∗2 (3) 𝑝2 + 𝐹 ∗3 (3) 𝑝3

1 3 4 3 16 1 64
= + ( )+ ( )+ ( ) = 0.3456.
5 4 25 16 125 64 625
When E(𝑁 ) and Var(𝑁 ) are known, one may also use a type of central limit the-
orem to approximate the distribution of 𝑆𝑁 as in the individual risk model. That
is, 𝑆√Var(𝑆
𝑁 −E(𝑆𝑁 )
)
approximately follows the standard normal distribution 𝑁 (0, 1).
𝑁
From this type of central limit theorem, the approximation works well if E[𝑁 ]
is sufficiently large.
Mean Standard Deviation

Number of Claims 8 3
Individual Losses 10, 000 3, 937
As a benchmark, use the normal approximation to determine the probability

that the aggregate loss will exceed 150% of the expected loss.
Solution. To use the normal approximation, we must first find the mean and
variance of the aggregate loss 𝑆
E(𝑆𝑁 ) = 𝜇 E(𝑁 ) = 10, 000(8) = 80, 000

= 39372 (8) + 100002 (32 ) = 1, 023, 999, 752
√Var(𝑆𝑁 ) = 31, 999.996 ≈ 32, 000.
Then under the normal approximation, aggregate loss 𝑆𝑁 is approximately nor-

mal with mean 80,000 and standard deviation 32,000. The probability that 𝑆𝑁
will exceed 150% of the expected aggregate loss is therefore
𝑆𝑁 − E(𝑆𝑁 ) 1.5 E(𝑆𝑁 ) − E(𝑆𝑁 )

Pr(𝑆𝑁 > 1.5E(𝑆𝑁 ) ) = Pr ( > )
√Var(𝑆𝑁 ) √Var(𝑆𝑁 )
0.5 E(𝑆𝑁 )
≈ Pr (𝑍 > ), where 𝑍 ∼ 𝑁 (0, 1)
√Var(𝑆𝑁 )
0.5(80, 000)
= Pr (𝑍 > ) = Pr(𝑍 > 1.25)
32, 000
= 1 − Φ(1.25) = 0.1056.
Example 5.3.5. Actuarial Exam Question. For an individual over 65:

(i) The number of pharmacy claims is a Poisson random variable with mean 25.
(ii) The amount of each pharmacy claim is uniformly distributed between 5 and
95.
(iii) The amounts of the claims and the number of claims are mutually indepen-
dent.
Estimate the probability that aggregate claims for this individual will exceed
2000 using the normal approximation.
Solution. We have claim frequency 𝑁 ∼ 𝑃 𝑜𝑖(𝜆 = 25) and claim severity
𝑋 ∼ 𝑈 (5, 95). To use the normal approximation, we need to find the mean and
variance of the aggregate claims 𝑆𝑁 . Note
E(𝑁 ) = 25 Var(𝑁 ) = 25
2
E(𝑋) = 5+95
2 = 50 = 𝜇 Var(𝑋) = (95−5)
12 = 675 = 𝜎2 .
Then for 𝑆𝑁 ,
E(𝑆𝑁 ) = 𝜇 E(𝑁 ) = 50(25) = 1, 250
= 675(25) + 502 (25) = 79, 375.
Using the normal approximation, 𝑆𝑁 is approximately normal with mean 1,250

and variance 79,375. The probability that 𝑆𝑁 exceeds 2,000 is
𝑆𝑁 − E(𝑆𝑁 ) 2, 000 − E(𝑆𝑁 )

Pr(𝑆𝑁 > 2, 000) = Pr ( > )
√Var(𝑆𝑁 ) √Var(𝑆𝑁 )
2, 000 − 1, 250
≈ Pr (𝑍 > √ ) , where 𝑍 ∼ 𝑁 (0, 1)
79, 375
= Pr(𝑍 > 2.662) = 1 − Φ(2.662) = 0.003884.
5.3.2 Stop-loss Insurance

Recall the coverage modifications on the individual policy level in Section 3.4.
Insurance on the aggregate loss 𝑆𝑁 , subject to a deductible 𝑑, is called net stop-
loss insurance. The expected value of the amount of the aggregate loss in excess
of the deductible,
E[(𝑆 − 𝑑)+ ]
is known as the net stop-loss premium.

To calculate the net stop-loss premium, we have
∞
∫𝑑 (𝑠 − 𝑑)𝑓𝑆𝑁 (𝑠)𝑑𝑠 for continuous 𝑆𝑁
E(𝑆𝑁 − 𝑑)+ = {
∑𝑠>𝑑 (𝑠 − 𝑑)𝑓𝑆𝑁 (𝑠) for discrete 𝑆𝑁
= E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 𝑑)
Example 5.3.6. Actuarial Exam Question. In a given week, the number

of projects that require you to work overtime has a geometric distribution with
𝛽 = 2. For each project, the distribution of the number of overtime hours in
the week, 𝑋, is as follows:
𝑥 𝑓(𝑥)
5 0.2
10 0.3
20 0.5
The number of projects and the number of overtime hours are independent. You
will get paid for overtime hours in excess of 15 hours in the week. Calculate the
expected number of overtime hours for which you will get paid in the week.
Solution. The number of projects in a week requiring overtime work has distri-
bution 𝑁 ∼ 𝐺𝑒𝑜(𝛽 = 2), while the number of overtime hours worked per project
has distribution 𝑋 as described above. The aggregate number of overtime hours
in a week is 𝑆𝑁 and we are therefore looking for
E(𝑆𝑁 − 15)+ = E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 15).
To find E(𝑆𝑁 ) = E(𝑋) E(𝑁 ), we have
E(𝑋) = 5(0.2) + 10(0.3) + 20(0.5) = 14

E(𝑁 ) = 2
⇒ E(𝑆) = E(𝑋) E(𝑁 ) = 14(2) = 28.
To find E(𝑆𝑁 ∧ 15) = 0 Pr(𝑆𝑁 = 0) + 5 Pr(𝑆𝑁 = 5) + 10 Pr(𝑆𝑁 = 10) +

15 Pr(𝑆𝑁 ≥ 15), we have
1 1
Pr(𝑆𝑁 = 0) = Pr(𝑁 = 0) = =
1+𝛽 3
2 0.4
Pr(𝑆𝑁 = 5) = Pr(𝑋 = 5, 𝑁 = 1) = 0.2 ( ) =
9 9
Pr(𝑆𝑁 = 10) = Pr(𝑋 = 10, 𝑁 = 1) + Pr(𝑋1 = 𝑋2 = 5, 𝑁 = 2)
2 4
= 0.3 ( ) + (0.2)(0.2) ( ) = 0.0726
9 27
1 0.4
Pr(𝑆𝑁 ≥ 15) = 1 − ( + + 0.0726) = 0.5496
3 9
⇒ E(𝑆𝑁 ∧ 15) = 0 Pr(𝑆𝑁 = 0) + 5 Pr(𝑆𝑁 = 5) + 10 Pr(𝑆𝑁 = 10) + 15 Pr(𝑆𝑁 ≥ 15)
1 0.4
= 0( ) + 5( ) + 10(0.0726) + 15(0.5496) = 9.193.
3 9
Therefore,
E(𝑆𝑁 − 15)+ = E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 15)
= 28 − 9.193 = 18.807.
Recursive Net Stop-Loss Premium Calculation. For the discrete case,

this can be computed recursively as
E [(𝑆𝑁 − (𝑗 + 1)ℎ)+ ] = E [(𝑆𝑁 − 𝑗ℎ)+ ] − ℎ (1 − 𝐹𝑆𝑁 (𝑗ℎ)) .
This assumes that the support of 𝑆𝑁 is equally spaced over units of ℎ.

To establish this, we assume that ℎ = 1. We have
E [(𝑆𝑁 − (𝑗 + 1))+ ] = E(𝑆𝑁 ) − E[𝑆𝑁 ∧ (𝑗 + 1)] , and

E [(𝑆𝑁 − 𝑗)+ ] = E(𝑆𝑁 ) − E[𝑆𝑁 ∧ 𝑗]
Thus,
E [(𝑆𝑁 − (𝑗 + 1))+ ] − E [(𝑆𝑁 − 𝑗)+ ] = {E(𝑆𝑁 ) − E(𝑆𝑁 ∧ (𝑗 + 1))} − {E(𝑆𝑁 ) − E(𝑆𝑁 ∧ 𝑗)}
= E (𝑆𝑁 ∧ 𝑗) − E [𝑆 ∧ (𝑗 + 1)]
We can write
𝑗
E [𝑆𝑁 ∧ (𝑗 + 1)] = ∑ 𝑥𝑓𝑆𝑁 (𝑥) + (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)
𝑥=0
𝑗−1
= ∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 = 𝑗) + (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)
𝑥=0
Similarly,
𝑗−1
E(𝑆𝑁 ∧ 𝑗) = ∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 ≥ 𝑗)
𝑥=0
With these expressions, we have
E [(𝑆𝑁 − (𝑗 + 1))+ ] − E [(𝑆𝑁 − 𝑗)+ ]

= E (𝑆𝑁 ∧ 𝑗) − E [𝑆 ∧ (𝑗 + 1)]
𝑗−1 𝑗−1
= {∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 ≥ 𝑗)} − {∑ 𝑥𝑓𝑆𝑁 (𝑥) + 𝑗 Pr(𝑆𝑁 = 𝑗) + (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)}
𝑥=0 𝑥=0
= 𝑗 [Pr(𝑆𝑁 ≥ 𝑗) − Pr(𝑆𝑁 = 𝑗)] − (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1)

= 𝑗 Pr(𝑆𝑁 > 𝑗) − (𝑗 + 1) Pr(𝑆𝑁 ≥ 𝑗 + 1) (note Pr(𝑆𝑁 > 𝑗) = Pr(𝑆𝑁 ≥ 𝑗 + 1))
= − Pr(𝑆𝑁 ≥ 𝑗 + 1) = − [1 − 𝐹𝑆𝑁 (𝑗)] ,
as required.
Example 5.3.7. Actuarial Exam Question - Continued. Recall that the

goal of this question was to calculate E (𝑆𝑁 −15)+ . Note that the support of 𝑆𝑁
is equally spaced over units of 5, so this question can also be done recursively,
using the expression above with steps of ℎ = 5:
• Step 1:
E (𝑆𝑁 − 5)+ = E(𝑆𝑁 ) − 5[1 − Pr(𝑆𝑁 ≤ 0)]

1 74
= 28 − 5 (1 − ) = = 24.6667.
3 3
• Step 2:
E (𝑆𝑁 − 10)+ = E (𝑆𝑁 − 5)+ − 5[1 − Pr(𝑆𝑁 ≤ 5)]

74 1 0.4
= − 5 (1 − − ) = 21.555.
3 3 9
• Step 3:
E (𝑆𝑁 − 15)+ = E (𝑆𝑁 − 10)+ − 5[1 − Pr(𝑆𝑁 ≤ 10)]
= E (𝑆𝑁 − 10)+ − 5 Pr(𝑆𝑁 ≥ 15)
= 21.555 − 5(0.5496) = 18.807.
5.3.3 Analytic Results

There are a few combinations of claim frequency and severity distributions that
result in an easy-to-compute distribution for aggregate losses. This section
provides some simple examples. Although these examples are computationally
convenient, they are generally too simple to be used in practice.
Example 5.3.8. One has a closed-form expression for the aggregate loss dis-
tribution by assuming a geometric frequency distribution and an exponential
severity distribution.
Assume that claim count 𝑁 is geometric with mean E(𝑁 ) = 𝛽, and that claim
amount 𝑋 is exponential with E(𝑋) = 𝜃. Recall that the pgf of 𝑁 and the mgf
of 𝑋 are:
1
𝑃𝑁 (𝑧) =
1 − 𝛽(𝑧 − 1)
1
𝑀𝑋 (𝑡) = .
1 − 𝜃𝑡
Thus, the mgf of aggregate loss 𝑆𝑁 can be expressed two ways (for details, see
Technical Supplement 5.A.3)
1
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 1
1 − 𝛽 ( 1−𝜃𝑡 − 1)
𝛽
= 1+ ([1 − 𝜃(1 + 𝛽)𝑡]−1 − 1) (5.1)
1+𝛽
1 𝛽 1
= (1) + ( ). (5.2)
1+𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
From (5.1), we note that 𝑆𝑁 is equivalent to the compound distribution of

𝑆𝑁 = 𝑋1∗ + ⋯ + 𝑋𝑁 ∗
∗ , where 𝑁
∗
is a Bernoulli with mean 𝛽/(1 + 𝛽) and 𝑋 ∗ is
an exponential with mean 𝜃(1 + 𝛽). To see this, we examine the mgf of 𝑆:
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 𝑃𝑁 ∗ [𝑀𝑋∗ (𝑡)],
where
𝛽
𝑃𝑁 ∗ (𝑧) = 1 + (𝑧 − 1),
1+𝛽
1
𝑀𝑋∗ (𝑡) = .
1 − 𝜃(1 + 𝛽)𝑡
From (5.2), we note that 𝑆𝑁 is also equivalent to a two-point mixture of 0 and

𝑋 ∗ . Specifically,
0 with probability Pr(𝑁 ∗ = 0) = 1/(1 + 𝛽)

𝑆𝑁 ={
𝑌∗ with probability Pr(𝑁 ∗ = 1) = 𝛽/(1 + 𝛽).
The distribution function of 𝑆𝑁 is:

1
Pr(𝑆𝑁 = 0) =
1+𝛽
𝛽 𝑠
Pr(𝑆𝑁 > 𝑠) = Pr(𝑋 ∗ > 𝑠) = exp (− )
1+𝛽 𝜃(1 + 𝛽)
with pdf for 𝑠 > 0,
𝛽 𝑠
𝑓𝑆𝑁 (𝑠) = exp (− ).
𝜃(1 + 𝛽)2 𝜃(1 + 𝛽)
Example 5.3.9. Consider a collective risk model with an exponential severity

and an arbitrary frequency distribution. Recall that if 𝑋𝑖 ∼ 𝐸𝑥𝑝(𝜃), then the
sum of iid exponential random variables, 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛 , has a gamma
distribution, i.e. 𝑆𝑛 ∼ 𝐺𝑎𝑚(𝑛, 𝜃). This has cdf:
𝑠
∗𝑛 1 𝑠
𝐹𝑋 (𝑠) = Pr(𝑆𝑛 ≤ 𝑠) = ∫ 𝑛
𝑠𝑛−1 exp (− ) 𝑑𝑠
0 Γ(𝑛)𝜃 𝜃
𝑛−1 𝑗
1 𝑠
= 1−∑ ( ) 𝑒−𝑠/𝜃 .
𝑗=0
𝑗! 𝜃
The last equality is derived by applying integration by parts 𝑛 − 1 times.

For the aggregate loss distribution, we can interchange the order of summations
in the second line below to get
∞ ∗𝑛
𝐹𝑆 (𝑠) = 𝑝0 + ∑𝑛=1 𝑝𝑛 𝐹𝑋 (𝑠)
∞ 𝑛−1 1 𝑠 𝑗 −𝑠/𝜃
= 1 − ∑𝑛=1 𝑝𝑛 ∑𝑗=0 𝑗! ( 𝜃 ) 𝑒
∞ 1 𝑗
= 1 − 𝑒−𝑠/𝜃 ∑𝑗=0 𝑗! ( 𝜃𝑠 ) 𝑃 𝑗
where 𝑃 𝑗 = 𝑝𝑗+1 + 𝑝𝑗+2 + ⋯ = Pr(𝑁 > 𝑗) is the “survival function” of the claims
count distribution.
5.3.4 Tweedie Distribution

In this section, we examine a particular compound distribution where the num-
ber of claims has a Poisson distribution and the amount of claims has a gamma
distribution. This specification leads to what is known as a Tweedie distribu-
tion. The Tweedie distribution has a mass probability at zero and a continuous
component for positive values. Because of this feature, it is widely used in in-
surance claims modeling, where the zero mass is interpreted as no claims and
the positive component as the amount of claims.
Specifically, consider the collective risk model 𝑆𝑁 = 𝑋1 +⋯+𝑋𝑁 . Suppose that
𝑁 has a Poisson distribution with mean 𝜆, and each 𝑋𝑖 has a gamma distribution
with shape parameter 𝛼 and scale parameter 𝛾. The Tweedie distribution is
derived as the Poisson sum of gamma variables. To understand the distribution
of 𝑆𝑁 , we first examine the mass probability at zero. The aggregate loss is zero
when no claims occurred, i.e.
Pr(𝑆𝑁 = 0) = Pr(𝑁 = 0) = 𝑒−𝜆 .
In addition, note that 𝑆𝑁 conditional on 𝑁 = 𝑛, denoted by 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛 ,

follows a gamma distribution with shape 𝑛𝛼 and scale 𝛾. Thus, for 𝑠 > 0, the
density of a Tweedie distribution can be calculated as
∞
𝑓𝑆𝑁 (𝑠) = ∑ 𝑝𝑛 𝑓𝑆𝑛 (𝑠)
𝑛=1
∞
(𝜆)𝑛 𝛾 𝑛𝑎 𝑛𝛼−1 −𝑠𝛾
= ∑ 𝑒−𝜆 𝑠 𝑒 .
𝑛=1
𝑛! Γ(𝑛𝛼)
Thus, the Tweedie distribution can be thought of a mixture of zero and a pos-
itive valued distribution, which makes it a convenient tool for modeling insur-
ance claims and for calculating pure premiums. The mean and variance of the
Tweedie compound Poisson model are:
𝛼 𝛼(1 + 𝛼)
E(𝑆𝑁 ) = 𝜆 and Var(𝑆𝑁 ) = 𝜆 .
𝛾 𝛾2
As another important feature, the Tweedie distribution is a special case of ex-

ponential dispersion models, a class of models used to describe the random
component in generalized linear models. To see this, we consider the following
reparameterization:
𝜇2−𝑝 2−𝑝 1
𝜆= , 𝛼= , = 𝜙(𝑝 − 1)𝜇𝑝−1 .
𝜙(2 − 𝑝) 𝑝−1 𝛾
With the above relationships, one can show that the distribution of 𝑆𝑁 is
1 −𝑠 𝜇2−𝑝
𝑓𝑆𝑁 (𝑠) = exp [ ( 𝑝−1
− ) + 𝐶(𝑠; 𝜙)]
𝜙 (𝑝 − 1)𝜇 2−𝑝
where
⎧ 0 if 𝑠 = 0
{ 𝑛
𝐶(𝑠; 𝜙) = ⎨ (1/𝜙)1/(𝑝−1) 𝑠(2−𝑝)/(𝑝−1) 1
log ∑ { } if 𝑠 > 0.
{ (2 − 𝑝)(𝑝 − 1) (2−𝑝)/(𝑝−1) 𝑛! Γ[𝑛(2 − 𝑝)/(𝑝 − 1)]𝑠
⎩ 𝑛≥1
Hence, the distribution of 𝑆𝑁 belongs to the exponential family with parameters

𝜇, 𝜙, and 1 < 𝑝 < 2, and we have
E(𝑆𝑁 ) = 𝜇 and Var(𝑆𝑁 ) = 𝜙𝜇𝑝 .
This allows us to use the Tweedie distribution with generalized linear models to
model claims. It is also worth mentioning the two limiting cases of the Tweedie
model: 𝑝 → 1 results in the Poisson distribution and 𝑝 → 2 results in the gamma
distribution. Thus, the Tweedie model accommodates the situations in between
the gamma and Poisson distributions, which makes intuitive sense as it is the
Poisson sum of gamma random variables.
5.4 Computing the Aggregate Claims Distribu-

tion
Computing the distribution of aggregate losses is a difficult, yet important, prob-
lem. As we have seen, for both individual risk model and collective risk model,
computing the distribution frequently involves the evaluation of a 𝑛-fold con-
volution. To make the problem tractable, one strategy is to use a distribution
that is easy to evaluate to approximate the aggregate loss distribution. For
instance, normal distribution is a natural choice based on central limit theorem
where parameters of the normal distribution can be estimated by matching the
moments. This approach has its strength and limitations. The main advantage
is the ease of computation. The disadvantages are: first, the size and direction
of approximation error are unknown; second, the approximation may fail to
capture some special features of the aggregate loss such as mass point at zero.
This section discusses two practical approaches to computing the distribution
of aggregate loss, the recursive method and the simulation.
5.4.1 Recursive Method

The recursive method applies to compound models where the frequency com-
ponent 𝑁 belongs to either (𝑎, 𝑏, 0) or (𝑎, 𝑏, 1) class (see Sections 2.3 and 2.5.1)
and the severity component 𝑋 has a discrete distribution. For continuous 𝑋, a
common practice is to first discretize the severity distribution, after which the
recursive method is ready to apply.
5.4. COMPUTING THE AGGREGATE CLAIMS DISTRIBUTION 201
Assume that 𝑁 is in the (𝑎, 𝑏, 1) class so that 𝑝𝑘 = (𝑎 + 𝑘𝑏 ) 𝑝𝑘−1 , 𝑘 = 2, 3, ….

Further assume that the support of 𝑋 is {0, 1, … , 𝑚}, discrete and finite. Then,
the probability function of 𝑆𝑁 is:
𝑓𝑆𝑁 (𝑠) = Pr(𝑆𝑁 = 𝑠)

𝑠∧𝑚
1 𝑏𝑥
= {[𝑝1 − (𝑎 + 𝑏)𝑝0 ] 𝑓𝑋 (𝑠) + ∑ (𝑎 + ) 𝑓𝑋 (𝑥)𝑓𝑆𝑁 (𝑠 − 𝑥)} .
1 − 𝑎𝑓𝑋 (0) 𝑥=1
𝑠
If 𝑁 is in the (𝑎, 𝑏, 0) class, then 𝑝1 = (𝑎 + 𝑏)𝑝0 and so
𝑠∧𝑚
1 𝑏𝑥
𝑓𝑆𝑁 (𝑠) = { ∑ (𝑎 + ) 𝑓𝑋 (𝑥)𝑓𝑆𝑁 (𝑠 − 𝑥)} .
1 − 𝑎𝑓𝑋 (0) 𝑥=1 𝑠
Special Case: Poisson Frequency. If 𝑁 ∼ 𝑃 𝑜𝑖(𝜆), then 𝑎 = 0 and 𝑏 = 𝜆,

and thus
𝜆 𝑠∧𝑚
𝑓𝑆𝑁 (𝑠) = { ∑ 𝑥𝑓 (𝑥)𝑓𝑆𝑁 (𝑠 − 𝑥)} .
𝑠 𝑥=1 𝑋
Example 5.4.1. Actuarial Exam Question. The number of claims in a

period 𝑁 has a geometric distribution with mean 4. The amount of each claim
𝑋 follows Pr(𝑋 = 𝑥) = 0.25, for 𝑥 = 1, 2, 3, 4. The number of claims and the
claim amount are independent. 𝑆𝑁 is the aggregate claim amount in the period.
Calculate 𝐹𝑆𝑁 (3).
Solution. The severity distribution 𝑋 follows
1
𝑓𝑋 (𝑥) = , 𝑥 = 1, 2, 3, 4.
4
The frequency distribution 𝑁 is geometric with mean 4, which is a member of

𝛽
the (𝑎, 𝑏, 0) class with 𝑏 = 0, 𝑎 = 1+𝛽 = 45 , and 𝑝0 = 1+𝛽
1
= 15 . The support of
severity component 𝑋 is {1, … , 𝑚 = 4}, discrete and finite. Thus, we can use
the recursive method
𝑥∧𝑚
𝑓𝑆𝑁 (𝑥) = 1 ∑ (𝑎 + 0)𝑓𝑋 (𝑦)𝑓𝑆𝑁 (𝑥 − 𝑦)
𝑦=1
4 𝑥∧𝑚
= ∑ 𝑓 (𝑦)𝑓𝑆𝑁 (𝑥 − 𝑦).
5 𝑦=1 𝑋
Specifically, we have
1
𝑓𝑆𝑁 (0) = Pr(𝑁 = 0) = 𝑝0 =
5
4 1 4
𝑓𝑆𝑁 (1) = ∑ 𝑓𝑋 (𝑦)𝑓𝑆𝑁 (1 − 𝑦) = 𝑓𝑋 (1)𝑓𝑆𝑁 (0)
5 𝑦=1 5
4 1 1 1
= ( )( ) =
5 4 5 25
4 2 4
𝑓𝑆𝑁 (2) = ∑ 𝑓𝑋 (𝑦)𝑓𝑆𝑁 (2 − 𝑦) = [𝑓𝑋 (1)𝑓𝑆𝑁 (1) + 𝑓𝑋 (2)𝑓𝑆𝑁 (0)]
5 𝑦=1 5
4 1 1 1 4 6 6
= [ ( + )] = ( )=
5 4 25 5 5 100 125
4
𝑓𝑆𝑁 (3) = [𝑓𝑋 (1)𝑓𝑆𝑁 (2) + 𝑓𝑋 (2)𝑓𝑆𝑁 (1) + 𝑓𝑋 (3)𝑓𝑆𝑁 (0)]
5
4 1 1 1 6 1 5 + 25 + 6
= [ ( + + )] = ( ) = 0.0576
5 4 25 5 125 5 125
⇒ 𝐹𝑆𝑁 (3) = 𝑓𝑆𝑁 (0) + 𝑓𝑆𝑁 (1) + 𝑓𝑆𝑁 (2) + 𝑓𝑆𝑁 (3) = 0.3456.
5.4.2 Simulation
The distribution of aggregate loss can be evaluated using Monte Carlo simula-
tion. You can get a broad introduction to simulation procedures in Chapter 6.
For aggregate losses, the idea is that one can calculate the empirical distribution
of 𝑆𝑁 using a random sample. The expected value and variance of the aggregate
loss can also be estimated using the sample mean and sample variance of the
simulated values.
We now summarize simulation procedures for aggregate loss models. Let 𝑚 be
the size of the generated random sample of aggregate losses.
1. Individual Risk Model: 𝑆𝑛 = 𝑋1 + ⋯ + 𝑋𝑛
• Let 𝑗 = 1, … , 𝑚 be a counter. Start by setting 𝑗 = 1.
• Generate each individual loss realization 𝑥𝑖𝑗 for 𝑖 = 1, … , 𝑛. For
example, this can be done using the inverse transformation method
(Section 6.2).
• Calculate the aggregate loss 𝑠𝑗 = 𝑥1𝑗 + ⋯ + 𝑥𝑛𝑗 .
• Repeat the above two steps for 𝑗 = 2, … , 𝑚 to obtain a size-𝑚 sample
of 𝑆𝑛 , i.e. {𝑠1 , … , 𝑠𝑚 }.
2. Collective Risk Model: 𝑆𝑁 = 𝑋1 + ⋯ + 𝑋𝑁
• Let 𝑗 = 1, … , 𝑚 be a counter. Start by setting 𝑗 = 1.
• Generate the number of claims 𝑛𝑗 from the frequency distribution 𝑁 .
5.4. COMPUTING THE AGGREGATE CLAIMS DISTRIBUTION 203
• Given 𝑛𝑗 , generate the amount of each claim independently from

severity distribution 𝑋, denoted by 𝑥1𝑗 , … , 𝑥𝑛𝑗 𝑗 .
• Calculate the aggregate loss 𝑠𝑗 = 𝑥1𝑗 + ⋯ + 𝑥𝑛𝑗 𝑗 .
• Repeat the above three steps for 𝑗 = 2, … , 𝑚 to obtain a size-𝑚
sample of 𝑆𝑁 , i.e. {𝑠1 , … , 𝑠𝑚 }.
Given the random sample of 𝑆, the empirical distribution can be calculated as
1 𝑚
𝐹𝑆̂ (𝑠) = ∑ 𝐼(𝑠𝑖 ≤ 𝑠),
𝑚 𝑖=1
where 𝐼(⋅) is an indicator function. The empirical distribution 𝐹𝑆̂ (𝑠) will con-
verge to 𝐹𝑆 (𝑠) almost surely as the sample size 𝑚 → ∞.
The above procedure assumes that the probability distributions, including the
parameter values, of the frequency and severity distributions are known. In
practice, one would need to first assume these distributions, estimate their pa-
rameters from data, and then assess the quality of model fit using various model
validation tools (see Chapter 4). For instance, the assumptions in the collective
risk model suggest a two-stage estimation where one model is developed for the
number of claims 𝑁 from data on claim counts, and another model is developed
for the severity of claims 𝑋 from data on the amount of claims.
Example 5.4.2. Recall Example 5.3.5 with an individual’s claim frequency 𝑁

has a Poisson distribution with mean 𝜆 = 25 and claim severity 𝑋 is uniformly
distributed on the interval (5, 95). Using a simulated sample of 10,000 obser-
vations, estimate the mean and variance of the aggregate loss 𝑆𝑁 . In addition,
use the simulated sample to estimate the probability that aggregate claims for
this individual will exceed 2,000 and compare with the normal approximation
estimates from Example 5.3.5.
Solution. We follow the algorithm for the collective risk model, where we
first simulate frequencies 𝑛1 , … , 𝑛10000 , and conditional on 𝑛𝑗 , 𝑗 = 1, … , 10000,
simulate each individual loss 𝑥𝑖𝑗 , 𝑖 = 1, … 𝑛𝑗 .
set.seed(4321) # For reproducibility of results
m <- 10000 # Number of observations to simulate
lambda <- 25 # Parameter for frequency distribution N
a <- 5; b <- 95 # Parameters for severity distribution X
S <- rep(NA, m) # Initialize an empty vector to store S observations
n <- rpois(m, lambda) # Generate m=10000 observations of N from Poisson

for(j in 1:m){
n_j <- n[j] # Given each n_j (j=1,...,m), generate n_j observations of X from uniform
x_j <- runif(n_j, min=a, max=b)
s_j <- sum(x_j) # Calculate the aggregate loss s_j
S[j] <- s_j # Store s_j in the vector of observations

}
mean(S) # Compare to theoretical value of 1,250
[1] 1248.09
var(S) # Compare to theoretical value of 79,375
[1] 77441.22
mean(S>2000) # Proportion of simulated observations s_j that are > 2000
[1] 0.0062
# Compare to normal approximation method of 0.003884
Using simulation, we estimate the mean and variance of the aggregate claims
to be approximately 1248 and 77441 respectively, compared to the theoretical
values of 1,250 and 79,375. In addition, we estimate the probability that aggre-
gate losses exceed 2000 to be 0.0062, compared to the normal approximation
estimate of 0.003884.
We can assess the appropriateness of the normal approximation by comparing

the empirical distribution of the simulated aggregate losses to the density of the
normal distribution used for the normal approximation, 𝑁 (𝜇 = 1, 250 , 𝜎2 =
79, 375):
Distribution of Simulated Aggregate Losses
Normal density
0.0012
0.0008
Density
0.0004
0.0000
500 1000 1500 2000
Aggregate Loss S
The simulated losses are slightly more right-skewed than the normal distribution,
with a longer right tail. This explains why the normal approximation estimate
of Pr(𝑆𝑁 > 2000) is lower than the simulated estimate.
5.5. EFFECTS OF COVERAGE MODIFICATIONS 205
5.5 Effects of Coverage Modifications

5.5.1 Impact of Exposure on Frequency
This section focuses on an individual risk model for claim counts. Recall the
individual risk model involves a fixed 𝑛 number of contracts and independent
loss random variables 𝑋𝑖 . Consider the number of claims from a group of 𝑛
policies:
𝑆 = 𝑋1 + ⋯ + 𝑋 𝑛 ,
where we assume 𝑋𝑖 are iid representing the number of claims from policy 𝑖. In
this case, the exposure for the portfolio is 𝑛, using policy as exposure base. In
Section 7.4.1 we will introduce other exposure bases. The pgf of 𝑆 is
𝑛
𝑃𝑆 (𝑧) = E(𝑧 𝑆 ) = E (𝑧 ∑𝑖=1 𝑋𝑖 )
𝑛
= ∏ E(𝑧 𝑋𝑖 ) = [𝑃𝑋 (𝑧)]𝑛 .
𝑖=1
Special Case: Poisson. If 𝑋𝑖 ∼ 𝑃 𝑜𝑖(𝜆), its pgf is 𝑃𝑋 (𝑧) = 𝑒𝜆(𝑧−1) . Then the
pgf of 𝑆 is
𝑃𝑆 (𝑧) = [𝑒𝜆(𝑧−1) ]𝑛 = 𝑒𝑛𝜆(𝑧−1) .
So 𝑆 ∼ 𝑃 𝑜𝑖(𝑛𝜆). That is, the sum of 𝑛 independent Poisson random variables
each with mean 𝜆 has a Poisson distribution with mean 𝑛𝜆.
Special Case: Negative Binomial. If 𝑋𝑖 ∼ 𝑁 𝐵(𝛽, 𝑟), its pgf is 𝑃𝑋 (𝑧) =

[1 − 𝛽(𝑧 − 1)]−𝑟 . Then the pgf of 𝑆 is
𝑃𝑆 (𝑧) = [[1 − 𝛽(𝑧 − 1)]−𝑟 ]𝑛 = [1 − 𝛽(𝑧 − 1)]−𝑛𝑟 .
So 𝑆 ∼ 𝑁 𝐵(𝛽, 𝑛𝑟).
Example 5.5.1. Assume that the number of claims for each vehicle is Poisson
with mean 𝜆. Given the following data on the observed number of claims for
each household, calculate the MLE of 𝜆.
Household ID Number of vehicles Number of claims

1 2 0
2 1 2
3 3 2
4 1 0
5 1 1
Solution. Each of the 5 households has number of exposures 𝑛𝑗 (number of

vehicles) and number of claims 𝑆𝑗 , 𝑗 = 1, ..., 5. Note for each household, the
number of claims 𝑆𝑗 ∼ 𝑃 𝑜𝑖(𝑛𝑗 𝜆). The likelihood function is

5 5
𝑒−𝑛𝑗 𝜆 (𝑛𝑗 𝜆)𝑠𝑗
𝐿(𝜆) = ∏ Pr(𝑆𝑗 = 𝑠𝑗 ) = ∏
𝑗=1 𝑗=1
𝑠𝑗 !
𝑒−2𝜆 (2𝜆)0 𝑒−1𝜆 (1𝜆)2 𝑒−3𝜆 (3𝜆)2 𝑒−1𝜆 (1𝜆)0 𝑒−1𝜆 (1𝜆)1
=( )( )( )( )( )
0! 2! 2! 0! 1!
∝ 𝑒−8𝜆 𝜆5
Taking the logarithm, we have
𝑙(𝜆) = log 𝐿(𝜆) = −8𝜆 + 5 log(𝜆).
Setting the first derivative of the log-likelihood to 0, we get 𝜆̂ = 5

8
If the exposure of the portfolio changes from 𝑛1 to 𝑛2 , we can establish the

following relation between the aggregate claim counts:
𝑃𝑆𝑛 (𝑧) = [𝑃𝑋 (𝑧)]𝑛2 = [𝑃𝑋 (𝑧)𝑛1 ]𝑛2 /𝑛1 = 𝑃𝑆𝑛 (𝑧)𝑛2 /𝑛1 .
2 1
5.5.2 Impact of Deductibles on Claim Frequency

This section examines the effect of deductibles on claim frequency. Intuitively,
there will be fewer claims filed when a policy deductible is imposed because a
loss below the deductible level may not result in a claim. Even if an insured
does file a claim, this may not result in a payment by the policy, since the claim
may be denied or the loss amount may ultimately be determined to be below
deductible. Let 𝑁 𝐿 denote the number of losses (i.e. the number of claims with
no deductible), and 𝑁 𝑃 denote the number of payments when a deductible 𝑑 is
imposed. Our goal is to identify the distribution of 𝑁 𝑃 given the distribution
of 𝑁 𝐿 . We show below that the relationship between 𝑁 𝐿 and 𝑁 𝑃 can be
established within an aggregate risk model framework.
Note that sometimes changes in deductibles will affect policyholder claim be-
havior. We assume that this is not the case, i.e. the underlying distributions of
losses for both frequency and severity remain unchanged when the deductible
changes.
Given there are 𝑁 𝐿 losses, let 𝑋1 , 𝑋2 … , 𝑋𝑁 𝐿 be the associated amount of losses.
For 𝑗 = 1, … , 𝑁 𝐿 , define
1 if 𝑋𝑗 > 𝑑
𝐼𝑗 = { .
0 otherwise
Then we establish
𝑁 𝑃 = 𝐼 1 + 𝐼2 + ⋯ + 𝐼 𝑁 𝐿 ,
that is, the total number of payments is equal to the number of losses above the
deductible level. Given that 𝐼𝑗 ’s are independent Bernoulli random variables
with probability of success 𝑣 = Pr(𝑋 > 𝑑), the sum of a fixed number of such
variables is then a binomial random variable. Thus, conditioning on 𝑁 𝐿 , 𝑁 𝑃
has a binomial distribution, i.e. 𝑁 𝑃 |𝑁 𝐿 ∼ 𝐵𝑖𝑛(𝑁 𝐿 , 𝑣), where 𝑣 = Pr(𝑋 > 𝑑).
This implies that
𝑃 𝑁𝐿
E (𝑧 𝑁 |𝑁 𝐿 ) = [1 + 𝑣(𝑧 − 1)]
So the pgf of 𝑁 𝑃 is
𝑃 𝑃
𝑃𝑁 𝑃 (𝑧) = E𝑁 𝑃 (𝑧 𝑁 ) = E𝑁 𝐿 [E𝑁 𝑃 (𝑧 𝑁 |𝑁 𝐿 )]
𝐿
= E𝑁 𝐿 [(1 + 𝑣(𝑧 − 1))𝑁 ]
= 𝑃𝑁 𝐿 (1 + 𝑣(𝑧 − 1)) .
Thus, we can write the pgf of 𝑁 𝑃 as the pgf of 𝑁 𝐿 , evaluated at a new argument
𝑧 ∗ = 1 + 𝑣(𝑧 − 1). That is, 𝑃𝑁 𝑃 (𝑧) = 𝑃𝑁 𝐿 (𝑧 ∗ ).
Special Cases:
• 𝑁 𝐿 ∼ 𝑃 𝑜𝑖(𝜆). The pgf of 𝑁 𝐿 is 𝑃𝑁 𝐿 = 𝑒𝜆(𝑧−1) . Thus the pgf of 𝑁 𝑃 is
𝑃𝑁 𝑃 (𝑧) = 𝑒𝜆(1+𝑣(𝑧−1)−1)
= 𝑒𝜆𝑣(𝑧−1) ,
So 𝑁 𝑃 ∼ 𝑃 𝑜𝑖(𝜆𝑣). This means the number of payments has the same

distribution as the number of losses, but with the expected number of
payments equal to 𝜆𝑣 = 𝜆 Pr(𝑋 > 𝑑).
−𝑟
• 𝑁 𝐿 ∼ 𝑁 𝐵(𝛽, 𝑟). The pgf of 𝑁 𝐿 is 𝑃𝑁 𝐿 (𝑧) = [1 − 𝛽 (𝑧 − 1)] . Thus the
pgf of 𝑁 𝑃 is
−𝑟
𝑃𝑁 𝑃 (𝑧) = (1 − 𝛽(1 + 𝑣(𝑧 − 1) − 1))
−𝑟
= (1 − 𝛽𝑣(𝑧 − 1)) ,
So 𝑁 𝑃 ∼ 𝑁 𝐵(𝛽𝑣, 𝑟). This means the number of payments has the same
distribution as the number of losses, but with parameters 𝛽𝑣 and 𝑟.
Example 5.5.2. Suppose that loss amounts 𝑋𝑖 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼 = 4, 𝜃 = 150).

You are given that the loss frequency is 𝑁 𝐿 ∼ 𝑃 𝑜𝑖(𝜆) and the payment frequency
distribution is 𝑁1𝑃 ∼ 𝑃 𝑜𝑖(0.4) at deductible level 𝑑1 = 30. Find the distribution
of the payment frequency 𝑁2𝑃 when the deductible level is 𝑑2 = 100.
Solution. Because the loss frequency 𝑁 𝐿 is Poisson, we can relate the means
of the loss distribution 𝑁 𝐿 and the first payment distribution 𝑁1𝑃 (under de-
ductible 𝑑1 = 30) through 0.4 = 𝜆𝑣1 , where

4
150 5 4
𝑣1 = Pr(𝑋 > 30) = ( ) =( )
30 + 150 6
6 4
⇒ 𝜆 = 0.4 ( ) .
5
With this, we can assess the second payment distribution 𝑁2𝑃 (under deductible
𝑑2 = 100) as being Poisson with mean 𝜆2 = 𝜆𝑣2 , where
4
150 3 4
𝑣2 = Pr(𝑋 > 100) = ( ) =( )
100 + 150 5
4 4
6 3
⇒ 𝜆2 = 𝜆𝑣2 = 0.4 ( ) ( ) = 0.1075.
5 5
Example 5.5.3. Follow-Up. Now suppose instead that the loss frequency
is 𝑁 𝐿 ∼ 𝑁 𝐵(𝛽, 𝑟) and for deductible 𝑑1 = 30, the payment frequency 𝑁1𝑃 is
negative binomial with mean 0.4. Find the mean of the payment frequency 𝑁2𝑃
for deductible 𝑑2 = 100.
Solution. Because the loss frequency 𝑁 𝐿 is negative binomial, we can relate the
parameter 𝛽 of the 𝑁 𝐿 distribution and the parameter 𝛽1 of the first payment
distribution 𝑁1𝑃 using 𝛽1 = 𝛽𝑣1 , where
5 4
𝑣1 = Pr(𝑋 > 30) = ( )
6
Thus, the mean of 𝑁1𝑃 and the mean of 𝑁 𝐿 are related via
0.4 = 𝑟𝛽1 = 𝑟 (𝛽𝑣1 )

0.4 6 4
⇒ 𝑟𝛽 = = 0.4 ( ) .
𝑣1 5
4
Note that 𝑣2 = Pr(𝑋 > 100) = ( 35 ) as in the original example. Then the second
payment frequency distribution under deductible 𝑑2 = 100 is 𝑁2𝑃 ∼ 𝑁 𝐵(𝛽𝑣2 , 𝑟)
with mean
6 4 3 4
𝑟(𝛽𝑣2 ) = (𝑟𝛽)𝑣2 = 0.4 ( ) ( ) = 0.1075.
5 5
Next, we examine the more general case where 𝑁 𝐿 is a zero-modified distri-

bution. Recall that a zero-modified distribution can be defined in terms of an
unmodified one (as was shown in Section 2.5.1). That is,
1 − 𝑝0𝑀
𝑝𝑘𝑀 = 𝑐 𝑝𝑘0 , for 𝑘 = 1, 2, 3, … , with 𝑐 = ,
1 − 𝑝00
where 𝑝𝑘0 is the pmf of the unmodified distribution. In the case that 𝑝0𝑀 = 0,
we call this a zero-truncated distribution, or 𝑍𝑇 . For other arbitrary values
of 𝑝0𝑀 , this is a zero-modified, or 𝑍𝑀 , distribution. The pgf for the modified
distribution is shown as
𝑃 𝑀 (𝑧) = 1 − 𝑐 + 𝑐 𝑃 0 (𝑧),
expressed in terms of the pgf of the unmodified distribution, 𝑃 0 (𝑧). When 𝑁 𝐿

follows a zero-modified distribution, the distribution of 𝑁 𝑃 is established using
the same relation from earlier, 𝑃𝑁 𝑃 (𝑧) = 𝑃𝑁 𝐿 (1 + 𝑣(𝑧 − 1)).
Special Cases:
• 𝑁 𝐿 is a ZM-Poisson random variable with parameters 𝜆 and 𝑝0𝑀 . The pgf
of 𝑁 𝐿 is
1 − 𝑝0𝑀 1 − 𝑝0𝑀 𝜆(𝑧−1)
𝑃𝑁 𝐿 (𝑧) = 1 − + (𝑒 ).
1 − 𝑒−𝜆 1 − 𝑒−𝜆
Thus the pgf of 𝑁 𝑃 is
1 − 𝑝0𝑀 1 − 𝑝0𝑀 𝜆𝑣(𝑧−1)

𝑃𝑁 𝑃 (𝑧) = 1 − + (𝑒 ).
1 − 𝑒−𝜆 1 − 𝑒−𝜆
So the number of payments is also a ZM-Poisson distribution with pa-
rameters 𝜆𝑣 and 𝑝0𝑀 . The probability at zero can be evaluated using
Pr(𝑁 𝑃 = 0) = 𝑃𝑁 𝑃 (0).
• 𝑁 𝐿 is a ZM-negative binomial random variable with parameters 𝛽, 𝑟, and
𝑝0𝑀 . The pgf of 𝑁 𝐿 is
1 − 𝑝0𝑀 1 − 𝑝0𝑀 −𝑟
𝑃𝑁 𝐿 (𝑧) = 1 − −𝑟
+ −𝑟
[1 − 𝛽 (𝑧 − 1)] .
1 − (1 + 𝛽) 1 − (1 + 𝛽)
Thus the pgf of 𝑁 𝑃 is
1 − 𝑝0𝑀 1 − 𝑝0𝑀 −𝑟
𝑃𝑁 𝑃 (𝑧) = 1 − + [1 − 𝛽𝑣 (𝑧 − 1)] .
1 − (1 + 𝛽)−𝑟 1 − (1 + 𝛽)−𝑟
So the number of payments is also a ZM-negative binomial distribution
with parameters 𝛽𝑣, 𝑟, and 𝑝0𝑀 . Similarly, the probability at zero can be
evaluated using Pr(𝑁 𝑃 = 0) = 𝑃𝑁 𝑃 (0).
Example 5.5.4. Aggregate losses are modeled as follows:

(i) The number of losses follows a zero-modified Poisson distribution with 𝜆 = 3
and 𝑝0𝑀 = 0.5.
(ii) The amount of each loss has a Burr distribution with 𝛼 = 3, 𝜃 = 50, 𝛾 = 1.
(iii) There is a deductible of 𝑑 = 30 on each loss.
(iv) The number of losses and the amounts of the losses are mutually indepen-
dent.
Calculate E(𝑁 𝑃 ) and Var(𝑁 𝑃 ).
Solution. Since 𝑁 𝐿 follows a ZM-Poisson distribution with parameters 𝜆 and

𝑝0𝑀 , we know that 𝑁 𝑃 also follows a ZM-Poisson distribution, but with param-
eters 𝜆𝑣 and 𝑝0𝑀 , where
3
1
𝑣 = Pr(𝑋 > 30) = ( ) = 0.2441.
1 + (30/50)
Thus, 𝑁 𝑃 follows a ZM-Poisson distribution with parameters 𝜆∗ = 𝜆𝑣 = 0.7324

and 𝑝0𝑀 = 0.5. Finally,
𝜆∗ 0.7324
E(𝑁 𝑃 ) = (1 − 𝑝0𝑀 ) −𝜆 ∗ = 0.5 ( )
1−𝑒 1 − 𝑒−0.7324
= 0.7053
∗ 2
𝜆∗ [1 − (𝜆∗ + 1)𝑒−𝜆 ] 𝜆∗
Var(𝑁 𝑃 ) = (1 − 𝑝0𝑀 ) ( ∗ ) + 𝑝 𝑀
0 (1 − 𝑝 𝑀
0 ) ( )
(1 − 𝑒−𝜆 )2 1 − 𝑒−𝜆∗
2
0.7324(1 − 1.7324𝑒−0.7324 ) 0.7324
= 0.5 ( ) + 0.52 ( )
(1 − 𝑒−0.7324 )2 1 − 𝑒−0.7324
= 0.7244.
5.5.3 Impact of Policy Modifications on Aggregate Claims

In this section, we examine how a change in the deductible affects the aggregate
payments from an insurance portfolio. We assume that the presence of policy
limits (𝑢), coinsurance (𝛼), and inflation (𝑟) have no effect on the underlying
distribution of frequency of payments made by an insurer. As in the previous
section, we further assume that deductible changes do not impact the underlying
distributions of losses for both frequency and severity.
Recall the notation 𝑁 𝐿 for the number of losses. With ground-up loss amount
𝑋 and policy deductible 𝑑, we use 𝑁 𝑃 for the number of payments (as defined
in the previous section 5.5.2). Also, define the amount of payment on a per-loss
basis as
⎧ 𝑑
{ 0, if 𝑋 <
{ 1+𝑟
{ 𝑑 𝑢
𝑋𝐿 = ⎨ 𝛼[(1 + 𝑟)𝑋 − 𝑑] , if ≤𝑋< ,
{ 1+𝑟 1+𝑟
{ 𝑢
{ 𝛼(𝑢 − 𝑑) , if 𝑋 ≥
⎩ 1+𝑟
and the amount of payment on a per-payment basis as
⎧ 𝑑
{ undefined , if 𝑋 <
{ 1+𝑟
{ 𝑑 𝑢
𝑋𝑃 = ⎨ 𝛼[(1 + 𝑟)𝑋 − 𝑑] , if ≤𝑋< .
{ 1+𝑟 1+𝑟
{ 𝑢
{ 𝛼(𝑢 − 𝑑) , if 𝑋 ≥ .
⎩ 1+𝑟
In the above, 𝑟, 𝑢, and 𝛼 represent the inflation rate, policy limit, and coinsur-
ance, respectively. Hence, aggregate costs (payment amounts) can be expressed
either on a per loss or per payment basis:
𝑆 = 𝑋1𝐿 + ⋯ + 𝑋𝑁
𝐿
𝐿
= 𝑋1𝑃 + ⋯ + 𝑋𝑁
𝑃
𝑃 .
(Recall that when we introduced the per-loss and per-payment bases in Section
3.4, we used another letter 𝑌 to distinguish losses from insurance payments, or
claims. At this point in our development, we use the letter 𝑋 to reduce notation
complexity.)
The fundamentals regarding collective risk models are ready to apply. For in-
stance, we have:
E(𝑆) = E (𝑁 𝐿 ) E (𝑋 𝐿 ) = E (𝑁 𝑃 ) E (𝑋 𝑃 )
2
Var(𝑆) = E (𝑁 𝐿 ) Var (𝑋 𝐿 ) + [E (𝑋 𝐿 )] Var(𝑁 𝐿 )
2
= E (𝑁 𝑃 ) Var (𝑋 𝑃 ) + [E (𝑋 𝑃 )] Var(𝑁 𝑃 )
𝑀𝑆 (𝑧) = 𝑃𝑁 𝐿 [𝑀𝑋𝐿 (𝑧)] = 𝑃𝑁 𝑃 [𝑀𝑋𝑃 (𝑧)] .
Example 5.5.5. Actuarial Exam Question. A group dental policy has

a negative binomial claim count distribution with mean 300 and variance 800.
Ground-up severity is given by the following table:
Severity Probability
40 0.25
80 0.25
120 0.25
200 0.25
You expect severity to increase 50% with no change in frequency. You decide
to impose a per claim deductible of 100. Calculate the expected total claim
payment 𝑆 after these changes.
Solution. The cost per loss with a 50% increase in severity and a 100 deductible
per claim is
0 1.5𝑥 < 100

𝑋𝐿 = {
1.5𝑥 − 100 1.5𝑥 ≥ 100
This has expectation

1
E(𝑋 𝐿 ) = [(1.5(40) − 100)+ + (1.5(80) − 100)+ + (1.5(120) − 100)+ + (1.5(200) − 100)+ ]
4
1
= [(60 − 100)+ + (120 − 100)+ + (180 − 100)+ + (300 − 100)+ ]
4
1
= [0 + 20 + 80 + 200] = 75.
4
Thus, the expected aggregate loss is
E(𝑆) = E(𝑁 ) E (𝑋 𝐿 ) = 300(75) = 22, 500..
Example 5.5.6. Follow-Up. What is the variance of the total claim payment,
Var (𝑆)?
Solution. On a per loss basis, we have
2
Var(𝑆) = E(𝑁 ) Var (𝑋 𝐿 ) + [E (𝑋 𝐿 )] Var(𝑁 )
where E(𝑁 ) = 300 and Var(𝑁 ) = 800. We find

1 2
E [(𝑋 𝐿 )2 ] = [0 + 202 + 802 + 2002 ] = 11, 700
4
2
⇒ Var(𝑋 𝐿 ) = E [(𝑋 𝐿 )2 ] − [E(𝑋 𝐿 )] = 11, 700 − 752 = 6, 075
Thus, the variance of the aggregate claim payment is
Var(𝑆) = 300(6, 075) + 752 (800) = 6, 322, 500.
Alternative Method: Using the Per Payment Basis. Previously, we calculated

the expected total claim payment by multiplying the expected number of losses
by the expected payment per loss. Recall that we can also multiply the expected
number of payments by the expected payment per payment. In this case, we
have
𝑆 = 𝑋1𝑃 + ⋯ + 𝑋𝑁 𝑃
𝑃
The probability of a payment is

3
Pr(1.5𝑋 ≥ 100) = Pr(𝑋 ≥ 66.6)̄ = .
4
Thus, the number of payments, 𝑁 𝑃 has a negative binomial distribution (see

negative binomial special case in Section 5.5.2) with mean
3
E(𝑁 𝑃 ) = E(𝑁 𝐿 ) Pr(1.5𝑋 ≥ 100) = 300 ( ) = 225.
4
The cost per payment is
undefined , if 1.5𝑥 < 100

𝑋𝑃 = {
1.5𝑥 − 100 , if 1.5𝑥 ≥ 100
This has expectation
E(𝑋 𝐿 ) 75
E(𝑋 𝑃 ) = = = 100.
Pr(1.5𝑋 > 100) (3/4)
Thus, as before, the expected aggregate loss is
E(𝑆) = E(𝑋 𝑃 ) E(𝑁 𝑃 ) = 100(225) = 22, 500.
Example 5.5.7. Actuarial Exam Question. A company insures a fleet of

vehicles. Aggregate losses have a compound Poisson distribution. The expected
number of losses is 20. Loss amounts, regardless of vehicle type, have exponential
distribution with 𝜃 = 200. To reduce the cost of the insurance, two modifications
are to be made:
(i) A certain type of vehicle will not be insured. It is estimated that this will
reduce loss frequency by 20%.
(ii) A deductible of 100 per loss will be imposed.
Calculate the expected aggregate amount paid by the insurer after the modifi-
cations.
Solution. On a per loss basis, we have a 100 deductible. Thus, the expectation
per loss is
E(𝑋 𝐿 ) = 𝐸[(𝑋 − 100)+ ] = 𝐸(𝑋) − 𝐸(𝑋 ∧ 100)
= 200 − 200(1 − 𝑒−100/200 ) = 121.31.
Loss frequency has been reduced by 20%, resulting in an expected number of

losses
E(𝑁 𝐿 ) = 0.8(20) = 16.
Thus, the expected aggregate amount paid after the modifications is
E(𝑆) = E(𝑋 𝐿 ) E(𝑁 𝐿 ) = 121.31(16) = 1, 941

Alternative Method: Using the Per Payment Basis. We can also use the per
payment basis to find the expected aggregate amount paid after the modifi-
cations. With the deductible of 100, the probability that a payment occurs
is Pr(𝑋 > 100) = 𝑒−100/200 . For the per payment severity, plugging in the
expression for E(𝑋 𝐿 ) from the original example, we have
E(𝑋 𝐿 ) 200 − 200(1 − 𝑒−100/200 )

E(𝑋 𝑃 ) = = = 200
Pr(𝑋 > 100) 𝑒−100/200
This is not surprising – recall that the exponential distribution is memoryless,

so the expected claim amount paid in excess of 100 is still exponential with
mean 200.
Now we look at the payment frequency
E(𝑁 𝑃 ) = E(𝑁 𝐿 ) Pr(𝑋 > 100) = 16 𝑒−100/200 = 9.7.
Putting this together, we produce the same answer using the per payment basis
as the per loss basis from earlier
E(𝑆) = E(𝑋 𝑃 ) E(𝑁 𝑃 ) = 200(9.7) = 1, 941.

Exercises
Aggregate Loss Guided Tutorials
Contributors
• Peng Shi and Lisa Gao, University of Wisconsin-Madison, are the princi-
pal authors of the initial version of this chapter. Email: [email protected]
• Chapter reviewers include: Vytaras Brazauskas, Mark Maxwell, Jiadong
Ren, Sherly Paola Alfonso Sanchez, and Di (Cindy) Xu.
TS 5.A.1. Individual Risk Model Properties

For the expected value of the aggregate loss under the individual risk model,
𝑛 𝑛 𝑛
E(𝑆𝑛 ) = ∑ E(𝑋𝑖 ) = ∑ E(𝐼𝑖 × 𝐵𝑖 ) = ∑ E(𝐼𝑖 ) E(𝐵𝑖 ) from the independence of 𝐼𝑖 ’s and 𝐵𝑖 ’s
𝑖=1 𝑖=1 𝑖=1
𝑛
= ∑ Pr(𝐼𝑖 = 1) 𝜇𝑖 since the expectation of an indicator variable is the probability it equals 1
𝑖=1
𝑛
= ∑ 𝑞𝑖 𝜇𝑖 .
𝑖=1
For the variance of the aggregate loss under the individual risk model,
𝑛
Var(𝑆𝑛 ) = ∑ Var(𝑋𝑖 ) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛
= ∑ ( E [Var(𝑋𝑖 |𝐼𝑖 )] + Var [E(𝑋𝑖 |𝐼𝑖 )] ) from the conditional variance formulas
𝑖=1
𝑛
= ∑ (𝑞𝑖 𝜎𝑖2 + 𝑞𝑖 (1 − 𝑞𝑖 ) 𝜇2𝑖 ) .
𝑖=1
To see this, note that

E [Var(𝑋𝑖 |𝐼𝑖 )] = Var(𝑋𝑖 |𝐼𝑖 = 0) Pr(𝐼𝑖 = 0) + Var(𝑋𝑖 |𝐼𝑖 = 1) Pr(𝐼𝑖 = 1)
= 𝑞𝑖 𝜎𝑖2 + (1 − 𝑞𝑖 ) (0) = 𝑞𝑖 𝜎𝑖2 ,
and
Var [E(𝑋𝑖 |𝐼𝑖 )] = 𝑞𝑖 (1 − 𝑞𝑖 ) 𝜇2𝑖 ,
using the Bernoulli variance shortcut since E(𝑋𝑖 |𝐼𝑖 ) = 0 when 𝐼𝑖 = 0 (prob-
ability Pr(𝐼𝑖 = 0) = 1 − 𝑞𝑖 ) and E(𝑋𝑖 |𝐼𝑖 ) = 𝜇𝑖 when 𝐼𝑖 = 1 (probability
Pr(𝐼𝑖 = 1) = 𝑞𝑖 ).
For the probability generating function of the aggregate loss under the individual
risk model,
𝑛
𝑃𝑆𝑛 (𝑧) = ∏ 𝑃𝑋𝑖 (𝑧) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛 𝑛 𝑛
𝑋𝑖 𝐼𝑖 ×𝐵𝑖 𝐼𝑖 ×𝐵𝑖
= ∏ E(𝑧 ) = ∏ E(𝑧 ) = ∏ E [E(𝑧 |𝐼𝑖 )] from the law of iterated expectations
𝑖=1 𝑖=1 𝑖=1
𝑛
𝐼𝑖 ×𝐵𝑖 𝐼𝑖 ×𝐵𝑖
= ∏ [ 𝐸 (𝑧 |𝐼𝑖 = 0) Pr(𝐼𝑖 = 0) + 𝐸 (𝑧 |𝐼𝑖 = 1) Pr(𝐼𝑖 = 1) ]
𝑖=1
𝑛 𝑛
= ∏ [ (1) (1 − 𝑞𝑖 ) + 𝑃𝐵𝑖 (𝑧) 𝑞𝑖 ] = ∏ ( 1 − 𝑞𝑖 + 𝑞𝑖 𝑃𝐵𝑖 (𝑧) )
𝑖=1 𝑖=1
Lastly, for the moment generating function of the aggregate loss under the
individual risk model,
𝑛
𝑀𝑆𝑛 (𝑡) = ∏ 𝑀𝑋𝑖 (𝑡) from the independence of 𝑋𝑖 ’s
𝑖=1
𝑛 𝑛
= ∏ E(𝑒𝑡 𝑋𝑖
) = ∏ E (𝑒 𝑡 (𝐼𝑖 ×𝐵𝑖 )
)
𝑖=1 𝑖=1
𝑛
𝑡 (𝐼𝑖 ×𝐵𝑖 )
= ∏ E [E (𝑒 |𝐼𝑖 )] from the law of iterated expectations
𝑖=1
𝑛
𝑡 (𝐼𝑖 ×𝐵𝑖 ) 𝑡 (𝐼𝑖 ×𝐵𝑖 )
= ∏ [ E (𝑒 |𝐼𝑖 = 0) Pr(𝐼𝑖 = 0) + E (𝑒 |𝐼𝑖 = 1) Pr(𝐼𝑖 = 1) ]
𝑖=1
𝑛 𝑛
= ∏ [ (1) (1 − 𝑞𝑖 ) + 𝑀𝐵𝑖 (𝑡) 𝑞𝑖 ] = ∏ ( 1 − 𝑞𝑖 + 𝑞𝑖 𝑀𝐵𝑖 (𝑡) ) .
𝑖=1 𝑖=1
TS 5.A.2. Relationship Between Probability Generating

Functions of 𝑋𝑖 and 𝑋𝑖𝑇
Let 𝑋𝑖 belong to the (𝑎, 𝑏, 0) class with pmf 𝑝𝑖𝑘 = Pr(𝑋𝑖 = 𝑘) for 𝑘 = 0, 1, …
and 𝑋𝑖𝑇 be the associated zero-truncated distribution in the (𝑎, 𝑏, 1) class with
𝑇
pmf 𝑝𝑖𝑘 = 𝑝𝑖𝑘 /(1 − 𝑝𝑖0 ) for 𝑘 = 1, 2, …. Then the relationship between the pgf
of 𝑋𝑖 and the pgf of 𝑋𝑖𝑇 is shown by
𝑃𝑋𝑖 (𝑧) = E (𝑧 𝑋𝑖 ) = E [E (𝑧 𝑋𝑖 |𝑋𝑖 )] from the law of iterated expectations

𝑋𝑖
= E (𝑧 |𝑋𝑖 = 0) Pr(𝑋𝑖 = 0) + E (𝑧𝑋𝑖 |𝑋𝑖 > 0) Pr(𝑋𝑖 > 0)
𝑇
= (1) 𝑝𝑖0 + E(𝑧 𝑋𝑖 ) (1 − 𝑝𝑖0 ) since (𝑋𝑖 |𝑋𝑖 > 0) is the zero-truncated random variable 𝑋𝑖𝑇
= 𝑝𝑖0 + (1 − 𝑝𝑖0 )𝑃𝑋𝑖𝑇 (𝑧).
TS 5.A.3. Example 5.3.8 Moment Generating Function of

Aggregate Loss 𝑆𝑁
For 𝑁 ∼ 𝐺𝑒𝑜(𝛽) and 𝑋 ∼ 𝐸𝑥𝑝(𝜃), we have
1
𝑃𝑁 (𝑧) =
1 − 𝛽(𝑧 − 1)
1
𝑀𝑋 (𝑡) = .
1 − 𝜃𝑡
Thus, the mgf of aggregate loss 𝑆𝑁 is
1
𝑀𝑆𝑁 (𝑡) = 𝑃𝑁 [𝑀𝑋 (𝑡)] = 1
1 − 𝛽 ( 1−𝜃𝑡 − 1)
𝜃𝑡
1 𝛽 ( 1−𝜃𝑡 )
= 𝜃𝑡 + 1 − 1 = 1 + 𝜃𝑡
1 − 𝛽 ( 1−𝜃𝑡 ) 1 − 𝛽 ( 1−𝜃𝑡 )
𝛽𝜃𝑡 𝛽𝜃𝑡 1+𝛽
=1+ =1+ ⋅
(1 − 𝜃𝑡) − 𝛽𝜃𝑡 1 − 𝜃𝑡(1 + 𝛽) 1 + 𝛽
𝛽 𝜃(1 + 𝛽)𝑡
=1+ [ ]
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
𝛽 1
=1+ [ − 1] ,
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
which gives the expression (5.1). For the alternate expression of the mgf (5.2),
we continue from where we just left off:
𝛽 𝜃(1 + 𝛽)𝑡
𝑀𝑆𝑁 (𝑡) = 1 + [ ]
1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1+𝛽 𝛽 𝜃(1 + 𝛽)𝑡
= + [ ]
1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 𝛽 𝜃(1 + 𝛽)𝑡
= + + [ ]
1 + 𝛽 1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 𝜃(1 + 𝛽)𝑡
= + [1 + ]
1+𝛽 1+𝛽 1 − 𝜃(1 + 𝛽)𝑡
1 𝛽 1
= + [ ].
1 + 𝛽 1 + 𝛽 1 − 𝜃(1 + 𝛽)𝑡
Chapter 6
Simulation and Resampling
Chapter Preview. Simulation is a computationally intensive method used to

solve difficult problems. Instead of creating physical processes and experiment-
ing with them in order to understand their operational characteristics, a simu-
lation study is based on a computer representation - it considers various hypo-
thetical conditions as inputs and summarizes the results. Through simulation,
a vast number of hypothetical conditions can be quickly and inexpensively ex-
amined. Section 6.1 introduces simulation, a wonderful computational tool that
is especially useful in complex, multivariate settings.
We can also use simulation to draw from an empirical distribution - this process
is known as resampling. Resampling allows us to assess the uncertainty of
estimates in complex models. Section 6.2 introduces resampling in the context
of bootstrapping to determine the precision of estimators.
Subsequent sections introduce other topics in resampling. Section 6.3 on cross-
validation shows how to use it for model selection and validation. Section 6.4 on
importance sampling describes resampling in specific regions of interest, such
as long-tailed actuarial applications. Section 6.5 on Monte Carlo Markov Chain
(MCMC) introduces the simulation and resampling engine underpinning much
of modern Bayesian analysis.
6.1 Simulation Fundamentals

• Generate approximately independent realizations that are uniformly dis-
tributed
• Transform the uniformly distributed realizations to observations from a
probability distribution of interest
219
220 CHAPTER 6. SIMULATION AND RESAMPLING
• Calculate quantities of interest and determine the precision of the calcu-

lated quantities
6.1.1 Generating Independent Uniform Observations

The simulations that we consider are generated by computers. A major strength
of this approach is that they can be replicated, allowing us to check and improve
our work. Naturally, this also means that they are not really random. Nonethe-
less, algorithms have been produced so that results appear to be random for
all practical purposes. Specifically, they pass sophisticated tests of indepen-
dence and can be designed so that they come from a single distribution - our
iid assumption, identically and independently distributed.
To get a sense as to what these algorithms do, we consider a historically promi-
nent method.
Linear Congruential Generator. To generate a sequence of random num-
bers, start with 𝐵0 , a starting value that is known as a seed. This value is
updated using the recursive relationship
𝐵𝑛+1 = (𝑎𝐵𝑛 + 𝑐) modulo 𝑚, 𝑛 = 0, 1, 2, … .
This algorithm is called a linear congruential generator. The case of 𝑐 = 0 is

called a multiplicative congruential generator; it is particularly useful for really
fast computations.
For illustrative values of 𝑎 and 𝑚, Microsoft’s Visual Basic uses 𝑚 = 224 , 𝑎 =
1, 140, 671, 485, and 𝑐 = 12, 820, 163 (see https://en.wikipedia.org/wiki/Linear
_congruential_generator). This is the engine underlying the random number
generation in Microsoft’s Excel program.
The sequence used by the analyst is defined as 𝑈𝑛 = 𝐵𝑛 /𝑚. The analyst may
interpret the sequence {𝑈𝑖 } to be (approximately) identically and independently
uniformly distributed on the interval (0,1). To illustrate the algorithm, consider
the following.
Example 6.1.1. Illustrative Sequence. Take 𝑚 = 15, 𝑎 = 3, 𝑐 = 2 and
𝐵0 = 1. Then we have:
step 𝑛 𝐵𝑛 𝑈𝑛
0 𝐵0 =1
5
1 𝐵1 = mod (3 × 1 + 2) = 5 𝑈1 = 15
2
2 𝐵2 = mod (3 × 5 + 2) = 2 𝑈2 = 15
8
3 𝐵3 = mod (3 × 2 + 2) = 8 𝑈3 = 15
11
4 𝐵4 = mod (3 × 8 + 2) = 11 𝑈4 = 15
6.1. SIMULATION FUNDAMENTALS 221
Sometimes computer generated random results are known as pseudo-random

numbers to reflect the fact that they are machine generated and can be repli-
cated. That is, despite the fact that {𝑈𝑖 } appears to be i.i.d, it can be reproduced
by using the same seed number (and the same algorithm).
Example 6.1.2. Generating Uniform Random Numbers in R. The fol-
lowing code shows how to generate three uniform (0,1) numbers in R using the
runif command. The set.seed() function sets the initial seed. In many com-
puter packages, the initial seed is set using the system clock unless specified
otherwise.
Uniform
0.92424
Three Uniform Random Variates
0.53718
0.46920
The linear congruential generator is just one method of producing pseudo-

random outcomes. It is easy to understand and is widely used. The linear
congruential generator does have limitations, including the fact that it is pos-
sible to detect long-run patterns over time in the sequences generated (recall
that we can interpret independence to mean a total lack of functional patterns).
Not surprisingly, advanced techniques have been developed that address some
of this method’s drawbacks.
6.1.2 Inverse Transform Method

With the sequence of uniform random numbers, we next transform them to a
distribution of interest, say 𝐹 . A prominent technique is the inverse transform
method, defined as
𝑋𝑖 = 𝐹 −1 (𝑈𝑖 ) .
Here, recall from Section 4.1.1 that we introduced the inverse of the distribution
function, 𝐹 −1 , and referred to it also as the quantile function. Specifically, it is
defined to be
𝐹 −1 (𝑦) = inf {𝐹 (𝑥) ≥ 𝑦}.

𝑥
Recall that inf stands for infimum or the greatest lower bound. It is essentially
the smallest value of x that satisfies the inequality {𝐹 (𝑥) ≥ 𝑦}. The result is
that the sequence {𝑋𝑖 } is approximately iid with distribution function 𝐹 if the
{𝑈𝑖 } are iid with uniform on (0, 1) distribution function.
The inverse transform result is available when the underlying random variable
is continuous, discrete or a hybrid combination of the two. We now present a
series of examples to illustrate its scope of applications.
Example 6.1.3. Generating Exponential Random Numbers. Suppose
that we would like to generate observations from an exponential distribution
with scale parameter 𝜃 so that 𝐹 (𝑥) = 1 − 𝑒−𝑥/𝜃 . To compute the inverse
transform, we can use the following steps:
𝑦 = 𝐹 (𝑥) ⇔ 𝑦 = 1 − 𝑒−𝑥/𝜃
⇔ −𝜃 ln(1 − 𝑦) = 𝑥 = 𝐹 −1 (𝑦).
Thus, if 𝑈 has a uniform (0,1) distribution, then 𝑋 = −𝜃 ln(1 − 𝑈 ) has an

exponential distribution with parameter 𝜃.
The following R code shows how we can start with the same three uniform
random numbers as in Example 6.1.2 and transform them to independent ex-
ponentially distributed random variables with a mean of 10. Alternatively, you
can directly use the rexp function in R to generate random numbers from the
exponential distribution. The algorithm built into this routine is different so
even with the same starting seed number, individual realizations will differ.
Uniform Exponential 1 Exponential 2

0.92424 25.80219 3.25222
Three Uniform Random Variates
0.53718 7.70409 8.47652
0.46920 6.33362 5.40176
Example 6.1.4. Generating Pareto Random Numbers. Suppose that we

would like to generate observations from a Pareto distribution with parameters
𝜃 𝛼
𝛼 and 𝜃 so that 𝐹 (𝑥) = 1 − ( 𝑥+𝜃 ) . To compute the inverse transform, we can
use the following steps:
𝛼
𝜃
𝑦 = 𝐹 (𝑥) ⇔ 1 − 𝑦 = ( )
𝑥+𝜃
−1/𝛼 𝑥+𝜃 𝑥
⇔ (1 − 𝑦) = = +1
𝜃 𝜃
⇔ 𝜃 ((1 − 𝑦)−1/𝛼 − 1) = 𝑥 = 𝐹 −1 (𝑦).
Thus, 𝑋 = 𝜃 ((1 − 𝑈 )−1/𝛼 − 1) has a Pareto distribution with parameters 𝛼 and

𝜃.
Inverse Transform Justification. Why does the random variable 𝑋 =

𝐹 −1 (𝑈 ) have a distribution function 𝐹 ?
Show A Snippet of Theory
This is easy to establish in the continuous case. Because 𝑈 is a uniform random

variable on (0,1), we know that Pr(𝑈 ≤ 𝑦) = 𝑦, for 0 ≤ 𝑦 ≤ 1. Thus,
Pr(𝑋 ≤ 𝑥) = Pr(𝐹 −1 (𝑈 ) ≤ 𝑥)
= Pr(𝐹 (𝐹 −1 (𝑈 )) ≤ 𝐹 (𝑥))
= Pr(𝑈 ≤ 𝐹 (𝑥)) = 𝐹 (𝑥)
as required. The key step is that 𝐹 (𝐹 −1 (𝑢)) = 𝑢 for each 𝑢, which is clearly
true when 𝐹 is strictly increasing.
We now consider some discrete examples.

Example 6.1.5. Generating Bernoulli Random Numbers. Suppose that
we wish to simulate random variables from a Bernoulli distribution with param-
eter 𝑞 = 0.85.
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Figure 6.1: Distribution Function of a Binary Random Variable
A graph of the cumulative distribution function in Figure 6.1 shows that the
quantile function can be written as
0 0 < 𝑦 ≤ 0.85
𝐹 −1 (𝑦) = {
1 0.85 < 𝑦 ≤ 1.0.
Thus, with the inverse transform we may define
0 0 < 𝑈 ≤ 0.85
𝑋={
1 0.85 < 𝑈 ≤ 1.0
For illustration, we generate three random numbers to get

Three Random Variates

Uniform Binary X
0.92424 1
0.53718 0
0.46920 0
Example 6.1.6. Generating Random Numbers from a Discrete Dis-

tribution. Consider the time of a machine failure in the first five years. The
distribution of failure times is given as:
Time 1.0 2.0 3.0 4.0 5.0

Discrete Distribution
Probability 0.1 0.2 0.1 0.4 0.2
Distribution Function 0.1 0.3 0.4 0.8 1.0
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
Figure 6.2: Distribution Function of a Discrete Random Variable
Using the graph of the distribution function in Figure 6.2, with the inverse
transform we may define
⎧ 1 0 < 𝑈 ≤ 0.1
{ 2 0.1 < 𝑈 ≤ 0.3
{
𝑋=⎨ 3 0.3 < 𝑈 ≤ 0.4
{ 4 0.4 < 𝑈 ≤ 0.8
{
⎩ 5 0.8 < 𝑈 ≤ 1.0.
For general discrete random variables there may not be an ordering of outcomes.
For example, a person could own one of five types of life insurance products and
we might use the following algorithm to generate random outcomes:
⎧ whole life 0 < 𝑈 ≤ 0.1

{ endowment 0.1 < 𝑈 ≤ 0.3
{
𝑋=⎨ term life 0.3 < 𝑈 ≤ 0.4
{ universal life 0.4 < 𝑈 ≤ 0.8
{
⎩ variable life 0.8 < 𝑈 ≤ 1.0.
Another analyst may use an alternative procedure such as:
⎧ whole life 0.9 < 𝑈 < 1.0

{ endowment 0.7 ≤ 𝑈 < 0.9
{
𝑋=⎨ term life 0.6 ≤ 𝑈 < 0.7
{ universal life 0.2 ≤ 𝑈 < 0.6
{
⎩ variable life 0 ≤ 𝑈 < 0.2.
Both algorithms produce (in the long-run) the same probabilities, e.g.,
Pr(whole life) = 0.1, and so forth. So, neither is incorrect. You should be
aware that there is more than one way to accomplish a goal. Similarly, you
could use an alternative algorithm for ordered outcomes (such as failure times
1, 2, 3, 4, or 5, above).
Example 6.1.7. Generating Random Numbers from a Hybrid Dis-
tribution. Consider a random variable that is 0 with probability 70% and is
exponentially distributed with parameter 𝜃 = 10, 000 with probability 30%. In
an insurance application, this might correspond to a 70% chance of having no
insurance claims and a 30% chance of a claim - if a claim occurs, then it is
exponentially distributed. The distribution function, depicted in Figure 6.3, is
given as
0 𝑥<0
𝐹 (𝑦) = {
1 − 0.3 exp(−𝑥/10000) 𝑥 ≥ 0.
From Figure 6.3, we can see that the inverse transform for generating random
variables with this distribution function is
0 0 < 𝑈 ≤ 0.7
𝑋 = 𝐹 −1 (𝑈 ) = {
−1000 ln( 1−𝑈
0.3 ) 0.7 < 𝑈 < 1.
For discrete and hybrid random variables, the key is to draw a graph of the
distribution function that allows you to visualize potential values of the inverse
function.
6.1.3 Simulation Precision

From the prior subsections, we now know how to generate independent simulated
realizations from a distribution of interest. With these realizations, we can
construct an empirical distribution and approximate the underlying distribution
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
0 10000 20000 30000 40000
Figure 6.3: Distribution Function of a Hybrid Random Variable
as precisely as needed. As we introduce more actuarial applications in this book,

you will see that simulation can be applied in a wide variety of contexts.
Many of these applications can be reduced to the problem of approximating
E [ℎ(𝑋)], where ℎ(⋅) is some known function. Based on 𝑅 simulations (replica-
tions), we get 𝑋1 , … , 𝑋𝑅 . From this simulated sample, we calculate an average
1 𝑅
ℎ𝑅 = ∑ ℎ(𝑋𝑖 )
𝑅 𝑖=1
that we use as our simulated approximate (estimate) of E [ℎ(𝑋)]. To estimate

the precision of this approximation, we use the simulation variance
𝑅
1 2
𝑠2ℎ,𝑅 = ∑ (ℎ(𝑋𝑖 ) − ℎ𝑅 ) .
𝑅 − 1 𝑖=1
√
From the independence, the standard error of the estimate is 𝑠ℎ,𝑅 / 𝑅. This
can be made as small as we like by increasing the number of replications 𝑅.
Example 6.1.8. Portfolio Management. In Section 3.4, we learned how
to calculate the expected value of policies with deductibles. For an example of
something that cannot be done with closed form expressions, we now consider
two risks. This is a variation of a more complex example that will be covered
as Example 10.3.6.
We consider two property risks of a telecommunications firm:
• 𝑋1 - buildings, modeled using a gamma distribution with mean 200 and
scale parameter 100.
• 𝑋2 - motor vehicles, modeled using a gamma distribution with mean 400

and scale parameter 200.
Denote the total risk as 𝑋 = 𝑋1 + 𝑋2 . For simplicity, you assume that these
risks are independent.
To manage the risk, you seek some insurance protection. You are willing to
retain internally small building and motor vehicles amounts, up to 𝑀 , say. Ran-
dom amounts in excess of 𝑀 will have an unpredictable affect on your budget
and so for these amounts you seek insurance protection. Stated mathematically,
your retained risk is 𝑌𝑟𝑒𝑡𝑎𝑖𝑛𝑒𝑑 = min(𝑋1 + 𝑋2 , 𝑀 ) and the insurer’s portion is
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑌𝑟𝑒𝑡𝑎𝑖𝑛𝑒𝑑 .
To be specific, we use 𝑀 = 400 as well as 𝑅 = 1000000 simulations.
a. With the settings, we wish to determine the expected claim amount and
the associated standard deviation of (i) that retained, (ii) that accepted by the
insurer, and (iii) the total overall amount.
Here is the code for the expected claim amounts.
Retained Insurer Total
Mean 365.17 235.01 600.18
Standard Deviation 69.51 280.86 316.36
The results of these calculations are:
Mean 365.17 235.01 600.18
Standard Deviation 69.51 280.86 316.36
b. For√ insured claims, the√standard error of the simulation approximation is
𝑠ℎ,𝑅 / 1000000 = 280.86 / 1000000 = 0.281. For this example, simulation is
quick and so a large value such as 1000000 is an easy choice. However, for more
complex problems, the simulation size may be an issue.
Figure 6.4 allows us to visualize the development of the approximation as the
number of simulations increases.
Determination of Number of Simulations

How many simulated values are recommended? 100? 1,000,000? We can use
the central limit theorem to respond to this question.
As one criterion for your confidence in the result, suppose that you wish
to be within 1% of the mean with 95% certainty. That is, you want
Pr (|ℎ𝑅 − E [ℎ(𝑋)]| ≤ 0.01E [ℎ(𝑋)]) ≥ 0.95. According to the central limit
theorem, your estimate should be approximately normally distributed and so we
want to have 𝑅 large enough to satisfy 0.01E [ℎ(𝑋)]/√Var [ℎ(𝑋)]/𝑅) ≥ 1.96.
400
Expected Insurer Claims
300
200
100
0 200 400 600 800 1000
Number of Simulations (R )
Figure 6.4: Estimated Expected Insurer Claims versus Number of Sim-

ulations
(Recall that 1.96 is the 97.5th percentile from the standard normal distribu-
tion.) Replacing E [ℎ(𝑋)] and Var [ℎ(𝑋)] with estimates, you continue your
simulation until
.01ℎ𝑅
√ ≥ 1.96
𝑠ℎ,𝑅 / 𝑅
or equivalently
𝑠2ℎ,𝑅
𝑅 ≥ 38, 416 2
. (6.1)
ℎ𝑅
This criterion is a direct application of the approximate normality. Note that ℎ𝑅

and 𝑠ℎ,𝑅 are not known in advance, so you will have to come up with estimates,
either by doing a small pilot study in advance or by interrupting your procedure
intermittently to see if the criterion is satisfied.
Example 6.1.8. Portfolio Management - continued
For our example, the average insurance claim is 235.011 and the corresponding
standard deviation is 280.862. Using equation (6.1), to be within 10% of the
mean, we would only require at least 54.87 thousand simulations. However, to
be within 1% we would want at least 5.49 million simulations.
Example 6.1.9. Approximation Choices. An important application of

simulation is the approximation of E [ℎ(𝑋)]. In this example, we show that the
choice of the ℎ(⋅) function and the distribution of 𝑋 can play a role.
Consider the following question : what is Pr[𝑋 > 2] when 𝑋 has a Cauchy
−1
distribution, with density 𝑓(𝑥) = (𝜋(1 + 𝑥2 )) , on the real line? The true
value is
∞
𝑑𝑥
Pr [𝑋 > 2] = ∫ .
2 𝜋(1 + 𝑥2 )
One can use an R numerical integration function (which usually works well on
improper integrals)
which is equal to 0.14758.
Approximation 1. Alternatively, one can use simulation techniques to approx-
imate that quantity. From calculus, you can check that the quantile function
of the Cauchy distribution is 𝐹 −1 (𝑦) = tan (𝜋(𝑦 − 0.5)). Then, with simulated
uniform (0,1) variates, 𝑈1 , … , 𝑈𝑅 , we can construct the estimator
1 𝑅 1 𝑅
𝑝1 = ∑ I(𝐹 −1 (𝑈𝑖 ) > 2) = ∑ I(tan (𝜋(𝑈𝑖 − 0.5)) > 2).
𝑅 𝑖=1 𝑅 𝑖=1
[1] 0.147439
[1] 0.0003545432
With one million simulations, we obtain an estimate of 0.14744 with standard
error 0.355 (divided by 1000). One can prove that the variance of 𝑝1 is of order
0.127/𝑅.
Approximation 2. With other choices of ℎ(⋅) and 𝐹 (⋅) it is possible to reduce
uncertainty even using the same number of simulations 𝑅. To begin, one can use
the symmetry of the Cauchy distribution to write Pr[𝑋 > 2] = 0.5 ⋅ Pr[|𝑋| > 2].
With this, can construct a new estimator,
1 𝑅
𝑝2 = ∑ I(|𝐹 −1 (𝑈𝑖 )| > 2).
2𝑅 𝑖=1

0.052/𝑅.
Approximation 3. But one can go one step further. The improper integral can
be written as a proper one by a simple symmetry property (since the function
is symmetric and the integral on the real line is equal to 1)
∞ 2
𝑑𝑥 1 𝑑𝑥
∫ 2
= −∫ 2
.
2 𝜋(1 + 𝑥 ) 2 0 𝜋(1 + 𝑥 )
From this expression, a natural approximation would be
1 1 𝑅 2
𝑝3 = − ∑ ℎ3 (2𝑈𝑖 ), where ℎ3 (𝑥) = .
2 𝑅 𝑖=1 𝜋(1 + 𝑥2 )

0.0285/𝑅.
Approximation 4. Finally, one can also consider some change of variable in
the integral
∞ 1/2
𝑑𝑥 𝑦−2 𝑑𝑦
∫ 2
= ∫ .
2 𝜋(1 + 𝑥 ) 0 𝜋(1 − 𝑦−2 )
From this expression, a natural approximation would be
1 𝑅 1
𝑝4 = ∑ ℎ (𝑈 /2), where ℎ4 (𝑥) = .
𝑅 𝑖=1 4 𝑖 2𝜋(1 + 𝑥2 )
The expression seems rather similar to the previous one.

0.00009/𝑅, which is much smaller than what we had so far!
Table 6.1 summarizes the four choices of ℎ(⋅) and 𝐹 (⋅) to approximate Pr[𝑋 >
2] = 0.14758. The standard error varies dramatically. Thus, if we have a desired
degree of accuracy, then the number of simulations depends strongly on how we
write the integrals we try to approximate.
Table 6.1. Summary of Four Choices to Approximate Pr[𝑋 > 2]
Estimator Definition Support Function Estimate Standard Error

1 𝑅
𝑝1 𝑅 ∑𝑖=1 I(𝐹 −1 (𝑈𝑖 ) > 2) −1
𝐹 (𝑢) = tan (𝜋(𝑢 − 0.5)) 0.147439 0.000355
1 𝑅
𝑝2 2𝑅 ∑𝑖=1 I(|𝐹 −1 (𝑈𝑖 )| > 2) 𝐹 −1 (𝑢) = tan (𝜋(𝑢 − 0.5)) 0.147477 0.000228
1 1 𝑅 2
𝑝3 2 − 𝑅 ∑𝑖=1 ℎ3 (2𝑈𝑖 ) ℎ3 (𝑥) = 𝜋(1+𝑥 2) 0.147558 0.000169
1 𝑅 1
𝑝4 𝑅 ∑𝑖=1 ℎ4 (𝑈𝑖 /2) ℎ4 (𝑥) = 2𝜋(1+𝑥2 ) 0.147587 0.000010
6.1.4 Simulation and Statistical Inference

Simulations not only help us approximate expected values but are also useful in
calculating other aspects of distribution functions. In particular, they are very
useful when distributions of test statistics are too complicated to derive; in this
case, one can use simulations to approximate the reference distribution. We
now illustrate this with the Kolmogorov-Smirnov test that we learned about in
Section 4.1.2.
Example 6.1.10. Kolmogorov-Smirnov Test of Distribution. Suppose

that we have available 𝑛 = 100 observations {𝑥1 , ⋯ , 𝑥𝑛 } that, unknown to the
analyst, were generated from a gamma distribution with parameters 𝛼 = 6 and
𝜃 = 2. The analyst believes that the data come from a lognormal distribution
with parameters 1 and 0.4 and would like to test this assumption.
The first step is to visualize the data.
With this set-up, Figure 6.5 provides a graph of a histogram and empirical dis-
tribution. For reference, superimposed are red dashed lines from the lognormal
distribution.
Histogram Empirical cdf

0.4
1.0
0.8
Cumulative Distribution
0.3
0.6
Density
0.2
0.4
0.1
0.2
0.0
0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
x x
Figure 6.5: Histogram and Empirical Distribution Function of Data

used in Kolmogorov-Smirnov Test. The red dashed lines are fits based on
(incorrectly) hypothesized lognormal distribution.
Recall that the Kolmogorov-Smirnov statistic equals the largest discrepancy

between the empirical and the hypothesized distribution. This is max𝑥 |𝐹𝑛 (𝑥) −
𝐹0 (𝑥)|, where 𝐹0 is the hypothesized lognormal distribution. We can calculate
this directly as:
[1] 0.09703627
Fortunately, for the lognormal distribution, R has built-in tests that allow us to
determine this without complex programming:
One-sample Kolmogorov-Smirnov test
data: x
D = 0.097037, p-value = 0.3031
alternative hypothesis: two-sided
However, for many distributions of actuarial interest, pre-built programs are
not available. We can use simulation to test the relevance of the test statis-
tic. Specifically, to compute the 𝑝-value, let us generate thousands of random
samples from a 𝐿𝑁 (1, 0.4) distribution (with the same size), and compute em-
pirically the distribution of the statistic,
ns <- 1e4
d_KS <- rep(NA,ns)
# compute the test statistics for a large (ns) number of simulated samples
for(s in 1:ns) d_KS[s] <- D(rlnorm(n,1,.4),function(x) plnorm(x,1,.4))
mean(d_KS>D(x,function(x) plnorm(x,1,.4)))
[1] 0.2843
15
10
Density
5
0
0.05 0.10 0.15 0.20
Test Statistic
Figure 6.6: Simulated Distribution of the Kolmogorov-Smirnov Test

Statistic. The vertical red dashed line marks the test statistic for the sample
of 100.
The simulated distribution based on 10,000 random samples is summarized in

6.2. BOOTSTRAPPING AND RESAMPLING 233
Figure 6.6. Here, the statistic exceeded the empirical value (0.09704) in 28.43%
of the scenarios, while the theoretical 𝑝-value is 0.3031. For both the simulation
and the theoretical 𝑝-values, the conclusions are the same; the data do not
provide sufficient evidence to reject the hypothesis of a lognormal distribution.
Although only an approximation, the simulation approach works in a variety
of distributions and test statistics without needing to develop the nuances of
the underpinning theory for each situation. We summarize the procedure for
developing simulated distributions and p-values as follows:
1. Draw a sample of size n, say, 𝑋1 , … , 𝑋𝑛 , from a known distribution func-
tion 𝐹 . Compute a statistic of interest, denoted as 𝜃(𝑋 ̂
1 , … , 𝑋𝑛 ). Call
𝑟̂
this 𝜃 for the rth replication.
2. Repeat this 𝑟 = 1, … , 𝑅 times to get a sample of statistics, 𝜃1̂ , … , 𝜃𝑅
̂ .
1̂ ̂
𝑅
3. From the sample of statistics in Step 2, {𝜃 , … , 𝜃 }, compute a summary
measure of interest, such as a p-value.
6.2 Bootstrapping and Resampling

• Generate a nonparametric bootstrap distribution for a statistic of interest
• Use the bootstrap distribution to generate estimates of precision for the
statistic of interest, including bias, standard deviations, and confidence
intervals
• Perform bootstrap analyses for parametric distributions
6.2.1 Bootstrap Foundations

Simulation presented up to now is based on sampling from a known distribution.
Section 6.1 showed how to use simulation techniques to sample and compute
quantities from known distributions. However, statistical science is dedicated to
providing inferences about distributions that are unknown. We gather summary
statistics based on this unknown population distribution. But how do we sample
from an unknown distribution?
Naturally, we cannot simulate draws from an unknown distribution but we can
draw from a sample of observations. If the sample is a good representation
from the population, then our simulated draws from the sample should well
approximate the simulated draws from a population. The process of sampling
from a sample is called resampling or bootstrapping. The term bootstrap comes
from the phrase “pulling oneself up by one’s bootstraps” (Efron, 1979). With
resampling, the original sample plays the role of the population and estimates
from the sample play the role of true population parameters.
The resampling algorithm is the same as introduced in Section 6.1.4 except that
now we use simulated draws from a sample. It is common to use {𝑋1 , … , 𝑋𝑛 } to
denote the original sample and let {𝑋1∗ , … , 𝑋𝑛∗ } denote the simulated draws. We
draw them with replacement so that the simulated draws will be independent
from one another, the same assumption as with the original sample. For each
sample, we also use n simulated draws, the same number as the original sample
size. To distinguish this procedure from the simulation, it is common to use
B (for bootstrap) to be the number of simulated samples. We could also write
(𝑏) (𝑏)
{𝑋1 , … , 𝑋𝑛 }, 𝑏 = 1, … , 𝐵 to clarify this.
There are two basic resampling methods, model-free and model-based, which are,
respectively, as nonparametric and parametric. In the nonparametric approach,
no assumption is made about the distribution of the parent population. The
simulated draws come from the empirical distribution function 𝐹𝑛 (⋅), so each
draw comes from {𝑋1 , … , 𝑋𝑛 } with probability 1/n.
In contrast, for the parametric approach, we assume that we have knowledge
of the distribution family F. The original sample 𝑋1 , … , 𝑋𝑛 is used to estimate
parameters of that family, say, 𝜃.̂ Then, simulated draws are taken from the
̂ Section 6.2.4 discusses this approach in further detail.
𝐹 (𝜃).
Nonparametric Bootstrap
The idea of the nonparametric bootstrap is to use the inverse transform method
on 𝐹𝑛 , the empirical cumulative distribution function, depicted in Figure 6.7.
y=F(x)
0
0 x = F−1(y)
Figure 6.7: Inverse of an Empirical Distribution Function
Because 𝐹𝑛 is a step-function, 𝐹𝑛−1 takes values in {𝑥1 , ⋯ , 𝑥𝑛 }. More precisely,

as illustrated in Figure 6.8.
• if 𝑦 ∈ (0, 1/𝑛) (with probability 1/𝑛) we draw the smallest value (min{𝑥𝑖 })
• if 𝑦 ∈ (1/𝑛, 2/𝑛) (with probability 1/𝑛) we draw the second smallest value,
• …
• if 𝑦 ∈ ((𝑛 − 1)/𝑛, 1) (with probability 1/𝑛) we draw the largest value
(max{𝑥𝑖 }).
Figure 6.8: Inverse of an Empirical Distribution Function
Using the inverse transform method with 𝐹𝑛 means sampling from {𝑥1 , ⋯ , 𝑥𝑛 },
with probability 1/𝑛. Generating a bootstrap sample of size 𝐵 means sampling
from {𝑥1 , ⋯ , 𝑥𝑛 }, with probability 1/𝑛, with replacement. See the following
illustrative R code.
[1] 2.6164 5.7394 5.7394 2.6164 2.6164 7.0899 0.8823 5.7394
Observe that value 0.8388 was obtained three times.
6.2.2 Bootstrap Precision: Bias, Standard Deviation, and

Mean Square Error
We summarize the nonparametric bootstrap procedure as follows:
1. From the sample {𝑋1 , … , 𝑋𝑛 }, draw a sample of size n (with replacement),

say, 𝑋1∗ , … , 𝑋𝑛∗ . From the simulated draws compute a statistic of interest,
denoted as 𝜃(𝑋 ̂ ∗ , … , 𝑋 ∗ ). Call this 𝜃∗̂ for the bth replicate.
1 𝑛 𝑏
2. Repeat this 𝑏 = 1, … , 𝐵 times to get a sample of statistics, 𝜃1∗̂ , … , 𝜃𝐵
∗̂
.
∗̂ ∗̂
3. From the sample of statistics in Step 2, {𝜃1 , … , 𝜃𝐵 }, compute a summary
measure of interest.
In this section, we focus on three summary measures, the bias, the standard
deviation, and the mean square error (MSE). Table 6.2 summarizes these three
measures. Here, 𝜃∗̂ is the average of {𝜃∗̂ , … , 𝜃∗̂ }.
1 𝐵
Table 6.2. Bootstrap Summary Measures

Population Measure Population Definition Bootstrap Approximation Bootstrap Symbol

Bias E(𝜃)̂ − 𝜃 𝜃∗̂ − 𝜃 ̂ 𝐵𝑖𝑎𝑠 (𝜃)̂
𝑏𝑜𝑜𝑡
2
𝐵
Standard Deviation √Var(𝜃)̂ √ 1
∑𝑏=1 (𝜃𝑏∗̂ − 𝜃∗̂ ) 𝑠𝑏𝑜𝑜𝑡 (𝜃)̂
𝐵−1
𝐵 2
Mean Square Error E(𝜃 ̂ − 𝜃)2 1
𝐵
∑𝑏=1 (𝜃𝑏∗̂ − 𝜃)̂ 𝑀 𝑆𝐸𝑏𝑜𝑜𝑡 (𝜃)̂
Example 6.2.1. Bodily Injury Claims and Loss Elimination Ratios.

To show how the bootstrap can be used to quantify the precision of estimators,
we return to the Section 4.1.1 Example 4.1.6 bodily injury claims data where
we introduced a nonparametric estimator of the loss elimination ratio.
Table 6.3 summarizes the results of the bootstrap estimation. For example, at
𝑑 = 14000, the nonparametric estimate of LER is 0.97678. This has an estimated
bias of 0.00018 with a standard deviation of 0.00701. For some applications, you
may wish to apply the estimated bias to the original estimate to give a bias-
corrected estimator. This is the focus of the next example. For this illustration,
the bias is small and so such a correction is not relevant.
Table 6.3. Bootstrap Estimates of LER at Selected Deductibles
d NP Bootstrap Bootstrap Lower Normal Upper Normal

Estimate Bias SD 95% CI 95% CI
4000 0.54113 0.00011 0.01237 0.51678 0.56527
5000 0.64960 0.00027 0.01412 0.62166 0.67700
10500 0.93563 0.00004 0.01017 0.91567 0.95553
11500 0.95281 -0.00003 0.00941 0.93439 0.97128
14000 0.97678 0.00016 0.00687 0.96316 0.99008
18500 0.99382 0.00014 0.00331 0.98719 1.00017
The bootstrap standard deviation gives a measure of precision. For one appli-
cation of standard deviations, we can use the normal approximation to create
a confidence interval. For example, the R function boot.ci produces the nor-
mal confidence intervals at 95%. These are produced by creating an interval
of twice the length of 1.95994 bootstrap standard deviations, centered about
the bias-corrected estimator (1.95994 is the 97.5th quantile of the standard nor-
mal distribution). For example, the lower normal 95% CI at 𝑑 = 14000 is
(0.97678 − 0.00018) − 1.95994 ∗ 0.00701 = 0.96286. We further discuss bootstrap
confidence intervals in the next section.
Example 6.2.2. Estimating exp(𝜇). The bootstrap can be used to quantify

the bias of an estimator, for instance. Consider here a sample x = {𝑥1 , ⋯ , 𝑥𝑛 }
that is iid with mean 𝜇.
Suppose that the quantity of interest is 𝜃 = exp(𝜇). A natural estimator would

be 𝜃1̂ = exp(𝑥). This estimator is biased (due to the Jensen inequality) but is
asymptotically unbiased. For our sample, the estimate is as follows.
[1] 19.13463
One can use the central limit theorem to get a correction using
𝜎2
𝑋 ≈ 𝒩 (𝜇, ) where 𝜎2 = Var[𝑋𝑖 ],
𝑛
so that, with the normal moment generating function, we have
𝜎2
E [exp(𝑋)] ≈ exp (𝜇 + ).
2𝑛
Hence, one can consider naturally
𝜎̂ 2
𝜃2̂ = exp (𝑥 − ).
2𝑛
For our data, this turns out to be as follows.
[1] 18.73334
As another strategy (that we do not pursue here), one can also use Taylor’s
approximation to get a more accurate estimator (as in the delta method),
𝑔″ (𝜇)
𝑔(𝑥) = 𝑔(𝜇) + (𝑥 − 𝜇)𝑔′ (𝜇) + (𝑥 − 𝜇)2 +⋯
2
The alternative we do explore is to use a bootstrap strategy: given a bootstrap
sample, x∗𝑏 , let 𝑥∗𝑏 denote its mean, and set
1 𝐵
𝜃3̂ = ∑ exp(𝑥∗𝑏 ).
𝐵 𝑏=1
To implement this, we have the following code.
Then, you can plot(results) and print(results) to see the following.
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = sample_x, statistic = function(y, indices) exp(mean(y[indices])),
R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 19.13463 0.2536551 3.909725
Histogram of t
40
35
0.08
30
Density
t*
25
0.04
20
15
0.00
10
10 20 30 40 −3 −1 0 1 2 3
t* Quantiles of Standard Normal
Figure 6.9: Distribution of Bootstrap Replicates. The left-hand panel

is a histogram of replicates. The right-hand panel is a quantile-quantile plot,
comparing the bootstrap distribution to the standard normal distribution.
This results in three estimators, the raw estimator 𝜃1̂ = 19.135, the second-order
correction 𝜃2̂ = 18.733, and the bootstrap estimator 𝜃3̂ = 19.388.
How does this work with differing sample sizes? We now suppose that the 𝑥𝑖 ’s
are generated from a lognormal distribution 𝐿𝑁 (0, 1), so that 𝜇 = exp(0+1/2) =
1.648721 and 𝜃 = exp(1.648721) = 5.200326. We use simulation to draw the
sample sizes but then act as if they were a realized set of observations. See the
following illustrative code.
The results of the comparison are summarized in Figure 6.10. This figure shows
that the bootstrap estimator is closer to the true parameter value for almost
all sample sizes. The bias of all three estimators decreases as the sample size
increases.
raw estimator
6.0
second order correction

bootstrap
5.5
estimator
5.0
4.5
20 40 60 80 100
sample size (n)
Figure 6.10: Comparision of Estimates. True value of the parameter is given

by the solid horizontal line at 5.20.
6.2.3 Confidence Intervals

The bootstrap procedure generates B replicates 𝜃1∗̂ , … , 𝜃𝐵
∗̂
of the estimator 𝜃.̂ In
Example 6.2.1, we saw how to use standard normal approximations to create
a confidence interval for parameters of interest. However, given that a major
point is to use bootstrapping to avoid relying on assumptions of approximate
normality, it is not surprising that there are alternative confidence intervals
available.
For an estimator 𝜃,̂ the basic bootstrap confidence interval is
(2𝜃 ̂ − 𝑞𝑈 , 2𝜃 ̂ − 𝑞𝐿 ) , (6.2)
where 𝑞𝐿 and 𝑞𝑈 are lower and upper 2.5% quantiles from the bootstrap sample
𝜃1∗̂ , … , 𝜃𝐵
∗̂
.
To see where this comes from, start with the idea that (𝑞𝐿 , 𝑞𝑈 ) provides a
95% interval for 𝜃1∗̂ , … , 𝜃𝐵
∗̂
. So, for a random 𝜃𝑏∗̂ , there is a 95% chance that
𝑞𝐿 ≤ 𝜃𝑏 ≤ 𝑞𝑈 . Reversing the inequalities and adding 𝜃 ̂ to each side gives a 95%
∗̂
interval
𝜃 ̂ − 𝑞𝑈 ≤ 𝜃 ̂ − 𝜃𝑏∗̂ ≤ 𝜃 ̂ − 𝑞𝐿 .
So, (𝜃 ̂ − 𝑞𝑈 , 𝜃 ̂ − 𝑞𝐿 ) is an 95% interval for 𝜃 ̂ − 𝜃𝑏∗̂ . The bootstrap approximation
idea says that this is also a 95% interval for 𝜃 − 𝜃.̂ Adding 𝜃 ̂ to each side gives
the 95% interval in equation (6.2).
Many alternative bootstrap intervals are available. The easiest to explain is the
percentile bootstrap interval which is defined as (𝑞𝐿 , 𝑞𝑈 ). However, this has the
drawback of potentially poor behavior in the tails which can be of concern in
some actuarial problems of interest.
Example 6.2.3. Bodily Injury Claims and Risk Measures. To see how
the bootstrap confidence intervals work, we return to the bodily injury auto
claims considered in Example 6.2.1. Instead of the loss elimination ratio, sup-
pose we wish to estimate the 95th percentile 𝐹 −1 (0.95) and a measure defined
as
𝑇 𝑉 𝑎𝑅0.95 [𝑋] = E[𝑋|𝑋 > 𝐹 −1 (0.95)].
This measure is called the tail value-at-risk; it is the expected value of 𝑋 condi-
tional on 𝑋 exceeding the 95th percentile. Section 10.2 explains how quantiles
and the tail value-at-risk are the two most important examples of so-called risk
measures. For now, we will simply think of these as measures that we wish
to estimate. For the percentile, we use the nonparametric estimator 𝐹𝑛−1 (0.95)
defined in Section 4.1.1. For the tail value-at-risk, we use the plug-in principle
to define the nonparametric estimator
𝑛
∑𝑖=1 𝑋𝑖 𝐼(𝑋𝑖 > 𝐹𝑛−1 (0.95))
𝑇 𝑉 𝑎𝑅𝑛,0.95 [𝑋] = 𝑛 .
∑𝑖=1 𝐼(𝑋𝑖 > 𝐹𝑛−1 (0.95))
In this expression, the denominator counts the number of observations that
exceed the 95th percentile 𝐹𝑛−1 (0.95). The numerator adds up losses for those
observations that exceed 𝐹𝑛−1 (0.95). Table 6.4 summarizes the estimator for
selected fractions.
Table 6.4. Bootstrap Estimates of Quantiles at Selected Fractions
Fraction NP Bootstrap Bootstrap Lower Normal Upper Normal Lower Basic Upper Basic Lower Percentile Upper Percentile
Estimate Bias SD 95% CI 95% CI 95% CI 95% CI 95% CI 95% CI
0.50 6500.00 -128.02 200.36 6235.32 7020.72 6300.00 7000.00 6000.00 6700.00
0.80 9078.40 89.51 200.27 8596.38 9381.41 8533.20 9230.40 8926.40 9623.60
0.90 11454.00 55.95 480.66 10455.96 12340.13 10530.49 12415.00 10493.00 12377.51
0.95 13313.40 13.59 667.74 11991.07 14608.55 11509.70 14321.00 12305.80 15117.10
0.98 16758.72 101.46 1273.45 14161.34 19153.19 14517.44 19326.95 14190.49 19000.00
For example, when the fraction is 0.50, we see that lower and upper 2.5th quan-
tiles of the bootstrap simulations are 𝑞𝐿 = 6000 and 𝑞𝑢 = 6700, respectively.
These form the percentile bootstrap confidence interval. With the nonpara-
metric estimator 6500, these yield the lower and upper bounds of the basic
confidence interval 6300 and 7000, respectively. Table 6.4 also shows bootstrap
estimates of the bias, standard deviation, and a normal confidence interval, con-
cepts introduced in Section 6.2.2.
Table 6.5 shows similar calculations for the tail value-at-risk. In each case,
we see that the bootstrap standard deviation increases as the fraction increases.
This is because there are fewer observations to estimate quantiles as the fraction
increases, leading to greater imprecision. Confidence intervals also become wider.
Interestingly, there does not seem to be the same pattern in the estimates of
the bias.
Table 6.5. Bootstrap Estimates of TVaR at Selected Risk Levels
Fraction NP Bootstrap Bootstrap Lower Normal Upper Normal Lower Basic Upper Basic Lower Percentile Upper Percentile
Estimate Bias SD 95% CI 95% CI 95% CI 95% CI 95% CI 95% CI
0.50 9794.69 -120.82 273.35 9379.74 10451.27 9355.14 10448.87 9140.51 10234.24
0.80 12454.18 30.68 481.88 11479.03 13367.96 11490.62 13378.52 11529.84 13417.74
0.90 14720.05 17.51 718.23 13294.82 16110.25 13255.45 16040.72 13399.38 16184.65
0.95 17072.43 5.99 1103.14 14904.31 19228.56 14924.50 19100.88 15043.97 19220.36
0.98 20140.56 73.43 1587.64 16955.40 23178.85 16942.36 22984.40 17296.71 23338.75
6.2.4 Parametric Bootstrap

The idea of the nonparametric bootstrap is to resample by drawing independent
variables from the empirical cumulative distribution function 𝐹𝑛 . In contrast,
with parametric bootstrap, we draw independent variables from 𝐹𝜃 ̂ where the
underlying distribution is assumed to be in a parametric family ℱ = {𝐹𝜃 , 𝜃 ∈ Θ}.
Typically, parameters from this distribution are estimated based on a sample
and denoted as 𝜃.̂
Example 6.2.4. Lognormal distribution. Consider again the dataset
The classical (nonparametric) bootstrap was based on the following samples.
Instead, for the parametric bootstrap, we have to assume that the distribution
of 𝑥𝑖 ’s is from a specific family. As an example, the following code utilizes a
lognormal distribution.
meanlog sdlog
1.03630697 0.30593440
(0.06840901) (0.04837027)
Then we draw from that distribution.

Figure 6.11 compares the bootstrap distributions for the coefficient of variation,
one based on the nonparametric approach and the other based on a parametric
approach, assuming a lognormal distribution.
8
nonparametric
parametric(LN)
6
Density
4
2
0
0.1 0.2 0.3 0.4 0.5
Coefficient of Variation
Figure 6.11: Comparision of Nonparametric and Parametric Bootstrap

Distributions for the Coefficient of Variation
Example 6.2.5. Bootstrapping Censored Observations. The parametric

bootstrap draws simulated realizations from a parametric estimate of the distri-
bution function. In the same way, we can draw simulated realizations from esti-
mates of a distribution function. As one example, we might draw from smoothed
estimates of a distribution function introduced in Section 4.1.1. Another special
case, considered here, is to draw an estimate from the Kaplan-Meier estimator
introduced in Section 4.3.2. In this way, we can handle observations that are
censored.
Specifically, return to the bodily injury data in Examples 6.2.1 and 6.2.3 but now
we include the 17 claims that were censored by policy limits. In Example 4.3.6,
we used this full dataset to estimate the Kaplan-Meier estimator of the survival
function introduced in Section 4.3.2. Table 6.6 presents bootstrap estimates of
the quantiles from the Kaplan-Meier survival function estimator. These include
the bootstrap precision estimates, bias and standard deviation, as well as the
basic 95% confidence interval.
6.3. CROSS-VALIDATION 243
Table 6.6. Bootstrap Kaplan-Meier Estimates of Quantiles at Selected

Fractions
Fraction KM NP Bootstrap Bootstrap Lower Basic Upper Basic

Estimate Bias SD 95% CI 95% CI
0.50 6500 18.77 177.38 6067 6869
0.80 9500 167.08 429.59 8355 9949
0.90 12756 37.73 675.21 10812 13677
0.95 18500 Inf NaN 12500 22300
0.98 25000 Inf NaN -Inf 27308
Results in Table 6.6 are consistent with the results for the uncensored subsample
in Table 6.4. In Table 6.6, we note the difficulty in estimating quantiles at large
fractions due to the censoring. However, for moderate size fractions (0.50, 0.80,
and 0.90), the Kaplan-Meier nonparametric (KM NP) estimates of the quantile
are consistent with those Table 6.4. The bootstrap standard deviation is smaller
at the 0.50 (corresponding to the median) but larger at the 0.80 and 0.90 levels.
The censored data analysis summarized in Table 6.6 uses more data than the
uncensored subsample analysis in Table 6.4 but also has difficulty extracting
information for large quantiles.
6.3 Cross-Validation
• Compare and contrast cross-validation to simulation techniques and boot-
strap methods.
• Use cross-validation techniques for model selection
• Explain the jackknife method as a special case of cross-validation and
calculate jackknife estimates of bias and standard errors
Cross-validation, briefly introduced in Section 4.2.4, is a technique based on

simulated outcomes. We now compare and contrast cross-validation to other
simulation techniques already introduced in this chapter.”
• Simulation, or Monte-Carlo, introduced in Section 6.1, allows us to com-
pute expected values and other summaries of statistical distributions, such
as 𝑝-values, readily.
• Bootstrap, and other resampling methods introduced in Section 6.2, pro-
vides estimators of the precision, or variability, of statistics.
• Cross-validation is important when assessing how accurately a predictive
model will perform in practice.
Overlap exists but nonetheless it is helpful to think about the broad goals asso-
ciated with each statistical method.
To discuss cross-validation, let us recall from Section 4.2 some of the key ideas
of model validation. When assessing, or validating, a model, we look to perfor-
mance measured on new data, or at least not those that were used to fit the
model. A classical approach, described in Section 4.2.3, is to split the sample in
two: a subpart (the training dataset) is used to fit the model and the other one
(the testing dataset) is used to validate. However, a limitation of this approach
is that results depend on the split; even though the overall sample is fixed, the
split between training and test subsamples varies randomly. A different train-
ing sample means that model estimated parameters will differ. Different model
parameters and a different test sample means that validation statistics will dif-
fer. Two analysts may use the same data and same models yet reach different
conclusions about the viability of a model (based on different random splits), a
frustrating situation.
6.3.1 k-Fold Cross-Validation

To mitigate this difficulty, it is common to use a cross-validation approach as
introduced in Section 4.2.4. The key idea is to emulate the basic test/training
approach to model validation by repeating it many times through averaging
over different splits of the data. A key advantage is that the validation statistic
is not tied to a specific parametric (or nonparametric) model - one can use a
nonparametric statistic or a statistic that has economic interpretations - and so
this can be used to compare models that are not nested (unlike likelihood ratio
procedures).
Example 6.3.1. Wisconsin Property Fund. For the 2010 property fund
data introduced in Section 1.3, we fit gamma and Pareto distributions to the
1,377 claims data. For details of the related goodness of fit, see Appendix Section
15.4.4. We now consider the Kolmogorov-Smirnov statistic introduced in Section
4.1.2. When the entire dataset was fit, the Kolmogorov-Smirnov goodness of fit
statistic for the gamma distribution turns out to be 0.2639 and for the Pareto
distribution is 0.0478. The lower value for the Pareto distribution indicates that
this distribution is a better fit than the gamma.
To see how k-fold cross-validation works, we randomly split the data into 𝑘 = 8
groups, or folds, each having about 1377/8 ≈ 172 observations. Then, we
fit gamma and Pareto models to a data set with the first seven folds (about
172 ⋅ 7 = 1204 observations), determine estimated parameters, and then used
these fitted models with the held-out data to determine the Kolmogorov-Smirnov
statistic.
The results appear in Figure 6.12 where horizontal axis is Fold=1. This process
was repeated for the other seven folds. The results summarized in Figure 6.12
show that the Pareto consistently provides a more reliable predictive distribution
than the gamma.
0.4
0.3
KS Statistic
Pareto
0.2
Gamma
0.1
0.0
1 2 3 4 5 6 7 8
Fold
Figure 6.12: Cross Validated Kolmogorov-Smirnov (KS) Statistics for

the Property Fund Claims Data. The solid black line is for the Pareto
distribution, the green dashed line is for the gamma distribution. The KS
statistic measures the largest deviation between the fitted distribution and the
empirical distribution for each of 8 groups, or folds, of randomly selected data.
6.3.2 Leave-One-Out Cross-Validation

A special case where 𝑘 = 𝑛 is known as leave-one-out cross validation. This case
is historically prominent and is closely related to jackknife statistics, a precursor
of the bootstrap technique.
Even though we present it as a special case of cross-validation, it is helpful to
given an explicit definition. Consider a generic statistic 𝜃 ̂ = 𝑡(𝑥) that is an
estimator for a parameter of interest 𝜃. The idea of the jackknife is to compute
̂ = 𝑡(𝑥 ), where 𝑥 is the subsample of 𝑥 with the 𝑖-th value
𝑛 values 𝜃−𝑖 −𝑖 −𝑖
removed. The average of these values is denoted as
̂ = 1 𝑛 ̂
𝜃(⋅) ∑𝜃 .
𝑛 𝑖=1 −𝑖
These values can be used to create estimates of the bias of the statistic 𝜃 ̂
̂ − 𝜃)
𝐵𝑖𝑎𝑠𝑗𝑎𝑐𝑘 = (𝑛 − 1) (𝜃(⋅) ̂ (6.3)
as well as a standard deviation estimate
𝑛−1 𝑛 2
̂ − 𝜃̂ ) .
𝑠𝑗𝑎𝑐𝑘 = √ ∑ (𝜃−𝑖 (⋅) (6.4)
𝑛 𝑖=1
Example 6.3.2. Coefficient of Variation. To illustrate, consider a small

fictitious sample 𝑥 = {𝑥1 , … , 𝑥𝑛 } with realizations
sample_x <- c(2.46,2.80,3.28,3.86,2.85,3.67,3.37,3.40,
5.22,2.55,2.79,4.50,3.37,2.88,1.44,2.56,2.00,2.07,2.19,1.77)
Suppose that we are interested in the coefficient of variation 𝜃 = 𝐶𝑉 =
√Var [𝑋]/E [𝑋].
With this dataset, the estimator of the coefficient of variation turns out to be
0.31196. But how reliable is it? To answer this question, we can compute the
jackknife estimates of bias and its standard deviation. The following code shows
that the jackknife estimator of the bias is 𝐵𝑖𝑎𝑠𝑗𝑎𝑐𝑘 = -0.00627 and the jackknife
standard deviation is 𝑠𝑗𝑎𝑐𝑘 = 0.01293.
Example 6.3.3. Bodily Injury Claims and Loss Elimination Ratios. In

Example 6.2.1, we showed how to compute bootstrap estimates of the bias and
standard deviation for the loss elimination ratio using the Example 4.1.11 bodily
injury claims data. We follow up now by providing comparable quantities using
jackknife statistics.
Table 6.7 summarizes the results of the jackknife estimation. It shows that
jackknife estimates of the bias and standard deviation of the loss elimination ra-
tio E [min(𝑋, 𝑑)]/E [𝑋] are largely consistent with the bootstrap methodology.
Moreover, one can use the standard deviations to construct normal based con-
fidence intervals, centered around a bias-corrected estimator. For example, at
𝑑 = 14000, we saw in Example 4.1.11 that the nonparametric estimate of LER
is 0.97678. This has an estimated bias of 0.00010, resulting in the (jackknife)
bias-corrected estimator 0.97688. The 95% confidence intervals are produced by
creating an interval of twice the length of 1.96 jackknife standard deviations,
centered about the bias-corrected estimator (1.96 is the approximate 97.5th
quantile of the standard normal distribution).
Table 6.7. Jackknife Estimates of LER at Selected Deductibles
d NP Bootstrap Bootstrap Jackknife Jackknife Lower Jackknife Upper Jackknife

Estimate Bias SD Bias SD 95% CI 95% CI
4000 0.54113 0.00011 0.01237 0.00031 0.00061 0.53993 0.54233
5000 0.64960 0.00027 0.01412 0.00033 0.00068 0.64825 0.65094
10500 0.93563 0.00004 0.01017 0.00019 0.00053 0.93460 0.93667
11500 0.95281 -0.00003 0.00941 0.00016 0.00047 0.95189 0.95373
14000 0.97678 0.00016 0.00687 0.00010 0.00034 0.97612 0.97745
18500 0.99382 0.00014 0.00331 0.00003 0.00017 0.99350 0.99415
Discussion. One of the many interesting things about the leave-one-out special
case is the ability to replicate estimates exactly. That is, when the size of the
fold is only one, then there is no additional uncertainty induced by the cross-
validation. This means that analysts can exactly replicate work of one another,
an important consideration.
Jackknife statistics were developed to understand precision of estimators, pro-
ducing estimators of bias and standard deviation in equations (6.3) and (6.4).
This crosses into goals that we have associated with bootstrap techniques, not
cross-validation methods. This demonstrates how statistical techniques can be
used to achieve different goals.
6.3.3 Cross-Validation and Bootstrap

The bootstrap is useful in providing estimators of the precision, or variability, of
statistics. It can also be useful for model validation. The bootstrap approach to
model validation is similar to the leave-one-out and k-fold validation procedures:
• Create a bootstrap sample by re-sampling (with replacement) 𝑛 indices in
{1, ⋯ , 𝑛}. That will be our training sample. Estimate the model under
consideration based on this sample.
• The test, or validation sample, consists of those observations not selected
for training. Evaluate the fitted model (based on the training data) using
the test data.
Repeat this process many (say 𝐵) times. Take an average over the results and
choose the model based on the average evaluation statistic.
Example 6.3.4. Wisconsin Property Fund. Return to Example 6.3.1 where
we investigate the fit of the gamma and Pareto distributions on the property
fund data. We again compare the predictive performance using the Kolmogorov-
Smirnov (KS) statistic but this time using the bootstrap procedure to split the
data between training and testing samples. The following provides illustrative
code.
We did the sampling using 𝐵 = 100 replications. The average KS statistic
for the Pareto distribution was 0.058 compared to the average for the gamma
distribution, 0.262. This is consistent with earlier results and provides another
piece of evidence that the Pareto is a better model for these data than the
gamma.
6.4 Importance Sampling

Section 6.1 introduced Monte Carlo techniques using the inversion technique
: to generate a random variable 𝑋 with distribution 𝐹 , apply 𝐹 −1 to calls of
a random generator (uniform on the unit interval). What if we what to draw
according to 𝑋, conditional on 𝑋 ∈ [𝑎, 𝑏] ?
One can use an accept-reject mechanism : draw 𝑥 from distribution 𝐹
• if 𝑥 ∈ [𝑎, 𝑏] : keep it (“accept”)
• if 𝑥 ∉ [𝑎, 𝑏] : draw another one (“reject”)
Observe that from 𝑛 values initially generated, we keep here only [𝐹 (𝑏)−𝐹 (𝑎)]⋅𝑛
draws, on average.
Example 6.4.1. Draws from a Normal Distribution. Suppose that we
draw from a normal distribution with mean 2.5 and variance 1, 𝑁 (2.5, 1), but
are only interested in draws greater that 𝑎 ≥ 2 and less than 𝑏 ≤ 4. That is, we
can only use 𝐹 (4) − 𝐹 (2) = Φ(4 − 2.5) − Φ(2 − 2.5) = 0.9332 - 0.3085 = 0.6247
proportion of the draws. Figure ?? demonstrates that some draws lie with the
interval (2, 4) and some are outside.
Instead, one can draw according to the conditional distribution 𝐹 ⋆ defined as
𝐹 (𝑥) − 𝐹 (𝑎)
𝐹 ⋆ (𝑥) = Pr(𝑋 ≤ 𝑥|𝑎 < 𝑋 ≤ 𝑏) = , for 𝑎 < 𝑥 ≤ 𝑏.
𝐹 (𝑏) − 𝐹 (𝑎)
Using the inverse transform method in Section 6.1.2, we have that the draw
𝑋 ⋆ = 𝐹 ⋆−1 (𝑈 ) = 𝐹 −1 (𝐹 (𝑎) + 𝑈 ⋅ [𝐹 (𝑏) − 𝐹 (𝑎)])

6.5. MONTE CARLO MARKOV CHAIN (MCMC) 249
has distribution 𝐹 ⋆ . Expressed another way, define
𝑈̃ = (1 − 𝑈 ) ⋅ 𝐹 (𝑎) + 𝑈 ⋅ 𝐹 (𝑏)
and then use 𝐹 −1 (𝑈̃ ). With this approach, each draw counts.
This can be related to the importance sampling mechanism : we draw more
frequently in regions where we expect to have quantities that have some interest.
This transform can be considered as a “a change of measure.”
In Example 6.4.1., the inverse of the normal distribution is readily available (in
R, the function is qnorm). However, for other applications, this is not the case.
Then, one simply uses numerical methods to determine 𝑋 ⋆ as the solution of
the equation 𝐹 (𝑋 ⋆ ) = 𝑈̃ where 𝑈̃ = (1 − 𝑈 ) ⋅ 𝐹 (𝑎) + 𝑈 ⋅ 𝐹 (𝑏). See the following
illustrative code.
6.5 Monte Carlo Markov Chain (MCMC)
This section is being written and is not yet complete nor edited. It
is here to give you a flavor of what will be in the final version.
The idea of Monte Carlo techniques rely on the law of large numbers (that
insures the convergence of the average towards the integral) and the central
limit theorem (that is used to quantify uncertainty in the computations). Recall
that if (𝑋𝑖 ) is an iid sequence of random variables with distribution 𝐹 , then
𝑛
1 ℒ
√ (∑ ℎ(𝑋𝑖 ) − ∫ ℎ(𝑥)𝑑𝐹 (𝑥)) → 𝒩(0, 𝜎2 ), as 𝑛 → ∞,
𝑛 𝑖=1
for some variance 𝜎2 > 0. But actually, the ergodic theorem can be used to
weaker the previous result, since it is not necessary to have independence of the
variables. More precisely, if (𝑋𝑖 ) is a Markov Process with invariant measure 𝜇,
under some additional technical assumptions, we can obtain that
𝑛
1 ℒ
√ (∑ ℎ(𝑋𝑖 ) − ∫ ℎ(𝑥)𝑑𝜇(𝑥)) → 𝒩(0, 𝜎⋆2 ), as 𝑛 → ∞.
𝑛 𝑖=1
for some variance 𝜎⋆2 > 0.

Hence, from this property, we can see that it is possible not necessarily to
generate independent values from 𝐹 , but to generate a Markov process with
invariant measure 𝐹 , and to consider means over the process (not necessarily
independent).
Consider the case of a constrained Gaussian vector : we want to generate random

pairs from a random vector 𝑋, but we are interested only in the case where the
𝑇
sum of the composants is large enough, which can be written 𝑋 1 > 𝑚 for some
real valued 𝑚. Of course, it is possible to use the accept-reject algorithm, but we
have seen that it might be quite inefficient. One can use Metropolis Hastings and
Gibbs sampler to generate a Markov process with such an invariant measure.
6.5.1 Metropolis Hastings

The algorithm is rather simple to generate from 𝑓: we start with a feasible value
𝑥1 . Then, at step 𝑡, we need to specify a transition kernel : given 𝑥𝑡 , we need
a conditional distribution for 𝑋𝑡+1 given 𝑥𝑡 . The algorithm will work well if
that conditional distribution can easily be simulated. Let 𝜋(⋅|𝑥𝑡 ) denote that
probability.
Draw a potential value 𝑥⋆𝑡+1 , and 𝑢, from a uniform distribution. Compute
𝑓(𝑥⋆𝑡+1 )
𝑅=
𝑓(𝑥𝑡 )
and
• if 𝑢 < 𝑟, then set 𝑥𝑡+1 = 𝑥⋆𝑡

• if 𝑢 ≤ 𝑟, then set 𝑥𝑡+1 = 𝑥𝑡
Here 𝑟 is called the acceptance-ratio: we accept the new value with probability
𝑟 (or actually the smallest between 1 and 𝑟 since 𝑟 can exceed 1).
For instance, assume that 𝑓(⋅|𝑥𝑡 ) is uniform on [𝑥𝑡 − 𝜀, 𝑥𝑡 + 𝜀] for some 𝜀 > 0,
and where 𝑓 (our target distribution) is the 𝒩(0, 1). We will never draw from
𝑓, but we will use it to compute our acceptance ratio at each step.
In the code above, vec contains values of 𝑥 = (𝑥1 , 𝑥2 , ⋯), innov is the innova-
tion.
Now, if we use more simulations, we get
6.5.2 Gibbs Sampler

Consider some vector 𝑋 = (𝑋1 , ⋯ , 𝑋𝑑 ) with independent components, 𝑋𝑖 ∼
𝑇
ℰ(𝜆𝑖 ). We sample to sample from 𝑋 given 𝑋 1 > 𝑠 for some threshold 𝑠 > 0.
• with some starting point 𝑥0 ,

• pick up (randomly) 𝑖 ∈ {1, ⋯ , 𝑑}
• 𝑋𝑖 given 𝑋𝑖 > 𝑠 − 𝑥𝑇(−𝑖) 1 has an Exponential distribution ℰ(𝜆𝑖 )
• draw 𝑌 ∼ ℰ(𝜆𝑖 ) and set 𝑥𝑖 = 𝑦 + (𝑠 − 𝑥𝑇(−𝑖) 1)+ until 𝑥𝑇(−𝑖) 1 + 𝑥𝑖 > 𝑠
4
3
sim[,2]
2
1
0
2 4 6 8 10
sim[,1]
The construction of the sequence (MCMC algorithms are iterative) can be visu-
alized below

• Include historical references for jackknife (Quenouille, Tukey, Efron)
• Here are some links to learn more about reproducibility and randomness
and how to go from a random generator to a sample function.
Contributors
• Arthur Charpentier, Université du Quebec á Montreal, and Edward
W. (Jed) Frees, University of Wisconsin-Madison, are the principal au-
thors of the initial version of this chapter. Email: [email protected]
and/or [email protected] for chapter comments and sug-
gested improvements.
• Chapter reviewers include Yvonne Chueh and Brian Hartman. Write Jed
or Arthur to add you name here.
6.6.1 TS 6.A. Bootstrap Applications in Predictive Mod-

eling
This section is under construction.

Chapter 7
Premium Foundations
Chapter Preview. Setting prices for insurance products, premiums, is an impor-

tant task for actuaries and other data analysts. This chapter introduces the
foundations for pricing non-life products.
7.1 Introduction to Ratemaking

• Describe expectations as a baseline method for determining insurance pre-
miums
• Analyze an accounting equation for relating premiums to losses, expenses
and profits
• Summarize a strategy for extending pricing to include heterogeneous risks
and trends over time
This chapter explains how you can think about determining the appropriate
price for an insurance product. As described in Section 1.2, one of the core
actuarial functions is ratemaking, where the analyst seeks to determine the
right price for a risk.
As this is a core function, let us first take a step back to define terms. A price
is a quantity, usually of money, that is exchanged for a good or service. In
insurance, we typically use the word premium for the amount of money charged
for insurance protection against contingent events. The amount of protection
varies by risk being insured. For example, in homeowners insurance the amount
of insurance protection depends on the value of the house. In life insurance, the
amount of protection depends on a policyholder’s financial status (e.g. income
253
254 CHAPTER 7. PREMIUM FOUNDATIONS
and wealth) as well as a perceived need for financial security. So, it is common to
express insurance prices as a unit of the protection being purchased, for example,
a price per thousand dollars of coverage on a home or benefit in the event of
death. These prices/premiums are known as rates because they are expressed
in standardized units,
To determine premiums, it is common in economics to consider the supply and

demand of a product. The demand is sensitive to price as well as the existence
of competing firms and substitute products. The supply is also sensitive to price
as well as the resources required for production. For the individual firm, the
price is set to meet some objective such as profit maximization which is met by
choosing the output level that balances costs and revenues at the margins.
However, a peculiarity of insurance is that the costs of insurance protection

are not known at the sale of the contract. If the insured contingent event,
such as the loss of a house or life, does not occur, then the contract costs are
only administrative (to set up the contract) and are relatively minor. If the
insured event occurs, then the cost includes not only administrative costs but
also payment of the amount insured and expenses to settle claims. So, the
cost is random; when the contract is written, by design neither the insurer nor
the insured knows the contract costs. Moreover, costs may not be revealed
for months or years. For example, a typical time to settlement in medical
malpractice is five years.
Because costs are unknown at the time of sale, insurance pricing differs from
common economic approaches. This chapter squarely addresses the uncertain
nature of costs by introducing traditional actuarial approaches that determine
prices as a function of insurance costs. As we will see, this pricing approach is
sufficient for some insurance markets such as personal automobile or homeown-
ers where the insurer has a portfolio of many independent risks. However, there
are other insurance markets where actuarial prices only provide an input to gen-
eral market prices. To reinforce this distinction, actuarial cost-based premiums
are sometimes known as technical prices. From the perspective of economists,
corporate decisions such as pricing are to be evaluated with reference to their
impact on the firm’s market value. This objective is more comprehensive than
the static notion of profit maximization. That is, you can think of the value
of the firm as the capitalized value of all future expected profits. Decisions im-
pacting this value in turn affect all groups having claims on the firm, including
stockholders, bondholders, policyowners (in the case of mutual companies), and
so forth.
For cost-based prices, it is helpful to think of a premium as revenue source that

provides for claim payments, contract expenses, and an operating margin. We
formalize this in an accounting equation
Premium = Loss + Expense + UW Profit. (7.1)

7.1. INTRODUCTION TO RATEMAKING 255
The Expense term can be split into those that vary by premium (such as
sales commissions) and those that do not (such as building costs and employee
salaries). The term UW Profit is a residual that stands for underwriting profit.
It may also include include a cost of capital (for example, an annual dividend to
company investors). Because fixed expenses and costs of capital are difficult to
interpret for individual contracts, we think of the equation (7.1) relationship as
holding over the sum of many contracts (a portfolio) and work with it in aggre-
gate. Then, in Section 7.2 we use this approach to help us think about setting
premiums, for example by setting profit objectives. Specifically, Sections 7.2.1
and 7.2.2 introduce two prevailing methods used in practice for determining
premiums, the pure premium and the loss ratio methods.
The Loss in equation (7.1) is random and so, as a baseline, we use the expected
costs to determine rates. There are several ways to motivate this perspective
that we expand upon in Section 7.3. For now, we will suppose that the insurer
enters into many contracts with risks that are similar except, by pure chance,
in some cases the insured event occurs and in others it does not. The insurer
is obligated to pay the total amount of claim payments for all contracts. If
risks are similar, then all policyholders are equally likely to contribute to the
total loss. So, from this perspective, it makes sense to look at the average claim
payment over many insureds. From probability theory, specifically the law of
large numbers, we know that the average of iid risks is close to the expected
amount, so we use the expectation as a baseline pricing principle.
Nonetheless, by using expected losses, we essentially assume that the uncertainty
is non-existent. If the insurer sells enough independent policies, this may be a
reasonable approximation. However, there will be other cases, such as a single
contract issued to a large corporation to insure all of its buildings against fire
damage, where the use of only an expectation for pricing is not sufficient. So,
Section 7.3 also summarizes alternative premium principles that incorporate
uncertainty into our pricing. Note that an emphasis of this text is estimation of
the entire distribution of losses so the analyst is not restricted to working only
with expectations.
The aggregate methods derived from equation (7.1) focus on collections of ho-
mogeneous risks that are similar except for the occurrence of random losses. In
statistical language that we have introduced, this is a discussion about risks
that have identical distributions. Naturally, when examining risks that insur-
ers work with, there are many variations in the risks being insured including
the features of the contracts and the people being insured. Section 7.4 extends
pricing considerations to heterogeneous collections of risks.
Section 7.5 introduces development and trending. When developing rates, we
want to use the most recent loss experience because the goal is to develop rates
that are forward looking. However, at contract initiation, recent loss experience
is often not known; it may be several years until it is fully realized. So, this
section introduces concepts needed for incorporating recent loss experience into
our premium development. Development and trending of experience is related to
but also differs from the idea of experience rating that suggests that experience
reveals hidden information about the insured and so should be incorporated in
our forward thinking viewpoint. Chapter 9 discusses this idea in more detail.
The final section of this chapter introduces methods for selecting a premium.
This is done by comparing a premium rating method to losses from a held-out
portfolio and selecting the method that produces the best match with the held-
out data. For a typical insurance portfolio, most policies produce zero losses,
that is, do not have a claim. Because the distribution of held-out losses is
a combination of (a large number of) zeros and continuous amounts, special
techniques are useful. Section 7.6 introduces concepts of concentration curves
and corresponding Gini statistics to help in this selection.
The chapter also includes a technical supplement on government regulation of

insurance rates to keep our work grounded in applications.
7.2 Aggregate Ratemaking Methods
• Define a pure premium as a loss cost as well as in terms of frequency and

severity
• Calculate an indicated rate using pure premiums, expenses, and profit
loadings
• Define a loss ratio
• Calculate an indicated rate change using loss ratios
• Compare the pure premium and loss ratio methods for determining pre-
miums
It is common to consider an aggregate portfolio of insurance experience. Con-

sistent with earlier notation, consider a collection of 𝑛 contracts with losses
𝑋1 , … , 𝑋𝑛 . In this section, we assume that contracts have the same loss distri-
bution, that is they form a homogeneous portfolio, and so are iid. For motiva-
tion, you can think about personal insurance such as auto or homeowners where
insurers write many contracts on risks that appear very similar. Further, the
assumption of identical distributions is not as limiting as you might think. In
Section 7.4.1 we will introduce the idea of an exposure variable that allows us to
rescale experience to make it comparable. For example, by rescaling losses we
will be able to treat homeowner losses from a house with 80,000 insurable value
and a house with a 320,000 insurable value as coming from the same distribution.
For now, we simply assume that 𝑋1 , … , 𝑋𝑛 are iid.
7.2. AGGREGATE RATEMAKING METHODS 257
7.2.1 Pure Premium Method

If the number of policies in a collection, 𝑛, is large, then the average provides a
good approximation of the expected loss
𝑛
∑𝑖=1 𝑋𝑖 Loss
E(𝑋) ≈ = = Pure Premium.
𝑛 Exposure
With this as motivation, we define the pure premium to be the sum of losses
divided by the exposure; it is also known as a loss cost. In the case of homo-
geneous risks, all policies are treated the same and we can use the number of
policies 𝑛 for the exposure. In Section 7.4.1 we extend the concept of exposure
when policies are not the same.
We can multiply and divide by the number of claims, claim count, to get
claim count Loss
Pure Premium = × = frequency × severity.
Exposure claim count
So, when premiums are determined using the pure premium method, we either
take the average loss (loss cost) or use the frequency-severity approach.
To get a bit closer to applications in practice, we now return to equation (7.1)
that includes expenses. Equation (7.1) also refers to UW Profit for underwriting
profit. When rescaled by premiums, this is known as the profit loading. Because
claims are uncertain, the insurer must hold capital to ensure that all claims are
paid. Holding this extra capital is a cost of doing business, investors in the
company need to be compensated for this, thus the extra loading.
We now decompose Expenses into those that vary by premium, Variable, and
those that do not, Fixed so that Expenses = Variable + Fixed. Thinking of
variable expenses and profit as a fraction of premiums, we define
Variable UW Profit
𝑉 = and 𝑄= .
Premium Premium
With these definitions and equation (7.1), we may write

Premium = Losses + Fixed + Premium × Variable + UW Profit
Premium
= Losses + Fixed + Premium × (𝑉 + 𝑄).
Solving for premiums yields
Losses + Fixed
Premium = . (7.2)
1−𝑉 −𝑄
Dividing by exposure, the rate can be calculated as
Premium Losses/Exposure + Fixed/Exposure

Rate = Exposure = 1−𝑉 −𝑄
Pure Premium + Fixed/Exposure
= 1−𝑉 −𝑄 .
In words, this is
pure premium + fixed expense per exposure

Rate = .
1 - variable expense factor - profit and contingencies factor
Example. CAS Exam 5, 2004, Number 13. Determine the indicated rate
per exposure unit, given the following information:
• Frequency per exposure unit = 0.25
• Severity = $100
• Fixed expense per exposure unit = $10
• Variable expense factor = 20%
• Profit and contingencies factor = 5%
Solution. Under the pure premium method, the indicated rate is
pure premium + fixed expense per exposure

Rate = 1 - variable expense factor - profit and contingencies factor
frequency×severity + 10
= 1−0.20−0.05 = 0.25×100+10
1−0.20−0.05 = 46.67.
From the example, note that the rates produced by the pure premium method
are commonly known as indicated rates.
From our development, note also that the profit is associated with the under-
writing aspect of the contract and not investments. Premiums are typically
paid at the beginning of a contract and insurers receive investment income from
holding this money. However, due in part to the short-term nature of the con-
tracts, investment income is typically ignored in pricing. This builds a bit of
conservatism into the process that insurers welcome. It is probably most rel-
evant in the very long “tail” lines such as workers’ compensation and medical
malpractice. In these lines, it can sometimes take 20 years or even longer to set-
tle claims. But, these are also the most volatile lines with some claim amounts
being large relative to the rest of the distribution. The mitigating factor is that
these large claim amounts tend to be far in the future and so are less extreme
when viewed in a discounted sense.
7.2.2 Loss Ratio Method

The loss ratio is the ratio of the sum of losses to the premium
Loss
Loss Ratio = .
Premium
When determining premiums, it is a bit counter-intuitive to emphasize this ratio

because the premium component is built into the denominator. As we will see,
the loss ratio method develops rate changes rather than rates; we can use
rate changes to update past experience to get a current rate. To do this, rate
changes consist of the ratio of the experience loss ratio to the target loss ratio.
7.2. AGGREGATE RATEMAKING METHODS 259
This adjustment factor is then applied to current rates to get new indicated
rates.
To see how this works in a simple context, let us return to equation (7.1) but now
ignore expenses to get Premium = Losses + UW Profit. Dividing by premiums
yields
UW Profit Loss
= 1 − 𝐿𝑅 = 1 − .
Premium Premium
Suppose that we have in mind a new “target” profit loading, say 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 . As-
suming that losses, exposure, and other things about the contract stay the same,
then to achieve the new target profit loading we adjust the premium. Use the
ICF for the indicated change factor that is defined through the expression
New UW Profit Loss

= 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 1 − .
Premium 𝐼𝐶𝐹 × Premium
Solving for ICF, we get
Loss 𝐿𝑅
𝐼𝐶𝐹 = = .
Premium × (1 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 ) 1 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡
So, for example, if we have a current loss ratio = 85% and a target profit load-
ing 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 0.20, then 𝐼𝐶𝐹 = 0.85/0.80 = 1.0625, meaning that we increase
premiums by 6.25%.
Now let’s see how this works with expenses in equation (7.1). We can use the
same development as in Section 7.2.1 and so start with equation (7.2), solve for
the profit loading to get
Loss+Fixed
𝑄=1− −𝑉.
Premium
We interpret the quantity Fixed /Premium + V as the “operating expense ratio.”
Now, fix the profit percentage Q at a target and adjust premiums through the
“indicated change factor” 𝐼𝐶𝐹
Loss + Fixed
𝑄𝑡𝑎𝑟𝑔𝑒𝑡 = 1 − −𝑉.
Premium × 𝐼𝐶𝐹
Solving for 𝐼𝐶𝐹 yields
Loss + Fixed
𝐼𝐶𝐹 = Premium×(1−𝑉 −𝑄𝑡𝑎𝑟𝑔𝑒𝑡 )
Loss Ratio + Fixed Expense Ratio (7.3)
= 1−𝑉 −𝑄𝑡𝑎𝑟𝑔𝑒𝑡
.
Example. Loss Ratio Indicated Change Factor. Assume the following

information:
• Projected ultimate loss and LAE ratio = 65%
• Projected fixed expense ratio = 6.5%

• Variable expense = 25%
• Target UW profit = 10%
With these assumptions, with equation (7.3), the indicated change factor can
be calculated as
(Losses + Fixed)/Premium 0.65 + 0.065
𝐼𝐶𝐹 = = = 1.10.
1 − 𝑉 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 1 − 0.25 − 0.10
This means that overall average rate level should be increased by 10%.
We later provide a comparison of the pure premium and loss ratio methods
in Section 7.5.3. As inputs, that section will require discussions of trended
exposures and on-level premiums defined in Section 7.5.
7.3 Pricing Principles

• Describe common actuarial pricing principles
• Describe properties of pricing principles
• Choose a pricing principle based on a desired property
Approaches to pricing vary by the type of contract. For example, personal

automobile is a widely available product throughout the world and is known
as part of the retail general insurance market in the United Kingdom. Here,
one can expect to do pricing based on a large pool of independent contracts, a
situation in which expectations of losses provide an excellent starting point. In
contrast, an actuary may wish to price an insurance contract issued to a large
employer that covers complex health benefits for thousands of employees. In this
example, knowledge of the entire distribution of potential losses, not just the
expected value, is critical for starting the pricing negotiations. To cover a range
of potential applications, this section describes general premium principles and
their properties that one can use to decide whether or not a specific principle is
applicable in a given situation.
7.3.1 Premium Principles

This chapter introduces traditional actuarial pricing principles that provide a
price based only on the insurance loss distribution; the price does not depend on
the demand for insurance or other aspects of the costs such as expenses. Assume
that the loss 𝑋 has distribution function 𝐹 (⋅) and that there exists some rule
7.3. PRICING PRINCIPLES 261
(which in mathematics is known as a functional), say 𝐻, that takes 𝐹 (⋅) into

the positive real line, denoted as 𝑃 = 𝐻(𝐹 ). For notation purposes, it is often
convenient to substitute the random variable 𝑋 for its distribution function and
write 𝑃 = 𝐻(𝑋). Table 7.1 provides several examples.
Table 7.1. Common Premium Principles
Description Definition (𝐻(𝑋))

Net (pure) premium E[𝑋]
Expected value (1 + 𝛼)E[𝑋]
Standard deviation E[𝑋] + 𝛼 𝑆𝐷(𝑋)
Variance E[𝑋] + 𝛼 Var(𝑋)
Zero utility solution of 𝑢(𝑤) = E[𝑢(𝑤 + 𝑃 − 𝑋)]
1
Exponential 𝛼
log E[𝑒𝛼𝑋 ]
A premium principle is similar to a risk measure that is introduced in Section

10.3. Mathematically, both are rules that map the loss rv of interest to a
numerical value. From a practical viewpoint, a premium principle provides a
guide as to how much an insurer will charge for accepting a risk 𝑋. In contrast,
a risk measure quantifies the level of uncertainty, or riskiness, that an insurer
can use to decide on a capital level to be assured of remaining solvent.
The net, or pure, premium essentially assumes no uncertainty. The expected
value, standard deviation, and variance principles each add an explicit loading
for uncertainty through the risk parameter 𝛼 ≥ 0. For the principle of zero
utility, we think of an insurer with utility function 𝑢(⋅) and wealth w as being
indifferent to accepting and not accepting risk 𝑋. In this case, 𝑃 is known as an
indifference price or, in economics, a reservation price. With exponential utility,
the principle of zero utility reduces to the exponential premium principle, that
is, assuming 𝑢(𝑥) = (1 − 𝑒−𝛼𝑥 )/𝛼.
For small values of the risk parameters, the variance principle is approximately
equal to exponential premium principle, as illustrated in the following special
case.
Special Case: Gamma Distribution. Consider a loss that is gamma dis-

tributed with parameters 𝜂 and 𝜃 (we usually use 𝛼 for the location param-
eter but, to distinguish it from the risk parameter, for this example we call
it 𝜂). From the Appendix Chapter 18, the mean is 𝜂 𝜃 and the variance is
𝜂 𝜃2 . Using 𝛼𝑉 𝑎𝑟 for the risk parameter, the variance premium is 𝐻𝑉 𝑎𝑟 (𝑋) =
𝜂 𝜃 + 𝛼𝑉 𝑎𝑟 (𝜂 𝜃2 ). From this appendix, it is straightforward to derive the well-
known moment generating function, 𝑀 (𝑡) = E[𝑒𝑡𝑋 ] = (1 − 𝑡𝜃)−𝜂 . With this and
a risk parameter 𝛼𝐸𝑥𝑝 , we may express the exponential premium as
−𝜂
𝐻𝐸𝑥𝑝 (𝑋) = log (1 − 𝛼𝐸𝑥𝑝 𝜃) .
𝛼𝐸𝑥𝑝
To see the relationship between 𝐻𝐸𝑥𝑝 (𝑋) and 𝐻𝑉 𝑎𝑟 (𝑋), we choose 𝛼𝐸𝑥𝑝 =
2𝛼𝑉 𝑎𝑟 . With an approximation from calculus (log(1−𝑥) = −𝑥−𝑥2 /2−𝑥3 /3−⋯),
we write
𝐻𝐸𝑥𝑝 (𝑋) = 𝛼−𝜂 log (1 − 𝛼𝐸𝑥𝑝 𝜃) = 𝛼−𝜂 {−𝛼𝐸𝑥𝑝 𝜃 − (𝛼𝐸𝑥𝑝 𝜃)2 /2 − ⋯}

𝐸𝑥𝑝 𝐸𝑥𝑝
𝛼 2
2 (𝜂 𝜃 ) = 𝐻𝑉 𝑎𝑟 (𝑋).
≈ 𝜂 𝜃 + 𝐸𝑥𝑝
7.3.2 Properties of Premium Principles

Properties of premium principles help guide the selection of a premium principle
in applications. Table 7.2 provides examples of properties of premium principles.
Table 7.2. Common Properties of Premium Principles
Description Definition
Nonnegative loading 𝐻(𝑋) ≥ E[𝑋]
Additivity 𝐻(𝑋1 + 𝑋2 ) = 𝐻(𝑋1 ) + 𝐻(𝑋2 ), for independent 𝑋1 , 𝑋2
Scale invariance 𝐻(𝑐𝑋) = 𝑐𝐻(𝑋), for 𝑐 ≥ 0
Consistency 𝐻(𝑐 + 𝑋) = 𝑐 + 𝐻(𝑋)
No rip-off 𝐻(𝑋) ≤ max{𝑋}
This is simply a subset of the many properties quoted in the actuarial literature.
For example, the review paper of Young (2014) lists 15 properties. See also the
properties described as coherent axioms that we introduce for risk measures in
Section 10.3.
Some of the properties listed in Table 7.2 are mild in the sense that they will
nearly always be satisfied. For example, the no rip-off property indicates that
the premium charge will be smaller than the largest or “maximal” value of
the loss 𝑋 (here, we use the notation max{𝑋} for this maximal value which is
defined as an “essential supremium” in mathematics). Other properties may not
be so mild. For example, for a portfolio of independent risks, the actuary may
want the additivity property to hold. It is easy to see that this property holds for
the expected value, variance, and exponential premium principles but not for the
standard deviation principle. Another example is the consistency property that
does not hold for the expected value principle when the risk loading parameter
𝛼 is positive.
The scale invariance principle is known as homogeneity of degree one in eco-
nomics. For example, it allows us to work in different currencies (e.g., from
dollars to Euros) as well as a host of other applications and will be discussed
further in the following Section 7.4. Although a generally accepted principle,
we note that this principle does not hold for a large value of 𝑋 that may border
on a surplus constraint of an insurer; if an insurer has a large probability of
becoming insolvent, then that insurer may not wish to use linear pricing. It
7.4. HETEROGENEOUS RISKS 263
is easy to check that this principle holds for the expected value and standard
deviation principles, although not for the variance and exponential principles.
7.4 Heterogeneous Risks

• Describe insurance exposures in terms of scale distributions
• Explain an exposure in terms of common types of insurance such as auto
and homeowners insurance
• Describe how rating factors can be used to account for the heterogeneity
among risks in a collection
• Measure the impact of a rating factor through relativities
As noted in Section 7.1, there are many variations in the risks being insured,
the features of the contracts, and the people being insured. As an example, you
might have a twin brother or sister who works in the same town and earns a
roughly similar amount of money. Still, when it comes to selecting choices in
rental insurance to insure contents of your apartment, you can imagine differ-
ences in the amount of contents to be insured, choices of deductibles for the
amount of risk retained, and perhaps different levels of uncertainty given the
relative safety of your neighborhoods. People and risks that they insure are
different.
When thinking about a collection of different (heterogeneous) risks, one option
is to price all risks the same. This is common in government sponsored programs
for flood or health insurance. However, it is also common to have different prices
where the differences are commensurate with the risk being insured.
7.4.1 Exposure to Risk

One way to make heterogeneous risks comparable is through the concept of an
exposure. To explain exposures, let us use scale distributions that we learned
about in Chapter 3. To recall a scale distribution, suppose that 𝑋 has a para-
metric distribution and define a rescaled version 𝑅 = 𝑋/𝐸, 𝐸 > 0. If 𝑅 is in
the same parametric family as 𝑋, then the distribution is said to be a scale dis-
tribution. As we have seen, the gamma, exponential, and Pareto distributions
are examples of scale distributions.
Intuitively, the idea behind exposures is to make risks more comparable to
one another. For example, it may be that risks 𝑋1 , … , 𝑋𝑛 come from different
distributions and yet, with the choice of the right exposures, the rates 𝑅1 , … , 𝑅𝑛
come from the same distribution. Here, we interpret the rate 𝑅𝑖 = 𝑋𝑖 /𝐸𝑖 to be
the loss divided by exposure.
Table 7.3 provides a few examples. We remark that this table refers to “earned”
car and house years, concepts that will be explained in Section 7.5.
Table 7.3. Commonly used Exposures in Different Types of Insurance
Type of Insurance Exposure Basis

Personal Automobile Earned Car Year, Amount of Insurance Coverage
Homeowners Earned House Year, Amount of Insurance Coverage
Workers Compensation Payroll
Commercial General Liability Sales Revenue, Payroll, Square Footage, Number of Units
Commercial Business Property Amount of Insurance Coverage
Physician’s Professional Liability Number of Physician Years
Professional Liability Number of Professionals (e.g., Lawyers or Accountants)
Personal Articles Floater Value of Item
An exposure is a type of rating factor, a concept that we define explicitly in the

next Section 7.4.2. It is typically the most important rating factor, so important
that both premiums and losses are quoted on a “per exposure” basis.
For frequency and severity modeling, it is customary to think about the fre-
quency aspect as proportional to exposure and the severity aspect in terms of
loss per claim (not dependent upon exposure). However, this does not cover
the entire story. For many lines of business, it is convenient for exposures to be
proportional to inflation. Inflation is typically viewed as unrelated to frequency
but proportional to severity.
Criteria for Choosing an Exposure

An exposure base should meet the following criteria. It should:
• be an accurate measure of the quantitative exposure to loss
• be easy for the insurer to determine (at the time the policy is initiated)
and not subject to manipulation by the insured,
• be easy to understand by the insured and to calculate by the insurer,
• consider any preexisting exposure base established within the industry,
and
• for some lines of business, be proportional to inflation. In this way, rates
are not sensitive to the changing value of money over time as these changes
are captured in exposure base.
To illustrate, consider personal automobile coverage. Instead of the exposure
basis “earned car year,” a more accurate measure of the quantitative exposure
to loss might be number of miles driven. Historically, this measure had been
difficult to determine at the time the policy is issued and subject to potential
manipulation by the insured and so it is still not typically used. Modern telem-
atic devices that allow for accurate mileage recording is changing the use of this
variable in some marketplaces.
7.4. HETEROGENEOUS RISKS 265
As another example, the exposure measure in commercial business property,

e.g. fire insurance, is typically the amount of insurance coverage. As property
values grow with inflation, so will the amount of insurance coverage. Thus, rates
quoted on a per amount of insurance coverage are less sensitive to inflation than
otherwise.
7.4.2 Rating Factors

A rating factor, or rating variable, is simply a characteristic of the policyholder
or risk being insured by which rates vary. For example, when you purchase auto
insurance, it is likely that the insurer has rates that differ by age, gender, type of
car, where the car is garaged, accident history, and so forth. These variables are
known as rating factors. Although some variables may be continuous, such as
age, most are categorical - factor is a label that is used for categorical variables.
In fact, even with continuous variables such as age, it is common to categorize
them by creating groups such as “young,” “intermediate,” and “old” for rating
purposes.
Table 7.4 provides just a few examples. In many jurisdictions, the personal
insurance market (e.g., auto and homeowners) is very competitive - using 10 or
20 variables for rating purposes is not uncommon.
Table 7.4. Commonly used Rating Factors in Different Types of Insur-

ance
Type of Insurance Rating Factors

Personal Automobile Driver Age and Gender, Model Year, Accident History
Homeowners Amount of Insurance, Age of Home, Construction Type
Workers Compensation Occupation Class Code
Commercial General Liability Classification, Territory, Limit of Liability
Medical Malpractice Specialty, Territory, Limit of Liability
Commercial Automobile Driver Class, Territory, Limit of Liability
Example. Losses and Premium by Amount of Insurance and Terri-

tory. To illustrate, Table 7.5 presents a small fictitious data set from Werner
and Modlin (2016). The data consists of loss and loss adjustment expenses
(LossLAE), decomposed by three levels of amount of insurance (AOI ), and
three territories (Terr). For each combination of AOI and Terr, we also have
available the number of policies issued, given as the Exposure.
Table 7.5. Losses and Premium by Amount of Insurance and Territory

𝐴𝑂𝐼 𝑇 𝑒𝑟𝑟 𝐸𝑥𝑝𝑜𝑠𝑢𝑟𝑒 𝐿𝑜𝑠𝑠𝐿𝐴𝐸 𝑃 𝑟𝑒𝑚𝑖𝑢𝑚

Low 1 7 210.93 335.99
Medium 1 108 4, 458.05 6, 479.87
High 1 179 10, 565.98 14, 498.71
Low 2 130 6, 206.12 10, 399.79
Medium 2 126 8, 239.95 12, 599.75
High 2 129 12, 063.68 17, 414.65
Low 3 143 8, 441.25 14, 871.70
Medium 3 126 10, 188.70 16, 379.68
High 3 40 4, 625.34 7, 019.86
Total 988 65, 000.00 99, 664.01
In this case, the rating factors AOI and Terr produce nine cells. Note that one
might combine the cell “territory one with a low amount of insurance”” with
another cell because there are only 7 policies in that cell. Doing so is perfectly
acceptable - considerations of this sort is one of the main jobs of the analyst. An
outline on selecting variables is in Chapter 8, including Technical Supplement
TS 8.B. Alternatively, you can also think about reinforcing information about
the cell (Terr 1, Low AOI ) by “borrowing” information from neighboring cells
(e.g., other territories with the same AOI, or other amounts of AOI within Terr
1). This is the subject of credibility that is introduced in Chapter 9.
To understand the impact of rating factors, it is common to use relativities. A

relativity compares the expected risk at a specific level of a rating factor to an
accepted baseline value. In this book, we work with relativities defined through
ratios; it is also possible to define relativities through arithmetic differences.
Thus, our relativity is defined as
(Loss/Exposure)𝑗
Relativity𝑗 = .
(Loss/Exposure)𝐵𝑎𝑠𝑒
Example. Losses and Premium by Amount of Insurance and Territory

- Continued. Traditional classification methods consider only one classification
variable at a time - they are univariate. Thus, if we wanted relativities for losses
and expenses (LossLAE) by amount of insurance, we might sum over territories
to get the information displayed in Table 7.6.
Table 7.6. Losses and Relativities by Amount of Insurance
𝐴𝑂𝐼 𝐸𝑥𝑝𝑜𝑠𝑢𝑟𝑒 𝐿𝑜𝑠𝑠𝐿𝐴𝐸 𝐿𝑜𝑠𝑠/𝐸𝑥𝑝 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑖𝑡𝑦

Low 280 14858.3 53.065 0.835
Medium 360 22886.7 63.574 1.000
High 348 27255.0 78.319 1.232
Total 988 65, 000.0
7.5. DEVELOPMENT AND TRENDING 267
Thus, losses and expenses per unit of exposure are 23.2% higher for risks with
a high amount of insurance compared to those with a medium amount. These
relativities do not control for territory.
The introduction of rating factors allows the analyst to create cells that define
small collections of risks – the goal is to choose the right combination of rating
factors so that all risks within a cell may be treated the same. In statistical
terminology, we want all risks within a cell to have the same distribution (subject
to rescaling by an exposure variable). This is the foundation of insurance pricing.
All risks within a cell have the same price per exposure yet risks from different
cells may have different prices.
Said another way, insurers are allowed to charge different rates for different
risks; discrimination of risks is legal and routinely done. Nonetheless, the basis
of discrimination, the choice of risk factors, is the subject of extensive debate.
The actuarial community, insurance management, regulators, and consumer ad-
vocates are all active participants in this debate. Technical Supplement TS 7.A
describes these issues from a regulatory perspective.
In addition to statistical criteria for assessing the significance of a rating factor,
analysts much pay attention to business concerns of the company (e.g., is it
expensive to implement a rating factor?), social criteria (is a variable under the
control of a policyholder?), legal criteria (are there regulations that prohibit
the use of a rating factor such as gender?), and other societal issues. These
questions are largely beyond the scope of this text. Nonetheless, because they
are so fundamental to pricing of insurance, a brief overview is given in Chapter
8, including Technical Supplement TS 8.B.
7.5 Development and Trending

• Define and calculate different types of exposure and premium summary
measures that appear in financial reports
• Describe the development of a claim over several payments and link that
to various unpaid claim measures, including incurred but not reported
(IBNR) as well as case reserves
• Compare and contrast relative strengths and weaknesses of the pure pre-
mium and loss ratio methods for ratemaking
As we have seen in Section 7.2, insurers consider aggregate information for

ratemaking such as exposures to risk, premiums, expenses, claims, and pay-
ments. This aggregate information is also useful for managing an insurers’ ac-
tivities; financial reports are commonly created at least annually and oftentimes
quarterly. At any given financial reporting date, information about recent poli-
cies and claims will be ongoing and necessarily incomplete; this section intro-
duces concepts for projecting risk information so that it is useful for ratemak-
ing purposes. Information about the risks, such as exposures, premium, claim
counts, losses, and rating factors, is typically organized into three databases:
• policy database - contains information about the risk being insured, the
policyholder, and the contract provisions
• claims database - contains information about each claim; these are linked
to the policy database.
• payment database - contains information on each claims transaction, typ-
ically payments but may also changes to case reserves. These are linked
to the claims database.
With these detailed databases, it is straightforward (in principle) to sum up
policy level detail to aggregate information needed for financial reports. This
section describes various summary measures commonly used.
7.5.1 Exposures and Premiums

A financial reporting period is a length of time that is fixed in the calendar;
we use January 1 to December 31 for the examples in this book although other
reporting periods are also common. The reporting period is fixed but policies
may begin at any time during the year. Even if all policies have a common
contract length of (say) one year, because of the differing starting time, they
can end at any time during the financial reporting. Figure 7.1 presents four
illustrative policies. There need to be some standards as to what types of
measures are most useful for summarizing experience in a given reporting period
due to these differing start and end times.
Calendar Time
1 Jan 2019 1 Jan 2020 1 Jan 2021
X X X
A | |
B | |
C | |
D | |
Figure 7.1: Timeline of Exposures for Four 12-Month Policies
Some commonly used exposure measures are:

• written exposures, the amount of exposures on policies issued (underwrit-
ten or written) during the period in question,
• earned exposures, the exposure units actually exposed to loss during the
period, that is, where coverage has already been provided
• unearned exposures, represent the portion of the written exposures for

which coverage has not yet been provided as of that point in time, and
• in force exposures, exposure units exposed to loss at a given point in time.
Table 7.7 gives detailed illustrative calculations for the four illustrative policies.
Table 7.7. Exposures for Four 12-Month Policies
In-Force
Effective Written Exposure Earned Exposure Unearned Exposure Exposure
𝑃 𝑜𝑙𝑖𝑐𝑦 Date 1/1/2019 1/1/2020 1/1/2019 1/1/2020 1/1/2019 1/1/2020 1/1/2020
A 1 Jan 2019 1.00 0.00 1.00 0.00 0.00 0.00 0.00
B 1 April 2019 1.00 0.00 0.75 0.25 0.25 0.00 1.00
C 1 July 2019 1.00 0.00 0.50 0.50 0.50 0.00 1.00
D 1 Oct 2019 1.00 0.00 0.25 0.75 0.75 0.00 1.00
𝑇 𝑜𝑡𝑎𝑙 4.00 0.00 2.50 1.50 1.50 0.00 3.00
This summarization is sometimes known as the calendar year method of aggre-

gation to serve as a contrast to the policy year method. In the policy year
method, all premiums and losses for policies written in the policy year are ag-
gregated regardless of when earned. For example, for all four policies A, B, C,
and D, they have written and earned exposures of 1.00 for the policy year 2019
(starting on 1/1 or 1 Jan). This is despite the fact that they do not start at the
beginning of the year. This method is useful for ratemaking methods based on
individual contracts and we do not pursue this further here.
In the same way as exposures, one can summarize premiums. Premiums, like
exposures, can be either written, earned, unearned, or in force. Consider the
following example.
Example 7.5.1. CAS Exam 5, 2003, Number 10. A 12-month policy is
written on March 1, 2002 for a premium of $900. As of December 31, 2002,
which of the following is true?
Calendar Year Calendar Year

2002 Written 2002 Earned Inforce
Premium Premium Premium
𝐴. 900 900 900
𝐵. 750 750 900
𝐶. 900 750 750
𝐷. 750 750 750
𝐸. 900 750 900
Solution.
Only earned premium differs from written premium and inforce premium and
therefore needs to be computed. Thus, earned premium at Dec 31, 2002, equals
$900 × 10/12 = $750. Answer E.
7.5.2 Losses, Claims, and Payments

Broadly speaking, the terms loss and claim refer to the amount of compensation
paid or potentially payable to the claimant under the terms of the insurance
policy. Definitions can vary:
• Sometimes, the term claim is used interchangeably with the term loss.
• In some insurance and actuarial sources, the term loss is used for the
amount of damage sustained in an insured event. The claim is the amount
paid by the insurer with differences typically due to deductibles, upper
policy limits, and the like.
• In economics, a claim is a demand for payment by an insured or by an
injured third-party under the terms and conditions of insurance contract
and the loss is the amount paid by the insurer.
This text will follow the second bullet. However, when reading other sources,
you will need to take care when thinking about definitions for the terms loss
and claim.
To establish additional terminology, it is helpful to follow the timeline of a
claim as it develops. In Figure 7.2, the claim occurs at time 𝑡1 and the insuring
company is notified at time 𝑡3 . There can be a long gap between occurrence and
notification such that the end of a company financial reporting period, known
as a valuation date, occurs (𝑡2 ) before the loss is reported. In this case, the
claim is said to be incurred but not reported at this valuation date.
After claim notification, there may one or more loss payments. Not all of the
payments may be made by the next valuation date (𝑡4 ). As the claim devel-
ops, eventually the company deems its financial obligations on the claim to be
resolved and declares the claim closed. However, it is possible that new facts
arise and the claim must be re-opened, giving rise to additional loss payments
prior to being closed again.
Loss Payments Loss Loss

Payments Payments
Valuation Valuation Re−

Closure
Date 1 Date 2 Opening
Occurrence Notification Closure

X X X
t1 t2 t3 t4
Figure 7.2: Timeline of Claim Development

• Accident date - the date of the occurrence which gave rise to the claim.
This is also known as the date of loss or the occurrence date.
• Report date - the date the insurer receives notice of the claim. Claims
not currently known by the insurer are referred to as unreported claims
or incurred but not reported (IBNR) claims.
Until the claim is settled, the reported claim is considered an open claim. Once
the claim is settled, it is categorized as a closed claim. In some instances, further
activity may occur after the claim is closed, and the claim may be re-opened.
Recall that a claim is the amount paid or payable to claimants under the terms
of insurance policies. Further, we have
• Paid losses are those losses for a particular period that have actually been
paid to claimants.
• Where there is an expectation that payment will be made in the future,
a claim will have an associated case reserve representing the estimated
amount of that payment.
• Reported Losses, also known as case incurred, is Paid Losses + Case Re-
serves
The ultimate loss is the amount of money required to close and settle all claims
for a defined group of policies.
7.5.3 Comparing Pure Premium and Loss Ratio Methods

Now that we have learned how exposures, premiums, and claims develop over
time, we can consider how they can be used for ratemaking. We have seen that
insurers offer many different types of policies that cover different policyholders
and amounts of risks. This aggregation is sometimes loosely referred to as the
mix of business. Importantly, the mix changes over time as policyholders come
and go, amounts of risks change, and so forth. The exposures, premiums, and
types of risks from a prior financial reporting may not be representative of the
period for which the rates are being developed. The process of extrapolating
exposures, premiums, and risk types is known as trending. For example, an on-
level earned premium is that earned premium that would have resulted for the
experience period had the current rates been in effect for the entire period; this
is also known as an earned premium at current rates. Most trending methods
used in practice are mathematically straight-forward although they can become
complicated given contractual and administrative complexities. We refer the
reader to standard references that describe approaches in detail such as Werner
and Modlin (2016) and Friedland (2013).
Loss Ratio Method

The expression for the loss ratio method indicated change factor in equation
(7.3) assumes a certain amount of consistency in the portfolio experience over
time. For another approach, we can define the experience loss ratio to be:
experience losses
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = .
experience period earned exposure × current rate
Here, we think of the experience period earned exposure × current rate as the
experience premium.
Using equation (7.2), we can write a loss ratio as
Losses 1−𝑉 −𝑄 1−𝑉 −𝑄

𝐿𝑅 = = = ,
Premium (Losses + Fixed)/Losses 1+𝐺
where 𝐺 = Fixed/Losses, the ratio of fixed expenses to losses. With this expres-
sion, we define the target loss ratio
1−𝑉 −𝑄 1 − premium related expense factor - profit and contingencies factor

𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡 = = .
1+𝐺 1 + ratio of non-premium related expenses to losses
With these, the indicated change factor is
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒
𝐼𝐶𝐹 = . (7.4)
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡
Comparing equation (7.3) to (7.4), we see that the latter offers more flexibility
to explicitly incorporate trended experience. As the loss ratio method is based
on rate changes, this flexibility is certainly warranted.
Comparison of Methods
Assuming that exposures, premiums, and claims have been trended to be repre-
sentative of a period that rates are being developed for, we are now in a position
to compare the pure premium and loss ratio methods for ratemaking. We start
with the observation that for the same data inputs, these two approaches pro-
duce the same results. That is, they are algebraically equivalent. However, they
rely on different inputs:
Pure Premium Method Loss Ratio Method

Based on exposures Based on premiums
Does not require existing rates Requires existing rates
Does not use on-level premiums Uses on-level premiums
Produces indicated rates Produces indicated rate changes
Comparing the pure premium and loss ratio methods, we note that:
• The pure premium method requires well-defined, responsive exposures.
• The loss ratio method cannot be used for new business because it produces
indicated rate changes.
• The pure premium method is preferable where on-level premium is difficult

to calculate. In some instances, such as commercial lines where individual
risk rating adjustments are made to individual policies, it is difficult to
determine the on-level earned premium required for the loss ratio method.
In many developed countries like the US where lines of business have been in
existence, the loss ratio approach is more popular.
Example 7.5.2. CAS Exam 5, 2006, Number 36. You are given the
following information:
• Experience period on-level earned premium = $500,000
• Experience period trended and developed losses = $300,000
• Experience period earned exposure = 10,000
• Premium-related expenses factor = 23%
• Non-premium related expenses = $21,000
• Profit and contingency factor = 5%
(a) Calculate the indicated rate level change using the loss ratio method.
(b) Calculate the indicated rate level change using the pure premium method.
(c) Describe one situation in which it is preferable to use the loss ratio method,
and one situation in which it is preferable to use the pure premium method.
Solution.
(a) We will calculate the experience and target loss ratios, then take the ratio
to get the indicated rate change. The experience loss ratio is
experience losses 300000
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = = = 0.60.
experience period premium 500000
The target loss ratio is:
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡 = 1−𝑉 −𝑄
1+𝐺
= 1−premium related expense factor - profit and contingencies factor
1+ratio of non-premium related expenses to losses
1−0.23−0.05
= 1+0.07
= 0.673.
21000
Here, the ratio of non-premium related expenses to losses is 𝐺 = 300000 = 0.07.
Thus, the (new) indicated rate level change is
𝐿𝑅𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 0.60
𝐼𝐶𝐹 = −1= − 1 = −10.8%.
𝐿𝑅𝑡𝑎𝑟𝑔𝑒𝑡 0.673
(b) Using the pure premium method with equation (7.2),
Losses + Fixed
𝑃 𝑟𝑒𝑚𝑖𝑢𝑚𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 = 1−𝑄−𝑉
300000+21000
= 1−0.23−0.05
= 445833.33.
445833.33
Thus, the indicated rate level change is 500000 − 1 = −10.8%.
(c) The loss ratio method is preferable when the exposure unit is not available.
The loss ratio method is preferable when the exposure unit is not reasonably
consistent between risks.
The pure premium method is preferable for a new line of business.
The pure premium method is preferable where on-level premiums are difficult
to calculate.
7.6 Selecting a Premium

• Describe skewed distributions via a Lorenz curve and Gini index
• Define a concentration curve and the corresponding Gini statistic
• Use the concentration curve and Gini statistic for premium selection base
on out-of-sample validation
For a portfolio of insurance contracts, insurers collect premiums and pay out
losses. After making adjustments for expenses and profit considerations, tools
for comparing distributions of premiums and losses can be helpful when selecting
a premium calculation principle.
7.6.1 Classic Lorenz Curve

In welfare economics, it is common to compare distributions via the Lorenz
curve, developed by Max Otto Lorenz (Lorenz, 1905). A Lorenz curve is a
graph of the proportion of a population on the horizontal axis and a distribu-
tion function of interest on the vertical axis. It is typically used to represent
income distributions. When the income distribution is perfectly aligned with
the population distribution, the Lorenz curve results in a 45 degree line that
is known as the line of equality. Because the graph compares two distribution
functions, one can also think of a Lorenz curve as a type of pp plot that was
introduced in Section 4.1.2. The area between the Lorenz curve and the line
of equality is a measure of the discrepancy between the income and population
distributions. Two times this area is known as the Gini index, introduced by
Corrado Gini in 1912.
Example – Classic Lorenz Curve. For an insurance example, Figure 7.3
shows a distribution of insurance losses. This figure is based on a random
sample of 2000 losses. The left-hand panel shows a right-skewed histogram of
losses. The right-hand panel provides the corresponding Lorenz curve, showing
again a skewed distribution. For example, the arrow marks the point where 60
percent of the policyholders have 30 percent of losses. The 45 degree line is the
7.6. SELECTING A PREMIUM 275
line of equality; if each policyholder has the same loss, then the loss distribution
would be at this line. The Gini index, twice the area between the Lorenz curve
and the 45 degree line, is 37.6 percent for this data set.
Proportion of Losses
0.8
0.0020
Density
0.4
0.0000
(0.60, 0.30)
0.0
0 400 800 0.0 0.2 0.4 0.6 0.8 1.0
Losses Proportion of Observations
Figure 7.3: Distribution of Insurance Losses. The left-hand panel is a

density plot of losses. The right-hand panel presents the same data using a
Lorenz curve.
7.6.2 Performance Curve and a Gini Statistic

We now introduce a modification of the classic Lorenz curve and Gini statistic
that is useful in insurance applications. Specifically, we introduce a performance
curve that, in this case, is a graph of the distribution of losses versus premiums,
where both losses and premiums are ordered by premiums. To make the ideas
concrete, we provide some notation and consider 𝑖 = 1, … , 𝑛 policies. For the
𝑖th policy, let
• 𝑦𝑖 denote the insurance loss,
• x𝑖 be a set of rating variables known to the analyst, and
• 𝑃𝑖 = 𝑃 (x𝑖 ) be the associated premium that is a function of x𝑖 .
The set of information used to calculate the performance curve for the 𝑖th policy
is (𝑃𝑖 , 𝑦𝑖 ).
Performance Curve
It is convenient to first sort the set of policies based on premiums (from smallest
to largest) and then compute the premium and loss distributions. The premium
distribution is 𝑛
∑𝑖=1 𝑃𝑖 I(𝑃𝑖 ≤ 𝑠)
̂
𝐹𝑃 (𝑠) = 𝑛 , (7.5)
∑𝑖=1 𝑃𝑖
and the loss distribution is

𝑛
∑𝑖=1 𝑦𝑖 I(𝑃𝑖 ≤ 𝑠)
𝐹𝐿̂ (𝑠) = 𝑛 , (7.6)
∑𝑖=1 𝑦𝑖
where I(⋅) is the indicator function, returning a 1 if the event is true and zero
otherwise. For a given value 𝑠, 𝐹𝑃̂ (𝑠) gives the proportion of premiums less than
or equal to 𝑠, and 𝐹𝐿̂ (𝑠) gives the proportion of losses for those policyholders
with premiums less than or equal to 𝑠. The graph (𝐹𝑃̂ (𝑠), 𝐹𝐿̂ (𝑠)) is known as
a performance curve.
Example – Loss Distribution. Suppose we have 𝑛 = 5 policyholders with
experience as follows. The data have been ordered by premiums.
Variable 𝑖 1 2 3 4 5
Premium 𝑃 (x𝑖 ) 2 4 5 7 16
𝑖
Cumulative Premiums ∑𝑗=1𝑃 (x𝑗 ) 2 6 11 18 34
Loss 𝑦𝑖 2 5 6 6 17
𝑖
Cumulative Loss ∑𝑗=1 𝑦𝑗 2 7 13 19 36
Figure 7.4 compares the Lorenz to the performance curve. The left-hand panel
shows the Lorenz curve. The horizontal axis is the cumulative proportion of
policyholders (0, 0.2, 0.4, 0.6, 0.8, 1.0) and the vertical axis is the cumulative
proportion of losses (0, 2/36, 7/36, 13/36, 19/39, 36/36). For the Lorenz curve,
you first order by the loss size (which turns out to be the same order as premi-
ums for this simple dataset). This figure shows a large separation between the
distributions of losses and policyholders.
The right-hand panel shows the performance curve. Because observations are
sorted by premiums, the first point after the origin (reading from left to right)
is (2/34, 2/36). The second point is (6/34, 7/36), with the pattern continu-
ing. From the figure, we see that there is little separation between losses and
premiums.
The performance curve can be helpful to the analyst who thinks about forming
profitable portfolios for the insurer. For example, suppose that 𝑠 is chosen to
represent the 95th percentile of the premium distribution. Then, the horizontal
axis, 𝐹𝑃̂ (𝑠), represents the fraction of premiums for this portfolio and the vertical
axis, 𝐹𝐿̂ (𝑠), the fraction of losses for this portfolio. When developing premium
principles, analysts wish to avoid unprofitable situations and make profits, or
at least break even.
𝑛 𝑛
The expectation of the denominator in equation (7.6) is ∑𝑖=1 E [𝑦𝑖 ] = ∑𝑖=1 𝜇𝑖 .
Thus, if the premium principle is chosen such that 𝑃𝑖 = 𝜇𝑖 , then we anticipate
Lorenz Performance
Loss Distn Loss Distn
0.8
0.8
0.4
0.4
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
People Distn Premium Distn
Figure 7.4: Lorenz versus Performance Curve
a close relation between the premium and loss distributions, resulting in a 45

degree line. The 45 degree line presents equality between losses and premiums,
a break-even situation which is the benchmark for insurance pricing.
Gini Statistic
The classic Lorenz curve shows the proportion of policyholders on the horizontal
axis and the loss distribution function on the vertical axis. The performance
curve extends the classical Lorenz curve in two ways, (1) through the ordering
of risks and prices by prices and (2) by allowing prices to vary by observation.
We summarize the performance curve in the same way as the classic Lorenz
curve using a Gini statistic, defined as twice the area between the curve and
a 45 degree line. The analyst seeks ordered performance curves that approach
passing through the 45 degree line; these have the least separation between the
loss and premium distributions and therefore small Gini statistics.
Specifically, the Gini statistic can be calculated as follows. Suppose that the
empirical performance curve is given by {(𝑎0 = 0, 𝑏0 = 0), (𝑎1 , 𝑏1 ), … , (𝑎𝑛 =
1, 𝑏𝑛 = 1)} for a sample of 𝑛 observations. Here, we use 𝑎𝑗 = 𝐹𝑃̂ (𝑃𝑗 ) and
𝑏𝑗 = 𝐹𝐿̂ (𝑃𝑗 ). Then, the empirical Gini statistic is
𝑛−1
𝑎𝑗+1 + 𝑎𝑗 𝑏𝑗+1 + 𝑏𝑗
̂ = 2 ∑(𝑎
𝐺𝑖𝑛𝑖 𝑗+1 − 𝑎𝑗 ) { − }
𝑗=0
2 2
𝑛−1
= 1 − ∑(𝑎𝑗+1 − 𝑎𝑗 )(𝑏𝑗+1 + 𝑏𝑗 ). (7.7)
𝑗=0
To understand the formula for the Gini statistic, here is a sketch of a parallelo-
gram connecting points (𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), and a 45 degree line. You can use basic
geometry to check that the area of the figure is 𝐴𝑟𝑒𝑎 = (𝑎2 −𝑎1 ) { 𝑎2 +𝑎2
1
− 𝑏2 +𝑏
2 }.
1
The definition of the Gini statistic in equation (7.7) is simply twice the sum of
the parallelograms. The second equality in equation (7.7) is the result of some
straight-forward algebra.
(a2,a2)
(a1,a1)
Area
45 degree line
(a2,b2)
(a1,b1)
Example – Loss Distribution: Continued. The Gini statistic for the Lorenz
curve (left-hand panel of Figure 7.4) is 34.4 percent. In contrast, the Gini
statistic for performance curve (right-hand panel) is 1.7 percent.
7.6.3 Out-of-Sample Validation

The benefits of out-of-sample validation for model selection were introduced in
Section 4.2. We now demonstrate the use of the a Gini statistic and performance
curve in this context. The procedure follows:
1. Use an in-sample data set to estimate several competing models, each
producing a premium function.
2. Designate an out-of-sample, or validation, data set of the form {(x𝑖 , 𝑦𝑖 ), 𝑖 =
1, … , 𝑛}.
3. Use the explanatory variables from the validation sample to form premi-
ums of the form 𝑃 (x𝑖 ).
4. Compute the Gini statistic for each model. Choose the model with the
lowest Gini statistic.
Example – Community Rating versus Premiums that Vary by State.

Suppose that we have experience from 25 states and that for each state we
have available 200 observations that can be used to predict future losses. For
simplicity, assume that the analyst knows that these losses were generated by a
gamma distribution with a common shape parameter equal to 5. Unknown to
the analyst, the scale parameters vary by state from a low of 20 to 66.
• To compute base premiums, the analyst assumes a scale parameter that
is common to all states that is to be estimated from the data. You can
think of this common premium as based on a community rating principle.
• As an alternative, the analyst allows the scale parameters to vary by state
and will again use the data to estimate these parameters.
An out of sample validation set of 100 losses from each state is available. For
each of the two rating procedures, determine the performance curve and the
corresponding Gini statistic. Choose the rate procedure with the lower Gini
statistic.
Recall for the gamma distribution that the mean equals the shape times the
scale or, 5 times the scale parameter, for our example. So, you can check that
the maximum likelihood estimates are simply the average experience.
For our base premium, we assume a common distribution among all states. For
these simulated data, the average in-sample loss is 𝑃1 =221.36.
As an alternative, we use averages that are state-specific; these averages form
our premiums 𝑃2 . Because this illustration uses means that vary by states, we
anticipate this alternative rating procedure to be preferred to the community
rating procedure.
Out of sample claims were generated from the same gamma distribution as the
in-sample model, with 100 observations for each state. The following R code
shows how to calculate the performance curves.
For these data, the Gini statistics are 19.6 percent for the flat rate premium
and -0.702 percent for the state-specific alternative. This indicates that the
state-specific alternative procedure is strongly preferred to the base community
rating procedure.
Discussion
In insurance claims modeling, standard out-of-sample validation measures are
not the most informative due to the high proportions of zeros (corresponding
to no claim) and the skewed fat-tailed distribution of the positive values. In
contrast, the Gini statistic works well with many zeros (see the demonstration
in (Frees et al., 2014)).
Flat State Specific

Loss Distn Loss Distn
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Premium Distn Premium Distn
The value of the performance curves and Gini statistics have been recently
advanced in the paper of Denuit et al. (2019). Properties of an extended version,
dealing with relativities for new premiums, were developed by Frees et al. (2011)
and Frees et al. (2014). In these articles you can find formulas for the standard
errors and additional background information.

This chapter serves as a bridge between the technical introduction of this book
and an introduction to pricing and ratemaking for practicing actuaries. For
readers interested in learning practical aspects of pricing, we recommend intro-
ductions by the Society of Actuaries in Friedland (2013) and by the Casualty
Actuarial Society in Werner and Modlin (2016). For a classic risk management
introduction to pricing, see Niehaus and Harrington (2003). See also Finger
(2006) and Frees (2014).
Bühlmann (1985) was the first in the academic literature to argue that pricing
should be done first at the portfolio level (he referred to this as a top down
approach) which would be subsequently reconciled with pricing of individual
contracts. See also the discussion in Kaas et al. (2008), Chapter 5.
For more background on pricing principles, a classic treatment is by Gerber
(1979) with a more modern approach in Kaas et al. (2008). For more discussion
of pricing from a financial economics viewpoint, see Bauer et al. (2013).
• Edward W. (Jed) Frees, University of Wisconsin-Madison, and José
Garrido, Concordia University are the principal authors of the initial
version of this chapter. Email: [email protected] and/or jose.garrido@c

oncordia.ca for chapter comments and suggested improvements.
• Chapter reviewers include Chun Yong Chew, Curtis Gary Dean, Brian
Hartman, and Jeffrey Pai. Write Jed or José to add you name here.
TS 7.A. Rate Regulation

Insurance regulation helps to ensure the financial stability of insurers and to
protect consumers. Insurers receive premiums in return for promises to pay in
the event of a contingent (insured) event. Like other financial institutions such
as banks, there is a strong public interest in promoting the continuing viability
of insurers.
Market Conduct
To help protect consumers, regulators impose administrative rules on the behav-
ior of market participants. These rules, known as market conduct regulation,
provide systems of regulatory controls that require insurers to demonstrate that
they are providing fair and reliable services, including rating, in accordance with
the statutes and regulations of a jurisdiction.
1. Product regulation serves to protect consumers by ensuring that insurance
policy provisions are reasonable and fair, and do not contain major gaps
in coverage that might be misunderstood by consumers and leave them
unprotected.
2. The insurance product is the insurance contract (policy) and the coverage
it provides. Insurance contracts are regulated for these reasons:
a. Insurance policies are complex legal documents that are often difficult
to interpret and understand.
b. Insurers write insurance policies and sell them to the public on a
“take it or leave it” basis.
Market conduct includes rules for intermediaries such as agents (who sell in-
surance to individuals) and brokers (who sell insurance to businesses). Market
conduct also includes competition policy regulation, designed to ensure an effi-
cient and competitive marketplace that offers low prices to consumers.
Rate Regulation
Rate regulation helps guide the development of premiums and so is the focus
of this chapter. As with other aspects of market conduct regulation, the in-
tent of these regulations is to ensure that insurers not take unfair advantage of
consumers. Rate (and policy form) regulation is common worldwide.
The amount of regulatory scrutiny varies by insurance product. Rate regulation
is uncommon in life insurance. Further, in non-life insurance, most commercial
lines and reinsurance are free from regulation. Rate regulation is common in
automobile insurance, health insurance, workers compensation, medical mal-

practice, and homeowners insurance. These are markets in which insurance is
mandatory or in which universal coverage is thought to be socially desirable.
There are three principles that guide rate regulation: rates should
• be adequate (to maintain insurance company solvency),
• but not excessive (not so high as to lead to exorbitant profits),
• nor unfairly discriminatory (price differences must reflect expected claim
and expense differences).
Recently, in auto and home insurance, the twin issues of availability and afford-
ability, which are not explicitly included in the guiding principles, have been
assuming greater importance in regulatory decisions.
Rates are Not Unfairly Discriminatory

Some government regulations of insurance restrict the amount, or level, of pre-
mium rates. These are based on the first two of the three guiding rate regulation
principles, that rates be adequate but not excessive. This type of regulation is
discussed further in the following section on types of rate regulation.
Other government regulations restrict the type of information that can be used
in risk classification. These are based on the third guiding principle, that rates
not be unfairly discriminatory. “Discrimination” in an insurance context has a
different meaning than commonly used; for our purposes, discrimination means
the ability to distinguish among things or, in our case, policyholders. The real
issue is what is meant by the adjective “fair.”
In life insurance, it has long been held that it is reasonable and fair to charge
different premium rates by age. For example, a life insurance premium differs
dramatically between an 80 year old and someone aged 20. In contrast, it is
unheard of to use rates that differ by:
• ethnicity or race,
• political affiliation, or
• religion.
It is not a matter of whether data can be used to establish statistical significance
among the levels of any of these variables. Rather, it is a societal decision as to
what constitutes notions of “fairness.”
Different jurisdictions have taken different stances on what constitutes a fair
rating variable. For example, in some jurisdictions for some insurance products,
gender is no longer a permissible variable. As an illustration, the European
Union now prohibits the use of gender for automobile rating. As another ex-
ample, in the U.S., many discussions have revolved around the use of credit
ratings to be used in automobile insurance pricing. Credit ratings are designed
to measure consumer financial responsibility. Yet, some argue that credit scores
are good proxies for ethnicity and hence should be prohibited.
In an age where more data is being used in imaginative ways, discussions of

what constitutes a fair rating variable will only become more important going
forward and much of that discussion is beyond the scope of this text. However,
it is relevant to the discussion to remark that actuaries and other data analysts
can contribute to societal discussions on what constitutes a “fair” rating variable
in unique ways by establishing the magnitude of price differences when using
variables under discussion.
Types of Rate Regulation

There are several methods, that vary by the level of scrutiny, by which regulators
may restrict the rates that insurers offer.
The most restrictive is a government prescribed regulatory system, where the
government regulator determines and promulgates the rates, classifications,
forms, and so forth, to which all insurers must adhere. Also restrictive are prior
approval systems. Here, the insurer must file rates, rules, and so forth, with
government regulators. Depending on the statute, the filing becomes effective
when a specified waiting period elapses (if the government regulator does not
take specific action on the filing, it is deemed approved automatically) or when
the government regulator formally approves the filing.
The least restrictive is a no file or record maintenance system where the insurer
need not file rates, rules, and so forth, with the government regulator. The
regulator may periodically examine the insurer to ensure compliance with the
law. Another relatively flexible system is the file only system, also known as
competitive rating, where the insurer simply keeps files to ensure compliance
with the law.
In between these two extremes are the (1) file and use, (2) use and file, (3)
modified prior approval, and (4) flex rating systems.
1. File and Use: The insurer must file rates, rules, and so forth, with the
government regulator. The filing becomes effective immediately or on a
future date specified by the filer.
2. Use and File: The filing becomes effective when used. The insurer must
file rates, rules, and so forth, with the government regulator within a
specified time period after first use.
3. Modified Prior Approval: This is a hybrid of “prior approval” and “file
and use” laws. If the rate revision is based solely on a change in loss
experience then “file and use” may apply. However, if the rate revision
is based on a change in expense relationships or rate classifications, then
“prior approval” may apply.
4. Flex (or Band) Rating: The insurer may increase or decrease a rate within
a “flex band,” or range, without approval of the government regulator.
Generally, either “file and use” or “use and file” provisions apply.
For a broad introduction to government insurance regulation from a global per-

spective, see the website of the International Association of Insurance Supervi-
sors (IAIS).
Chapter 8
Risk Classification
Chapter Preview. This chapter motivates the use of risk classification in in-
surance pricing and introduces readers to Poisson regression as a prominent
example of risk classification. In Section 8.1 we explain why insurers need to
incorporate various risk characteristics, or rating factors, of individual policy-
holders in pricing insurance contracts. In Section 8.2, we introduce Poisson
regression as a pricing tool to achieve such premium differentials. The con-
cept of exposure is also introduced in this section. As most rating factors are
categorical, we show in Section 8.3 how the multiplicative tariff model can be
incorporated into a Poisson regression model in practice, along with numerical
examples for illustration.
8.1 Introduction
In this section, you learn:

• Why premiums should vary across policyholders with different risk
characteristics.
• The meaning of the adverse selection spiral.
• The need for risk classification.
Through insurance contracts, the policyholders effectively transfer their risks

to the insurer in exchange for premiums. For the insurer to stay in business,
the premium income collected from a pool of policyholders must at least equal
the benefit outgo. In general insurance products where a premium is charged
285
286 CHAPTER 8. RISK CLASSIFICATION
for a single period, say annually, the gross insurance premium based on the
equivalence principle is stated as
Gross Premium = Expected Losses + Expected Expenses + Profit.
Thus, ignoring frictional expenses associated with the administrative expenses

and the profit, the net or pure premium charged by the insurer should be equal
to the expected losses occurring from the risk that is transferred from the poli-
cyholder.
If all policyholders in the insurance pool have identical risk profiles, the insurer
simply charges the same premium for all policyholders because they have the
same expected loss. In reality, however, the policyholders are hardly homoge-
neous. For example, mortality risk in life insurance depends on the character-
istics of the policyholder, such as, age, sex and life style. In auto insurance,
those characteristics may include age, occupation, the type or use of the car,
and the area where the driver resides. The knowledge of these characteristics
or variables can enhance the ability of calculating fair premiums for individual
policyholders, as they can be used to estimate or predict the expected losses
more accurately.
Adverse Selection. Indeed, if the insurer does not differentiate the risk char-
acteristics of individual policyholders and simply charges the same premium to
all insureds based on the average loss in the portfolio, the insurer would face
adverse selection, a situation where individuals with a higher chance of loss are
attracted in the portfolio and low-risk individuals are repelled.
For example, consider a health insurance industry where smoking status is an
important risk factor for mortality and morbidity. Most health insurers in the
market require different premiums depending on smoking status, so smokers pay
higher premiums than non-smokers, with other characteristics being identical.
Now suppose that there is an insurer, we will call EquitabAll, that offers the
same premium to all insureds regardless of smoking status, unlike other com-
petitors. The net premium of EquitabAll is naturally an average mortality loss
accounting for both smokers and non-smokers. That is, the net premium is a
weighted average of the losses with the weights being the proportions of smok-
ers and non-smokers, respectively. Thus it is easy to see that that a smoker
would have a good incentive to purchase insurance from EquitabAll than from
other insurers as the offered premium by EquitabAll is relatively lower. At
the same time non-smokers would prefer buying insurance from somewhere else
where lower premiums, computed from the non-smoker group only, are offered.
As a result, there will be more smokers and less non-smokers in the Equita-
bAll’s portfolio, which leads to larger-than-expected losses and hence a higher
premium for insureds in the next period to cover the higher costs. With the
raised new premium in the next period, non-smokers in EquitabAll will have
even greater incentives to switch insurers. As this cycle continues over time,
EquitabAll would gradually retain more smokers and less non-smokers in its
8.2. POISSON REGRESSION MODEL 287
portfolio with the premium continually raised, eventually leading to a collapse

of business.
In the literature, this phenomenon is known as the adverse selection spiral or
death spiral. Therefore, incorporating and differentiating important risk charac-
teristics of individuals in the insurance pricing process are a pertinent component
for both the determination of fair premium for individual policyholders and the
long term sustainability of insurers.
Rating Factors. In order to incorporate relevant risk characteristics of pol-
icyholders in the pricing process, insurers maintain some classification system
that assigns each policyholder to one of the risk classes based on a relatively
small number of risk characteristics that are deemed most relevant. These char-
acteristics used in the classification system are called rating factors, which are
a priori variables in the sense that they are known before the contract begins
(e.g., sex, health status, vehicle type, etc, are known during underwriting). All
policyholders sharing identical risk factors thus are assigned to the same risk
class, and are considered homogeneous from a pricing viewpoint; the insurer
consequently charges them the same premium or rate.
Regarding the risk factors and premiums, the Actuarial Standard of Practice
(ASOP) No. 12 of the Actuarial Standards Board (2018) states that the actuary
should select risk characteristics that are related to expected outcomes, and that
rates within a risk classification system would be considered equitable if differ-
ences in rates reflect material differences in expected cost for risk characteristics.
In the process of choosing risk factors, ASOP also requires the actuary to con-
sider the following: relationship of risk characteristics and expected outcomes,
causality, objectivity, practicality, applicable law, industry practices, and busi-
ness practices. Technical Supplement TS 8.B provides additional discussion of
selection of rating factors.
On the quantitative side, an important task for the actuary in building a risk
classification framework is to construct a statistical model that can determine
the expected loss given various rating factors of a policyholder. The standard
approach is to adopt a regression model which produces the expected loss as the
output when the relevant risk factors are given as the inputs. In this chapter
we learn about Poisson regression, which can be used when the loss is a count
variable, as a prominent example of an insurance pricing tool.
8.2 Poisson Regression Model

The Poisson regression model has been successfully used in a wide range of
applications and has an advantage of allowing closed-form expressions for im-
portant quantities. In this section we introduce Poisson regression as a natural
extension of the Poisson distribution.
In this section you will:

• Understand Poisson regression as a convenient tool for combining individ-
ual Poisson distributions.
• Sharpen your understanding of the concept of exposure and its impor-
tance.
• Formally learn how to formulate a Poisson regression model using indicator

variables when the explanatory variables are categorical.
8.2.1 Need for Poisson Regression

Poisson Distribution
To introduce Poisson regression, let us consider a hypothetical health insurance
portfolio where all policyholders are of the same age and only one risk factor,
smoking status, is relevant. Smoking status thus is a categorical variable with
two levels: smoker and non-smoker. As there are two levels for smoking status,
we may denote smoker and non-smoker by level 1 and 2, respectively. Here
the numbering is arbitrary; smoking status is a nominal categorical variable.
(See Section 14.1.1 for more discussion of categorial and nominal variables.)
Suppose now that we are interested in pricing a health insurance where the
premium for each policyholder is determined by the number of outpatient visits
to doctor’s office during a year. The medical cost for each visit is assumed to
be the same regardless of smoking status for simplicity. Thus if we believe that
smoking status is a valid risk factor in this health insurance, it is natural to
consider observations from smokers separately from non-smokers. In Table 8.1
we present data for this portfolio.
Table 8.1. Number of Visits to Doctor’s Office in Last Year
Smoker (level 1) Non-smoker (level 2) Both

Count Observed Count Observed Count Observed
0 2213 0 6671 0 8884
1 178 1 430 1 608
2 11 2 25 2 36
3 6 3 9 3 15
4 0 4 4 4 4
5 1 5 2 5 3
Total 2409 Total 7141 Total 9550
Mean 0.0926 Mean 0.0746 Mean 0.0792
As this dataset contains random counts, we try to fit a Poisson distribution for
each level.
As introduced in Section 2.2.3, the probability mass function of the Poisson with
mean 𝜇 is given by
𝜇𝑦 𝑒−𝜇
Pr(𝑌 = 𝑦) = , 𝑦 = 0, 1, 2, … (8.1)
𝑦!
and E (𝑌 ) = Var (𝑌 ) = 𝜇. In regression contexts, it is common to use 𝜇

for mean parameters instead of the Poisson parameter 𝜆 although certainly
both symbols are suitable. As we saw in Section 2.4, the mle of the Poisson
distribution is given by the sample mean. Thus if we denote the Poisson mean
parameter for each level by 𝜇(1) (smoker) and 𝜇(2) (non-smoker), we see from
Table 8.1 that 𝜇(1)
̂ = 0.0926 and 𝜇(2) ̂ = 0.0746. This simple example shows the
basic idea of risk classification. Depending on smoking status, a policyholder will
have a different risk characteristic that can be incorporated via varying Poisson
mean parameters to compute the fair premium. In this example the ratio of
expected loss frequencies is 𝜇(1) ̂ /𝜇(2)
̂ = 1.2402, implying that smokers tend to
visit a doctor’s office 24.02% times more frequently compared to non-smokers.
It is also informative to note that if the insurer charges the same premium to all
policyholders regardless of smoking status, based on the average characteristic
of the portfolio, as was the case for EquitabAll described in Introduction, the
expected frequency (or premium) 𝜇̂ is 0.0792, obtained from the last column of
Table 8.1. It can be verified that
𝑛1 𝑛2
𝜇̂ = ( ) 𝜇(1)
̂ +( ) 𝜇(2)
̂ = 0.0792, (8.2)
𝑛1 + 𝑛 2 𝑛1 + 𝑛 2
where 𝑛𝑖 is the number of observations in each level. Clearly, this premium is

a weighted average of the premiums for each level with the weight equal to the
proportion of insureds in that level.
A simple Poisson regression
In the example above, we have fitted a Poisson distribution for each level sepa-
rately, but we can actually combine them together in a unified fashion so that
a single Poisson model can encompass both smoking and non-smoking statuses.
This can be done by relating the Poisson mean parameter with the risk factor.
In other words, we make the Poisson mean, which is the expected loss frequency,
respond to the change in the smoking status. The conventional approach to deal
with a categorical variable is to adopt indicator or dummy variables that take
either 1 or 0, so that we turn the switch on for one level and off for others.
Therefore we may propose to use
𝜇 = 𝛽0 + 𝛽1 𝑥1 (8.3)
or, more commonly, a log linear form
log 𝜇 = 𝛽0 + 𝛽1 𝑥1 , (8.4)
where 𝑥1 is an indicator variable with
1 if smoker,
𝑥1 = { (8.5)
0 otherwise.
We generally prefer the log linear relation in (8.4) to the linear one in (8.3) to
prevent producing negative 𝜇 values, which can happen when there are many
different risk factors and levels. The setup in (8.4) and (8.5) then results in
different Poisson frequency parameters depending on the level in the risk factor:
𝛽0 + 𝛽 1 𝑒𝛽0 +𝛽1 if smoker (level 1),

log 𝜇 = { or equivalently, 𝜇={
𝛽0 𝑒𝛽0 if non-smoker (level 2).
(8.6)
This is the simplest form of Poisson regression. Note that we require a single
indicator variable to model two levels in this case. Alternatively, it is also
possible to use two indicator variables through a different coding scheme. This
scheme requires dropping the intercept term so that (8.4) is modified to
log 𝜇 = 𝛽1 𝑥1 + 𝛽2 𝑥2 , (8.7)
where 𝑥2 is the second indicator variable with
1 if non-smoker,
𝑥2 = {
0 otherwise.
Then we have, from (8.7),
𝛽1 𝑒𝛽1 if smoker (level 1),

log 𝜇 = { or 𝜇={ (8.8)
𝛽2 𝑒𝛽2 if non-smoker (level 2).
The numerical result of (8.6) is the same as (8.8) as all coefficients are given
as numbers in actual estimation, with the former setup more common in most
texts; we also stick to the former.
With this Poisson regression model we can readily understand how the coeffi-
cients 𝛽0 and 𝛽1 are linked to the expected loss frequency in each level. Accord-
ing to (8.6), the Poisson mean of the smokers, 𝜇(1) , is given by
𝜇(1) = 𝑒𝛽0 +𝛽1 = 𝜇(2) 𝑒𝛽1 or 𝜇(1) /𝜇(2) = 𝑒𝛽1

where 𝜇(2) is the Poisson mean for the non-smokers. This relation between the
smokers and non-smokers suggests a useful way to compare the risks embedded
in different levels of a given risk factor. That is, the proportional increase in
the expected loss frequency of the smokers compared to that of the non-smokers
is simply given by a multiplicative factor 𝑒𝛽1 . Put another way, if we set the
expected loss frequency of the non-smokers as the base value, the expected loss
frequency of the smokers is obtained by applying 𝑒𝛽1 to the base value.
Dealing with multi-level case
We can readily extend the two-level case to a multi-level one where 𝑙 different
levels are involved for a single rating factor. For this we generally need 𝑙 − 1
indicator variables to formulate
log 𝜇 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑙−1 𝑥𝑙−1 , (8.9)
where 𝑥𝑘 is an indicator variable that equals 1 if the policy belongs to level

𝑘 and 0 otherwise, for 𝑘 = 1, 2, … , 𝑙 − 1. By omitting the indicator variable
associated with the last level in (8.9) we effectively chose level 𝑙 as the base case,
or reference level, but this choice is arbitrary and does not matter numerically.
The resulting Poisson parameter for policies in level 𝑘 then becomes, from (8.9),
𝑒𝛽0 +𝛽𝑘 if the policy belongs to level 𝑘, (𝑘 = 1, 2, ..., 𝑙 − 1),

𝜇={
𝑒𝛽0 if the policy belongs to level 𝑙.
Thus if we denote the Poisson parameter for policies in level 𝑘 by 𝜇(𝑘) , we

can relate the Poisson parameter for different levels through 𝜇(𝑘) = 𝜇(𝑙) 𝑒𝛽𝑘 ,
𝑘 = 1, 2, … , 𝑙 − 1. This indicates that, just like the two-level case, the expected
loss frequency of the 𝑘th level is obtained from the base value multiplied by
the relative factor 𝑒𝛽𝑘 . This relative interpretation becomes more powerful
when there are many risk factors with multi-levels, and leads us to a better
understanding of the underlying risk and a more accurate prediction of future
losses. Finally, we note that the varying Poisson mean is completely driven by
the coefficient parameters 𝛽𝑘 ’s, which are to be estimated from the dataset; the
procedure of the parameter estimation will be discussed later in this chapter.
8.2.2 Poisson Regression

We now describe Poisson regression in a formal and more general setting. Let us
assume that there are 𝑛 independent policyholders with a set of rating factors
characterized by a 𝑘-variate vector1 . The 𝑖th policyholder’s rating factor is thus
denoted by vector x𝑖 = (1, 𝑥𝑖1 , … , 𝑥𝑖𝑘 )′ , and the policyholder has recorded the
loss count 𝑦𝑖 ∈ {0, 1, 2, …} from the last period of loss observation, for 𝑖 =
1, … , 𝑛. In the regression literature, the values 𝑥𝑖1 , … , 𝑥𝑖𝑘 are generally known
1 For example, if there are 3 risk factors each of which the number of levels are 2, 3 and 4,
respectively, we have 𝑘 = (2 − 1) × (3 − 1) × (4 − 1) = 6.
as explanatory variables, as these are measurements providing information about

the variable of interest 𝑦𝑖 . In essence, regression analysis is a method to quantify
the relationship between a variable of interest and explanatory variables.
We also assume, for now, that all policyholders have the same one unit period
for loss observation, or equal exposure of 1, to keep things simple; we will discuss
more details regarding the exposure in the following subsection.
We describe Poisson regression through its mean function. For this we first
denote 𝜇𝑖 as the expected loss count of the 𝑖th policyholder under the Poisson
specification (8.1):
𝜇𝑖 = E (𝑦𝑖 |x𝑖 ), 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛. (8.10)
The condition inside the expectation in equation (8.10) indicates that the loss
frequency 𝜇𝑖 is the model expected response to the given set of risk factors or
explanatory variables. In principle the conditional mean E (𝑦𝑖 |x𝑖 ) in (8.10) can
take different forms depending on how we specify the relationship between x
and 𝑦. The standard choice for Poisson regression is to adopt the exponential
function, as we mentioned previously, so that
′
𝜇𝑖 = E (𝑦𝑖 |x𝑖 ) = 𝑒x𝑖 𝛽 , 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛. (8.11)
Here 𝛽 = (𝛽0 , … , 𝛽𝑘 )′ is the vector of coefficients so that x′𝑖 𝛽 = 𝛽0 + 𝛽1 𝑥𝑖1 +

… + 𝛽𝑘 𝑥𝑖𝑘 . The exponential function in (8.11) ensures that 𝜇𝑖 > 0 for any set
of rating factors x𝑖 . Often (8.11) is rewritten as a log linear form
log 𝜇𝑖 = log E (𝑦𝑖 |x𝑖 ) = x′𝑖 𝛽, 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛 (8.12)
to reveal the relationship when the right side is set as the linear form, x′𝑖 𝛽.
Again, we see that the mapping works well as both sides of (8.12), log 𝜇𝑖 and
x𝑖 𝛽, can now cover all real values. This is the formulation of Poisson regres-
sion, assuming that all policyholders have the same unit period of exposure.
When the exposures differ among the policyholders, however, as is the case in
most practical cases, we need to revise this formulation by adding an exposure
component as an additional term in (8.12).
8.2.3 Incorporating Exposure

Concept of Exposure
We first saw the concept of exposures in Section 7.4. In order to determine
the size of potential losses in any type of insurance, one must always know the
corresponding exposure. The concept of exposure is an extremely important
ingredient in insurance pricing, though we usually take it for granted. For
example, when we say the expected claim frequency of a health insurance policy
is 0.2, it does not mean much without the specification of the exposure such
as, in this case, per month or per year. In fact, all premiums and losses need
the exposure precisely specified and must be quoted accordingly; otherwise all
subsequent statistical analyses and predictions will be distorted.
In the previous section we assumed the same unit of exposure across all policy-
holders, but this is hardly realistic in practice. In health insurance, for example,
two different policyholders with different lengths of insurance coverage (e.g., 3
months and 12 months, respectively) could have recorded the same number of
claim counts. As the expected number of claim counts would be proportional
to the length of coverage, we should not treat these two policyholders’ loss ex-
periences identically in the modeling process. This motivates the need of the
concept of exposure in Poisson regression.
The Poisson distribution in (8.1) is parametrized via its mean. To understand
the exposure, we alternatively parametrize the Poisson pmf in terms of the rate
parameter 𝜆, based on the definition of the Poisson process:
(𝜆𝑡)𝑦 𝑒−𝜆𝑡
Pr(𝑌 = 𝑦) = , 𝑦 = 0, 1, 2, … (8.13)
𝑦!
with E (𝑌 ) = Var (𝑌 ) = 𝜆𝑡. Here 𝜆 is known as the rate or intensity per unit
period of the Poisson process and 𝑡 represents the length of time or exposure, a
known constant value. For given 𝜆 the Poisson distribution (8.13) produces a
larger expected loss count as the exposure 𝑡 gets larger. Clearly, (8.13) reduces
to (8.1) when 𝑡 = 1, which means that the mean and the rate become the same
for an exposure of 1, the case we considered in the previous subsection.
In principle, the exposure does not need to be measured in units of time and
may represent different things depending the problem at hand. For example:
1. In health insurance, the rate may be the occurrence of a specific disease
per 1,000 people and the exposure is the number of people considered in
the unit of 1,000.
2. In auto insurance, the rate may be the number of accidents per year of a
driver and the exposure is the length of the observed period for the driver
in the unit of year.
3. For workers compensation that covers lost wages resulting from an em-
ployee’s work-related injury or illness, the rate may be the probability of
injury in the course of employment per dollar and the exposure is the
payroll amount in dollars.
4. In marketing, the rate may be the number of customers who enter a store
per hour and the exposure is the number of hours observed.
5. In civil engineering, the rate may be the number of major cracks on the
paved road per 10 kms and the exposure is the length of road considered
in the unit of 10 kms.
6. In credit risk modelling, the rate may be the number of default events per
1000 firms and the exposure is the number of firms under consideration in
the unit of 1,000.
Actuaries may be able to use different exposure bases for a given insurable loss.
For example, in auto insurance, both the number of kilometers driven and the
number of months covered by insurance can be used as exposure bases. Here the
former is more accurate and useful in modelling the losses from car accidents, but
more difficult to measure and manage for insurers. Thus, a good exposure base
may not be the theoretically best one due to various practical constraints. As a
rule, an exposure base must be easy to determine, accurately measurable, legally
and socially acceptable, and free from potential manipulation by policyholders.
Incorporating exposure in Poisson regression
As exposures affect the Poisson mean, constructing Poisson regressions requires
us to carefully separate the rate and exposure in the modelling process. Focusing
on the insurance context, let us denote the rate of the loss event of the 𝑖th
policyholder by 𝜆𝑖 , the known exposure (the length of coverage) by 𝑚𝑖 and the
expected loss count under the given exposure by 𝜇𝑖 . Then the Poisson regression
formulation in (8.11) and (8.12) should be revised in light of (8.13) as
′
𝜇𝑖 = E (𝑦𝑖 |x𝑖 ) = 𝑚𝑖 𝜆𝑖 = 𝑚𝑖 𝑒x𝑖 𝛽 , 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , 𝑛, (8.14)
which gives
log 𝜇𝑖 = log 𝑚𝑖 + x′𝑖 𝛽, 𝑦𝑖 ∼ 𝑃 𝑜𝑖𝑠(𝜇𝑖 ), 𝑖 = 1, … , (8.15)
Adding log 𝑚𝑖 in (8.15) does not pose a problem in fitting as we can always
specify this as an extra explanatory variable, as it is a known constant, and fix
its coefficient to 1. In the literature the log of exposure, log 𝑚𝑖 , is commonly
called the offset.
8.2.4 Exercises
1. Regarding Table 8.1 answer the following.
(a) Verify the mean values in the table.
(b) Verify the number in equation (8.2).
(c) Produce the fitted Poisson counts for each smoking status in the
table.
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 295
2. In a Poisson regression formulation (8.10), consider using 𝜇𝑖 = E (𝑦𝑖 |x𝑖 ) =

(x′𝑖 𝛽)2 , for 𝑖 = 1, … , 𝑛, instead of the exponential function. What potential
issue would you have?
8.3 Categorical Variables and Multiplicative

Tariff
In this section you will learn:

• The multiplicative tariff model when the rating factors are categorical.
• How to construct a Poisson regression model based on the multiplicative

tariff structure.
8.3.1 Rating Factors and Tariff

In practice most rating factors in insurance are categorical variables, meaning
that they take one of the predetermined number of possible values. Examples
of categorical variables include sex, type of cars, the driver’s region of residence
and occupation. Continuous variables, such as age or auto mileage, can also be
grouped by bands and treated as categorical variables. Thus we can imagine
that, with a small number of rating factors, there will be many policyholders
falling into the same risk class, charged with the same premium. For the remain-
ing of this chapter we assume that all rating factors are categorical variables.
To illustrate how categorical variables are used in the pricing process, we con-
sider a hypothetical auto insurance with only two rating factors:
• Type of vehicle: Type A (personally owned) and B (owned by corpora-
tions). We use index 𝑗 = 1 and 2 to respectively represent each level of
this rating factor.
• Age band of the driver: Young (age < 25), middle (25 ≤ age < 60) and
old age (age ≥ 60). We use index 𝑘 = 1, 2 and 3, respectively, for this
rating factor.
From this classification rule, we may create an organized table or list, such
as the one shown in Table 8.2, collected from all policyholders. Clearly there
are 2 × 3 = 6 different risk classes in total. Each row of the table shows a
combination of different risk characteristics of individual policyholders. Our
goal is to compute six different premiums for each of these combinations. Once
the premium for each row has been determined using the given exposure and
claim counts, the insurer can replace the last two columns in Table 8.2 with
a single column containing the computed premiums. This new table then can
serve as a manual to determine the premium for a new policyholder given rating
factors during the underwriting process. In non-life insurance, a table (or a set of
tables) or list that contains each set of rating factors and the associated premium
is referred to as a tariff. Each unique combination of the rating factors in a tariff
is called a tariff cell; thus, in Table 8.2 the number of tariff cells is six, same as
the number of risk classes.
Table 8.2. Loss Record of the Illustrative Auto Insurer
Rating factors Exposure Claim count

Type (𝑗) Age (𝑘) in year observed
1 1 89.1 9
1 2 208.5 8
1 3 155.2 6
2 1 19.3 1
2 2 360.4 13
2 3 276.7 6
Let us now look at the loss information in Table 8.2 more closely. The exposure
in each row represents the sum of the length of insurance coverages, or in-force
times, in years, of all the policyholders in that tariff cell. Similarly the claim
counts in each row is the number of claims in each cell. Naturally the exposures
and claim counts vary due to the different number of drivers across the cells, as
well as different in-force time periods among the drivers within each cell.
In light of the Poisson regression framework, we denote the exposure and claim
count of cell (𝑗, 𝑘) as 𝑚𝑗𝑘 and 𝑦𝑗𝑘 , respectively, and define the claim count per
unit exposure as
𝑦𝑗𝑘
𝑧𝑗𝑘 = , 𝑗 = 1, 2; 𝑘 = 1, 2, 3.
𝑚𝑗𝑘
For example, 𝑧12 = 8/208.5 = 0.03837, meaning that a policyholder in tariff cell
(1,2) would have 0.03837 accidents if insured for a full year on average. The set
of 𝑧𝑖𝑗 values then corresponds to the rate parameter in the Poisson distribution
(8.13) as they are the event occurrence rates per unit exposure. That is, we
have 𝑧𝑗𝑘 = 𝜆̂ 𝑗𝑘 where 𝜆𝑗𝑘 is the Poisson rate parameter. Producing 𝑧𝑖𝑗 values
however does not do much beyond comparing the average loss frequencies across
risk classes. To fully exploit the dataset, we will construct a pricing model from
Table 8.2 using Poisson regression, for the remaining part of the chapter.
We comment that actual loss records used by insurers typically include many
more risk factors, in which case the number of cells grows exponentially. The
tariff would then consist of a set of tables, instead of one, separated by some of
the basic rating factors, such as sex or territory.
8.3.2 Multiplicative Tariff Model

In this subsection, we introduce the multiplicative tariff model, a popular pricing
structure that can be naturally used within the Poisson regression framework.
The developments here are based on Table 8.2. Recall that the loss count of
a policyholder is described by a Poisson regression model with rate 𝜆 and the
exposure 𝑚, so that the expected loss count becomes 𝑚𝜆. As 𝑚 is a known
constant, we are essentially concerned with modelling 𝜆, so that it responds
to the change in rating factors. Among other possible functional forms, we
commonly choose the multiplicative2 relation to model the Poisson rate 𝜆𝑗𝑘 for
cell (𝑗, 𝑘):
𝜆𝑗𝑘 = 𝑓0 × 𝑓1𝑗 × 𝑓2𝑘 , 𝑗 = 1, 2; 𝑘 = 1, 2, 3. (8.16)
Here {𝑓1𝑗 , 𝑗 = 1, 2} are the parameters associated with the two levels in the first
rating factor, car type, and {𝑓2𝑘 , 𝑘 = 1, 2, 3} associated with the three levels
in the age band, the second rating factor. For instance, the Poisson rate for a
mid-aged policyholder with a Type B vehicle is given by 𝜆22 = 𝑓0 × 𝑓12 × 𝑓22 .
The first term 𝑓0 is some base value to be discussed shortly. Thus these six
parameters are understood as numerical representations of the levels within
each rating factor, and are to be estimated from the dataset.
The multiplicative form (8.16) is easy to understand and use, because it clearly
shows how the expected loss count (per unit exposure) changes as each rating
factor varies. For example, if 𝑓11 = 1 and 𝑓12 = 1.2, then the expected loss count
of a policyholder with a vehicle of type B would be 20% larger than type A, when
the other factors are the same. In non-life insurance, the parameters 𝑓1𝑗 and
𝑓2𝑘 are known as relativities as they determine how much expected loss should
change relative to the base value 𝑓0 . The idea of relativity is quite convenient in
practice, as we can decide the premium for a policyholder by simply multiplying
a series of corresponding relativities to the base value.
Dropping an existing rating factor or adding a new one is also transparent with
this multiplicative structure. In addition, the insurer may adjust the overall
premium for all policyholders by controlling the base value 𝑓0 without chang-
ing individual relativities. However, by adopting the multiplicative form, we
implicitly assume that there is no serious interaction among the risk factors.
When the multiplicative form is used we need to address an identification issue.
That is, for any 𝑐 > 0, we can write
𝑓1𝑗
𝜆𝑗𝑘 = 𝑓0 × × 𝑐 𝑓2𝑘 .
𝑐
By comparing with (8.16), we see that the identical rate parameter 𝜆𝑗𝑘 can
be obtained for very different individual relativities. This over-parametrization,
2 Preferring the multiplicative form to others (e.g., additive one) was already hinted in (8.4).
meaning that many different sets of parameters arrive at an identical model,

obviously calls for some restriction on 𝑓1𝑗 and 𝑓2𝑘 . The standard practice is
to make one relativity in each rating factor equal to one. This can be made
arbitrarily in theory, but the standard practice is to make the relativity of most
common class (base class) equal to one. We will assume that type A vehicles and
young drivers to be the most common classes, that is, 𝑓11 = 1 and 𝑓21 = 1. This
way all other relativities are uniquely determined. The tariff cell (𝑗, 𝑘) = (1, 1)
is then called the base tariff cell, where the rate simply becomes 𝜆11 = 𝑓0 ,
corresponding to the base value according to (8.16). Thus the base value 𝑓0 is
generally interpreted as the Poisson rate of the base tariff cell.
Again, (8.16) is log-transformed and rewritten as
log 𝜆𝑗𝑘 = log 𝑓0 + log 𝑓1𝑗 + log 𝑓2𝑘 , (8.17)
as it is easier to work with in estimating process, similar to (8.12). This log

linear form makes the log relativities of the base level in each rating factor
equal to zero, i.e., log 𝑓11 = log 𝑓21 = 0, and leads to the following alternative,
more explicit expression for (8.17):
⎧log 𝑓0 + 0 + 0 for a policy in cell (1, 1),

{
{log 𝑓0 + 0 + log 𝑓22 for a policy in cell (1, 2),
{
{log 𝑓 + 0 + log 𝑓23 for a policy in cell (1, 3),
0
log 𝜆𝑗𝑘 =⎨ (8.18)
{ log 𝑓 0 + log 𝑓12 + 0 for a policy in cell (2, 1),
{log 𝑓 + log 𝑓 + log 𝑓 for a policy in cell (2, 2),
{ 0 12 22
{
⎩log 𝑓0 + log 𝑓12 + log 𝑓23 for a policy in cell (2, 3).
This clearly shows that the Poisson rate parameter 𝜆 varies across different tariff
cells, with the same log linear form used in a Poisson regression framework. In
fact the reader may see that (8.18) is an extended version of the early expression
(8.6) with multiple risk factors and that the log relativities now play the role of
𝛽𝑖 parameters. Therefore all the relativities can be readily estimated via fitting
a Poisson regression with a suitably chosen set of indicator variables.
8.3.3 Poisson Regression for Multiplicative Tariff

Indicator Variables for Tariff Cells
We now explain how the relativities can be incorporated into Poisson regression.
As seen early in this chapter we use indicator variables to deal with categorical
variables. For our illustrative auto insurer, therefore, we define an indicator
variable for the first rating factor as
1 for vehicle type B,

𝑥1 = {
0 otherwise.
For the second rating factor, we employ two indicator variables for the age band,
that is,
1 for age band 2,

𝑥2 = {
0 otherwise.
and
1 for age band 3,

𝑥3 = {
0 otherwise.
The triple (𝑥1 , 𝑥2 , 𝑥3 ) then can effectively and uniquely determine each risk
class. By observing that the indicator variables associated with Type A and
Age band 1 are omitted, we see that tariff cell (𝑗, 𝑘) = (1, 1) plays the role of
the base cell. We emphasize that our choice of the three indicator variables
above has been carefully made so that it is consistent with the choice of the
base levels in the multiplicative tariff model in the previous subsection (i.e.,
𝑓11 = 1 and 𝑓21 = 1).
With the proposed indicator variables we can rewrite the log rate (8.17) as
log 𝜆 = log 𝑓0 + log 𝑓12 × 𝑥1 + log 𝑓22 × 𝑥2 + log 𝑓23 × 𝑥3 , (8.19)
which is identical to (8.18) when each triple value is actually applied. For
example, we can verify that the base tariff cell (𝑗, 𝑘) = (1, 1) corresponds to
(𝑥1 , 𝑥2 , 𝑥3 ) = (0, 0, 0), and in turn produces log 𝜆 = log 𝑓0 or 𝜆 = 𝑓0 in (8.19) as
required.
Poisson regression for the tariff model
Under this specification, let us consider 𝑛 policyholders in the portfolio with the
𝑖th policyholder’s risk characteristic given by a vector of explanatory variables
x𝑖 = (1, 𝑥𝑖1 , 𝑥𝑖2 , 𝑥𝑖3 )′ , for 𝑖 = 1, … , 𝑛. We then recognize (8.19) as
log 𝜆𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + 𝛽3 𝑥𝑖3 = x′𝑖 𝛽, 𝑖 = 1, … , 𝑛,
where 𝛽0 , … , 𝛽3 can be mapped to the corresponding log relativities in (8.19).

This is exactly the same setup as in (8.15) except for the exposure component.
Therefore, by incorporating the exposure in each risk class, a Poisson regression
model for this multiplicative tariff model finally becomes
log 𝜇𝑖 = log 𝜆𝑖 + log 𝑚𝑖 = log 𝑚𝑖 + 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + 𝛽3 𝑥𝑖3

= log 𝑚𝑖 + x′𝑖 𝛽,
for 𝑖 = 1, … , 𝑛. As a result, the relativities are given by
𝑓0 = 𝑒𝛽0 , 𝑓12 = 𝑒𝛽1 , 𝑓22 = 𝑒𝛽2 , and 𝑓23 = 𝑒𝛽3 , (8.20)
with 𝑓11 = 1 and 𝑓21 = 1 from the original construction. For the actual dataset,
𝛽𝑖 , 𝑖 = 0, 1, 2, 3, is replaced with the mle 𝑏𝑖 using the method in the technical
supplement at the end of this chapter (Section 8.A).
8.3.4 Numerical Examples

We present two numerical examples of Poisson regression. In the first example
we construct a Poisson regression model from Table 8.2, which is a dataset of a
hypothetical auto insurer. The second example uses an actual industry dataset
with more risk factors. As our purpose is to show how a Poisson regression
model can be used under a given classification rule, we are not concerned with
the quality of the Poisson model fit in this chapter.
Example 8.1: Poisson regression for the illustrative auto insurer
In the last few subsections we considered a dataset of a hypothetical auto insurer
with two risk factors, as given in Table 8.2. We now apply a Poisson regression
model to this dataset. As done before, we have set (𝑗, 𝑘) = (1, 1) as the base
tariff cell, so that 𝑓11 = 𝑓21 = 1. The result of the regression gives the coefficient
estimates (𝑏0 , 𝑏1 , 𝑏2 , 𝑏3 ) = (−2.3359, −0.3004, −0.7837, −1.0655), which in turn
produces the corresponding estimated relativities
𝑓0 = 0.0967, 𝑓12 = 0.7405, 𝑓22 = 0.4567 and 𝑓23 = 0.3445,
from the relation given in (8.20). The R script and the output are as follows.
Example 8.2. Poisson regression for Singapore insurance claims data

This actual dataset is a subset of the data used by Frees and Valdez (2008). The
data are from the General Insurance Association of Singapore, an organization
consisting of non-life insurers in Singapore. These data contains the number of
car accidents for 𝑛 = 7, 483 auto insurance policies with several categorical ex-
planatory variables and the exposure for each policy. The explanatory variables
include four risk factors: the type of the vehicle insured (either automobile (A)
or other (O), denoted by Vtype), the age of the vehicle in years (Vage), gender
of the policyholder (Sex) and the age of the policyholder (in years, grouped into
seven categories, denoted Age).
Based on the data description, there are several things to consider before con-
structing a model. First, there are 3,842 policies with vehicle type A (auto-
mobile) and 3,641 policies with other vehicle types. However, age and sex
information is available for the policies of vehicle type A only; the drivers of all
other types of vehicles are recorded to be aged 21 or less with sex unspecified,
except for one policy, indicating that no driver information has been collected
for non-automobile vehicles. Second, type A vehicles are all classified as private
vehicles and all the other types are not.
When we include these risk factors, we assume all unspecified sex to be male.
As the age information is only applicable to type A vehicles, we set the model
accordingly. That is, we apply the age variable only to vehicles of type A.
Also we used five vehicle age bands, simplifying the original seven bands, by
combining vehicle ages 0,1 and 2; the combined band is marked as level 23 in
the data file. Thus our Poisson model has the following explicit form:
6
log 𝜇𝑖 = x′𝑖 𝛽+ log 𝑚𝑖 = 𝛽0 + 𝛽1 𝐼(𝑆𝑒𝑥𝑖 = 𝑀 ) + ∑ 𝛽𝑡 𝐼(𝑉 𝑎𝑔𝑒𝑖 = 𝑡)
𝑡=2
13
+ ∑ 𝛽𝑡 𝐼(𝑉 𝑡𝑦𝑝𝑒𝑖 = 𝐴) × 𝐼(𝐴𝑔𝑒𝑖 = 𝑡 − 7) + log 𝑚𝑖 .
𝑡=7
The fitting result is given in Table 8.3, for which we have several comments.
• The claim frequency is higher for males by 17.3%, when other rating
factors are held fixed. However, this may have been affected by the fact
that all unspecified sex has been assigned to male.
• Regarding the vehicle age, the claim frequency gradually decreases as the
vehicle age increases, when other rating factors are held fixed. The level
starts from 2 for this variable but, again, the numbering is nominal and
does not affect the numerical result.
• The policyholder age variable only applies to type A (automobile) vehicle,

and there is no policy in the first age band. We may speculate that younger
drivers less than age 21 drive their parents’ cars rather than having their
own because of high insurance premiums or related regulations. The miss-
ing relativity may be estimated by some interpolation or the professional
judgement of the actuary. The claim frequency is the lowest for age band
3 and 4, but gets substantially higher for older age bands, a reasonable
pattern seen in many auto insurance loss datasets.
We also note that there is no base level in the policyholder age variable, in
the sense that no relativity is equal to 1. This is because the variable is only
applicable to vehicle type A. This does not cause a problem numerically, but one
may set the base relativity as follows if necessary for other purposes. Since there
is no policy in age band 0, we consider band 1 as the base case. Specifically, we
treat its relativity as a product of 0.918 and 1, where the former is the common
3 corresponding to VAgecat1.
relativity (that is, the common premium reduction) applied to all policies with
vehicle type A and the latter is the base value for age band 1. Then the relativity
of age band 2 can be seen as 0.917 = 0.918 × 0.999, where 0.999 is understood as
the relativity for age band 2. The remaining age bands can be treated similarly.
Table 8.3. Singapore Insurance Claims Data
Rating factor Level Relativity in the tariff Note

Base value 0.167 𝑓0
Sex 1(𝐹 ) 1.000 Base level
2(𝑀 ) 1.173
Vehicle age 2(0 − 2 yrs) 1.000 Base level
3(3 − 5 yrs) 0.843
4(6 − 10 yrs) 0.553
5(11 − 15 yrs) 0.269
6(16 + yrs) 0.189
Policyholder age 0(0 − 21) N/A No policy
(Only applicable to 1(22 − 25) 0.918
vehicle type A) 2(26 − 35) 0.917
3(36 − 45) 0.758
4(46 − 55) 0.632
5(56 − 65) 1.102
6(65+) 1.179
Let us try several examples based on Table 8.3. Suppose a male policyholder
aged 40 who owns a 7-year-old vehicle of type A. The expected claim frequency
for this policyholder is then given by
𝜆 = 0.167 × 1.173 × 0.553 × 0.758 = 0.082.
As another example consider a female policyholder aged 60 who owns a 3-year-

old vehicle of type O. The expected claim frequency for this policyholder is
𝜆 = 0.167 × 1 × 0.843 = 0.141.
Note that for this policy the age band variable is not used as the vehicle type
is not A. The R script is given as follows.
As a concluding remark, we comment that Poisson regression is not the only

possible count regression model. Actually, the Poisson distribution can be re-
strictive in the sense that it has a single parameter and its mean and the vari-
ance are always equal. There are other count regression models that allow more
flexible distributional structure, such as negative binomial regressions and zero-
inflated (ZI) regressions; details of these alternative regressions can be found in
other texts listed in the next section.

Further Reading and References
Poisson regression is a special member of a more general regression model class
known as the generalized linear model (GLM). The GLM develops a unified
regression framework for datasets when the response variables are continuous,
binary or discrete. The classical linear regression model with a normally dis-
tributed error is also a member of the GLM. There are many standard sta-
tistical texts dealing with the GLM, including McCullagh and Nelder (1989).
More accessible texts are Dobson and Barnett (2008), Agresti (1996) and Far-
away (2016). For actuarial and insurance GLM applications, see Frees (2009),
De Jong and Heller (2008). Also, Ohlsson and Johansson (2010) discusses GLM
in non-life insurance pricing context with tariff analyses.
Contributor
• Joseph H. T. Kim, Yonsei University, is the principal author of the
initial version of this chapter. Email: [email protected] for chapter
comments and suggested improvements.
• Chapter reviewers include: Chun Yong Chew, Lina Xu, Jeffrey Zheng.
TS 8.A. Estimating Poisson Regression Models

The principles of maximum likelihood estimation (mle) are introduced in Sec-
tions 2.4.1 and 3.5, defined in Section 15.2.2, and theoretically developed in
Chapter 17. Here we present the mle procedure of Poisson regression so that
the reader can see how the explanatory variables are treated in maximizing the
likelihood function in the regression setting.
Maximum Likelihood Estimation for Individual Data
In Poisson regression the varying Poisson mean is determined by parameters 𝛽𝑖 ’s,
as shown in (8.15). In this subsection we use the maximum likelihood method to
estimate these parameters. Again, we assume that there are 𝑛 policyholders and
the 𝑖th policyholder is characterized by x𝑖 = (1, 𝑥𝑖1 , … , 𝑥𝑖𝑘 )′ with the observed
loss count 𝑦𝑖 . Then, from (8.14) and (8.15), the log-likelihood function of vector
𝛽 = (𝛽0 , … , 𝛽𝑘 ) is given by
𝑛
log 𝐿(𝛽) = 𝑙(𝛽) = ∑ (−𝜇𝑖 + 𝑦𝑖 log 𝜇𝑖 − log 𝑦𝑖 !)
𝑖=1
𝑛
= ∑ (−𝑚𝑖 exp(x′𝑖 𝛽) + 𝑦𝑖 (log 𝑚𝑖 + x′𝑖 𝛽) − log 𝑦𝑖 !) (8.21)
𝑖=1
To obtain the mle of 𝛽 = (𝛽0 , … , 𝛽𝑘 )′ , we differentiate4 𝑙(𝛽) with respect to

4 We use matrix derivative here.
vector 𝛽 and set it to zero:
𝑛
𝜕
𝑙(𝛽)∣ = ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 = 0. (8.22)
𝜕𝛽 𝑖=1
𝛽=b
Numerically solving this equation system gives the mle of 𝛽, denoted by b =

(𝑏0 , 𝑏1 , … , 𝑏𝑘 )′ . Note that, as x𝑖 = (1, 𝑥𝑖1 , … , 𝑥𝑖𝑘 )′ is a column vector, equation
(8.22) is a system of 𝑘 + 1 equations with both sides written as column vectors
of size 𝑘 + 1. If we denote 𝜇𝑖̂ = 𝑚𝑖 exp(x′𝑖 b), we can rewrite (8.22) as
𝑛
∑ (𝑦𝑖 − 𝜇𝑖̂ ) x𝑖 = 0.
𝑖=1
Since the solution b satisfies this equation, it follows that the first among the
array of 𝑘 + 1 equations, corresponding to the first constant element of x𝑖 , yields
𝑛
∑ (𝑦𝑖 − 𝜇𝑖̂ ) × 1 = 0,
𝑖=1
which implies that we must have
𝑛 𝑛
𝑛−1 ∑ 𝑦𝑖 = 𝑦 ̄ = 𝑛−1 ∑ 𝜇𝑖̂ .
𝑖=1 𝑖=1
This is an interesting property saying that the average of the individual losses,
𝑦,̄ is same as the average of the estimated values. That is, the sample mean is
preserved under the fitted Poisson regression model.
Maximum Likelihood Estimation for Grouped Data
Sometimes the data are not available at the individual policy level. For example,
Table 8.2 provides collective loss information for each risk class after grouping
individual policies. When this is the case, 𝑦𝑖 and 𝑚𝑖 , the quantities needed for
the mle calculation in (8.22), are unavailable for each 𝑖. However this does not
pose a problem as long as we have the total loss counts and total exposure for
each risk class.
To elaborate, let us assume that there are 𝐾 different risk classes, and further
that, in the 𝑘th risk class, we have 𝑛𝑘 policies with the total exposure 𝑚(𝑘) and
the average loss count 𝑦(𝑘)̄ , for 𝑘 = 1, … , 𝐾; the total loss count for the 𝑘th
risk class is then 𝑛𝑘 𝑦(𝑘)
̄ . We denote the set of indices of the policies belonging
to the 𝑘th class by 𝐶𝑘 . As all policies in a given risk class share the same risk
characteristics, we may denote x𝑖 = x(𝑘) for all 𝑖 ∈ 𝐶𝑘 . With this notation, we
can rewrite (8.22) as
𝑛 𝐾
∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 = ∑ { ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′𝑖 b)) x𝑖 }
𝑖=1 𝑘=1 𝑖∈𝐶𝑘
𝐾
= ∑ { ∑ (𝑦𝑖 − 𝑚𝑖 exp(x′(𝑘) b)) x(𝑘) }
𝑘=1 𝑖∈𝐶𝑘
𝐾
= ∑ {( ∑ 𝑦𝑖 − ∑ 𝑚𝑖 exp(x′(𝑘) b))x(𝑘) }
𝑘=1 𝑖∈𝐶𝑘 𝑖∈𝐶𝑘
𝐾
̄ − 𝑚(𝑘) exp(x′(𝑘) b))x(𝑘) = 0.
= ∑ (𝑛𝑘 𝑦(𝑘) (8.23)
𝑘=1
Since 𝑛𝑘 𝑦(𝑘)
̄ in (8.23) represents the total loss count for the 𝑘th risk class and
𝑚(𝑘) is its total exposure, we see that for Poisson regression the mle b is the
same whether if we use the individual data or the grouped data.
Information matrix
Section 17.1 defines information matrices. Taking second derivatives to (8.21)
gives the information matrix of the mle estimators,
𝑛 𝑛
𝜕2 ′ ′ ′
I(𝛽) = −E ( ′ 𝑙(𝛽)) = ∑ 𝑚𝑖 exp(x𝑖 𝛽)x𝑖 x𝑖 = ∑ 𝜇𝑖 x𝑖 x𝑖 . (8.24)
𝜕𝛽𝜕𝛽 𝑖=1 𝑖=1
For actual datasets, 𝜇𝑖 in (8.24) is replaced with 𝜇𝑖̂ = 𝑚𝑖 exp(x′𝑖 b) to estimate

the relevant variances and covariances of the mle b or its functions.
For grouped datasets, we have
𝐾 𝐾
I(𝛽) = ∑ { ∑ 𝑚𝑖 exp(x′𝑖 𝛽)x𝑖 x′𝑖 } = ∑ 𝑚(𝑘) exp(x′(𝑘) 𝛽)x(𝑘) x′(𝑘) .
𝑘=1 𝑖∈𝐶𝑘 𝑘=1
TS 8.B. Selecting Rating Factors

A complete discussion of rating factor selection is beyond the scope of this
book. In addition to technical analyses, you have to think carefully about the
type of business (personal, commercial) as well as the regulatory landscape.
Nonetheless, a broad overview of some key concerns may serve to ground the
reader as one thinks about the pricing of insurance contracts.
Statistical Criteria
From an analyst’s perspective, the discussion starts with the statistical signif-
icance of a rating factor. If the factor is not statistically significant, then the
variable is not even worthy of consideration for inclusion in a rating plan. The
statistical significance is judged not only on an in-sample basis but also on how
well it fares on an out-of-sample basis, as per our discussion in Section 4.2.
It is common in insurance applications to have many rating factors. Handling
multivariate aspects can be difficult with traditional univariate methods. Ana-
lysts employ techniques such as generalized linear models as described in Section
8.3.
Rating factors are introduced to create cells that contain similar risks. A rating
group should be large enough to measure costs with sufficient accuracy. There
is an inherent trade-off between theoretical accuracy and homogeneity.
As an example, most insurers charge the same automobile insurance premiums
for drivers between the ages of 30 and 50, not varying the premium by age.
Presumably costs do not vary much by age, or cost variances are due to other
identifiable factors.
Operational Criteria
From a business perspective, statistical criteria only provide a starting point
for discussions of potential inclusion of rating factors. Inclusion of a rating
factor must also induce economically meaningful results. From an insured’s
perspective, if differentiation by a factor produces little change in a rate then it
is not worth including. From an insurer’s perspective, the inclusion of a factor
should help segment the marketplace in a way that helps attract the business
that they seek. For example, we introduce the Gini index in Section 7.6 as one
metric that insurers use to describe the financial impact of a rating variable.
Rating factors should also be objective, inexpensive to administer, and verifi-
able. For example, automobile insurance underwriters often talk of “maturity”
and “responsibility” as important criteria for youthful drivers. Yet, these are
difficult to define objectively and to apply consistently. As another example, in
automobile it has long been known that amount of miles (or kilometers) driven
is an excellent rating factor. However, insurers have been reluctant to adopt
this factor because it is subject to abuse. Historically, driving mileage has not
been used because of the difficulty in verifying this variable (it is far too easy
to alter the car’s odometer to change reported mileage). Going forward, mod-
ern day drivers and cars are equipped with global positioning devices and other
equipment that allow insurers to use distance driven as a rating factor because
it can be verified.
Rating Factors from the Perspective of a Consumer

Insurance companies sell insurance products to a variety of consumers; conse-
quently, companies are affected by public perception. On the one hand, free
market competition dictates rating factors that insurers use, as is common in
commercial insurance. On the other hand, insurance may be required by law.
This is common in personal insurance such as third party automobile liability

and homeowners. In these instances, the mandatory and de facto mandatory
purchase of insurance may mean that free market competition is insufficient to
protect policyholders. Here, the following items affect the social acceptability
of using a particular risk characteristic as a rating variable:
• Affordability - introduction of some variables may be mitigated by result-
ing high costs of insurance.
• Causality - other things being equal, a rating variable is easier to justify if
there is a “causal” relationship with losses. A good example is the effects
of smoking in life insurance. For many years, this factor was viewed with
suspicion by the industry. However, over time, scientific evidence provided
overwhelming evidence as this an important predictor of mortality.
• Controllability - A controllable variable is one that is under the control of
the insured, e.g., installing burglar alarms. The use of controllable rating
variables encourages accident prevention.
• Privacy concerns - people are reluctant to disclose personal information.
In today’s world with increasing emphasis on social media and the avail-
ability of personal information, consumer advocates are concerned that
the benefits of big data skew heavily in insurers’ favor. They reason that
insureds do not have equivalent new tools to compare quality of cover-
age/policies and performance of insurance companies.
Example: Youthful Drivers. In some cases, a particular risk characteristic
may identify a small group of insureds whose risk level is extremely high, and if
used as a rating variable, the resulting premium may be unaffordable for that
high-risk class. To the extent that this occurs, companies may wish to or be
required by regulators to combine classes and introduce subsidies. For example,
16-year-old drivers are generally higher risk than 17-year-old drivers. Some
companies have chosen to use the same rates for 16- and 17-year-old drivers to
minimize the affordability issues that arise when a family adds a 16-year-old to
the auto policy.
Societal Effects of Rating Factors

With public discussions of rating factors, it is also important to think about the
societal effects of classification.
For example, does a rating variable encourage “good” behavior? As an example,
we return to the use of distance driven as a rating factor. Many people advocate
for including this variable as a factor. The motivation is that if insurance, like
fuel, is priced based on distance driven, this will induce consumers to reduce
the amount driven, thereby benefiting society.
One can consider other aspects of societal effects of classification, see, for exam-
ple, Niehaus and Harrington (2003):
• Re-distributive Effects - provide a cross-subsidy from e.g., high risks to
low risks
• Classification Costs - Money spent by society, insurers, to classify people
appropriately.
Legal Criteria
For example, some states have statutes prohibiting the use of gender in rating
insurance while others permit it as a rating variable. As a result, an insurer
writing in multiple states may include gender as a rating variable in those states
where it is permitted, but not include it in a state that prohibits its use for rating.
If allowed by law, the company may continue to charge the average rate but
utilize the characteristic to identify, attract, and select the lower-risk insureds
that exist in the insured population; this is called skimming the cream. See
Frees and Huang (2020) for a broad discussion of the discrimination in pricing.
Chapter 9
Experience Rating Using

Credibility Theory
Chapter Preview. This chapter introduces credibility theory as an important

actuarial tool for estimating pure premiums, frequencies, and severities for indi-
vidual risks or classes of risks. Credibility theory provides a convenient frame-
work for combining the experience for an individual risk or class with other data
to produce more stable and accurate estimates. Several models for calculating
credibility estimates will be discussed including limited fluctuation, Bühlmann,
Bühlmann-Straub, and nonparametric and semiparametric credibility methods.
The chapter will also show a connection between credibility theory and Bayesian
estimation which was introduced in Chapter 4.
9.1 Introduction to Applications of Credibility

Theory
What premium should be charged to provide insurance? The answer depends
upon the exposure to the risk of loss. A common method to compute an in-
surance premium is to rate an insured using a classification rating plan. A
classification plan is used to select an insurance rate based on an insured’s rat-
ing characteristics such as geographic territory, age, etc. All classification rating
plans use a limited set of criteria to group insureds into a “class” and there will
be variation in the risk of loss among insureds within the class.
An experience rating plan attempts to capture some of the variation in the risk
of loss among insureds within a rating class by using the insured’s own loss
experience to complement the rate from the classification rating plan. One way
to do this is to use a credibility weight 𝑍 with 0 ≤ 𝑍 ≤ 1 to compute
309
310CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
𝑅̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝑀 ,
𝑅̂ = credibility weighted rate for risk,

𝑋̄ = average loss for the risk over a specified time period,
𝑀 = the rate for the classification group, often called the manual rate.
For a risk whose loss experience is stable from year to year, 𝑍 might be close to
1. For a risk whose losses vary widely from year to year, 𝑍 may be close to 0.
Credibility theory is also used for computing rates for individual classes within a
classification rating plan. When classification plan rates are being determined,
some or many of the groups may not have sufficient data to produce stable
and reliable rates. The actual loss experience for a group will be assigned a
credibility weight 𝑍 and the complement of credibility 1 − 𝑍 may be given to
the average experience for risks across all classes. Or, if a class rating plan is
being updated, the complement of credibility may be assigned to the current
class rate. Credibility theory can also be applied to the calculation of expected
frequencies and severities.
Computing numeric values for 𝑍 requires analysis and understanding of the
data. What are the variances in the number of losses and sizes of losses for
risks? What is the variance between expected values across risks?
9.2 Limited Fluctuation Credibility

• Calculate full credibility standards for number of claims, average size of
claims, and aggregate losses.
• Learn how the relationship between means and variances of underlying
distributions affects full credibility standards.
• Determine credibility-weight 𝑍 using the square-root partial credibility
formula.
Limited fluctuation credibility, also called “classical credibility” and “American

credibility,” was given this name because the method explicitly attempts to limit
fluctuations in estimates for claim frequencies, severities, or losses. For example,
suppose that you want to estimate the expected number of claims 𝑁 for a group
of risks in an insurance rating class. How many risks are needed in the class
9.2. LIMITED FLUCTUATION CREDIBILITY 311
to ensure that a specified level of accuracy is attained in the estimate? First

the question will be considered from the perspective of how many claims are
needed.
9.2.1 Full Credibility for Claim Frequency

Let 𝑁 be a random variable representing the number of claims for a group of
risks, for example, risks within a particular rating classification. The observed
number of claims will be used to estimate 𝜇𝑁 = E[𝑁 ], the expected number
of claims. How big does 𝜇𝑁 need to be to get a good estimate? One way
to quantify the accuracy of the estimate would be with a statement like: “The
observed value of 𝑁 should be within 5% of 𝜇𝑁 at least 90% of the time.” Writing
this as a mathematical expression would give Pr[0.95𝜇𝑁 ≤ 𝑁 ≤ 1.05𝜇𝑁 ] ≥ 0.90.
Generalizing this statement by letting the range parameter 𝑘 replace 5% and
probability level 𝑝 replace 0.90 gives the equation
Pr[(1 − 𝑘)𝜇𝑁 ≤ 𝑁 ≤ (1 + 𝑘)𝜇𝑁 ] ≥ 𝑝. (9.1)
The expected number of claims required for the probability on the left-hand
side of (9.1) to equal 𝑝 is called the full credibility standard.
If the expected number of claims is greater than or equal to the full credibility
standard then full credibility can be assigned to the data so 𝑍 = 1. Usually the
expected value 𝜇𝑁 is not known so full credibility will be assigned to the data
if the actual observed number of claims 𝑛 is greater than or equal to the full
credibility standard. The 𝑘 and 𝑝 values must be selected and the actuary may
rely on experience, judgment, and other factors in making the choices.
Subtracting 𝜇𝑁 from each term in (9.1) and dividing by the standard deviation
𝜎𝑁 of 𝑁 gives
−𝑘𝜇𝑁 𝑁 − 𝜇𝑁 𝑘𝜇𝑁
Pr [ ≤ ≤ ] ≥ 𝑝. (9.2)
𝜎𝑁 𝜎𝑁 𝜎𝑁
In limited fluctuation credibility the standard normal distribution is used to

approximate the distribution of (𝑁 − 𝜇𝑁 )/𝜎𝑁 . If 𝑁 is the sum of many claims
from a large group of similar risks and the claims are independent, then the
approximation may be reasonable.
Let 𝑦𝑝 be the value such that
𝑁 − 𝜇𝑁
Pr[−𝑦𝑝 ≤ ≤ 𝑦𝑝 ] = Φ(𝑦𝑝 ) − Φ(−𝑦𝑝 ) = 𝑝
𝜎𝑁
where Φ() is the cumulative distribution function of the standard normal. Be-
cause Φ(−𝑦𝑝 ) = 1 − Φ(𝑦𝑝 ), the equality can be rewritten as 2Φ(𝑦𝑝 ) − 1 = 𝑝.
Solving for 𝑦𝑝 gives 𝑦𝑝 = Φ−1 ((𝑝 + 1)/2) where Φ−1 () is the inverse of Φ().
Equation (9.2) will be satisfied if 𝑘𝜇𝑁 /𝜎𝑁 ≥ 𝑦𝑝 assuming the normal approxi-
mation. First we will consider this inequality for the case when 𝑁 has a Poisson
distribution: Pr[𝑁 = 𝑛] = 𝜆𝑛 e−𝜆 /𝑛!. Because 𝜆 = 𝜇𝑁 = 𝜎𝑁 2
for the Poisson,
1/2 1/2
taking square roots yields 𝜇𝑁 = 𝜎𝑁 . So, 𝑘𝜇𝑁 /𝜇𝑁 ≥ 𝑦𝑝 which is equivalent to
𝜇𝑁 ≥ (𝑦𝑝 /𝑘)2 . Let’s define 𝜆𝑘𝑝 to be the value of 𝜇𝑁 for which equality holds.
Then the full credibility standard for the Poisson distribution is
𝑦𝑝 2
𝜆𝑘𝑝 = ( ) with 𝑦𝑝 = Φ−1 ((𝑝 + 1)/2). (9.3)
𝑘
If the expected number of claims 𝜇𝑁 is greater than or equal to 𝜆𝑘𝑝 then equation
(9.1) is assumed to hold and full credibility can be assigned to the data. As
noted previously, because 𝜇𝑁 is usually unknown, full credibility is given if the
observed number of claims 𝑛 satisfies 𝑛 ≥ 𝜆𝑘𝑝 .
Example 9.2.1. The full credibility standard is set so that the observed number
of claims is to be within 5% of the expected value with probability 𝑝 = 0.95.
If the number of claims has a Poisson distribution find the number of claims
needed for full credibility.
Solution. Referring to a standard normal distribution table, 𝑦𝑝 = Φ−1 ((𝑝 +
1)/2) = Φ−1 ((0.95 + 1)/2)=Φ−1 (0.975) = 1.960. Using this value and 𝑘 = .05
then 𝜆𝑘𝑝 = (𝑦𝑝 /𝑘)2 = (1.960/0.05)2 = 1, 536.64. After rounding up the full
credibility standard is 1,537.
If claims are not Poisson distributed then equation (9.2) does not imply (9.3).
Setting the upper bound of (𝑁 −𝜇𝑁 )/𝜎𝑁 in (9.2) equal to 𝑦𝑝 gives 𝑘𝜇𝑁 /𝜎𝑁 = 𝑦𝑝 .
Squaring both sides and moving everything to the right side except for one of
the 𝜇𝑁 ’s gives 𝜇𝑁 = (𝑦𝑝 /𝑘)2 (𝜎𝑁
2
/𝜇𝑁 ). This is the full credibility standard for
frequency and will be denoted by 𝑛𝑓 ,
𝑦𝑝 2 𝜎𝑁2
𝜎2
𝑛𝑓 = ( ) ( ) = 𝜆𝑘𝑝 ( 𝑁 ) . (9.4)
𝑘 𝜇𝑁 𝜇𝑁
This is the same equation as the Poisson full credibility standard except for the
2
(𝜎𝑁 /𝜇𝑁 ) multiplier. When the claims distribution is Poisson this extra term is
one because the variance equals the mean.
Example 9.2.2. The full credibility standard is set so that the total number of
claims is to be within 5% of the observed value with probability 𝑝 = 0.95. The
number of claims has a negative binomial distribution,
𝑟 𝑥
𝑥+𝑟−1 1 𝛽
Pr(𝑁 = 𝑥) = ( )( ) ( ) ,
𝑥 1+𝛽 1+𝛽
with 𝛽 = 1. Calculate the full credibility standard.

Solution From the prior example, 𝜆𝑘𝑝 = 1, 536.64. The mean and variance for
2
the negative binomial are E(𝑁 ) = 𝑟𝛽 and Var(𝑁 ) = 𝑟𝛽(1 + 𝛽) so (𝜎𝑁 /𝜇𝑁 ) =
2
(𝑟𝛽(1 + 𝛽)/(𝑟𝛽)) = 1 + 𝛽 which equals 2 when 𝛽 = 1. So, 𝑛𝑓 = 𝜆𝑘𝑝 (𝜎𝑁 /𝜇𝑁 ) =
1, 536.64(2) = 3, 073.28 and rounding up gives a full credibility standard of
3,074.
2
We see that the negative binomial distribution with (𝜎𝑁 /𝜇𝑁 ) > 1 requires more
claims for full credibility than a Poisson distribution for the same 𝑘 and 𝑝 values.
2
The next example shows that a binomial distribution which has (𝜎𝑁 /𝜇𝑁 ) < 1
will need fewer claims for full credibility.
Example 9.2.3. The full credibility standard is set so that the total number of
claims is to be within 5% of the observed value with probability 𝑝 = 0.95. The
number of claims has a binomial distribution
𝑚
Pr(𝑁 = 𝑥) = ( )𝑞 𝑥 (1 − 𝑞)𝑚−𝑥 .
𝑥
Calculate the full credibility standard for 𝑞 = 1/4.

Solution From the first example in this section 𝜆𝑘𝑝 = 1, 536.64. The mean
and variance for a binomial are E(𝑁 ) = 𝑚𝑞 and Var(𝑁 ) = 𝑚𝑞(1 − 𝑞) so
2
(𝜎𝑁 /𝜇𝑁 ) = (𝑚𝑞(1 − 𝑞)/(𝑚𝑞)) = 1 − 𝑞 which equals 3/4 when 𝑞 = 1/4. So,
2
𝑛𝑓 = 𝜆𝑘𝑝 (𝜎𝑁 /𝜇𝑁 ) = 1, 536.64(3/4) = 1, 152.48 and rounding up gives a full
credibility standard of 1,153.
Rather than using expected number of claims to define the full credibility stan-
dard, the number of exposures can be used for the full credibility standard. An
exposure is a measure of risk. For example, one car insured for a full year would
be one car-year. Two cars each insured for exactly one-half year would also
result in one car-year. Car-years attempt to quantify exposure to loss. Two
car-years would be expected to generate twice as many claims as one car-year if
the vehicles have the same risk of loss. To translate a full credibility standard
denominated in terms of number of claims to a full credibility standard denom-
inated in exposures one needs a reasonable estimate of the expected number of
claims per exposure.
Example 9.2.4. The full credibility standard should be selected so that the ob-
served number of claims will be within 5% of the expected value with probability
𝑝 = 0.95. The number of claims has a Poisson distribution. If one exposure
is expected to have about 0.20 claims per year, find the number of exposures
needed for full credibility.
Solution With 𝑝 = 0.95 and 𝑘 = .05, 𝜆𝑘𝑝 = (𝑦𝑝 /𝑘)2 = (1.960/0.05)2 = 1, 536.64
claims are required for full credibility. The claims frequency rate is 0.20 claims
per exposure. To convert the full credibility standard to a standard denominated
in exposures the calculation is: (1,536.64 claims)/(0.20 claims/exposures) =
7,683.20 exposures. This can be rounded up to 7,684.
Frequency can be defined as the number of claims per exposure. Letting 𝑚

denote the number of exposures. Then, if observed claim frequency 𝑁 /𝑚 is
used to estimate E(𝑁 /𝑚):
Pr[(1 − 𝑘)E(𝑁 /𝑚) ≤ 𝑁 /𝑚 ≤ (1 + 𝑘)E(𝑁 /𝑚)] ≥ 𝑝.
Because the number of exposures is not a random variable, E(𝑁 /𝑚) =

E(𝑁 )/𝑚 = 𝜇𝑁 /𝑚 and the prior equation becomes
𝜇𝑁 𝑁 𝜇
Pr [(1 − 𝑘) ≤ ≤ (1 + 𝑘) 𝑁 ] ≥ 𝑝.
𝑚 𝑚 𝑚
Multiplying through by 𝑚 results in equation (9.1) at the beginning of the sec-

tion. The full credibility standards that were developed for estimating expected
number of claims also apply to frequency.
9.2.2 Full Credibility for Aggregate Losses and Pure Pre-

mium
Aggregate losses are the total of all loss amounts for a risk or group of risks.
Letting 𝑆 represent aggregate losses
𝑆 = 𝑋 1 + 𝑋2 + ⋯ + 𝑋 𝑁 .
The random variable 𝑁 represents the number of losses and random variables
𝑋1 , 𝑋2 , … , 𝑋𝑁 are the individual loss amounts. In this section it is assumed
that 𝑁 is independent of the loss amounts and that 𝑋1 , 𝑋2 , … , 𝑋𝑁 are iid.
The mean and variance of 𝑆 are
𝜇𝑆 = E(𝑆) = E(𝑁 )E(𝑋) = 𝜇𝑁 𝜇𝑋
and
𝜎𝑆2 = Var(𝑆) = E(𝑁 )Var(𝑋) + [E(𝑋)]2 Var(𝑁 ) = 𝜇𝑁 𝜎𝑋

2
+ 𝜇2𝑋 𝜎𝑁
2
,
where 𝑋 is the amount of a single loss. See the discussion on collective risk
models in Section 5.3 for more discussion of this framework.
Observed losses 𝑆 will be used to estimate expected losses 𝜇𝑆 = E(𝑆). As with

the frequency model in the previous section, the observed losses must be close
to the expected losses as quantified in the equation
Pr[(1 − 𝑘)𝜇𝑆 ≤ 𝑆 ≤ (1 + 𝑘)𝜇𝑆 ] ≥ 𝑝.
After subtracting the mean and dividing by the standard deviation,
−𝑘𝜇𝑆 𝑘𝜇𝑆
Pr [ ≤ (𝑆 − 𝜇𝑆 )/𝜎𝑆 ≤ ] ≥ 𝑝.
𝜎𝑆 𝜎𝑆
As done in the previous section the distribution for (𝑆 − 𝜇𝑆 )/𝜎𝑆 is assumed

to be standard normal and 𝑘𝜇𝑆 /𝜎𝑆 = 𝑦𝑝 = Φ−1 ((𝑝 + 1)/2). This equation
can be rewritten as 𝜇2𝑆 = (𝑦𝑝 /𝑘)2 𝜎𝑆2 . Using the prior formulas for 𝜇𝑆 and 𝜎𝑆2
gives (𝜇𝑁 𝜇𝑋 )2 = (𝑦𝑝 /𝑘)2 (𝜇𝑁 𝜎𝑋
2
+ 𝜇2𝑋 𝜎𝑁2
). Dividing both sides by 𝜇𝑁 𝜇2𝑋 and
reordering terms on the right side results in a full credibility standard 𝑛𝑆 for
aggregate losses
2 2
𝑦𝑝 2 𝜎2 𝜎 𝜎2 𝜎
𝑛𝑆 = ( ) [( 𝑁 ) + ( 𝑋 ) ] = 𝜆𝑘𝑝 [( 𝑁 ) + ( 𝑋 ) ] . (9.5)
𝑘 𝜇𝑁 𝜇𝑋 𝜇𝑁 𝜇𝑋
Example 9.2.5. The number of claims has a Poisson distribution. Individual

loss amounts are independently and identically distributed with a Pareto distri-
bution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . The number of claims and loss amounts are
independent. If observed aggregate losses should be within 5% of the expected
value with probability 𝑝 = 0.95, how many losses are required for full credibility?
2
Solution. Because the number of claims is Poisson, (𝜎𝑁 /𝜇𝑁 ) = 1. The mean
2
of the Pareto is 𝜇𝑋 = 𝜃/(𝛼 − 1) and the variance is 𝜎𝑋 = 𝜃2 𝛼/[(𝛼 − 1)2 (𝛼 − 2)]
so (𝜎𝑋 /𝜇𝑋 )2 = 𝛼/(𝛼 − 2). Combining the frequency and severity terms gives
2
[(𝜎𝑁 /𝜇𝑁 )+(𝜎𝑋 /𝜇𝑋 )2 ] = 2(𝛼−1)/(𝛼−2). From a standard normal distribution
table 𝑦𝑝 = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility standard is 𝑛𝑆 =
(1.96/0.05)2 [2(𝛼 − 1)/(𝛼 − 2)] = 3, 073.28(𝛼 − 1)/(𝛼 − 2). Suppose 𝛼 = 3 then
𝑛𝑆 = 6, 146.56 for a full credibility standard of 6,147. Note that considerably
more claims are needed for full credibility for aggregate losses than frequency
alone.
When the number of claims is Poisson distributed then equation (9.5) can be
2
simplified using (𝜎𝑁 /𝜇𝑁 ) = 1. It follows that
2
[(𝜎𝑁 /𝜇𝑁 ) + (𝜎𝑋 /𝜇𝑋 )2 ] = [1 + (𝜎𝑋 /𝜇𝑋 )2 ] = [(𝜇2𝑋 + 𝜎𝑋
2
)/𝜇2𝑋 ] = E(𝑋 2 )/E(𝑋)2
using the relationship 𝜇2𝑋 + 𝜎𝑋

2
= E(𝑋 2 ). The full credibility standard is 𝑛𝑆 =
2 2
𝜆𝑘𝑝 E(𝑋 )/E(𝑋) .
The pure premium 𝑃 𝑃 is equal to aggregate losses 𝑆 divided by exposures 𝑚:
𝑃 𝑃 = 𝑆/𝑚. The full credibility standard for pure premium will require
Pr [(1 − 𝑘)𝜇𝑃 𝑃 ≤ 𝑃 𝑃 ≤ (1 + 𝑘)𝜇𝑃 𝑃 ] ≥ 𝑝.
The number of exposures 𝑚 is assumed fixed and not a random variable so

𝜇𝑃 𝑃 = E(𝑆/𝑚) = E(𝑆)/𝑚 = 𝜇𝑆 /𝑚.
𝜇𝑆 𝑆 𝜇
Pr [(1 − 𝑘) ( ) ≤ ( ) ≤ (1 + 𝑘) ( 𝑆 )] ≥ 𝑝.
𝑚 𝑚 𝑚
Multiplying through by 𝑚 returns the bounds for losses
Pr[(1 − 𝑘)𝜇𝑆 ≤ 𝑆 ≤ (1 + 𝑘)𝜇𝑆 ] ≥ 𝑝.
This means that the full credibility standard 𝑛𝑃 𝑃 for the pure premium is the
same as that for aggregate losses
2 2
𝜎𝑁 𝜎
𝑛𝑃 𝑃 = 𝑛𝑆 = 𝜆𝑘𝑝 [( ) + ( 𝑋) ].
𝜇𝑛 𝜇𝑋
9.2.3 Full Credibility for Severity

Let 𝑋 be a random variable representing the size of one claim. Claim severity is
𝜇𝑋 = E(𝑋). Suppose that 𝑋1 , 𝑋2 , … , 𝑋𝑛 is a random sample of 𝑛 claims that
will be used to estimate claim severity 𝜇𝑋 . The claims are assumed to be iid.
The average value of the sample is
1
𝑋̄ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) .
𝑛
How big does 𝑛 need to be to get a good estimate? Note that 𝑛 is not a random
variable whereas it is in the aggregate loss model.
In Section 9.2.1 the accuracy of an estimator for frequency was defined by re-
quiring that the number of claims lie within a specified interval about the mean
number of claims with a specified probability. For severity this requirement is
Pr[(1 − 𝑘)𝜇𝑋 ≤ 𝑋̄ ≤ (1 + 𝑘)𝜇𝑋 ] ≥ 𝑝,

where 𝑘 and 𝑝 need to be specified. Following the steps in Section 9.2.1, the
mean claim severity 𝜇𝑋 is subtracted from each term and the standard deviation
of the claim severity estimator 𝜎𝑋̄ is divided into each term yielding
−𝑘 𝜇𝑋 𝑘 𝜇𝑋
Pr [ ≤ (𝑋̄ − 𝜇𝑋 )/𝜎𝑋̄ ≤ ] ≥ 𝑝.
𝜎𝑋̄ 𝜎𝑋̄
As in prior sections, it is assumed that (𝑋̄ − 𝜇𝑋 )/𝜎𝑋̄ is approximately nor-

mally distributed and the prior equation is satisfied if 𝑘𝜇𝑋 /𝜎𝑋̄ ≥ 𝑦𝑝 with 𝑦𝑝 =
Φ−1 ((𝑝 + 1)/2). Because 𝑋̄ is the average of individual claims 𝑋1 , 𝑋2 , … , 𝑋𝑛 ,
its standard√deviation is equal
√ to the standard √ deviation of an individual claim
divided by 𝑛: 𝜎𝑋̄ = 𝜎𝑋 / 𝑛. So, 𝑘𝜇𝑋 /(𝜎𝑋 / 𝑛) ≥ 𝑦𝑝 and with a little algebra
this can be rewritten as 𝑛 ≥ (𝑦𝑝 /𝑘)2 (𝜎𝑋 /𝜇𝑋 )2 . The full credibility standard for
severity is
𝑦𝑝 2 𝜎𝑋 2 𝜎𝑋
2
𝑛𝑋 =( ) ( ) = 𝜆𝑘𝑝 ( ) . (9.6)
𝑘 𝜇𝑋 𝜇𝑋
Note that the term 𝜎𝑋 /𝜇𝑋 is the coefficient of variation for an individual claim.
Even though 𝜆𝑘𝑝 is the full credibility standard for frequency given a Poisson
distribution, there is no assumption about the distribution for the number of
claims.
Example 9.2.6. Individual loss amounts are independently and identically
distributed with a Type II Pareto distribution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . How
many claims are required for the average severity of observed claims to be within
5% of the expected severity with probability 𝑝 = 0.95?
Solution. The mean of the Pareto is 𝜇𝑋 = 𝜃/(𝛼 − 1) and the variance is
2
𝜎𝑋 = 𝜃2 𝛼/[(𝛼 − 1)2 (𝛼 − 2)] so (𝜎𝑋 /𝜇𝑋 )2 = 𝛼/(𝛼 − 2). From a standard normal
distribution table 𝑦𝑝 = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility standard
is 𝑛𝑋 = (1.96/0.05)2 [𝛼/(𝛼 − 2)] = 1, 536.64𝛼/(𝛼 − 2). Suppose 𝛼 = 3 then
𝑛𝑋 = 4, 609.92 for a full credibility standard of 4,610.
9.2.4 Partial Credibility

In prior sections full credibility standards were calculated for estimating fre-
quency (𝑛𝑓 ), pure premium (𝑛𝑃 𝑃 ), and severity (𝑛𝑋 ) - in this section these full
credibility standards will be denoted by 𝑛0 . In each case the full credibility
standard was the expected number of claims required to achieve a defined level
of accuracy when using empirical data to estimate an expected value. If the ob-
served number of claims is greater than or equal to the full credibility standard
then a full credibility weight 𝑍 = 1 is given to the data.
In limited fluctuation credibility, credibility weights 𝑍 assigned to data are
√𝑛/𝑛0 if 𝑛 < 𝑛0
𝑍={
1 if 𝑛 ≥ 𝑛0 ,
where 𝑛0 is the full credibility standard. The quantity 𝑛 is the number of claims
for the data that is used to estimate the expected frequency, severity, or pure
premium.
Example 9.2.7. The number of claims has a Poisson distribution. Individ-
ual loss amounts are independently and identically distributed with a Type II
Pareto distribution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . Assume that 𝛼 = 3. The number
of claims and loss amounts are independent. The full credibility standard is
that the observed pure premium should be within 5% of the expected value
with probability 𝑝 = 0.95. What credibility 𝑍 is assigned to a pure premium
computed from 1,000 claims?
Solution. Because the number of claims is Poisson,
2
E(𝑋 2 ) 𝜎2 𝜎
2
= 𝑁 +( 𝑋) .
[E (𝑋)] 𝜇𝑁 𝜇𝑋
The mean of the Pareto is 𝜇𝑋 = 𝜃/(𝛼 − 1) and the second moment is E(𝑋 2 ) =
2𝜃2 /[(𝛼 − 1)(𝛼 − 2)] so E(𝑋 2 )/[E (𝑋)]2 = 2(𝛼 − 1)/(𝛼 − 2). From a standard
normal distribution table, 𝑦𝑝 = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility
standard is
𝑛𝑃 𝑃 = (1.96/0.05)2 [2(𝛼 − 1)/(𝛼 − 2)] = 3, 073.28(𝛼 − 1)/(𝛼 − 2)
and if 𝛼 = 3 then 𝑛0 = 𝑛𝑃 𝑃 = 6, 146.56 or 6,147 if rounded up. The credibility

assigned to 1,000 claims is 𝑍 = (1, 000/6, 147)1/2 = 0.40.
Limited fluctuation credibility uses the formula 𝑍 = √𝑛/𝑛0 to limit the fluctu-
ation in the credibility-weighted estimate to match the fluctuation allowed for
data with expected claims at the full credibility standard. Variance or standard
deviation is used as the measure of fluctuation. Next we show an example to
explain why the square-root formula is used.
Suppose that average claim severity is being estimated from a sample of size
𝑛 that is less than the full credibility standard 𝑛0 = 𝑛𝑋 . Applying credibility
theory, the estimate 𝜇𝑋̂ would be
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝑀𝑋 ,
𝜇𝑋
with 𝑋̄ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 )/𝑛 and 𝑖𝑖𝑑 random variables 𝑋𝑖 representing the

sizes of individual claims. The complement of credibility is applied to 𝑀𝑋 which
9.3. BÜHLMANN CREDIBILITY 319
could be last year’s estimated average severity adjusted for inflation, the average
severity for a much larger pool of risks, or some other relevant quantity selected
by the actuary. It is assumed that the variance of 𝑀𝑋 is zero or negligible.
With this assumption
𝑛
̂ ) = Var(𝑍 𝑋)̄ = 𝑍 2 Var(𝑋)̄ =
Var(𝜇𝑋 ̄
Var(𝑋).
𝑛0
Because 𝑋̄ = (𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 )/𝑛 it follows that Var(𝑋)̄ = Var(𝑋𝑖 )/𝑛 where

random variable 𝑋𝑖 is one claim. So,
𝑛 𝑛 Var(𝑋𝑖 ) Var(𝑋𝑖 )
Var(𝜇𝑋
̂ )= Var(𝑋)̄ = = .
𝑛0 𝑛0 𝑛 𝑛0
The last term is exactly the variance of a sample mean 𝑋̄ when the sample size
is equal to the full credibility standard 𝑛0 = 𝑛𝑋 .
9.3 Bühlmann Credibility

• Compute a credibility-weighted estimate for the expected loss for a risk
or group of risks.
• Determine the credibility 𝑍 assigned to observations.
• Calculate the values required in Bühlmann credibility including the Ex-
pected Value of the Process Variance (𝐸𝑃 𝑉 ), Variance of the Hypothetical
Means (𝑉 𝐻𝑀 ) and collective mean 𝜇.
• Recognize situations when the Bühlmann model is appropriate.
A classification rating plan groups policyholders together into classes based on

risk characteristics. Although policyholders within a class have similarities, they
are not identical and their expected losses will not be exactly the same. An ex-
perience rating plan can supplement a class rating plan by credibility weighting
an individual policyholder’s loss experience with the class rate to produce a
more accurate rate for the policyholder.
In the presentation of Buhlmann credibility it is convenient to assign a risk
parameter 𝜃 to each policyholder. Losses 𝑋 for the policyholder will have a
common distribution function 𝐹𝜃 (𝑥) with mean 𝜇(𝜃) = E(𝑋|𝜃) and variance
𝜎2 (𝜃) = Var(𝑋|𝜃). Losses 𝑋 can represent pure premiums, aggregate losses,
number of claims, claim severities, or some other measure of loss for a period of
time, often one year. Risk parameter 𝜃 may be continuous or discrete and may
be multivariate depending on the model.
If a policyholder with risk parameter 𝜃 had losses 𝑋1 , … , 𝑋𝑛 during 𝑛 time

periods then the goal is to find E(𝜇(𝜃)|𝑋1 , … , 𝑋𝑛 ), the conditional expecta-
tion of 𝜇(𝜃) given 𝑋1 , … , 𝑋𝑛 . The Bühlmann credibility-weighted estimate for
E(𝜇(𝜃)|𝑋1 , … , 𝑋𝑛 ) for the policyholder is
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 (9.7)
with
𝜃 = a risk parameter that identifies a policyholder’s risk level

𝜇(𝜃)
̂ = estimated expected loss for a policyholder with parameter 𝜃
and loss experience 𝑋̄
𝑋̄ = (𝑋1 + ⋯ + 𝑋𝑛 )/𝑛 is the average of 𝑛 observations of the policyholder
𝑍 = credibility assigned to 𝑛 observations
𝜇 = the expected loss for a randomly chosen policyholder in the class.
For a selected policyholder, random variables 𝑋𝑗 are assumed to be iid for

𝑗 = 1, … , 𝑛 because it is assumed that the policyholder’s exposure to loss is not
changing through time. The quantity 𝑋̄ is the average of 𝑛 observations and
̄
E(𝑋|𝜃) = E(𝑋𝑗 |𝜃) = 𝜇(𝜃).
If a policyholder is randomly chosen from the class and there is no loss informa-
tion about the risk then the expected loss is 𝜇 = E(𝜇(𝜃)) where the expectation
is taken over all 𝜃’s in the class. In this situation 𝑍 = 0 and the expected loss
is 𝜇(𝜃)
̂ = 𝜇 for the risk. The quantity 𝜇 can also be written as 𝜇 = E(𝑋𝑗 ) or
𝜇 = E(𝑋)̄ and is often called the overall mean or collective mean. Note that
E(𝑋𝑗 ) is evaluated with the law of total expectation: E(𝑋𝑗 ) = E(E[𝑋𝑗 |𝜃]).
Example 9.3.1. The number of claims 𝑋 for an insured in a class has a Poisson
distribution with mean 𝜃 > 0. The risk parameter 𝜃 is exponentially distributed
within the class with pdf 𝑓(𝜃) = 𝑒−𝜃 . What is the expected number of claims
for an insured chosen at random from the class?
Solution Random variable 𝑋 is Poisson with parameter 𝜃 and E(𝑋|𝜃) = 𝜃.
The expected number of claims for a randomly chosen insured is 𝜇 = E(𝜇(𝜃)) =
∞
E(E(𝑋|𝜃)) =E(𝜃) = ∫0 𝜃𝑒−𝜃 𝑑𝜃 = 1.
In the prior example the risk parameter 𝜃 is a random variable with an expo-
nential distribution. In the next example there are three types of risks and the
risk parameter has a discrete distribution.
Example 9.3.2. For any risk (policyholder) in a population the number of

losses 𝑁 in a year has a Poisson distribution with parameter 𝜆. Individual loss
amounts 𝑋𝑖 for a risk are independent of 𝑁 and are iid with Type II Pareto
distribution 𝐹 (𝑥) = 1 − [𝜃/(𝑥 + 𝜃)]𝛼 . There are three types of risks in the
population as follows:
Risk Percentage Poisson Pareto

Type of Population Parameter Parameters
𝐴 50% 𝜆 = 0.5 𝜃 = 1000, 𝛼 = 2.0
𝐵 30% 𝜆 = 1.0 𝜃 = 1500, 𝛼 = 2.0
𝐶 20% 𝜆 = 2.0 𝜃 = 2000, 𝛼 = 2.0
If a risk is selected at random from the population, what is the expected aggre-
gate loss in a year?
Solution The expected number of claims for a risk is E(𝑁 |𝜆)=𝜆. The expected
value for a Pareto distributed random variable is E(𝑋|𝜃, 𝛼)=𝜃/(𝛼 − 1). The
expected value of the aggregate loss random variable 𝑆 = 𝑋1 + ⋯ + 𝑋𝑁 for
a risk with parameters 𝜆, 𝛼, and 𝜃 is E(𝑆) = E(𝑁 )E(𝑋) = 𝜆𝜃/(𝛼 − 1). The
expected aggregate loss for a risk of type A is E(𝑆A )=(0.5)(1000)/(2-1)=500.
The expected aggregate loss for a risk selected at random from the population
is E(𝑆) = 0.5[(0.5)(1000)]+0.3[(1.0)(1500)]+0.2[(2.0)(2000)]=1500.
What is the risk parameter for a risk (policyholder) in the prior example? One
could say that the risk parameter has three components (𝜆, 𝜃, 𝛼) with possible
values (0.5,1000,2.0), (1.0,1500,2.0), and (2.0,2000,2.0) depending on the type
of risk.
Note that in both of the examples the risk parameter is a random quantity with
its own probability distribution. We do not know the value of the risk parameter
for a randomly chosen risk.
Although formula (9.7) was introduced using experience rating as an example,
the Bühlmann credibility model has wider application. Suppose that a rating
plan has multiple classes. Credibility formula (9.7) can be used to determine
individual class rates. The overall mean 𝜇 would be the average loss for all
classes combined, 𝑋̄ would be the experience for the individual class, and 𝜇(𝜃)
̂
would be the estimated loss for the class.
9.3.1 Credibility Z, EPV, and VHM

When computing the credibility estimate 𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇, how much
weight 𝑍 should go to experience 𝑋̄ and how much weight (1 − 𝑍) to the over-
all mean 𝜇? In Bühlmann credibility there are three factors that need to be
considered:
1. How much variation is there in a single observation 𝑋𝑗 for a selected

risk? With 𝑋̄ = (𝑋1 + ⋯ + 𝑋𝑛 )/𝑛 and assuming that the observations
̄
are iid conditional on 𝜃, it follows that Var(𝑋|𝜃) = Var(𝑋𝑗 |𝜃)/𝑛. For
larger Var(𝑋|𝜃) less credibility weight 𝑍 should be given to experience 𝑋.̄
̄
The Expected Value of the Process Variance, abbreviated 𝐸𝑃 𝑉 , is the
expected value of Var(𝑋𝑗 |𝜃) across all risks:
𝐸𝑃 𝑉 = E(Var(𝑋𝑗 |𝜃)).
̄
Because Var(𝑋|𝜃) ̄
= Var(𝑋𝑗 |𝜃)/𝑛 it follows that E(Var(𝑋|𝜃)) = 𝐸𝑃 𝑉 /𝑛.
2. How homogeneous is the population of risks whose experience was com-
bined to compute the overall mean 𝜇? If all the risks are similar in loss
potential then more weight (1 − 𝑍) would be given to the overall mean
𝜇 because 𝜇 is the average for a group of similar risks whose means 𝜇(𝜃)
are not far apart. The homogeneity or heterogeneity of the population is
measured by the Variance of the Hypothetical Means with abbreviation
𝑉 𝐻𝑀 :
̄
𝑉 𝐻𝑀 = Var(E(𝑋𝑗 |𝜃)) = Var(E(𝑋|𝜃)).
̄
Note that we used E(𝑋|𝜃) = E(𝑋𝑗 |𝜃) for the second equality.
̄ A larger sample
3. How many observations 𝑛 were used to compute 𝑋?
would infer a larger 𝑍.
Example 9.3.3. The number of claims 𝑁 in a year for a risk in a population
has a Poisson distribution with mean 𝜆 > 0. The risk parameter 𝜆 is uniformly
distributed over the interval (0, 2). Calculate the 𝐸𝑃 𝑉 and 𝑉 𝐻𝑀 for the
population.
Solution. Random variable 𝑁 is Poisson with parameter 𝜆 so Var(𝑁 |𝜆) = 𝜆.
The Expected Value of the Process variance is 𝐸𝑃 𝑉 = E(Var(𝑁 |𝜆)) = E(𝜆) =
2
∫0 𝜆 12 𝑑𝜆 = 1. The Variance of the Hypothetical Means is 𝑉 𝐻𝑀 = Var(E(𝑁 |𝜆))
2
= Var(𝜆) = E(𝜆2 ) − (E(𝜆))2 = ∫0 𝜆2 12 𝑑𝜆 − (1)2 = 13 .
The Bühlmann credibility formula includes values for 𝑛, 𝐸𝑃 𝑉 , and 𝑉 𝐻𝑀 :
𝑛 𝐸𝑃 𝑉
𝑍= , 𝐾= . (9.8)
𝑛+𝐾 𝑉 𝐻𝑀
If the 𝑉 𝐻𝑀 increases then 𝑍 increases. If the 𝐸𝑃 𝑉 increases then 𝑍 gets
smaller. Unlike limited fluctuation credibility where 𝑍 = 1 when the expected
number of claims is greater than the full credibility standard, 𝑍 can approach
but not equal 1 as the number of observations 𝑛 goes to infinity.
If you multiply the numerator and denominator of the 𝑍 formula by (𝑉 𝐻𝑀 /𝑛)

then 𝑍 can be rewritten as
𝑉 𝐻𝑀
𝑍= .
𝑉 𝐻𝑀 + (𝐸𝑃 𝑉 /𝑛)
The number of observations 𝑛 is captured in the term (𝐸𝑃 𝑉 /𝑛). As shown

̄
in bullet (1) at the beginning of the section, E(Var(𝑋|𝜃)) = 𝐸𝑃 𝑉 /𝑛. As the
number of observations get larger, the expected variance of 𝑋̄ gets smaller and
credibility 𝑍 increases so that more weight gets assigned to 𝑋̄ in the credibility-
weighted estimate 𝜇(𝜃).
̂
Example 9.3.4. Use the law of total variance to show that Var(𝑋)̄ = 𝑉 𝐻𝑀 +
(𝐸𝑃 𝑉 /𝑛) and derive a formula for 𝑍 in terms of 𝑋.̄
Solution The quantity Var(𝑋)̄ is called the unconditional variance or the total
variance of 𝑋.̄ The law of total variance, equation (16.2), says
Var(𝑋)̄ = E(Var(𝑋|𝜃))
̄ ̄
+ Var(E(𝑋|𝜃)).
̄
In bullet (1) at the beginning of this section we showed E(Var(𝑋|𝜃)) = 𝐸𝑃 𝑉 /𝑛.
̄
In the bullet (2), Var(E(𝑋|𝜃)) = 𝑉 𝐻𝑀 . Reordering the right hand side gives
Var(𝑋)̄ = 𝑉 𝐻𝑀 + (𝐸𝑃 𝑉 /𝑛). Another way to write the formula for credibility
̄
𝑍 is 𝑍 = Var(E(𝑋|𝜃))/Var( ̄ This implies (1 − 𝑍) = E(Var(𝑋|𝜃))/Var(
𝑋). ̄ ̄
𝑋).
The following long example and solution demonstrate how to compute the
credibility-weighted estimate with frequency and severity data.
Example 9.3.5. For any risk in a population the number of losses 𝑁 in a year
has a Poisson distribution with parameter 𝜆. Individual loss amounts 𝑋 for
a selected risk are independent of 𝑁 and are iid with exponential distribution
𝐹 (𝑥) = 1 − 𝑒−𝑥/𝛽 . There are three types of risks in the population as shown
below. A risk was selected at random from the population and all losses were
recorded over a five-year period. The total amount of losses over the five-year
period was 5,000. Use Bühlmann credibility to estimate the annual expected
aggregate loss for the risk.
Risk Percentage Poisson Exponential

Type of Population Parameter Parameter
𝐴 50% 𝜆 = 0.5 𝛽 = 1000
𝐵 30% 𝜆 = 1.0 𝛽 = 1500
𝐶 20% 𝜆 = 2.0 𝛽 = 2000
Solution Because individual loss amounts 𝑋 are exponentially distributed,

E(𝑋|𝛽) = 𝛽 and Var(𝑋|𝛽) = 𝛽 2 . For aggregate loss 𝑆 = 𝑋1 + ⋯ + 𝑋𝑁 ,
the mean is E(𝑆) = E(𝑁 )E(𝑋) and process variance is Var(𝑆) =
E(𝑁 )Var(𝑋) + [E(𝑋)]2 Var(𝑁 ). With Poisson frequency and exponentially

distributed loss amounts, E(𝑆|𝜆, 𝛽) = 𝜆𝛽 and Var(𝑆|𝜆, 𝛽) = 𝜆𝛽 2 + 𝛽 2 𝜆 = 2𝜆𝛽 2 .
Population mean 𝜇: Risk means are 𝜇(A)=0.5(1000)=500; 𝜇(B)=1.0(1500)=1500;
𝜇(C)=2.0(2000)=4000; and 𝜇=0.50(500)+0.30(1500)+0.20(4000)=1,500.
VHM: VHM =0.50(500 − 1500)2 + 0.30(1500 − 1500)2 + 0.20(4000 −
1500)2 =1,750,000.
EPV: Process variances are 𝜎2 (𝐴) = 2(0.5)(1000)2 = 1, 000, 000; 𝜎2 (𝐵) =
2(1.0)(1500)2 = 4, 500, 000; 𝜎2 (𝐶) = 2(2.0)(2000)2 = 16, 000, 000; and
𝐸𝑃 𝑉 =0.50(1,000,000)+0.30(4,500,000)+0.20(16,000,000)=5,050,000.
̄ 𝑋̄ 5 = 5, 000/5=1,000.
X:
K: 𝐾 = 5, 050, 000/1, 750, 000=2.89.
Z: There are five years of observations so 𝑛 = 5. 𝑍 = 5/(5 + 2.89)=0.63.
𝜇(𝜃):
̂ 𝜇(𝜃)
̂ = 0.63(1, 000) + (1 − 0.63)1, 500 = 1, 185.00 .
In real world applications of Bühlmann credibility the value of 𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀

must be estimated. Sometimes a value for 𝐾 is selected using judgment. A
smaller 𝐾 makes estimator 𝜇(𝜃)̂ more responsive to actual experience 𝑋̄ whereas
a larger 𝐾 produces a more stable estimate by giving more weight to 𝜇. Judg-
ment may be used to balance responsiveness and stability. A later section in
this chapter will discuss methods for determining 𝐾 from data.
For a policyholder with risk parameter 𝜃, Bühlmann credibility uses a linear ap-
proximation 𝜇(𝜃)
̂ = 𝑍 𝑋̄ +(1−𝑍)𝜇 to estimate E(𝜇(𝜃)|𝑋1 , … , 𝑋𝑛 ), the expected
loss for the policyholder given prior losses 𝑋1 , … , 𝑋𝑛 . We can rewrite this as
𝜇(𝜃)
̂ = 𝑎 + 𝑏𝑋̄ which makes it obvious that the credibility estimate is a linear
function of 𝑋.̄
If E(𝜇(𝜃)|𝑋1 , … , 𝑋𝑛 ) is approximated by the linear function 𝑎 + 𝑏𝑋̄ and con-
̄ 2 ] is mini-
stants 𝑎 and 𝑏 are chosen so that E[(E(𝜇(𝜃)|𝑋1 , … , 𝑋𝑛 ) − (𝑎 + 𝑏𝑋))
mized, what are 𝑎 and 𝑏? The answer is 𝑏 = 𝑛/(𝑛 + 𝐾) and 𝑎 = (1 − 𝑏)𝜇 with
𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀 and 𝜇 = E(𝜇(𝜃)). More details can be found in references
(Bühlmann, 1967), (Bühlmann and Gisler, 2005), (Klugman et al., 2012), and
(Tse, 2009).
Bühlmann credibility is also called least-squares credibility, greatest accuracy
credibility, or Bayesian credibility.
9.4 Bühlmann-Straub Credibility

• Compute a credibility-weighted estimate for the expected loss for a risk
or group of risks using the Bühlmann-Straub model.
• Determine the credibility 𝑍 assigned to observations.
9.4. BÜHLMANN-STRAUB CREDIBILITY 325
• Calculate required values including the Expected Value of the Process Vari-
ance (𝐸𝑃 𝑉 ), Variance of the Hypothetical Means (𝑉 𝐻𝑀 ) and collective
mean 𝜇.
• Recognize situations when the Bühlmann-Straub model is appropriate.
With standard Bühlmann or least-squares credibility as described in the prior

section, losses 𝑋1 , … , 𝑋𝑛 arising from a selected policyholder are assumed to be
iid. If the subscripts indicate year 1, year 2 and so on up to year 𝑛, then the
iid assumption means that the policyholder has the same exposure to loss every
year. For commercial insurance this assumption is frequently violated.
Consider a commercial policyholder that uses a fleet of vehicles in its business.

In year 1 there are 𝑚1 vehicles in the fleet, 𝑚2 vehicles in year 2, .., and 𝑚𝑛
vehicles in year 𝑛. The exposure to loss from ownership and use of this fleet is
not constant from year to year. The annual losses for the fleet are not iid.
Define 𝑌𝑗𝑘 to be the loss for the 𝑘𝑡ℎ vehicle in the fleet for year 𝑗. Then, the
total losses for the fleet in year 𝑗 are 𝑌𝑗1 + ⋯ + 𝑌𝑗𝑚𝑗 where we are adding up
the losses for each of the 𝑚𝑗 vehicles. In the Bühlmann-Straub model it is
assumed that random variables 𝑌𝑗𝑘 are iid across all vehicles and years for the
policyholder. With this assumption the means E(𝑌𝑗𝑘 |𝜃) = 𝜇(𝜃) and variances
Var(𝑌𝑗𝑘 |𝜃) = 𝜎2 (𝜃) are the same for all vehicles and years. The quantity 𝜇(𝜃)
is the expected loss and 𝜎2 (𝜃) is the variance in the loss for one year for one
vehicle for a policyholder with risk parameter 𝜃.
If 𝑋𝑗 is the average loss per unit of exposure in year 𝑗, 𝑋𝑗 = (𝑌𝑗1 + ⋯ +

𝑌𝑗𝑚𝑗 )/𝑚𝑗 , then E(𝑋𝑗 |𝜃) = 𝜇(𝜃) and Var(𝑋𝑗 |𝜃) = 𝜎2 (𝜃)/𝑚𝑗 for a policyholder
with risk parameter 𝜃. Note that we used the fact that the 𝑌𝑗𝑘 are iid for a
given policyholder. The average loss per vehicle for the entire 𝑛-year period is
1 𝑛 𝑛
𝑋̄ = ∑𝑚 𝑋 , 𝑚 = ∑ 𝑚𝑗 .
𝑚 𝑗=1 𝑗 𝑗 𝑗=1
̄
It follows that E(𝑋|𝜃) = 𝜇(𝜃) and Var(𝑋|𝜃)̄ = 𝜎2 (𝜃)/𝑚 where 𝜇(𝜃) and 𝜎2 (𝜃)
are the mean and variance for a single vehicle for one year for the policyholder.
̄
Example 9.4.1. Prove that Var(𝑋|𝜃) = 𝜎2 (𝜃)/𝑚 for a risk with risk parameter
𝜃.
Solution
̄ 1 𝑛
Var(𝑋|𝜃) = Var ( ∑ 𝑚 𝑋 |𝜃)
𝑚 𝑗=1 𝑗 𝑗
1 𝑛 1 𝑛
= ∑ Var(𝑚 𝑋
𝑗 𝑗 |𝜃) = ∑ 𝑚2 Var(𝑋𝑗 |𝜃)
𝑚2 𝑗=1 𝑚2 𝑗=1 𝑗
1 𝑛 𝜎2 (𝜃) 𝑛
= ∑ 𝑚2𝑗 (𝜎2 (𝜃)/𝑚𝑗 ) = ∑ 𝑚 = 𝜎2 (𝜃)/𝑚.
2
𝑚 𝑗=1 𝑚2 𝑗=1 𝑗
The Buhlmann-Straub credibility estimate is:
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 (9.9)
with
𝜃 = a risk parameter that identifies a policyholder’s risk level

𝜇(𝜃)
̂ = estimated expected loss for one exposure for the policyholder
with loss experience 𝑋̄
1 𝑛
𝑋̄ = ∑ 𝑚 𝑋 is the average loss per exposure for 𝑚 exposures.
𝑚 𝑗=1 𝑗 𝑗
𝑋𝑗 is the average loss per exposure and 𝑚𝑗 is the number of exposures in year 𝑗.
𝑍 = credibility assigned to 𝑚 exposures
𝜇 = expected loss for one exposure for randomly chosen
policyholder from population.
Note that 𝜇(𝜃)

̂ is the estimator for the expected loss for one exposure. If the
policyholder has 𝑚𝑗 exposures then the expected loss is 𝑚𝑗 𝜇(𝜃).
̂
̄
In Example 9.3.4, it was shown that 𝑍 = Var(E(𝑋|𝜃))/Var( 𝑋)̄ where 𝑋̄ is the
average loss for 𝑛 observations. In equation (9.9) the 𝑋̄ is the average loss for
𝑚 exposures and the same 𝑍 formula can be used:
̄
Var(E(𝑋|𝜃)) ̄
Var(E(𝑋|𝜃))
𝑍= = .
Var(𝑋) ̄ ̄ ̄
E(Var(𝑋|𝜃)) + Var(E(𝑋|𝜃))
The denominator was expanded using the law of total variance. As noted above
̄
E(𝑋|𝜃) ̄
= 𝜇(𝜃) so Var(E(𝑋|𝜃)) = Var(𝜇(𝜃)) = 𝑉 𝐻𝑀 . Because Var(𝑋|𝜃) ̄ =
9.5. BAYESIAN INFERENCE AND BÜHLMANN CREDIBILITY 327
̄
𝜎2 (𝜃)/𝑚 it follows that E(Var(𝑋|𝜃)) = E(𝜎2 (𝜃))/𝑚 = 𝐸𝑃 𝑉 /𝑚. Making these
substitutions and using a little algebra gives
𝑚 𝐸𝑃 𝑉
𝑍= , 𝐾= . (9.10)
𝑚+𝐾 𝑉 𝐻𝑀
This is the same 𝑍 as for Bühlmann credibility except number of exposures 𝑚

replaces number of years or observations 𝑛.
Example 9.4.2. A commercial automobile policyholder had the following ex-
posures and claims over a three-year period:
Year Number of Vehicles Number of Claims

1 9 5
2 12 4
3 15 4
• The number of claims in a year for each vehicle in the policyholder’s fleet
is Poisson distributed with the same mean (parameter) 𝜆.
• Parameter 𝜆 is distributed among the policyholders in the population with
pdf 𝑓(𝜆) = 6𝜆(1 − 𝜆) with 0 < 𝜆 < 1.
The policyholder has 18 vehicles in its fleet in year 4. Use Bühlmann-Straub
credibility to estimate the expected number of policyholder claims in year 4.
Solution The expected number of claims for one vehicle for a randomly chosen
1
policyholder is 𝜇 = E(𝜆) = ∫0 𝜆[6𝜆(1 − 𝜆)]𝑑𝜆 = 1/2. The average number of
̄
claims per vehicle for the policyholder is 𝑋=13/36. The expected value of the
process variance for a single vehicle is 𝐸𝑃 𝑉 = E(𝜆) = 1/2. The variance of the
hypothetical means across policyholders is 𝑉 𝐻𝑀 = Var(𝜆) = E(𝜆2 )-(E(𝜆))2 =
1
∫0 𝜆2 [6𝜆(1 − 𝜆)]𝑑𝜆 − (1/2)2 = (3/10) − (1/4) = (6/20) − (5/20) = 1/20. So, 𝐾 =
𝐸𝑃 𝑉 /𝑉 𝐻𝑀 =(1/2)/(1/20)=10. The number of exposures in the experience
period is 𝑚 = 9 + 12 + 15 = 36. The credibility is 𝑍 = 36/(36 + 10) = 18/23.
The credibility-weighted estimate for the number of claims for one vehicle is
𝜇(𝜃)
̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇=(18/23)(13/36)+(5/23)(1/2)=9/23. With 18 vehicles
in the fleet in year 4 the expected number of claims is 18(9/23)=162/23=7.04 .
9.5 Bayesian Inference and Bühlmann Credibil-

ity

• Use Bayes Theorem to determine a formula for the expected loss of a risk
given a likelihood and prior distribution.
• Determine the posterior distributions for the gamma-Poisson and beta-
binomial Bayesian models and compute expected values.
• Understand the connection between the Bühlmann and Bayesian estimates
for the gamma-Poisson and beta-binomial models.
Section 4.4 reviews Bayesian inference and it is assumed that the reader is fa-
miliar with that material. The reader is also advised to read the Bühlmann
credibility Section 9.3 in this chapter. This section will compare Bayesian infer-
ence with Bühlmann credibility and show connections between the two models.
A risk with risk parameter 𝜃 has expected loss 𝜇(𝜃) = E(𝑋|𝜃) with random
variable 𝑋 representing pure premium, aggregate loss, number of claims, claim
severity, or some other measure of loss during a period of time. If the risk
has 𝑛 losses 𝑋1 , … , 𝑋𝑛 during n separate periods of time, then these losses are
assumed to be 𝑖𝑖𝑑 for the policyholder and 𝜇(𝜃) = E(𝑋𝑖 |𝜃) for 𝑖 = 1, .., 𝑛.
If the risk had 𝑛 losses 𝑥1 , … , 𝑥𝑛 then E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ) is the conditional ex-
pectation of 𝜇(𝜃). The Bühlmann credibility formula 𝜇(𝜃) ̂ = 𝑍 𝑋̄ + (1 − 𝑍)𝜇 is
̄
a linear function of 𝑋 = (𝑥1 + ⋯ + 𝑥𝑛 )/𝑛 used to estimate E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ).
The expectation E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ) can be calculated from the conditional den-
sity function 𝑓(𝑥|𝜃) and the posterior distribution 𝜋(𝜃|𝑥1 , … , 𝑥𝑛 ):
E(𝜇(𝜃)|𝑥1 , … , 𝑥𝑛 ) = ∫ 𝜇(𝜃)𝜋(𝜃|𝑥1 , … , 𝑥𝑛 )𝑑𝜃
𝜇(𝜃) = E(𝑋|𝜃) = ∫ 𝑥𝑓(𝑥|𝜃)𝑑𝑥.
The posterior distribution comes from Bayes theorem
𝑛
∏𝑗=1 𝑓(𝑥𝑗 |𝜃)
𝜋(𝜃|𝑥1 , … , 𝑥𝑛 ) = 𝜋(𝜃).
𝑓(𝑥1 , … , 𝑥𝑛 )
The conditional density function 𝑓(𝑥|𝜃) and the prior distribution 𝜋(𝜃) must
𝑛
be specified. The numerator ∏𝑗=1 𝑓(𝑥𝑗 |𝜃) on the right-hand side is called the
likelihood. The denominator 𝑓(𝑥1 , … , 𝑥𝑛 ) is the joint density function for 𝑛
losses 𝑥1 , … , 𝑥𝑛 .
9.5.1 Gamma-Poisson Model

In the Gamma-Poisson model the number of claims 𝑋 has a Poisson distribution
Pr(𝑋 = 𝑥|𝜆) = 𝜆𝑥 𝑒−𝜆 /𝑥! for a risk with risk parameter 𝜆. The prior distribution
for 𝜆 is gamma with 𝜋(𝜆) = 𝛽 𝛼 𝜆𝛼−1 𝑒−𝛽𝜆 /Γ(𝛼). (Note that a rate parameter 𝛽
is being used in the gamma distribution rather than a scale parameter.) The
mean of the gamma is E(𝜆) = 𝛼/𝛽 and the variance is Var(𝜆) = 𝛼/𝛽 2 . In this
section we will assume that 𝜆 is the expected number of claims per year though
we could have chosen another time interval.
If a risk is selected at random from the population then the expected number of
claims in a year is E(𝑁 ) = E(E[𝑁 |𝜆]) = E(𝜆) = 𝛼/𝛽. If we had no observations
for the selected risk then the expected number of claims for the risk is 𝛼/𝛽.
During 𝑛 years the following number of claims by year was observed for the ran-
domly selected risk: 𝑥1 , … , 𝑥𝑛 . From Bayes theorem the posterior distribution
is
𝑛
∏𝑗=1 (𝜆𝑥𝑗 𝑒−𝜆 /𝑥𝑗 !)
𝜋(𝜆|𝑥1 , … , 𝑥𝑛 ) = 𝛽 𝛼 𝜆𝛼−1 𝑒−𝛽𝜆 /Γ(𝛼).
Pr(𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 )
Combining terms that have a 𝜆 and putting all other terms into constant 𝐶
gives
𝑛
𝜋(𝜆|𝑥1 , … , 𝑥𝑛 ) = 𝐶𝜆(𝛼+∑𝑗=1 𝑥𝑗 )−1 𝑒−(𝛽+𝑛)𝜆 .
𝑛
This is a gamma distribution with parameters 𝛼′ = 𝛼 + ∑𝑗=1 𝑥𝑗 and 𝛽 ′ = 𝛽 + 𝑛.
𝛼′ ∞
The constant must be 𝐶 = 𝛽 ′ /Γ(𝛼′ ) so that ∫0 𝜋(𝜆|𝑥1 , … , 𝑥𝑛 )𝑑𝜆 = 1 though
we do not need to know 𝐶. As explained in Chapter 4 the gamma distribution
is a conjugate prior for the Poisson distribution so the posterior distribution is
also gamma. See also Appendix Section 16.3.2.
Because the posterior distribution is gamma the expected number of claims for
the selected risk is
𝑛
𝛼 + ∑𝑗=1 𝑥𝑗 𝛼 + number of claims
E(𝜆|𝑥1 , … , 𝑥𝑛 ) = = .
𝛽+𝑛 𝛽 + number of years
This formula is slightly different from Chapter 4 because parameter 𝛽 is multi-

plied by 𝜆 in the exponential of the gamma pdf whereas in Chapter 4 𝜆 is divided
by parameter 𝜃. We have chosen this form for the exponential to simplify the
equation for the expected number of claims.
Now we will compute the Bühlmann credibility estimate for the gamma-Poisson
model. The variance for a Poisson distribution with parameter 𝜆 is 𝜆 so 𝐸𝑃 𝑉 =
E(Var(𝑋|𝜆)) = E(𝜆) = 𝛼/𝛽. The mean number of claims per year for the risk
is 𝜆 so 𝑉 𝐻𝑀 = Var(E(𝑋|𝜆)) = Var(𝜆) = 𝛼/𝛽 2 . The credibility parameter
is 𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀 = (𝛼/𝛽)/(𝛼/𝛽 2 ) = 𝛽. The overall mean is E(E(𝑋|𝜆)) =
𝑛
E(𝜆) = 𝛼/𝛽. The sample mean is 𝑋̄ = (∑𝑗=1 𝑥𝑗 )/𝑛. The credibility-weighted
estimate for the expected number of claims for the risk is
𝑛 𝑛
𝑛 ∑𝑗=1 𝑥𝑗 𝑛 𝛼 𝛼 + ∑𝑗=1 𝑥𝑗
𝜇̂ = + (1 − ) = .
𝑛+𝛽 𝑛 𝑛+𝛽 𝛽 𝛽+𝑛
For the gamma-Poisson model the Bühlmann credibility estimate matches the
Bayesian analysis result.
9.5.2 Beta-Binomial Model

The Beta-Binomial model is useful for modeling the probability of an event.
Assume that random variable 𝑋 is the number of successes in 𝑛 trials and that
𝑋 has a binomial distribution Pr(𝑋 = 𝑥|𝑝) = (𝑛𝑥)𝑝𝑥 (1 − 𝑝)𝑛−𝑥 . In the beta-
binomial model the prior distribution for probability 𝑝 is a beta distribution
with pdf
Γ(𝛼 + 𝛽) 𝛼−1
𝜋(𝑝) = 𝑝 (1 − 𝑝)𝛽−1 , 0 < 𝑝 < 1, 𝛼 > 0, 𝛽 > 0.
Γ(𝛼)Γ(𝛽)
The posterior distribution for 𝑝 given an outcome of 𝑥 successes in 𝑛 trials is
(𝑛𝑥)𝑝𝑥 (1 − 𝑝)𝑛−𝑥 Γ(𝛼 + 𝛽) 𝛼−1

𝜋(𝑝|𝑥) = 𝑝 (1 − 𝑝)𝛽−1 .
Pr(𝑥) Γ(𝛼)Γ(𝛽)
Combining terms that have a 𝑝 and putting everything else into the constant 𝐶
yields
𝜋(𝑝|𝑥) = 𝐶𝑝𝛼+𝑥−1 (1 − 𝑝)𝛽+(𝑛−𝑥)−1 .
This is a beta distribution with new parameters 𝛼′ = 𝛼 + 𝑥 and 𝛽 ′ = 𝛽 + (𝑛 − 𝑥).

The constant must be
Γ(𝛼 + 𝛽 + 𝑛)
𝐶= .
Γ(𝛼 + 𝑥)Γ(𝛽 + 𝑛 − 𝑥)
The mean for the beta distribution with parameters 𝛼 and 𝛽 is E(𝑝) = 𝛼/(𝛼+𝛽).
Given 𝑥 successes in 𝑛 trials in the beta-binomial model the mean of the posterior
distribution is
𝛼+𝑥
E(𝑝|𝑥) = .
𝛼+𝛽+𝑛
As the number of trials 𝑛 and successes 𝑥 increase, the expected value of 𝑝

approaches 𝑥/𝑛.
The Bühlmann credibility estimate for E(𝑝|𝑥) is exactly as the same as the
Bayesian estimate as demonstrated in the following example.
Example 9.5.1 The probability that a coin toss will yield heads is 𝑝. The prior
distribution for probability 𝑝 is beta with parameters 𝛼 and 𝛽. On 𝑛 tosses of
the coin there were exactly 𝑥 heads. Use Bühlmann credibility to estimate the
expected value of 𝑝.
Solution Define random variables 𝑌𝑗 such that 𝑌𝑗 = 1 if the 𝑗𝑡ℎ coin toss is
heads and 𝑌𝑗 = 0 if tails for 𝑗 = 1, … , 𝑛. Random variables 𝑌𝑗 are iid conditional
on 𝑝 with Pr[𝑌 = 1|𝑝] = 𝑝 and Pr[𝑌 = 0|𝑝] = 1 − 𝑝 The number of heads in 𝑛
tosses can be represented by the random variable 𝑋 = 𝑌1 + ⋯ + 𝑌𝑛 . We want to
estimate 𝑝 = 𝐸[𝑌𝑗 ] using Bühlmann credibility: 𝑝̂ = 𝑍 𝑌 ̄ + (1 − 𝑍)𝜇. The overall
mean is 𝜇 = E(E(𝑌𝑗 |𝑝)) = E(𝑝) = 𝛼/(𝛼 + 𝛽). The sample mean is 𝑦 ̄ = 𝑥/𝑛. The
credibility is 𝑍 = 𝑛/(𝑛 + 𝐾) and 𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀 . With Var(𝑌𝑗 |𝑝) = 𝑝(1 − 𝑝)
it follows that 𝐸𝑃 𝑉 = E(Var[𝑌𝑗 |𝑝]) = E(𝑝(1 − 𝑝)). Because E(𝑌𝑗 |𝑝) = 𝑝 then
𝑉 𝐻𝑀 = Var((E(𝑌𝑗 |𝑝)) = Var(𝑝). For the beta distribution
𝛼 𝛼(𝛼 + 1) 𝛼𝛽
E(𝑝) = , E(𝑝2 ) = , and Var(𝑝) = 2
.
𝛼+𝛽 (𝛼 + 𝛽)(𝛼 + 𝛽 + 1) (𝛼 + 𝛽) (𝛼 + 𝛽 + 1)
Parameter 𝐾 = 𝐸𝑃 𝑉 /𝑉 𝐻𝑀 =[E(𝑝) − E(𝑝2 )]/Var(𝑝). With some algebra this

reduces to 𝐾 = 𝛼 + 𝛽. The Bühlmann credibility-weighted estimate is
𝑛 𝑥 𝑛 𝛼
𝑝̂ = ( ) + (1 − )
𝑛+𝛼+𝛽 𝑛 𝑛+𝛼+𝛽 𝛼+𝛽
𝛼+𝑥
𝑝̂ =
𝛼+𝛽+𝑛
which is the same as the Bayesian posterior mean.
9.5.3 Exact Credibility

As demonstrated in the prior section, the Bühlmann credibility estimates for
the gamma-Poisson and beta-binomial models exactly match the Bayesian anal-
ysis results. The term exact credibility is applied in these situations. Exact
credibility may occur if the probability distribution for 𝑋𝑗 is in the linear ex-
ponential family and the prior distribution is a conjugate prior. Besides these
two models, examples of exact credibility also include Gamma-Exponential and
Normal-Normal models.
It is also noteworthy that if the conditional mean E(𝜇(𝜃)|𝑋1 , ..., 𝑋𝑛 ) is linear
in the past observations, then the Bühlmann credibility estimate will coincide
with the Bayesian estimate. More information about exact credibility can be
found in (Bühlmann and Gisler, 2005), (Klugman et al., 2012), and (Tse, 2009).
9.6 Estimating Credibility Parameters
• Perform nonparametric estimation with the Bühlmann and Bühlmann-

Straub credibility models.
• Identify situations when semiparametric estimation is appropriate.
• Use data to approximate the 𝐸𝑃 𝑉 and 𝑉 𝐻𝑀 .
• Balance credibility-weighted estimates.
The examples in this chapter have provided assumptions for calculating credi-
bility parameters. In actual practice the actuary must use real world data and
judgment to determine credibility parameters.
9.6.1 Full Credibility Standard for Limited Fluctuation

Credibility
Limited-fluctuation credibility requires a full credibility standard. The general
formula for aggregate losses or pure premium, as obtained in formula (9.5), is
2
𝑦𝑝 2 𝜎2 𝜎
𝑛𝑆 = ( ) [( 𝑁 ) + ( 𝑋 ) ] ,
𝑘 𝜇𝑁 𝜇𝑋
with 𝑁 representing number of claims and 𝑋 the size of claims. If one assumes
𝜎𝑋 = 0 then the full credibility standard for frequency results. If 𝜎𝑁 = 0 then
the full credibility formula for severity follows. Probability 𝑝 and 𝑘 value are
often selected using judgment and experience.
In practice it is often assumed that the number of claims is Poisson distributed

2
so that 𝜎𝑁 /𝜇𝑁 = 1. In this case the formula can be simplified to
𝑦𝑝 2 E(𝑋 2 )
𝑛𝑆 = ( ) [ ].
𝑘 (E(𝑋))2
An empirical mean and second moment for the sizes of individual claim losses
can be computed from past data, if available.
9.6. ESTIMATING CREDIBILITY PARAMETERS 333
9.6.2 Nonparametric Estimation for Bühlmann and

Bühlmann-Straub Models
Bayesian analysis as described previously requires assumptions about a prior
distribution and likelihood. It is possible to produce estimates without these
assumptions and these methods are often referred to as empirical Bayes methods.
Bühlmann and Bühlmann-Straub credibility with parameters estimated from
the data are included in the category of empirical Bayes methods.
Bühlmann Model First we will address the simpler Bühlmann model. Assume
that there are 𝑟 risks in a population. For risk 𝑖 with risk parameter 𝜃𝑖 the
losses for 𝑛 periods are 𝑋𝑖1 , … , 𝑋𝑖𝑛 . The losses for a given risk are iid across
periods as assumed in the Bühlmann model. For risk 𝑖 the sample mean is
𝑛 𝑛
𝑋̄ 𝑖 = ∑𝑗=1 𝑋𝑖𝑗 /𝑛 and the unbiased sample process variance is 𝑠2𝑖 = ∑𝑗=1 (𝑋𝑖𝑗 −
𝑋̄ 𝑖 )2 /(𝑛 − 1). An unbiased estimator for the 𝐸𝑃 𝑉 can be calculated by taking
the average of 𝑠2𝑖 for the 𝑟 risks in the population:
̂ 1 𝑟 1 𝑟 𝑛
𝐸𝑃 𝑉 = ∑ 𝑠2𝑖 = ∑ ∑(𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2 . (9.11)
𝑟 𝑖=1 𝑟(𝑛 − 1) 𝑖=1 𝑗=1
The individual risk means 𝑋̄ 𝑖 for 𝑖 = 1, … , 𝑟 can be used to estimate the 𝑉 𝐻𝑀 .

An unbiased estimator of Var(𝑋̄ 𝑖 ) is
𝑟
̂ 𝑋̄ 𝑖 ) = 1 1 𝑟
Var( ∑(𝑋̄ 𝑖 − 𝑋)̄ 2 and 𝑋̄ = ∑ 𝑋̄ 𝑖 ,
𝑟 − 1 𝑖=1 𝑟 𝑖=1
but Var(𝑋̄ 𝑖 ) is not the 𝑉 𝐻𝑀 . Using equation (16.2), the total variance formula
or unconditional variance formula is
Var(𝑋̄ 𝑖 ) = E(Var(𝑋̄ 𝑖 |Θ = 𝜃𝑖 )) + Var(E(𝑋̄ 𝑖 |Θ = 𝜃𝑖 )).
The 𝑉 𝐻𝑀 is the second term on the right because 𝜇(𝜃𝑖 ) = E(𝑋̄ 𝑖 |Θ = 𝜃𝑖 ) is the
hypothetical mean for risk 𝑖. So,
𝑉 𝐻𝑀 = Var(E(𝜇(𝜃𝑖 )) = Var(𝑋̄ 𝑖 ) − E(Var(𝑋̄ 𝑖 |Θ = 𝜃𝑖 )).
As discussed previously in Section 9.3.1, 𝐸𝑃 𝑉 /𝑛 = E(Var[𝑋̄ 𝑖 |Θ = 𝜃𝑖 ]) and

using the above estimators gives an unbiased estimator for the 𝑉 𝐻𝑀 :
𝑟 ̂
1 𝐸𝑃 𝑉
𝑉̂
𝐻𝑀 = ∑(𝑋̄ 𝑖 − 𝑋)̄ 2 − . (9.12)
𝑟 − 1 𝑖=1 𝑛
Although the expected loss for a risk with parameter 𝜃𝑖 is 𝜇(𝜃𝑖 )=E(𝑋̄ 𝑖 |Θ = 𝜃𝑖 ),
the variance of the sample mean 𝑋̄ 𝑖 is greater than or equal to the variance
of the hypothetical means: Var(𝑋̄ 𝑖 ) ≥Var(𝜇(𝜃𝑖 )). The variance in the sample
means Var(𝑋̄ 𝑖 ) includes both the variance in the hypothetical means plus a
process variance term.
In some cases formula (9.12) can produce a negative value for 𝑉̂ 𝐻𝑀 because
̂
of the subtraction of 𝐸𝑃 𝑉 /𝑛, but a variance cannot be negative. The process
variance within risks is so large that it overwhelms the measurement of the
variance in means between risks. In this case we cannot use this method to
determine the values needed for Bühlmann credibility.
Example 9.6.1. Two policyholders had claims over a three-year period as
shown in the table below. Estimate the expected number of claims for each
policyholder using Bühlmann credibility and calculating necessary parameters
from the data.
Year Risk A Risk B

1 0 2
2 1 1
3 0 2
̄ = 13 (0 + 1 + 0) = 13 , 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (2 + 1 + 2) = 5
3
𝑥̄ = 21 ( 13 + 53 ) = 1
1
𝑠2𝐴 = 3−1 [(0 − 13 )2 + (1 − 13 )2 + (0 − 13 )2 ] = 1
3
1
𝑠2𝐵 = 3−1 [(2 − 53 )2 + (1 − 53 )2 + (2 − 53 )2 ] = 1
3
̂
𝐸𝑃 𝑉 = 1
( 13 + 13 ) = 1
2 3
𝑉̂
𝐻𝑀 = 1
2−1 [( 13 − 1)2 + ( 53 − 1)2 ] − 1/3
3 = 7
9
1/3 3
𝐾= 7/9 = 7
3 7
𝑍= 3+(3/7)) = 8
7
𝜇𝐴
̂ = 8 ( 13 ) + (1 − 78 )1 = 5
12
7
𝜇𝐵
̂ = 8 ( 53 ) + (1 − 78 )1 = 19
12

shown in the table below. Calculate the nonparametric estimate for the 𝑉 𝐻𝑀 .
Year Risk A Risk B

1 3 3
2 0 0
3 0 3
̄ = 13 (3 + 0 + 0) = 1, 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (3 + 0 + 3) = 2
𝑥̄ = 12 (1 + 2) = 3
2
1
𝑠2𝐴 = 3−1 [(3 − 1)2 + (0 − 1)2 + (0 − 1)2 ] = 3
1
𝑠2𝐵 = 3−1 [(3 − 2)2 + (0 − 2)2 + (3 − 2)2 ] = 3
̂
𝐸𝑃 𝑉 = 21 (3 + 3) = 3
𝑉̂
𝐻𝑀 = 1
2−1 [(1 − 32 )2 + (2 − 32 )2 ] − 3
3 = − 12 .
The process variance is so large that it is not possible to estimate the 𝑉 𝐻𝑀 .
Bühlmann-Straub Model Empirical formulas for 𝐸𝑃 𝑉 and 𝑉 𝐻𝑀 in the

Bühlmann-Straub model are more complicated because a risk’s number of ex-
posures can change from one period to another. Also, the number of experience
periods does not have to be constant across the population. First some defini-
tions:
• 𝑋𝑖𝑗 is the losses per exposure for risk 𝑖 in period 𝑗. Losses can refer to
number of claims or amount of loss. There are 𝑟 risks so 𝑖 = 1, … , 𝑟.
• 𝑛𝑖 is the number of observation periods for risk 𝑖
• 𝑚𝑖𝑗 is the number of exposures for risk 𝑖 in period 𝑗 for 𝑗 = 1, … , 𝑛𝑖
Risk 𝑖 with risk parameter 𝜃𝑖 has 𝑚𝑖𝑗 exposures in period 𝑗 which means that
the losses per exposure random variable can be written as 𝑋𝑖𝑗 = (𝑌𝑖1 + ⋯ +
𝑌𝑖𝑚𝑖𝑗 )/𝑚𝑖𝑗 . Random variable 𝑌𝑖𝑘 is the loss for one exposure. For risk 𝑖 losses
𝑌𝑖𝑘 are iid with mean E(𝑌𝑖𝑘 ) = 𝜇(𝜃𝑖 ) and process variance Var(𝑌𝑖𝑘 ) = 𝜎2 (𝜃𝑖 ).
It follows that Var(𝑋𝑖𝑗 ) = 𝜎2 (𝜃𝑖 )/𝑚𝑖𝑗 .
Two more important definitions are:
𝑛𝑖 𝑛𝑖
• 𝑋̄ 𝑖 = 𝑚1 ∑𝑗=1 𝑚𝑖𝑗 𝑋𝑖𝑗 with 𝑚𝑖 = ∑𝑗=1 𝑚𝑖𝑗 . 𝑋̄ 𝑖 is the average loss per
𝑖
exposure for risk 𝑖 for all observation periods combined.
𝑟 𝑟
• 𝑋̄ = 𝑚 1
∑𝑖=1 𝑚𝑖 𝑋̄ 𝑖 with 𝑚 = ∑𝑖=1 𝑚𝑖 . 𝑋̄ is the average loss per exposure
for all risks for all observation periods combined.
An unbiased estimator for the process variance 𝜎2 (𝜃𝑖 ) of one exposure for risk 𝑖
is
𝑛𝑖
∑𝑗=1 𝑚𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2
𝑠𝑖 2 = .
𝑛𝑖 − 1
The weights 𝑚𝑖𝑗 are applied to the squared differences because the 𝑋𝑖𝑗 are
the averages of 𝑚𝑖𝑗 exposures. The weighted average of the sample variances
𝑠𝑖 2 for each risk 𝑖 in the population with weights proportional to the number
of (𝑛𝑖 − 1) observation periods will produce the expected value of the process
variance (𝐸𝑃 𝑉 ) estimate
𝑟 𝑛
𝑟
∑𝑖=1 (𝑛𝑖 − 1)𝑠𝑖 2 ∑𝑖=1 ∑𝑗=1
𝑖
𝑚𝑖𝑗 (𝑋𝑖𝑗 − 𝑋̄ 𝑖 )2
̂
𝐸𝑃 𝑉 = 𝑟 = 𝑟 .
∑𝑖=1 (𝑛𝑖 − 1) ∑𝑖=1 (𝑛𝑖 − 1)
̂
The quantity 𝐸𝑃 𝑉 is an unbiased estimator for the expected value of the process
variance of one exposure for a risk chosen at random from the population.
To calculate an estimator for the variance in the hypothetical means (𝑉 𝐻𝑀 )
the squared differences of the individual risk sample means 𝑋̄ 𝑖 and population
mean 𝑋̄ are used. An unbiased estimator for the 𝑉 𝐻𝑀 is
𝑟
∑𝑖=1 𝑚𝑖 (𝑋̄ 𝑖 − 𝑋)̄ 2 − (𝑟 − 1)𝐸𝑃
̂ 𝑉
𝑉̂
𝐻𝑀 = 1 𝑟 .
𝑚− 𝑚 ∑𝑖=1 𝑚2𝑖
This complicated formula is necessary because of the varying number of expo-

sures. Proofs that the 𝐸𝑃 𝑉 and 𝑉 𝐻𝑀 estimators shown above are unbiased
can be found in several references mentioned at the end of this chapter including
(Bühlmann and Gisler, 2005), (Klugman et al., 2012), and (Tse, 2009).
Example 9.6.3. Two policyholders had claims shown in the table below. Es-
timate the expected number of claims per vehicle for each policyholder using
Bühlmann-Straub credibility and calculating parameters from the data.
Policyholder Year 1 Year 2 Year 3 Year 4

A Number of claims 0 2 2 3
A Insured vehicles 1 2 2 2
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
0+2+2+3
Solution 𝑥𝐴
̄ = 1+2+2+2 =1
0+1+2 1
𝑥𝐵
̄ = 2+3+4 = 3
7(1)+9(1/3) 5
𝑥̄ = 7+9 = 8
1
𝑠2𝐴 = 4−1 [1(0 − 1)2 + 2(1 − 1)2 + 2(1 − 1)2 + 2( 23 − 1)2 ] = 1
2
1
𝑠2𝐵 = 3−1 [2(0 − 13 )2 + 3( 13 − 13 )2 + 4( 12 − 13 )2 ] = 1
6
̂
𝐸𝑃 𝑉 = [3 ( 12 ) + 2 ( 61 )] /(3 + 2) = 11
= 0.3667
30
𝑉̂
𝐻𝑀 = [(7(1 − 58 )2 + 9( 13 − 58 )2 − (2 − 1) 30
11 1
] / [16 − ( 16 ) (72 + 92 )] = 0.1757
0.3667
𝐾= 0.1757 = 2.0871
𝑚𝐴 = 7, 𝑚𝐵 = 9
7 9
𝑍𝐴 = 7+2.0871 = 0.7703, 𝑍𝐵 = 9+2.0871 = 0.8118
𝜇𝐴
̂ = 0.7703(1) + (1 − 0.7703)(5/8) = 0.9139
𝜇𝐵
̂ = 0.8118(1/3) + (1 − 0.8118)(5/8) = 0.3882
9.6.3 Semiparametric Estimation for Bühlmann and

Bühlmann-Straub Models
In the prior section on nonparametric estimation, there were no assumptions
about the distribution of the losses per exposure 𝑋𝑖𝑗 . Assuming that the 𝑋𝑖𝑗
have a particular distribution and using properties of the distribution along with
the data to determine credibility parameters is referred to as semiparametric
estimation.
An example of semiparametric estimation would be the assumption of a Poisson
distribution when estimating claim frequencies. The Poisson distribution has
the property that the mean and variance are identical and this property can
simplify calculations. The following simple example comes from the prior section
but now includes a Poisson assumption about claim frequencies.
shown in the table below. Assume that the number of claims for each risk
has a Poisson distribution. Estimate the expected number of claims for each
policyholder using Bühlmann credibility and calculating necessary parameters
from the data.
Year Risk A Risk B

1 0 2
2 1 1
3 0 2
̄ = 13 (0 + 1 + 0) = 31 , 𝑥𝐵
Solution 𝑥𝐴 ̄ = 13 (2 + 1 + 2) = 5
3
𝑥̄ = 12 ( 13 + 35 ) = 1
2 1
With Poisson assumption the estimated variance for risk A is 𝜎̂𝐴 = 𝑥𝐴
̄ = 3
2 5
Similarly, 𝜎̂𝐵 = 𝑥𝐵
̄ = 3
̂
𝐸𝑃 𝑉 = 21 ( 13 ) + 21 ( 53 ) = 1. This is also 𝑥̄ because of Poisson assumption.
𝑉̂
𝐻𝑀 = 1
2−1 [( 13 − 1)2 + ( 35 − 1)2 ] − 1
3 = 5
9
1 9
𝐾= 5/9 = 5
3 5
𝑍𝐴 = 𝑍 𝐵 = 3+(9/5) = 8
5
𝜇𝐴
̂ = 8 ( 13 ) + (1 − 58 )1 = 7
12
5
𝜇𝐵
̂ = 8 ( 53 ) + (1 − 58 )1 = 17
12 .
Although we assumed that the number of claims for each risk was Poisson
distributed in the prior example, we did not need this additional assumption
because there was enough information to use nonparametric estimation. In fact,
the Poisson assumption might not be appropriate because for risk B the sample
̄ = 35 ≠ 𝑠2𝐵 = 13 .
mean is not equal to the sample variance: 𝑥𝐵
The following example is commonly used to demonstrate a situation where semi-

parametric estimation is needed. There is insufficient information for nonpara-
metric estimation but with the Poisson assumption, estimates can be calculated.
Example 9.6.5. A portfolio of 2,000 policyholders generated the following

claims profile during a five-year period:
Number of Claims
In 5 Years Number of policies
0 923
1 682
2 249
3 70
4 51
5 25
In your model you assume that the number of claims for each policyholder
has a Poisson distribution and that a policyholder’s expected number of claims
is constant through time. Use Bühlmann credibility to estimate the annual
expected number of claims for policyholders with 3 claims during the five-year
period.
Solution Let 𝜃𝑖 be the risk parameter for the 𝑖𝑡ℎ risk in the portfolio with mean
𝜇(𝜃𝑖 ) and variance 𝜎2 (𝜃𝑖 ). With the Poisson assumption 𝜇(𝜃𝑖 ) = 𝜎2 (𝜃𝑖 ). The ex-
pected value of the process variance is 𝐸𝑃 𝑉 = E(𝜎2 (𝜃𝑖 )) where the expectation
is taken across all risks in the population. Because of the Poisson assumption
for all risks it follows that 𝐸𝑃 𝑉 = E(𝜎2 (𝜃𝑖 )) = E(𝜇(𝜃𝑖 )). An estimate for the an-
nual expected number of claims is 𝜇(𝜃 ̂ 𝑖 )= (observed number of claims)/5. This
can also serve as the estimate for the expected value of the process variance for
a risk. Weighting the process variance estimates (or means) by the number of
policies in each group gives the estimators
̂ 923(0) + 682(1) + 249(2) + 70(3) + 51(4) + 25(5)

𝐸𝑃 𝑉 = 𝑥̄ = = 0.1719.
(5)(2000)
Using the formula ((9.12)), the 𝑉 𝐻𝑀 estimator is

1
𝑉̂
𝐻𝑀 = [923(0 − 0.1719)2 + 682(0.20 − 0.1719)2 + 249(0.40 − 0.1719)2
2000 − 1
0.1719
+70(0.60 − 0.1719)2 + 51(0.80 − 0.1719)2 + 25(1 − 0.1719)2 ] −
5
= 0.0111
𝐾̂ = ̂
𝐸𝑃 𝑉 /𝑉̂
𝐻𝑀 = 0.1719/0.0111 = 15.49
5
𝑍̂ = = 0.2440
5 + 15.49
𝜇3̂ claims = 0.2440(3/5) + (1 − 0.2440)0.1719 = 0.2764.
9.6.4 Balancing Credibility Estimators

The credibility weighted model 𝜇(𝜃 ̂ 𝑖 ) = 𝑍𝑖 𝑋̄ 𝑖 +(1−𝑍𝑖 )𝑋,̄ where 𝑋̄ 𝑖 is the loss per
exposure for risk 𝑖 and 𝑋̄ is loss per exposure for the population, can be used to
𝑟
estimate the expected loss for risk 𝑖. The overall mean is 𝑋̄ = ∑𝑖=1 (𝑚𝑖 /𝑚)𝑋̄ 𝑖
where 𝑚𝑖 and 𝑚 are number of exposures for risk 𝑖 and population, respectively.
For the credibility weighted estimators to be in balance we want
𝑟 𝑟
𝑋̄ = ∑(𝑚𝑖 /𝑚)𝑋̄ 𝑖 = ∑(𝑚𝑖 /𝑚)𝜇(𝜃
̂ 𝑖 ).
𝑖=1 𝑖=1
If this equation is satisfied then the estimated losses for each risk will add up to
the population total, an important goal in ratemaking, but this may not happen
if the complement of credibility is applied to 𝑋.̄
To achieve balance, we will set 𝑀̂ 𝑋 as the amount that is applied to the com-
plement of credibility and thus analyze the following equation:
𝑟 𝑟
∑(𝑚𝑖 /𝑚)𝑋̄ 𝑖 = ∑(𝑚𝑖 /𝑚) {𝑍𝑖 𝑋̄ 𝑖 + (1 − 𝑍𝑖 ) ⋅ 𝑀̂ 𝑋 } .
𝑖=1 𝑖=1
A little algebra gives
𝑟 𝑟 𝑟
∑ 𝑚𝑖 𝑋̄ 𝑖 = ∑ 𝑚𝑖 𝑍𝑖 𝑋̄ 𝑖 + 𝑀̂ 𝑋 ∑ 𝑚𝑖 (1 − 𝑍𝑖 ),
𝑖=1 𝑖=1 𝑖=1
and
𝑟
∑𝑖=1 𝑚𝑖 (1 − 𝑍𝑖 )𝑋̄ 𝑖
𝑀̂ 𝑋 = 𝑟 .
∑𝑖=1 𝑚𝑖 (1 − 𝑍𝑖 )
Using this value for 𝑀̂ 𝑋 will bring the credibility weighted estimators into bal-
ance.
If credibilities 𝑍𝑖 were computed using the Bühlmann-Straub model, then 𝑍𝑖 =
𝑚𝑖 /(𝑚𝑖 + 𝐾). The prior formula can be simplified using the following relation-
ship
𝑚𝑖 (𝑚 + 𝐾) − 𝑚𝑖
𝑚𝑖 (1 − 𝑍𝑖 ) = 𝑚𝑖 (1 − ) = 𝑚𝑖 ( 𝑖 ) = 𝐾𝑍𝑖 .
𝑚𝑖 + 𝐾 𝑚𝑖 + 𝐾
Therefore, an amount when applied to the complement of credibility that will

bring the credibility-weighed estimators into balance with the overall mean loss
per exposure is
𝑟
∑𝑖=1 𝑍𝑖 𝑋̄ 𝑖
𝑀̂ 𝑋 = 𝑟 .
∑𝑖=1 𝑍𝑖
Example 9.6.6. An example from the nonparametric Bühlmann-Straub sec-

tion had the following data for two risks. Find the amount associated with the
complement of credibility, 𝑀̂ 𝑋 , that will produce credibility-weighted estimates
that are in balance.
Policyholder Year 1 Year 2 Year 3 Year 4

A Number of claims 0 2 2 3
A Insured vehicles 1 2 2 2
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
7
Solution The credibilities from the prior example are 𝑍𝐴 = 7+2.0871 = 0.7703
9
and 𝑍𝐵 = 9+2.0871 = 0.8118. The sample means are 𝑥𝐴 ̄ = 1 and 𝑥𝐵̄ = 1/3. The
balanced complement of credibility is
0.7703(1) + 0.8118(1/3)
𝑀̂ 𝑋 = = 0.6579.
0.7703 + 0.8118
The updated credibility estimates are 𝑀̂ 𝑋𝐴 = 0.7703(1) + (1 − 0.7703)(.6579) =

0.9214 versus the previous 0.9139 and 𝑀̂ 𝑋𝐵 = 0.8118(1/3) + (1 −
0.8118)(.6579) = 0.3944 versus the previous 0.3882. Checking the bal-
ance on the new estimators: (7/16)(0.9214)+(9/16)(0.3944)=0.6250. This
exactly matches 𝑋̄ = 10/16 = 0.6250.

Exercises
Credibility Guided Tutorials
Contributors
• Gary Dean, Ball State University is the author of the initial version of
this chapter. Email: [email protected] for chapter comments and suggested
improvements.
• Chapter reviewers include: Liang (Jason) Hong, Ambrose Lo, Ranee Thi-
agarajah, Hongjuan Zhou.
Chapter 10
Insurance Portfolio
Management including
Reinsurance
Chapter Preview. An insurance portfolio is simply a collection of insurance

contracts. To help manage the uncertainty of the portfolio, this chapter
• quantifies unusually large obligations by examining the tail of the distri-
bution,
• quantifies the overall riskiness by introducing summaries known as risk
measures, and
• discusses options of spreading portfolio risk through reinsurance, the pur-
chase of insurance protection by an insurer.
10.1 Introduction to Insurance Portfolios

Most of our analyses in prior chapters has been at the contract level which is an
agreement between a policyholder and an insurer. Insurers hold, and manage,
portfolios that are simply collections of contracts. As in other areas of finance,
there are management decision-making choices that occur only at the portfolio
level. For example, strategic decision-making does not occur at the contract
level. It happens in the conference room, where management reviews available
data and possibly steers a new course. From the portfolio perspective, insurers
want to do capacity planning, set management policies, and balance the mix of
products being booked to grow revenue while controlling volatility.
Conceptually, one can think about an insurance company as nothing more than
a collection, or portfolio, of insurance contracts. In Chapter 5 we learned about
343
344CHAPTER 10. INSURANCE PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
modeling insurance portfolios as the sum of individual contracts based on as-

sumptions of independence among contracts. Because of their importance, this
chapter focuses directly on portfolio distributions.
• Insurance portfolios represent obligations of insurers and so we are par-
ticularly interested in probabilities of large outcomes as these represent
unusually large obligations. To formalize this concept, we introduce the
notion of a heavy-tail distribution in Section 10.2.
• Insurance portfolios represent company obligations and so insurers keep
an equivalent amount of assets to meet these obligations. Risk measures,
introduced in Section 10.3, summarize the distribution of the insurance
portfolio and these summary measures are used to quantify the amount
of assets that an insurer needs to retain to meet obligations.
• In Section 3.4, we learned about mechanisms that policyholders use to
spread risks such as deductibles and policy limits. In the same way, in-
surers use similar mechanisms in order to spread portfolio risks. They
purchase risk protection from reinsurers, an insurance company for insur-
ers. This sharing of insurance portfolio risk is described in Section 10.4.
10.2 Tails of Distributions

• Describe a heavy tail distribution intuitively.
• Classify the heaviness of a distribution’s tails based on moments.
• Compare the tails of two distributions.
In 1998 freezing rain fell on eastern Ontario, southwestern Quebec and lasted
for six days. The event was double the amount of precipitation in the area
experienced in any prior ice storm and resulted in a catastrophe that produced in
excess of 840,000 insurance claims. This number is 20% more than that of claims
caused by the Hurricane Andrew - one of the largest natural disasters in the
history of North America. The catastrophe caused approximately 1.44 billion
Canadian dollars in insurance settlements which is the highest loss burden in the
history of Canada. This is not an isolated example - similar catastrophic events
that caused extreme insurance losses are the Hurricane Harvey, Superstorm
Sandy, the 2011 Japanese earthquake and tsunami, and so forth.
In the context of insurance, a few large losses hitting a portfolio and then con-
verting into claims usually represent the greatest part of the indemnities paid
by insurance companies. The aforementioned losses, also called ‘extremes’, are
quantitatively modeled by the tails of the associated probability distributions.
From the quantitative modeling standpoint, relying on probabilistic models with
improper tails is rather daunting. For instance, periods of financial stress may
10.2. TAILS OF DISTRIBUTIONS 345
appear with a higher frequency than expected, and insurance losses may oc-
cur with worse severity. Therefore, the study of probabilistic behavior in the
tail portion of actuarial models is of utmost importance in the modern frame-
work of quantitative risk management. For this reason, this section is devoted
to the introduction of a few mathematical notions that characterize the tail
weight of random variables. The applications of these notions will benefit us in
the construction and selection of appropriate models with desired mathematical
properties in the tail portion, that are suitable for a given task.
Formally, define 𝑋 to be the (random) obligations that arise from a collection
(portfolio) of insurance contracts. (In earlier chapters, we used 𝑆 for aggregate
losses. Now, the focus is on distributional aspects of only the collective and so we
revert to the traditional 𝑋 notation.) We are particularly interested in studying
the right tail of the distribution of 𝑋, which represents the occurrence of large
losses. Informally, a random variable is said to be heavy-tailed if high probabilities
are assigned to large values. Note that this by no mean implies the probability
density/mass functions are increasing as the value of 𝑋 goes to infinity. Indeed
for a real-valued random variable, the pdf/pmf must diminish at infinity in
order to guarantee the total probability to be equal to one. Instead, what we
are concernded about is the rate of decay of the pdf/pmf. Unwelcome outcomes
are more likely to occur for an insurance portfolio that is described by a loss
random variable possessing a heavier (right) tail. Tail weight can be an absolute
or a relative concept. Specifically, for the former, we may consider a random
variable to be heavy-tailed if certain mathematical properties of the probability
distribution are met. For the latter, we can say the tail of one distribution is
heavier/lighter than the other if some tail measures are larger/smaller.
Several quantitative approaches have been proposed to classify and compare
tail weight. Among most of these approaches, the survival function serves as
the building block. In what follows, we introduce two simple yet useful tail
classification methods both of which are based on the behavior of the survival
function of 𝑋.
10.2.1 Classification Based on Moments

One way of classifying the tail weight of a distribution is by assessing the exis-
tence of raw moments. Since our major interest lies in the right tails of distri-
butions, we henceforth assume the obligation or loss random variable 𝑋 to be
positive. At the outset, the 𝑘−th raw moment of a continuous random variable
𝑋, introduced in Section 3.1, can be computed as
∞ ∞
𝜇′𝑘 = ∫ 𝑥𝑘 𝑓(𝑥) 𝑑𝑥 = 𝑘 ∫ 𝑥𝑘−1 𝑆(𝑥) 𝑑𝑥,
0 0
where 𝑆(⋅) denotes the survival function of 𝑋. This expression emphasizes

that the existence of the raw moments depends on the asymptotic behavior
of the survival function at infinity. Namely, the faster the survival function
decays to zero, the higher the order of finite moment (𝑘) the associated random
variablepossesses. You may interpret 𝑘∗ to be the largest value of 𝑘 so that
the moment is finite. Formally, define 𝑘∗ = sup{𝑘 > 0 ∶ 𝜇′𝑘 < ∞}, where 𝑠𝑢𝑝
represents the supremum operator. This observation leads us to the moment-
based tail weight classification method, which is defined formally next.
Definition 10.1. Consider a non-negative loss random variable 𝑋.
• If all the positive raw moments exist, namely the maximal order of finite
moment 𝑘∗ = ∞, then 𝑋 is said to be light tailed based on the moment
method.
• If 𝑘∗ < ∞, then 𝑋 is said to be heavy tailed based on the moment method.
• Moreover, for two positive loss random variables 𝑋1 and 𝑋2 with maximal
orders of moment 𝑘1∗ and 𝑘2∗ respectively, we say 𝑋1 has a heavier (right)
tail than 𝑋2 if 𝑘1∗ ≤ 𝑘2∗ .
The first part of Definition 10.1 is an absolute concept of tail weight, while the
second part is a relative concept of tail weight which compares the (right) tails
between two distributions. Next, we present a few examples that illustrate the
applications of the moment-based method for comparing tail weight.
Example 10.2.1. Light tail nature of the gamma distribution. Let

𝑋 ∼ 𝑔𝑎𝑚𝑚𝑎(𝛼, 𝜃), with 𝛼 > 0 and 𝜃 > 0, then for all 𝑘 > 0, show that 𝜇′𝑘 < ∞.
Solution.
∞
𝑥𝛼−1 𝑒−𝑥/𝜃
𝜇′𝑘 = ∫ 𝑥𝑘 𝑑𝑥
0 Γ(𝛼)𝜃𝛼
∞
(𝑦𝜃)𝛼−1 𝑒−𝑦
= ∫ (𝑦𝜃)𝑘 𝜃𝑑𝑦
0 Γ(𝛼)𝜃𝛼
𝜃𝑘
= Γ(𝛼 + 𝑘) < ∞.
Γ(𝛼)
Since all the positive moments exist, i.e., 𝑘∗ = ∞, in accordance with the
moment-based classification method in Definition 10.1, the gamma distribution
is light-tailed.
Example 10.2.2. Light tail nature of the Weibull distribution. Let

𝑋 ∼ 𝑊 𝑒𝑖𝑏𝑢𝑙𝑙(𝜃, 𝜏 ), with 𝜃 > 0 and 𝜏 > 0, then for all 𝑘 > 0, show that 𝜇′𝑘 < ∞.
Solution.
∞
𝜏 𝑥𝜏−1 −(𝑥/𝜃)𝜏
𝜇′𝑘 = ∫ 𝑥𝑘 𝑒 𝑑𝑥
0 𝜃𝜏
∞
𝑦𝑘/𝜏 −𝑦/𝜃𝜏
= ∫ 𝑒 𝑑𝑦
0 𝜃𝜏
= 𝜃𝑘 Γ(1 + 𝑘/𝜏 ) < ∞.
Again, due to the existence of all the positive moments, the Weibull distribution
is light-tailed.
The gamma and Weibull distributions are used quite extensively in the actuar-
ial practice. Applications of these two distributions are vast which include, but
are not limited to, insurance claim severity modeling, solvency assessment, loss
reserving, aggregate risk approximation, reliability engineering and failure anal-
ysis. We have thus far seen two examples of using the moment-based method
to analyze light-tailed distributions. We document a heavy-tailed example in
what follows.
Example 10.2.3. Heavy tail nature of the Pareto distribution. Let
𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼, 𝜃), with 𝛼 > 0 and 𝜃 > 0, then for 𝑘 > 0
∞
′ 𝛼𝜃𝛼
𝜇𝑘 = ∫ 𝑥𝑘 𝑑𝑥
0 (𝑥 + 𝜃)𝛼+1
∞
= 𝛼𝜃𝛼 ∫ (𝑦 − 𝜃)𝑘 𝑦−(𝛼+1) 𝑑𝑦.
𝜃
Consider a similar integration:
∞
< ∞, for 𝑘 < 𝛼;
𝑔𝑘 = ∫ 𝑦𝑘−𝛼−1 𝑑𝑦 = {
𝜃
= ∞, for 𝑘 ≥ 𝛼.
Meanwhile,
(𝑦 − 𝜃)𝑘 𝑦−(𝛼+1)
lim = lim (1 − 𝜃/𝑦)𝑘 = 1.
𝑦→∞ 𝑦𝑘−𝛼−1 𝑦→∞
Application of the limit comparison theorem for improper integrals yields 𝜇′𝑘 is
finite if and only if 𝑔𝑘 is finite. Hence we can conclude that the raw moments
of Pareto random variables exist only up to 𝑘 < 𝛼, i.e., 𝑘∗ = 𝛼, and thus the
distribution is heavy-tailed. What is more, the maximal order of finite moment
depends only on the shape parameter 𝛼 and it is an increasing function of 𝛼. In

other words, based on the moment method, the tail weight of Pareto random
variables is solely manipulated by 𝛼 – the smaller the value of 𝛼, the heavier
the tail weight becomes. Since 𝑘∗ < ∞, the tail of Pareto distribution is heavier
than those of the gamma and Weibull distributions.
We conclude this section with an open discussion on the limitations of the

moment-based method. Despite its simple implementation and intuitive in-
terpretation, there are certain circumstances in which the application of the
moment-based method is not suitable. First, for more complicated probabilistic
models, the 𝑘-th raw moment may not be simple to derive, and thus the identi-
fication of the maximal order of finite moment can be challenging. Second, the
moment-based method does not well comply with main body of the well estab-
lished heavy tail theory in the literature. Specifically, the existence of moment
generating functions is arguably the most popular method for classifying heavy
tail versus light tail within the community of academic actuaries. However, for
some random variables such as the lognormal random variables, their moment
generating functions do not exist even that all the positive moments are finite.
In these cases, applications of the moment-based methods can lead to different
tail weight assessment. Third, when we need to compare the tail weight be-
tween two light-tailed distributions both having all positive moments exist, the
moment-based method is no longer informative (see, e.g., Examples 10.2.1 and
10.2.2).
10.2.2 Comparison Based on Limiting Tail Behavior

In order to resolve the aforementioned issues of the moment-based classification
method, an alternative approach for comparing tail weight is to directly study
the limiting behavior of the survival functions.
Definition 10.2. For two random variables 𝑋 and 𝑌 , let
𝑆𝑋 (𝑡)
𝛾 = lim .
𝑡→∞ 𝑆𝑌 (𝑡)
We say that
• 𝑋 has a heavier right tail than 𝑌 if 𝛾 = ∞;
• 𝑋 and 𝑌 are proportionally equivalent in the right tail if 𝛾 = 𝑐 ∈

(0, ∞);
• 𝑋 has a lighter right tail than 𝑌 if 𝛾 = 0.
Example 10.2.4. Comparison of Pareto to Weibull distributions. Let
𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼, 𝜃) and 𝑌 ∼ 𝑊 𝑒𝑖𝑏𝑢𝑙𝑙(𝜏 , 𝜃), for 𝛼 > 0, 𝜏 > 0, and 𝜃 > 0. Show
that the Pareto has a heavier right tail than the Weibull.
Solution.
𝑆𝑋 (𝑡) (1 + 𝑡/𝜃)−𝛼
lim = lim
𝑡→∞ 𝑆𝑌 (𝑡) 𝑡→∞ exp{−(𝑡/𝜃)𝜏 }
exp{𝑡/𝜃𝜏 }
= lim
𝑡→∞ (1 + 𝑡1/𝜏 /𝜃)𝛼
∞ 𝑖
∑𝑖=0 ( 𝜃𝑡𝜏 ) /𝑖!
= lim
𝑡→∞ (1 + 𝑡1/𝜏 /𝜃)𝛼
∞ −𝛼
−𝑖/𝛼 𝑡(1/𝜏−𝑖/𝛼)
= lim ∑ (𝑡 + ) /𝜃𝜏𝑖 𝑖!
𝑡→∞
𝑖=0
𝜃
= ∞.
Therefore, the Pareto distribution has a heavier tail than the Weibull distribu-
tion. One may also realize that exponentials go to infinity faster than polyno-
mials, thus the aforementioned limit must be infinite.
For some distributions of which the survival functions do not admit explicit
expressions, we may find the following alternative formula useful:
′
𝑆𝑋 (𝑡) 𝑆𝑋 (𝑡)
lim = lim
′
𝑡→∞ 𝑆𝑌 (𝑡) 𝑡→∞ 𝑆 (𝑡)
𝑌
−𝑓𝑋 (𝑡)
= lim
𝑡→∞ −𝑓𝑌 (𝑡)
𝑓 (𝑡)
= lim 𝑋 .
𝑡→∞ 𝑓𝑌 (𝑡)
given that the density functions exist. This is an application of L’Hôpital’s Rule
from calculus.
Example 10.2.5. Comparison of Pareto to gamma distributions. Let
𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝛼, 𝜃) and 𝑌 ∼ 𝑔𝑎𝑚𝑚𝑎(𝛼, 𝜃), for 𝛼 > 0 and 𝜃 > 0. Show that the
Pareto has a heavier right tail than the gamma.
Solution.
𝑓𝑋 (𝑡) 𝛼𝜃𝛼 (𝑡 + 𝜃)−𝛼−1

lim = lim
𝑡→∞ 𝑓𝑌 (𝑡) 𝑡→∞ 𝑡𝜏−1 𝑒−𝑡/𝜆 𝜆−𝜏 Γ(𝜏 )−1
𝑒𝑡/𝜆
∝ lim
𝑡→∞ (𝑡 + 𝜃)𝛼+1 𝑡𝜏−1
= ∞,
as exponentials go to infinity faster than polynomials.
10.3 Risk Measures

• Define the idea of coherence and determine whether or not a risk measure
is coherent.
• Define the value-at-risk and calculate this quantity for a given distribution.
• Define the tail value-at-risk and calculate this quantity for a given distri-
bution.
In the previous section, we studied two methods for classifying the weight of
distribution tails. We may claim that the risk associated with one distribution is
more dangerous (asymptotically) than the others if the tail is heavier. However,
knowing one risk is more dangerous (asymptotically) than the others may not
provide sufficient information for a sophisticated risk management purpose, and
in addition, one is also interested in quantifying how much more. In fact, the
magnitude of risk associated with a given loss distribution is an essential input
for many insurance applications, such as actuarial pricing, reserving, hedging,
insurance regulatory oversight, and so forth.
10.3.1 Coherent Risk Measures

To compare the magnitude of risk in a practically convenient manner, we seek
a function that maps the loss random variable of interest to a numerical value
indicating the level of riskiness, which is termed the risk measure. Put math-
ematically, the risk measure simply summarizes the distribution function of a
random variable as a single number. Two simple risk measures are the mean
E[𝑋] and the standard deviation SD(𝑋) = √Var(𝑋). Other classical examples
of risk measures include the standard deviation principle
𝐻SD (𝑋) = E[𝑋] + 𝛼SD(𝑋), for 𝛼 ≥ 0, (10.1)
and the variance principle
𝐻Var (𝑋) = E[𝑋] + 𝛼Var(𝑋), for 𝛼 ≥ 0.
One can check that all the aforementioned functions are risk measures in which
we input the loss random variable and the functions output a numerical value.
On a different note, the function 𝐻 ∗ (𝑋) = 𝛼𝑋 𝛽 for any real-valued 𝛼, 𝛽 ≠ 0, is
10.3. RISK MEASURES 351
not a risk measure because 𝐻 ∗ produces another random variable rather than
a single numerical value.
Since risk measures are scalar measures which aim to use a single numerical
value to describe the stochastic nature of loss random variables, it should not
be surprising to us that there is no risk measure which can capture all the risk
information of the associated random variables. Therefore, when seeking useful
risk measures, it is important for us to keep in mind that the measures should
be at least
• interpretable practically;
• computable conveniently; and
• able to reflect the most critical information of risk underpinning the loss
distribution.
Several risk measures have been developed in the literature. Unfortunately,
there is no best risk measure that can outperform the others, and the selection
of appropriate risk measure depends mainly on the application questions at hand.
In this respect, it is imperative to emphasize that risk is a subjective concept,
and thus even given the same problem, there are multifarious approaches to
assess risk. However, for many risk management applications, there is a wide
agreement that economically sounded risk measures should satisfy four major
axioms which we are going to describe in detail next. Risk measures that satisfy
these axioms are termed coherent risk measures.
Consider a risk measure 𝐻(⋅). It is said to be a coherent risk measure for two
random variables 𝑋 and 𝑌 if the following axioms are satisfied.
• Axiom 1. Subadditivity: 𝐻(𝑋 + 𝑌 ) ≤ 𝐻(𝑋) + 𝐻(𝑌 ). The economic
implication of this axiom is that diversification benefits exist if different
risks are combined.
• Axiom 2. Monotonicity: if Pr[𝑋 ≤ 𝑌 ] = 1, then 𝐻(𝑋) ≤ 𝐻(𝑌 ). Recall

that 𝑋 and 𝑌 are random variables representing losses, the underlying
economic implication is that higher losses essentially leads to a higher
level of risk.
• Axiom 3. Positive homogeneity: 𝐻(𝑐𝑋) = 𝑐𝐻(𝑋) for any positive

constant 𝑐. A potential economic implication about this axiom is that risk
measure should be independent of the monetary units in which the risk
is measured. For example, let 𝑐 be the currency exchange rate between
the US and Canadian dollars, then the risk of random losses measured in
terms of US dollars (i.e., 𝑋) and Canadian dollars (i.e., 𝑐𝑋) should be
different only up to the exchange rate 𝑐 (i.e., 𝑐𝐻(𝑥) = 𝐻(𝑐𝑋)).
• Axiom 4. Translation invariance: 𝐻(𝑋 + 𝑐) = 𝐻(𝑋) + 𝑐 for any positive

constant 𝑐. If the constant 𝑐 is interpreted as risk-free cash and 𝑋 is

an insurance portfolio, then adding cash to a portfolio only increases the
portfolio risk by the amount of cash.
Verifying the coherent properties for some risk measures can be quite straight-
forward, but it can be very challenging sometimes. For example, it is a simple
matter to check that the mean is a coherent risk measure.
Special Case. The Mean is a Coherent Risk Measure.
For any pair of random variables 𝑋 and 𝑌 having finite means and constant
𝑐 > 0,
• validation of subadditivity: E[𝑋 + 𝑌 ] = E[𝑋] + E[𝑌 ];
• validation of monotonicity: if Pr[𝑋 ≤ 𝑌 ] = 1, then E[𝑋] ≤ E[𝑌 ];
• validation of positive homogeneity: E[𝑐𝑋] = 𝑐E[𝑋];
• validation of translation invariance: E[𝑋 + 𝑐] = E[𝑋] + 𝑐
With a little more effort, we can determine the following.

Special Case. The Standard Deviation is not a Coherent Risk Mea-
sure.
Verification of the Special Case.
To see that the standard deviation is not a coherent risk measure, start by
checking that the standard deviation satisfies
• validation of subadditivity:
SD[𝑋 + 𝑌 ] = √Var(𝑋) + Var(𝑌 ) + 2Cov(𝑋, 𝑌 )

≤ √SD(𝑋)2 + SD(𝑌 )2 + 2SD(𝑋)SD(𝑌 )
= SD(𝑋) + SD(𝑌 );
• validation of positive homogeneity: SD[𝑐𝑋] = 𝑐 SD[𝑋].

However, the standard deviation does not comply with translation invariance
property as for any positive constant 𝑐,
SD(𝑋 + 𝑐) = SD(𝑋) < SD(𝑋) + 𝑐.
Moreover, the standard deviation also does not satisfy the monotonicity prop-
erty. To see this, consider the following two random variables:
0, with probability 0.25;

𝑋={ (10.2)
4, with probability 0.75,
and 𝑌 is a degenerate random variable such that
Pr[𝑌 = 4] = 1. (10.3)
√ √
You can check that Pr[𝑋 ≤ 𝑌 ] = 1, but SD(𝑋) = 42 ⋅ 0.25 ⋅ 0.75 = 3 >
SD(𝑌 ) = 0.
We have so far checked that E[⋅] is a coherent risk measure, but not SD(⋅). Let us
now proceed to study the coherent property for the standard deviation principle
(10.1) which is a linear combination of coherent and incoherent risk measures.
Special Case. The Standard Deviation Principle (10.1) is a Coherent

Risk Measure.
Verification of the Special Case. To this end, for a given 𝛼 > 0, we check
the four axioms for 𝐻SD (𝑋 + 𝑌 ) one by one:
• validation of subadditivity:
𝐻SD (𝑋 + 𝑌 ) = E[𝑋 + 𝑌 ] + 𝛼SD(𝑋 + 𝑌 )

≤ E[𝑋] + E[𝑌 ] + 𝛼[SD(𝑋) + SD(𝑌 )]
= 𝐻SD (𝑋) + 𝐻SD (𝑌 );
• validation of positive homogeneity:
𝐻SD (𝑐𝑋) = 𝑐E[𝑋] + 𝑐𝛼SD(𝑋) = 𝑐𝐻SD (𝑋);
• validation of translation invariance:
𝐻SD (𝑋 + 𝑐) = E[𝑋] + 𝑐 + 𝛼SD(𝑋) = 𝐻SD (𝑋) + 𝑐.
It only remains to verify the monotonicity property, which may or may not be
satisfied depending on the value of 𝛼. To see this, consider
√ again the setup of
(10.2) and (10.3) in which Pr[𝑋 ≤ 𝑌 ] = 1. Let 𝛼 = 0.1 ⋅ 3, then 𝐻SD (𝑋) =
3 + 0.3 = 3.3 < 𝐻SD (𝑌√ ) = 4 and the monotonicity condition is met. On the
other hand, let 𝛼 = 3, then 𝐻SD (𝑋) = 3 + 3 = 6 > 𝐻SD (𝑌 ) = 4 and the
monotonicity condition is not satisfied. More precisely, by setting
√
𝐻SD (𝑋) = 3 + 𝛼 3 ≤ 4 = 𝐻SD (𝑌 ),
√
we find that the monotonicity condition is only satisfied for 0 ≤ 𝛼 ≤ 1/ 3, and
thus the standard deviation principle 𝐻SD is coherent. This result appears to
be very intuitive to us since the standard deviation principle 𝐻SD is a linear

combination of two risk
√ measures of which one is coherent and the other is
incoherent. If 𝛼 ≤ 1/ 3, then the coherent measure dominates the incoherent
one, thus the resulting measure 𝐻SD is coherent and vice versa. Note that
the aforementioned conclusion may not be generalized to any pair of random
variables 𝑋 and 𝑌 .
The literature on risk measures has been growing rapidly in popularity and
importance. In the succeeding two subsections, we introduce two indices which
have recently earned an unprecedented amount of interest among theoreticians,
practitioners, and regulators. They are namely the Value-at-Risk (𝑉 𝑎𝑅) and
the Tail Value-at-Risk (𝑇 𝑉 𝑎𝑅) measures. The economic rationale behind these
two popular risk measures is similar to that for the tail classification methods
introduced in the previous section, with which we hope to capture the risk of
extremal losses represented by the distribution tails.
10.3.2 Value-at-Risk
In Section 4.1.1, we defined the quantile of a distribution. We now look to a
special case of this and offer the formal definition of the value-at-risk, or VaR.
Definition 10.3. Consider an insurance loss random variable 𝑋. The value-at-

risk measure of 𝑋 with confidence level 𝑞 ∈ (0, 1) is formulated as
𝑉 𝑎𝑅𝑞 [𝑋] = inf{𝑥 ∶ 𝐹𝑋 (𝑥) ≥ 𝑞}. (10.4)
Here, 𝑖𝑛𝑓 is the infimum operator so that the 𝑉 𝑎𝑅 measure outputs the smallest
value of 𝑋 such that the associated cdf first exceeds or equates to 𝑞. This is
simply the quantile that was introduced in Section 3.1.2 and further developed
in Section 4.1.1.
Here is how we should interpret 𝑉 𝑎𝑅 in the context of actuarial applications.

The 𝑉 𝑎𝑅 is a measure of the ‘maximal’ probable loss for an insurance prod-
uct/portfolio or a risky investment occurring 𝑞 × 100% of times, over a specific
time horizon (typically, one year). For instance, let 𝑋 be the annual loss random
variable of an insurance product, 𝑉 𝑎𝑅0.95 [𝑋] = 100 million means that there
is no more than a 5% chance that the loss will exceed 100 million over a given
year. Owing to this meaningful interpretation, 𝑉 𝑎𝑅 has become the industrial
standard to measuring financial and insurance risks since 1990’s. Financial con-
glomerates, regulators, and academics often utilize 𝑉 𝑎𝑅 to measure risk capital,
ensure the compliance with regulatory rules, and disclose the financial positions.
Next, we present a few examples about the computation of 𝑉 𝑎𝑅.

Example 10.3.1. 𝑉 𝑎𝑅 for the exponential distribution. Consider an

insurance loss random variable 𝑋 with an exponential distribution having pa-
rameter 𝜃 for 𝜃 > 0, then the cdf of 𝑋 is given by
𝐹𝑋 (𝑥) = 1 − 𝑒−𝑥/𝜃 , for 𝑥 > 0.
Give a closed-form expression for the 𝑉 𝑎𝑅.

Solution.
Because exponential distribution is a continuous distribution, the smallest value
such that the cdf first exceeds or equates to 𝑞 ∈ (0, 1) must be at the point 𝑥𝑞
satisfying
𝑞 = 𝐹𝑋 (𝑥𝑞 ) = 1 − exp{−𝑥𝑞 /𝜃}.
Thus
−1
𝑉 𝑎𝑅𝑞 [𝑋] = 𝐹𝑋 (𝑞) = −𝜃[log(1 − 𝑞)].
The result reported in Example 10.3.1 can be generalized to any continuous

random variables having a strictly increasing cdf. Specifically, the 𝑉 𝑎𝑅 of any
continuous random variables is simply the inverse of the corresponding cdf. Let
us consider another example of continuous random variable which has the sup-
port from negative infinity to positive infinity.
Example 10.3.2. 𝑉 𝑎𝑅 for the normal distribution. Consider an insurance
loss random variable 𝑋 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎2 ) with 𝜎 > 0. In this case, one may
interpret the negative values of 𝑋 as profit or revenue. Give a closed-form
expression for the 𝑉 𝑎𝑅.
Solution.
Because normal distribution is a continuous distribution, the 𝑉 𝑎𝑅 of 𝑋 must
satisfy
𝑞 = 𝐹𝑋 (𝑉 𝑎𝑅𝑞 [𝑋])
= Pr [(𝑋 − 𝜇)/𝜎 ≤ (𝑉 𝑎𝑅𝑞 [𝑋] − 𝜇)/𝜎]
= Φ((𝑉 𝑎𝑅𝑞 [𝑋] − 𝜇)/𝜎).
Therefore, we have
𝑉 𝑎𝑅𝑞 [𝑋] = Φ−1 (𝑞) 𝜎 + 𝜇.
In many insurance applications, we have to deal with transformations of ran-

dom variables. For instance, in Example 10.3.2, the loss random variable
𝑋 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎2 ) can be viewed as a linear transformation of a standard nor-
mal random variable 𝑍 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(0, 1), namely 𝑋 = 𝑍𝜎 + 𝜇. By setting 𝜇 = 0
and 𝜎 = 1, it is straightforward for us to check 𝑉 𝑎𝑅𝑞 [𝑍] = Φ−1 (𝑞). A useful

finding revealed from Example 10.3.2 is that the 𝑉 𝑎𝑅 of a linear transformation
of the normal random variables is equivalent to the linear transformation of the
𝑉 𝑎𝑅 of the original random variables. This finding can be further generalized
to any random variables as long as the transformations are strictly increasing.
Example 10.3.3. 𝑉 𝑎𝑅 for transformed variables. Consider an insurance
loss random variable 𝑌 with a lognormal distribution with parameters 𝜇 ∈ R
and 𝜎2 > 0. Give an expression of the 𝑉 𝑎𝑅 of 𝑌 in terms of the standard
normal inverse cdf.
Solution.
Note that log 𝑌 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎2 ), or equivalently let 𝑋 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎2 ), then
𝑑 𝑑
𝑌 = 𝑒𝑋 which is strictly increasing transformation. Here, the notation ‘=’
means equality in distribution. The 𝑉 𝑎𝑅 of 𝑌 is thus given by the exponential
transformation of the 𝑉 𝑎𝑅 of 𝑋. Precisely, for 𝑞 ∈ (0, 1),
𝑉 𝑎𝑅𝑞 [𝑌 ] = 𝑒𝑉 𝑎𝑅𝑞 [𝑋] = exp{Φ−1 (𝑞) 𝜎 + 𝜇}.
We have thus far seen a number of examples about the 𝑉 𝑎𝑅 for continuous
random variables, let us consider an example concerning the 𝑉 𝑎𝑅 for a discrete
random variable.
Example 10.3.4. 𝑉 𝑎𝑅 for a discrete random variable. Consider an
insurance loss random variable with the following probability distribution:
0.75, for 𝑥 = 1
Pr[𝑋 = 𝑥] = { 0.20, for 𝑥 = 3
0.05, for 𝑥 = 4.
Determine the 𝑉 𝑎𝑅 at 𝑞 = 0.6, 0.9, 0.95, 0.95001.

Solution.
The corresponding cdf of 𝑋 is
⎧ 0, 𝑥 < 1;
{ 0.75, 1 ≤ 𝑥 < 3;
𝐹𝑋 (𝑥) = ⎨
0.95, 3 ≤ 𝑥 < 4;
{
⎩ 1, 4 ≤ 𝑥.
By the definition of 𝑉 𝑎𝑅, we thus have then

• 𝑉 𝑎𝑅0.6 [𝑋] = 1;
• 𝑉 𝑎𝑅0.9 [𝑋] = 3;
• 𝑉 𝑎𝑅0.95 [𝑋] = 3;
• 𝑉 𝑎𝑅0.950001 [𝑋] = 4.
Let us now conclude the current subsection by an open discussion of the 𝑉 𝑎𝑅

measure. Some advantages of utilizing 𝑉 𝑎𝑅 include
• possessing a practically meaningful interpretation;
• relatively simple to compute for many distributions with closed-form dis-
tribution functions;
• no additional assumption is required for the computation of 𝑉 𝑎𝑅.
On the other hand, the limitations of 𝑉 𝑎𝑅 can be particularly pronounced for
some risk management practices. We report some of them herein:
• the selection of the confidence level 𝑞 ∈ (0, 1) is highly subjective, while
the 𝑉 𝑎𝑅 can be very sensitive to the choice of 𝑞 (e.g., in Example 10.3.4,
𝑉 𝑎𝑅0.95 [𝑋] = 3 and 𝑉 𝑎𝑅0.950001 [𝑋] = 4);
• the scenarios/loss information that are above the (1 − 𝑞) × 100% worst
event, are completely neglected;
• 𝑉 𝑎𝑅 is not a coherent risk measure (specifically, the 𝑉 𝑎𝑅 measure does
not satisfy the subadditivity axiom, meaning that diversification benefits
may not be fully reflected).
10.3.3 Tail Value-at-Risk

Recall that the 𝑉 𝑎𝑅 represents the (1 − 𝑞) × 100% chance maximal loss. As we
mentioned in the previous section, one major drawback of the 𝑉 𝑎𝑅 measure is
that it does not reflect the extremal losses occurring beyond the (1 − 𝑞) × 100%
chance worst scenario. For illustrative purposes, let us consider the following
slightly unrealistic yet inspiring example.
Example 10.3.5. Consider two loss random variable’s 𝑋 ∼ 𝑈 𝑛𝑖𝑓𝑜𝑟𝑚[0, 100],
and 𝑌 with an exponential distribution having parameter 𝜃 = 31.71. We use
𝑉 𝑎𝑅 at 95% confidence level to measure the riskiness of 𝑋 and 𝑌 . Simple
calculation yields (see, also, Example 10.3.1),
𝑉 𝑎𝑅0.95 [𝑋] = 𝑉 𝑎𝑅0.95 [𝑌 ] = 95,
and thus these two loss distributions have the same level of risk according to
𝑉 𝑎𝑅0.95 . However, 𝑌 is riskier than 𝑋 if extremal losses are of major concern
since 𝑋 is bounded above while 𝑌 is unbounded. Simply quantifying risk by
using 𝑉 𝑎𝑅 at a specific confidence level could be misleading and may not reflect
the true nature of risk.
As a remedy, the Tail Value-at-Risk (𝑇 𝑉 𝑎𝑅) was proposed to measure the
extremal losses that are above a given level of 𝑉 𝑎𝑅 as an average. We document
the definition of 𝑇 𝑉 𝑎𝑅 in what follows. For the sake of simplicity, we are going
to confine ourselves to continuous positive random variables only, which are
more frequently used in the context of insurance risk management. We refer
the interested reader to Hardy (2006) for a more comprehensive discussion of
𝑇 𝑉 𝑎𝑅 for both discrete and continuous random variables.
Definition 10.4. Fix 𝑞 ∈ (0, 1), the tail value-at-risk of a (continuous) random
variable 𝑋 is formulated as
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = E[𝑋|𝑋 > 𝑉 𝑎𝑅𝑞 [𝑋]],
given that the expectation exists.

In light of Definition 10.4, the computation of 𝑇 𝑉 𝑎𝑅 typically consists of two
major components - the 𝑉 𝑎𝑅 and the average of losses that are above the 𝑉 𝑎𝑅.
The 𝑇 𝑉 𝑎𝑅 can be computed via a number of formulas. Consider a continuous
positive random variable 𝑋, for notional convenience, henceforth let us write
𝜋𝑞 = 𝑉 𝑎𝑅𝑞 [𝑋]. By definition, the 𝑇 𝑉 𝑎𝑅 can be computed via
∞
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥. (10.5)
(1 − 𝑞) 𝜋𝑞
Example 10.3.6. 𝑇 𝑉 𝑎𝑅 for a normal distribution. Consider an insurance

loss random variable 𝑋 ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎2 ) with 𝜇 ∈ R and 𝜎 > 0. Give an
expression for 𝑇 𝑉 𝑎𝑅.
Solution.
Let 𝑍 be the standard normal random variable. For 𝑞 ∈ (0, 1), the 𝑇 𝑉 𝑎𝑅 of 𝑋
can be computed via
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = E[𝑋|𝑋 > 𝑉 𝑎𝑅𝑞 [𝑋]]

= E[𝜎𝑍 + 𝜇|𝜎𝑍 + 𝜇 > 𝑉 𝑎𝑅𝑞 [𝑋]]
= 𝜎E[𝑍|𝑍 > (𝑉 𝑎𝑅𝑞 [𝑋] − 𝜇)/𝜎] + 𝜇
(1)
= 𝜎E[𝑍|𝑍 > 𝑉 𝑎𝑅𝑞 [𝑍]] + 𝜇,
(1)
where ‘ =’ holds because of the results reported in Example 10.3.2. Next, we
turn to study 𝑇 𝑉 𝑎𝑅𝑞 [𝑍] = E[𝑍|𝑍 > 𝑉 𝑎𝑅𝑞 [𝑍]]. Let 𝜔(𝑞) = (Φ−1 (𝑞))2 /2, we
have
∞
1 2
(1 − 𝑞) 𝑇 𝑉 𝑎𝑅𝑞 [𝑍] = ∫ 𝑧 √ 𝑒−𝑧 /2 𝑑𝑧
−1
Φ (𝑞) 2𝜋
∞
1
= ∫ √ 𝑒−𝑥 𝑑𝑥
𝜔(𝑞) 2𝜋
1
= √ 𝑒−𝜔(𝑞)
2𝜋
= 𝜙(Φ−1 (𝑞)).
Thus,
𝜙(Φ−1 (𝑞))
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜎 + 𝜇.
1−𝑞
We mentioned earlier in the previous subsection that the 𝑉 𝑎𝑅 of a strictly

increasing function of random variable is equal to the function of 𝑉 𝑎𝑅 of the
original random variable. Motivated by the results in Example 10.3.6, one can
show that the 𝑇 𝑉 𝑎𝑅 of a strictly increasing linear transformation of random
variable is equal to the function of 𝑉 𝑎𝑅 of the original random variable. This
is due to the linearity property of expectations. However, the aforementioned
finding cannot be extended to non-linear functions. The following example of
lognormal random variable serves as a counter example.
Example 10.3.7. 𝑇 𝑉 𝑎𝑅 of a lognormal distribution. Consider an insur-

ance loss random variable 𝑋 lognormal distribution with parameters 𝜇 ∈ R and
𝜎 > 0. Show that
2
𝑒𝜇+𝜎 /2
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = Φ(Φ−1 (𝑞) − 𝜎).
(1 − 𝑞)
Solution.
Recall that the pdf of lognormal distribution is formulated as
1
𝑓𝑋 (𝑥) = √ exp{−(log 𝑥 − 𝜇)2 /2𝜎2 }, for 𝑥 > 0.
𝜎 2𝜋𝑥
Fix 𝑞 ∈ (0, 1), then the 𝑇 𝑉 𝑎𝑅 of 𝑋 can be computed via
∞
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥
(1 − 𝑞) 𝜋𝑞
∞
1 1 (log 𝑥 − 𝜇)2
= ∫ √ exp {− } 𝑑𝑥
(1 − 𝑞) 𝜋𝑞 𝜎 2𝜋 2𝜎2
∞
(1) 1 1 1 2
= ∫ √ 𝑒− 2 𝑤 +𝜎𝑤+𝜇 𝑑𝑤
(1 − 𝑞) 𝜔(𝑞) 2𝜋
2 ∞
𝑒𝜇+𝜎 /2 1 1 2
= ∫ √ 𝑒− 2 (𝑤−𝜎) 𝑑𝑤
(1 − 𝑞) 𝜔(𝑞) 2𝜋
2
𝑒𝜇+𝜎 /2
= Φ(𝜔(𝑞) − 𝜎), (10.6)
(1 − 𝑞)
(1)
where = holds by applying change of variable 𝑤 = (log 𝑥 − 𝜇)/𝜎, and 𝜔(𝑞) =
(log 𝜋𝑞 − 𝜇)/𝜎. Evoking the formula of 𝑉 𝑎𝑅 for lognormal random variable
reported in Example 10.3.2, we can simplify the expression (10.6) into
2
𝑒𝜇+𝜎 /2
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = Φ(Φ−1 (𝑞) − 𝜎).
(1 − 𝑞)
Clearly, the 𝑇 𝑉 𝑎𝑅 of lognormal random variable is not the exponential of the

𝑇 𝑉 𝑎𝑅 of normal random variable.
For distributions of which the survival distribution functions are more tractable
to work with, we may apply the integration by parts technique (assuming the
mean is finite) to rewrite equation (10.5) as
∞
∞ 1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = [−𝑥𝑆𝑋 (𝑥)∣𝜋 + ∫ 𝑆𝑋 (𝑥)𝑑𝑥]
𝑞
𝜋𝑞 (1 − 𝑞)
∞
1
= 𝜋𝑞 + ∫ 𝑆 (𝑥)𝑑𝑥.
(1 − 𝑞) 𝜋𝑞 𝑋
Example 10.3.8. 𝑇 𝑉 𝑎𝑅 of an exponential distribution. Consider an

insurance loss random variable 𝑋 with an exponential distribution having pa-
rameter 𝜃 for 𝜃 > 0. Give an expression for the 𝑇 𝑉 𝑎𝑅.
Solution.
We have seen from the previous subsection that
𝜋𝑞 = −𝜃[log(1 − 𝑞)].
Let us now consider the 𝑇 𝑉 𝑎𝑅:
∞
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜋𝑞 + ∫ 𝑒−𝑥/𝜃 𝑑𝑥/(1 − 𝑞)
𝜋𝑞
= 𝜋𝑞 + 𝜃𝑒−𝜋𝑞 /𝜃 /(1 − 𝑞)
= 𝜋𝑞 + 𝜃.
It can also be helpful to express the 𝑇 𝑉 𝑎𝑅 in terms of limited expected values.

Specifically, we have
∞
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ (𝑥 − 𝜋𝑞 + 𝜋𝑞 )𝑓𝑋 (𝑥)𝑑𝑥/(1 − 𝑞)
𝜋𝑞
∞
1
= 𝜋𝑞 + ∫ (𝑥 − 𝜋𝑞 )𝑓𝑋 (𝑥)𝑑𝑥
(1 − 𝑞) 𝜋𝑞
= 𝜋𝑞 + 𝑒𝑋 (𝜋𝑞 )
(E[𝑋] − E[𝑋 ∧ 𝜋𝑞 ])
= 𝜋𝑞 + , (10.7)
(1 − 𝑞)
where 𝑒𝑋 (𝑑) = E[𝑋 − 𝑑|𝑋 > 𝑑] for 𝑑 > 0 denotes the mean excess loss function.
For many commonly used parametric distributions, the formulas for calculating
E[𝑋] and E[𝑋 ∧ 𝜋𝑞 ] can be found in a table of distributions.
Example 10.3.9. 𝑇 𝑉 𝑎𝑅 of a Pareto distribution. Consider a loss random
variable 𝑋 ∼ 𝑃 𝑎𝑟𝑒𝑡𝑜(𝜃, 𝛼) with 𝜃 > 0 and 𝛼 > 0. The cdf of 𝑋 is given by
𝛼
𝜃
𝐹𝑋 (𝑥) = 1 − ( ) , for 𝑥 > 0.
𝜃+𝑥
Fix 𝑞 ∈ (0, 1) and set 𝐹𝑋 (𝜋𝑞 ) = 𝑞, we readily obtain
𝜋𝑞 = 𝜃 [(1 − 𝑞)−1/𝛼 − 1] . (10.8)
From Section 18.2, we know

𝜃
E[𝑋] = ,
𝛼−1
and
𝛼−1
𝜃 𝜃
E[𝑋 ∧ 𝜋𝑞 ] = [1 − ( ) ].
𝛼−1 𝜃 + 𝜋𝑞
Evoking equation (10.7) yields
𝜃 (𝜃/(𝜃 + 𝜋𝑞 ))𝛼−1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = 𝜋𝑞 +
𝛼 − 1 (𝜃/(𝜃 + 𝜋𝑞 ))𝛼
𝜃 𝜋𝑞 + 𝜃
= 𝜋𝑞 + ( )
𝛼−1 𝜃
𝜋𝑞 + 𝜃
= 𝜋𝑞 + ,
𝛼−1
where 𝜋𝑞 is given by (10.8).

The following measure is closely related to 𝑇 𝑉 𝑎𝑅.

Definition 10.5. Fix 𝑞 ∈ (0, 1), the conditional value at risk of a random
variable 𝑋 is formulated as
1
1
𝐶𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑉 𝑎𝑅𝛼 [𝑋] 𝑑𝛼.
1−𝑞 𝑞
The conditional value at risk is also known as the average value at risk (AVaR)
and the expected short-fall (ES). It can be shown that 𝐶𝑉 𝑎𝑅𝑞 [𝑋] = 𝑇 𝑉 𝑎𝑅𝑞 [𝑋]
when Pr(𝑋 = 𝑉 𝑎𝑅𝑞 [𝑋]) = 0 which holds for continuous random variables. That,
is, if 𝑋 is continuous, then via a change of variables, we can rewrite equation
(10.5) as
1
1
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] = ∫ 𝑉 𝑎𝑅𝛼 [𝑋] 𝑑𝛼. (10.9)
1−𝑞 𝑞
This alternative formula (10.9) tells us that 𝑇 𝑉 𝑎𝑅 is the average of 𝑉 𝑎𝑅𝛼 [𝑋]
with varying degree of confidence level over 𝛼 ∈ [𝑞, 1]. Therefore, the 𝑇 𝑉 𝑎𝑅
effectively resolves most of the limitations of 𝑉 𝑎𝑅 outlined in the previous
subsection. First, due to the averaging effect, the 𝑇 𝑉 𝑎𝑅 may be less sensitive
to the change of confidence level compared with 𝑉 𝑎𝑅. Second, all the extremal
losses that are above the (1 − 𝑞) × 100% worst probable event are taken in
account.
In this respect, one can see that for any given 𝑞 ∈ (0, 1)
𝑇 𝑉 𝑎𝑅𝑞 [𝑋] ≥ 𝑉 𝑎𝑅𝑞 [𝑋].

Third and perhaps foremost, 𝑇 𝑉 𝑎𝑅 is a coherent risk measure and thus is able to
more accurately capture the diversification effects of insurance portfolio. Herein,
we do not intend to provide the proof of the coherent feature for 𝑇 𝑉 𝑎𝑅, which
is considered to be challenging technically.
10.4 Reinsurance

• Define basic reinsurance treaties including proportional, quota share, non-
proportional, stop-loss, excess of loss, and surplus share.
• Interpret the optimality of quota share for reinsurers and compute optimal
quota share agreements.
• Interpret the optimality of stop-loss for insurers.
10.4. REINSURANCE 363
• Interpret and calculate optimal excess of loss retention limits.
Recall that reinsurance is simply insurance purchased by an insurer. Insur-

ance purchased by non-insurers is sometimes known as primary insurance to
distinguish it from reinsurance. Reinsurance differs from personal insurance
purchased by individuals, such as auto and homeowners insurance, in contract
flexibility. Like insurance purchased by major corporations, reinsurance pro-
grams are generally tailored more closely to the buyer. For contrast, in personal
insurance buyers typically cannot negotiate on the contract terms although they
may have a variety of different options (contracts) from which to choose.
The two broad types are proportional and non-proportional reinsurance. A
proportional reinsurance contract is an agreement between a reinsurer and a
ceding company (also known as the reinsured) in which the reinsurer assumes a
given percent of losses and premium. A reinsurance contract is also known as
a treaty. Non-proportional agreements are simply everything else. As examples
of non-proportional agreements, this chapter focuses on stop-loss and excess of
loss contracts. For all types of agreements, we split the total risk 𝑋 into the
portion taken on by the reinsurer, 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 , and that retained by the insurer,
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 , that is, 𝑋 = 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 + 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 .
The mathematical structure of a basic reinsurance treaty is the same as the
coverage modifications of personal insurance introduced in Chapter 3. For a
proportional reinsurance, the transformation 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑐𝑋 is identical to a
coinsurance adjustment in personal insurance. For stop-loss reinsurance, the
transformation 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = max(0, 𝑋 − 𝑀 ) is the same as an insurer’s payment
with deductible 𝑀 and 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = min(𝑋, 𝑀 ) = 𝑋 ∧ 𝑀 is equivalent to what a
policyholder pays with deductible 𝑀 . For practical applications of the mathe-
matics, in personal insurance the focus is generally upon the expectation as this
is a key ingredient used in pricing. In contrast, for reinsurance the focus is on
the entire distribution of the risk, as the extreme events are a primary concern
of the financial stability of the insurer and reinsurer.
This section describes the foundational and most basic of reinsurance treaties:
Section 10.4.1 for proportional and Section 10.4.2 for non-proportional reinsur-
ance. Section 10.4.3 gives a flavor of more complex contracts.
10.4.1 Proportional Reinsurance

The simplest example of a proportional treaty is called quota share.
• In a quota share treaty, the reinsurer receives a flat percent, say 50%, of
the premium for the book of business reinsured.
• In exchange, the reinsurer pays 50% of losses, including allocated loss
adjustment expenses
• The reinsurer also pays the ceding company a ceding commission which is
designed to reflect the differences in underwriting expenses incurred.
The amounts paid by the primary insurer and the reinsurer are summarized as
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑐𝑋 and 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = (1 − 𝑐)𝑋,
where 𝑐 ∈ (0, 1) denotes the proportion retained by the insurer. Note that
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 + 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋.
Example 10.4.1. Distribution of losses under quota share. To develop
an intuition for the effect of quota-share agreement on the distribution of losses,
the following is a short R demonstration using simulation. The accompanying
figure provides the relative shapes of the distributions of total losses, the retained
portion (of the insurer), and the reinsurer’s portion.
Total Loss Insurer (75%) Reinsurer (25%)

0.008
0.008
0.008
0.006
0.006
0.006
Density
Density
Density
0.004
0.004
0.004
0.002
0.002
0.002
0.000
0.000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0.000 0 500 1000 1500 2000 2500 3000
Losses Losses Losses
Quota Share is Desirable for Reinsurers

The quota share contract is particularly desirable for the reinsurer. To see this,
suppose that an insurer and reinsurer wish to enter a contract to share total
losses 𝑋 such that
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑔(𝑋) and 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑔(𝑋),
for some generic function 𝑔(⋅) (known as the retention function). So that the
insurer does not retain more than the loss, we consider only functions so that
𝑔(𝑥) ≤ 𝑥. Suppose further that the insurer only cares about the variability of
retained claims and is indifferent to the choice of 𝑔 as long as 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) stays
the same and equals, say, 𝑄. Then, the following result shows that the quota
share reinsurance treaty minimizes the reinsurer’s uncertainty as measured by
𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ).
Proposition. Suppose that 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝑄. Then, 𝑉 𝑎𝑟((1 − 𝑐)𝑋) ≤
𝑉 𝑎𝑟(𝑔(𝑋)) for all 𝑔(.) such that E[𝑔(𝑋)] = 𝐾, where 𝑐 = 𝑄/𝑉 𝑎𝑟(𝑋).
Proof of the Proposition. With 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 and the law of
total variation
𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝑉 𝑎𝑟(𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 )

= 𝑉 𝑎𝑟(𝑋) + 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 2𝐶𝑜𝑣(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 )
√
= 𝑉 𝑎𝑟(𝑋) + 𝑄 − 2𝐶𝑜𝑟𝑟(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) × 𝑄√𝑉 𝑎𝑟(𝑋).
In this expression, we see that 𝑄 and 𝑉 𝑎𝑟(𝑋) do not change with the choice
of 𝑔. Thus, we can minimize 𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) by maximizing the correlation
𝐶𝑜𝑟𝑟(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ). If we use a quota share reinsurance agreement, then
𝐶𝑜𝑟𝑟(𝑋, 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐶𝑜𝑟𝑟(𝑋, (1 − 𝑐)𝑋) = 1, the maximum possible correlation.
This establishes the proposition.
The proposition is intuitively appealing - with quota share insurance, the rein-
surer shares the responsibility for very large claims in the tail of the distribution.
This is in contrast to non-proportional agreements where reinsurers take respon-
sibility for the very large claims.
Optimizing Quota Share Agreements for Insurers

Now assume 𝑛 risks in the portfolio, 𝑋1 , … , 𝑋𝑛 , so that the portfolio sum is
𝑋 = 𝑋1 + ⋯ + 𝑋𝑛 . For simplicity, we focus on the case of independent risks
(extensions to dependence is the subject of Chapter 14). Each risk 𝑋𝑖 may
represent risk of an individual policy, claim, or a sub-portfolio, depending on
the application. As an example of the latter, the insurer may subdivide its
portfolio into subportfolios consisting of lines of business such as (1) personal
auto, (2) commercial auto, (3) homeowners, (4) workers’ compensation, and so
forth.
In general, let us consider a variation of the basic quota share agreement where
the amount retained by the insurer may vary with each risk, say 𝑐𝑖 . Thus, the
𝑛
insurer’s portion of the portfolio risk is 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = ∑𝑖=1 𝑐𝑖 𝑋𝑖 . What is the best
choice of the proportions 𝑐𝑖 ?
To formalize this question, we seek to find those values of 𝑐𝑖 that minimize

𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) subject to the constraint that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. The requirement
that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾 suggests that the insurers wishes to retain a revenue in
at least the amount of the constant 𝐾. Subject to this revenue constraint, the
insurer wishes to minimize the uncertainty of the retained risks as measured by
the variance.
The Optimal Retention Proportions
Minimizing 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) subject to 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾 is a constrained opti-

mization problem. We can use the method of Lagrange multipliers, a calculus
technique, to solve this. To this end, define the Lagrangian
𝐿 = 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝜆(𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝐾)
𝑛 𝑛
= ∑𝑖=1 𝑐𝑖2 𝑉 𝑎𝑟(𝑋𝑖 ) − 𝜆(∑𝑖=1 𝑐𝑖 𝐸(𝑋𝑖 ) − 𝐾)
Taking a partial derivative with respect to 𝜆 and setting this equal to zero
simply means that the constraint, 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾, is enforced and we have to
choose the proportions 𝑐𝑖 to satisfy this constraint. Moreover, taking the partial
derivative with respect to each proportion 𝑐𝑖 yields
𝜕
𝐿 = 2𝑐𝑖 𝑉 𝑎𝑟(𝑋𝑖 ) − 𝜆 𝐸(𝑋𝑖 ) = 0
𝜕𝑐𝑖
so that
𝜆 𝐸(𝑋𝑖 )
𝑐𝑖 = .
2 𝑉 𝑎𝑟(𝑋𝑖 )
With our constraint, we may determine 𝜆 as the solution of
3
𝐾 = ∑𝑖=1 𝑐𝑖 E(𝑋𝑖 )
3 2
= 𝜆2 ∑𝑖=1 𝑉E(𝑋 𝑖)
𝑎𝑟(𝑋 ) 𝑖
and use this value of 𝜆 to determine the proportions.
From the math, it turns out that the constant for the 𝑖th risk, 𝑐𝑖 is proportional
to 𝑉𝐸(𝑋 𝑖)
𝑎𝑟(𝑋𝑖 ) . This is intuitively appealing. Other things being equal, a higher
revenue as measured by 𝐸(𝑋𝑖 ) means a higher value of 𝑐𝑖 . In the same way,
a higher value of uncertainty as measured by 𝑉 𝑎𝑟(𝑋𝑖 ) means a lower value of
𝑐𝑖 . The proportional scaling factor is determined by the revenue requirement
𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. The following example helps to develop a feel for this rela-
tionship.
Example 10.4.2. Three Pareto risks. Consider three risks that have a
Pareto distribution, each having a different set of parameters (so they are inde-
pendent but non-identical). Specifically, use the parameters:
• 𝛼1 = 3, 𝜃1 = 1000 for the first risk 𝑋1 ,
• 𝛼2 = 3, 𝜃2 = 2000 for the second risk 𝑋2 , and
• 𝛼3 = 4, 𝜃3 = 3000 for the third risk 𝑋3 .
Provide a graph that give values of 𝑐1 , 𝑐2 , and 𝑐3 for a required revenue 𝐾. Note
that these values increase linearly with 𝐾.
Solution.
1.0
0.8
c1
c2
proportion
0.6
0.4
c3
0.2
0.0
500 1000 1500 2000 2500
required revenue (K)
10.4.2 Non-Proportional Reinsurance

The Optimality of Stop-Loss Insurance
Under a stop-loss arrangement, the insurer sets a retention level 𝑀 (> 0) and
pays in full total claims for which 𝑋 ≤ 𝑀 . Further, for claims for which 𝑋 > 𝑀 ,
the primary insurer pays 𝑀 and the reinsurer pays the remaining amount 𝑋−𝑀 .
Thus, the insurer retains an amount 𝑀 of the risk. Summarizing, the amounts
paid by the primary insurer and the reinsurer are
𝑋 for 𝑋 ≤ 𝑀
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { = min(𝑋, 𝑀 ) = 𝑋 ∧ 𝑀
𝑀 for 𝑋 > 𝑀
and
0 for 𝑋 ≤ 𝑀
𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = { = max(0, 𝑋 − 𝑀 ).
𝑋−𝑀 for 𝑋 > 𝑀
As before, note that 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 + 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋.

The stop-loss type of contract is particularly desirable for the insurer. Similar
to earlier, suppose that an insurer and reinsurer wish to enter a contract so
that 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑔(𝑋) and 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑔(𝑋) for some generic retention
function 𝑔(⋅). Suppose further that the insurer only cares about the variability
of retained claims and is indifferent to the choice of 𝑔 as long as 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 )
can be minimized. Again, we impose the constraint that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾; the
insurer needs to retain a revenue 𝐾. Subject to this revenue constraint, the
insurer wishes to minimize uncertainty of the retained risks (as measured by
the variance). Then, the following result shows that the stop-loss reinsurance
treaty minimizes the reinsurer’s uncertainty as measured by 𝑉 𝑎𝑟(𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ).
Proposition. Suppose that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. Then, 𝑉 𝑎𝑟(𝑋 ∧𝑀 ) ≤ 𝑉 𝑎𝑟(𝑔(𝑋))
for all 𝑔(.), where 𝑀 is such that 𝐸(𝑋 ∧ 𝑀 ) = 𝐾.
Proof of the Proposition. Add and subtract a constant 𝑀 and expand the
square to get
𝑉 𝑎𝑟(𝑔(𝑋)) = 𝐸(𝑔(𝑋) − 𝐾)2 = 𝐸(𝑔(𝑋) − 𝑀 + 𝑀 − 𝐾)2

= 𝐸(𝑔(𝑋) − 𝑀 )2 + (𝑀 − 𝐾)2 + 2𝐸(𝑔(𝑋) − 𝑀 )(𝑀 − 𝐾)
= 𝐸(𝑔(𝑋) − 𝑀 )2 − (𝑀 − 𝐾)2 ,
because 𝐸(𝑔(𝑋)) = 𝐾.
Now, for any retention function, we have 𝑔(𝑋) ≤ 𝑋, that is, the insurer’s
retained claims are less than or equal to total claims. Using the notation
𝑔𝑆𝐿 (𝑋) = 𝑋 ∧ 𝑀 for stop-loss insurance, we have
𝑀 − 𝑔𝑆𝐿 (𝑋) = 𝑀 − (𝑋 ∧ 𝑀 )
= max(𝑀 − 𝑋, 0)
≤ max(𝑀 − 𝑔(𝑋), 0).
Squaring each side yields
(𝑀 − 𝑔𝑆𝐿 (𝑋))2 ≤ max((𝑀 − 𝑔(𝑋))2 , 0) ≤ (𝑀 − 𝑔(𝑋))2 .
Returning to our expression for the variance, we have
𝑉 𝑎𝑟(𝑔𝑆𝐿 (𝑋)) = 𝐸(𝑔𝑆𝐿 (𝑋) − 𝑀 )2 − (𝑀 − 𝐾)2

≤ 𝐸(𝑔(𝑋) − 𝑀 )2 − (𝑀 − 𝐾)2 = 𝑉 𝑎𝑟(𝑔(𝑋)),
for any retention function 𝑔. This establishes the proposition.

‘
The proposition is intuitively appealing - with stop-loss insurance, the reinsurer
takes the responsibility for very large claims in the tail of the distribution, not
the insurer.
Excess of Loss
A closely related form of non-proportional reinsurance is the excess of loss cov-
erage. Under this contract, we assume that the total risk 𝑋 can be thought
of as composed as 𝑛 separate risks 𝑋1 , … , 𝑋𝑛 and that each of these risks are
subject to an upper limit, say, 𝑀𝑖 . So the insurer retains
𝑛
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = ∑ 𝑌𝑖,𝑖𝑛𝑠𝑢𝑟𝑒𝑟 , where 𝑌𝑖,𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋𝑖 ∧ 𝑀𝑖 .
𝑖=1
and the reinsurer is responsible for the excess, 𝑌𝑟𝑒𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = 𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 . The
retention limits may vary by risk or may be the same for all risks, that is,
𝑀𝑖 = 𝑀 , for all 𝑖.
Optimal Choice for Excess of Loss Retention Limits

What is the best choice of the excess of loss retention limits 𝑀𝑖 ? To formalize
this question, we seek to find those values of 𝑀𝑖 that minimize 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 )
subject to the constraint that 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾. Subject to this revenue con-
straint, the insurer wishes to minimize the uncertainty of the retained risks (as
measured by the variance).
The Optimal Retention Limits
Minimizing 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) subject to 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾 is a constrained opti-
mization problem. We can use the method of Lagrange multipliers, a calculus
technique, to solve this. As before, define the Lagrangian
𝐿 = 𝑉 𝑎𝑟(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝜆(𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) − 𝐾)
𝑛 𝑛
= ∑𝑖=1 𝑉 𝑎𝑟(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝜆(∑𝑖=1 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝐾).
We first recall the relationships
𝑀
𝐸(𝑋 ∧ 𝑀 ) = ∫ (1 − 𝐹 (𝑥))𝑑𝑥
0
and
𝑀
𝐸(𝑋 ∧ 𝑀 )2 = 2 ∫ 𝑥(1 − 𝐹 (𝑥))𝑑𝑥.
0
Taking a partial derivative of 𝐿 with respect to 𝜆 and setting this equal to zero
simply means that the constraint, 𝐸(𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 ) = 𝐾, is enforced and we have
to choose the limits 𝑀𝑖 to satisfy this constraint. Moreover, taking the partial
derivative with respect to each limit 𝑀𝑖 yields
𝜕 𝜕 𝜕
𝜕𝑀𝑖 𝐿 = 𝜕𝑀
𝑖
𝑉 𝑎𝑟(𝑋𝑖 ∧ 𝑀𝑖 ) − 𝜆 𝜕𝑀
𝑖
𝐸(𝑋𝑖 ∧ 𝑀𝑖 )
𝜕
= 𝜕𝑀 (𝐸(𝑋𝑖 ∧ 𝑀𝑖 ) − (𝐸(𝑋𝑖 ∧ 𝑀𝑖 ))2 ) − 𝜆(1 − 𝐹𝑖 (𝑀𝑖 ))
2
𝑖
= 2𝑀𝑖 (1 − 𝐹𝑖 (𝑀𝑖 )) − 2𝐸(𝑋𝑖 ∧ 𝑀𝑖 )(1 − 𝐹𝑖 (𝑀𝑖 )) − 𝜆(1 − 𝐹𝑖 (𝑀𝑖 )).
𝜕
Setting 𝜕𝑀𝑖 𝐿 = 0 and solving for 𝜆, we get
𝜆 = 2(𝑀𝑖 − 𝐸(𝑋𝑖 ∧ 𝑀𝑖 )).

From the math, it turns out that the retention limit less the expected insurer’s
claims, 𝑀𝑖 − 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ), is the same for all risks. This is intuitively appealing.
Example 10.4.3. Excess of loss for three Pareto risks. Consider three
risks that have a Pareto distribution, each having a different set of parameters
(so they are independent but non-identical). Use the same set of parameters as
in Example 10.4.2. For this example:
a. Show numerically that the optimal retention limits 𝑀1 , 𝑀2 , and 𝑀3 re-

sulting retention limit minus expected insurer’s claims, 𝑀𝑖 − 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ),
is the same for all risks, as we derived theoretically.
b. Further, graphically compare the distribution of total risks to that retained
by the insurer and by the reinsurer.
Solution
a. We first optimize the Lagrangian using the R package alabama for Augmented
Lagrangian Adaptive Barrier Minimization Algorithm.
The optimal retention limits 𝑀1 , 𝑀2 , and 𝑀3 resulting retention limit minus

expected insurer’s claims, 𝑀𝑖 − 𝐸(𝑋𝑖 ∧ 𝑀𝑖 ), is the same for all risks, as we
derived theoretically.
[1] 1344.135
[1] 1344.133
[1] 1344.133
b. We graphically compare the distribution of total risks to that retained by

the insurer and by the reinsurer.
Total Loss Insurer Reinsurer

0.0004
0.0020
0.00030
0.0003
0.0015
0.00020
0.0002
Density
Density
Density
0.0010
0.00010
0.0001
0.0005
0.00000
0.0000
0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000
Losses Losses Losses

10.4.3 Additional Reinsurance Treaties

Surplus Share Proportional Treaty
Another proportional treaty is known as surplus share; this type of contract is
common in commercial property insurance.
• A surplus share treaty allows the reinsured to limit its exposure on a risk
to a given amount (the retained line).
• The reinsurer assumes a part of the risk in proportion to the amount that
the insured value exceeds the retained line, up to a given limit (expressed
as a multiple of the retained line, or number of lines).
For example, let the retained line be 100,000 and the given limit be 4 lines
(400,000). Then, if 𝑋 is the loss, the reinsurer’s portion is min(400000, (𝑋 −
100000)+ ).
Layers of Coverage
One can also extend non-proportional stop-loss treaties by introducing addi-
tional parties to the contract. For example, instead of simply an insurer and
reinsurer or an insurer and a policyholder, think about the situation with all
three parties, a policyholder, insurer, and reinsurer, who agree on how to share
a risk. More generally, we consider 𝑘 parties. If 𝑘 = 3, it could be an insurer
and two different reinsurers.
Example 10.4.4. Layers of coverage for three parties.
• Suppose that there are 𝑘 = 3 parties. The first party is responsible for
the first 100 of claims, the second responsible for claims from 100 to 3000,
and the third responsible for claims above 3000.
• If there are four claims in the amounts 50, 600, 1800 and 4000, then they
would be allocated to the parties as follows:
Layer Claim 1 Claim 2 Claim 3 Claim 4 Total

(0, 100] 50 100 100 100 350
(100, 3000] 0 500 1700 2900 5100
(3000, ∞) 0 0 0 1000 1000
Total 50 600 1800 4000 6450
To handle the general situation with 𝑘 groups, partition the positive real line
into 𝑘 intervals using the cut-points
0 = 𝑀0 < 𝑀1 < ⋯ < 𝑀𝑘−1 < 𝑀𝑘 = ∞.
Note that the 𝑗th interval is (𝑀𝑗−1 , 𝑀𝑗 ]. Now let 𝑌𝑗 be the amount of risk
shared by the 𝑗th party. To illustrate, if a loss 𝑥 is such that 𝑀𝑗−1 < 𝑥 ≤ 𝑀𝑗 ,
then
𝑌1 𝑀1 − 𝑀0
⎛
⎜ 𝑌2 ⎞
⎟ ⎛
⎜ 𝑀2 − 𝑀1 ⎞
⎟
⎜
⎜ ⎟
⎟ ⎜
⎜ ⎟
⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎜
⎜ 𝑌𝑗 ⎟
⎟ = ⎜
⎜ 𝑥 − 𝑀 ⎟
⎟
⎜
⎜ ⎟
⎟ ⎜
⎜
𝑗−1 ⎟
⎟
⎜ 𝑌𝑗+1 ⎟ ⎜ 0 ⎟
⎜
⎜ ⎟
⎟ ⎜ ⎜ ⎟
⎟
⋮ ⋮
⎝ 𝑌𝑘 ⎠ ⎝ 0 ⎠
More succinctly, we can write
𝑌𝑗 = min(𝑋, 𝑀𝑗 ) − min(𝑋, 𝑀𝑗−1 ).
With the expression 𝑌𝑗 = min(𝑋, 𝑀𝑗 )−min(𝑋, 𝑀𝑗−1 ), we see that the 𝑗th party
is responsible for claims in the interval (𝑀𝑗−1 , 𝑀𝑗 ]. With this, you can check
that 𝑋 = 𝑌1 + 𝑌2 + ⋯ + 𝑌𝑘 . As emphasized in the following example, we also
remark that the parties need not be different.
Example 10.4.5.
• Suppose that a policyholder is responsible for the first 100 of claims and
all claims in excess of 100,000. The insurer takes claims between 100 and
100,000.
• Then, we would use 𝑀1 = 100, 𝑀2 = 100000.
• The policyholder is responsible for 𝑌1 = min(𝑋, 100) and 𝑌3 = 𝑋 −
min(𝑋, 100000) = max(0, 𝑋 − 100000).
For additional reading, see the Wisconsin Property Fund site for an example on
layers of reinsurance.
Portfolio Management Example

Many other variations of the foundational contracts are possible. For one more
illustration, consider the following.
Example 10.4.6. Portfolio Management. You are the Chief Risk Officer
of a telecommunications firm. Your firm has several property and liability risks.
We will consider:
• 𝑋1 - buildings, modeled using a gamma distribution with mean 200 and
• 𝑋2 - motor vehicles, modeled using a gamma distribution with mean 400
and scale parameter 200.
• 𝑋3 - directors and executive officers risk, modeled using a Pareto distri-
bution with mean 1000 and scale parameter 1000.
• 𝑋4 - cyber risks, modeled using a Pareto distribution with mean 1000 and
Denote the total risk as 𝑋 = 𝑋1 + 𝑋2 + 𝑋3 + 𝑋4 . For simplicity, you assume

that these risks are independent. (Later, in Section 14.6, we will consider the
more complex case of dependence.)
To manage the risk, you seek some insurance protection. You wish to manage
internally small building and motor vehicles amounts, up to 𝑀1 and 𝑀2 , respec-
tively. You seek insurance to cover all other risks. Specifically, the insurer’s
portion is
𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = (𝑋1 − 𝑀1 )+ + (𝑋2 − 𝑀2 )+ + 𝑋3 + 𝑋4 ,
so that your retained risk is 𝑌𝑟𝑒𝑡𝑎𝑖𝑛𝑒𝑑 = 𝑋 − 𝑌𝑖𝑛𝑠𝑢𝑟𝑒𝑟 = min(𝑋1 , 𝑀1 ) +

min(𝑋2 , 𝑀2 ). Using deductibles 𝑀1 = 100 and 𝑀2 = 200:
a. Determine the expected claim amount of (i) that retained, (ii) that ac-
cepted by the insurer, and (iii) the total overall amount.
b. Determine the 80th, 90th, 95th, and 99th percentiles for (i) that retained,
(ii) that accepted by the insurer, and (iii) the total overall amount.
c. Compare the distributions by plotting the densities for (i) that retained,
(ii) that accepted by the insurer, and (iii) the total overall amount.
Solution.
In preparation, here is the code needed to set the parameters.
With these parameters, we can now simulate realizations of the portfolio risks.
(a) Here are the results for the expected claim amounts.

[1,] 269.05 5274.41 5543.46
(b) Here are the results for the quantiles.
80% 90% 95% 99%

Retained 300.00 300.00 300.00 300.00
Insurer 6075.67 7399.80 9172.69 14859.02
Total 6351.35 7675.04 9464.20 15159.02
(c) Here are the results for the density plots of the retained, insurer, and total
portfolio risk.
Retained Portfolio Risk Insurer Portfolio Risk Total Portfolio Risk
0.0004
0.0004
0.04
Density (Note different vertical scale)
0.0003
0.0003
0.03
Density
Density
0.0002
0.0002
0.02
0.0001
0.0001
0.01
0.0000
0.0000
0.00
0 100 200 300 400 500 0 5000 10000 15000 0 5000 10000 15000
Loss (Note the different horizontal scale) Loss Loss

• Edward W. (Jed) Frees, University of Wisconsin-Madison, and Jianxi
Su, Purdue University are the principal authors of the initial version of
this chapter. Email: [email protected] and/or [email protected] for
• Chapter reviewers include: Fei Huang, Hirokazu (Iwahiro) Iwasawa, Peng
Shi, Ranee Thiagarajah, Ping Wang, and Chengguo Weng.
Some of the examples from this chapter were borrowed from Clark (1996), Klug-
man et al. (2012), and Bahnemann (2015). These resources provide excellent
sources for additional discussions and examples.
Chapter 11
Loss Reserving
Chapter Preview. This chapter introduces loss reserving (also known as claims
reserving) for property and casualty (P&C, or general, non-life) insurance prod-
ucts. In particular, the chapter sketches some basic, though essential, analytic
tools to assess the reserves on a portfolio of P&C insurance products. First,
Section 11.1 motivates the need for loss reserving, then Section 11.2 studies the
available data sources and introduces some formal notation to tackle loss reserv-
ing as a prediction challenge. Next, Section 11.3 covers the chain-ladder method
and Mack’s distribution-free chain-ladder model. Section 11.4 then develops a
fully stochastic approach to determine the outstanding reserve with generalized
linear models (GLMs), including the technique of bootstrapping to obtain a
predictive distribution of the outstanding reserve via simulation.
11.1 Motivation
Our starting point is the lifetime of a P&C insurance claim. Figure 11.1 pictures
the development of such a claim over time and identifies the events of interest:
The insured event or accident occurs at time 𝑡𝑜𝑐𝑐 . This incident is reported to the
insurance company at time 𝑡𝑟𝑒𝑝 , after some delay. If the filed claim is accepted
by the insurance company, payments will follow to reimburse the financial loss
of the policyholder. In this example the insurance company compensates the
incurred loss with loss payments at times 𝑡1 , 𝑡2 and 𝑡3 . Eventually, the claim
settles or closes at time 𝑡𝑠𝑒𝑡 .
Often claims will not settle immediately due to the presence of delay in the re-
porting of a claim, delay in the settlement process or both. The reporting delay
is the time that elapses between the occurrence of the insured event and the
reporting of this event to the insurance company. The time between reporting
and settlement of a claim is known as the settlement delay. For example, it is
very intuitive that a material or property damage claim settles quicker than a
375
376 CHAPTER 11. LOSS RESERVING
Occurrence Loss payments Settlement
Reporting
tocc trep t1 t2 t3 tset time
reporting delay settlement delay
Figure 11.1: Lifetime or Run-off of a Claim
bodily injury claim involving a complex type of injury. Closed claims may also
reopen due to new developments, e.g. an injury that requires extra treatment.
Put together, the development of a claim typically takes some time. The pres-
ence of this delay in the run-off of a claim requires the insurer to hold capital
in order to settle these claims in the future.
11.1.1 Closed, IBNR, and RBNS Claims

Based on the status of the claim’s run-off we distinguish three types of claims
in the books of an insurance company. A first type of claim is a closed claim.
For these claims the complete development has been observed. With the red
line in Figure 11.2 indicating the present moment, all events from the claim’s
development take place before the present moment. Hence, these events are
observed at the present moment. For convenience, we will assume that a closed
claim can not reopen.
Present
Occurrence
Loss Payments
Reporting Settlement
tocc trep t1 t2 tset time
Figure 11.2: Lifetime of a Closed Claim
An RBNS claim is one that has been Reported, But is Not fully Settled at
the present moment or the moment of evaluation (the valuation date), that is,
the moment when the reserves should be calculated and booked by the insurer.
Occurrence, reporting and possibly some loss payments take place before the
11.1. MOTIVATION 377
present moment, but the closing of the claim happens in the future, beyond the
present moment.
Present
Occurrence
Loss Payments
Reporting
tocc trep t1 t2 time
Uncertainty
Figure 11.3: Lifetime of an RBNS Claim
An IBNR claim is one that has Incurred in the past But is Not yet Reported.
For such a claim the insured event took place, but the insurance company is
not yet aware of the associated claim. This claim will be reported in the future
and its complete development (from reporting to settlement) takes place in the
future.
Occurrence
Present
tocc time
Uncertainty
Figure 11.4: Lifetime of an IBNR Claim
Insurance companies will reserve capital to fulfill their future liabilities with
respect to both RBNS as well as IBNR claims. The future development of such
claims is uncertain and predictive modeling techniques will be used to calculate
appropriate reserves, from the historical development data observed on similar
claims.
11.1.2 Why Reserving?

The inverted production cycle of the insurance market and the claim dynam-
ics pictured in Section 11.1.1 motivate the need for reserving and the design
of predictive modeling tools to estimate reserves. In insurance, the premium
income precedes the costs. An insurer will charge a client a premium, before
actually knowing how costly the insurance policy or contract will become. In
typical manufacturing industry this is not the case and the manufacturer knows
- before selling a product - what the cost of producing this product was. At
a specified evaluation moment 𝜏 the insurer will predict outstanding liabilities
with respect to contracts sold in the past. This is the claims reserve or loss
reserve; it is the capital necessary to settle open claims from past exposures. It
is a very important element on the balance sheet of the insurer, more specifically
on the liabilities side of the balance sheet.
11.2 Loss Reserve Data

11.2.1 From Micro to Macro
We now shed light on the data available to estimate the outstanding reserve for
a portfolio of P&C contracts. Insurance companies typically register data on the
development of an individual claim as sketched in the timeline on the left hand
side of Figure 11.5. We refer to data registered at this level as granular or
micro-level data. Typically, an actuary aggregates the information registered
on the individual development of claims across all claims in a portfolio. This
aggregation leads to data structured in a triangular format as shown on the right
hand side of Figure 11.5. Such data are called aggregate or macro-level data
because each cell in the triangle displays information obtained by aggregating
the development of multiple claims.
Payment delay
Occurrence Loss payments
Year of occurrence
Reporting All claims in portfolio
Settlement
Compress data
tocc trep t1 t2 t3 tset time
Figure 11.5: From Granular Data to Run-off Triangle
The triangular display used in loss reserving is called a run-off or develop-

ment triangle. On the vertical axis the triangle lists the accident or occurrence
years during which a portfolio is followed. The loss payments booked for a spe-
cific claim are connected to the year during the which the insured event occurred.
The horizontal axis indicates the payment delay since occurrence of the insured
event.
11.2.2 Run-off Triangles

A first example of a run-off triangle with incremental payments is displayed
in Figure 11.6 (taken from Wüthrich and Merz (2008), Table 2.2, also used in
Wüthrich and Merz (2015), Table 1.4). Accident years (or years of occurrence)
are shown on the vertical axis and run from 2004 up to 2013. These refer to the
11.2. LOSS RESERVE DATA 379
year during which the insured event occurred. The horizontal axis indicates the
payment delay in years since occurrence of the insured event. 0 delay is used
for payments made in the year of occurrence of the accident or insured event.
One year of delay is used for payments made in the year after occurrence of the
accident.
accident payment delay (in years)
year 0 1 2 3 4 5 6 7 8 9
2004 5,947.0 3,721.2 895.7 207.8 206.7 621.2 658.1 148.5 111.3 158.1
2005 6,346.8 3,246.4 723.2 151.8 678.2 366.0 527.5 111.9 116.5
2006 6,269.1 2,976.2 8470.5 262.8 152.7 654.4 535.5 892.4
2007 5,863 2,683.2 722.5 190.7 133.0 883.4 433.3
2008 5,778.9 2,745.2 653.9 273.4 230.3 105.2
2009 6,184.8 2,828.3 572.8 244.9 105.0
2010 5,600.2 2,893.2 563.1 225.5
2011 5,288.1 2,440.1 528.0
2012 5,290.8 2,357.9
2013 5,675.6
Figure 11.6: A Run-off Triangle with Incremental Payment Data.

Source: Wüthrich and Merz (2008), Table 2.2.
For example, cell (2004, 0) in the above triangle displays the number 5, 947, the
total amount paid in the year 2004 for all claims occurring in year 2004. Thus,
it is the total amount paid with 0 years of delay on all claims that occurred in
the year 2004. Similarly, the number in cell (2012, 1) displays the total 2, 357.9
paid in the year 2013 for all claims that occurred in year 2012.
year 0 1 2 3 4 5 6 7 8 9
2004 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2005 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
2006 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636
2007 5,863 8,546 9,269 9,459 9,592 9,681 9,724
2008 5,779 8,524 9,178 9,451 9,682 9,787
2009 6,185 9,013 9,586 9,831 9,936
2010 5,600 8,493 9,057 9,282
2011 5,288 7,728 8,256
2012 5,291 7,649
2013 5,676
Figure 11.7: A Run-off Triangle with Cumulative Payment Data. Source:

Wüthrich and Merz (2008), Table 2.2.
Whereas the triangle in Figure 11.6 displays incremental payment data, the Fig-
ure 11.7 shows the same information in cumulative format. Now, cell (2004, 1)
displays the total claim amount paid up to payment delay 1 for all claims that
occurred in year 2004. Therefore, it is the sum of the amount paid in 2004 and
the amount paid in 2005 on accidents that occurred in 2004.
Different pieces of information can be stored in run-off triangles as those shown
in Figure 11.6 and Figure 11.7. Depending on the kind of data stored, the
triangle will be used to estimate different quantities.
For example, in incremental format a cell may display:
• the claim payments, as motivated before
• the number of claims that occurred in a specific year and were reported
with a certain delay, when the goal is to estimate the number of IBNR
claims
• the change in incurred amounts, where incurred claim amounts are the
sum of cumulative paid claims and the case estimates. The case estimate
is the claims handler’s expert estimate of the outstanding amount on a
claim.
In cumulative format a cell may display:
• the cumulative paid amount, as motivated before
• the total number of claims from an occurrence year, reported up to a
certain delay
• the incurred claim amounts.
Other sources of information are potentially available, e.g. covariates (like the
type of claim), external information (like inflation, change in regulation). Most
claims reserving methods designed for run-off triangles are rather based on a
single source of information, although recent contributions focus on the use of
more detailed data for loss reserving.
11.2.3 Loss Reserve Notation

Run-off Triangles
To formalize the displays shown in Figures 11.6 and 11.7, we let 𝑖 refer to the
occurrence or accident year, the year in which the insured event happened. In
our notation the first accident year considered in the portfolio is denoted with
1 and the latest, most recent accident year is denoted with 𝐼. Then, 𝑗 refers to
the payment delay or development year, where a delay equal to 0 corresponds
to the accident year itself. Figure 11.8 shows a triangle where the same number
of years is considered in both the vertical as well as the horizontal direction,
hence 𝑗 runs from 0 up to 𝐽 = 𝐼 − 1.
accident payment delay j
year i 0 1 2 3 4 ... j ... J −1 J =I−1
1
2
.
.
. observations DI = {Cij : i + j ≤ I}
.
.
.
i
.
.
.
.
.
.
I−2 to be predicted DIc = {Cij : i + j > I}
I−1
I
Figure 11.8: Mathematical notation for a run-off triangle. Source:

Wüthrich and Merz (2008)
The random variable 𝑋𝑖𝑗 denotes the incremental claims paid in development
period 𝑗 on claims from accident year 𝑖. Thus, 𝑋𝑖𝑗 is the total amount paid in
development year 𝑗 for all claims that happened in occurrence year 𝑖. These
payments are actually paid out in accounting or calendar year 𝑖 + 𝑗. Taking
a cumulative point of view, 𝐶𝑖𝑗 is the cumulative amount paid up until (and
including) development year 𝑗 for accidents that occurred in year 𝑖. Ultimately,
a total amount 𝐶𝑖𝐽 is paid in the final development year 𝐽 for claims that
occurred in accident year 𝑖. In this chapter time is expressed in years, though
other time units can be used as well, e.g. six-month periods or quarters.
The Loss Reserve

At the evaluation moment 𝜏 , the data in the upper triangle have been observed,
whereas the lower triangle has to be predicted. Here, the evaluation moment is
the end of accident year 𝐼 which implies that a cell (𝑖, 𝑗) with 𝑖+𝑗 ≤ 𝐼 is observed,
and a cell (𝑖, 𝑗) with 𝑖 + 𝑗 > 𝐼 belongs to the future and has to be predicted.
Thus, for a cumulative run-off triangle, the goal of a loss reserving method is to
predict 𝐶𝑖,𝐼−1 , the ultimate claim amount for occurrence year 𝑖, corresponding
to the final development period 𝐼 − 1 in Figure 11.7. We assume that - beyond
this period - no further payments will follow, although this assumption can be
relaxed.
Since 𝐶𝑖,𝐼−1 is cumulative, it includes both an observed part as well as a part

that has to be predicted. Therefore, the outstanding liability or loss reserve for
accident year 𝑖 is
𝐼−1
(0)
ℛ𝑖 = ∑ 𝑋𝑖ℓ = 𝐶𝑖,𝐼 − 𝐶𝑖,𝐼−𝑖 .
ℓ=𝐼−𝑖+1
We express the reserve either as a sum of incremental data, the 𝑋𝑖ℓ , or as a dif-
ference between cumulative numbers. In the latter case the outstanding amount
is the ultimate cumulative amount 𝐶𝑖,𝐼 minus the most recently observed cumu-
(0)
lative amount 𝐶𝑖,𝐼−𝑖 . Following Wüthrich and Merz (2015), the notation ℛ𝑖
refers to the reserve for occurrence year 𝑖 where 𝑖 = 1, … , 𝐼. The superscript
(0) refers to the evaluation of the reserve at the present moment, say 𝜏 = 0. We
understand 𝜏 = 0 at the end of occurrence year 𝐼, the most recent calendar year
for which data are observed and registered.
11.2.4 R Code to Summarize Loss Reserve Data

We use the ChainLadder package (Gesmann et al., 2019) to import run-off
triangles in R and to explore the trends present in these triangles. The package’s
vignette nicely documents its functions for working with triangular data. First,
we explore two ways to import a triangle.
Long Format Data

The dataset triangle_W_M_long.txt stores the cumulative run-off triangle
from Wüthrich and Merz (2008) (Table 2.2) in long format. That is: each
cell in the triangle is one row in this data set, and three features are stored: the
payment size (cumulative, in this example), the year of occurrence (𝑖) and the
payment delay (𝑗). We import the .txt file and store the resulting data frame
as my_triangle_long:
payment origin dev
1 5946975 2004 0
2 9668212 2004 1
3 10563929 2004 2
4 10771690 2004 3
5 10978394 2004 4
6 11040518 2004 5
We use the as.triangle function from the ChainLadder package to transform
the data frame into a triangular display. The resulting object my_triangle is
now of type triangle.
'triangle' int [1:10, 1:10] 5946975 6346756 6269090 5863015 5778885 6184793 5600184 52
- attr(*, "dimnames")=List of 2
..$ origin: chr [1:10] "2004" "2005" "2006" "2007" ...
..$ dev : chr [1:10] "0" "1" "2" "3" ...
We display the triangle and recognize the numbers (in thousands) from Figure
11.7. Cells in the lower triangle are indicated as not available, NA.
dev
origin 0 1 2 3 4 5 6 7 8 9
2004 5947 9668 10564 10772 10978 11041 11106 11121 11132 11148
2005 6347 9593 10316 10468 10536 10573 10625 10637 10648 NA
2006 6269 9245 10092 10355 10508 10573 10627 10636 NA NA
2007 5863 8546 9269 9459 9592 9681 9724 NA NA NA
2008 5779 8524 9178 9451 9682 9787 NA NA NA NA
2009 6185 9013 9586 9831 9936 NA NA NA NA NA
2010 5600 8493 9057 9282 NA NA NA NA NA NA
2011 5288 7728 8256 NA NA NA NA NA NA NA
2012 5291 7649 NA NA NA NA NA NA NA NA
2013 5676 NA NA NA NA NA NA NA NA NA
Triangular Format Data

Alternatively, the triangle may be stored in a .csv file with the occurrence years
in the rows and the development years in the column cells. We import this .csv
file and transform the resulting my_triangle_csv to a matrix.
We inspect the triangle:
dev
origin 0 1 2 3 4 5 6 7 8 9
2004 5947 9668 10564 10772 10978 11041 11106 11121 11132 11148
2005 6347 9593 10316 10468 10536 10573 10625 10637 10648 NA
2006 6269 9245 10092 10355 10508 10573 10627 10636 NA NA
2007 5863 8546 9269 9459 9592 9681 9724 NA NA NA
2008 5779 8524 9178 9451 9682 9787 NA NA NA NA
2009 6185 9013 9586 9831 9936 NA NA NA NA NA
2010 5600 8493 9057 9282 NA NA NA NA NA NA
2011 5288 7728 8256 NA NA NA NA NA NA NA
From Cumulative to Incremental, and vice versa
The R functions cum2incr() and incr2cum() enable us to switch from cumula-

tive to incremental displays, and vice versa, in an easy way.
dev
origin 0 1 2 3 4 5 6 7 8 9
2004 5947 3721 896 208 207 62 66 15 11 16
2005 6347 3246 723 152 68 37 53 11 12 NA
2006 6269 2976 847 263 153 65 54 9 NA NA
2007 5863 2683 723 191 133 88 43 NA NA NA
2008 5779 2745 654 273 230 105 NA NA NA NA
2009 6185 2828 573 245 105 NA NA NA NA NA
2010 5600 2893 563 226 NA NA NA NA NA NA
2011 5288 2440 528 NA NA NA NA NA NA NA
We recognize the incremental triangle from Figure 11.6.
Visualizing Triangles
To explore the evolution of the cumulative payments per occurrence year, Fig-
ure 11.9 shows my_triangle using the plot function available for objects of
type triangle in the ChainLadder package. Each line in this plot depicts an
occurrence year (from 2004 to 2013, labelled as 1 to 10). Development periods
are labelled from 1 to 10 (instead of 0 to 9, as used above).
plot(my_triangle)
Alternatively, the lattice argument creates one plot per occurrence year.
plot(my_triangle, lattice = TRUE)
1 1 1 1 1 1
1
8000000 10000000
1 2 2
3 3
2 3
2 3
2 2
2 3
3 6
1 6 5 5
4 4
2 6 4
5 4
3 4
5 7
my_triangle 6 7
4
5
7
8
8
9
6000000
2
3
6
1
4
5
0
7
9
8
2 4 6 8 10
dev. period
Figure 11.9: Claim Development by Occurrence Year
0 2 4 6 8 0 2 4 6 8
2004 2005 2006 2007

11000000
10000000
9000000
8000000
7000000
6000000
2008 2009 2010 2011
11000000
10000000
9000000
8000000
7000000
6000000
2012 2013
11000000
10000000
9000000
8000000
7000000
6000000
0 2 4 6 8
dev. period
Instead of plotting the cumulative triangle stored in my_triangle, we can plot

the incremental run-off triangle.
plot(my_triangle_incr)
11.3. THE CHAIN-LADDER METHOD 385
2
3
6
1
4
5
0
5000000
7
9
8
my_triangle_incr
1
2
3
7
6
5
4
2000000
8
9
1
3
2
4
5
6
7
8
5
3
6
7
1
4
2 5
1
3
4
6
2 5
4
3
1
2 1
3
2
4 1
3
2 2
1 1
0
2 4 6 8 10
dev. period
plot(my_triangle_incr, lattice = TRUE)
0 2 4 6 8 0 2 4 6 8
2004 2005 2006 2007

6000000
4000000
2000000
0
2008 2009 2010 2011
6000000
4000000
2000000
0
2012 2013
6000000
4000000
2000000
0
0 2 4 6 8
dev. period
11.3 The Chain-Ladder Method

The most widely used method to estimate outstanding loss reserves is the so-
called chain-ladder method. The origins of this method are obscure but was
firmly entrenched in practical applications by the early 1970’s, Taylor (1986).
As will be seen, the name refers to the chaining of a sequence of (year-to-year
development) factors into a ladder of factors; immature losses climb toward
maturity when multiplied by this concatenation of ratios, hence the apt descrip-
tor chain-ladder method. We will start with exploring the chain-ladder method
in its deterministic or algorithmic version, hence without making any stochas-
tic assumptions. Then we will describe Mack’s distribution-free chain-ladder

model.
11.3.1 The Deterministic Chain-Ladder

The deterministic chain-ladder method focuses on the run-off triangle in cumu-
lative form. Recall that a cell (𝑖, 𝑗) in this triangle displays the cumulative
amount paid up until development period 𝑗 for claims that occurred in year 𝑖.
The chain-ladder method assumes that development factors 𝑓𝑗 (also called
age-to-age factors, link ratios or chain-ladder factors) exist such that
𝐶𝑖,𝑗+1 = 𝑓𝑗 × 𝐶𝑖,𝑗 .
Thus, the development factor tells you how the cumulative amount in develop-
ment year 𝑗 grows to the cumulative amount in year 𝑗 + 1. We highlight the
cumulative amount in period 0 in blue and the cumulative amount in period 1
in red on the Figure 11.10 taken from Wüthrich and Merz (2008) (Table 2.2,
also used in Wüthrich and Merz (2015), Table 1.4).
year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724
5 5,779 8,524 9,178 9,451 9,682 9,787
6 6,185 9,013 9,586 9,831 9,936
7 5,600 8,493 9,057 9,282
8 5,288 7,728 8,256
9 5,291 7,649
10 5,676
Figure 11.10: A Run-off Triangle with Cumulative Payment Data High-

lighting the Cumulative Amount in Period 0 in Blue and the Cumu-
lative Amount in Period 1 in Red. Source: Wüthrich and Merz (2008),
Table 2.2.
The chain-ladder method then presents an intuitive recipe to estimate or calcu-

late these development factors. Since the first development factor 𝑓0 describes
the development of the cumulative claim amount from development period 0
to development period 1, it can be estimated as the ratio of the cumulative
amounts in red and the cumulative amounts in blue, highlighted in the Figure
11.10. We then obtain the following estimate 𝑓0𝐶𝐿 ̂ for the first development
factor 𝑓0 , given observations 𝒟𝐼 :
10−0−1
∑𝑖=1 𝐶𝑖,0+1
̂ =
𝑓0𝐶𝐿 = 1.4925.
10−0−1
∑𝑖=1 𝐶𝑖0
Note that the index 𝑖, used in the sums in the numerator and denominator, runs
from the first occurrence period (1) to the last occurrence period (9) for which
both development periods 0 and 1 are observed. As such, this development
factor measures how the data in blue grow to the data in red, averaged across
all occurrence periods for which both periods are observed. The chain-ladder
method then uses this development factor estimator to predict the cumulative
amount 𝐶10,1 (i.e. the cumulative amount paid up until and including develop-
ment year 1 for accidents that occurred in year 10). This prediction is obtained
by multiplying the most recent observed cumulative claim amount for occurrence
period 10 (i.e. 𝐶10,0 with development period 0) with the estimated development
̂ :
factor 𝑓0𝐶𝐿
̂
𝐶10,1 ̂ = 5, 676 ⋅ 1.4925 = 8, 471.
= 𝐶10,0 ⋅ 𝑓0𝐶𝐿
Going forward with this reasoning, the next development factor 𝑓1 can be es-
timated. Since 𝑓1 captures the development from period 1 to period 2, it can
be estimated as the ratio of the numbers in red and the numbers in blue as
highlighted in Figure 11.11.

year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724
5 5,779 8,524 9,178 9,451 9,682 9,787
6 6,185 9,013 9,586 9,831 9,936
7 5,600 8,493 9,057 9,282
8 5,288 7,728 8,256
9 5,291 7,649
10 5,676

Table 2.2.
̂ for the next development factor

The mathematical notation of the estimate 𝑓1𝐶𝐿
𝑓1 , given observations 𝒟𝐼 , equals:
10−1−1
∑𝑖=1 𝐶𝑖,1+1
̂
𝑓1𝐶𝐿 = = 1.0778.
10−1−1
∑𝑖=1 𝐶𝑖1
Consequently, this factor measures how the cumulative paid amount in devel-
opment period 1 grows to period 2, averaged across all occurrence periods for
which both periods are observed. The index 𝑖 now runs from period 1 to 8, since
these are the occurrence periods for which both development periods 1 and 2
are observed. This estimate for the second development factor is then used to
predict the missing, unobserved cells in development period 2:
̂
𝐶10,2 ̂ ⋅ 𝑓 𝐶𝐿
= 𝐶10,0 ⋅ 𝑓0𝐶𝐿 ̂ = 𝐶̂ ̂ = 8, 471 ⋅ 1.0778 = 9, 130
𝐶𝐿
1 10,1 ⋅ 𝑓1
̂
𝐶9,2 ̂
𝐶𝐿
= 𝐶9,1 ⋅ 𝑓1 = 7, 649 ⋅ 1.0778 = 8, 244.
̂
Note that for 𝐶10,2 ̂
you actually use the estimate 𝐶10,1 and multiply it with the
̂
𝐶𝐿
estimated development factor 𝑓1 .
We continue analogously and obtain following predictions, printed in italics in

the Figure 11.12:

year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636 10,647
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724 9,735 9,745
5 5,779 8,524 9,178 9,451 9,682 9,787 9,837 9,848 9,858
6 6,185 9,013 9,586 9,831 9,936 10,005 10,057 10,067 10,078
7 5,600 8,493 9,057 9,282 9,420 9,485 9,534 9,545 9,555
8 5,288 7,728 8,256 8,445 8,570 8,630 8,675 8,684 8,693
9 5,291 7,649 8,243 8,432 8,557 8,617 8,661 8,671 8,680
10 5,676 8,471 9,130 9,339 9,477 9,543 9,592 9,603 9,613
fˆCL 1.493 1.078 1.023 1.015 1.007 1.005 1.001 1.001
Figure 11.12: A Run-off Triangle with Cumulative Payment Data In-

cluding Predictions in Italic Source: Wüthrich and Merz (2008), Table 2.2.
Eventually we need to estimate the values in the final column. The last develop-
ment factor 𝑓8 measures the growth from development period 8 to development
period 9 in the triangle. Since only the first row in the triangle has both cells
observed, this last factor is estimated as the ratio of the value in red and the
value in blue in Figure 11.13.

year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636 10,647
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724 9,735 9,745
5 5,779 8,524 9,178 9,451 9,682 9,787 9,837 9,848 9,858
6 6,185 9,013 9,586 9,831 9,936 10,005 10,057 10,067 10,078
7 5,600 8,493 9,057 9,282 9,420 9,485 9,534 9,545 9,555
8 5,288 7,728 8,256 8,445 8,570 8,630 8,675 8,684 8,693
9 5,291 7,649 8,243 8,432 8,557 8,617 8,661 8,671 8,680
10 5,676 8,471 9,130 9,339 9,477 9,543 9,592 9,603 9,613
fˆCL 1.493 1.078 1.023 1.015 1.007 1.005 1.001 1.001

Table 2.2.
̂ is equal to:
Given observations 𝒟𝐼 , this factor estimate 𝑓8𝐶𝐿
10−8−1
∑𝑖=1 𝐶𝑖,8+1
̂ =
𝑓8𝐶𝐿 = 1.001.
10−8−1
∑𝑖=1 𝐶𝑖8
Typically this last development factor is close to 1 and hence the cash flows
paid in the final development period are minor. Using this development factor
estimate, we can now estimate the remaining cumulative claim amounts in the
column by multiplying the values for development year 8 with this factor.
The general math notation for the chain ladder predictions for the lower triangle
(𝑖 + 𝑗 > 𝐼) is as follows:
̂
𝐶𝐿 𝑗−1
̂
𝐶𝑖𝑗 = 𝐶𝑖,𝐼−𝑖 ⋅ ∏𝑙=𝐼−𝑖 𝑓𝑙𝐶𝐿
𝐼−𝑗−1
̂ ∑𝑖=1 𝐶𝑖,𝑗+1
𝑓 𝐶𝐿
𝑗 = 𝐼−𝑗−1 ,
∑𝑖=1 𝐶𝑖𝑗
where 𝐶𝑖,𝐼−𝑖 is on the last observed diagonal. It is clear that an important

assumption of the chain-ladder method is that the proportional developments
of claims from one development period to the next are similar for all occurrence
years.
This yields the following Figure 11.14:
year 0 1 2 3 4 5 6 7 8 9
1 5,947 9,668 10,564 10,772 10,978 11,041 11,106 11,121 11,132 11,148
2 6,347 9,593 10,316 10,468 10,536 10,573 10,625 10,637 10,648 10,663
3 6,269 9,245 10,092 10,355 10,508 10,573 10,627 10,636 10,647 10,662
4 5,863 8,546 9,269 9,459 9,592 9,681 9,724 9,735 9,745 9,759
5 5,779 8,524 9,178 9,451 9,682 9,787 9,837 9,848 9,858 9,872
6 6,185 9,013 9,586 9,831 9,936 10,005 10,057 10,067 10,078 10,092
7 5,600 8,493 9,057 9,282 9,420 9,485 9,534 9,545 9,555 9,568
8 5,288 7,728 8,256 8,445 8,570 8,630 8,675 8,684 8,693 8,705
9 5,291 7,649 8,243 8,432 8,557 8,617 8,661 8,671 8,680 8,692
10 5,676 8,471 9,130 9,339 9,477 9,543 9,592 9,603 9,613 9,626
ˆ
f CL 1.493 1.078 1.023 1.015 1.007 1.005 1.001 1.001 1.001
Figure 11.14: A Run-off Triangle with Cumulative Payment Data In-

cluding Predictions in Italic Source: Wüthrich and Merz (2008), Table 2.2.
The numbers in the last column show the estimates for the ultimate claim
amounts. The estimate for the outstanding claim amount ℛ̂𝐶𝐿 𝑖 for a particular
occurrence period 𝑖 = 𝐼 − 𝐽 + 1, … , 𝐼 is then given by the difference between
the ultimate claim amount and the cumulative amount as observed on the most
recent diagonal:
ℛ̂𝐶𝐿
𝑖
̂ −𝐶
𝐶𝐿
= 𝐶𝑖𝐽 𝑖,𝐼−𝑖 .
This is the chain-ladder estimate for the reserve necessary to fulfill future liabil-
ities with respect to claims that occurred in this particular occurrence period.
These reserves per occurrence period and for the total summed over all occur-
rence periods are summarized in Figure 11.15.
Ci,I−i Dev.To.Date CL
ĈiJ R̂CL
i
1 11,148,123 1.000 11,148,123 0
2 10,648,192 0.999 10,663,317 15,125
3 10,635,750 0.998 10,662,007 26,257
4 9,724,069 0.996 9,758,607 34,538
5 9,786,915 0.991 9,872,216 85,301
6 9,935,752 0.984 10,092,245 156,493
7 9,282,022 0.970 9,568,142 286,120
8 8,256,212 0.948 8,705,378 449,166
9 7,648,729 0.880 8,691,971 1,043,242
10 5,675,568 0.590 9,626,383 3,950,815
totals 92,741,332.00 0.94 98,788,390.50 6,047,058.50
Figure 11.15: Reserves per Occurence Period and for Total

11.3.2 Mack’s Distribution-Free Chain-Ladder Model

At this stage, the traditional chain-ladder method provides a point estimator
𝐶𝑖𝐽̂
𝐶𝐿
for the forecast of 𝐶𝑖𝐽 , using the information 𝒟𝐼 . Since the chain-ladder
method is a purely deterministic and intuitively natural algorithm to complete
a run-off triangle, we are not able to determine how reliable that point estimator
is or to model the variation of the future payments. To answer such questions an
underlying stochastic model that reproduces the chain-ladder reserve estimates
is needed.
In this section we will focus on the distribution-free chain-ladder model as an

underlying stochastic model, introduced in Mack (1993). This method allows
us to estimate the standard errors of the chain-ladder predictions. In the next
Section 11.4, generalized linear models are used to develop a fully stochastic
approach for predicting the outstanding reserve.
In Mack’s approach the following conditions (without assuming a distribution)

hold:
• Cumulative claims (𝐶𝑖𝑗 )𝑗=0,…,𝐽 are independent over different occurrence

periods 𝑖.
• There exist fixed constants 𝑓0 , … , 𝑓𝐽−1 and 𝜎02 , … , 𝜎𝐽−1

2
such that for all
𝑖 = 1, … , 𝐼 and 𝑗 = 0, … , 𝐽 − 1:
𝐸[𝐶𝑖,𝑗+1 |𝐶𝑖0 , … , 𝐶𝑖𝑗 ] = 𝑓𝑗 ⋅ 𝐶𝑖𝑗

Var(𝐶𝑖,𝑗+1 |𝐶𝑖𝑗 ) = 𝜎𝑗2 ⋅ 𝐶𝑖𝑗 .
This means that the cumulative claims (𝐶𝑖𝑗 )𝑗=0,…,𝐽 are Markov processes (in
the development periods 𝑗) and hence the future only depends on the present.
Under these assumptions, the expected value of the ultimate claim amount 𝐶𝑖,𝐽 ,
given the available data in the upper triangle, is the cumulative amount on the
most recent diagonal (𝐶𝑖,𝐼−1 ) multiplied with appropriate development factors
𝑓𝑗 . In mathematical notation we obtain for known development factors 𝑓𝑗 and
observations 𝒟𝐼 :
𝐽−1
𝐸[𝐶𝑖𝐽 |𝒟𝐼 ] = 𝐶𝑖,𝐼−𝑖 ∏ 𝑓𝑗 .
𝑗=𝐼−𝑖
This is exactly what the deterministic chain-ladder method does, as explained

in Section 11.3.1. In practice, the development factors are not known and need
to be estimated from the data that is available in the upper triangle. In Mack’s
approach we obtain exactly the same expression for estimating the development
factors 𝑓𝑗 at time 𝐼 as in the deterministic chain-ladder algorithm:
𝐼−𝑗−1
∑𝑗=1 𝐶𝑖,𝑗+1
̂
𝑓𝑗𝐶𝐿 = .
𝐼−𝑗−1
∑𝑖=1 𝐶𝑖𝑗
The predictions for the cells in the lower triangle (i.e. for cells $C_{i,j} $where
𝑖 + 𝑗 > 𝐼) are then obtained by replacing the unknown factors 𝑓𝑗 by their
̂ :
corresponding estimates 𝑓𝑗𝐶𝐿
𝑗−1
̂ =𝐶
𝐶𝐿
𝐶𝑖𝑗 ̂
𝐶𝐿
𝑖,𝐼−𝑖 ∏ 𝑓𝑙 .
𝑙=𝐼−𝑖
To quantify the prediction error that comes with the chain-ladder predictions,
Mack also introduced variance parameters 𝜎𝑗2 . To gain insight in the estimation
of these variance parameters, so-called individual development factors 𝑓𝑖,𝑗 are
introduced (which are specific to occurrence period 𝑖):
𝐶𝑖,𝑗+1
𝑓𝑖,𝑗 = .
𝐶𝑖𝑗
These individual development also describe how the cumulative amount grows
from period 𝑗 to period 𝑗 + 1, but they consider the ratio of only two cells
(instead of taking the ratio of two sums over all available occurrence periods).
Note that the development factors can be written as a weighted average of
individual development factors:
𝐼−𝑗−1
𝐶𝑖𝑗
̂ = ∑
𝑓𝑗𝐶𝐿 𝑓𝑖,𝑗 ,
𝐼−𝑗−1
𝑖=1 ∑𝑖=1 𝐶𝑖𝑗
where the weights are equal to the cumulative claims 𝐶𝑖𝑗 .
Let us now estimate the variance parameters 𝜎2 by writing Mack’s variance
assumption in equivalent ways. First, the variance of the ratio of 𝐶𝑖,𝑗+1 and 𝑐𝑖,𝑗
conditional on 𝐶𝑖,0 , … , 𝐶𝑖,𝑗 is proportional to the inverse of 𝐶𝑖,𝑗 :
1
Var[𝐶𝑖,𝑗+1 /𝐶𝑖𝑗 |𝐶𝑖0 , … , 𝐶𝑖𝑗 ] ∝ .
𝐶𝑖𝑗
This reminds us of a typical weighted least squares setting where the weights
are the inverse of the variability of a response. Therefore, a more volatile or
imprecise response variable will get less weight. The 𝐶𝑖,𝑗 play the role of the
weights. Using the unknown variance parameter 𝜎𝑗2 this variance assumption
can be written as:
Var[𝐶𝑖,𝑗+1 |𝐶𝑖0 , … , 𝐶𝑖𝑗 ] = 𝜎𝑗2 ⋅ 𝐶𝑖𝑗 ,

The connection with weighted least squares then directly leads to an unbiased
estimate for the unknown variance parameter 𝜎𝑗2 in the form of a weighted
residual sum of squares:
𝐼−𝑗−1 2
1 𝐶𝑖,𝑗+1
𝜎̂𝑗2 = ∑ 𝐶𝑖𝑗 ( ̂ ) .
− 𝑓𝑗𝐶𝐿
𝐼 − 𝑗 − 2 𝑖=1 𝐶𝑖𝑗
The weights are again equal to 𝐶𝑖,𝑗 and the residuals are the differences between
the ratios 𝐶𝑖,𝑗+1 /𝐶𝑖,𝑗 and the individual development factors.
We now have all ingredients required to calibrate the distribution-free chain-
ladder model to the data. The next step is then to analyze the prediction
uncertainty and the prediction error. Hereto we use the chain-ladder predictor
where we replace the unknown development factors with their estimators:
𝐽−1
̂ =𝐶
𝐶𝐿
𝐶𝑖𝐽 ̂
𝐶𝐿
𝑖,𝐼−𝑖 ∏ 𝑓𝑙
𝑙=𝐼−𝑖
We use this expression either as an estimator for the conditional expectation of

the ultimate claim amount (given the observed upper triangle) or as a predictor
for the ultimate claim amount as a random variable (given the observed upper
triangle).
In statistics the simplest measure to analyze the uncertainty that comes with a
point estimate or prediction is the Mean Squared Error of Prediction (MSEP).
Here we consider a conditional MSEP, conditional on the data observed in the
upper triangle:
2
̂ ) = 𝐸 [(𝐶 − 𝐶 𝐶𝐿
𝐶𝐿
𝑀 𝑆𝐸𝑃𝐶𝑖𝐽 |𝒟𝐼 (𝐶𝑖𝐽 ̂
𝑖𝐽 𝑖𝐽 ) |𝒟𝐼 ] .
This conditional MSEP measures:

• the distance between the (true) ultimate claim 𝐶𝑖𝐽 and its chain-ladder
̂ at time 𝐼, and
𝐶𝐿
predictor 𝐶𝑖𝐽
• the total prediction uncertainty over the entire run-off of the nominal
ultimate claim 𝐶𝑖𝐽 . It does not consider time value of money, a risk
margin nor any dynamics in claim development.
The MSEP that comes with the estimate for the ultimate cumulative claim
amount is equal to the MSEP that measures the squared distance between the
true and the estimated reserve:
𝑀 𝑆𝐸𝑃ℛ̂ 𝐼 |𝒟 (ℛ̂𝐼𝑖 ) = 𝐸[(ℛ̂𝐼𝑖 − ℛ𝐼𝑖 )2 |𝒟𝐼 ]

𝑖 𝐼
𝐶𝐿
= 𝐸[(𝐶𝑖𝐽̂ − 𝐶 )2 |𝒟 ] = 𝑀 𝑆𝐸𝑃 (𝐶 ̂ ).
𝑖𝐽 𝐼 𝑖𝐽
The reason for this equivalence is the fact that the reserve is the ultimate claim
amount minus the most recently observed claim amount. The latter is observed
and used in both ℛ𝐼𝑖 and ℛ̂𝐼𝑖 .
It is interesting to decompose this MSEP into a component that captures process
variance and a component that captures parameter estimation variance:
2
̂ )
𝐶𝐿
𝑀 𝑆𝐸𝑃𝐶𝑖𝐽|𝒟 (𝐶𝑖𝐽 ̂ ) |𝒟 ]
= 𝐸 [(𝐶𝑖𝐽 − 𝐶𝑖𝐽
𝐼
𝐼
2
̂ )
𝐶𝐿
= Var(𝐶𝑖𝐽 |𝒟𝐼 ) + (𝐸[𝐶𝑖𝐽 |𝒟𝐼 ] − 𝐶𝑖𝐽
= process variance + parameter estimation variance,
for a 𝒟𝐼 measurable estimator/predictor 𝐶𝑖𝐽 ̂ . The process variance component

captures the volatility or uncertainty in the random variable 𝐶𝑖,𝐽 and the pa-
rameter estimation variance measures the error that arises from replacing the
unknown development factors 𝑓𝑗 with their estimated values. This result follows
immediately from following equality about the variance of a shifted random vari-
able 𝑋 where the shift 𝑎 is deterministic:
2
E (𝑋 − 𝑎)2 = Var(𝑋) + [E(𝑋) − 𝑎] .
̂ as fixed because you work

Applied to the expression of the MSEP you treat 𝐶𝑖,𝐽
̂ only uses information
conditionally on the data in the upper triangle and 𝐶𝑖,𝐽
from this upper triangle.
Mack (1993) then derived the important formula for the conditional MSEP in
the distribution-free chain-ladder model for a single occurrence period 𝑖:
𝐽−1
2 𝜎̂𝑗2 1 1
̂
𝑀 𝑆𝐸𝑃 ̂
𝐶𝐿
𝐶𝑖𝐽 |𝒟𝐼 = (𝐶𝑖𝐽 ) ∑ [ ( + 𝐼−𝑗−1 )] .
𝑗=𝐼−𝑖 (𝑓𝑗̂ )2 𝐶𝑖𝑗̂
𝐶𝐿 𝐶𝐿
∑𝑛=1 𝐶𝑛𝑗
For the derivation of this popular formula, we refer to his paper. Note that it is
an estimate of the MSEP since the unknown parameters 𝑓𝑗 and 𝜎𝑗 need to be
estimated as the estimation error cannot be calculated explicitly.
Mack also derived a formula for the MSEP for the total reserve, across all
occurrence periods:
̂𝐼 𝐼 ̂ )
𝐶𝐿
𝑀 𝑆𝐸𝑃 ∑ ̂
𝐶𝐿
𝐶𝑖𝐽
(∑𝑖=1 𝐶𝑖𝐽
𝑖=1
2
𝐼 𝐽−1 ̂ )
𝜎̂ 𝑗2 /(𝑓𝑗CL
̂
∑𝑖=1 𝑀 𝑆𝐸𝑃 ̂
𝐶𝐿 ̂
𝐶𝐿 ̂
𝐶𝐿
𝐶𝑖𝐽 |𝒟𝐼 (𝐶𝑖𝐽 ) +2 ∑1≤𝑖<𝑘≤𝐼 𝐶𝑖𝐽 𝐶𝑘𝐽 ∑𝑗=𝐼−𝑖 𝐼−𝑗−1
∑𝑛=1 𝐶𝑛𝑗
.
The result is the sum of the MSEPs per occurrence period plus a covariance
term. This covariance term is added because the MSEPs for different occurrence
̂ of 𝑓 for different accident years
periods 𝑖 use the same parameter estimates 𝑓𝑗𝐶𝐿 𝑗
𝑖.
11.3.3 R code for Chain-Ladder Predictions

We use the object my_triangle of type triangle that was created in Sec-
tion 11.2.4. The distribution-free chain-ladder model of Mack (1993) is im-
plemented in the ChainLadder package (Gesmann et al., 2019) (as a special
form of weighted least squares) and can be applied on the data my_triangle to
predict outstanding claim amounts and to estimate the standard error around
those forecasts.
CL <- MackChainLadder(my_triangle)
CL
MackChainLadder(Triangle = my_triangle)
Latest Dev.To.Date Ultimate IBNR Mack.S.E CV(IBNR)

2004 11,148,124 1.000 11,148,124 0 0 NaN
2005 10,648,192 0.999 10,663,318 15,126 716 0.0474
2006 10,635,751 0.998 10,662,008 26,257 1,131 0.0431
2007 9,724,068 0.996 9,758,606 34,538 3,121 0.0904
2008 9,786,916 0.991 9,872,218 85,302 7,654 0.0897
2009 9,935,753 0.984 10,092,247 156,494 33,347 0.2131
2010 9,282,022 0.970 9,568,143 286,121 73,469 0.2568
2011 8,256,211 0.948 8,705,378 449,167 85,400 0.1901
2012 7,648,729 0.880 8,691,971 1,043,242 134,338 0.1288
2013 5,675,568 0.590 9,626,383 3,950,815 410,818 0.1040
Totals
Latest: 92,741,334.00
Dev: 0.94
Ultimate: 98,788,397.77
IBNR: 6,047,063.77
Mack.S.E 462,977.83
CV(IBNR): 0.08
round(summary(CL)$Totals)
Totals
Latest: 92741334
Dev: 1
Ultimate: 98788398
IBNR: 6047064
Mack S.E.: 462978
CV(IBNR): 0
The development factors are obtained as follows:

round(CL$f,digits = 4)
[1] 1.4925 1.0778 1.0229 1.0148 1.0070 1.0051 1.0011 1.0010 1.0014 1.0000
We can also print the complete run-off triangle (including predictions).

CL$FullTriangle
dev
origin 0 1 2 3 4 5 6 7
2004 5946975 9668212 10563929 10771690 10978394 11040518 11106331 11121181
2005 6346756 9593162 10316383 10468180 10536004 10572608 10625360 10636546
2006 6269090 9245313 10092366 10355134 10507837 10573282 10626827 10635751
2007 5863015 8546239 9268771 9459424 9592399 9680740 9724068 9734574
2008 5778885 8524114 9178009 9451404 9681692 9786916 9837277 9847905
2009 6184793 9013132 9585897 9830796 9935753 10005044 10056528 10067393
2010 5600184 8493391 9056505 9282022 9419776 9485469 9534279 9544579
2011 5288066 7728169 8256211 8445057 8570389 8630159 8674567 8683939
2012 5290793 7648729 8243496 8432051 8557190 8616868 8661208 8670566
2013 5675568 8470989 9129696 9338521 9477113 9543206 9592313 9602676
dev
origin 8 9
2004 11132310 11148124
2005 10648192 10663318
2006 10646884 10662008
2007 9744764 9758606
2008 9858214 9872218
2009 10077931 10092247
2010 9554570 9568143
2011 8693029 8705378
2012 8679642 8691971
2013 9612728 9626383
The MSEP for the total reserve across all occurrence periods is given by:
CL$Total.Mack.S.E^2
9
214348469061
It is strongly advised to validate Mack’s assumptions by checking that there

are no trends in the residual plots. The last four plots that we obtain with
the following command show respectively the standardized residuals versus the
fitted values, the origin period, the calendar period and the development period.
plot(CL)
Mack Chain Ladder Results Chain ladder developments by origin period

1 1 1 1
3
2 1
3
2 1
2 1
1
2 2
3 2
3 3
2
0 6000000
Forecast 3 6 6 5 4
Amount
Amount
1
2
3 6
4
5 5
4
7 5
4 4
Latest 6
4
5 7
7 8
8
6000000
9
2
3
6
1
4
5
0
7
9
8
2004 2006 2008 2010 2012 2 4 6 8 10
Origin period Development period

Standardised residuals
1
1
−1
−1
8000000 9000000 10000000 11000000 2004 2006 2008 2010 2012
Fitted Origin period

1
1
−1
−1
2004 2006 2008 2010 2012 1 2 3 4 5 6 7 8
Calendar period Development period
The top left-hand plot is a bar-chart of the latest claims position plus IBNR and
Mack’s standard error by occurrence period. The top right-hand plot shows the
forecasted development patterns for all occurrence periods (starting with 1 for
the oldest occurrence period).
When setting the argument lattice=TRUE we obtain a plot of the development,

including the prediction and estimated standard errors by occurrence period:
plot(CL, lattice=TRUE)
11.4. GLMS AND BOOTSTRAP FOR LOSS RESERVES 397
Chain ladder developments by origin period

Chain ladder dev. Mack's S.E.
0 2 4 6 8 0 2 4 6 8
2004 2005 2006 2007

11000000
10000000
9000000
8000000
7000000
6000000
2008 2009 2010 2011
11000000
Amount
10000000
9000000
8000000
7000000
6000000
2012 2013
11000000
10000000
9000000
8000000
7000000
6000000
0 2 4 6 8
Development period
11.4 GLMs and Bootstrap for Loss Reserves
This section is being written and is not yet complete nor edited. It
This section covers regression models to analyze run-off triangles. When analyz-
ing the data in a run-off triangle with a regression model, the standard toolbox
for model building, estimation and prediction becomes available. Using these
tools we are able to go beyond the point estimate and standard error as derived
in Section 11.3. More specifically, we build a generalized linear model (GLM) for
the incremental payments 𝑋𝑖𝑗 in Figure 11.6. Whereas the chain-ladder method
works with cumulative data, typical GLMs assume the response variables to be
independent and therefore work with incremental run-off triangles.
11.4.1 Model Specification

Let 𝑋𝑖𝑗 denote the incremental payment in cell (𝑖, 𝑗) of the run-off triangle.
We assume the 𝑋𝑖𝑗 s to be independent with a density 𝑓(𝑥𝑖𝑗 ; 𝜃𝑖𝑗 , 𝜙) from the
exponential family of distributions. We identify
• 𝜇𝑖𝑗 = 𝐸[𝑋𝑖𝑗 ] the expected value of cell 𝑋𝑖𝑗
• 𝜙 the dispersion parameter and Var[𝑋𝑖𝑗 ] = 𝜙 ⋅ 𝑉 (𝜇𝑖𝑗 ), where 𝑉 (.) is the

variance function
• 𝜂𝑖𝑗 the linear predictor such that 𝜂𝑖𝑗 = 𝑔(𝜇𝑖𝑗 ) with 𝑔 the link function.
Distributions from the exponential family and their default link functions are
listed on http://stat.ethz.ch/R-manual/R-patched/library/stats/html/family.
html. We now discuss three specific GLMs widely used for loss reserving.
First, the Poisson regression model was introduced in Section 8.2. In this model,
we assume that 𝑋𝑖𝑗 has a Poisson distribution with parameter
𝜇𝑖𝑗 = 𝜋𝑖 ⋅ 𝛾𝑗 ,
a cross-classified structure that captures a multiplicative effect of the occurrence

year 𝑖 and the development period 𝑗. The proposed model structure is not
𝐽
identifiable without an additional constraint on the parameters, e.g. ∑𝑗=0 𝛾𝑗 = 1.
This constraint gives an explicit interpretation to 𝜋𝑖 (with 𝑖 = 1, … , 𝐼) as the
exposure or volume measure for occurrence year 𝑖 and 𝛾𝑗 as the fraction of
the total volume paid out with delay 𝑗. However, when calibrating GLMs in R
alternative constraints such as 𝜋1 = 1 or 𝛾1 = 1, or a reparametrization where
𝜇𝑖𝑗 = exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ) are easier to implement. We continue with the latter
specification, including 𝛼1 = 𝛽0 = 0, the so-called corner constraints. This GLM
treats the occurrence year and the payment delay as factor variables and fits a
parameter per level, next to an intercept 𝜇. The corner constraints put the effect
of the first level of a factor variable equal to zero. The Poisson assumption is
particularly useful for a run-off triangle with numbers of reported claims, often
used in the estimation of the number of IBNR claims (see Section 11.2).
Second, an interesting modification of the basic Poisson regression model is the
over-dispersed Poisson regression model where 𝑍𝑖𝑗 has a Poisson distribution
with parameter 𝜇𝑖𝑗 /𝜙 and
𝑋𝑖𝑗 ∼ 𝜙 ⋅ 𝑍𝑖𝑗
𝜇𝑖𝑗 = exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ).
Consequently, 𝑋𝑖𝑗 has the same specification for the mean as in the basic Poisson
regression model, but now
Var[𝑋𝑖𝑗 ] = 𝜙2 ⋅ Var[𝑍𝑖𝑗 ] = 𝜙 ⋅ exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ).
This construction allows for under (when 𝜙 < 1) and over-dispersion (with
𝜙 > 1). Because 𝑋𝑖𝑗 no longer follows a well-known distribution, this approach
is referred to as quasi-likelihood. It is particularly useful to model a run-off
triangle with incremental payments, as these typically reveal over-dispersion.
Third, the gamma regression model is relevant to model a run-off triangle with
claim payments. Recall from Section 3.2.1 (see also the Appendix Chapter
18) that the gamma distribution has shape parameter 𝛼 and scale parameter 𝜃.
From these, we reparameterize and define a new parameter 𝜇 = 𝛼⋅𝜃 while retain-
ing the scale parameter 𝜃. Further, assume that 𝑋𝑖𝑗 has a gamma distribution
and allow 𝜇 to vary by 𝑖𝑗 such that
𝜇𝑖𝑗 = exp (𝜇 + 𝛼𝑖 + 𝛽𝑗 ).
11.4.2 Model Estimation and Prediction

We now estimate the regression parameters 𝜇, 𝛼𝑖 and 𝛽𝑗 in the proposed GLMs.
In R the glm function is readily available to estimate these parameters via max-
imum likelihood estimation (mle) or quasi-likelihood estimation (in the case
of the over-dispersed Poisson). Having the parameter estimates 𝜇,̂ 𝛼𝑖̂ and 𝛽𝑗̂
available, a point estimate for each cell in the upper triangle follows
𝑋̂ 𝑖𝑗 = 𝐸[𝑋̂ 𝑖𝑗 ] = exp (𝜇̂ + 𝛼𝑖̂ + 𝛽𝑗̂ ), with 𝑖 + 𝑗 ≤ 𝐼.

Similarly, a cell in the lower triangle will be predicted as
𝑋̂ 𝑖𝑗 = 𝐸[𝑋̂ 𝑖𝑗 ] = exp (𝜇̂ + 𝛼𝑖̂ + 𝛽𝑗̂ ), with 𝑖 + 𝑗 > 𝐼.
Point estimates for outstanding reserves (per occurrence year 𝑖 or the total
reserve) then follow by summing the cell-specific estimates. By combining the
observations in the upper triangle with their point estimates, we can construct
properly defined residuals and use these for residual inspection.
11.4.3 Bootstrap

Contributors
• Katrien Antonio, KU Leuven and University of Amsterdam, Jan Beir-
lant, KU Leuven, and Tim Veerdonck, University of Antwerp, are the
principal authors of the initial version of this chapter. Email: katrien.an
[email protected] for chapter comments and suggested improvements.
Further Readings and References

As displayed in Figure 11.1, similar timelines and visualizations are discussed
(among others) in Wüthrich and Merz (2008), Antonio and Plat (2014) and
Wüthrich and Merz (2015).
Over time actuaries started to think about possible underlying models and we
mention some important contributions:
• Kremer (1982): two-way ANOVA
• Kremer (1984), Mack (1991): Poisson model
• Mack (1993): distribution-free chain-ladder model
• Renshaw (1989); Renshaw and Verrall (1998): over-dispersed Poisson
model
• Gisler (2006); Gisler and Wüthrich (2008); Bühlmann et al. (2009):
Bayesian chain-ladder model.
The various stochastic models proposed in actuarial literature rely on different
assumptions and have different model properties, but have in common that they
provide exactly the chain-ladder reserve estimates. For more information we also
refer to Mack and Venter (2000) and to the lively discussion that was published
in ASTIN Bulletin: Journal of the International Actuarial Association in 2006
(Venter, 2006).
To read more about exponential families and generalized linear models, see, for
example, McCullagh and Nelder (1989) and Wüthrich and Merz (2008). We
refer to (Kremer, 1982), (Renshaw and Verrall, 1998) and (England and Verrall,
2002), and the overviews in (Taylor, 2000), (Wüthrich and Merz, 2008) and
(Wüthrich and Merz, 2015) for more details on the discussed GLMs. XXX
presents alternative distributional assumptions and specifications of the linear
predictor.
Chapter 12
Experience Rating using

Bonus-Malus
Chapter Preview. This chapter introduces bonus-malus system used in motor

insurance ratemaking. In particular, the chapter discusses the features of bonus-
malus system and studies its modelling and properties via basic statistical tech-
niques. Section 12.1 introduces the use of bonus-malus system as an experience
rating scheme, followed by Section 12.2 which describes its practical implemen-
tation in several countries. Section 12.3 covers its modelling setup by a discrete
time Markov Chain. Next, Section 12.4 studies a number of simple relevant
properties associated with the stationary distribution of bonus-malus system.
Section 12.5 focuses on the determination of a posteriori premium rating to
complement a priori ratemaking.
This chapter is being written and is not yet complete nor edited. It
12.1 Introduction
Bonus-malus system, which is used interchangeably as “no-fault discount”,
“merit rating”, “experience rating” or “no-claim discount” in different countries,
is based on penalizing insureds who are responsible for one or more claims by a
premium surcharge (malus), and rewarding insureds with a premium discount
(bonus) if they do not have any claims. Insurers use bonus-malus systems
for two main purposes: to encourage drivers to drive more carefully in a year
without any claims, and to ensure insureds to pay premiums proportional to
their risks based on their claims experience via an experience rating mechanism.
401
402 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
No Claim Discount (NCD) system is an experience rating system commonly

used in motor insurance. It represents an attempt to categorize insureds into
homogeneous groups who pay premiums based on their claims experience. De-
pending on the rules in the scheme, new policyholders may be required to pay
full premium initially, and obtain discounts in the future years as a result of
claim-free years. An NCD system rewards policyholders for not making any
claims during a year. In other words, it grants a bonus to a careful driver. This
bonus principle may affect policy holders’ decisions whether to claim or not to
claim, especially when involving accidents with slight damages, which is known
as the ‘hunger for bonus’ phenomenon. The ‘hunger for bonus’ under an NCD
system may reduce insurers’ claim costs, and may be able to offset the expected
decrease in premium income.
12.2 NCD System in Several Countries

12.2.1 NCD System in Malaysia
Before the liberalization of Motor Tariff on 1st July 2017, the rating of motor
insurance in Malaysia was governed by the Motor Tariff. Under the tariff, the
rate charged should not be lower than the rates specified under the classes of
risks, to ensure that the price competition among insurers will not go below
the country’s economic level. The basic rating factors considered were scope
of insurance, cubic capacity of vehicle and estimated value of vehicle (or sum
insured, whichever is lower). Under the Motor Tariff, the final premium to be
paid is adjusted by the policyholder’s claim experience, or equivalently, his NCD
entitlement.
Effective on 1st July 2017, the premium rates for motor insurance are liberalized,
or de-tariffed. The pricing of premium is now determined by individual insur-
ers and takaful operators, and the consumers are able to enjoy a wider choice
of motor insurance products at competitive prices. Since tariff liberalization
encourages innovation and competition among insurers and takaful operators,
the premiums are based on broader risk factors other than the two rating fac-
tors specified in the Motor Tariff, i.e. sum insured and cubic capacity of vehicle.
Other rating factors may be defined in the risk profile of an insured, such as
age of vehicle, age of driver, safety and security features of vehicle, geographi-
cal location of vehicle and traffic offences of driver. As different insurers and
takaful operators have different ways of defining the risk profile of an insured,
the price of a policy may differ from one insurer to another. However, the NCD
structure from the Motor Tariff remains ‘unchanged’ and continue to exist, and
is ‘transferable’ from one insurer, or from one takaful operator, to another.
The discounts in the Malaysian NCD system are divided into six classes, starting
from the initial class of 0% discount, followed by classes of 25%, 30%, 38.3%, 45%
and 55% discounts. Table 12.1 provides the classes of NCD system in Malaysia.
A claim-free year indicates that a policyholder is entitled to move one-step
12.2. NCD SYSTEM IN SEVERAL COUNTRIES 403
forward to the next discount class, such as from a 0% discount to a 25% discount
in the renewal year. If a policyholder is already at the highest class, which is
at a 55% discount, a claim-free year indicates that the policyholder remains in
the same class. On the other hand, if one or more claims are made within the
year, the NCD will be forfeited and the policyholder has to start at 0% discount
in the renewal year. This set of transition rules can also be summarized as a
rule of -1/Top, that is, a class of bonus for a claim-free year, and moving to the
highest class after having one or more claims. For an illustration purpose, Table
12.1 and Figure 12.1 respectively show the classes and the transition diagram
for the Malaysian NCD system.
Table 12.1. Classes of NCD (Malaysia)
Classes (claim-free years) Discounts (%)

0 0
1 25
2 30
3 38.33
4 45
5(and above) 55
Figure 12.1: Transition Diagram for NCD Classes (Malaysia)
12.2.2 NCD System in Other Countries

The NCD system in Brazil are subdivided into seven classes, with the following
premium levels (Lemaire and Zi, 1994): 100, 90, 85, 80, 75, 70, and 65. These
premium levels are also equivalent to the following discount classes: 0%, 10%,
15%, 20%, 25%, 30% and 45%. New policyholders have to start at 0% discount,
or at premium level of 100, and a claim-free year indicates that a policyholder
can move forward at a one-class discount. If one or more claims incurred within
the year, the policyholder has to move one-class backward for each claim. Table
12.2 and Figure 12.2 respectively show the classes and the transition diagram for
the NCD system in Brazil. This set of transition rules can also be summarized
as a rule of -1/+1, that is, a class of bonus for a claim-free year, and a class of
malus for each claim reported.
Table 12.2. Classes of NCD (Brazil)
Classes (claim-free years) Discounts (%)

0 0
1 10
2 15
3 20
4 25
5 30
6 (and above) 45
Figure 12.2: Transition Diagram for NCD Classes (Brazil)
The NCD system in Switzerland are subdivided into twenty-two classes, with
the following premium levels: 270, 250, 230, 215, 200, 185, 170, 155, 140, 130,
120, 110, 100, 90, 80, 75, 70, 65, 60, 55, 50 and 45 (Lemaire and Zi, 1994). These
levels are also equivalent to the following loadings (malus): 170%, 150%, 130%,
115%, 100%, 85%, 70%, 55%, 40%, 30%, 20%, and 10%, and the following
discounts: 0%, 10%, 20%, 25%, 30%, 35%, 40%, 45%, 50% and 55%. New
policyholders have to start at 0% discount, or at premium level of 100, and
a claim-free year indicates that a policyholder can move one-class forward. If
one or more claims incurred within the year, the policyholder has to move four-
classes backward for each claim. Table 12.3 and Figure 12.3 respectively show
the classes and the transition diagram for the NCD system in Switzerland. This
set of transition rule can be summarized as a rule of -1/+4.
Table 12.3. Classes of NCD (Switzerland)

12.3. BMS AND MARKOV CHAIN MODEL 405
Classes Loadings (%) Classes Discounts (%)

0 170 12 0
1 150 13 10
2 130 14 20
3 115 15 25
4 100 16 30
5 85 17 35
6 70 18 40
7 55 19 45
8 40 20 50
9 30 21 55
10 20
11 10
Figure 12.3: Transition Diagram for NCD Classes (Switzerland)
12.3 BMS and Markov Chain Model

A BMS can be represented by a discrete time Markov chain. A stochastic
process is said to possess the Markov property if the evolution of the process in
the future depends only on the present state but not the past. A discrete time
Markov Chain is a Markov process with discrete state space.
12.3.1 Transition Probability

A Markov Chain is determined by its transition probabilities. The transition
probability from state 𝑖 (at time 𝑛) to state 𝑗 (at time 𝑛+1) is called a one-step
transition probability, and is denoted by 𝑝𝑖𝑗 (𝑛, 𝑛 + 1) = 𝑃 𝑟(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖),
𝑖 = 1, 2, … , 𝑘, 𝑗 = 1, 2, … , 𝑘. For general transition from time 𝑚 to time 𝑛,
for 𝑚 < 𝑛, by conditioning on 𝑋𝑜 for 𝑚 ≤ 𝑜 ≤ 𝑛, we have the Chapman-
Kolmogorov equation of
𝑝𝑖𝑗 (𝑚, 𝑛) = ∑ 𝑝𝑖𝑙 (𝑚, 𝑜)𝑝𝑙𝑗 (𝑜, 𝑛).

𝑙∈𝑆
(𝑡)
A time-homogeneous Markov Chain satisfies the property of 𝑝𝑖𝑗 (𝑛, 𝑛 + 𝑡) = 𝑝𝑖𝑗
(1)
for all 𝑛. For instance, we have 𝑝𝑖𝑗 (𝑛, 𝑛 + 1) = 𝑝𝑖𝑗 ≡ 𝑝𝑖𝑗 . In this case, the
Chapman-Kolmogorov equation can be written as
(𝑚) (𝑛)
𝑝𝑖𝑗 (0, 𝑚 + 𝑛) = ∑ 𝑝𝑖𝑙 (0, 𝑚)𝑝𝑙𝑗 (𝑚, 𝑚 + 𝑛) = ∑ 𝑝𝑖𝑙 𝑝𝑙𝑗 .
𝑙∈𝑆 𝑙∈𝑆
In the context of BMS, the transition of the NCD classes is governed by the
transition probability in a given year. The transition of the NCD classes is also
a time-homogeneous Markov Chain since the set of transition rules is fixed and
independent of time. We can represent the one-step transition probabilities by a
𝑘×𝑘 transition matrix P = (𝑝𝑖𝑗 ) that corresponds to NCD classes 0, 1, 2, … , 𝑘−1.
𝑝00 𝑝01 … 𝑝0𝑘−1

⎡ 𝑝10 𝑝11 … 𝑝1𝑘−1 ⎤
P=⎢
⎢ ⋮ ⎥
⎥
⋱ ⋮
⎣ 𝑝𝑘−10 𝑝𝑘−11 ⋯ 𝑝𝑘−1𝑘−1 ⎦
Here, its (𝑖, 𝑗)-th element is the transition probability from state 𝑖 to state 𝑗.
In other words, each row of the transition matrix represents the transition of
flowing out of state, whereas each column represents the transition of flowing
into the state. The summation of transition probabilities of flowing out of state
must equal to 1, or each row of the matrix must sum to 1, i.e. ∑𝑗 𝑝𝑖𝑗 = 1. All
probabilities must also be non-negative (since they are probabilities), i.e. 𝑝𝑖𝑗 ≥ 0.
Consider the Malaysian NCD system. Let {𝑋𝑡 ∶ 𝑡 = 0, 1, 2, …} be the NCD
class occupied by a policyholder at time 𝑡 with state space 𝑆 = {0, 1, 2, 3, 4, 5}.
Therefore, the transition probability in a no-claim year is equal to the probability
of transition from state 𝑖 to state 𝑖 + 1, i.e. 𝑝𝑖𝑖+1 . If an insured has one or
more claims within the year, the probability of transitioning back to state 0
is represented by 𝑝𝑖0 = 1 − 𝑝𝑖𝑖+1 . Hence, the Malaysian NCD system can be
represented by the following 6 × 6 transition matrix:
𝑝00 𝑝01 0 0 0 0 1 − 𝑝01 𝑝01 0 0 0 0

⎡ 𝑝10 0 𝑝12 0 0 0 ⎤ ⎡ 1 − 𝑝12 0 𝑝12 0 0 0 ⎤
⎢ 𝑝20 0 0 𝑝23 0 0 ⎥ ⎢ 1 − 𝑝23 0 0 𝑝23 0 0 ⎥
P=⎢ ⎥=⎢ ⎥
⎢ 𝑝30 0 0 0 𝑝34 0 ⎥ ⎢ 1 − 𝑝34 0 0 0 𝑝34 0 ⎥
⎢ 𝑝40 0 0 0 0 𝑝45 ⎥ ⎢ 1 − 𝑝45 0 0 0 0 𝑝45 ⎥
⎣ 𝑝50 0 0 0 0 𝑝55 ⎦ ⎣ 1 − 𝑝55 0 0 0 0 𝑝55 ⎦
Example 12.3.1. Provide the transition matrix for the NCD system in Brazil.
Solution
Based on the NCD classes and the transition diagram shown respectively in
Table 12.2 and Figure 12.2, the probability of a no-claim year is equal to the
12.3. BMS AND MARKOV CHAIN MODEL 407
probability of moving one-class forward, whereas the probability of having one

or more claims within the year is equal to the probability of moving one-class
backward for each claim. Therefore, each row can contain two or more transition
probabilities; one probability for advancing to the next state, and one or more
probabilities for moving one-class backward. The transition matrix is:
1 − 𝑝01 𝑝01 0 0 0 0 0
⎡ 1 − 𝑝12 0 𝑝12 0 0 0 0 ⎤
⎢ 1 − ∑𝑗 𝑝2𝑗 𝑝21 0 𝑝23 0 0 0 ⎥
⎢ ⎥
P=⎢ 1 − ∑𝑗 𝑝3𝑗 𝑝31 𝑝32 0 𝑝34 0 0 ⎥
⎢ 1 − ∑𝑗 𝑝4𝑗 𝑝41 𝑝42 𝑝43 0 𝑝45 0 ⎥
⎢ ⎥
⎢ 1 − ∑𝑗 𝑝5𝑗 𝑝51 𝑝52 𝑝53 𝑝54 0 𝑝56 ⎥
⎣ 1 − ∑𝑗 𝑝6𝑗 𝑝61 𝑝62 𝑝63 𝑝64 𝑝65 𝑝66 ⎦
Example 12.3.2. Provide the transition matrix for the NCD system in Switzer-
land.
Solution.
From Table 12.3 and Figure 12.3, the probability of a no-claim year is equal to
the probability of moving one-class forward, whereas the probability of having
one or more claims within the year is equal to the probability of moving four-
classes backward for each claim. The transition matrix is:
1 − 𝑝01 𝑝01 0 0 0 0 0 0 0 0 0 ⋯
⎡ 1 − 𝑝12 0 𝑝12 0 0 0 0 0 0 0 0 ⋯ ⎤
⎢ 1 − 𝑝23 0 0 𝑝23 0 0 0 0 0 0 0 ⋯ ⎥
⎢ 1 − 𝑝34 0 0 0 𝑝34 0 0 0 0 0 0 ⋯
⎥
⎢ ⎥
⎢ 1 − 𝑝45 0 0 0 0 𝑝45 0 0 0 0 0 ⋯ ⎥
⎢ 1 − ∑ 𝑝5𝑗 𝑝51 0 0 0 0 𝑝56 0 0 0 0 ⋯ ⎥
⎢ 𝑗 ⎥
⎢ 1 − ∑ 𝑝6𝑗 0 𝑝62 0 0 0 0 𝑝67 0 0 0 ⋯ ⎥
P=⎢ 𝑗
⎥
⎢ 1 − ∑ 𝑝7𝑗 0 0 𝑝73 0 0 0 0 𝑝78 0 0 ⋯ ⎥
𝑗
⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⎥
⎢ 1 − ∑ 𝑝19,𝑗 0 0 𝑝19,3 0 0 0 𝑝19,7 0 0 0 ⋯ ⎥
⎢ 𝑗 ⎥
⎢ 1 − ∑ 𝑝20,𝑗 0 0 0 𝑝20,4 0 0 0 𝑝20,8 0 0 ⋯ ⎥
⎢ 𝑗
⎥
1 − ∑ 𝑝21,𝑗 𝑝21,1 0 0 0 𝑝21,5 0 0 0 𝑝21,9 0 ⋯
⎣ 𝑗 ⎦
12.4 BMS and Stationary Distribution

12.4.1 Stationary Distribution
A stationary distribution of a Markov Chain is a probability distribution that
remains unchanged as time progresses into the future. It is represented by a
row vector 𝜋 = (𝜋1 , 𝜋2 , … , 𝜋𝑘 ) with the following properties:
0 ≤ 𝜋𝑗 ≤ 1,
∑ 𝜋𝑗 = 1,
𝑗
𝜋𝑗 = ∑ 𝜋𝑖 𝑝𝑖𝑗 .
𝑖
The last equation can be written as 𝜋P = 𝜋. The first two conditions are
necessary for probability distribution whereas the last property indicates that
the row vector 𝜋 is invariant (i.e. unchanged) by the one-step transition matrix.
In other words, once the Markov Chain has reached the stationary state, its
probability distribution will stay stationary over time. Mathematically, the
stationary vector 𝜋 can also be obtained by finding the left eigenvector of the
one-step transition matrix.
Example 12.4.1. Find the stationary distribution for the NCD system in
Malaysia assuming that the probability of a no-claim year for all NCD classes
are 𝑝0 .
Solution. The transition matrix can be re-written as:
1 − 𝑝0 𝑝0 0 0 0 0
⎡ 1 − 𝑝0 0 𝑝0 0 0 0 ⎤
⎢ 1 − 𝑝0 0 0 𝑝0 0 0 ⎥
P=⎢ ⎥
⎢ 1 − 𝑝0 0 0 0 𝑝0 0 ⎥
⎢ 1 − 𝑝0 0 0 0 0 𝑝0 ⎥
⎣ 1 − 𝑝0 0 0 0 0 𝑝0 ⎦
The stationary distribution can be calculated using 𝜋𝑗 = ∑ 𝜋𝑖 𝑝𝑖𝑗 . The solutions

𝑖
are:
12.4. BMS AND STATIONARY DISTRIBUTION 409
𝜋0 = ∑ 𝜋𝑖 𝑝𝑖0 = (1 − 𝑝0 ) ∑ 𝜋𝑖 = 1 − 𝑝0
𝑖 𝑖
𝜋1 = ∑ 𝜋𝑖 𝑝𝑖1 = 𝜋0 𝑝01 = (1 − 𝑝0 )𝑝0
𝑖
𝜋2 = ∑ 𝜋𝑖 𝑝𝑖2 = 𝜋1 𝑝12 = (1 − 𝑝0 )𝑝0 2
𝑖
𝜋3 = ∑ 𝜋𝑖 𝑝𝑖3 = 𝜋2 𝑝23 = (1 − 𝑝0 )𝑝0 3
𝑖
𝜋4 = ∑ 𝜋𝑖 𝑝𝑖4 = 𝜋3 𝑝34 = (1 − 𝑝0 )𝑝0 4
𝑖
𝜋5 = ∑ 𝜋𝑖 𝑝𝑖5 = 𝜋4 𝑝45 + 𝜋5 𝑝55 = (1 − 𝑝0 )𝑝0 5 + 𝜋5 𝑝0
𝑖
(1−𝑝0 )𝑝0 5
∴𝜋5 = (1−𝑝0 )
= 𝑝0 5
The stationary distribution shown in Example 12.4.1 represents the asymptotic

distribution of the NCD system, or the distribution in the long run. As an
example, assuming that the probability of a no-claim year is 𝑝0 = 0.90, the
stationary probabilities are:
𝜋0 = 1 − 𝑝0 = 0.1000
𝜋1 = (1 − 𝑝0 )𝑝0 = 0.0900
𝜋2 = (1 − 𝑝0 )𝑝0 2 = 0.0810
𝜋3 = (1 − 𝑝0 )𝑝0 3 = 0.0729
𝜋4 = (1 − 𝑝0 )𝑝0 4 = 0.0656
𝜋5 = 𝑝0 5 = 0.5905
In other words, 𝜋0 = 0.10 indicates that 10% of insureds will eventually belong
to class 0, 𝜋1 = 0.09 indicates that 9% of insureds will eventually belong to
class 1, and so forth, until 𝜋5 = 0.59, which indicates that 59% of insureds will
eventually belong to class 5.
12.4.2 R Code for a Stationary Distribution

We can use the left eigenvector of a transition matrix to calculate a stationary
distribution. The following R code can be used to calculate the left eigenvector:
1. Create a Transition Matrix
#create transition matrix
entries = c(0.1,0.9,0,0,0,0,
0.1,0,0.9,0,0,0,
0.1,0,0,0.9,0,0,
0.1,0,0,0,0.9,0,
0.1,0,0,0,0,0.9,
0.1,0,0,0,0,0.9)
(TP <- matrix(entries,nrow=6,byrow=TRUE) )
[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0.1 0.9 0.0 0.0 0.0 0.0
[2,] 0.1 0.0 0.9 0.0 0.0 0.0
[3,] 0.1 0.0 0.0 0.9 0.0 0.0

[4,] 0.1 0.0 0.0 0.0 0.9 0.0
[5,] 0.1 0.0 0.0 0.0 0.0 0.9
[6,] 0.1 0.0 0.0 0.0 0.0 0.9
2. Calculate eigenvalues and eigenvectors using the eigen function

#hint -- left eigenvector is the same as right eigenvector of transpose
#of transition matrix
eigenTP <- eigen(t(TP))
signif(eigenTP$values, digits = 3)
signif(eigenTP$vectors, digits = 3)
[1] 1.000000+0.000000i -0.000488+0.000355i -0.000488-0.000355i

[4] 0.000187+0.000574i 0.000187-0.000574i 0.000603+0.000000i
[,1] [,2]
[1,] 0.162+0i 0.000000000000115+0.000000000000084i
[2,] 0.145+0i -0.000000000066000-0.000000000203000i
[3,] 0.131+0i -0.000000098000000+0.000000302000000i
[4,] 0.118+0i 0.000384000000000-0.000279000000000i
[5,] 0.106+0i -0.707000000000000+0.000000000000000i
[6,] 0.954+0i 0.707000000000000+0.000000000000000i
[,3]
[1,] 0.000000000000115-0.000000000000084i
[2,] -0.000000000066000+0.000000000203000i
[3,] -0.000000098000000-0.000000302000000i
[4,] 0.000384000000000+0.000279000000000i
[5,] -0.707000000000000+0.000000000000000i
[6,] 0.707000000000000+0.000000000000000i
[,4]
[1,] -0.000000000000044+0.000000000000136i
[2,] 0.000000000172000+0.000000000125000i
[3,] 0.000000257000000-0.000000187000000i
[4,] -0.000147000000000-0.000451000000000i
[5,] -0.707000000000000+0.000000000000000i
[6,] 0.707000000000000+0.000000000000000i
[,5] [,6]
[1,] -0.000000000000044-0.000000000000136i 0.000000000000143+0i
[2,] 0.000000000172000-0.000000000125000i 0.000000000213000+0i
[3,] 0.000000257000000+0.000000187000000i 0.000000317000000+0i
[4,] -0.000147000000000+0.000451000000000i 0.000474000000000+0i
[5,] -0.707000000000000+0.000000000000000i 0.707000000000000+0i
[6,] 0.707000000000000+0.000000000000000i -0.707000000000000+0i
3. Calculate the left eigenvector

#divide entry of first column by sum of elements, so that entries sum to 1

#provide answers in 4 decimal places
signif(eigen(t(TP))$vectors[,1]/sum(eigen(t(TP))$vectors[,1]), digits = 4)
[1] 0.10000+0i 0.09000+0i 0.08100+0i 0.07290+0i 0.06561+0i 0.59050+0i
Example 12.4.2. Find the stationary distribution for the NCD system in
Brazil assuming that the number of claims is Poisson distributed with parameter
𝜆 = 0.10.
Solution. Under the Poisson distribution, the probability of* 𝑘 claims is 𝑝𝑘 =
𝑘
𝑒−0.1 (0.1)
𝑘! , 𝑘 = 0, 1, 2, … .
The transition matrix is:
1 − 𝑝0 𝑝0 0 0 0 0 0 0.0952 0.9048 0 0 0 0 0
⎡ 1 − 𝑝0 0 𝑝0 0 0 0 0 ⎤ ⎡ 0.0952 0 0.9048 0 0 0 0 ⎤
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝1 0 𝑝0 0 0 0 ⎥ ⎢ 0.0047 0.0905 0 0.9048 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
P=⎢ 1 − ∑𝑖 𝑝𝑖 𝑝2 𝑝1 0 𝑝0 0 0 ⎥=⎢ 0.0002 0.0045 0.0905 0 0.9048 0 0 ⎥
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝3 𝑝2 𝑝1 0 𝑝0 0 ⎥ ⎢ 0.0000 0.0002 0.0045 0.0905 0 0.9048 0 ⎥
⎢ 1 − ∑𝑖 𝑝𝑖 𝑝4 𝑝3 𝑝2 𝑝1 0 𝑝0 ⎥ ⎢ 0.0000 0.0000 0.0002 0.0045 0.0905 0 0.9048 ⎥
⎣ 1 − ∑𝑖 𝑝𝑖 𝑝5 𝑝4 𝑝3 𝑝2 𝑝1 𝑝0 ⎦ ⎣ 0.0000 0.0000 0.0000 0.0002 0.0045 0.0905 0.9048 ⎦
Using R code, the stationary probabilities are:
𝜋0 0.0000
⎡ 𝜋1 ⎤ ⎡ 0.0000 ⎤
⎢ ⎥ ⎢ ⎥
⎢ 𝜋2 ⎥ ⎢ 0.0003 ⎥
⎢ 𝜋3 ⎥=⎢ 0.0022 ⎥.
⎢ ⎥ ⎢ ⎥
⎢ 𝜋4 ⎥ ⎢ 0.0145 ⎥
⎢ 𝜋5 ⎥ ⎢ 0.0936 ⎥
⎣ 𝜋6 ⎦ ⎣ 0.8894 ⎦
The probabilities indicate that 89% of insureds will eventually belong to class
6, 9% of insureds will eventually belong to class 5, and 1.5% of insureds will
eventually belong to class 4. Other classes would have less than 1% of insureds
in the long run.
Example 12.4.3. Using the results from Example 12.4.2, find the final pre-
mium under the steady state condition assuming that the premium prior to
implementing the NCD is 𝑚.
Solution. Using the stationary probabilities from Example 12.4.2, the station-
ary final premium is:
= ∑ (premium) × (proportion in class 𝑗 in the long run) × (1 - *NCD* in class 𝑗)

𝑗
= 𝑚[𝜋0 (1) + 𝜋1 (1 − 0.9) + 𝜋2 (1 − 0.15) + … + 𝜋6 (1 − 0.35)]
= 𝑚[0 + 0 + (0.0003)(0.85) + … + (0.8894)(0.65)]
= 0.6565𝑚.
The results indicate that the final premium reduce from 𝑚 to 0.6565𝑚 in the
long run under stationary condition if the NCD is considered. From a financial
standpoint, this implies that the collected premium is insufficient to cover the
expected claim cost of 𝑚. This result is not surprising because none of the classes
in the NCD system in Brazil impose a malus loading for the policyholders. More
importantly, it indicates that NCD will only be financially balanced if there are
both bonus and malus classes and the premium levels are re-calculated such
that the expected premium under the stationary distribution equals to 𝑚.
12.4.3 Premium Evolution

We may be interested to find out the evolution of the mean premium after 𝑛
years (or 𝑛 steps). Under the NCD system, the n-step transition probability,
(𝑛)
𝑝𝑖𝑗 = Pr(𝑋𝑛 = 𝑗|𝑋0 = 𝑖), can be used to calculate the evolution of the mean
(𝑛)
premium. The probability 𝑝𝑖𝑗 can be obtained as the (𝑖, 𝑗)-th element of the
𝑛-th power of transition matrix P, that is, P𝑛 .
Example 12.4.4. Observe the premiums in 20 years under the NCD system
in Malaysia, assuming that the probability of claims is Poisson distributed with
parameter 𝜆 = 0.10 and the premium prior to implementing the NCD is 𝑚 =
100.
Solution. Under the Malaysian NCD, we use the Poisson probability,* 𝑝𝑘 =

𝑘
𝑒−0.1 (0.1)
𝑘! , only for 𝑘 = 0, 1. Therefore, the transition matrix in the first year is:
0.0952 0.9048 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 ⎤
⎢ 0.0952 0 0 0.9048 0 0 ⎥
P (1)
=⎢ ⎥
⎢ 0.0952 0 0 0 0.9048 0 ⎥
⎢ 0.0952 0 0 0 0 0.9048 ⎥
⎣ 0.0952 0 0 0 0 0.9048 ⎦
The premium in the first year, after implementing the NCD, is:
= ∑ (premium) × (average proportion in class 𝑗) × (1 - *NCD* in class 𝑗)

𝑗
∑ 𝑝𝑖0 ∑ 𝑝𝑖1 ∑ 𝑝𝑖5

= 𝑚⎢⎡ 𝑖
(1) +
𝑖
(1 − 0.25) + … +
𝑖
(1 − 0.55)⎤
6 6 6 ⎥
⎣ ⎦
= 100[0.0952(1) + 0.1508(0.75) + ⋯ + 0.3016(0.45)]
= 62.55.
Using similar steps, the premium in the 𝑛-th year for 𝑛 = 1, 2, ..., 20 can be
observed. From R, the premiums in 20 years are:
62.55, 59.87, 58.06, 57.06, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58,
56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58, 56.58.
12.4.4 R Program for Premium Evolution

The following R code can be used to find the premium in the n-th year and the
premiums in 20 years under the NCD system in Malaysia (to find the solution
in Example 12.4.4).
1. Create a Transition Matrix
#create transition matrix
entries = c(0.0952,0.9048,0,0,0,0,
0.0952,0,0.9048,0,0,0,
0.0952,0,0,0.9048,0,0,
0.0952,0,0,0,0.9048,0,
0.0952,0,0,0,0,0.9048,
0.0952,0,0,0,0,0.9048)
(TP <- matrix(entries,nrow=6,byrow=TRUE) )
[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0.0952 0.9048 0.0000 0.0000 0.0000 0.0000
[2,] 0.0952 0.0000 0.9048 0.0000 0.0000 0.0000
[3,] 0.0952 0.0000 0.0000 0.9048 0.0000 0.0000
[4,] 0.0952 0.0000 0.0000 0.0000 0.9048 0.0000
[5,] 0.0952 0.0000 0.0000 0.0000 0.0000 0.9048
[6,] 0.0952 0.0000 0.0000 0.0000 0.0000 0.9048
2. Create a function for the 𝑛th power of a square matrix
#create function for nth power of square matrix
powA <- function(n) {
if (n==1) return (TP)
if (n==2) return (TP%*%TP)

if (n>2) return ( TP%*%powA(n-1))}
#example for n=3
signif(powA(3), digits = 3)
[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0.0952 0.0861 0.0779 0.741 0.000 0.000
[2,] 0.0952 0.0861 0.0779 0.000 0.741 0.000
[3,] 0.0952 0.0861 0.0779 0.000 0.000 0.741
[4,] 0.0952 0.0861 0.0779 0.000 0.000 0.741
[5,] 0.0952 0.0861 0.0779 0.000 0.000 0.741
[6,] 0.0952 0.0861 0.0779 0.000 0.000 0.741
3. create function for premium in nth year

#define NCD percentage
NCD = c(1,.75,.7,.6167,.55,.45)
#create function for premium in nth year

p = numeric(0)
prem <- function(n){
for (j in 1:length(NCD))
p[j] = mean(powA(n)[,j])
100*sum(p*NCD)
}
#example for n=3
signif(prem(3), digits = 3)
[1] 58.1
4. Provide Premiums for 20 years

premium=numeric(0)
for (n in 1:20) {premium[n] = prem(n) }
signif(premium, digits = 2)
[1] 63 60 58 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57
Example 12.4.5. Observe the premiums in 20 years under the NCD system
−0.1 𝑘
in Brazil, assuming that the probability of 𝑘 claims is 𝑝𝑘 = 𝑒 𝑘!(0.1) , 𝑘 =
0, 1, 2, …, and the premium prior to implementing the NCD is 𝑚 = 100.
Solution. The transition matrix for the NCD system in Brazil is:
0.0952 0.9048 0 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 0 ⎤
⎢ ⎥
⎢ 0.0047 0.0905 0 0.9048 0 0 0 ⎥
P=⎢ 0.0002 0.0045 0.0905 0 0.9048 0 0 ⎥
⎢ ⎥
⎢ 0.0000 0.0002 0.0045 0.0905 0 0.9048 0 ⎥
⎢ 0.0000 0.0000 0.0002 0.0045 0.0905 0 0.9048 ⎥
⎣ 0.0000 0.0000 0.0000 0.0002 0.0045 0.0905 0.9048 ⎦
Using R, the premiums in 20 years are:

76.69, 73.76, 71.31, 69.38, 67.92, 66.93, 66.40, 66.05, 65.88, 65.78,
65.72, 65.69, 65.67, 65.66, 65.66, 65.66, 65.66, 65.65, 65.65, 65.65.
The results in Examples 12.4-5 allow us to observe the evolution of premium for
the NCD systems in Malaysia and Brazil assuming that the number of claims
is Poisson distributed with parameter 𝜆 = 0.10, and the premium prior to im-
plementing the NCD is 𝑚 = 100. The evolution of premiums for both countries
are provided in Table 12.4, and are shown graphically in Figure 12.4.
Table 12.4. Evolution of Premium (Malaysia and Brazil)
Year Premium Premium Year Premium Premium

Malaysia Brazil Malaysia Brazil
0 100 100 11 56.58 65.72
1 62.55 76.69 12 56.58 65.69
2 59.87 73.76 13 56.58 65.67
3 58.06 71.31 14 56.58 65.66
4 57.06 69.38 15 56.58 65.66
5 56.58 67.92 16 56.58 65.66
6 56.58 66.93 17 56.58 65.66
7 56.58 66.40 18 56.58 65.65
8 56.58 66.05 19 56.58 65.65
9 56.58 65.88 20 56.58 65.65
12.4.5 Convergence Rate

We may also be interested to determine the variation between the probability
(𝑛)
in the n-th year, 𝑝𝑖𝑗 , and the stationary probability, 𝜋𝑗 . The variation between
the probabilities can be measured using:
(𝑛)
∣𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑝𝑖𝑗 ) − 𝜋𝑗 ∣ .
Therefore, the total variation can be measured by the sum of variation in all
classes:
Figure 12.4: Evolution of Premium (Malaysia and Brazil)
(𝑛)
∑ ∣𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑝𝑖𝑗 ) − 𝜋𝑗 ∣.
𝑗
The total variation is also called the convergence rate because it measures the
convergence rate after 𝑛 years (or 𝑛 transitions). A lower total variation implies
a better convergence rate between the 𝑛-step transition probabilities and the
stationary distribution.
Example 12.4.6. Provide the total variations (convergence rate) in 20 years
under the NCD system in Malaysia, assuming that the probability of claims is
Poisson distributed with parameter 𝜆 = 0.10.
Solution. Using R, the stationary probabilities are:
𝜋0 0.0952
⎡ 𝜋1 ⎤ ⎡ 0.0861 ⎤
⎢ ⎥ ⎢ ⎥
⎢ 𝜋2 ⎥=⎢ 0.0779 ⎥
⎢ 𝜋3 ⎥ ⎢ 0.0705 ⎥
⎢ 𝜋4 ⎥ ⎢ 0.0638 ⎥
⎣ 𝜋5 ⎦ ⎣ 0.6064 ⎦
The transition matrix in the first year is:
0.0952 0.9048 0 0 0 0
⎡ 0.0952 0 0.9048 0 0 0 ⎤
⎢ ⎥
0.0952 0 0 0.9048 0 0
P(1) =⎢ ⎥
⎢ 0.0952 0 0 0 0.9048 0 ⎥
⎢ 0.0952 0 0 0 0 0.9048 ⎥
⎣ 0.0952 0 0 0 0 0.9048 ⎦
The variation can be computed as:

𝑝𝑖0
∣∑ 6 − 𝜋0 ∣ = 0
𝑖
∣∑ 𝑝6𝑖1 − 𝜋1 ∣ = 0.0647
𝑖
⋮
∣∑ 𝑝6𝑖5 − 𝜋5 ∣ = .3048
𝑖
Therefore, the total variation in the first year is
𝑝𝑖𝑗
∑ ∣∑ − 𝜋𝑗 ∣ = 0.6096.
𝑗 𝑖
6
Using R, the total variations (or convergence rate) in 20 years are:

0.6096, 0.3941, 0.2252, 0.0958, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
12.4.6 R Program for Convergence Rate

The following R code can be used to calculate the total variation in the 𝑛th year,
and the total variations (convergence rates) in 20 years under the NCD system
in Malaysia (the solution in Example 12.4.6).
1. Recall the Transition Matrix
TP
[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0.0952 0.9048 0.0000 0.0000 0.0000 0.0000
[2,] 0.0952 0.0000 0.9048 0.0000 0.0000 0.0000
[3,] 0.0952 0.0000 0.0000 0.9048 0.0000 0.0000
[4,] 0.0952 0.0000 0.0000 0.0000 0.9048 0.0000
[5,] 0.0952 0.0000 0.0000 0.0000 0.0000 0.9048
[6,] 0.0952 0.0000 0.0000 0.0000 0.0000 0.9048
2. Create stationary probabilities
SP <- eigen(t(TP))$vectors[,1]/sum(eigen(t(TP))$vectors[,1])
signif(SP, digits = 3)
[1] 0.0952+0i 0.0861+0i 0.0779+0i 0.0705+0i 0.0638+0i 0.6060+0i

3. Create a function for total variation in 𝑛th year
TV=function(n){
dif =numeric(0)
for (j in 1:length(SP))
dif[j]=abs(mean(powA(n)[,j])-SP[j])
sum(dif)
}
#example for n=1
signif(TV(1), digits = 4)
[1] 0.6096
4. Provide total variations (convergence rate) in 20 years
tot.var=numeric(0)
for (n in 1:20) {tot.var[n] = TV(n)}
signif(tot.var,4)
[1] 6.096e-01 3.941e-01 2.252e-01 9.580e-02 4.177e-15 4.274e-15 4.372e-15

[8] 4.483e-15 4.594e-15 4.594e-15 4.594e-15 4.594e-15 4.594e-15 4.594e-15
[15] 4.594e-15 4.594e-15 4.594e-15 4.594e-15 4.594e-15 4.594e-15
Example 12.4.7. Provide the total variations (or convergence rate) in 20
years under the NCD system in Brazil, assuming that the number of claims is
distributed as Poisson with parameter 𝜆 = 0.10.
Solution. Using R code, the total variations (or convergence rates) in 20 years
for the NCD system in Brazil are:
1.2617, 1.0536, 0.8465, 0.6412, 0.4362, 0.2316, 0.1531, 0.0747, 0.0480, 0.0232,
0.0145, 0.0071, 0.0043, 0.0021, 0.0013, 0.0006, 0.0004, 0.0002, 0.0001, 0.0001.
Examples 12.4.6-7 provide the degree of convergence for two different BMS (two
different countries). The Malaysian BMS reaches full stationary only after five
years, while the BMS in Brazil takes a longer period. As mentioned in Lemaire
(1998), a more sophisticated BMS would converge more slowly, and is considered
as a drawback as it takes a longer period to stabilize. The main objective of
a BMS is to separate the good drivers from the bad drivers, and thus, it is
desirable to have a classification process that can be finalized (or stabilized) as
soon as possible.
12.5 BMS and Premium Rating

12.5.1 Premium Rating
In motor insurance ratemaking, BMS is a form of a posteriori rating mechanism
to complement the use of a priori risk classification. The a priori risk segmenta-
tion divides portfolio of drivers into a number of homogeneous risk classes based
on observable characteristics, such that policyholders in the same risk class pay
12.5. BMS AND PREMIUM RATING 419
the same a priori premium. The underlying reason for utilizing BMS that re-
lies on claims experience information is to deal with the residual heterogeneity
within each homogeneous risk class since the observable variables are far from
perfect in predicting the riskiness of driving behaviors.
The ideal a posteriori mechanism is the credibility premium (see Dionne and
Vanasse, 1989) framework, whereby premiums are derived on an individual basis
for each policyholder by incorporating both the a priori and a posteriori infor-
mation. However, such individual premium determination is overly complex
from a commercial standpoint for practical implementations by motor insurers.
For this reason, BMS is the preferred solution and it consists of the following
three building blocks: (a) BMS classes; (b) transition rules; (c) premium levels
(also known as premium relativities or premium adjustment coefficients). The
first two building blocks are pre-specified in advance and have been discussed
in previous sections, whereas the determination (instead of pre-determined as
discussed in the cases of Malaysian, Brazilian and Swiss systems) of premium
relativities are important for motor insurers precisely because of its complemen-
tary and correction nature to account for the imperfection or inaccuracies in
the a priori risk classification. In the following subsections, we briefly introduce
the required modelling setup to study the determination of optimal relativities.
We refer interested reader to Denuit et al. (2007) for a fuller discussion on the
technical details.
12.5.2 A Priori Risk Classification

Let us consider a portfolio of 𝑛 policies, where the risk exposure of driver 𝑖
is denoted as 𝑑𝑖 and the number of claims reported is represented by 𝑁𝑖 . Let
X𝑇𝑖 = (𝑋𝑖1 , 𝑋𝑖2 , … , 𝑋𝑖𝑞 ) be the vector of observable variables for 𝑖 = 1, 2, … , 𝑛.
The Poisson regression is commonly chosen to model 𝑁𝑖 under the generalized
linear models (GLM) framework, see McCullagh and Nelder (1989).
We can express the predicted a priori expected claim frequency for policyholder
𝑖 as
𝑞
𝜆𝑖 = 𝑑𝑖 exp (𝛽0̂ + ∑ 𝛽𝑚
̂ 𝑥 ),
𝑖𝑚
𝑚=1
where 𝛽0̂ , 𝛽1̂ , … , 𝛽𝑞̂ are the estimated regression coefficients.
12.5.3 Modelling of Residual Heterogeneity

Since unobserved factors that may affect driving behaviors are not taken into
account in estimating the expected claim frequency, insurers would have to ac-
count for the residual heterogeneity within each a priori risk class by introducing
a random effect component Θ𝑖 into the conditional distribution of 𝑁𝑖 . Given
Θ𝑖 = 𝜃, 𝑁𝑖 follows a Poisson distribution with mean 𝜆𝑖 𝜃, that is,
(𝜆𝑖 𝜃)𝑘
Pr(𝑁𝑖 = 𝑘|Θ𝑖 = 𝜃) = exp(−𝜆𝑖 𝜃) , 𝑘 = 0, 1, 2, … .
𝑘!
We further assume that all the Θ𝑖 ’s are independent and follow a gamma (𝑎, 𝑎)
distribution with the following density function
1 𝑎 𝑎−1
𝑓(𝜃) = 𝑎 𝜃 exp(−𝑎𝜃), 𝜃 > 0,
Γ(𝑎)
where the use of Poisson-gamma mixture produces a negative binomial dis-

tribution for 𝑁𝑖 . With these specifications, we obtain 𝔼(Θ𝑖 ) = 1 and hence
𝔼(𝑁𝑖 ) = 𝔼(𝔼(𝑁𝑖 |Θ𝑖 )) = 𝔼(𝜆𝑖 Θ𝑖 ) = 𝜆𝑖 .
12.5.4 Stationary Distribution Allowing for Residual Het-

erogeneity
Suppose that a driver is selected at random from the portfolio that has been
classified into ℎ risk classes via the use of observed a priori variables. The true
expected claim frequency for this driver is given by ΛΘ, where Λ is the unknown
a priori expected claim frequency and Θ is the random residual heterogeneity.
Let us further denote 𝑤𝑔 as the proportion of drivers in the 𝑔-th risk class, that
𝑛
is, 𝑤𝑔 = Pr(Λ = 𝜆𝑔 ) = 𝑛𝑔 where 𝑛𝑔 is the number of drivers classified in the
𝑔-th risk class. Note that since there are two different concepts of risk classes
(from a priori risk classification) and BMS (or NCD) classes (for a posteriori
rating mechanism), for the rest of this chapter we will refer BMS classes as BMS
levels instead to avoid unnecessary confusion.
𝜆
Let 𝑝𝑖𝑗 (𝜆𝜃) be the transition probability of moving from BMS level 𝑖 to level
𝑗 for a driver with expected claim frequency 𝜆𝜃 belonging to the risk class
with predicted claim frequency of 𝜆. In other words, the one-step transition
𝜆
matrix can be written as P(𝜆𝜃; 𝜆) = {𝑝𝑖𝑗 (𝜆𝜃)}. The row vector of the stationary
𝜆 𝜆 𝜆
distribution 𝜋 = (𝜋0 (𝜆𝜃), 𝜋1 (𝜆𝜃), … , 𝜋𝑘−1 (𝜆𝜃)) can be obtained by solving the
following conditions:
𝜋(𝜆𝜃; 𝜆) = 𝜋(𝜆𝜃; 𝜆)P(𝜆𝜃; 𝜆)

𝜋(𝜆𝜃; 𝜆)1 = 1,
where 1 is the column vector of 1’s and 𝜋ℓ𝜆 (𝜆𝜃) is the stationary probability
for a driver with true expected claim frequency of 𝜆𝜃 to be in level ℓ when the
equilibrium steady state is reached in the long run. Note that with the incorpo-
ration of random effect parameter 𝜃, the expression (not numerical values) for
𝜋ℓ𝜆 (𝜆𝜃) can be found by say MATLAB but not R.
With these setup, the probability of drivers staying in BMS level 𝐿 = ℓ for
ℓ = 0, 1, … , 𝑘 − 1 in the context of the entire portfolio can be obtained as

ℎ
Pr(𝐿 = ℓ) = ∑ Pr(𝐿 = ℓ|Λ = 𝜆𝑔 ) Pr(Λ = 𝜆𝑔 )
𝑔=1
ℎ ∞
= ∑ Pr(Λ = 𝜆𝑔 ) ∫ Pr(𝐿 = ℓ|Λ = 𝜆𝑔 , Θ = 𝜃)𝑓(𝜃)𝑑𝜃
𝑔=1 0
ℎ ∞
𝜆
= ∑ 𝑤𝑔 ∫ 𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃.
𝑔=1 0
12.5.5 Determination of Optimal Relativities

The optimal relativity for each BMS level was first derived by Norberg (1976)
through the minimization of the following objective function, which is more
commonly known as the Norberg’s criterion:
̄ − 𝜆𝑟
min 𝔼 ((𝜆Θ ̄ 𝐿 )2 ) ≡ min 𝔼 ((Θ − 𝑟𝐿 )2 ) ,
where 𝜆̂ is the constant expected claim frequency for all policyholders in the
absence of a priori risk classification and 𝑟𝐿 is the premium relativity for BMS
level 𝐿. Pitrebois et al. (2003) then incorporated the information of a priori
risk classification into the optimization of the same objective function of
min 𝔼 ((Θ − 𝑟𝐿 )2 )
to derive 𝑟𝐿 analytically. Tan et al. (2015) further proposed the minimization
of the following objective function
min 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 ) , subject to 𝔼(𝑟𝐿 ) = 1,
under a financial balanced constraint (that is, the expected premium relativity
equals 1) to determine the optimal relativities of a BMS given pre-specified BMS
levels and transition rules, where
𝑘−1
min 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 ) = ∑ 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0
𝑘−1
= ∑ 𝔼 (𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ, Λ) |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0
𝑘−1 ℎ
= ∑ ∑ 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ, Λ = 𝜆𝑔 ) × Pr (Λ = 𝜆𝑔 |𝐿 = ℓ) Pr(𝐿 = ℓ)
ℓ=0 𝑔=1
𝑘−1 ℎ ∞
= ∑ ∑ ∫ (𝜆𝑔 𝜃 − 𝜆𝑔 𝑟ℓ )2 𝜋ℓ (𝜆𝑔 𝜃)𝑤𝑔 𝑓(𝜃)𝑑𝜃
ℓ=0 𝑔=1 0
ℎ ∞ 𝑘−1
= ∑ 𝑤𝑔 ∫ ∑(𝜆𝑔 𝜃 − 𝜆𝑔 𝑟ℓ )2 𝜋ℓ (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃.
𝑔=1 0 ℓ=0
It is crucially important that the optimal relativity has an average of 100%, so

that the bonuses and maluses exactly offset each other to result in a financial
equilibrium condition. Note that the approach considered by Pitrebois et al.
(2003) does not require the financial balanced constraint because the analytical
solution to its objective function is given by 𝑟ℓ = 𝔼(Θ|𝐿 = ℓ), so it follows
that 𝔼(𝑟𝐿 ) = 𝔼 (𝔼(Θ|𝐿)) = 𝔼(Θ) = 1 with the specific choice of Gamma (𝑎, 𝑎)
distribution for the random effect component Θ.
In this case, the optimization problem can be solved by specifying the La-
grangian as
ℒ(r, 𝛼) = 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 ) + 𝛼(𝔼(𝑟𝐿 ) − 1)

𝑘−1 𝑘−1
= ∑ 𝔼 ((ΛΘ − Λ𝑟𝐿 )2 |𝐿 = ℓ) Pr(𝐿 = ℓ) + 𝛼 (∑ 𝑟ℓ Pr(𝐿 = ℓ) − 1) ,
ℓ=0 ℓ=0
where r = (𝑟0 , 𝑟1 , … , 𝑟𝑘−1 )𝑇 . The required first order conditions are given as
follows
Pr(𝐿 = ℓ) × (2𝔼 (Λ2 Θ − Λ2 𝑟𝐿 |𝐿 = ℓ) − 𝛼) = 0, ℓ = 0, 1, … , 𝑘 − 1,

𝑘−1
∑ 𝑟ℓ Pr(𝐿 = ℓ) − 1 = 0.
ℓ=0
Finally, the solution set for 𝛼 and 𝑟ℓ , ℓ = 0, 1, … , 𝑘 − 1 is obtained as

𝑘−1
𝔼(Λ2 Θ|𝐿=ℓ) Pr(𝐿=ℓ)
(∑ 𝔼(Λ2 |𝐿=ℓ) ) −1
ℓ=0
𝛼= 𝑘−1
,
Pr(𝐿=ℓ)
∑ 2𝔼(Λ2 |𝐿=ℓ)
ℓ=0
2
𝔼(Λ Θ|𝐿 = ℓ) 𝛼
𝑟ℓ = − ,
𝔼(Λ2 |𝐿 = ℓ) 2𝔼(Λ2 |𝐿 = ℓ)
where
ℎ ∞
𝜆
Pr(𝐿 = ℓ) = ∑ 𝑤𝑔 ∫ 𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃,
𝑔=1 0
ℎ ∞ 𝜆
∑ 𝑤𝑔 ∫0 𝜆2𝑔 𝜃𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃
𝑔=1
𝔼(Λ2 Θ|𝐿 = ℓ) = ℎ
,
∞ 𝜆
∑ 𝑤𝑔 ∫0 𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃
𝑔=1
ℎ ∞ 𝜆
∑ 𝑤𝑔 ∫0 𝜆2𝑔 𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃
𝑔=1
𝔼(Λ2 |𝐿 = ℓ) = ℎ
.
∞ 𝜆
∑ 𝑤𝑔 ∫0 𝜋ℓ 𝑔 (𝜆𝑔 𝜃)𝑓(𝜃)𝑑𝜃
𝑔=1
If we perform the optimization without the financial balanced constraint, then

we obtain
𝛼unconstrained = 0,
𝔼(Λ2 Θ|𝐿 = ℓ)
𝑟ℓunconstrained = .
𝔼(Λ2 |𝐿 = ℓ)
12.5.6 Numerical Illustrations

In this section, we present two numerical illustrations that integrate a priori
information into the determination of optimal relativities. We consider the
BMS levels and the transition rules of both Malaysian and Brazilian systems but
choose to calculate the set of optimal relativities instead of the specified premium
levels given earlier. In our illustrations, we also assume that the following 3
values of a priori expected claim frequency are given
𝜆1 = 0.1,𝜆2 = 0.3, and 𝜆3 = 0.5 with the following proportions
Pr(Λ = 𝜆1 ) = 0.6, Pr(Λ = 𝜆2 ) = 0.3, Pr(Λ = 𝜆3 ) = 0.1.
We also assume that the gamma parameteris fixed at 𝑎 = 1.5. Note that while
these modelling assumptions are simple, the purpose here is to demonstrate
the determination of optimal relativities under a relatively simple setup, and
that the optimization procedure for the BMS remains the same even if the a
priori risk classification is performed extensively. We refer interested readers to
the motor vehicle claims data as documented in De Jong and Heller (2008) to
conduct the a priori risk segmentation before proceeding to the determination
of optimal relativities.
Furthermore, as mentioned earlier, the inclusion of random effect parameter
𝜃 implies that the required expression (not numerical values) for stationary
probabilities 𝜋ℓ𝜆 (𝜆𝜃) to be used for subsequent integrals can be found by say
MATLAB but not R. Also, since the obtained form of stationary probabilities
are rather complex, in this section we choose not to include any R codes for
the determination of optimal relativities. More importantly, we hope that the
key take-away for this subsection is for readers to get a solid overall conceptual
understanding on how to account for all the relevant information in the design
of a bonus-malus system.
For the Malaysian BMS with 6 levels and the transition rule of -1/Top, the
obtained numerical values of optimal relativities are presented in Table 12.5
together with the stationary probabilities. We find that around half of the poli-
cyholders will occupy the highest BMS level with the lowest premium relativity
over the long run when the stationary state has been reached. We also observe
that the constrained optimal relativities are higher than the unconstrained coun-
terparts because of the need to satisfy the financial balanced constraint.
Table 12.5. Optimal Relativities with 𝑘 = 6 levels and transition rule

of -1/Top
Level ℓ Pr(𝐿 = ℓ) 𝑟ℓ 𝑟ℓunconstrained

0 16.22% 131.99% 126.33%
1 11.29% 127.33% 120.66%
2 8.49% 120.64% 113.03%
3 6.69% 113.93% 105.44%
4 5.44% 107.79% 98.47%
5 51.87% 78.06% 63.15%
𝐸(𝑟𝐿 ) 100% 88.87%
Moreover, we see that except for the highest BMS level (level 5), other BMS lev-
els will impose malus surcharges to policyholders occupying those levels. This
finding is not surprising since our theoretical framework here is to determine
optimal relativities given the calculation of a priori base premiums by solely
relying on claim frequency information but not claim severity. In practice, in-
surers could afford to introduce NCD levels with only discounts (bonuses) but
not loadings (maluses) because the a priori base premiums have been inflated
accordingly taking into account both the information of claim frequency and
claim severity.
For the Brazilian BMS with 7 levels and the transition rule of -1/+1, the corre-
sponding numerical values of optimal relativities are shown in Table 12.6. We
find that around three quarters of the policyholders will occupy the highest
BMS level with the lowest premium relativity in the stationary state. This
finding is mainly due to the less severe penalty in the transition rule of -1/+1
in comparison to the rule of -1/Top, so more policyholders are expected to oc-
cupy the highest BMS level. Similar to the earlier example, we find that the
unconstrained optimal relativities are lower and result in a lower value of 𝔼(𝑟𝐿 ).
Table 12.6. Optimal Relativities with 𝑘 = 7 levels and transition rule
of -1/+1
Level ℓ Pr(𝐿 = ℓ) 𝑟ℓ 𝑟ℓunconstrained

0 3.28% 234.94% 228.65%
1 2.21% 196.24% 189.27%
2 2.00% 168.36% 160.59%
3 2.38% 145.96% 137.03%
4 4.02% 125.53% 114.63%
5 10.38% 106.25% 91.12%
6 75.74% 85.89% 61.74%
𝐸(𝑟𝐿 ) 100% 78.97%
Note that the obtained values of optimal relativities may not be desirable for
commercial implementations because of the possibility of irregular differences
between adjacent BMS levels. To alleviate this problem, insurers could consider
linear
imposing linear optimal relativities in the form of 𝑟𝐿 = 𝑎 + 𝑏𝐿 by solving the
following constrained optimization with an inequality constraint
min 𝔼 ((ΛΘ − Λ𝑎 − Λ𝑏𝐿)2 ) subject to 𝑎 + 𝑏𝔼(𝐿) ≥ 1.
We refer interested readers to Tan (2016) for a discussion on how to incorporate

further commercial constraints and also on the solution to this optimization
problem involving Kuhn-Tucker conditions.
Contributors
• Noriszura Ismail, Universiti Kebangsaan Malaysia and Chong It Tan,
Macquarie University, are the principal authors of the initial version of
this chapter. Email: [email protected] or [email protected] for
• This chapter has not yet been reviewed. Write Noriszura, Chong It, or
Jed Frees ([email protected]> if you are interested.
Chapter 13
Data and Systems
Chapter Preview. This chapter covers the learning areas on data and systems
outlined in the IAA (International Actuarial Association) Education Syllabus
published in September 2015. This chapter is organized into three major parts:
data, data analysis, and data analysis techniques. The first part introduces
data basics such as data types, data structures, data storages, and data sources.
The second part discusses the process and various aspects of data analysis. The
third part presents some commonly used techniques for data analysis.
13.1 Data
13.1.1 Data Types and Sources
In terms of how data are collected, data can be divided into two types (Hox
and Boeije, 2005): primary data and secondary data. Primary data are original
data that are collected for a specific research problem. Secondary data are
data originally collected for a different purpose and reused for another research
problem. A major advantage of using primary data is that the theoretical
constructs, the research design, and the data collection strategy can be tailored
to the underlying research question to ensure that data collected help to solve
the problem. A disadvantage of using primary data is that data collection can
be costly and time consuming. Using secondary data has the advantage of lower
cost and faster access to relevant information. However, using secondary data
may not be optimal for the research question under consideration.
In terms of the degree of organization of the data, data can be also divided
into two types (Inmon and Linstedt, 2014; O’Leary, 2013; Hashem et al., 2015;
Abdullah and Ahmad, 2013; Pries and Dunnigan, 2015): structured data and
unstructured data. Structured data have a predictable and regularly occurring
format. In contrast, unstructured data lack any regularly occurring format and
427
428 CHAPTER 13. DATA AND SYSTEMS
have no structure that is recognizable to a computer. Structured data consist of

records, attributes, keys, and indices and are typically managed by a database
management system (DBMS) such as IBM DB2, Oracle, MySQL, and Microsoft
SQL Server. As a result, most units of structured data can be located quickly
and easily. Unstructured data have many different forms and variations. One
common form of unstructured data is text. Accessing unstructured data can be
awkward. To find a given unit of data in a long text, for example, a sequential
search is usually performed.
Data can be classified as qualitative or quantitative. Qualitative data are data
about qualities, which cannot be actually measured. As a result, qualitative
data are extremely varied in nature and include interviews, documents, and ar-
tifacts (Miles et al., 2014). Quantitative data are data about quantities, which
can be measured numerically with numbers. In terms of the level of measure-
ment, quantitative data can be further classified as nominal, ordinal, interval,
or ratio (Gan, 2011). Nominal data, also called categorical data, are discrete
data without a natural ordering. Ordinal data are discrete data with a natural
order. Interval data are continuous data with a specific order and equal inter-
vals. Ratio data are interval data with a natural zero. See Section 14.1 for a
more detailed discussion, with examples, on types of data.
There exist a number of data sources. First, data can be obtained from
university-based researchers who collect primary data. Second, data can be ob-
tained from organizations that are set up for the purpose of releasing secondary
data for the general research community. Third, data can be obtained from
national and regional statistical institutes that collect data. Finally, companies
have corporate data that can be obtained for research purposes.
While it might be difficult to obtain data to address a specific research problem
or answer a business question, it is relatively easy to obtain data to test a model
or an algorithm for data analysis. In the modern era, readers can obtain datasets
from the Internet. The following is a list of some websites to obtain real-world
data:
• UCI Machine Learning Repository. This website (url: http://arch
ive.ics.uci.edu/ml/index.php) maintains more than 400 datasets that can
be used to test machine learning algorithms.
• Kaggle. The Kaggle website (url: https://www.kaggle.com/) include
real-world datasets used for data science competitions. Readers can down-
load data from Kaggle by registering an account.
• DrivenData. DrivenData aims at bringing cutting-edge practices in data
science to solve some of the world’s biggest social challenges. In its website
(url: https://www.drivendata.org/), readers can participate in data
science competitions and download datasets.
• Analytics Vidhya. This website (url: https://datahack.analyticsvidhya
.com/contest/all/) allows you to participate and download datasets from
practice problems and hackathon problems.
• KDD Cup. KDD Cup is the annual Data Mining and Knowledge Discov-
13.1. DATA 429
ery competition organized by the ACM Special Interest Group on Knowl-

edge Discovery and Data Mining. This website (url: http://www.kdd.
org/kdd-cup) contains the datasets used in past KDD Cup competitions
since 1997.
• U.S. Government’s open data. This website (url: https://www.da
ta.gov/) contains about 200,000 datasets covering a wide range of areas
including climate, education, energy, and finance.
• AWS Public Datasets. In this website (url: https://aws.amazon.com
/datasets/), Amazon provides a centralized repository of public datasets,
including some huge datasets.
13.1.2 Data Structures and Storage

As mentioned in the previous subsection, there are structured data as well as
unstructured data. Structured data are highly organized data and usually have
the following tabular format:
𝑉1 𝑉2 ⋯ 𝑉𝑑
x1 𝑥11 𝑥12 ⋯ 𝑥1𝑑
x2 𝑥21 𝑥22 ⋯ 𝑥2𝑑
⋮ ⋮ ⋮ ⋯ ⋮
x𝑛 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑑
In other words, structured data can be organized into a table consisting of

rows and columns. Typically, each row represents a record and each column
represents an attribute. A table can be decomposed into several tables that can
be stored in a relational database such as the Microsoft SQL Server. The SQL
(Structured Query Language) can be used to access and modify the data easily
and efficiently.
Unstructured data do not follow a regular format (Abdullah and Ahmad, 2013).
Examples of unstructured data include documents, videos, and audio files. Most
of the data we encounter are unstructured data. In fact, the term “big data”
was coined to reflect this fact. Traditional relational databases cannot meet
the challenges on the varieties and scales brought by massive unstructured data
nowadays. NoSQL databases have been used to store massive unstructured
data.
There are three main NoSQL databases (Chen et al., 2014): key-value
databases, column-oriented databases, and document-oriented databases.
Key-value databases use a simple data model and store data according to
key values. Modern key-value databases have higher expandability and
smaller query response times than relational databases. Examples of key-value
databases include Dynamo used by Amazon and Voldemort used by LinkedIn.
Column-oriented databases store and process data according to columns rather
than rows. The columns and rows are segmented in multiple nodes to achieve
expandability. Examples of column-oriented databases include BigTable devel-

oped by Google and Cassandra developed by FaceBook. Document databases
are designed to support more complex data forms than those stored in key-value
databases. Examples of document databases include MongoDB, SimpleDB, and
CouchDB. MongoDB is an open-source document-oriented database that stores
documents as binary objects. SimpleDB is a distributed NoSQL database used
by Amazon. CouchDB is another open-source document-oriented database.
13.1.3 Data Quality

Accurate data are essential to useful data analysis. The lack of accurate data
may lead to significant costs to organizations in areas such as correction ac-
tivities, lost customers, missed opportunities, and incorrect decisions (Olson,
2003).
Data have quality if it satisfies its intended use; that is, the data are accurate,
timely, relevant, complete, understandable, and trusted (Olson, 2003). As a
result, we first need to know the specification of the intended uses and then
judge the suitability for those uses in order to assess data quality. Unintended
uses of data can arise from a variety of reasons and lead to serious problems.
Accuracy is the single most important component of high-quality data. Accurate
data have the following properties (Olson, 2003):
• The data elements are not missing and have valid values.
• The values of the data elements are in the right ranges and have the right
representations.
Inaccurate data arise from different sources. In particular, the following areas
are common areas where inaccurate data occur:
• Initial data entry. Mistakes (including deliberate errors) and system errors
can occur during the initial data entry. Flawed data entry processes can
result in inaccurate data.
• Data decay. Data decay, also known as data degradation, refers to the
gradual corruption of computer data due to an accumulation of non-critical
failures in a storage device.
• Data moving and restructuring. Inaccurate data can also arise from data
extracting, cleansing, transforming, loading, or integrating.
• Data usage. Faulty reporting and lack of understanding can lead to inac-
curate data.
Reverification and analysis are two approaches used to find inaccurate data
elements. The first approach is done by people, who manually check every
data element by going back to the original source of the data. The second
approach is done by software with the skills of an analyst to search through
data to find possible inaccurate data elements. To ensure that the data elements
are 100% accurate, we must use reverification. However, reverification can be
time consuming and may not be possible for some data. Analytical techniques
13.2. DATA ANALYSIS PRELIMINARIES 431
can also be used to identify inaccurate data elements. There are five types
of analysis that can be used to identify inaccurate data (Olson, 2003): data
element analysis, structural analysis, value correlation, aggregation correlation,
and value inspection.
Companies can create a data quality assurance program to create high-quality
databases. For more information about management of data quality issues and
data profiling techniques, readers are referred to Olson (2003).
13.1.4 Data Cleaning

Raw data usually need to be cleaned before useful analysis can be conducted. In
particular, the following areas need attention when preparing data for analysis
(Janert, 2010):
• Missing values. It is common to have missing values in raw data. De-
pending on the situation, we can discard the record, discard the variable,
or impute the missing values.
• Outliers. Raw data may contain unusual data points such as outliers. We
need to handle outliers carefully. We cannot just remove outliers without
knowing the reason for their existence. Although sometimes outliers can
be simple mistakes such as thosed caused by clerical errors, sometimes
their unusual behavior can point to precisely the type of effect that we are
looking for.
• Junk. Raw data may contain garbage, or junk, such as nonprintable
characters. When it happens, junk is typically rare and not easily noticed.
However, junk can cause serious problems in downstream applications.
• Format. Raw data may be formatted in a way that is inconvenient for
subsequent analysis. For example, components of a record may be split
into multiple lines in a text file. In such cases, lines corresponding to a
single record should be merged before loading to a data analysis software
such as R.
• Duplicate records. Raw data may contain duplicate records. Duplicate
records should be recognized and removed. This task may not be trivial
depending on what you consider “duplicate.”
• Merging datasets. Raw data may come from different sources. In such
cases, we need to merge data from different sources to ensure compatibility.
For more information about how to handle data in R, readers are referred to
Forte (2015) and Buttrey and Whitaker (2017).
13.2 Data Analysis Preliminaries

Data analysis involves inspecting, cleansing, transforming, and modeling data
to discover useful information to suggest conclusions and make decisions. Data
analysis has a long history. In 1962, statistician John Tukey defined data anal-
ysis as:
procedures for analyzing data, techniques for interpreting the results

of such procedures, ways of planning the gathering of data to make
its analysis easier, more precise or more accurate, and all the machin-
ery and results of (mathematical) statistics which apply to analyzing
data.
— (Tukey, 1962)
Recently, Judd and coauthors defined data analysis as the following equation
(Judd et al., 2017):
Data = Model + Error,
where Data represents a set of basic scores or observations to be analyzed, Model

is a compact representation of the data, and Error is the amount by which an
observation differs from its model representation. Using the above equation for
data analysis, an analyst must resolve the following two conflicting goals:
• to add more parameters to the model so that the model represents data
better, and
• to remove parameters from the model so that the model is simple and
parsimonious.
In this section, we give a high-level introduction to data analysis, including
different types of methods.
13.2.1 Data Analysis Process

Data analysis is part of an overall study. For example, Figure 13.1 shows the
process of a typical study in behavioral and social sciences as described in Albers
(2017). Data analysis consists of the following steps:
• Exploratory analysis. Through summary statistics and graphical rep-
resentations, understand all relevant data characteristics and determine
what type of analysis for the data makes sense.
• Statistical analysis. Performs statistical analysis such as determining
statistical significance and effect size.
• Make sense of the results. Interprets the statistical results in the
context of the overall study.
• Determine implications. Interprets the data by connecting it to the
study goals and the larger field of this study.
The goal of the data analysis as described above focuses on explaining some
phenomenon (See Section 13.2.5).
Shmueli (2010) described a general process for statistical modeling, which is
shown in Figure 13.2. Depending on the goal of the analysis, the steps differ in
terms of the choice of methods, criteria, data, and information.
Figure 13.1: The Process of a Typical Study in Behavioral and Social

Sciences
Figure 13.2: The Process of Statistical Modeling
13.2.2 Exploratory versus Confirmatory

There are two phases of data analysis (Good, 1983): exploratory data analysis
(EDA) and confirmatory data analysis (CDA). Table 13.1 summarizes some
differences between EDA and CDA. EDA is usually applied to observational data
with the goal of looking for patterns and formulating hypotheses. In contrast,
CDA is often applied to experimental data (i.e., data obtained by means of a
formal design of experiments) with the goal of quantifying the extent to which
discrepancies between the model and the data could be expected to occur by
chance. Gelman (2004).
Table 13.1. Comparison of Exploratory Data Analysis and Confirma-
tory Data Analysis
EDA CDA
Data Observational data Experimental data
Goal Pattern recognition, Hypothesis testing,

formulate hypotheses estimation, prediction
Techniques Descriptive statistics, Traditional statistical tools of

visualization, clustering inference, significance, and
confidence
Techniques for EDA include descriptive statistics (e.g., mean, median, standard
deviation, quantiles), distributions, histograms, correlation analysis, dimension
reduction, and cluster analysis. Techniques for CDA include the traditional
statistical tools of inference, significance, and confidence.
13.2.3 Supervised versus Unsupervised

Methods for data analysis can be divided into two types (Abbott, 2014; Igual and
Segu, 2017): supervised learning methods and unsupervised learning methods.
Supervised learning methods work with labeled data, which include a target
variable. Mathematically, supervised learning methods try to approximate the
following function:
𝑌 = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑝 ),
where 𝑌 is a target variable and 𝑋1 , 𝑋2 , …, 𝑋𝑝 are explanatory variables. Table
13.2 gives a list of common names for different types of variables (Frees, 2009).
When the target variable is a categorical variable, supervised learning meth-
ods are called classification methods. When the target variable is continuous,
supervised learning methods are called regression methods.
Table 13.2. Common Names of Different Variables
Target Variable Explanatory Variable

Dependent variable Independent variable
Response Treatment
Output Input
Endogenous variable Exogenous variable
Predicted variable Predictor variable
Regressand Regressor
Unsupervised learning methods work with unlabeled data, which include ex-
planatory variables only. In other words, unsupervised learning methods do not
use target variables. As a result, unsupervised learning methods are also called
descriptive modeling methods.
13.2.4 Parametric versus Nonparametric

Methods for data analysis can be parametric or nonparametric (Abbott, 2014).
Parametric methods assume that the data follow a certain distribution. Non-
parametric methods, introduced in Section 4.1, do not assume distributions for
the data and therefore are called distribution-free methods.
Parametric methods have the advantage that if the distribution of the data
is known, properties of the data and properties of the method (e.g., errors,
convergence, coefficients) can be derived. A disadvantage of parametric methods
is that analysts need to spend considerable time on figuring out the distribution.
For example, analysts may try different transformation methods to transform
the data so that it follows a certain distribution.
Because nonparametric methods make fewer assumptions, nonparametric meth-
ods have the advantage that they are more flexible, more robust, and more
applicable to non-quantitative data. However, a drawback of nonparametric

methods is that it is more difficult to extrapolate findings outside of the ob-
served domain of the data, a key consideration in predictive modeling.
13.2.5 Explanation versus Prediction

There are two goals in data analysis (Breiman, 2001; Shmueli, 2010): expla-
nation and prediction. In some scientific areas such as economics, psychology,
and environmental science, the focus of data analysis is to explain the causal
relationships between the input variables and the response variable. In other
scientific areas such as natural language processing and bioinformatics, the focus
of data analysis is to predict what the responses are going to be given the input
variables.
Shmueli (2010) discussed in detail the distinction between explanatory modeling
and predictive modeling. Explanatory modeling is commonly used for theory
building and testing. However, predictive modeling is rarely used in many sci-
entific fields as a tool for developing theory.
Explanatory modeling is typically done as follows:
• State the prevailing theory.
• State causal hypotheses, which are given in terms of theoretical constructs
rather than measurable variables. A causal diagram is usually included
to illustrate the hypothesized causal relationship between the theoretical
constructs.
• Operationalize constructs. In this step, previous literature and theoretical
justification are used to build a bridge between theoretical constructs and
observable measurements.
• Collect data and build models alongside the statistical hypotheses, which
are operationalized from the research hypotheses.
• Reach research conclusions and recommend policy. The statistical conclu-
sions are converted into research conclusions or policy recommendations.
Shmueli (2010) defined predictive modeling as the process of applying a statisti-
cal model or data mining algorithm to data for the purpose of predicting new or
future observations. Predictions include point predictions, interval predictions,
regions, distributions, and rankings of new observations. A predictive model
can be any method that produces predictions.
13.2.6 Data Modeling versus Algorithmic Modeling

Breiman (2001) discussed two cultures for the use of statistical modeling to reach
conclusions from data: the data modeling culture and the algorithmic modeling
culture. In the data modeling culture, data are assumed to be generated by
a given stochastic data model. In the algorithmic modeling culture, the data
mechanism is treated as unknown and algorithmic models are used.
Data modeling allows statisticians to analyze data and acquire information

about the data mechanisms. However, Breiman (2001) argued that the focus
on data modeling in the statistical community has led to some side effects such
as:
• It produced irrelevant theory and questionable scientific conclusions.
• It kept statisticians from using algorithmic models that might be more
suitable.
• It restricted the ability of statisticians to deal with a wide range of prob-
lems.
Algorithmic modeling was used by industrial statisticians long time ago. Sadly,
the development of algorithmic methods was taken up by a community out-
side statistics (Breiman, 2001). The goal of algorithmic modeling is predictive
accuracy. For some complex prediction problems, data models are not suit-
able. These prediction problems include speech recognition, image recognition,
handwriting recognition, nonlinear time series prediction, and financial market
prediction. The theory in algorithmic modeling focuses on the properties of
algorithms, such as convergence and predictive accuracy.
13.2.7 Big Data Analysis

Unlike traditional data analysis, big data analysis employs additional methods
and tools that can extract information rapidly from massive data. In particular,
big data analysis uses the following processing methods (Chen et al., 2014):
• A bloom filter is a space-efficient probabilistic data structure that is used
to determine whether an element belongs to a set. It has the advantages
of high space efficiency and high query speed. A drawback of using bloom
filter is that there is a certain misrecognition rate.
• Hashing is a method that transforms data into fixed-length numerical
values through a hash function. It has the advantages of rapid reading
and writing. However, sound hash functions are difficult to find.
• Indexing refers to a process of partitioning data in order to speed up
reading. Hashing is a special case of indexing.
• A trie, also called digital tree, is a method to improve query efficiency by
using common prefixes of character strings to reduce comparisons among
character strings.
• Parallel computing uses multiple computing resources to complete a
computation task. Parallel computing tools include Message Passing In-
terface (MPI), MapReduce, and Dryad.
Big data analysis can be conducted in the following levels (Chen et al., 2014):
memory-level, business intelligence (BI) level, and massive level. Memory-level
analysis is conducted when data can be loaded to the memory of a cluster of
computers. Current hardware can handle hundreds of gigabytes (GB) of data
in memory. BI level analysis can be conducted when data surpass the memory
level. It is common for BI level analysis products to support data over terabytes
(TB). Massive level analysis is conducted when data surpass the capabilities of
products for BI level analysis. Usually Hadoop and MapReduce are used in
massive level analysis.
13.2.8 Reproducible Analysis

As mentioned in Section 13.2.1, a typical data analysis workflow includes col-
lecting data, analyzing data, and reporting results. The data collected are saved
in a database or files. Data are then analyzed by one or more scripts, which
may save some intermediate results or always work on the raw data. Finally a
report is produced to describe the results, which include relevant plots, tables,
and summaries of data. The workflow may be subject to the following potential
issues (Mailund, 2017, Chapter 2):
• Data are separated from the analysis scripts.
• The documentation of the analysis is separated from the analysis itself.
If the analysis is done on the raw data with a single script, then the first issue
is not a major problem. If the analysis consists of multiple scripts and a script
saves intermediate results that are read by the next script, then the scripts
describe a workflow of data analysis. To reproduce an analysis, the scripts have
to be executed in the right order. The workflow may cause major problems if
the order of the scripts is not documented or the documentation is not updated
or lost. One way to address the first issue is to write the scripts so that any
part of the workflow can be run automatically at any time.
If the documentation of the analysis is synchronized with the analysis, then the
second issue is not a major problem. However, the documentation may become
useless if the scripts are changed but the documentation is not updated.
Literate programming is an approach to address the two issues mentioned above,
where the documentation of a program and the code of the program are written
together. To do literate programming in R, one way is to use R Markdown and
the knitr package.
13.2.9 Ethical Issues

Analysts may face ethical issues and dilemmas during the data analysis process.
In some fields, ethical issues and dilemmas include participant consent, benefits,
risk, confidentiality, and data ownership (Miles et al., 2014). For data analysis
in actuarial science and insurance in particular, we face the following ethical
matters and issues (Miles et al., 2014):
• Worthiness of the project. Is the project worth doing? Will the project
contribute in some significant way to a domain broader than my career?
If a project is only opportunistic and does not have a larger significance,
then it might be pursued with less care. The result may look good but
not be right.
• Competence. Do I or the whole team have the expertise to carry out the
project? Incompetence may lead to weakness in the analytics such as col-
lecting large amounts of data poorly and drawing superficial conclusions.
• Benefits, costs, and reciprocity. Will each stakeholder gain from the
project? Are the benefits and costs equitable? A project will likely to fail
if the benefit and the cost for a stakeholder do not match.
• Privacy and confidentiality. How do we make sure that the information
is kept confidentially? How do we verify where are raw data and analysis
results stored? How will we have access to them? These questions should
be addressed and documented in explicit confidentiality agreements.
13.3 Data Analysis Techniques

Techniques for data analysis are drawn from different but overlapping fields such
as statistics, machine learning, pattern recognition, and data mining. Statistics
is a field that addresses reliable ways of gathering data and making inferences,
Bandyopadhyay and Forster (2011), Bluman (2012). The term machine learning
was coined by Samuel in 1959 (Samuel, 1959). Originally, machine learning
referred to the field of study where computers have the ability to learn without
being explicitly programmed. Nowadays, machine learning has evolved to the
broad field of study where computational methods use experience (i.e., the past
information available for analysis) to improve performance or to make accurate
predictions (Bishop, 2007; Clarke et al., 2009; Mohri et al., 2012; Kubat, 2017).
There are four types of machine learning algorithms (see Table 13.3) depending
on the type of data and the type of the learning tasks.
Table 13.3. Types of Machine Learning Algorithms
Supervised Unsupervised
Discrete Data Classification Clustering
Continuous Data Regression Dimension reduction
Originating in engineering, pattern recognition is a field that is closely related

to machine learning, which grew out of computer science. In fact, pattern
recognition and machine learning can be considered to be two facets of the same
field (Bishop, 2007). Data mining is a field that concerns collecting, cleaning,
processing, analyzing, and gaining useful insights from data (Aggarwal, 2015).
13.3.1 Exploratory Techniques

Exploratory data analysis techniques include descriptive statistics as well as
many unsupervised learning techniques such as data clustering and principal
component analysis.
13.3. DATA ANALYSIS TECHNIQUES 439
Descriptive Statistics
In one sense (as a “mass noun”), “descriptive statistics” is an area of statistics
that concerns the collection, organization, summarization, and presentation of
data (Bluman, 2012). In another sense (as a “count noun”), “descriptive statis-
tics” are summary statistics that quantitatively describe or summarize data.
Table 13.4. Commonly Used Descriptive Statistics
Descriptive Statistics
Measures of central tendency Mean, median, mode, midrange
Measures of variation Range, variance, standard deviation
Measures of position Quantile
Table 13.4 lists some commonly used descriptive statistics. In R, we can use the
function summary to calculate some of the descriptive statistics. For numeric
data, we can visualize the descriptive statistics using a boxplot.
In addition to these quantitative descriptive statistics, we can also qualitatively
describe shapes of the distributions (Bluman, 2012). For example, we can say
that a distribution is positively skewed, symmetric, or negatively skewed. To
visualize the distribution of a variable, we can draw a histogram.
Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that transforms
a dataset described by possibly correlated variables into a dataset described
by linearly uncorrelated variables, which are called principal components and
are ordered according to their variances. PCA is a technique for dimension
reduction. If the original variables are highly correlated, then the first few
principal components can account for most of the variation of the original data.
The principal components of the variables are related to the eigenvalues and
eigenvectors of the covariance matrix of the variables. For 𝑖 = 1, 2, … , 𝑑, let
(𝜆𝑖 , e𝑖 ) be the 𝑖th eigenvalue-eigenvector pair of the covariance matrix Σ of 𝑑
variables 𝑋1 , 𝑋2 , … , 𝑋𝑑 such that 𝜆1 ≥ 𝜆2 ≥ … ≥ 𝜆𝑑 ≥ 0 and the eigenvectors
are normalized. Then the 𝑖th principal component is given by
𝑑
𝑍𝑖 = e′𝑖 X = ∑ 𝑒𝑖𝑗 𝑋𝑗 ,
𝑗=1
where X = (𝑋1 , 𝑋2 , … , 𝑋𝑑 )′ . It can be shown that Var (𝑍𝑖 ) = 𝜆𝑖 . As a result,

the proportion of variance explained by the 𝑖th principal component is calculated
as
Var (𝑍𝑖 ) 𝜆𝑖
𝑑
= .
∑𝑗=1 Var (𝑍𝑗 ) 𝜆1 + 𝜆2 + ⋯ + 𝜆𝑑
For more information about PCA, readers are referred to Mirkin (2011).
Cluster Analysis
Cluster analysis (aka data clustering) refers to the process of dividing a dataset
into homogeneous groups or clusters such that points in the same cluster are
similar and points from different clusters are quite distinct (Gan et al., 2007;
Gan, 2011). Data clustering is one of the most popular tools for exploratory
data analysis and has found its applications in many scientific areas.
During the past several decades, many clustering algorithms have been proposed.
Among these clustering algorithms, the 𝑘-means algorithm is perhaps the most
well-known algorithm due to its simplicity. To describe the k-means algorithm,
let 𝑋 = {x1 , x2 , … , x𝑛 } be a dataset containing 𝑛 points, each of which is
described by 𝑑 numerical features. Given a desired number of clusters 𝑘, the
𝑘-means algorithm aims at minimizing the following objective function:
𝑘 𝑛
𝑃 (𝑈 , 𝑍) = ∑ ∑ 𝑢𝑖𝑙 ‖x𝑖 − z𝑙 ‖2 ,
𝑙=1 𝑖=1
where 𝑈 = (𝑢𝑖𝑙 )𝑛×𝑘 is an 𝑛 × 𝑘 partition matrix, 𝑍 = {z1 , z2 , … , z𝑘 } is a set of

cluster centers, and ‖ ⋅ ‖ is the 𝐿2 norm or Euclidean distance. The partition
matrix 𝑈 satisfies the following conditions:
𝑢𝑖𝑙 ∈ {0, 1}, 𝑖 = 1, 2, … , 𝑛, 𝑙 = 1, 2, … , 𝑘,
𝑘
∑ 𝑢𝑖𝑙 = 1, 𝑖 = 1, 2, … , 𝑛.
𝑙=1
The 𝑘-means algorithm employs an iterative procedure to minimize the objective

function. It repeatedly updates the partition matrix 𝑈 and the cluster centers
𝑍 alternately until some stop criterion is met. For more information about
𝑘-means, readers are referred to Gan et al. (2007) and Mirkin (2011).
13.3.2 Confirmatory Techniques

Confirmatory data analysis techniques include the traditional statistical tools
of inference, significance, and confidence.
Linear Models
Linear models, also called linear regression models, aim at using a linear function
to approximate the relationship between the dependent variable and indepen-
dent variables. A linear regression model is called a simple linear regression
13.3. DATA ANALYSIS TECHNIQUES 441
model if there is only one independent variable. When more than one indepen-
dent variable is involved, a linear regression model is called a multiple linear
regression model.
Let 𝑋 and 𝑌 denote the independent and the dependent variables, respectively.
For 𝑖 = 1, 2, … , 𝑛, let (𝑥𝑖 , 𝑦𝑖 ) be the observed values of (𝑋, 𝑌 ) in the 𝑖th case.
Then the simple linear regression model is specified as follows (Frees, 2009):
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, 2, … , 𝑛,
where 𝛽0 and 𝛽1 are parameters and 𝜖𝑖 is a random variable representing the

error for the 𝑖th case.
When there are multiple independent variables, the following multiple linear
regression model is used:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘 + 𝜖𝑖 ,
where 𝛽0 , 𝛽1 , …, 𝛽𝑘 are unknown parameters to be estimated.

Linear regression models usually make the following assumptions:
(a) 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑘 are nonstochastic variables.
(b) Var (𝑦𝑖 ) = 𝜎2 , where Var (𝑦𝑖 ) denotes the variance of 𝑦𝑖 .
(c) 𝑦1 , 𝑦2 , … , 𝑦𝑛 are independent random variables.
For the purpose of obtaining tests and confidence statements with small samples,
the following strong normality assumption is also made:
(d) 𝜖1 , 𝜖2 , … , 𝜖𝑛 are normally distributed.
Generalized Linear Models

The generalized linear model (GLM) consists of a wide family of regression
models that include linear regression models as a special case. In a GLM, the
mean of the response (i.e., the dependent variable) is assumed to be a function
of linear combinations of the explanatory variables, i.e.,
𝜇𝑖 = E [𝑦𝑖 ],
𝜂𝑖 = x′𝑖 𝛽 = 𝑔(𝜇𝑖 ),
where x𝑖 = (1, 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑘 )′ is a vector of regressor values, 𝜇𝑖 is the mean
response for the 𝑖th case, and 𝜂𝑖 is a systematic component of the GLM. The
function 𝑔(⋅) is known and is called the link function. The mean response can
vary by observations by allowing some parameters to change. However, the re-
gression parameters 𝛽 are assumed to be the same among different observations.
GLMs make the following assumptions:

(a) 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 are nonstochastic variables.
(b) 𝑦1 , 𝑦2 , … , 𝑦𝑛 are independent.
(c) The dependent variable is assumed to follow a distribution from the linear
exponential family.
(d) The variance of the dependent variable is not assumed to be constant but
is a function of the mean, i.e.,
Var (𝑦𝑖 ) = 𝜙𝜈(𝜇𝑖 ),
where 𝜙 denotes the dispersion parameter and 𝜈(⋅) is a function.

As we can see from the above specification, the GLM provides a unifying frame-
work to handle different types of dependent variables, including discrete and
continuous variables. For more information about GLMs, readers are referred
to De Jong and Heller (2008) and Frees (2009).
Tree-based Models
Decision trees, also known as tree-based models, involve dividing the predictor
space (i.e., the space formed by independent variables) into a number of simple
regions and using the mean or the mode of the region for prediction (Breiman
et al., 1984). There are two types of tree-based models: classification trees
and regression trees. When the dependent variable is categorical, the result-
ing tree models are called classification trees. When the dependent variable is
continuous, the resulting tree models are called regression trees.
The process of building classification trees is similar to that of building regression
trees. Here we only briefly describe how to build a regression tree. To do
that, the predictor space is divided into non-overlapping regions such that the
following objective function
𝐽 𝑛
𝑓(𝑅1 , 𝑅2 , … , 𝑅𝐽 ) = ∑ ∑ 𝐼𝑅𝑗 (x𝑖 )(𝑦𝑖 − 𝜇𝑗 )2
𝑗=1 𝑖=1
is minimized, where 𝐼 is an indicator function, 𝑅𝑗 denotes the set of indices

of the observations that belong to the 𝑗th box, 𝜇𝑗 is the mean response of the
observations in the 𝑗th box, x𝑖 is the vector of predictor values for the 𝑖th
observation, and 𝑦𝑖 is the response value for the 𝑖th observation.
In terms of predictive accuracy, decision trees generally do not perform to the
level of other regression and classification models. However, tree-based models
may outperform linear models when the relationship between the response and
the predictors is nonlinear. For more information about decision trees, readers
are referred to Breiman et al. (1984) and Mitchell (1997).
13.4. SOME R FUNCTIONS 443
13.4 Some R Functions

R is an open-source software for statistical computing and graphics. The R
software can be downloaded from the R project website at https://www.r-
project.org/. In this section, we give some R function for data analysis, especially
the data analysis tasks mentioned in previous sections.
Table 13.5. Some R Functions for Data Analysis
Data Analysis Task R Package R Function

Descriptive Statistics base summary
Principal Component Analysis stats prcomp
Data Clustering stats kmeans, hclust
Fitting Distributions MASS fitdistr
Linear Regression Models stats lm
Generalized Linear Models stats glm
Regression Trees rpart rpart
Survival Analysis survival survfit
Table 13.5 lists a few R functions for different data analysis tasks. Readers can
go to the R documentation to learn how to use these functions. There are also
other R packages that do similar things. However, the functions listed in this
table provide good starting points for readers to conduct data analysis in R. For
analyzing large datasets in R in an efficient way, readers are referred to Daroczi
(2015).
13.5 Summary
In this chapter, we give a high-level overview of data analysis by introducing
data types, data structures, data storages, data sources, data analysis processes,
and data analysis techniques. In particular, we present various aspects of data
analysis. In addition, we provide several websites where readers can obtain real-
world datasets to hone their data analysis skills. We also list some R packages
and functions that can be used to perform various data analysis tasks.

Contributor
• Guojun Gan, University of Connecticut, is the principal author of the
initial version of this chapter. Email: [email protected] for chapter
comments and suggested improvements.
• Chapter reviewers include: Runhuan Feng, Himchan Jeong, Lei Hua, Min
Ji, and Toby White.
Chapter 14
Dependence Modeling
Chapter Preview. In practice, there are many types of variables that one encoun-
ters. The first step in dependence modeling is identifying the type of variable
to help direct you to the appropriate technique. This chapter introduces read-
ers to variable types and techniques for modeling dependence or association of
multivariate distributions. Section 14.1 provides an overview of the types of vari-
ables. Section 14.2 then elaborates basic measures for modeling the dependence
between variables.
Section 14.3 introduces an approach to modeling dependence using copulas
which is reinforced with practical illustrations in Section 14.4. The types of
copula families and basic properties of copula functions are explained Section
14.5. The chapter concludes by explaining why the study of dependence model-
ing is important in Section 14.6.
14.1 Variable Types

• Classify variables as qualitative or quantitative.
• Describe multivariate variables.
People, firms, and other entities that we want to understand are described in
a dataset by numerical characteristics. As these characteristics vary by entity,
they are commonly known as variables. To manage insurance systems, it will
be critical to understand the distribution of each variable and how they are
associated with one another. It is common for data sets to have many variables
(high dimensional) and so is useful to begin by classifying them into different
445
446 CHAPTER 14. DEPENDENCE MODELING
types. As will be seen, these classifications are not strict; there is overlap among
the groups. Nonetheless, the grouping summarized in Table 14.1 and explained
in the remainder of this section provides a solid first step in framing a data set.
Table 14.1. Variable Types
Variable Type Example

𝑄𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒
Binary Sex
Categorical (Unordered, Nominal) Territory (e.g., state/province) in which an insured resides
Ordered Category (Ordinal) Claimant satisfaction (five point scale ranging from 1=dissatisfied
to 5 =satisfied)
𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑎𝑡𝑖𝑣𝑒
Continuous Policyholder’s age, weight, income
Discrete Amount of deductible (0, 250, 500, and 1000)
Count Number of insurance claims
Combinations of Policy losses, mixture of 0’s (for no loss)
Discrete and Continuous and positive claim amount
Interval Variable Driver Age: 16-24 (young), 25-54 (intermediate),
55 and over (senior)
Circular Data Time of day measures of customer arrival
𝑀 𝑢𝑙𝑡𝑖𝑣𝑎𝑟𝑖𝑎𝑡𝑒 𝑉 𝑎𝑟𝑖𝑎𝑏𝑙𝑒
High Dimensional Data Characteristics of a firm purchasing worker’s compensation
insurance (location of plants, industry, number of employees,
and so on)
Spatial Data Longitude/latitude of the location an insurance hailstorm claim
Missing Data Policyholder’s age (continuous/interval) and -99 for
‵
not reported,’ that is, missing
Censored and Truncated Data Amount of insurance claims in excess of a deductible
Aggregate Claims Losses recorded for each claim in a motor vehicle policy.
Stochastic Process Realizations The time and amount of each occurrence of an insured loss
In data analysis, it is important to understand what type of variable

you are working with. For example, consider a pair of random variables
(Coverage, Claim) from the LGPIF data introduced in Section 1.3 as displayed
in Figure 14.1 below. We would like to know whether the distribution of
Coverage depends on the distribution of Claim or whether they are statistically
independent. We would also want to know how the Claim distribution depends
on the EntityType variable. Because the EntityType variable belongs to a dif-
ferent class of variables, modeling the dependence between Claim and Coverage
may require a different technique from that of Claim and EntityType.
14.1.1 Qualitative Variables
In this sub-section, you learn how to:

14.1. VARIABLE TYPES 447
3000000
EntityType
2000000
City
County
Claim
Misc
School
Town
1000000 Village
0 500 1000 1500 2000 2500

Coverage (Millions)
Figure 14.1: Scatter Plot of (Coverage,Claim) from LGPIF Data
• Classify qualitative variables as nominal or ordinal

• Describe a binary variable
A qualitative, or categorical variable is one for which the measurement denotes

membership in a set of groups, or categories. For example, if you were coding in
which area of the country an insured resides, you might use 1 for the northern
part, 2 for southern, and 3 for everything else. This location variable is an
example of a nominal variable, one for which the levels have no natural ordering.
Any analysis of nominal variables should not depend on the labeling of the
categories. For example, instead of using a 1,2,3 for north, south, other, I
should arrive at the same set of summary statistics if I used a 2,1,3 coding
instead, interchanging north and south.
In contrast, an ordinal variable is a type of categorical variable for which an

ordering does exist. For example, with a survey to see how satisfied customers
are with our claims servicing department, we might use a five point scale that
ranges from 1 meaning dissatisfied to a 5 meaning satisfied. Ordinal variables
provide a clear ordering of levels of a variable but the amount of separation
between levels is unknown.
A binary variable is a special type of categorical variable where there are only
two categories commonly taken to be 0 and 1. For example, we might code a
variable in a dataset to be 1 if an insured is female and 0 if male.
14.1.2 Quantitative Variables

• Differentiate between continuous and discrete variables
• Use a combination of continuous and discrete variables
• Describe circular data
Unlike a qualitative variable, a quantitative variable is one in which each nu-

merical level is a realization from some scale so that the distance between any
two levels of the scale takes on meaning. A continuous variable is one that can
take on any value within a finite interval. For example, one could represent a
policyholder’s age, weight, or income, as continuous variables. In contrast, a
discrete variable is one that takes on only a finite number of values in any finite
interval. For example, when examining a policyholder’s choice of deductibles, it
may be that values of 0, 250, 500, and 1000 are the only possible outcomes. Like
an ordinal variable, these represent distinct categories that are ordered. Unlike
an ordinal variable, the numerical difference between levels takes on economic
meaning. A special type of discrete variable is a count variable, one with values
on the nonnegative integers. For example, we will be particularly interested in
the number of claims arising from a policy during a given period.
Some variables are inherently a combination of discrete and continuous compo-
nents. For example, when we analyze the insured loss of a policyholder, we will
encounter a discrete outcome at zero, representing no insured loss, and a con-
tinuous amount for positive outcomes, representing the amount of the insured
loss. Another interesting variation is an interval variable, one that gives a range
of possible outcomes.
Circular data represent an interesting category typically not analyzed by insur-
ers. As an example of circular data, suppose that you monitor calls to your
customer service center and would like to know when is the peak time of the
day for calls to arrive. In this context, one can think about the time of the day
as a variable with realizations on a circle, e.g., imagine an analog picture of a
clock. For circular data, the distance between observations at 00:15 and 00:45
are just as close as observations 23:45 and 00:15 (the convention HH:MM means
hours and minutes).
14.1.3 Multivariate Variables

• Differentiate between univariate and multivariate data
• Handle missing variables
14.2. CLASSIC MEASURES OF SCALAR ASSOCIATIONS 449
Insurance data typically are multivariate in the sense that we can take many
measurements on a single entity. For example, when studying losses associated
with a firm’s workers’ compensation plan, we might want to know the location
of its manufacturing plants, the industry in which it operates, the number of
employees, and so forth. The usual strategy for analyzing multivariate data is
to begin by examining each variable in isolation of the others. This is known as
a univariate approach.
In contrast, for some variables, it makes little sense to only look at one dimen-
sional aspect. For example, insurers typically organize spatial data by longitude
and latitude to analyze the location of weather related insurance claims due to
hailstorms. Having only a single number, either longitude or latitude, provides
little information in understanding geographic location.
Another special case of a multivariate variable, less obvious, involves coding for
missing data. Historically, some statistical packages used a -99 to report when a
variable, such as policyholder’s age, was not available or not reported. This led
to many unsuspecting analysts providing strange statistics when summarizing
a set of data. When data are missing, it is better to think about the variable as
having two dimensions, one to indicate whether or not the variable is reported
and the second providing the age (if reported). In the same way, insurance data
are commonly censored and truncated. We refer you to Section 4.3 for more
on censored and truncated data. Aggregate claims, described in Chapter 5, can
also be coded as another special type of multivariate variable.
Perhaps the most complicated type of multivariate variable is a realization of
a stochastic process. You will recall that a stochastic process is little more
than a collection of random variables. For example, in insurance, we might
think about the times that claims arrive to an insurance company in a one-
year time horizon. This is a high dimensional variable that theoretically is
infinite dimensional. Special techniques are required to understand realizations
of stochastic processes that will not be addressed here.
14.2 Classic Measures of Scalar Associations

• Estimate correlation using the Pearson method
• Use rank based measures like Spearman, Kendall to estimate correlation
• Measure dependence using the odds ratio, Pearson chi-square, and likeli-
hood ratio test statistics
• Use normal-based correlations to quantify associations involving ordinal
variables
14.2.1 Association Measures for Quantitative Variables

For this section, consider a pair of random variables (𝑋, 𝑌 ) having joint dis-
tribution function 𝐹 (⋅) and a random sample (𝑋𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛. For the
continuous case, suppose that 𝐹 (⋅) has absolutely continuous marginals with
marginal density functions.
Pearson Correlation
̂ 𝑛 ̄ 𝑖 − 𝑌 ̄ ),
Define the sample covariance function 𝐶𝑜𝑣(𝑋, 𝑌 ) = 𝑛1 ∑𝑖=1 (𝑋𝑖 − 𝑋)(𝑌
where 𝑋̄ and 𝑌 ̄ are the sample means of 𝑋 and 𝑌 , respectively. Then, the
product-moment (Pearson) correlation can be written as
̂
𝐶𝑜𝑣(𝑋, 𝑌) ̂
𝐶𝑜𝑣(𝑋, 𝑌)
𝑟= = .
̂
√𝐶𝑜𝑣(𝑋, ̂ ,𝑌)
𝑋)𝐶𝑜𝑣(𝑌 √𝑉̂
𝑎𝑟(𝑋)√𝑉̂
𝑎𝑟(𝑌 )
The correlation statistic 𝑟 is widely used to capture linear association between

random variables. It is a (nonparametric) estimator of the correlation parameter
𝜌, defined to be the covariance divided by the product of standard deviations.
This statistic has several important features. Unlike regression estimators, it is
symmetric between random variables, so the correlation between 𝑋 and 𝑌 equals
the correlation between 𝑌 and 𝑋. It is unchanged by linear transformations of
random variables (up to sign changes) so that we can multiply random variables
or add constants as is helpful for interpretation. The range of the statistic is
[−1, 1] which does not depend on the distribution of either 𝑋 or 𝑌 .
Further, in the case of independence, the correlation coefficient 𝑟 is 0. However,
it is well known that zero correlation does not in general imply independence,
one exception is the case of normally distributed random variables. The cor-
relation statistic 𝑟 is also a (maximum likelihood) estimator of the association
parameter for the bivariate normal distribution. So, for normally distributed
data, the correlation statistic 𝑟 can be used to assess independence. For addi-
tional interpretations of this well-known statistic, readers will enjoy Lee Rodgers
and Nicewander (1998).
You can obtain the Pearson correlation statistic 𝑟 using the cor() function in
R and selecting the pearson method. This is demonstrated below by using the
Coverage rating variable in millions of dollars and Claim amount variable in
dollars from the LGPIF data introduced in chapter 1.
From the R output above, 𝑟 = 0.31, which indicates a positive association be-
tween Claim and Coverage. This means that as the coverage amount of a policy
increases we expect claims to increase.
14.2.2 Rank Based Measures

Spearman’s Rho
The Pearson correlation coefficient does have the drawback that it is not invari-
ant to nonlinear transforms of the data. For example, the correlation between 𝑋
and log 𝑌 can be quite different from the correlation between 𝑋 and 𝑌 . As we
see from the R code for the Pearson correlation statistic above, the correlation
statistic 𝑟 between the Coverage rating variable in logarithmic millions of dol-
lars and the Claim amounts variable in dollars is 0.1 as compared to 0.31 when
we calculate the correlation between the Coverage rating variable in millions of
dollars and the Claim amounts variable in dollars. This limitation is one reason
for considering alternative statistics.
Alternative measures of correlation are based on ranks of the data. Let 𝑅(𝑋𝑗 )
denote the rank of 𝑋𝑗 from the sample 𝑋1 , … , 𝑋𝑛 and similarly for 𝑅(𝑌𝑗 ).
′
Let 𝑅(𝑋) = (𝑅(𝑋1 ), … , 𝑅(𝑋𝑛 )) denote the vector of ranks, and similarly for
𝑅(𝑌 ). For example, if 𝑛 = 3 and 𝑋 = (24, 13, 109), then 𝑅(𝑋) = (2, 1, 3).
A comprehensive introduction of rank statistics can be found in, for example,
Hettmansperger (1984). Also, ranks can be used to obtain the empirical dis-
tribution function, refer to Section 4.1.1 for more on the empirical distribution
function.
With this, the correlation measure of Spearman (1904) is simply the product-
moment correlation computed on the ranks:
̂
𝐶𝑜𝑣(𝑅(𝑋), 𝑅(𝑌 )) ̂
𝐶𝑜𝑣(𝑅(𝑋), 𝑅(𝑌 ))
𝑟𝑆 = = .
̂
√𝐶𝑜𝑣(𝑅(𝑋), ̂ (𝑛2 − 1)/12
𝑅(𝑋))𝐶𝑜𝑣(𝑅(𝑌 ), 𝑅(𝑌 ))
You can obtain the Spearman correlation statistic 𝑟𝑆 using the cor() function
in R and selecting the spearman method. From below, the Spearman correlation
between the Coverage rating variable in millions of dollars and Claim amount
variable in dollars is 0.41.
We can show that the Spearman correlation statistic is invariant under strictly
increasing transformations. From the R Code for the Spearman correlation
statistic above, 𝑟𝑆 = 0.41 between the Coverage rating variable in logarithmic
millions of dollars and Claim amount variable in dollars.
Kendall’s Tau
An alternative measure that uses ranks is based on the concept of concordance.
An observation pair (𝑋, 𝑌 ) is said to be concordant (discordant) if the ob-
servation with a larger value of 𝑋 has also the larger (smaller) value of 𝑌 .
Then Pr(𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) = Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) > 0] , Pr(𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) =
Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) < 0], Pr(𝑡𝑖𝑒) = Pr[(𝑋1 − 𝑋2 )(𝑌1 − 𝑌2 ) = 0] and
𝜏 (𝑋, 𝑌 ) = Pr(𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) − Pr(𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒)

= 2 Pr(𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒) − 1 + Pr(𝑡𝑖𝑒).
To estimate this, the pairs (𝑋𝑖 , 𝑌𝑖 ) and (𝑋𝑗 , 𝑌𝑗 ) are said to be concordant if the
product 𝑠𝑔𝑛(𝑋𝑗 − 𝑋𝑖 )𝑠𝑔𝑛(𝑌𝑗 − 𝑌𝑖 ) equals 1 and discordant if the product equals
-1. Here, 𝑠𝑔𝑛(𝑥) = 1, 0, −1 as 𝑥 > 0, 𝑥 = 0, 𝑥 < 0, respectively. With this, we
can express the association measure of Kendall (1938), known as Kendall’s tau,
as
2
𝜏̂ = 𝑛(𝑛−1) ∑𝑖<𝑗 𝑠𝑔𝑛(𝑋𝑗 − 𝑋𝑖 ) × 𝑠𝑔𝑛(𝑌𝑗 − 𝑌𝑖 )
2
= 𝑛(𝑛−1) ∑𝑖<𝑗 𝑠𝑔𝑛(𝑅(𝑋𝑗 ) − 𝑅(𝑋𝑖 )) × 𝑠𝑔𝑛(𝑅(𝑌𝑗 ) − 𝑅(𝑌𝑖 )).
Interestingly, Hougaard (2000), page 137, attributes the original discovery of this
statistic to Fechner (1897), noting that Kendall’s discovery was independent and
more complete than the original work.
You can obtain Kendall’s tau using the cor() function in R and selecting the
kendall method. From below, 𝜏 ̂ = 0.32 between the Coverage rating variable
in millions of dollars and the Claim amount variable in dollars. When there are
ties in the data, the cor() function computes Kendall’s tau_b as proposed by
Kendall (1945).
Also, to show that the Kendall’s tau is invariant under strictly increasing trans-
formations, we see that 𝜏 ̂ = 0.32 between the Coverage rating variable in loga-
rithmic millions of dollars and the Claim amount variable in dollars.
14.2.3 Nominal Variables

Bernoulli Variables
To see why dependence measures for continuous variables may not be the best
for discrete variables, let us focus on the case of Bernoulli variables that take on
simple binary outcomes, 0 and 1. For notation, let 𝜋𝑗𝑘 = Pr(𝑋 = 𝑗, 𝑌 = 𝑘) for
𝑗, 𝑘 = 0, 1 and let 𝜋𝑋 = Pr(𝑋 = 1) and similarly for 𝜋𝑌 . Then, the population
version of the product-moment (Pearson) correlation can be easily seen to be
𝜋11 − 𝜋𝑋 𝜋𝑌
𝜌= .
√𝜋𝑋 (1 − 𝜋𝑋 )𝜋𝑌 (1 − 𝜋𝑌 )
Unlike the case for continuous data, it is not possible for this measure to achieve
the limiting boundaries of the interval [−1, 1]. To see this, students of probability
may recall the Fréchet-Höeffding bounds for a joint distribution that turn out
to be max{0, 𝜋𝑋 + 𝜋𝑌 − 1} ≤ 𝜋11 ≤ min{𝜋𝑋 , 𝜋𝑌 } for this joint probability.
(More discussion of these bounds is in Section 14.5.4.) This limit on the joint
probability imposes an additional restriction on the Pearson correlation. As an
illustration, assume equal probabilities 𝜋𝑋 = 𝜋𝑌 = 𝜋 > 1/2. Then, the lower

bound is
2𝜋 − 1 − 𝜋2 1−𝜋
=− .
𝜋(1 − 𝜋) 𝜋
For example, if 𝜋 = 0.8, then the smallest that the Pearson correlation could be
is -0.25. More generally, there are bounds on 𝜌 that depend on 𝜋𝑋 and 𝜋𝑌 that
make it difficult to interpret this measure.
As noted by Bishop et al. (1975) (page 382), squaring this correlation coefficient
yields the Pearson chi-square statistic (introduced in Section 2.7). Despite the
boundary problems described above, this feature makes the Pearson correlation
coefficient a good choice for describing dependence with binary data.
As an alternative measure for Bernoulli variables, the odds ratio is given by
𝜋11 𝜋00 𝜋 (1 + 𝜋11 − 𝜋𝑋 − 𝜋𝑌 )

𝑂𝑅(𝜋11 ) = = 11 .
𝜋01 𝜋10 (𝜋𝑋 − 𝜋11 )(𝜋𝑌 − 𝜋11 )
Pleasant calculations show that 𝑂𝑅(𝑧) is 0 at the lower Fréchet-Höeffding bound

𝑧 = max{0, 𝜋𝑋 + 𝜋𝑌 − 1} and is ∞ at the upper bound 𝑧 = min{𝜋𝑋 , 𝜋𝑌 }. Thus,
the bounds on this measure do not depend on the marginal probabilities 𝜋𝑋
and 𝜋𝑌 , making it easier to interpret this measure.
As noted by Yule (1900), odds ratios are invariant to the labeling of 0 and 1.
Further, they are invariant to the marginals in the sense that one can rescale 𝜋𝑋
and 𝜋𝑌 by positive constants and the odds ratio remains unchanged. Specifically,
suppose that 𝑎𝑖 , 𝑏𝑗 are sets of positive constants and that
𝑛𝑒𝑤
𝜋𝑖𝑗 = 𝑎𝑖 𝑏𝑗 𝜋𝑖𝑗
𝑛𝑒𝑤
and ∑𝑖𝑗 𝜋𝑖𝑗 = 1. Then,
(𝑎1 𝑏1 𝜋11 )(𝑎0 𝑏0 𝜋00 ) 𝜋 𝜋

𝑂𝑅𝑛𝑒𝑤 = = 11 00 = 𝑂𝑅𝑜𝑙𝑑 .
(𝑎0 𝑏1 𝜋01 )(𝑎1 𝑏0 𝜋10 ) 𝜋01 𝜋10
For additional help with interpretation, Yule proposed two transforms for the
odds ratio, the first in Yule (1900),
𝑂𝑅 − 1
,
𝑂𝑅 + 1
and the second in Yule (1912),
√
𝑂𝑅 − 1
√ .
𝑂𝑅 + 1
Although these statistics provide the same information as is the original odds
ratio 𝑂𝑅, they have the advantage of taking values in the interval [−1, 1], making
them easier to interpret.
In a later section, we will also see that the marginal distributions have no ef-
fect on the Fréchet-Höeffding of the tetrachoric correlation, another measure of
association, see also, Joe (2014), page 48.
1611(956)
From Table 14.2, 𝑂𝑅(𝜋11 ) = 897(2175) = 0.79. You can obtain the 𝑂𝑅(𝜋11 ),
using the oddsratio() function from the epitools library in R. From the output
below, 𝑂𝑅(𝜋11 ) = 0.79 for the binary variables NoClaimCredit and Fire5 from
the LGPIF data.
Table 14.2. 2 × 2 Table of Counts for Fire5 and NoClaimCredit
Fire5
NoClaimCredit 0 1 Total
0 1611 2175 3786
1 897 956 1853
Total 2508 3131 5639
Categorical Variables
More generally, let (𝑋, 𝑌 ) be a bivariate pair having 𝑛𝑐𝑎𝑡𝑋 and 𝑛𝑐𝑎𝑡𝑌 numbers
of categories, respectively. For a two-way table of counts, let 𝑛𝑗𝑘 be the number
in the 𝑗th row, 𝑘th column. Let 𝑛𝑗 be the row margin total, 𝑛𝑘 be the column
margin total and 𝑛 = ∑𝑗,𝑘 𝑛𝑗,𝑘 . Define the Pearson chi-square statistic as
(𝑛𝑗𝑘 − 𝑛𝑗 𝑛𝑘 /𝑛)2

𝜒2 = ∑ .
𝑗𝑘
𝑛𝑗 𝑛𝑘 /𝑛
The likelihood ratio test statistic is

𝑛𝑗𝑘
𝐺2 = 2 ∑ 𝑛𝑗𝑘 log .
𝑗𝑘
𝑛𝑗 𝑛𝑘 /𝑛
Under the assumption of independence, both 𝜒2 and 𝐺2 have an asymptotic

chi-square distribution with (𝑛𝑐𝑎𝑡𝑋 − 1)(𝑛𝑐𝑎𝑡𝑌 − 1) degrees of freedom.
To help see what these statistics are estimating, let 𝜋𝑗𝑘 = Pr(𝑋 = 𝑗, 𝑌 = 𝑘)
and let 𝜋𝑋,𝑗 = Pr(𝑋 = 𝑗) and similarly for 𝜋𝑌 ,𝑘 . Assuming that 𝑛𝑗𝑘 /𝑛 ≈ 𝜋𝑗𝑘
for large 𝑛 and similarly for the marginal probabilities, we have
𝜒2 (𝜋𝑗𝑘 − 𝜋𝑋,𝑗 𝜋𝑌 ,𝑘 )2
≈∑
𝑛 𝑗𝑘
𝜋𝑋,𝑗 𝜋𝑌 ,𝑘
and
𝐺2 𝜋𝑗𝑘
≈ 2 ∑ 𝜋𝑗𝑘 log .
𝑛 𝑗𝑘
𝜋𝑋,𝑗 𝜋𝑌 ,𝑘
Under the null hypothesis of independence, we have 𝜋𝑗𝑘 = 𝜋𝑋,𝑗 𝜋𝑌 ,𝑘 and it is

clear from these approximations that we anticipate that these statistics will be
small under this hypothesis.
Classical approaches, as described in Bishop et al. (1975) (page 374), distinguish

between tests of independence and measures of associations. The former are
designed to detect whether a relationship exists whereas the latter are meant
to assess the type and extent of a relationship. We acknowledge these differing
purposes but also less concerned with this distinction for actuarial applications.
Table 14.3. Two-way Table of Counts for EntityType and NoClaimCredit
NoClaimCredit
EntityType 0 1
City 644 149
County 310 18
Misc 336 273
School 1103 494
Town 492 479
Village 901 440
You can obtain the Pearson chi-square statistic, using the chisq.test() func-
tion from the MASS library in R. Here, we test whether the EntityType variable
is independent of the NoClaimCredit variable using Table 14.3.
As the p-value is less than the .05 significance level, we reject the null hypothesis
that EntityType is independent of NoClaimCredit.
Furthermore, you can obtain the likelihood ratio test statistic, using the
likelihood.test() function from the Deducer library in R. From below, we
test whether EntityType is independent of NoClaimCredit from the LGPIF
data. The same conclusion is drawn as the Pearson chi-square test.
14.2.4 Ordinal Variables

As the analyst moves from the continuous to the nominal scale, there are two
main sources of loss of information Bishop et al. (1975) (page 343). The first is
breaking the precise continuous measurements into groups. The second is losing
the ordering of the groups. So, it is sensible to describe what we can do with
variables that are in discrete groups but where the ordering is known.
As described in Section 14.1.1, ordinal variables provide a clear ordering of

levels of a variable but distances between levels are unknown. Associations have
traditionally been quantified parametrically using normal-based correlations and
nonparametrically using Spearman correlations with tied ranks.
Parametric Approach Using Normal Based Correlations

Refer to page 60, Section 2.12.7 of Joe (2014). Let (𝑦1 , 𝑦2 ) be a bivariate pair
with discrete values on 𝑚1 , … , 𝑚2 . For a two-way table of ordinal counts, let
𝑛𝑠𝑡 be the number in the 𝑠th row, 𝑡 column. Let (𝑛𝑚1 , … , 𝑛𝑚2 ) be the row
margin total, (𝑛𝑚1 , … , 𝑛𝑚2 ) be the column margin total and 𝑛 = ∑𝑠,𝑡 𝑛𝑠,𝑡 .
̂ = Φ−1 ((𝑛𝑚 + ⋯ + 𝑛𝑠 )/𝑛) for 𝑠 = 𝑚1 , … , 𝑚2 be a cutpoint and similarly
Let 𝜉1𝑠 1
̂ . The polychoric correlation, based on a two-step estimation procedure,
for 𝜉2𝑡
is
𝑚2 𝑚2 ̂ , 𝜉2𝑡
̂ ; 𝜌) − Φ2 (𝜉1,𝑠−1
̂ ̂ ; 𝜌)
𝜌𝑁̂ = argmax𝜌 ∑𝑠=𝑚 ∑𝑡=𝑚 𝑛𝑠𝑡 log {Φ2 (𝜉1𝑠 , 𝜉2𝑡
1 1
̂ ̂ ̂ ̂
−Φ2 (𝜉1𝑠 , 𝜉2,𝑡−1 ; 𝜌) + Φ2 (𝜉1,𝑠−1 , 𝜉2,𝑡−1 ; 𝜌)} .
It is called a tetrachoric correlation for binary variables.

Table 14.4. Two-way Table of Counts for AlarmCredit and NoClaimCredit
NoClaimCredit
AlarmCredit 0 1
1 1669 942
2 121 118
3 195 132
4 1801 661
You can obtain the polychoric or tetrachoric correlation using the polychoric()
or tetrachoric() function from the psych library in R. The polychoric correla-
tion is illustrated using Table 14.4. Here, 𝜌𝑁
̂ = −0.14, which means that there
is a negative relationship between AlarmCredit and NoClaimCredit.
14.2.5 Interval Variables

As described in Section 14.1.2, interval variables provide a clear ordering of
levels of a variable and the numerical distance between any two levels of the
scale can be readily interpretable. For example, driver’s age group variable is
an interval variable.
For measuring association, both the continuous variable and ordinal variable ap-
proaches make sense. The former takes advantage of knowledge of the ordering
although assumes continuity. The latter does not rely on continuity but also
does not make use of the information given by the distance between scales.
14.2.6 Discrete and Continuous Variables

The polyserial correlation is defined similarly, when one variable (𝑦1 ) is contin-
uous and the other (𝑦2 ) ordinal. Define 𝑧 to be the normal score of 𝑦1 . The
polyserial correlation is
14.3. INTRODUCTION TO COPULAS 457
𝑛 ̂
𝜉2,𝑦 − 𝜌𝑧𝑖1 ̂
𝜉2,𝑦 − 𝜌𝑧𝑖1
𝜌𝑁̂ = argmax𝜌 ∑ log {𝜙(𝑧𝑖1 ) [Φ ( 𝑖2
) − Φ( 𝑖2−1
)]} .
𝑖=1
(1 − 𝜌2 )1/2 (1 − 𝜌2 )1/2
The biserial correlation is defined similarly, when one variable is continuous and
the other binary.
Table 14.5. Summary of Claim by NoClaimCredit
NoClaimCredit Mean Total

Claim Claim
0 22, 505 85, 200, 483
1 6, 629 12, 282, 618
You can obtain the polyserial or biserial correlation using the polyserial()
or biserial() function, respectively, from the psych library in R. Table 14.5
gives the summary of Claim by NoClaimCredit and the biserial correlation is
illustrated using R code below. The 𝜌𝑁̂ = −0.04 which means that there is a
negative correlation between Claim and NoClaimCredit.
14.3 Introduction to Copulas

• Describe a multivariate distribution function in terms of a copula function.
Copulas are widely used in insurance and many other fields to model the de-
pendence among multivariate outcomes. A copula is a multivariate distribution
function with uniform marginals. Specifically, let {𝑈1 , … , 𝑈𝑝 } be 𝑝 uniform
random variables on (0, 1). Their distribution function
𝐶(𝑢1 , … , 𝑢𝑝 ) = Pr(𝑈1 ≤ 𝑢1 , … , 𝑈𝑝 ≤ 𝑢𝑝 ),
is a copula. We seek to use copulas in applications that are based on more than
just uniformly distributed data. Thus, consider arbitrary marginal distribu-
tion functions 𝐹1 (𝑦1 ),…,𝐹𝑝 (𝑦𝑝 ). Then, we can define a multivariate distribution
function using the copula such that
𝐹 (𝑦1 , … , 𝑦𝑝 ) = 𝐶(𝐹1 (𝑦1 ), … , 𝐹𝑝 (𝑦𝑝 )). (14.1)
Here, 𝐹 is a multivariate distribution function. Sklar (1959) showed that 𝑎𝑛𝑦

multivariate distribution function 𝐹 , can be written in the form of equation
(14.1), that is, using a copula representation.
Sklar also showed that, if the marginal distributions are continuous, then there
is a unique copula representation. In this chapter we focus on copula modeling
with continuous variables. For discrete case, readers can see Joe (2014) and
Genest and Nešlohva (2007).
For the bivariate case where 𝑝 = 2, we can write a copula and the distribution
function of two random variables as
𝐶(𝑢1 , 𝑢2 ) = Pr(𝑈1 ≤ 𝑢1 , 𝑈2 ≤ 𝑢2 )
and
𝐹 (𝑦1 , 𝑦2 ) = 𝐶(𝐹1 (𝑦1 ), 𝐹𝑝 (𝑦2 )).
As an example, we can look to the copula due to Frank (1979). The copula
(distribution function) is
1 (exp(𝛾𝑢1 ) − 1)(exp(𝛾𝑢2 ) − 1)
𝐶(𝑢1 , 𝑢2 ) = log (1 + ). (14.2)
𝛾 exp(𝛾) − 1
This is a bivariate distribution function with its domain on the unit square
[0, 1]2 . Here 𝛾 is dependence parameter, that is, the range of dependence is
controlled by the parameter 𝛾. Positive association increases as 𝛾 increases. As
we will see, this positive association can be summarized with Spearman’s rho
(𝜌𝑆 ) and Kendall’s tau (𝜏 ). Frank’s copula is commonly used. We will see other
copula functions in Section 14.5.
14.4 Application Using Copulas

• Discover dependence structure between random variables
• Model the dependence with a copula function
This section analyzes the insurance losses and expenses data with the statistical
program R. The data set was introduced in Frees and Valdez (1998) and is now
readily available in the copula package. The model fitting process is started by
marginal modeling of each of the two variables, LOSS and ALAE. Then we model
the joint distribution of these marginal outcomes.
14.4. APPLICATION USING COPULAS 459
14.4.1 Data Description

We start with a sample (𝑛 = 1500) from the whole data. We consider first two
variables of the data; losses and expenses.
• LOSS, general liability claims from the Insurance Services Office, Inc. (ISO)
• ALAE, specifically attributable to the settlement of individual claims
(e.g. lawyer’s fees, claims investigation expenses)
To visualize the relationship between losses and expenses, the scatterplots in
Figure 14.2 are created on dollar and log dollar scales. It is difficult to see any
relationship between the two variables in the left-hand panel. Their dependence
is more evident when viewed on the log scale, as in the right-hand panel.
500000
12
10
300000
log(ALAE)
ALAE
8
6
0 100000
0 1000000 2000000 2 4 6 8 10 14
LOSS log(LOSS)
Figure 14.2: Scatter Plot of LOSS and ALAE
14.4.2 Marginal Models

We first examine the marginal distributions of losses and expenses before going
through the joint modeling. The histograms show that both LOSS and ALAE are
right-skewed and fat-tailed. Because of these features, for both marginal distri-
butions of losses and expenses, we consider a Pareto distribution, distribution
function of the form
𝛼
𝜃
𝐹 (𝑦) = 1 − (1 + ) .
𝑦+𝜃
Here, 𝜃 is a scale parameter and 𝛼 is a shape parameter. Section 18.2 provides

details of this distribution.
The marginal distributions of losses and expenses are fit using the method of
maximum likelihood. Specifically, we use the vglm function from the R VGAM
package. Firstly, we fit the marginal distribution of ALAE. Parameters are
summarized in Table 14.6.
We repeat this procedure to fit the marginal distribution of the LOSS variable.
Because the loss variable also seems right-skewed and heavy-tailed data, we
also model the marginal distribution with the Pareto distribution (although
with different parameters).
Table 14.6. Summary of Pareto Maximum Likelihood Fitted Parame-
ters from the LGPIF Data
Shape 𝜃 ̂ Scale 𝛼̂
𝐴𝐿𝐴𝐸 15133.60360 2.22304
𝐿𝑂𝑆𝑆 16228.14797 1.23766
To visualize the fitted distribution of LOSS and ALAE variables, one can use
the estimated parameters and plot the corresponding distribution function and
density function. For more details on the selection of marginal models, see
Chapter 4.
14.4.3 Probability Integral Transformation

When studying simulation, in Section 6.1.2 we learned about the inverse trans-
form method. This is a way of mapping a 𝑈 (0, 1) random variable into a random
variable 𝑋 with distribution function 𝐹 via the inverse of the distribution, that
is, 𝑋 = 𝐹 −1 (𝑈 ). The probability integral transformation goes in the other
direction, it states that 𝐹 (𝑋) = 𝑈 . Although the inverse transform result is
available when the underlying random variable is continuous, discrete or a hy-
brid combination of the two, the probability integral transform is mainly useful
when the distribution is continuous. That is the focus of this chapter.
We use the probability integral transform for two purposes: (1) for diagnostic
purposes, to check that we have correctly specified a distribution function and
(2) as an input into the copula function in equation (14.1).
For the first purpose, we can check to see whether the Pareto is a reasonable
distribution to model our marginal distributions. Given the fitted Pareto dis-
tribution, the variable ALAE is transformed to the variable 𝑢1 , which follows a
uniform distribution on [0, 1]:
−𝛼̂
𝐴𝐿𝐴𝐸
𝑢1 = 𝐹1̂ −1 (𝐴𝐿𝐴𝐸) = 1 − (1 + ) .
𝜃̂
After applying the probability integral transformation to the ALAE variable, we

plot the histogram of Transformed ALAE in Figure 14.3. This plot appears
14.4. APPLICATION USING COPULAS 461
reasonably close to what we expect to see with a uniform distribution, suggesting

that the Pareto distribution is a reasonable specification.
150
Frequency
100
50
0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 14.3: Histogram of Transformed ALAE
In the same way, the variable LOSS is also transformed to the variable 𝑢2 , which
follows a uniform distribution on [0, 1]. The left-hand panel of Figure 14.4
shows a plot the histogram of Transformed ALAE, again reinforcing the Pareto
distribution specification. For another way of looking at the data, the variable
𝑢2 can be transformed to a normal score with the quantile function of standard
normal distribution. As we see in Figure 14.4, normal scores of the variable LOSS
are approximately marginally standard normal. This figure is helpful because
analysts are used to looking for patterns of approximate normality (which seems
to be evident in the figure). The logic is that, if the Pareto distribution is
correctly specified, then transformed losses 𝑢2 should be approximately normal,
and the normal scores Φ−1 (𝑢2 ), should be approximately normal. (Here, Φ is
the cumulative standard normal distribution function.)
14.4.4 Joint Modeling with Copula Function

Before jointly modeling losses and expenses, we draw the scatterplot of trans-
formed variables (𝑢1 , 𝑢2 ) and the scatterplot of normal scores in Figure 14.5.
The left-hand panel is a plot of 𝑢1 versus 𝑢2 , where 𝑢1 = 𝐹1̂ (𝐴𝐿𝐴𝐸) and
𝑢2 = 𝐹2̂ (𝐿𝑂𝑆𝑆)). Then we transform each one using an inverse standard nor-
mal distribution function, Φ−1 (⋅), or qnorm in R to get normal scores. As in
Figure 14.2, it is difficult to see patterns in the left-hand panel. However, with
rescaling, patterns are evident in the right-hand panel. To learn more details
about normal scores and their applications in copula modeling, see Joe (2014).
The right-hand panel of Figure 14.2 shows us there is a positive dependency be-
tween these two random variables. This can be summarized using, for example,
Spearman’s rho that turns out to be 0.451. As we learned in Section 14.2.2, this
statistics depends only on the order of the two variables through their respective
300
200
150
200
Frequency
Frequency
100
100
50
50
0
0
0.0 0.2 0.4 0.6 0.8 1.0 −3 −1 0 1 2 3
Figure 14.4: Histogram of Transformed Loss. The left-hand panel shows

the distribution of probability integral transformed losses. The right-hand panel
shows the distribution for the corresponding normal scores.
3
1.0
2
0.8
Transformed LOSS
1
qnorm(u2)
0.6
0
0.4
−1
0.2
−2
0.0
−3
0.0 0.4 0.8 −3 −1 0 1 2 3
Transformed ALAE qnorm(u1)
Figure 14.5: Left: Scatter plot for transformed variables. Right:Scatter plot for
normal scores
14.5. TYPES OF COPULAS 463
ranks. Therefore, the statistic is the same for (1) the original data in Figure
14.2, (2) the data transformed to uniform scales in the left-hand panel of Figure
14.5, and (3) the normal scores in the right-hand panel of Figure 14.5.
The next step is to calculate estimates of the copula parameters. One option
is to use traditional maximum likelihood and determine all the parameters at
the same time which can be computationally burdensome. Even in our simple
example, this means maximizing a (log) likelihood function over five parameters,
two for the marginal ALAE distribution, two for the marginal LOSS distribution,
and one for the copula. A widely alternative, known as the inference for margins
(IFM) approach, is to simply use the fitted marginal distributions, 𝑢1 and 𝑢2 ,
as inputs when determining the copula. This is the approach taken here. In the
following code, you will see that it turns how that the fitted copula parameter
𝛾̂ = 3.114.
To visualize the fitted Frank’s copula, the distribution function and density
function perspective plots are drawn in Figure 14.6.
1.0
0.8
C(u,v
0.6
0.4
)
0.2 1.0
0.0 0.8
0.0 3
0.6
0.2 1.0
v
0.8 2 0.4
0.4
c(u,v
0.6
0.6 0.2
u
1
0.4 v
)
0.8 0.0
0.2 0
0.0 0.2 0.4 0.6 0.8 1.0
1.0 0.0 u
Figure 14.6: Left: Plot for distribution function for Franks Copula. Right:Plot
for density function for Franks Copula
14.5 Types of Copulas


• Define the basic types of copulas, including the normal, 𝑡-, elliptical, and
Archimedean copulas
• Interpret bounds that limit copula distribution functions as the amount
of dependence varies
• Calculate measures of association for different copulas and interpret their
properties
• Interpret tail dependency for different copulas
There are several families of copulas have been described in the literature. Two
main families of the copula families are the Archimedean and Elliptical cop-
ulas.
14.5.1 Normal (Gaussian) Copulas

We started our study with Frank’s copula in equation (14.2) because it can
capture both positive and negative dependence and has a readily understood
analytic form. However, extensions to multivariate cases where 𝑝 > 2 are not
easy and so we look to alternatives. In particular, the normal, or Gaussian, dis-
tribution has been used for many years in empirical work, starting with Gauss
in 1887. So, it is natural to turn to this distribution as a benchmark for under-
standing multivariate dependencies.
For a multivariate normal distribution, think of 𝑝 normal random variables, each
with mean zero and standard deviation one. Their dependence is controlled by
Σ, a correlation matrix, with ones on the diagonal. The number in the 𝑖th row
and 𝑗th column, say Σ𝑖𝑗 , gives the correlation between the 𝑖th and 𝑗th normal
random variables. This collection of random variables has a multivariate normal
distribution with probability density function
1 1 −1
𝜙𝑁 (z) = √ exp (− z′ Σ z) . (14.3)
(2𝜋)𝑝/2 det Σ 2
To develop the corresponding copula version, it is possible to start with equation

(14.1), evaluate this using normal variables, and go through a bit of calculus.
Instead, we simply state as a definition, the normal (Gaussian) copula density
function is
𝑝
−1 −1 1
𝑐𝑁 (𝑢1 , … , 𝑢𝑝 ) = 𝜙𝑁 (Φ (𝑢1 ), … , Φ (𝑢𝑝 )) ∏ −1 (𝑢 ))
.
𝑗=1
𝜙(Φ 𝑗
Here, we use Φ and 𝜙 to denote the standard normal distribution and density
functions. Unlike the usual probability density function 𝜙𝑁 , the copula density
function has its domain on the hyper-cube [0, 1]𝑝 . For contrast, Figure 14.7
compares these two density functions.
0.15
F
0.10
1.5
0.05
c(u,v
1.0
3 1.0
0.5
2 0.8
)
1 0.0 0.6
3 0.0
0
2 0.2
y
v
0.4
−1 1
0.4
0
−2 −1 x u 0.6 0.2
−2 0.8
−3 0.0
−3 1.0
Figure 14.7: Bivariate Normal Probability Density Function Plots. The

left-hand panel is a traditional bivariate normal probability density function.
The right-hand plot is a plot of the copula density for the normal distribution.
14.5.2 t- and Elliptical Copulas

Another copula used widely in practice is the 𝑡- copula. Both the 𝑡- and the
normal copula are special cases of a family known as elliptical copulas, so we
introduce this general family first, then specialize to the case of the 𝑡- copula.
The normal and the 𝑡- distributions are examples of symmetric distributions.
More generally, elliptical distributions is a class of distributions that are sym-
metric and can be multivariate. In short, an elliptical distribution is a type of
symmetric, multivariate distribution. The multivariate normal and multivariate
𝑡- are special types of elliptical distributions.
Elliptical copulas are constructed from elliptical distributions. This copula de-
composes a (multivariate) elliptical distribution into their univariate elliptical
marginal distributions by Sklar’s theorem. Properties of elliptical copulas can
be obtained from the properties of the corresponding elliptical distributions, see
for example, Hofert et al. (2018).
In general, a 𝑝-dimensional vector of random variables has an elliptical distribu-
tion if the density can be written as
𝑘𝑝 1 −1
ℎ𝐸 (z) = √ 𝑔𝑝 ( (z − 𝜇)′ Σ (z − 𝜇)) ,
det Σ 2
for z ∈ 𝑅𝑝 and 𝑘𝑝 is a constant, determined so the density integrates to one.

The function 𝑔𝑝 (⋅) is called a generator because it can be used to produce dif-
ferent distributions. Table 14.7 summarizes a few choices used in actuarial
practice. The choice 𝑔𝑝 (𝑥) = exp(−𝑥) gives rises to the normal pdf in equation
(14.3). The choice 𝑔𝑝 (𝑥) = exp(−(1 + 2𝑥/𝑟)−(𝑝+𝑟)/2 ) gives rise to a multivariate
𝑡- distribution with 𝑟 degrees of freedom with pdf
−(𝑝+𝑟)/2
−1
𝑘𝑝 (z − 𝜇)′ Σ (z − 𝜇)
ℎ𝑡𝑟 (z) = √ exp ⎡
⎢− (1 + ) ⎤.
⎥
det Σ 𝑟
⎣ ⎦
Table 14.7. Generator Functions (𝑔𝑝 (⋅)) for Selected Elliptical Distri-
butions
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟
𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑔𝑝 (𝑥)
Normal distribution 𝑒−𝑥
𝑡 − distribution with 𝑟 degrees of freedom (1 + 2𝑥/𝑟)−(𝑝+𝑟)/2
Cauchy (1 + 2𝑥)−(𝑝+1)/2
Logistic 𝑒−𝑥 /(1 + 𝑒−𝑥 )2
Exponential power exp(−𝑟𝑥𝑠 )
We can use elliptical distributions to generate copulas. Because copulas are

concerned primarily with relationships, we may restrict our considerations to
the case where 𝜇 = 0 and Σ is a correlation matrix. With these restrictions,
the marginal distributions of the multivariate elliptical copula are identical; we
use 𝐻 to refer to this marginal distribution function and ℎ is the corresponding
density. This marginal density is ℎ(𝑧) = 𝑘1 𝑔1 (𝑧 2 /2). For example, in the normal
case we have 𝐻(⋅) = Φ(⋅) and ℎ(⋅) = 𝜙(⋅).
We are now ready to define the pdf of the elliptical copula, a function defined
on the unit cube [0, 1]𝑝 as
𝑝
1
𝑐𝐸 (𝑢1 , … , 𝑢𝑝 ) = ℎ𝐸 (𝐻 −1 (𝑢1 ), … , 𝐻 −1 (𝑢𝑝 )) ∏ .
𝑗=1
ℎ(𝐻 −1 (𝑢𝑗 ))
As noted above, most empirical work focuses on the normal copula and 𝑡-copula.
Specifically, 𝑡-copulas are useful for modeling the dependency in the tails of
bivariate distributions, especially in financial risk analysis applications. The
𝑡-copulas with same association parameter in varying the degrees of freedom
parameter show us different tail dependency structures. For more information

on about 𝑡-copulas readers can see Joe (2014) and Hofert et al. (2018).
14.5.3 Archimedean Copulas

This class of copulas is also constructed from a generator function. For
Archimedean copulas, we assume that 𝑔(⋅) is a convex, decreasing function
with domain [0,1] and range [0, ∞) such that 𝑔(0) = 0. Use 𝑔−1 for the inverse
function of 𝑔. Then the function
𝐶𝑔 (𝑢1 , … , 𝑢𝑝 ) = 𝑔−1 (𝑔(𝑢1 ) + ⋯ + 𝑔(𝑢𝑝 ))
is said to be an Archimedean copula distribution function.

For the bivariate case, 𝑝 = 2, an Archimedean copula function can be written
by the function
𝐶𝑔 (𝑢1 , 𝑢2 ) = 𝑔−1 (𝑔(𝑢1 ) + 𝑔(𝑢2 )) .
Some important special cases of Archimedean copulas include the Frank,

Clayton/Cook-Johnson, and Gumbel/Hougaard copulas. Each copula class is
derived from different generator functions. As another useful special case, recall
the Frank’s copula described in Sections 14.3 and 14.4. To illustrate, we now
provide explicit expressions for the Clayton and Gumbel/Hougaard copulas.
Clayton Copula
For 𝑝 = 2, the Clayton copula is parameterized by 𝛾 ∈ [−1, ∞) is defined by
𝐶𝛾𝐶 (𝑢) = max{𝑢−𝛾 −𝛾

1 + 𝑢2 − 1, 0}
1/𝛾
, 𝑢 ∈ [0, 1]2 .
This is a bivariate distribution function defined on the unit square [0, 1]2 . The
range of dependence is controlled by the parameter 𝛾, similar to Frank’s copula.
Gumbel-Hougaard Copula
The Gumbel-Hougaard copula is parametrized by 𝛾 ∈ [1, ∞) and defined by
1/𝛾
2
𝐶𝛾𝐺𝐻 (𝑢) = exp ⎛
⎜− (∑(− log 𝑢𝑖 )𝛾 ) ⎞
⎟, 𝑢 ∈ [0, 1]2 .
⎝ 𝑖=1 ⎠
For more information on Archimedean copulas, see Joe (2014), Frees and Valdez
(1998), and Genest and Mackay (1986).
14.5.4 Properties of Copulas

With many choices of copulas available, it is helpful for analysts to understand
general features of how these alternatives behave.
Bounds on Association
Any distribution function is bounded below by zero and from above by one.
Additional types of bounds are available in multivariate contexts. These bounds
are useful when studying dependencies. That is, as an analyst thinks about
variables as being extremely dependent, one has available bounds that cannot
be exceeded, regardless of the dependence. The most widely used bounds in
dependence modeling are known as the Fréchet-Höeffding bounds, given as
max(𝑢1 + ⋯ + 𝑢𝑝 + 𝑝 − 1, 0) ≤ 𝐶(𝑢1 , … , 𝑢𝑝 ) ≤ min(𝑢1 , … , 𝑢𝑝 ).
To see the right-hand side of this equation, note that
𝐶(𝑢1 , … , 𝑢𝑝 ) = Pr(𝑈1 ≤ 𝑢1 , … , 𝑈𝑝 ≤ 𝑢𝑝 ) ≤ Pr(𝑈𝑗 ≤ 𝑢𝑗 ),
for 𝑗 = 1, … , 𝑝. The bound is achieved when 𝑈1 = ⋯ = 𝑈𝑝 . To see the left-hand

side when 𝑝 = 2, consider 𝑈2 = 1 − 𝑈1 . In this case, if 1 − 𝑢2 < 𝑢1 then
Pr(𝑈1 ≤ 𝑢1 , 𝑈2 ≤ 𝑢2 ) = Pr(1 − 𝑢2 ≤ 𝑈1 < 𝑢1 ) = 𝑢1 + 𝑢2 − 1.
See, for example, Nelson (1997) for additional discussion.

To see how these bounds relate to the concept of dependence, consider the case of
𝑝 = 2. As a benchmark, first note that the product copula is 𝐶(𝑢1 , 𝑢2 ) = 𝑢1 ⋅ 𝑢2
is the result of assuming independence between random variables. Now, from
the above discussion, we see that the lower bound is achieved when the two
random variables are perfectly negatively related (𝑈2 = 1 − 𝑈1 ). Further, it is
clear that the upper bound is achieved when they are perfectly positively related
(𝑈2 = 𝑈1 ). To emphasize this, the Frechet-Hoeffding bounds for two random
variables appear in Figure 14.8.
Measures of Association
Empirical versions of Spearman’s rho and Kendall’s tau were introduced in
Sections 14.2.2 and 14.2.2, respectively. The interesting thing about these ex-
pressions is that these summary measures of association are based only on the
ranks of each variable. Thus, any strictly increasing transform does not affect
these measures of association. Specifically, consider two random variables, 𝑌1
and 𝑌2 , and let m1 and m2 be strictly increasing functions. Then, the associa-
tion, when measured by Spearman’s rho or Kendall’s tau, between 𝑚1 (𝑌1 ) and
𝑚2 (𝑌2 ) does not change regardless of the choice of m1 and m2 . For example,
this allows analysts to consider dollars, Euros, or log dollars, and still retain the
Perfect Negative Dependency Perfect Positive Dependency
1.0
1.0
0.8
0.8
0.6
0.6
U2
U2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
U1 U1
Figure 14.8: Perfect Positive and Perfect Negative Dependence Plots
same essential dependence. As we have seen in Section 14.2, this is not the case
with the Pearson’s measure of correlation.
Schweizer et al. (1981) established that the copula accounts for all the depen-
dence in the sense that they way 𝑌1 and 𝑌2 “move together” is captured by
the copula, regardless of the scale in which each variable is measured. They
also showed that (population versions of) the two standard nonparametric mea-
sures of association could be expressed solely in terms of the copula function.
Spearman’s correlation coefficient is given by
1 1
𝜌𝑆 = 12 ∫ ∫ {𝐶(𝑢, 𝑣) − 𝑢𝑣} 𝑑𝑢𝑑𝑣. (14.4)
0 0
Kendall’s tau is given by
1 1
𝜏 = 4 ∫ ∫ 𝐶(𝑢, 𝑣) 𝑑𝐶(𝑢, 𝑣) − 1.
0 0
For these expressions, we assume that 𝑌1 and 𝑌2 have a jointly continuous

distribution function.
Example. Loss versus Expenses. Earlier, in Section 14.4, we saw that the
Spearman’s correlation was 0.452, calculated with the rho function. Then, we
fit Frank’s copula to these data, and estimated the dependence parameter to be
𝛾̂ = 0.452. As an alternative, the following code shows how to use the empirical
version of equation (14.4). In this case, the Spearman’s correlation coefficient
is 0.462, which is close to the sample Spearman’s correlation coefficient, 0.452.
Tail Dependency
There are applications in which it is useful to distinguish the part of the dis-
tribution in which the association is strongest. For example, in insurance it is
helpful to understand association among the largest losses, that is, association
in the right tails of the data.
To capture this type of dependency, we use the right-tail concentration function,
defined as
Pr(𝑈1 > 𝑧, 𝑈2 > 𝑧) 1 − 2𝑧 + 𝐶(𝑧, 𝑧)

𝑅(𝑧) = = Pr(𝑈1 > 𝑧|𝑈2 > 𝑧) = .
1−𝑧 1−𝑧
As a benchmark, 𝑅(𝑧) will be equal to 𝑧 under independence. Joe (1997) uses

the term “upper tail dependence parameter” for 𝑅 = lim𝑧→1 𝑅(𝑧).
In the same way, one can define the left-tail concentration function as
Pr(𝑈1 ≤ 𝑧, 𝑈2 ≤ 𝑧) 𝐶(𝑧, 𝑧)
𝐿(𝑧) = = Pr(𝑈1 ≤ 𝑧|𝑈2 ≤ 𝑧) = ,
𝑧 𝑧
with the lower tail dependence parameter 𝐿 = lim𝑧→0 𝐿(𝑧). A tail dependency
concentration function captures the probability of two random variables simul-
taneously having extreme values.
It is of interest to see how well a given copula can capture tail dependence. To
this end, we calculate the left and right tail concentration functions for four
different types of copulas; Normal, Frank, Gumbel and 𝑡- copulas. The results
are summarized for concentration function values for these four copulas in Table
14.8. As in Venter (2002), we show 𝐿(𝑧) for 𝑧 ≤ 0.5 and 𝑅(𝑧) for 𝑧 > 0.5 in
the tail dependence plot in Figure 14.9. We interpret the tail dependence plot
to mean that both the Frank and Normal copula exhibit no tail dependence
whereas the 𝑡- and the Gumbel do so. The 𝑡- copula is symmetric in its treatment
of upper and lower tails.
Table 14.8. Tail Dependence Parameters for Four Copulas
Copula Lower Upper

Frank 0 0
Gumbel 0 0.74
Normal 0 0
𝑡− 0.10 0.10
14.6 Why is Dependence Modeling Important?

Dependence modeling is important because it enables us to understand the
dependence structure by defining the relationship between variables in a dataset.
14.6. WHY IS DEPENDENCE MODELING IMPORTANT? 471
1.0
0.8
Gumbel
Tail Dependence
0.6
0.4
t with 5 df
0.2
normal Frank
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 14.9: Tail Dependence Plots
In insurance, ignoring dependence modeling may not impact pricing but could
lead to misestimation of required capital to cover losses. For instance, from
Section 14.4 , it is seen that there was a positive relationship between LOSS
and ALAE. This means that, if there is a large loss then we expect expenses to
be large as well and ignoring this relationship could lead to mis-estimation of
reserves.
To illustrate the importance of dependence modeling, we refer you back to port-

folio management example in Section 10.4.3 that assumed that the property and
liability risks are independent. Now, we incorporate dependence by allowing the
four lines of business to depend on one another through a Gaussian copula. In
Table 14.9, we show that dependence affects the portfolio quantiles (𝑉 𝑎𝑅𝑞 ), al-
though not the expected values. For instance, the 𝑉 𝑎𝑅0.99 for total risk which
is the amount of capital required to ensure, with a 99% degree of certainty that
the firm does not become technically insolvent is higher when we incorporate de-
pendence. This leads to less capital being allocated when dependence is ignored
and can cause unexpected solvency problems.
Table 14.9. Results for Portfolio Expected Value and Quantiles (𝑉 𝑎𝑅𝑞 )
Independent Expected 𝑉 𝑎𝑅0.9 𝑉 𝑎𝑅0.95 𝑉 𝑎𝑅0.99

Value
Retained 269 300 300 300
Insurer 2, 274 4, 400 6, 173 11, 859
Total 2, 543 4, 675 6, 464 12, 159
Gaussian Copula Expected 𝑉 𝑎𝑅0.9 𝑉 𝑎𝑅0.95 𝑉 𝑎𝑅0.99
Value
Retained 269 300 300 300
Insurer 2, 340 4, 988 7, 339 14, 905
Total 2, 609 5, 288 7, 639 15, 205

Contributors
• Edward W. (Jed) Frees and Nii-Armah Okine, University of
Wisconsin-Madison, and Emine Selin Sarıdaş, Mimar Sinan University,
are the principal authors of the initial version of this chapter. Email:
[email protected] for chapter comments and suggested improvements.
• Chapter reviewers include: Runhuan Feng, Fei Huang, Himchan Jeong,
Min Ji, and Toby White.
TS 14.A. Other Classic Measures of Scalar Associations

TS 14.A.1. Blomqvist’s Beta
Blomqvist (1950) developed a measure of dependence now known as Blomqvist’s
beta, also called the median concordance coefficient and the medial correlation
coefficient. Using distribution functions, this parameter can be expressed as
−1
𝛽𝐵 = 4𝐹 (𝐹𝑋 (1/2), 𝐹𝑌−1 (1/2)) − 1.
−1
That is, first evaluate each marginal at its median (𝐹𝑋 (1/2) and 𝐹𝑌−1 (1/2),
respectively). Then, evaluate the bivariate distribution function at the two
medians. After rescaling (multiplying by 4 and subtracting 1), the coefficient
turns out to have a range of [−1, 1], where 0 occurs under independence.
Like Spearman’s rho and Kendall’s tau, an estimator based on ranks is easy to
provide. First write 𝛽𝐵 = 4𝐶(1/2, 1/2) − 1 = 2 Pr((𝑈1 − 1/2)(𝑈2 − 1/2)) − 1
where 𝑈1 , 𝑈2 are uniform random variables. Then, define
2 𝑛 𝑛+1 𝑛+1
𝛽𝐵̂ = ∑ 𝐼 ((𝑅(𝑋𝑖 ) − )(𝑅(𝑌𝑖 ) − ) ≥ 0) − 1.
𝑛 𝑖=1 2 2
See, for example, Joe (2014), page 57 or Hougaard (2000), page 135, for more
details.
Because Blomqvist’s parameter is based on the center of the distribution, it is

particularly useful when data are censored; in this case, information in extreme
parts of the distribution are not always reliable. How does this affect a choice
of association measures? First, recall that association measures are based on a
bivariate distribution function. So, if one has knowledge of a good approxima-
tion of the distribution function, then calculation of an association measure is
straightforward in principle. Second, for censored data, bivariate extensions of
the univariate Kaplan-Meier distribution function estimator are available. For
example, the version introduced in Dabrowska (1988) is appealing. However,
because of instances when large masses of data appear at the upper range of
the data, this and other estimators of the bivariate distribution function are
unreliable. This means that, summary measures of the estimated distribution
function based on Spearman’s rho or Kendall’s tau can be unreliable. For this
situation, Blomqvist’s beta appears to be a better choice as it focuses on the
center of the distribution. Hougaard (2000), Chapter 14, provides additional
discussion.
You can obtain the Blomqvist’s beta, using the betan() function from the
copula library in R. From below, 𝛽𝐵 = 0.3 between the Coverage rating variable
in millions of dollars and Claim amount variable in dollars.
In addition, to show that the Blomqvist’s beta is invariant under strictly in-
creasing transformations, 𝛽𝐵 = 0.3 between the Coverage rating variable in
logarithmic millions of dollars and Claim amount variable in dollars.
TS 14.A.2. Nonparametric Approach Using Spearman Correlation

with Tied Ranks
For the first variable, the average rank of observations in the 𝑠th row is
1
𝑟1𝑠 = 𝑛𝑚1 + ⋯ + 𝑛𝑠−1, + (1 + 𝑛𝑠 )
2
and similarly 𝑟2𝑡 = 21 [(𝑛𝑚1 + ⋯ + 𝑛,𝑠−1 + 1) + (𝑛𝑚1 + ⋯ + 𝑛𝑠 )]. With this,
we have Spearman’s rho with tied rank is
𝑚 𝑚
∑𝑠=𝑚
2
∑𝑡=𝑚
2
𝑛𝑠𝑡 (𝑟1𝑠 − 𝑟)(𝑟
̄ 2𝑡 − 𝑟)̄
𝜌𝑆̂ = 1 1
2
𝑚2 𝑚
[∑𝑠=𝑚 𝑛𝑠 (𝑟1𝑠 − 𝑟)̄ 2 ∑𝑡=𝑚
2
𝑛𝑡 (𝑟2𝑡 − 𝑟)̄ 2 ]
1 1
where the average rank is 𝑟 ̄ = (𝑛 + 1)/2.

Special Case: Binary Data. Here, 𝑚1 = 0 and 𝑚2 = 1. For the first variable
ranks, we have 𝑟10 = (1 + 𝑛0 )/2 and 𝑟11 = (𝑛0 + 1 + 𝑛)/2. Thus, 𝑟10 − 𝑟 ̄ =
1
(𝑛0 − 𝑛)/2 and 𝑟11 − 𝑟 ̄ = 𝑛0 /2. This means that we have ∑𝑠=0 𝑛𝑠 (𝑟1𝑠 − 𝑟)̄ 2 =
𝑛(𝑛 − 𝑛0 )𝑛0 /4 and similarly for the second variable. For the numerator, we
have
1 1
∑∑ 𝑛𝑠𝑡 (𝑟1𝑠 − 𝑟)(𝑟
̄ 2𝑡 − 𝑟)̄
𝑠=0 𝑡=0
𝑛0 − 𝑛 𝑛0 − 𝑛 𝑛 − 𝑛 𝑛0 𝑛 𝑛 −𝑛 𝑛 𝑛
= 𝑛00 + 𝑛01 0 + 𝑛10 0 0 + 𝑛11 0 0
2 2 2 2 2 2 2 2
1
= (𝑛 (𝑛 − 𝑛)(𝑛0 − 𝑛) + (𝑛0 − 𝑛00 )(𝑛0 − 𝑛)𝑛0
4 00 0
+ (𝑛0 − 𝑛00 )𝑛0 (𝑛0 − 𝑛) + (𝑛 − 𝑛0 − 𝑛0 + 𝑛00 )𝑛0 𝑛0 )
1
= (𝑛 𝑛2 − 𝑛0 (𝑛0 − 𝑛)𝑛0
4 00
+ 𝑛0 𝑛0 (𝑛0 − 𝑛) + (𝑛 − 𝑛0 − 𝑛0 )𝑛0 𝑛0 )
1
= (𝑛 𝑛2 − 𝑛0 𝑛0 (𝑛0 − 𝑛 + 𝑛0 − 𝑛 + 𝑛 − 𝑛0 − 𝑛0 )
4 00
𝑛
= (𝑛𝑛00 − 𝑛0 𝑛0 ).
4
This yields
𝑛(𝑛𝑛00 − 𝑛0 𝑛0 )

𝜌𝑆̂ =
4√(𝑛(𝑛 − 𝑛0 )𝑛0 /4)(𝑛(𝑛 − 𝑛0 )𝑛0 /4)
𝑛𝑛00 − 𝑛0 𝑛0
=
√𝑛0 𝑛0 (𝑛 − 𝑛0 )(𝑛 − 𝑛0 )
𝑛00 − 𝑛(1 − 𝜋𝑋 ̂ )(1 − 𝜋𝑌̂ )
=
√ 𝜋𝑋
̂ (1 − 𝜋𝑋 ̂ )𝜋𝑌̂ (1 − 𝜋𝑌̂ )
where 𝜋𝑋̂ = (𝑛 − 𝑛0 )/𝑛 and similarly for 𝜋𝑌̂ . Note that this is same form as
the Pearson measure. From this, we see that the joint count 𝑛00 drives this
association measure.
You can obtain the ties-corrected Spearman correlation statistic 𝑟𝑆 using the
cor() function in R and selecting the spearman method. From below 𝜌𝑆̂ =
−0.09.
Chapter 15
Appendix A: Review of
Statistical Inference
Chapter Preview. The appendix gives an overview of concepts and methods

related to statistical inference on the population of interest, using a random
sample of observations from the population. In the appendix, Section 15.1 in-
troduces the basic concepts related to the population and the sample used for
making the inference. Section 15.2 presents the commonly used methods for
point estimation of population characteristics. Section 15.3 demonstrates inter-
val estimation that takes into consideration the uncertainty in the estimation,
due to use of a random sample from the population. Section 15.4 introduces the
concept of hypothesis testing for the purpose of variable and model selection.
15.1 Basic Concepts
In this section, you learn the following concepts related to statistical inference.
• Random sampling from a population that can be summarized using a list
of items or individuals within the population
• Sampling distributions that characterize the distributions of possible out-
comes for a statistic calculated from a random sample
• The central limit theorem that guides the distribution of the mean of a
random sample from the population
Statistical inference is the process of making conclusions on the characteris-

tics of a large set of items/individuals (i.e., the population), using a representa-
tive set of data (e.g., a random sample) from a list of items or individuals from
475
476CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
the population that can be sampled. While the process has a broad spectrum of
applications in various areas including science, engineering, health, social, and
economic fields, statistical inference is important to insurance companies that
use data from their existing policy holders in order to make inference on the
characteristics (e.g., risk profiles) of a specific segment of target customers (i.e.,
the population) whom the insurance companies do not directly observe.
Example – Wisconsin Property Fund. Assume there are 1,377 individual
claims from the 2010 experience.
Minimum First Quartile Median Mean Third Quartile Maximum Sta

Claims 1 788 2,250 26,620 6,171 12,920,000
Logarithmic Claims 0 6.670 7.719 7.804 8.728 16.370
1400
100 150 200 250 300

1000
Frequency
Frequency
600
200
50
0
0 4000000 10000000 0 5 10 15
Claims Logarithmic Claims
Figure 15.1: Distribution of Claims
## Sample size: 1377
Using the 2010 claim experience (the sample), the Wisconsin Property Fund may
be interested in assessing the severity of all claims that could potentially occur,
such as 2010, 2011, and so forth (the population). This process is important
in the contexts of ratemaking or claim predictive modeling. In order for such
inference to be valid, we need to assume that
• the set of 2010 claims is a random sample that is representative of the

population,
15.1. BASIC CONCEPTS 477
• the sampling distribution of the average claim amount can be estimated,

so that we can quantify the bias and uncertainty in the estimation due to
use of a finite sample.
15.1.1 Random Sampling

In statistics, a sampling error occurs when the sampling frame, the list from
which the sample is drawn, is not an adequate approximation of the population
of interest. A sample must be a representative subset of a population, or uni-
verse, of interest. If the sample is not representative, taking a larger sample
does not eliminate bias, as the same mistake is repeated over again and again.
Thus, we introduce the concept for random sampling that gives rise to a simple
random sample that is representative of the population.
We assume that the random variable 𝑋 represents a draw from a population with
a distribution function 𝐹 (⋅) with mean E[𝑋] = 𝜇 and variance Var[𝑋] = E[(𝑋 −
𝜇)2 ], where 𝐸(⋅) denotes the expectation of a random variable. In random
sampling, we make a total of 𝑛 such draws represented by 𝑋1 , … , 𝑋𝑛 , each
unrelated to one another (i.e., statistically independent). We refer to 𝑋1 , … , 𝑋𝑛
as a random sample (with replacement) from 𝐹 (⋅), taking either a parametric
or nonparametric form. Alternatively, we may say that 𝑋1 , … , 𝑋𝑛 are identically
and independently distributed (iid) with distribution function 𝐹 (⋅).
15.1.2 Sampling Distribution

Using the random sample 𝑋1 , … , 𝑋𝑛 , we are interested in making a conclusion
on a specific attribute of the population distribution 𝐹 (⋅). For example, we may
be interested in making an inference on the population mean, denoted 𝜇. It is
𝑛
natural to think of the sample mean, 𝑋̄ = ∑𝑖=1 𝑋𝑖 , as an estimate of the
population mean 𝜇. We call the sample mean as a statistic calculated from the
random sample 𝑋1 , … , 𝑋𝑛 . Other commonly used summary statistics include
sample standard deviation and sample quantiles.
When using a statistic (e.g., the sample mean 𝑋)̄ to make statistical inference
on the population attribute (e.g., population mean 𝜇), the quality of inference
is determined by the bias and uncertainty in the estimation, owing to the use
of a sample in place of the population. Hence, it is important to study the
distribution of a statistic that quantifies the bias and variability of the statistic.
In particular, the distribution of the sample mean, 𝑋̄ (or any other statistic), is
called the sampling distribution. The sampling distribution depends on the
sampling process, the statistic, the sample size 𝑛 and the population distribution
𝐹 (⋅). The central limit theorem gives the large-sample (sampling) distribution
of the sample mean under certain conditions.
15.1.3 Central Limit Theorem

In statistics, there are variations of the central limit theorem (CLT) ensuring
that, under certain conditions, the sample mean will approach the population
mean with its sampling distribution approaching the normal distribution as the
sample size goes to infinity. We give the Lindeberg–Levy CLT that establishes
the asymptotic sampling distribution of the sample mean 𝑋̄ calculated using a
random sample from a universe population having a distribution 𝐹 (⋅).
Lindeberg–Levy CLT. Let 𝑋1 , … , 𝑋𝑛 be a random sample from a population
distribution 𝐹 (⋅) with mean 𝜇 and variance 𝜎2√< ∞. The difference between
the sample mean 𝑋̄ and 𝜇, when multiplied by 𝑛, converges in distribution to
a normal distribution as the sample size goes to infinity. That is,
√ 𝑑
𝑛(𝑋̄ − 𝜇) −
→ 𝑁 (0, 𝜎).
Note that the CLT does not require a parametric form for 𝐹 (⋅). Based on the
CLT, we may perform statistical inference on the population mean (we infer,
not deduce). The types of inference we may perform include estimation of
the population, hypothesis testing on whether a null statement is true, and
prediction of future samples from the population.
15.2 Point Estimation and Properties
In this section, you learn how to

• estimate population parameters using method of moments estimation
• estimate population parameters based on maximum likelihood estimation
The population distribution function 𝐹 (⋅) can usually be characterized by a

limited (finite) number of terms called parameters, in which case we refer to the
distribution as a parametric distribution. In contrast, in nonparametric
analysis, the attributes of the sampling distribution are not limited to a small
number of parameters.
For obtaining the population characteristics, there are different attributes re-
lated to the population distribution 𝐹 (⋅). Such measures include the mean,
median, percentiles (i.e., 95th percentile), and standard deviation. Because
these summary measures do not depend on a specific parametric reference, they
are nonparametric summary measures.
In parametric analysis, on the other hand, we may assume specific families of
distributions with specific parameters. For example, people usually think of log-
arithm of claim amounts to be normally distributed with mean 𝜇 and standard
deviation 𝜎. That is, we assume that the claims have a lognormal distribution
15.2. POINT ESTIMATION AND PROPERTIES 479
with parameters 𝜇 and 𝜎. Alternatively, insurance companies commonly assume

that claim severity follows a gamma distribution with a shape parameter 𝛼 and
a scale parameter 𝜃. Here, the normal, lognormal, and gamma distributions are
examples of parametric distributions. In the above examples, the quantities of
𝜇, 𝜎, 𝛼, and 𝜃 are known as parameters. For a given parametric distribution
family, the distribution is uniquely determined by the values of the parameters.
One often uses 𝜃 to denote a summary attribute of the population. In parametric
models, 𝜃 can be a parameter or a function of parameters from a distribution
such as the normal mean and variance parameters. In nonparametric analysis,
it can take a form of a nonparametric summary such as the population mean
or standard deviation. Let 𝜃 ̂ = 𝜃(𝑋
̂
1 , … , 𝑋𝑛 ) be a function of the sample that
provides a proxy, or an estimate, of 𝜃. It is referred to as a statistic, a function
of the sample 𝑋1 , … , 𝑋𝑛 .
Example – Wisconsin Property Fund. The sample mean 7.804 and the sam-
ple standard deviation 1.683 can be either deemed as nonparametric estimates
of the population mean and standard deviation, or as parametric estimates of
𝜇 and 𝜎 of the normal distribution concerning the logarithmic claims. Using
results from the lognormal distribution, we may estimate the expected claim,
the lognormal mean, as 10,106.8 ( = exp(7.804 + 1.6832 /2) ).
For the Wisconsin Property Fund data, we may denote 𝜇̂ = 7.804 and 𝜎̂ = 1.683,
with the hat notation denoting an estimate of the parameter based on the
sample. In particular, such an estimate is referred to as a point estimate,
a single approximation of the corresponding parameter. For point estimation,
we introduce the two commonly used methods called the method of moments
estimation and maximum likelihood estimation.
15.2.1 Method of Moments Estimation

Before defining the method of moments estimation, we define the the concept
of moments. Moments are population attributes that characterize the dis-
tribution function 𝐹 (⋅). Given a random draw 𝑋 from 𝐹 (⋅), the expectation
𝜇𝑘 = E[𝑋 𝑘 ] is called the 𝑘th moment of 𝑋, 𝑘 = 1, 2, 3, … For example, the pop-
ulation mean 𝜇 is the first moment. Furthermore, the expectation E[(𝑋 − 𝜇)𝑘 ]
is called a 𝑘th central moment. Thus, the variance is the second central
moment.
Using the random sample 𝑋1 , … , 𝑋𝑛 , we may construct the corresponding sam-
𝑛
ple moment, 𝜇𝑘̂ = (1/𝑛) ∑𝑖=1 𝑋𝑖𝑘 , for estimating the population attribute 𝜇𝑘 .
For example, we have used the sample mean 𝑋̄ as an estimator for the pop-
ulation mean 𝜇. Similarly, the second central moment can be estimated as
𝑛
(1/𝑛) ∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 2 . Without assuming a parametric form for 𝐹 (⋅), the sample
moments constitute nonparametric estimates of the corresponding population
attributes. Such an estimator based on matching of the corresponding sample
and population moments is called a method of moments estimator (mme).
While the mme works naturally in a nonparametric model, it can be used to

estimate parameters when a specific parametric family of distribution is assumed
for 𝐹 (⋅). Denote by 𝜃 = (𝜃1 , ⋯ , 𝜃𝑚 ) the vector of parameters corresponding to
a parametric distribution 𝐹 (⋅). Given a distribution family, we commonly know
the relationships between the parameters and the moments. In particular, we
know the specific forms of the functions ℎ1 (⋅), ℎ2 (⋅), ⋯ , ℎ𝑚 (⋅) such that 𝜇1 =
ℎ1 (𝜃), 𝜇2 = ℎ2 (𝜃), ⋯ , 𝜇𝑚 = ℎ𝑚 (𝜃). Given the mme 𝜇1̂ , … , 𝜇𝑚 ̂ from the random
sample, the mme of the parameters 𝜃1̂ , ⋯ , 𝜃𝑚 ̂ can be obtained by solving the
equations of
𝜇1̂ = ℎ1 (𝜃1̂ , ⋯ , 𝜃𝑚̂ );
𝜇2̂ = ℎ2 (𝜃1̂ , ⋯ , 𝜃𝑚
̂ );
⋯
̂ = ℎ𝑚 (𝜃1̂ , ⋯ , 𝜃𝑚
𝜇𝑚 ̂ ).
Example – Wisconsin Property Fund. Assume that the claims follow a

lognormal distribution, so that logarithmic claims follow a normal distribution.
Specifically, assume log(𝑋) has a normal distribution with mean 𝜇 and variance
𝜎2 , denoted as log(𝑋) ∼ 𝑁 (𝜇, 𝜎2 ). It is straightforward that the mme 𝜇̂ = 𝑋̄
𝑛
and 𝜎̂ = √(1/𝑛) ∑ (𝑋𝑖 − 𝑋)̄ 2 . For the Wisconsin Property Fund example,
𝑖=1
the method of moments estimates are 𝜇̂ = 7.804 and 𝜎̂ = 1.683.
15.2.2 Maximum Likelihood Estimation

When 𝐹 (⋅) takes a parametric form, the maximum likelihood method is widely
used for estimating the population parameters 𝜃. Maximum likelihood estima-
tion is based on the likelihood function, a function of the parameters given the
observed sample. Denote by 𝑓(𝑥𝑖 |𝜃) the probability function of 𝑋𝑖 evaluated
at 𝑋𝑖 = 𝑥𝑖 (𝑖 = 1, 2, ⋯ , 𝑛); it is the probability mass function in the case of
a discrete 𝑋 and the probability density function in the case of a continuous
𝑋. Assuming independence, the likelihood function of 𝜃 associated with the
observation (𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 ) = (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) = x can be written as
𝑛
𝐿(𝜃|x) = ∏ 𝑓(𝑥𝑖 |𝜃),
𝑖=1
with the corresponding log-likelihood function given by

𝑛
𝑙(𝜃|x) = log(𝐿(𝜃|x)) = ∑ log 𝑓(𝑥𝑖 |𝜃).
𝑖=1
The maximum likelihood estimator (mle) of 𝜃 is the set of values of 𝜃 that

maximize the likelihood function (log-likelihood function), given the observed
sample. That is, the mle 𝜃 ̂ can be written as
𝜃 ̂ = argmax𝜃∈Θ 𝑙(𝜃|x),
15.2. POINT ESTIMATION AND PROPERTIES 481
where Θ is the parameter space of 𝜃, and argmax𝜃∈Θ 𝑙(𝜃|x) is defined as the value
of 𝜃 at which the function 𝑙(𝜃|x) reaches its maximum.
Given the analytical form of the likelihood function, the mle can be obtained by
taking the first derivative of the log-likelihood function with respect to 𝜃, and
setting the values of the partial derivatives to zero. That is, the mle are the
solutions of the equations of
̂
𝜕𝑙(𝜃|x)
= 0;
𝜕𝜃 ̂ 1
𝜕𝑙(𝜃|x)̂
= 0;
𝜕 𝜃2̂
⋯
𝜕𝑙(𝜃|x) ̂
= 0,
𝜕 𝜃𝑚̂
provided that the second partial derivatives are negative.
For parametric models, the mle of the parameters can be obtained either an-
alytically (e.g., in the case of normal distributions and linear estimators), or
numerically through iterative algorithms such as the Newton-Raphson method
and its adaptive versions (e.g., in the case of generalized linear models with a
non-normal response variable).
Normal distribution. Assume (𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 ) to be a random sample from
the normal distribution 𝑁 (𝜇, 𝜎2 ). With an observed sample (𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 ) =
(𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ), we can write the likelihood function of 𝜇, 𝜎2 as
𝑛 2
1 (𝑥𝑖 −𝜇)
𝐿(𝜇, 𝜎2 ) = ∏ [ √ 𝑒− 2𝜎2 ],
𝑖=1 2𝜋𝜎2
𝑛 1 𝑛 2
𝑙(𝜇, 𝜎2 ) = − [log(2𝜋) + log(𝜎2 )] − 2 ∑ (𝑥𝑖 − 𝜇) .
2 2𝜎 𝑖=1
By solving
𝜕𝑙(𝜇,̂ 𝜎2 )
= 0,
𝜕 𝜇̂
𝑛
we obtain 𝜇̂ = 𝑥̄ = (1/𝑛) ∑𝑖=1 𝑥𝑖 . It is straightforward to verify that
2 2
𝜕𝑙 (𝜇,𝜎
̂
𝜕 𝜇̂ 2
)
∣𝜇=< 0. Since this works for arbitrary 𝑥, 𝜇̂ = 𝑋̄ is the mle of 𝜇.
̂ 𝑥̄
Similarly, by solving
𝜕𝑙(𝜇, 𝜎̂ 2 )
= 0,
𝜕 𝜎̂ 2
𝑛
we obtain 𝜎̂ 2 = (1/𝑛) ∑𝑖=1 (𝑥𝑖 − 𝜇)2 . Further replacing 𝜇 by 𝜇,̂ we derive the
𝑛
mle of 𝜎2 as 𝜎̂ 2 = (1/𝑛) ∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 2 .
Hence, the sample mean 𝑋̄ and 𝜎̂ 2 are both the mme and MLE for the mean
𝜇 and variance 𝜎2 , under a normal population distribution 𝐹 (⋅). More details
regarding the properties of the likelihood function are given in Appendix Section
17.1.
15.3 Interval Estimation
• derive the exact sampling distribution of the mle of the normal mean
• obtain the large-sample approximation of the sampling distribution using
the large sample properties of the mle
• construct a confidence interval of a parameter based on the large sample
properties of the mle
Now that we have introduced the mme and mle, we may perform the first type
of statistical inference, interval estimation that quantifies the uncertainty
resulting from the use of a finite sample. By deriving the sampling distribution of
mle, we can estimate an interval (a confidence interval) for the parameter. Under
the frequentist approach (e.g., that based on maximum likelihood estimation),
the confidence intervals generated from the same random sampling frame will
cover the true value the majority of times (e.g., 95% of the times), if we repeat
the sampling process and re-calculate the interval over and over again. Such a
process requires the derivation of the sampling distribution for the mle.
15.3.1 Exact Distribution for Normal Sample Mean

Due to the additivity property of the normal distribution (i.e., a sum of normal
random variables that follows a multivariate normal distribution still follows a
normal distribution) and that the normal distribution belongs to the location–
scale family (i.e., a location and/or scale transformation of a normal random
variable has a normal distribution), the sample mean 𝑋̄ of a random sample
from a normal 𝐹 (⋅) has a normal sampling distribution for any finite 𝑛. Given
𝑋𝑖 ∼𝑖𝑖𝑑 𝑁 (𝜇, 𝜎2 ), 𝑖 = 1, … , 𝑛, the mle of 𝜇 has an exact distribution
2
𝜎
𝑋̄ ∼ 𝑁 (𝜇, ) .
𝑛
Hence, the sample mean is an unbiased estimator of 𝜇. In addition, the uncer-

tainty in the estimation can be quantified by its variance 𝜎2 /𝑛, that decreases
with the sample size 𝑛. When the sample size goes to infinity, the sample mean
will approach a single mass at the true value.
15.3. INTERVAL ESTIMATION 483
15.3.2 Large-sample Properties of MLE

For the mle of the mean parameter and any other parameters of other paramet-
ric distribution families, however, we usually cannot derive an exact sampling
distribution for finite samples. Fortunately, when the sample size is sufficiently
large, mles can be approximated by a normal distribution. Due to the general
maximum likelihood theory, the mle has some nice large-sample properties.
• The mle 𝜃 ̂ of a parameter 𝜃, is a consistent estimator. That is, 𝜃 ̂ converges
in probability to the true value 𝜃, as the sample size 𝑛 goes to infinity.
• The mle has the asymptotic normality property, meaning that the es-
timator will converge in distribution to a normal distribution centered
around the true value, when the sample size goes to infinity. Namely,
√
𝑛(𝜃 ̂ − 𝜃) →𝑑 𝑁 (0, 𝑉 ) , as 𝑛 → ∞,
where 𝑉 is the inverse of the Fisher Information. Hence, the mle 𝜃 ̂ ap-
proximately follows a normal distribution with mean 𝜃 and variance 𝑉 /𝑛,
when the sample size is large.
• The mle is efficient, meaning that it has the smallest asymptotic vari-
ance 𝑉 , commonly referred to as the Cramer–Rao lower bound. In
particular, the Cramer–Rao lower bound is the inverse of the Fisher infor-
mation defined as ℐ(𝜃) = −E(𝜕 2 log 𝑓(𝑋; 𝜃)/𝜕𝜃2 ). Hence, Var(𝜃)̂ can be
estimated based on the observed Fisher information that can be written
𝑛
as − ∑𝑖=1 𝜕 2 log 𝑓(𝑋𝑖 ; 𝜃)/𝜕𝜃2 .
For many parametric distributions, the Fisher information may be derived ana-
lytically for the mle of parameters. For more sophisticated parametric models,
the Fisher information can be evaluated numerically using numerical integration
for continuous distributions, or numerical summation for discrete distributions.
More details regarding maximum likelihood estimation are given in Appendix
Section 17.2.
15.3.3 Confidence Interval

Given that the mle 𝜃 ̂ has either an exact or an approximate normal distribution
with mean 𝜃 and variance Var(𝜃), ̂ we may take the square root of the variance
and plug-in the estimate to define 𝑠𝑒(𝜃)̂ = √Var(𝜃). ̂ A standard error is an
estimated standard deviation that quantifies the uncertainty in the estimation
resulting from the use of a finite sample. Under some regularity conditions
governing the population distribution, we may establish that the statistic
𝜃̂− 𝜃
𝑠𝑒(𝜃)̂
converges in distribution to a Student-𝑡 distribution with degrees of freedom (a

parameter of the distribution) 𝑛 − 𝑝, where 𝑝 is the number of parameters in
the model other than the variance. For example, for the normal distribution
case, we have 𝑝 = 1 for the parameter 𝜇; for a linear regression model with an
independent variable, we have 𝑝 = 2 for the parameters of the intercept and the
independent variable. Denote by 𝑡𝑛−𝑝 (1 − 𝛼/2) the 100 × (1 − 𝛼/2)-th percentile
of the Student-𝑡 distribution that satisfies Pr [𝑡 < 𝑡𝑛−𝑝 (1 − 𝛼/2)] = 1 − 𝛼/2.
We have,
𝛼 𝜃̂− 𝜃 𝛼
Pr [−𝑡𝑛−𝑝 (1 − )< < 𝑡𝑛−𝑝 (1 − )] = 1 − 𝛼,
2 𝑠𝑒(𝜃)̂ 2
from which we can derive a confidence interval for 𝜃. From the above equa-
tion we can derive a pair of statistics, 𝜃1̂ and 𝜃2̂ , that provide an interval of
the form [𝜃1̂ , 𝜃2̂ ]. This interval is a 1 − 𝛼 confidence interval for 𝜃 such that
Pr (𝜃1̂ ≤ 𝜃 ≤ 𝜃2̂ ) = 1 − 𝛼, where the probability 1 − 𝛼 is referred to as the con-
fidence level. Note that the above confidence interval is not valid for small
samples, except for the case of the normal mean.
Normal distribution. For the normal population mean 𝜇, the mle has an
√
sampling distribution 𝑋̄ ∼ 𝑁 (𝜇, 𝜎/ 𝑛), in which we can estimate 𝑠𝑒(𝜃)̂
exact √
by 𝜎/̂ 𝑛. Based on the Cochran’s theorem, the resulting statistic has an
exact Student-𝑡 distribution with degrees of freedom 𝑛 − 1. Hence, we can
derive the lower and upper bounds of the confidence interval as
𝛼 𝜎̂
𝜇1̂ = 𝜇̂ − 𝑡𝑛−1 (1 − )√
2 𝑛
and
𝛼 𝜎̂
𝜇2̂ = 𝜇̂ + 𝑡𝑛−1 (1 − )√ .
2 𝑛
When 𝛼 = 0.05, 𝑡𝑛−1 (1 − 𝛼/2) ≈ 1.96 for large values of 𝑛. Based on the
Cochran’s theorem, the confidence interval is valid regardless of the sample size.
Example – Wisconsin Property Fund. For the lognormal claim model,

(7.715235, 7.893208) is a 95% confidence interval for 𝜇.
More details regarding interval estimation based the mle of other parameters
and distribution families are given in Appendix Chapter 17.
15.4. HYPOTHESIS TESTING 485
15.4 Hypothesis Testing

• understand the basic concepts in hypothesis testing including the level of
significance and the power of a test
• perform hypothesis testing such as a Student-𝑡 test based on the properties
of the mle
• construct a likelihood ratio test for a single parameter or multiple param-
eters from the same statistical model
• use information criteria such as the Akaike’s information criterion or the
Bayesian information criterion to perform model selection
For the parameter(s) 𝜃 from a parametric distribution, an alternative type of

statistical inference is called hypothesis testing that verifies whether a hy-
pothesis regarding the parameter(s) is true, under a given probability called the
level of significance 𝛼 (e.g., 5%). In hypothesis testing, we reject the null hy-
pothesis, a restrictive statement concerning the parameter(s), if the probability
of observing a random sample as extremal as the observed one is smaller than
𝛼, if the null hypothesis were true.
15.4.1 Basic Concepts

In a statistical test, we are usually interested in testing whether a statement
regarding some parameter(s), a null hypothesis (denoted 𝐻0 ), is true given
the observed data. The null hypothesis can take a general form 𝐻0 ∶ 𝜃 ∈ Θ0 ,
where Θ0 is a subset of the parameter space Θ of 𝜃 that may contain multiple
parameters. For the case with a single parameter 𝜃, the null hypothesis usually
takes either the form 𝐻0 ∶ 𝜃 = 𝜃0 or 𝐻0 ∶ 𝜃 ≤ 𝜃0 . The opposite of the null
hypothesis is called the alternative hypothesis that can be written as 𝐻𝑎 ∶
𝜃 ≠ 𝜃0 or 𝐻𝑎 ∶ 𝜃 > 𝜃0 . The statistical test on 𝐻0 ∶ 𝜃 = 𝜃0 is called a two-sided
as the alternative hypothesis contains two inequalities of 𝐻𝑎 ∶ 𝜃 < 𝜃0 or 𝜃 > 𝜃0 .
In contrast, the statistical test on either 𝐻0 ∶ 𝜃 ≤ 𝜃0 or 𝐻0 ∶ 𝜃 ≥ 𝜃0 is called a
one-sided test.
A statistical test is usually constructed based on a statistic 𝑇 and its exact
or large-sample distribution. The test typically rejects a two-sided test when
either 𝑇 > 𝑐1 or 𝑇 < 𝑐2 , where the two constants 𝑐1 and 𝑐2 are obtained based
on the sampling distribution of 𝑇 at a probability level 𝛼 called the level of
significance. In particular, the level of significance 𝛼 satisfies
𝛼 = Pr(reject 𝐻0 |𝐻0 is true),
meaning that if the null hypothesis were true, we would reject the null hypothesis
only 5% of the times, if we repeat the sampling process and perform the test
over and over again.
Thus, the level of significance is the probability of making a type I error (error
of the first kind), the error of incorrectly rejecting a true null hypothesis. For
this reason, the level of significance 𝛼 is also referred to as the type I error
rate. Another type of error we may make in hypothesis testing is the type II
error (error of the second kind), the error of incorrectly accepting a false null
hypothesis. Similarly, we can define the type II error rate as the probability
of not rejecting (accepting) a null hypothesis given that it is not true. That is,
the type II error rate is given by
Pr(accept 𝐻0 |𝐻0 is false).
Another important quantity concerning the quality of the statistical test is called
the power of the test 𝛽, defined as the probability of rejecting a false null
hypothesis. The mathematical definition of the power is
𝛽 = Pr(reject 𝐻0 |𝐻0 is false).
Note that the power of the test is typically calculated based on a specific alter-
native value of 𝜃 = 𝜃𝑎 , given a specific sampling distribution and a given sample
size. In real experimental studies, people usually calculate the required sample
size in order to choose a sample size that will ensure a large chance of obtaining
a statistically significant test (i.e., with a prespecified statistical power such as
85%).
15.4.2 Student-t test based on mle

Based on the results from Section 15.3.1, we can define a Student-𝑡 test for
testing 𝐻0 ∶ 𝜃 = 𝜃0 . In particular, we define the test statistic as
𝜃 ̂ − 𝜃0
𝑡-stat = ,
𝑠𝑒(𝜃)̂
which has a large-sample distribution of a student-𝑡 distribution with degrees of

freedom 𝑛 − 𝑝, when the null hypothesis is true (i.e., when 𝜃 = 𝜃0 ).
For a given level of significance 𝛼, say 5%, we reject the null hypothesis if the
event 𝑡-stat < −𝑡𝑛−𝑝 (1 − 𝛼/2) or 𝑡-stat > 𝑡𝑛−𝑝 (1 − 𝛼/2) occurs (the rejection
region). Under the null hypothesis 𝐻0 , we have
𝛼 𝛼 𝛼
Pr [𝑡-stat < −𝑡𝑛−𝑝 (1 − )] = Pr [𝑡-stat > 𝑡𝑛−𝑝 (1 − )] = .
2 2 2
In addition to the concept of rejection region, we may reject the test based on
the 𝑝-value defined as 2 Pr(𝑇 > |𝑡-stat|) for the aforementioned two-sided test,
where the random variable 𝑇 ∼ 𝑇𝑛−𝑝 . We reject the null hypothesis if 𝑝-value
is smaller than and equal to 𝛼. For a given sample, a 𝑝-value is defined to be
the smallest significance level for which the null hypothesis would be rejected.
Similarly, we can construct a one-sided test for the null hypothesis 𝐻0 ∶ 𝜃 ≤ 𝜃0
(or 𝐻0 ∶ 𝜃 ≥ 𝜃0 ). Using the same test statistic, we reject the null hypothesis
when 𝑡-stat > 𝑡𝑛−𝑝 (1 − 𝛼) (or 𝑡-stat < −𝑡𝑛−𝑝 (1 − 𝛼) for the test on 𝐻0 ∶ 𝜃 ≥ 𝜃0 ).
The corresponding 𝑝-value is defined as Pr(𝑇 > |𝑡-stat|) (or Pr(𝑇 < |𝑡-stat|) for
the test on 𝐻0 ∶ 𝜃 ≥ 𝜃0 ). Note that the test is not valid for small samples, except
for the case of the test on the normal mean.
One-sample 𝑡 Test for Normal Mean. For the test on the normal mean
of the form 𝐻0 ∶ 𝜇 = 𝜇0 , 𝐻0 ∶ 𝜇 ≤ 𝜇0 or 𝐻0 ∶ 𝜇 ≥ 𝜇0 , we can define the test
statistic as
𝑋̄ − 𝜇0
𝑡-stat = √ ,
𝜎/̂ 𝑛
for which we have an exact sampling distribution 𝑡-stat ∼ 𝑇𝑛−1 from the
Cochran’s theorem, with 𝑇𝑛−1 denoting a Student-𝑡 distribution with degrees of
freedom 𝑛 − 1. According to the Cochran’s theorem, the test is valid for both
small and large samples.
Example – Wisconsin Property Fund. Assume that mean logarithmic
claims have historically been approximately by 𝜇0 = log(5000) = 8.517. We
might want to use the 2010 data to assess whether the mean of the distribution
has changed (a two-sided test), or whether it has increased (a one-sided test).
Given the actual 2010 average 𝜇̂ = 7.804, we may use the one-sample 𝑡 test to
assess whether this is a significant departure from 𝜇0 = 8.517 (i.e., √
in testing
𝐻0 ∶ 𝜇 = 8.517). The test statistic 𝑡-stat = (8.517 − 7.804)/(1.683/ 1377) =
15.72 > 𝑡1376 (0.975). Hence, we reject the two-sided test at 𝛼 = 5%. Similarly,
we will reject the one-sided test at 𝛼 = 5%.
Example – Wisconsin Property Fund. For numerical stability and exten-
sions to regression applications, statistical packages often work with transformed
versions of parameters. The following estimates are from the R package VGAM
(the function). More details on the mle of other distribution families are given
in Appendix Chapter 17.
Distribution Parameter Standard 𝑡-stat

Estimate Error
Gamma 10.190 0.050 203.831
-1.236 0.030 -41.180
Lognormal 7.804 0.045 172.089
0.520 0.019 27.303
Pareto 7.733 0.093 82.853

-0.001 0.054 -0.016
GB2 2.831 1.000 2.832
1.203 0.292 4.120
6.329 0.390 16.220
1.295 0.219 5.910
15.4.3 Likelihood Ratio Test

In the previous subsection, we have introduced the Student-𝑡 test on a single
parameter, based on the properties of the mle. In this section, we define an
alternative test called the likelihood ratio test (LRT). The LRT may be used
to test multiple parameters from the same statistical model.
Given the likelihood function 𝐿(𝜃|x) and Θ0 ⊂ Θ, the likelihood ratio test
statistic for testing 𝐻0 ∶ 𝜃 ∈ Θ0 against 𝐻𝑎 ∶ 𝜃 ∉ Θ0 is given by
sup𝜃∈Θ 𝐿(𝜃|x)
𝐿= 0
,
and that for testing 𝐻0 ∶ 𝜃 = 𝜃0 versus 𝐻𝑎 ∶ 𝜃 ≠ 𝜃0 is
𝐿(𝜃0 |x)
𝐿= .
The LRT rejects the null hypothesis when 𝐿 < 𝑐, with the threshold depending
on the level of significance 𝛼, the sample size 𝑛, and the number of parameters
in 𝜃. Based on the Neyman–Pearson Lemma, the LRT is the uniformly
most powerful test for testing 𝐻0 ∶ 𝜃 = 𝜃0 versus 𝐻𝑎 ∶ 𝜃 = 𝜃𝑎 . That is, it
provides the largest power 𝛽 for a given 𝛼 and a given alternative value 𝜃𝑎 .
Based on the Wilks’s Theorem, the likelihood ratio test statistic −2 log(𝐿)
converges in distribution to a Chi-square distribution with the degree of freedom
being the difference between the dimensionality of the parameter spaces Θ and
Θ0 , when the sample size goes to infinity and when the null model is nested
within the alternative model. That is, when the null model is a special case of
the alternative model containing a restricted sample space, we may approximate
𝑐 by 𝜒2𝑝1 −𝑝2 (1 − 𝛼), the 100 × (1 − 𝛼) th percentile of the Chi-square distribution,
with 𝑝1 − 𝑝2 being the degrees of freedom, and 𝑝1 and 𝑝2 being the numbers of
parameters in the alternative and null models, respectively. Note that the LRT
is also a large-sample test that will not be valid for small samples.
15.4.4 Information Criteria

In real-life applications, the LRT has been commonly used for comparing two
nested models. The LRT approach as a model selection tool, however, has two
major drawbacks: 1) It typically requires the null model to be nested within the
alternative model; 2) models selected from the LRT tends to provide in-sample
over-fitting, leading to poor out-of-sample prediction. In order to overcome these
issues, model selection based on information criteria, applicable to non-nested
models while taking into consideration the model complexity, is more widely
used for model selection. Here, we introduce the two most widely used criteria,
the Akaike’s information criterion and the Bayesian information criterion.
In particular, the Akaike’s information criterion (𝐴𝐼𝐶) is defined as
𝐴𝐼𝐶 = −2 log 𝐿(𝜃)̂ + 2𝑝,
where 𝜃 ̂ denotes the mle of 𝜃, and 𝑝 is the number of parameters in the model.
The additional term 2𝑝 represents a penalty for the complexity of the model.
That is, with the same maximized likelihood function, the 𝐴𝐼𝐶 favors model
with less parameters. We note that the 𝐴𝐼𝐶 does not consider the impact from
the sample size 𝑛.
Alternatively, people use the Bayesian information criterion (𝐵𝐼𝐶) that
takes into consideration the sample size. The 𝐵𝐼𝐶 is defined as
𝐵𝐼𝐶 = −2 log 𝐿(𝜃)̂ + 𝑝 log(𝑛).
We observe that the 𝐵𝐼𝐶 generally puts a higher weight on the number of
parameters. With the same maximized likelihood function, the 𝐵𝐼𝐶 will suggest
a more parsimonious model than the 𝐴𝐼𝐶.
Example – Wisconsin Property Fund. Both the 𝐴𝐼𝐶 and 𝐵𝐼𝐶 statistics
suggest that the GB2 is the best fitting model whereas gamma is the worst.
Distribution AIC BIC

Gamma 28,305.2 28,315.6
Lognormal 26,837.7 26,848.2
Pareto 26,813.3 26,823.7
GB2 26,768.1 26,789.0
In this graph,
• black represents actual (smoothed) logarithmic claims
• Best approximated by green which is fitted GB2
• Pareto (purple) and Lognormal (lightblue) are also pretty good
• Worst are the exponential (in red) and gamma (in dark blue)
0.30
0.20
Density
0.10
0.00
0 5 10 15
Log Expenditures
Figure 15.2: Fitted Claims Distribution
## Sample size: 6258

R Code for Fitted Claims Distributions
Contributors
• Lei (Larry) Hua, Northern Illinois University, and Edward W. (Jed)
Frees, University of Wisconsin-Madison, are the principal authors of the
initial version of this chapter. Email: [email protected] or [email protected]
Chapter 16
Appendix B: Iterated
Expectations
This appendix introduces the laws related to iterated expectations. In particular,

Section 16.1 introduces the concepts of conditional distribution and conditional
expectation. Section 16.2 introduces the Law of Iterated Expectations and the
Law of Total Variance.
In some situations, we only observe a single outcome but can conceptualize
an outcome as resulting from a two (or more) stage process. Such types of
statistical models are called two-stage, or hierarchical models. Some special
cases of hierarchical models include:
• models where the parameters of the distribution are random variables;
• mixture distribution, where Stage 1 represents the draw of a sub-
population and Stage 2 represents a random variable from a distribution
that is determined by the sub-population drew in Stage 1;
• an aggregate distribution, where Stage 1 represents the draw of the number
of events and Stage two represents the loss amount occurred per event.
In these situations, the process gives rise to a conditional distribution of a
random variable (the Stage 2 outcome) given the other (the Stage 1 outcome).
The Law of Iterated Expectations can be useful for obtaining the unconditional
expectation or variance of a random variable in such cases.
16.1 Conditional Distribution and Conditional

Expectation
In this section, you learn
491
492 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
• the concepts related to the conditional distribution of a random variable

given another
• how to define the conditional expectation and variance based on the con-
ditional distribution function
The iterated expectations are the laws regarding calculation of the expectation
and variance of a random variable using a conditional distribution of the variable
given another variable. Hence, we first introduce the concepts related to the
conditional distribution, and the calculation of the conditional expectation and
variable based on a given conditional distribution.
16.1.1 Conditional Distribution

Here we introduce the concept of conditional distribution respectively for dis-
crete and continuous random variables.
Discrete Case
Suppose that 𝑋 and 𝑌 are both discrete random variables, meaning that they
can take a finite or countable number of possible values with a positive proba-
bility. The joint probability (mass) function of (𝑋, 𝑌 ) is defined as
𝑝(𝑥, 𝑦) = Pr[𝑋 = 𝑥, 𝑌 = 𝑦].
When 𝑋 and 𝑌 are independent (the value of 𝑋 does not depend on that of
𝑌 ), we have
𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦),
with 𝑝(𝑥) = Pr[𝑋 = 𝑥] and 𝑝(𝑦) = Pr[𝑌 = 𝑦] being the marginal probability
function of 𝑋 and 𝑌 , respectively.
Given the joint probability function, we may obtain the marginal probability
functions of 𝑌 as
𝑝(𝑦) = ∑ 𝑝(𝑥, 𝑦),

𝑥
where the summation is over all possible values of 𝑥, and the marginal probabil-
ity function of 𝑋 can be obtained in a similar manner.
The conditional probability (mass) function of (𝑌 |𝑋) is defined as
𝑝(𝑥, 𝑦)
𝑝(𝑦|𝑥) = Pr[𝑌 = 𝑦|𝑋 = 𝑥] = ,
Pr[𝑋 = 𝑥]
16.1. CONDITIONAL DISTRIBUTION AND CONDITIONAL EXPECTATION493
where we may obtain the conditional probability function of (𝑋|𝑌 ) in a sim-

ilar manner. In particular, the above conditional probability represents the
probability of the event 𝑌 = 𝑦 given the event 𝑋 = 𝑥. Hence, even in cases
where Pr[𝑋 = 𝑥] = 0, the function may be given as a particular form, in real
applications.
Continuous Case
For continuous random variables 𝑋 and 𝑌 , we may define their joint probability
(density) function based on the joint cumulative distribution function. The joint
cumulative distribution function of (𝑋, 𝑌 ) is defined as
𝐹 (𝑥, 𝑦) = Pr[𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦].
When 𝑋 and 𝑌 are independent, we have
𝐹 (𝑥, 𝑦) = 𝐹 (𝑥)𝐹 (𝑦),
with 𝐹 (𝑥) = Pr[𝑋 ≤ 𝑥] and 𝐹 (𝑦) = Pr[𝑌 ≤ 𝑦] being the cumulative distri-
bution function (cdf) of 𝑋 and 𝑌 , respectively. The random variable 𝑋 is
referred to as a continuous random variable if its cdf is continuous on 𝑥.
When the cdf 𝐹 (𝑥) is continuous on 𝑥, then we define 𝑓(𝑥) = 𝜕𝐹 (𝑥)/𝜕𝑥 as the
(marginal) probability density function (pdf) of 𝑋. Similarly, if the joint
cdf 𝐹 (𝑥, 𝑦) is continuous on both 𝑥 and 𝑦, we define
𝜕 2 𝐹 (𝑥, 𝑦)
𝑓(𝑥, 𝑦) =
𝜕𝑥𝜕𝑦
as the joint probability density function of (𝑋, 𝑌 ), in which case we refer

to the random variables as jointly continuous.
When 𝑋 and 𝑌 are independent, we have
𝑓(𝑥, 𝑦) = 𝑓(𝑥)𝑓(𝑦).
Given the joint density function, we may obtain the marginal density function
of 𝑌 as
𝑓(𝑦) = ∫ 𝑓(𝑥, 𝑦) 𝑑𝑥,

𝑥
where the integral is over all possible values of 𝑥, and the marginal probability
function of 𝑋 can be obtained in a similar manner.
Based on the joint pdf and the marginal pdf, we define the conditional prob-
ability density function of (𝑌 |𝑋) as
𝑓(𝑥, 𝑦)
𝑓(𝑦|𝑥) = ,
𝑓(𝑥)
where we may obtain the conditional probability function of (𝑋|𝑌 ) in a similar

manner. Here, the conditional density function is the density function of 𝑦 given
𝑋 = 𝑥. Hence, even in cases where Pr[𝑋 = 𝑥] = 0 or when 𝑓(𝑥) is not defined,
the function may be given in a particular form in real applications.
16.1.2 Conditional Expectation and Conditional Variance

Now we define the conditional expectation and variance based on the conditional
distribution defined in the previous subsection.
Discrete Case
For a discrete random variable 𝑌 , its expectation is defined as E[𝑌 ] =
∑𝑦 𝑦 𝑝(𝑦) if its value is finite, and its variance is defined as Var[𝑌 ] =
E{(𝑌 − E[𝑌 ])2 } = ∑𝑦 𝑦2 𝑝(𝑦) − {E[𝑌 ]}2 if its value is finite.
For a discrete random variable 𝑌 , the conditional expectation of the random
variable 𝑌 given the event 𝑋 = 𝑥 is defined as
E[𝑌 |𝑋 = 𝑥] = ∑ 𝑦 𝑝(𝑦|𝑥),
𝑦
where 𝑋 does not have to be a discrete variable, as far as the conditional prob-
ability function 𝑝(𝑦|𝑥) is given.
Note that the conditional expectation E[𝑌 |𝑋 = 𝑥] is a fixed number. When we
replace 𝑥 with 𝑋 on the right hand side of the above equation, we can define
the expectation of 𝑌 given the random variable 𝑋 as
E[𝑌 |𝑋] = ∑ 𝑦 𝑝(𝑦|𝑋),

𝑦
which is still a random variable, and the randomness comes from 𝑋.

In a similar manner, we can define the conditional variance of the random
variable 𝑌 given the event 𝑋 = 𝑥 as
Var[𝑌 |𝑋 = 𝑥] = E[𝑌 2 |𝑋 = 𝑥]−{E[𝑌 |𝑋 = 𝑥]}2 = ∑ 𝑦2 𝑝(𝑦|𝑥)−{E[𝑌 |𝑋 = 𝑥]}2 .

𝑦
16.2. ITERATED EXPECTATIONS AND TOTAL VARIANCE 495
The variance of 𝑌 given 𝑋, Var[𝑌 |𝑋] can be defined by replacing 𝑥 by 𝑋 in the

above equation, and Var[𝑌 |𝑋] is still a random variable and the randomness
comes from 𝑋.
Continuous Case
For a continuous random variable 𝑌 , its expectation is defined as E[𝑌 ] =
∫𝑦 𝑦 𝑓(𝑦)𝑑𝑦 if the integral exists, and its variance is defined as Var[𝑌 ] = E{(𝑋 −
E[𝑌 ])2 } = ∫𝑦 𝑦2 𝑓(𝑦)𝑑𝑦 − {E[𝑌 ]}2 if its value is finite.
For jointly continuous random variables 𝑋 and 𝑌 , the conditional expecta-
tion of the random variable 𝑌 given 𝑋 = 𝑥 is defined as
E[𝑌 |𝑋 = 𝑥] = ∫ 𝑦 𝑓(𝑦|𝑥)𝑑𝑦.
𝑦
where 𝑋 does not have to be a continuous variable, as far as the conditional

probability function 𝑓(𝑦|𝑥) is given.
Similarly, the conditional expectation E[𝑌 |𝑋 = 𝑥] is a fixed number. When we
replace 𝑥 with 𝑋 on the right-hand side of the above equation, we can define
the expectation of 𝑌 given the random variable 𝑋 as
E[𝑌 |𝑋] = ∫ 𝑦 𝑝(𝑦|𝑋) 𝑑𝑦,

𝑦
which is still a random variable, and the randomness comes from 𝑋.

In a similar manner, we can define the conditional variance of the random
variable 𝑌 given the event 𝑋 = 𝑥 as
Var[𝑌 |𝑋 = 𝑥] = E[𝑌 2 |𝑋 = 𝑥]−{E[𝑌 |𝑋 = 𝑥]}2 = ∫ 𝑦2 𝑓(𝑦|𝑥) 𝑑𝑦−{E[𝑌 |𝑋 = 𝑥]}2 .

𝑦
The variance of 𝑌 given 𝑋, Var[𝑌 |𝑋] can then be defined by replacing 𝑥 by 𝑋

in the above equation, and similarly Var[𝑌 |𝑋] is also a random variable and the
randomness comes from 𝑋.
16.2 Iterated Expectations and Total Variance

• the Law of Iterated Expectations for calculating the expectation of a ran-
dom variable based on its conditional distribution given another random
variable
• the Law of Total Variance for calculating the variance of a random variable
based on its conditional distribution given another random variable
• how to calculate the expectation and variance based on an example of a
two-stage model
16.2.1 Law of Iterated Expectations

Consider two random variables 𝑋 and 𝑌 , and ℎ(𝑋, 𝑌 ), a random variable de-
pending on the function ℎ, 𝑋 and 𝑌 .
Assuming all the expectations exist and are finite, the Law of Iterated Ex-
pectations states that
E[ℎ(𝑋, 𝑌 )] = E {E [ℎ(𝑋, 𝑌 )|𝑋]} , (16.1)
where the first (inside) expectation is taken with respect to the random variable
𝑌 and the second (outside) expectation is taken with respect to 𝑋.
For the Law of Iterated Expectations, the random variables may be discrete,
continuous, or a hybrid combination of the two. We use the example of discrete
variables of 𝑋 and 𝑌 to illustrate the calculation of the unconditional expecta-
tion using the Law of Iterated Expectations. For continuous random variables,
we only need to replace the summation with the integral, as illustrated earlier
in the appendix.
Given 𝑝(𝑦|𝑥) the joint pmf of 𝑋 and 𝑌 , the conditional expectation of ℎ(𝑋, 𝑌 )
given the event 𝑋 = 𝑥 is defined as
E [ℎ(𝑋, 𝑌 )|𝑋 = 𝑥] = ∑ ℎ(𝑥, 𝑦)𝑝(𝑦|𝑥),

𝑦
and the conditional expectation of ℎ(𝑋, 𝑌 ) given 𝑋 being a random variable

can be written as
E [ℎ(𝑋, 𝑌 )|𝑋] = ∑ ℎ(𝑋, 𝑦)𝑝(𝑦|𝑋).

𝑦
The unconditional expectation of ℎ(𝑋, 𝑌 ) can then be obtained by taking the

expectation of E [ℎ(𝑋, 𝑌 )|𝑋] with respect to the random variable 𝑋. That is,
we can obtain E[ℎ(𝑋, 𝑌 )] as
16.2. ITERATED EXPECTATIONS AND TOTAL VARIANCE 497
E {E [ℎ(𝑋, 𝑌 )|𝑋]} = ∑ {∑ ℎ(𝑥, 𝑦)𝑝(𝑦|𝑥)} 𝑝(𝑥)

𝑥 𝑦
= ∑ ∑ ℎ(𝑥, 𝑦)𝑝(𝑦|𝑥)𝑝(𝑥) .
𝑥 𝑦
= ∑ ∑ ℎ(𝑥, 𝑦)𝑝(𝑥, 𝑦) = E[ℎ(𝑋, 𝑌 )]

𝑥 𝑦
The Law of Iterated Expectations for the continuous and hybrid cases can be
proved in a similar manner, by replacing the corresponding summation(s) by
integral(s).
16.2.2 Law of Total Variance

Assuming that all the variances exist and are finite, the Law of Total Variance
states that
Var[ℎ(𝑋, 𝑌 )] = E {Var [ℎ(𝑋, 𝑌 )|𝑋]} + Var {E [ℎ(𝑋, 𝑌 )|𝑋]} , (16.2)
where the first (inside) expectation/variance is taken with respect to the ran-
dom variable 𝑌 and the second (outside) expectation/variance is taken with
respect to 𝑋. Thus, the unconditional variance equals to the expectation of the
conditional variance plus the variance of the conditional expectation.
In order to verify this rule, first note that we can calculate a conditional variance
as
2
Var [ℎ(𝑋, 𝑌 )|𝑋] = E[ℎ(𝑋, 𝑌 )2 |𝑋] − {E [ℎ(𝑋, 𝑌 )|𝑋]} .
From this, the expectation of the conditional variance is
2
E{Var [ℎ(𝑋, 𝑌 )|𝑋]} = E {E [ℎ(𝑋, 𝑌 )2 |𝑋]} − E ({E [ℎ(𝑋, 𝑌 )|𝑋]} )
2
= E [ℎ(𝑋, 𝑌 )2 ] − E ({E [ℎ(𝑋, 𝑌 )|𝑋]} ) . (16.3)
Further, note that the conditional expectation, E [ℎ(𝑋, 𝑌 )|𝑋], is a function of

𝑋, denoted 𝑔(𝑋). Thus, 𝑔(𝑋) is a random variable with mean E[ℎ(𝑋, 𝑌 )] and
variance
Var {E [ℎ(𝑋, 𝑌 )|𝑋]} = Var[𝑔(𝑋)]

2
= E[𝑔(𝑋)2 ] − {E[𝑔(𝑋)]}
2 2
= E ({E [ℎ(𝑋, 𝑌 )|𝑋]} ) − {E[ℎ(𝑋, 𝑌 )]} . (16.4)
Thus, adding Equations (16.3) and (16.4) leads to the unconditional variance
Var [ℎ(𝑋, 𝑌 )].
16.2.3 Application
To apply the Law of Iterated Expectations and the Law of Total Variance, we
generally adopt the following procedure.
1. Identify the random variable that is being conditioned upon, typically a
stage 1 outcome (that is not observed).
2. Conditional on the stage 1 outcome, calculate summary measures such as
a mean, variance, and the like.
3. There are several results of the step 2, one for each stage 1 outcome. Then,
combine these results using the iterated expectations or total variance
rules.
Mixtures of Finite Populations. Suppose that the random variable 𝑁1 rep-
resents a realization of the number of claims in a policy year from the population
of good drivers and 𝑁2 represents that from the population of bad drivers. For a
specific driver, there is a probability 𝛼 that (s)he is a good driver. For a specific
draw 𝑁 , we have
𝑁1 , if (s)he is a good driver;

𝑁 ={
𝑁2 , otherwise.
Let 𝑇 be the indicator whether (s)he is a good driver, with 𝑇 = 1 representing

that the driver is a good driver with Pr[𝑇 = 1] = 𝛼 and 𝑇 = 2 representing that
the driver is a bad driver with Pr[𝑇 = 2] = 1 − 𝛼.
From equation (16.1), we can obtain the expected number of claims as
E[𝑁 ] = E {E [𝑁 |𝑇 ]} = E[𝑁1 ] × 𝛼 + E[𝑁2 ] × (1 − 𝛼).
From equation (16.2), we can obtain the variance of 𝑁 as
Var[𝑁 ] = E {Var [𝑁 |𝑇 ]} + Var {E [𝑁 |𝑇 ]} .

16.3. CONJUGATE DISTRIBUTIONS 499
To be more concrete, suppose that 𝑁𝑗 follows a Poisson distribution with the

mean 𝜆𝑗 , 𝑗 = 1, 2. Then we have
Var[𝑁 |𝑇 = 𝑗] = E[𝑁 |𝑇 = 𝑗] = 𝜆𝑗 , 𝑗 = 1, 2.
Thus, we can derive the expectation of the conditional variance as
E {Var [𝑁 |𝑇 ]} = 𝛼𝜆1 + (1 − 𝛼)𝜆2
and the variance of the conditional expectation as
Var {E [𝑁 |𝑇 ]} = (𝜆1 − 𝜆2 )2 𝛼(1 − 𝛼).
Note that the later is the variance for a Bernoulli with outcomes 𝜆1 and 𝜆2 , and
the binomial probability 𝛼.
Based on the Law of Total Variance, the unconditional variance of 𝑁 is given
by
Var[𝑁 ] = 𝛼𝜆1 + (1 − 𝛼)𝜆2 + (𝜆1 − 𝜆2 )2 𝛼(1 − 𝛼).
16.3 Conjugate Distributions

As described in Section 4.4.1, for conjugate distributions the posterior and the
prior come from the same family of distributions. In insurance applications,
this broadly occurs in a “family of distribution families” known as the linear
exponential family which we introduce first.
16.3.1 Linear Exponential Family

Definition. The distribution of the linear exponential family is
𝑥𝛾 − 𝑏(𝛾)
𝑓(𝑥; 𝛾, 𝜃) = exp ( + 𝑆 (𝑥, 𝜃)) .
𝜃
Here, 𝑥 is a dependent variable and 𝛾 is the parameter of interest. The quantity

𝜃 is a scale parameter. The term 𝑏(𝛾) depends only on the parameter 𝛾, not the
dependent variable. The statistic 𝑆 (𝑥, 𝜃) is a function of the dependent variable
and the scale parameter, not the parameter 𝛾.
The dependent variable 𝑥 may be discrete, continuous or a hybrid combination
of the two. Thus, 𝑓 (⋅) may be interpreted to be a density or mass function,
depending on the application. Table 16.1 provides several examples, including
the normal, binomial and Poisson distributions.
Table 16.1. Selected Distributions of the Linear Exponential Family
Density or
Distribution Parameters Mass Function Components
General 𝛾, 𝜃 exp ( 𝑥𝛾−𝑏(𝛾)
𝜃
+ 𝑆 (𝑥, 𝜃)) 𝛾, 𝜃, 𝑏(𝛾), 𝑆(𝑥, 𝜃)
2 𝛾2 2
Normal 𝜇, 𝜎2 √1
𝜎 2𝜋
exp (− (𝑥−𝜇)
2𝜎2
) 𝜇, 𝜎2 , 2
, − ( 𝑥2𝜃 + log(2𝜋𝜃)
2
)
𝑛 𝑥 𝑛−𝑥 𝜋 𝛾
Binomal 𝜋 (𝑥)𝜋 (1 − 𝜋) log ( 1−𝜋 ) , 1, 𝑛 log(1 + 𝑒 ),
log (𝑛𝑥)
𝜆𝑥 𝛾
Poisson 𝜆 𝑥!
exp(−𝜆) log 𝜆, 1, 𝑒 , − log(𝑥!)
Γ(𝑥+𝑟) 𝑟
Negative 𝑟, 𝑝 𝑥!Γ(𝑟)
𝑝 (1 − 𝑝)𝑥 log(1 − 𝑝), 1, −𝑟 log(1 − 𝑒𝛾 ),
∗
Binomial log [ Γ(𝑥+𝑟)
𝑥!Γ(𝑟)
]
1 𝛼−1 1 1 −1
Gamma 𝛼, 𝜃 Γ(𝛼)𝜃𝛼
𝑥 exp(−𝑥/𝜃) − ,
𝛼𝛾 𝛼
, − log(−𝛾), −𝜃 log 𝜃
−1 −1
− log (Γ(𝜃 )) + (𝜃 − 1) log 𝑥
∗
This assumes that the parameter r is fixed but need not be an integer.
The Tweedie (see Section 5.3.4) and inverse Gaussian distributions are also
members of the linear exponential family. The linear exponential family of
distribution families is extensively used as the basis of generalized linear models
as described in, for example, Frees (2009).
16.3.2 Conjugate Distributions

Now assume that the parameter 𝛾 is random with distribution 𝜋(𝛾, 𝜏 ), where
𝜏 is a vector of parameters that describe the distribution of 𝛾. In Bayesian
models, the distribution 𝜋 is known as the prior and reflects our belief or infor-
mation about 𝛾. The likelihood 𝑓(𝑥|𝛾) is a probability conditional on 𝛾. The
distribution of 𝛾 with knowledge of the random variables, 𝜋(𝛾, 𝜏 |𝑥), is called the
posterior distribution. For a given likelihood distribution, priors and posteriors
that come from the same parametric family are known as conjugate families of
distributions.
For a linear exponential likelihood, there exists a natural conjugate family.
Specifically, consider a likelihood of the form 𝑓(𝑥|𝛾) = exp {(𝑥𝛾 − 𝑏(𝛾))/𝜃} exp {𝑆 (𝑥, 𝜃)}.
For this likelihood, define the prior distribution
𝜋(𝛾, 𝜏 ) = 𝐶 exp {𝛾𝑎1 (𝜏 ) − 𝑏(𝛾)𝑎2 (𝜏 ))} ,
where 𝐶 is a normalizing constant. Here, 𝑎1 (𝜏 ) = 𝑎1 and 𝑎2 (𝜏 ) = 𝑎2 are
functions of the parameters 𝜏 although we simplify the notation by dropping
explicit dependence on 𝜏 . The joint distribution of 𝑥 and 𝛾 is given by 𝑓(𝑥, 𝛾) =
𝑓(𝑥|𝛾)𝜋(𝛾, 𝜏 ). Using Bayes Theorem, the posterior distribution is
𝑥 1
𝜋(𝛾, 𝜏 |𝑥) = 𝐶1 exp {𝛾 (𝑎1 + ) − 𝑏(𝛾) (𝑎2 + )} ,
𝜃 𝜃
where 𝐶1 is a normalizing constant. Thus, we see that 𝜋(𝛾, 𝜏 |𝑥) has the same
form as 𝜋(𝛾, 𝜏 ).
16.3. CONJUGATE DISTRIBUTIONS 501
Special case. Gamma-Poisson Model. Consider a Poisson likelihood so

that 𝑏(𝛾) = 𝑒𝛾 and scale parameter (𝜃) equals one. Thus, we have
𝑎1
𝜋(𝛾, 𝜏 ) = 𝐶 exp {𝛾𝑎1 − 𝑎2 𝑒𝛾 } = 𝐶 (𝑒𝛾 ) exp (−𝑎2 𝑒𝛾 ) .
From the table of exponential family distributions, we recognize this to be a

gamma distribution. That is, we have that the prior distribution of 𝜆 = 𝑒𝛾
−1
is a gamma distribution with parameters 𝛼𝑝𝑟𝑖𝑜𝑟 = 𝑎1 + 1 and 𝜃𝑝𝑟𝑖𝑜𝑟 = 𝑎2 .
The posterior distribution is a gamma distribution with parameters 𝛼𝑝𝑜𝑠𝑡 =
−1 −1
𝑎1 + 𝑥 + 1 = 𝛼𝑝𝑟𝑖𝑜𝑟 + 𝑥 and 𝜃𝑝𝑜𝑠𝑡 = 𝑎2 + 1 = 𝜃𝑝𝑟𝑖𝑜𝑟 + 1. This is consistent with
our Section 4.4.4 result.
Special case. Normal-Normal Model. Consider a normal likelihood so that

𝑏(𝛾) = 𝛾 2 /2 and the scale parameter is 𝜎2 . Thus, we have
2
𝛾2 𝑎 𝑎
𝜋(𝛾, 𝜏 ) = 𝐶 exp {𝛾𝑎1 − 𝑎 } = 𝐶1 (𝜏 ) exp {− 2 (𝛾 − 1 ) } ,
2 2 2 𝑎2
The prior distribution of 𝛾 is normal with mean 𝑎1 /𝑎2 and variance 𝑎−12 . The
posterior distribution of 𝛾 given 𝑥 is normal with mean (𝑎1 + 𝑥/𝜎2 )/(𝑎2 + 𝜎−2 )
and variance (𝑎2 + 𝜎−2 )−1 .
Special case. Beta-Binomial Model. Consider a binomial likelihood so that

𝑏(𝛾) = 𝑛 log(1 + 𝑒𝛾 ) and scale parameter equals one. Thus, we have
𝑎1 −𝑛𝑎2 +𝑎1
𝑒𝛾 𝑒𝛾
𝜋(𝛾, 𝜏 ) = 𝐶 exp {𝛾𝑎1 − 𝑛𝑎2 log(1 + 𝑒𝛾 )} = 𝐶 ( 𝛾
) (1 − 𝛾
) .
1+𝑒 1+𝑒
This is a beta distribution. As in the other cases, prior parameters 𝑎1 and 𝑎2

are updated to become posterior parameters 𝑎1 + 𝑥 and 𝑎2 + 1.
Contributors
Chapter 17
Appendix C: Maximum
Likelihood Theory
Chapter Preview. Appendix Chapter 15 introduced the maximum likelihood

theory regarding estimation of parameters from a parametric family. This ap-
pendix gives more specific examples and expands some of the concepts. Section
17.1 reviews the definition of the likelihood function, and introduces its proper-
ties. Section 17.2 reviews the maximum likelihood estimators, and extends their
large-sample properties to the case where there are multiple parameters in the
model. Section 17.3 reviews statistical inference based on maximum likelihood
estimators, with specific examples on cases with multiple parameters.
17.1 Likelihood Function

• the definitions of the likelihood function and the log-likelihood function
• the properties of the likelihood function.
From Appendix Chapter 15, the likelihood function is a function of parameters

given the observed data. Here, we review the concepts of the likelihood function,
and introduces its properties that are bases for maximum likelihood inference.
17.1.1 Likelihood and Log-likelihood Functions

Here, we give a brief review of the likelihood function and the log-likelihood
function from Appendix Chapter 15. Let 𝑓(⋅|𝜃) be the probability function of
503
504 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
𝑋, the probability mass function (pmf) if 𝑋 is discrete or the probability density

function (pdf) if it is continuous. The likelihood is a function of the parameters
(𝜃) given the data (x). Hence, it is a function of the parameters with the data
being fixed, rather than a function of the data with the parameters being fixed.
The vector of data x is usually a realization of a random sample as defined in
Appendix Chapter 15.
Given a realized of a random sample x = (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) of size 𝑛, the likelihood
function is defined as
𝑛
𝐿(𝜃|x) = 𝑓(x|𝜃) = ∏ 𝑓(𝑥𝑖 |𝜃),
𝑖=1
𝑛
𝑙(𝜃|x) = log 𝐿(𝜃|x) = ∑ log 𝑓(𝑥𝑖 |𝜃),
𝑖=1
where 𝑓(x|𝜃) denotes the joint probability function of x. The log-likelihood

function leads to an additive structure that is easy to work with.
In Appendix Chapter 15, we have used the normal distribution to illustrate con-
cepts of the likelihood function and the log-likelihood function. Here, we derive
the likelihood and corresponding log-likelihood functions when the population
distribution is from the Pareto distribution family.
Example – Pareto Distribution. Suppose that 𝑋1 , … , 𝑋𝑛 represents a ran-
dom sample from a single-parameter Pareto distribution with the cumulative
distribution function given by
500 𝛼
𝐹 (𝑥) = Pr(𝑋𝑖 ≤ 𝑥) = 1 − ( ) , 𝑥 > 500,
𝑥
with parameter 𝜃 = 𝛼.
The corresponding probability density function is 𝑓(𝑥) = 500𝛼 𝛼𝑥−𝛼−1 and the
log-likelihood function can be derived as
𝑛 𝑛
𝑙(𝛼|x) = ∑ log 𝑓(𝑥𝑖 ; 𝛼) = 𝑛𝛼 log 500 + 𝑛 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 .
𝑖=1 𝑖=1
17.1.2 Properties of Likelihood Functions

In mathematical statistics, the first derivative of the log-likelihood function
with respect to the parameters, 𝑢(𝜃) = 𝜕𝑙(𝜃|x)/𝜕𝜃, is referred to as the score
function, or the score vector when there are multiple parameters in 𝜃. The
score function or score vector can be written as
17.1. LIKELIHOOD FUNCTION 505
𝑛 𝑛
𝜕 𝜕 𝜕
𝑢(𝜃) = 𝑙(𝜃|x) = log ∏ 𝑓(𝑥𝑖 ; 𝜃) = ∑ log 𝑓(𝑥𝑖 ; 𝜃),
𝜕𝜃 𝜕𝜃 𝑖=1 𝑖=1
𝜕𝜃
where 𝑢(𝜃) = (𝑢1 (𝜃), 𝑢2 (𝜃), ⋯ , 𝑢𝑝 (𝜃)) when 𝜃 = (𝜃1 , ⋯ , 𝜃𝑝 ) contains 𝑝 > 2 pa-
rameters, with the element 𝑢𝑘 (𝜃) = 𝜕𝑙(𝜃|x)/𝜕𝜃𝑘 being the partial derivative
with respect to 𝜃𝑘 (𝑘 = 1, 2, ⋯ , 𝑝).
The likelihood function has the following properties:
• One basic property of the likelihood function is that the expectation of
the score function with respect to x is 0. That is,
𝜕
E[𝑢(𝜃)] = E [ 𝑙(𝜃|x)] = 0.
𝜕𝜃
To illustrate this, we have
𝜕
𝜕 𝑓(x; 𝜃) 𝜕
E[ 𝑙(𝜃|x)] = E [ 𝜕𝜃 ] = ∫ 𝑓(y; 𝜃)𝑑y
𝜕𝜃 𝑓(x; 𝜃) 𝜕𝜃
𝜕 𝜕
= ∫ 𝑓(y; 𝜃)𝑑y = 1 = 0.
𝜕𝜃 𝜕𝜃
′ 2
• Denote by 𝜕 2 𝑙(𝜃|x)/𝜕𝜃𝜕𝜃 = 𝜕 2 𝑙(𝜃|x)/𝜕𝜃 the second derivative
of the log-likelihood function when 𝜃 is a single parameter, or by
′
𝜕 2 𝑙(𝜃|x)/𝜕𝜃𝜕𝜃 = (ℎ𝑗𝑘 ) = (𝜕 2 𝑙(𝜃|x)/𝜕𝑥𝑗 𝜕𝑥𝑘 ) the hessian matrix of the
log-likelihood function when it contains multiple parameters. Denote
′
[𝜕𝑙(𝜃|x)𝜕𝜃][𝜕𝑙(𝜃|x)𝜕𝜃 ] = 𝑢2 (𝜃) when 𝜃 is a single parameter, or let
′
[𝜕𝑙(𝜃|x)𝜕𝜃][𝜕𝑙(𝜃|x)𝜕𝜃 ] = (𝑢𝑢𝑗𝑘 ) be a 𝑝 × 𝑝 matrix when 𝜃 contains a total
of 𝑝 parameters, with each element 𝑢𝑢𝑗𝑘 = 𝑢𝑗 (𝜃)𝑢𝑘 (𝜃) and 𝑢𝑗 (𝜃) being the
𝑘th element of the score vector as defined earlier. Another basic property
of the likelihood function is that sum of the expectation of the hessian
matrix and the expectation of the Kronecker product of the score vector
and its transpose is 0. That is,
𝜕2 𝜕𝑙(𝜃|x) 𝜕𝑙(𝜃|x)
E( ′ 𝑙(𝜃|x)) + E ( ′ ) = 0.
𝜕𝜃𝜕𝜃 𝜕𝜃 𝜕𝜃
• Define the Fisher information matrix as
𝜕𝑙(𝜃|x) 𝜕𝑙(𝜃|x) 𝜕2
ℐ(𝜃) = E ( ′ ) = −E ( ′ 𝑙(𝜃|x)) .
𝜕𝜃 𝜕𝜃 𝜕𝜃𝜕𝜃
As the sample size 𝑛 goes to infinity, the score function (vector) converges in dis-
tribution to a normal distribution (or multivariate normal distribution
when 𝜃 contains multiple parameters) with mean 0 and variance (or covariance
matrix in the multivariate case) given by ℐ(𝜃).
17.2 Maximum Likelihood Estimators

• the definition and derivation of the maximum likelihood estimator (mle)
for parameters from a specific distribution family
• the properties of maximum likelihood estimators that ensure valid large-
sample inference of the parameters
• why using the mle-based method, and what caution that needs to be taken.
In statistics, maximum likelihood estimators are values of the parameters 𝜃 that

are most likely to have been produced by the data.
17.2.1 Definition and Derivation of MLE

Based on the definition given in Appendix Chapter 15, the value of 𝜃, say
̂
𝜃𝑀𝐿𝐸 , that maximizes the likelihood function, is called the maximum likelihood
estimator (mle) of 𝜃.
Because the log function log(⋅) is a one-to-one function, we can also determine
̂
𝜃𝑀𝐿𝐸 by maximizing the log-likelihood function, 𝑙(𝜃|x). That is, the mle is
defined as
̂
𝜃𝑀𝐿𝐸 = argmax𝜃∈Θ 𝑙(𝜃|x).
Given the analytical form of the likelihood function, the mle can be obtained by
taking the first derivative of the log-likelihood function with respect to 𝜃, and
setting the values of the partial derivatives to zero. That is, the mle are the
solutions of the equations of
̂
𝜕𝑙(𝜃|x)
= 0.
𝜕 𝜃̂
Example. Course C/Exam 4. May 2000, 21. You are given the following
five observations: 521, 658, 702, 819, 1217. You use the single-parameter Pareto
with cumulative distribution function:
500 𝛼
𝐹 (𝑥) = 1 − ( ) , 𝑥 > 500.
𝑥
Calculate the maximum likelihood estimate of the parameter 𝛼.

Solution. With 𝑛 = 5, the log-likelihood function is
17.2. MAXIMUM LIKELIHOOD ESTIMATORS 507
5 5
𝑙(𝛼|x) = ∑ log 𝑓(𝑥𝑖 ; 𝛼) = 5𝛼 log 500 + 5 log 𝛼 − (𝛼 + 1) ∑ log 𝑥𝑖 .
𝑖=1 𝑖=1
Solving for the root of the score function yields
5
𝜕 5
𝑙(𝛼|x) = 5 log 500+5/𝛼−∑ log 𝑥𝑖 =𝑠𝑒𝑡 0 ⇒ 𝛼𝑀𝐿𝐸
̂ = 5 = 2.453.
𝜕𝛼 𝑖=1 ∑𝑖=1 log 𝑥𝑖 − 5 log 500
17.2.2 Asymptotic Properties of MLE

From Appendix Chapter 15, the MLE has some nice large-sample properties, un-
der certain regularity conditions. We presented the results for a single parameter
in Appendix Chapter 15, but results are true for the case when 𝜃 contains mul-
tiple parameters. In particular, we have the following results, in a general case
when 𝜃 = (𝜃1 , 𝜃2 , ⋯ , 𝜃𝑝 ).
̂
• The mle of a parameter 𝜃, 𝜃𝑀𝐿𝐸 , is a consistent estimator. That is, the
̂
mle 𝜃𝑀𝐿𝐸 converges in probability to the true value 𝜃, as the sample size
𝑛 goes to infinity.
• The mle has the asymptotic normality property, meaning that the esti-
mator will converge in distribution to a multivariate normal distribution
centered around the true value, when the sample size goes to infinity.
Namely,
√
𝑛(𝜃 ̂ − 𝜃) → 𝑁 (0, 𝑉 ) , as 𝑛 → ∞,
𝑀𝐿𝐸
where 𝑉 denotes the asymptotic variance (or covariance matrix) of the

̂
estimator. Hence, the mle 𝜃𝑀𝐿𝐸 has an approximate normal distribution
with mean 𝜃 and variance (covariance matrix when 𝑝 > 1) 𝑉 /𝑛, when the
sample size is large.
• The mle is efficient, meaning that it has the smallest asymptotic variance
𝑉 , commonly referred to as the Cramer–Rao lower bound. In particu-
lar, the Cramer–Rao lower bound is the inverse of the Fisher information
̂
(matrix) ℐ(𝜃) defined earlier in this appendix. Hence, Var(𝜃𝑀𝐿𝐸 ) can be
estimated based on the observed Fisher information.
Based on the above results, we may perform statistical inference based on the
procedures defined in Appendix Chapter 15.
Example. Course C/Exam 4. Nov 2000, 13. A sample of ten observations

comes from a parametric family 𝑓(𝑥, ; 𝜃1 , 𝜃2 ) with log-likelihood function
10
𝑙(𝜃1 , 𝜃2 ) = ∑ 𝑓(𝑥𝑖 ; 𝜃1 , 𝜃2 ) = −2.5𝜃12 − 3𝜃1 𝜃2 − 𝜃22 + 5𝜃1 + 2𝜃2 + 𝑘,
𝑖=1
where 𝑘 is a constant. Determine the estimated covariance matrix of the maxi-

mum likelihood estimator, 𝜃1̂ , 𝜃2̂ .
Solution. Denoting 𝑙 = 𝑙(𝜃1 , 𝜃2 ), the hessian matrix of second derivatives is
𝜕2 𝜕2
𝜕𝜃21
𝑙 𝜕𝜃1 𝜕𝜃2 𝑙 −5 −3
( 𝜕2 𝜕2 )=( )
𝜕𝜃1 𝜕𝜃2 𝑙 𝑙 −3 −2
𝜕𝜃21
Thus, the information matrix is:
𝜕2 5 3
ℐ(𝜃1 , 𝜃2 ) = −E ( ′ 𝑙(𝜃|x)) = ( 3 2 )
𝜕𝜃𝜕𝜃
and
1 2 −3 2 −3
ℐ−1 (𝜃1 , 𝜃2 ) = ( )=( ).
5(2) − 3(3) −3 5 −3 5
17.2.3 Use of Maximum Likelihood Estimation

The method of maximum likelihood has many advantages over alternative meth-
ods such as the method of moment method introduced in Appendix Chapter
15.
• It is a general tool that works in many situations. For example, we may
be able to write out the closed-form likelihood function for censored and
truncated data. Maximum likelihood estimation can be used for regres-
sion models including covariates, such as survival regression, generalized
linear models and mixed models, that may include covariates that are
time-dependent.
• From the efficiency of the mle, it is optimal, the best, in the sense that
it has the smallest variance among the class of all unbiased estimators for
large sample sizes.
• From the results on the asymptotic normality of the mle, we can obtain
a large-sample distribution for the estimator, allowing users to assess the
variability in the estimation and perform statistical inference on the param-
eters. The approach is less computationally extensive than re-sampling
methods that require a large of fittings of the model.
Despite its numerous advantages, mle has its drawback in cases such as gener-
alized linear models when it does not have a closed analytical form. In such
cases, maximum likelihood estimators are computed iteratively using numerical
17.3. STATISTICAL INFERENCE BASED ON MAXIMUM LIKELIHOOD ESTIMATION509
optimization methods. For example, we may use the Newton-Raphson iterative

algorithm or its variations for obtaining the mle. Iterative algorithms require
starting values. For some problems, the choice of a close starting value is critical,
particularly in cases where the likelihood function has local minimums or maxi-
mums. Hence, there may be a convergence issue when the starting value is far
from the maximum. Hence, it is important to start from different values across
the parameter space, and compare the maximized likelihood or log-likelihood to
make sure the algorithms have converged to a global maximum.
17.3 Statistical Inference Based on Maximum

Likelihood Estimation

• perform hypothesis testing based on mle for cases where there are multiple
parameters in 𝜃
• perform likelihood ratio test for cases where there are multiple parameters
in 𝜃
In Appendix Chapter 15, we have introduced maximum likelihood-based meth-

ods for statistical inference when 𝜃 contains a single parameter. Here, we will
extend the results to cases where there are multiple parameters in 𝜃.
17.3.1 Hypothesis Testing

In Appendix Chapter 15, we defined hypothesis testing concerning the null hy-
pothesis, a statement on the parameter(s) of a distribution or model. One
important type of inference is to assess whether a parameter estimate is statis-
tically significant, meaning whether the value of the parameter is zero or not.
̂
We have learned earlier that the mle 𝜃𝑀𝐿𝐸 has a large-sample normal distri-
bution with mean 𝜃 and the variance covariance matrix ℐ−1 (𝜃). Based on the
̂
multivariate normal distribution, the 𝑗th element of 𝜃𝑀𝐿𝐸 ̂
, say 𝜃𝑀𝐿𝐸,𝑗 , has a
large-sample univariate normal distribution.
̂
Define 𝑠𝑒(𝜃𝑀𝐿𝐸,𝑗 ), the standard error (estimated standard deviation) to be
the square root of the 𝑗th diagonal element of ℐ−1 (𝜃)𝑀𝐿𝐸 . To assess the null
̂
hypothesis that 𝜃𝑗 = 𝜃0 , we define the 𝑡-statistic or 𝑡-ratio to be 𝑡(𝜃𝑀𝐿𝐸,𝑗 )=
(𝜃 ̂
𝑀𝐿𝐸,𝑗 − 𝜃 )/𝑠𝑒(𝜃 ̂
0 𝑀𝐿𝐸,𝑗).
Under the null hypothesis, it has a Student-𝑡 distribution with degrees of free-
dom equal to 𝑛 − 𝑝, with 𝑝 being the dimension of 𝜃.
For most actuarial applications, we have a large sample size 𝑛, so the 𝑡-

distribution is very close to the (standard) normal distribution. In the case
when 𝑛 is very large or when the standard error is known, the 𝑡-statistic can be
referred to as a 𝑧-statistic or 𝑧-score.
Based on the results from Appendix Chapter 15, if the 𝑡-statistic 𝑡(𝜃𝑀𝐿𝐸,𝑗 ̂ )
exceeds a cut-off (in absolute value), then the test for the 𝑗 parameter 𝜃𝑗 is
said to be statistically significant. If 𝜃𝑗 is the regression coefficient of the 𝑗 th
independent variable, then we say that the 𝑗th variable is statistically significant.
For example, if we use a 5% significance level, then the cut-off value is 1.96 using
a normal distribution approximation for cases with a large sample size. More
generally, using a 100𝛼% significance level, then the cut-off is a 100(1 − 𝛼/2)%
quantile from a Student-𝑡 distribution with the degree of freedom being 𝑛 − 𝑝.
Another useful concept in hypothesis testing is the 𝑝-value, shorthand for prob-
ability value. From the mathematical definition in Appendix Chapter 15, a
𝑝-value is defined as the smallest significance level for which the null hypothesis
would be rejected. Hence, the 𝑝-value is a useful summary statistic for the data
analyst to report because it allows the reader to understand the strength of
statistical evidence concerning the deviation from the null hypothesis.
17.3.2 MLE and Model Validation

In addition to hypothesis testing and interval estimation introduced in Appendix
Chapter 15 and the previous subsection, another important type of inference is
selection of a model from two choices, where one choice is a special case of the
other with certain parameters being restricted. For such two models with one
being nested in the other, we have introduced the likelihood ratio test (LRT) in
Appendix Chapter 15. Here, we will briefly review the process of performing a
LRT based on a specific example of two alternative models.
Suppose that we have a (large) model under which we derive the maximum
̂
likelihood estimator, 𝜃𝑀𝐿𝐸 . Now assume that some of the 𝑝 elements in 𝜃
are equal to zero and determine the maximum likelihood estimator over the
̂
remaining set, with the resulting estimator denoted 𝜃𝑅𝑒𝑑𝑢𝑐𝑒𝑑 .
Based on the definition in Appendix Chapter 15, the statistic, 𝐿𝑅𝑇 =
̂
2 (𝑙(𝜃𝑀𝐿𝐸 ̂
) − 𝑙(𝜃𝑅𝑒𝑑𝑢𝑐𝑒𝑑 )), is called the likelihood ratio statistic. Under the
null hypothesis that the reduced model is correct, the likelihood ratio has a
Chi-square distribution with degrees of freedom equal to 𝑑, the number of
variables set to zero.
Such a test allows us to judge which of the two models is more likely to be
correct, given the observed data. If the statistic 𝐿𝑅𝑇 is large relative to the
critical value from the chi-square distribution, then we reject the reduced model
in favor of the larger one. Details regarding the critical value and alternative
methods based on information criteria are given in Appendix Chapter 15.
17.3. STATISTICAL INFERENCE BASED ON MAXIMUM LIKELIHOOD ESTIMATION511
Contributors
Chapter 18
Appendix D: Summary of
Distributions
User Notes
• The R functions are from the packages actuar and invgamma.

• Tables appear when first loaded by the browser. To hide them, click on
one of the distributions, e.g., Poisson, and then click on the Hide button.
• More information on the R codes is available at the R Codes for Loss Data
Analytics site.
18.1 Discrete Distributions

Overview. This section summarizes selected discrete probability distributions
used throughout Loss Data Analytics. Relevant functions and R code are pro-
vided.
18.1.1 The (a,b,0) Class

Poisson
Functions
513
514 CHAPTER 18. APPENDIX D: SUMMARY OF DISTRIBUTIONS
Name Function
Parameter assumptions 𝜆>0
𝑝0 𝑒−𝜆
𝑒−𝜆 𝜆𝑘
Probability mass function 𝑘!
𝑝𝑘
Expected value 𝜆
E[𝑁 ]
Variance 𝜆
Probability generating function 𝑒𝜆(𝑧−1)
𝑃 (𝑧)
𝑎 and 𝑏 for recursion 𝑎=0
𝑏=𝜆
R Commands
Function Name R Command

Probability mass function dpois(𝑥 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Distribution function ppois(𝑝 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Quantile function qpois(𝑞 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Random sampling function rpois(𝑛 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Geometric
Functions
Name Function
Parameter assumptions 𝛽>0
1
𝑝0 1+𝛽
𝛽𝑘
Probability mass function (1+𝛽)𝑘+1
𝑝𝑘
Expected value 𝛽
E[𝑁 ]
Variance 𝛽(1 + 𝛽)
Probability generating function [1 − 𝛽(𝑧 − 1)]−1
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎 = 1+𝛽
𝑏=0
R Commands
18.1. DISCRETE DISTRIBUTIONS 515

1
Probability mass function dgeom(𝑥 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Distribution function pgeom(𝑝 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Quantile function qgeom(𝑞 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Random sampling function rgeom(𝑛 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
Binomial
Functions
Name Function
Parameter assumptions 0 < 𝑞 < 1, m is an integer
0≤𝑘≤𝑚
𝑝0 (1 − 𝑞)𝑚
Probability mass function (𝑚
𝑘
)𝑞 𝑘
(1 − 𝑞)𝑚−𝑘
𝑝𝑘
Expected value 𝑚𝑞
E[𝑁 ]
Variance 𝑚𝑞(1 − 𝑞)
Probability generating function [1 + 𝑞(𝑧 − 1)]𝑚
𝑃 (𝑧)
−𝑞
𝑎 and 𝑏 for recursion 𝑎 = 1−𝑞
(𝑚+1)𝑞
𝑏 = 1−𝑞
R Commands

Probability mass function dbinom(𝑥 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑞)
Distribution function pbinom(𝑝 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑞)
Quantile function qbinom(𝑞 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑞)
Random sampling function rbinom(𝑛 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑞)
Negative Binomial
Functions
Name Function
Parameter assumptions 𝑟 > 0, 𝛽 > 0
𝑝0 (1 + 𝛽)−𝑟
𝑟(𝑟+1)⋯(𝑟+𝑘−1)𝛽𝑘
Probability mass function 𝑘!(1+𝛽)𝑟+𝑘
𝑝𝑘
Expected value 𝑟𝛽
E[𝑁 ]
Variance 𝑟𝛽(1 + 𝛽)
Probability generating function [1 − 𝛽(𝑧 − 1)]−𝑟
𝑃 (𝑧)
𝛽
(𝑟−1)𝛽
𝑏 = 1+𝛽
R Commands

1
Probability mass function dnbinom(𝑥 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Distribution function pnbinom(𝑝 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Quantile function qnbinom(𝑞 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Random sampling function rnbinom(𝑛 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
18.1.2 The (a,b,1) Class

Zero Truncated Poisson
Functions
Name Function
Parameter assumptions 𝜆>0
𝜆
𝑝1𝑇 𝑒𝜆 −1
𝜆𝑘
Probability mass function 𝑘!(𝑒𝜆 −1)
𝑝𝑘𝑇
𝜆
Expected value 1−𝑒−𝜆
E[𝑁 ]
𝜆[1−(𝜆+1)𝑒−𝜆 ]
Variance (1−𝑒−𝜆 )2
𝑒𝜆𝑧 −1
Probability generating function 𝑒𝜆 −1
𝑃 (𝑧)
𝑎 and 𝑏 for recursion 𝑎=0
𝑏=𝜆
R Commands
18.1. DISCRETE DISTRIBUTIONS 517

Probability mass function dztpois(𝑥 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Distribution function pztpois(𝑝 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Quantile function qztpois(𝑞 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Random sampling function rztpois(𝑛 =, 𝑙𝑎𝑚𝑏𝑑𝑎 = 𝜆)
Zero Truncated Geometric
Functions
Name Function
1
𝑝1𝑇 1+𝛽
𝑘−1
𝛽
Probability mass function (1+𝛽)𝑘
𝑝𝑘𝑇
Expected value 1+𝛽
E[𝑁 ]
Variance 𝛽(1 + 𝛽)
[1−𝛽(𝑧−1)]−1 −(1+𝛽)−1
Probability generating function 1−(1+𝛽)−1
𝑃 (𝑧)
𝛽
𝑏=0
R Commands

1
Probability mass function dztgeom(𝑥 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Distribution function pztgeom(𝑝 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Quantile function qztgeom(𝑞 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Random sampling function rztgeom(𝑛 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
Zero Truncated Binomial
Functions
Name Function
Parameter assumptions 0 < 𝑞 < 1, m is an integer
0≤𝑘≤𝑚
𝑚(1−𝑞)𝑚−1 𝑞
𝑝1𝑇 1−(1−𝑞)𝑚
(𝑚𝑘)𝑞𝑘 (1−𝑞)𝑚−𝑘
Probability mass function 1−(1−𝑞)𝑚
𝑝𝑘𝑇
𝑚𝑞
Expected value 1−(1−𝑞)𝑚
E[𝑁 ]
𝑚𝑞[(1−𝑞)−(1−𝑞+𝑚𝑞)(1−𝑞)𝑚 ]
Variance [1−(1−𝑞)𝑚 ]2
[1+𝑞(𝑧−1)𝑚 ]−(1−𝑞)𝑚
Probability generating function 1−(1−𝑞)𝑚
𝑃 (𝑧)
−𝑞
𝑎 and 𝑏 for recursion 𝑎 = 1−𝑞
(𝑚+1)𝑞
𝑏 = 1−𝑞
R Commmands

Probability mass function dztbinom(𝑥 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑝)
Distribution function pztbinom(𝑝 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑝)
Quantile function qztbinom(𝑞 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑝)
Random sampling function rztbinom(𝑛 =, 𝑠𝑖𝑧𝑒 = 𝑚, 𝑝𝑟𝑜𝑏 = 𝑝)
Zero Truncated Negative Binomial

Functions
Name Function
Parameter assumptions 𝑟 > −1, 𝑟 ≠ 0
𝑟𝛽
𝑝1𝑇 (1+𝛽)𝑟+1 −(1+𝛽)
𝑟(𝑟+1)⋯(𝑟+𝑘−1) 𝛽 𝑘
Probability mass function 𝑘![(1+𝛽)𝑟 −1] ( 1+𝛽 )
𝑝𝑘𝑇
𝑟𝛽
Expected value 1−(1+𝛽)−𝑟
E[𝑁 ]
𝑟𝛽[(1+𝛽)−(1+𝛽+𝑟𝛽)(1+𝛽)−𝑟 ]
Variance [1−(1+𝛽)−𝑟 ]2
[1−𝛽(𝑧−1)]−𝑟 −(1+𝛽)−𝑟
Probability generating function 1−(1+𝛽)−𝑟
𝑃 (𝑧)
𝛽
𝑏 = (𝑟−1)𝛽
1+𝛽
R Commands
18.2. CONTINUOUS DISTRIBUTIONS 519

1
Probability mass function dztnbinom(𝑥 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Distribution function pztnbinom(𝑝 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Quantile function qztnbinom(𝑞 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
1
Random sampling function rztnbinom(𝑛 =, 𝑠𝑖𝑧𝑒 = 𝑟, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
Logarithmic
Functions
Name Function
𝛽
𝑝1𝑇 (1+𝛽)𝑙𝑛(1+𝛽)
𝛽𝑘
Probability mass function 𝑘(1+𝛽)𝑘 ln(1+𝛽)
𝑝𝑘𝑇
𝛽
Expected value ln(1+𝛽)
E[𝑁 ]
𝛽
𝛽[1+𝛽− 𝑙𝑛(1+𝛽) ]
Variance ln(1+𝛽)
Probability generating function 1 − 𝑙𝑛[1−𝛽(𝑧−1)]
ln(1+𝛽)
𝑃 (𝑧)
𝛽
𝑎 and 𝑏 for recursion 𝑎= 1+𝛽
−𝛽
𝑏= 1+𝛽
R Commands

𝛽
Probability mass function dnbinom(𝑥 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
𝛽
Distribution function pnbinom(𝑝 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
𝛽
Quantile function qnbinom(𝑞 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
𝛽
Random sampling function rnbinom(𝑛 =, 𝑝𝑟𝑜𝑏 = 1+𝛽 )
18.2 Continuous Distributions

Overview. This section summarizes selected continuous probability distribu-
tions used throughout Loss Data Analytics. Relevant functions, R code, and
illustrative graphs are provided.
18.2.1 One Parameter Distributions

Exponential
Functions
Name Function
Parameter assumptions 𝜃>0
1 −𝑥/𝜃
Probability density 𝜃𝑒
function 𝑓(𝑥)
Distribution function 1 − 𝑒−𝑥/𝜃
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(𝑘 + 1)
𝑘
E[𝑋 ] 𝑘 > −1
𝑉 𝑎𝑅𝑝 (𝑥) −𝜃 ln(1 − 𝑝)
Limited Expected Value 𝜃(1 − 𝑒−𝑥/𝜃 )
E[𝑋 ∧ 𝑥]
R Commands

Density function dexp(𝑥 =, 𝑟𝑎𝑡𝑒 = 1/𝜃)
Distribution function pexp(𝑝 =, 𝑟𝑎𝑡𝑒 = 1/𝜃)
Quantile function qexp(𝑞 =, 𝑟𝑎𝑡𝑒 = 1/𝜃)
Random sampling function rexp(𝑛 =, 𝑟𝑎𝑡𝑒 = 1/𝜃)
Illustrative Graph
Exponential Distribution
0.008
Probability density
0.004
0.000
0 200 400 600 800 1000
X
Inverse Exponential
Functions
Name Function
Parameter assumptions 𝜃>0
𝜃𝑒−𝜃/𝑥
Probability density 𝑥2
function 𝑓(𝑥)
Distribution function 𝑒−𝜃/𝑥
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 − 𝑘)
𝑘
E[𝑋 ] 𝑘<1
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 𝐺(1 − 𝑘; 𝜃/𝑥) + 𝑥𝑘 (1 − 𝑒−𝜃/𝑥 )
R Commands

Density function dinvexp(𝑥 =, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pinvexp(𝑝 =, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qinvexp(𝑞 =, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rinvexp(𝑛 =, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Inverse Exponential Distribution

0.004
Probability density
0.002
0.000
0 200 400 600 800 1000
X
Single Parameter Pareto

Functions
Name Function
Parameter assumptions 𝜃 is known, 𝑥 > 𝜃, 𝛼 > 0
𝛼𝜃𝛼
Probability density 𝑥𝛼+1
function 𝑓(𝑥)
Distribution function 1 − (𝜃/𝑥)𝛼
𝐹 (𝑥)
𝑡ℎ 𝛼𝜃𝑘
k raw moment 𝛼−𝑘
E[𝑋 𝑘 ] 𝑘<𝛼
𝛼𝜃𝑘 𝑘𝜃𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] 𝛼−𝑘 − (𝛼−𝑘)𝑥 𝛼−𝑘
𝑥≥𝜃
R Commands

Density function dpareto1(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑚𝑖𝑛 = 𝜃)
Distribution function ppareto1(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑚𝑖𝑛 = 𝜃)
Quantile function qpareto1(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑚𝑖𝑛 = 𝜃)
Random sampling function rpareto1(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑚𝑖𝑛 = 𝜃)
Illustrative Graph
Single Parameter Pareto Distribution

0.030
Probability density
0.020
0.010
0.000
0 200 400 600 800 1000
X
18.2.2 Two Parameter Distributions
Pareto
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼 > 0
𝛼𝜃𝛼
Probability density (𝑥+𝜃)𝛼+1
function 𝑓(𝑥)
𝛼
𝜃
Distribution function 1 − ( 𝑥+𝜃 )
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑘+1)Γ(𝛼−𝑘)
k raw moment Γ(𝛼)
E[𝑋 𝑘 ] −1 < 𝑘 < 𝛼
𝛼−1
𝜃 𝜃
Limited Expected Value: 𝛼 ≠ 1 𝛼−1 [1 − ( 𝑥+𝜃 ) ]
E[𝑋 ∧ 𝑥]
𝜃
Limited Expected Value: 𝛼 = 1 −𝜃 ln ( 𝑥+𝜃 )
E[𝑋 ∧ 𝑥]
𝜃𝑘 Γ(𝑘+1)Γ(𝛼−𝑘) 𝑥 𝜃 𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) 𝛽(𝑘 + 1, 𝛼 − 𝑘; 𝑥+𝜃 ) + 𝑥𝑘 ( 𝑥+𝜃 )
R Commands

Density function dpareto(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function ppareto(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qpareto(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rpareto(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Pareto Distribution
0.015
Probability density
0.010
0.005
0.000
0 200 400 600 800 1000
Inverse Pareto
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜏 > 0
𝜏𝜃𝑥𝜏−1
Probability density (𝑥+𝜃)𝜏 −1
function 𝑓(𝑥)
𝜏
𝑥
Distribution function ( 𝑥+𝜃 )
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+𝑘)Γ(1−𝑘)
k raw moment Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 < 𝑘 < 1
𝑥/(𝑥+𝜃) 𝜏
𝑘 𝑘 𝜏+𝑘−1 𝑥
E[(𝑋 ∧ 𝑥) ] 𝜃 𝜏 ∫0 𝑦 (1 − 𝑦)−𝑘 𝑑𝑦 + 𝑥𝑘 [1 − ( 𝑥+𝜃 ) ]
𝑘 > −𝜏
R Commands

Density function dinvpareto(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pinvpareto(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qinvpareto(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rinvpareto(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Inverse Pareto Distribution

0.0008
Probability density
0.0004
0.0000
0 500 1000 1500 2000 2500 3000
Loglogistic
Functions
Name Function
(𝑥/𝜃)𝛾
Parameter assumptions 𝜃 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃) 𝛾
𝛾(𝑥/𝜃)𝛾
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]2
function 𝑓(𝑥)
Distribution function 𝑢
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 + (𝑘/𝛾))Γ(1 − (𝑘/𝛾))
𝑘
E[𝑋 ] −𝛾 < 𝑘 < 𝛾
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 Γ(1 + (𝑘/𝛾))Γ(1 − (𝑘/𝛾))𝛽(1 + (𝑘/𝛾), 1 − (𝑘/𝛾); 𝑢) + 𝑥𝑘 (1 − 𝑢)
𝑘 > −𝛾
Illustrative Graph
dloglogistic(X, gamma = gamma, theta = theta)
0.006
0.004
0.002
0.000
0 200 400 600 800 1000
Paralogistic
Functions
Name Function
1
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝑢 = 1+(𝑥/𝜃) 𝛼
𝛼2 (𝑥/𝜃)𝛼
Probability density 𝑥[1+(𝑥/𝜃)𝛼 ]𝛼+1
function 𝑓(𝑥)
Distribution function 1 − 𝑢𝛼
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(1+(𝑘/𝛼))Γ(𝛼−(𝑘/𝛼))
E[𝑋 𝑘 ] −𝛼 < 𝑘 < 𝛼 2
𝜃𝑘 Γ(1+(𝑘/𝛼))Γ(𝛼−(𝑘/𝛼))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) + (𝑘/𝛼), 𝛼 − (𝑘/𝛼); 1 − 𝑢) + 𝑥𝑘 𝑢𝛼
𝛽(1
𝑘 > −𝛼
R Commands

Density function dparalogis(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pparalogis(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qparalogis(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rparalogis(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Paralogistic Distribution
0.008
Probability density
0.004
0.000
0 200 400 600 800 1000
Gamma
Functions
Name Function
1 𝛼−1 −𝑥/𝜃
Probability density 𝜃 Γ(𝛼) 𝑥
𝛼 𝑒
function 𝑓(𝑥)
Distribution function Γ(𝛼; 𝑥𝜃 )
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(𝛼+𝑘)
Γ(𝛼)
𝑘
E[𝑋 ] 𝑘 > −𝛼
𝜃𝑘 Γ(𝑘+𝛼) 𝑘
Γ(𝛼) Γ(𝑘 + 𝛼; 𝑥/𝜃) + 𝑥 [1 − Γ(𝛼; 𝑥/𝜃)]
E[𝑋 ∧ 𝑥]𝑘 𝑘 > −𝛼
R Commands
Density function dgamma(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)

Distribution function pgamma(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qgamma(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rgamma(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Gamma Distribution
0.006
Probability density
0.004
0.002
0.000
0 200 400 600 800 1000
Inverse Gamma
Functions
Name Function
(𝜃/𝑥)𝛼 𝑒−𝜃/𝑥
Probability density 𝑥Γ(𝛼)
function 𝑓(𝑥)
Distribution function 1 − Γ(𝛼; 𝜃/𝑥)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝛼−𝑘)
E[𝑋 𝑘 ] 𝑘<𝛼
𝜃𝑘 Γ(𝛼−𝑘)
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) [1 − Γ(𝛼 − 𝑘; 𝜃/𝑥)] + 𝑥𝑘 Γ(𝛼; 𝜃/𝑥)
R Commands

Density function dinvgamma(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pinvgamma(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qinvgamma(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rinvgamma(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Inverse Gamma Distribution

0.0000004
Probability density
0.0000002
0.0000000
0 100 200 300 400
Weibull
Functions
Name Function
𝛼 𝛼
𝛼( 𝑥
𝜃 ) exp (−( 𝜃 )
𝑥
)
Probability density 𝑥
function 𝑓(𝑥)
𝛼
Distribution function 1 − exp ( − ( 𝑥𝜃 ) )
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 + 𝛼𝑘 )
𝑘
E[𝑋 ] 𝑘 > −𝛼
𝛼 𝛼
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 Γ(1 + 𝛼𝑘 )Γ[1 + 𝛼𝑘 ; ( 𝑥𝜃 ) ] + 𝑥𝑘 exp ( − ( 𝑥𝜃 ) )
𝑘 > −𝛼
R Commands

Density function dweibull(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pweibull(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qweibull(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rweibull(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Weibull Distribution
0.000 0.002 0.004 0.006 0.008
Probability density
0 200 400 600 800 1000
Inverse Weibull
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜏 > 0
𝜏
𝜏(𝜃/𝑥)𝜏 exp (−( 𝑥

𝜃
) )
Probability density 𝑥
function 𝑓(𝑥)
𝜏
Distribution function exp ( − ( 𝑥𝜃 ) )
𝐹 (𝑥)
𝑡ℎ
k raw moment 𝜃𝑘 Γ(1 − (𝑘/𝜏 ))
𝑘
E[𝑋 ] 𝑘<𝜏
𝜏
E[(𝑋 ∧ 𝑥)𝑘 ] 𝜃𝑘 Γ(1 − (𝑘/𝜏 ))[1 − Γ(1 − (𝑘/𝜏 ); (𝜃/𝑥)𝜏 )] + 𝑥𝑘 [1 − 𝑒−(𝜃/𝑥) ]
R Commands

Density function dinvweibull(𝑥 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pinvweibull(𝑝 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qinvweibull(𝑞 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rinvweibull(𝑛 =, 𝑠ℎ𝑎𝑝𝑒 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Inverse Weibull Distribution

0.015
Probability density
0.010
0.005
0.000
0 200 400 600 800 1000
Uniform
Functions
Name Function
Parameter assumptions −∞ < 𝛼 < 𝛽 < ∞
1
Probability density 𝛽−𝛼
f(x)
𝑥−𝛼
Distribution function 𝛽−𝛼
𝐹 (𝑥)
𝛽+𝛼
Mean 2
E[X]
(𝛽−𝛼)2
Variance 12
𝐸[(𝑋 − 𝜇)2 ]
E[(𝑋 − 𝜇)𝑘 ] 𝜇𝑘 = 0 for odd k
𝑘
𝜇𝑘 = 2(𝛽−𝛼)
𝑘 (𝑘+1) for even k
R Commands

Density function dunif(𝑥 =, 𝑚𝑖𝑛 = 𝑎, 𝑚𝑎𝑥 = 𝑏)
Distribution function punif(𝑝 =, 𝑚𝑖𝑛 = 𝑎, 𝑚𝑎𝑥 = 𝑏)
Quantile function qunif(𝑞 =, 𝑚𝑖𝑛 = 𝑎, 𝑚𝑎𝑥 = 𝑏)
Random sampling function runif(𝑛 =, 𝑚𝑖𝑛 = 𝑎, 𝑚𝑎𝑥 = 𝑏)
Illustrative Graph
Continuous Uniform Distribution

0.025
Probability density
0.020
0.015
50 60 70 80 90 100
X
Normal
Functions
Name Function
Parameter assumptions −∞ < 𝜇 < ∞, 𝜎 > 0
2
Probability density √1
2𝜋𝜎
exp (− (𝑥−𝜇)
2𝜎2 )
f(x)
Distribution function Φ ( 𝑥−𝜇
𝜎 )
𝐹 (𝑥)
Mean 𝜇
E[X]
Variance 𝜎2
𝐸[(𝑋 − 𝜇)2 ]
E[(𝑥 − 𝜇)𝑘 ] 𝜇𝑘 = 0 for even k
2
𝜇𝑘 = ( 𝑘𝑘!𝜎
)!2𝑘/2
for odd k
2
R Commands

Density function dnorm(𝑥 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑠𝑑 = 𝜎)
Distribution function pnorm(𝑝 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑠𝑑 = 𝜎)
Quantile function qnorm(𝑞 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑠𝑑 = 𝜎)
Random sampling function rnorm(𝑛 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑠𝑑 = 𝜎)
Illustrative Graph
Normal Distribution
0.04
0.03
Probability density
0.02
0.01
0.00
0 50 100 150 200
Cauchy
Functions
Name Function
Parameter assumptions −∞ < 𝛼 < ∞, 𝛽 > 0
2
1
Probability density 𝜋𝛽 [1 + ( 𝑥−𝛼
𝛽 ) ]
−1
function 𝑓(𝑥)
R Commands

Density function dcauchy(𝑥 =, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝛽)
Distribution function pcauchy(𝑝 =, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝛽)
Quantile function qcauchy(𝑞 =, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝛽)
Random sampling function rcauchy(𝑛 =, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 = 𝛼, 𝑠𝑐𝑎𝑙𝑒 = 𝛽)
Illustrative Graph
Cauchy Distribution
0.0030
Probability density
0.0020
0.0010
0.0000
0 200 400 600 800 1000
18.2.3 Three Parameter Distributions

Generalized Pareto
Functions
Name Function
𝑥
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝜏 > 0, 𝑢 = 𝑥+𝜃
𝛼
Γ(𝛼+𝜏) 𝜃 𝑥 𝜏−1
Probability density Γ(𝛼)Γ(𝜏) (𝑥+𝜃)𝛼+𝜏
function 𝑓(𝑥)
Distribution function 𝛽(𝜏 , 𝛼; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+1)Γ(𝛼−𝑘)
k raw moment Γ(𝛼)Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 < 𝑘 < 𝛼
𝜃𝑘 Γ(𝜏+𝑘)Γ(𝛼−𝑘)
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼)Γ(𝜏) 𝛽(𝜏 + 𝑘, 𝛼 − 𝑘; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝜏 , 𝛼; 𝑢)]
𝑘 > −𝜏
R Commands

Density function dgenpareto(𝑥 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pgenpareto(𝑞 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qgenpareto(𝑝 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rgenpareto(𝑟 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Generalized Pareto Distribution

0.004
0.003
Probability density
0.002
0.001
0.000
0 200 400 600 800 1000
Burr
Functions
Name Function
1
Parameter assumptions 𝜃 > 0, 𝛼 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃)𝛾
𝛾
𝛼𝛾(𝑥/𝜃)
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]𝛼+1
function 𝑓(𝑥)
Distribution function 1 − 𝑢𝛼
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(1+(𝑘/𝛾))Γ(𝛼−(𝑘/𝛾))
E[𝑋 𝑘 ] −𝛾 < 𝑘 < 𝛼𝛾
𝜃𝑘 Γ(1+(𝑘/𝛾))Γ(𝛼−(𝑘/𝛾))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝛼) 𝛽(1+ (𝑘/𝛾), 𝛼 − (𝑘/𝛾); 1 − 𝑢) + 𝑥𝑘 𝑢𝛼
𝑘 > −𝛾
R Commands

Density function dburr(𝑥 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pburr(𝑝 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qburr(𝑞 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rburr(𝑛 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝛼, 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Burr Distribution
0.012
Probability density
0.008
0.004
0.000
0 200 400 600 800 1000
Inverse Burr
Functions
Name Function
(𝑥/𝜃)𝛾
Parameter assumptions 𝜃 > 0, 𝜏 > 0, 𝛾 > 0, 𝑢 = 1+(𝑥/𝜃)𝛾
𝜏𝛾(𝑥/𝜃)𝜏𝛾
Probability density 𝑥[1+(𝑥/𝜃)𝛾 ]𝜏+1
function 𝑓(𝑥)
Distribution function 𝑢𝜏
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝜏+(𝑘/𝛾))Γ(1−(𝑘/𝛾))
k raw moment Γ(𝜏)
E[𝑋 𝑘 ] −𝜏 𝛾 < 𝑘 < 𝛾
𝜃𝑘 Γ(𝜏+(𝑘/𝛾))Γ(1−(𝑘/𝛾))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝜏) 𝛽(𝜏+ (𝑘/𝛾), 1 − (𝑘/𝛾); 𝑢) + 𝑥𝑘 [1 − 𝑢𝜏 ]
𝑘 > −𝜏 𝛾
R Commands

Density function dinvburr(𝑥 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝜏 , 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pinvburr(𝑝 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝜏 , 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qinvburr(𝑞 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝜏 , 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rinvburr(𝑛 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝜏 , 𝑠ℎ𝑎𝑝𝑒2 = 𝛾, 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Inverse Burr Distribution

0.006
Probability density
0.004
0.002
0.000
0 200 400 600 800 1000
X
18.2.4 Four Parameter Distribution

Generalized Beta of the Second Kind (GB2)
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝛼1 > 0, 𝛼2 > 0, 𝜎 > 0
(𝑥/𝜃)𝛼2 /𝜎
Probability density 1/𝜎
𝛼1 +𝛼2
𝑥𝜎 B(𝛼1 ,𝛼2 )[1+(𝑥/𝜃) ]
function 𝑓(𝑥)
𝑡ℎ 𝜃𝑘 B(𝛼1 +𝑘𝜎,𝛼2 −𝑘𝜎)
k raw moment B(𝛼1 ,𝛼2 )
E[𝑋 𝑘 ] k>0
R Commands
Please see the R Codes for Loss Data Analytics site for information about this
distribution.
18.2.5 Other Distributions

–>
Lognormal
Functions
Name Function
Parameter assumptions −∞ < 𝜇 < ∞, 𝜎 > 0
2
Probability density √1
𝑥 2𝜋𝜎
exp (− (ln 2𝜎
𝑥−𝜇)
2 )
function 𝑓(𝑥)
Distribution function Φ ( ln(𝑥)−𝜇
𝜎 )
𝐹 (𝑥)
𝑡ℎ 𝑘2 𝜎2
k raw moment exp(𝑘𝜇 + 2 )
E[𝑋 𝑘 ]
𝑘2 𝜎2 ln(𝑥)−𝜇−𝑘𝜎2
Limited Expected Value exp (𝑘𝜇 + 2 )Φ( 𝜎 ) + 𝑥𝑘 [1 − Φ( ln(𝑥)−𝜇
𝜎 )]
E[𝑋 ∧ 𝑥]
Illustrative Graph
dlognorm(X, mu = mu, sigma = sigma)
0.0020
0.0010
0.0000
0 200 400 600 800 1000
Inverse Gaussian
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝜇 > 0, 𝑧 = 𝑥−𝜇
𝜇 , 𝑦 =
𝑥+𝜇
𝜇
1/2 2
𝜃
Probability density ( 2𝜋𝑥 3) exp ( −𝜃𝑧
2𝑥 )
function 𝑓(𝑥)
1/2 1/2
Distribution function Φ[𝑧( 𝑥𝜃 ) ] + exp ( 2𝜃 𝜃
𝜇 )Φ[ − 𝑦( 𝑥 ) ]
𝐹 (𝑥)
Mean 𝜇
E[𝑋]
𝜇3
Var[X] 𝜃
1/2 1/2
E[(𝑋 ∧ 𝑥)𝑘 ] 𝑥 − 𝜇𝑥Φ[𝑧( 𝑥𝜃 ) ] − (𝜇𝑦) exp ( 2𝜃 𝜃
𝜇 )Φ[ − 𝑦( 𝑥 ) ]
R Commands

Density function dinvgauss(𝑥 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 = 𝜃)
Distribution function pinvgauss(𝑝 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 = 𝜃)
Quantile function qinvgauss(𝑞 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 = 𝜃)
Random sampling function rinvgauss(𝑛 =, 𝑚𝑒𝑎𝑛 = 𝜇, 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛 = 𝜃)
Illustrative Graph
Inverse Gaussian Distribution

0.012
Probability density
0.008
0.004
0.000
0 20 40 60 80 100
18.2.6 Distributions with Finite Support
Beta
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝑎 > 0, 𝑏 > 0, 𝑢 = 𝑥𝜃 , 0 < 𝑥 < 𝜃
Γ(𝑎+𝑏) 𝑎 𝑏−1 1
Probability density Γ(𝑎)Γ(𝑏) 𝑢 (1 − 𝑢) 𝑥
function 𝑓(𝑥)
Distribution function 𝛽(𝑎, 𝑏; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+𝑘)
k raw moment Γ(𝑎)Γ(𝑎+𝑏+𝑘)
E[𝑋 𝑘 ] 𝑘 > −𝑎
𝜃𝑘 𝑎(𝑎+1)⋯(𝑎+𝑘−1)
(𝑎+𝑏)(𝑎+𝑏+1)⋯(𝑎+𝑏+𝑘−1) 𝛽(𝑎+ 𝑘, 𝑏; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝑎, 𝑏; 𝑢)]
E[𝑋 ∧ 𝑥]𝑘
R Commands

Density function dbeta(𝑥 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑛𝑐𝑝 = 𝜃)
Distribution function pbeta(𝑝 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑛𝑐𝑝 = 𝜃)
Quantile function qbeta(𝑞 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑛𝑐𝑝 = 𝜃)
Random sampling function rbeta(𝑛 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑛𝑐𝑝 = 𝜃)
Beta Distribution
2.0
1.5
Probability density
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
X
Generalized Beta
Functions
Name Function
Parameter assumptions 𝜃 > 0, 𝑎 > 0, 𝑏 > 0, 𝜏 > 0, 0 < 𝑥 < 𝜃 , 𝑢 = (𝑥/𝜃)𝜏
Γ(𝑎+𝑏) 𝛼 𝑏−1 𝜏
Probability density Γ(𝑎)Γ(𝑏) 𝑢 (1 − 𝑢) 𝑥
function 𝑓(𝑥)
Distribution function 𝛽(𝑎, 𝑏; 𝑢)
𝐹 (𝑥)
𝑡ℎ 𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+(𝑘/𝜏))
k raw moment Γ(𝑎)Γ(𝑎+𝑏+(𝑘/𝜏))
E[𝑋 𝑘 ] 𝑘 > −𝛼𝜏
𝜃𝑘 Γ(𝑎+𝑏)Γ(𝑎+(𝑘/𝜏))
E[(𝑋 ∧ 𝑥)𝑘 ] Γ(𝑎)Γ(𝑎+𝑏+(𝑘/𝜏)) 𝛽(𝑎 + (𝑘/𝜏 ), 𝑏; 𝑢) + 𝑥𝑘 [1 − 𝛽(𝑎, 𝑏; 𝑢)]
R Commmands

Density function dgenbeta(𝑥 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑠ℎ𝑎𝑝𝑒3 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Distribution function pgenbeta(𝑝 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑠ℎ𝑎𝑝𝑒3 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Quantile function qgenbeta(𝑞 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑠ℎ𝑎𝑝𝑒3 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Random sampling function rgenbeta(𝑛 =, 𝑠ℎ𝑎𝑝𝑒1 = 𝑎, 𝑠ℎ𝑎𝑝𝑒2 = 𝑏, 𝑠ℎ𝑎𝑝𝑒3 = 𝜏 , 𝑠𝑐𝑎𝑙𝑒 = 𝜃)
Illustrative Graph
Generalized Beta Distribution

0.0020
Probability density
0.0010
0.0000
0 200 400 600 800 1000
X
18.3 Limited Expected Values

Overview. This section summarizes limited expected values for selected con-
tinuous distributions.
Functions
Limited Expected Value Functions
Distribuion Function
𝜃Γ(𝜏+1)Γ(𝛼−1) 𝑥 𝑥
GB2 Γ(𝛼)Γ(𝜏) 𝛽(𝜏 + 1, 𝛼 − 1; 𝑥+𝛽 ) + 𝑥[1 − 𝛽(𝜏 , 𝛼; 𝑥+𝛽 )]
1
𝜃Γ(1+ 𝛾 1
)Γ(𝛼− 𝛾 ) 𝛼
Burr Γ(𝛼) 𝛽(1 + 𝛾1 , 𝛼 − 𝛾1 ; 1 − 1+(𝑥/𝜃)
1 1
𝛾 ) + 𝑥( 1+(𝑥/𝜃)𝛾 )
𝛾 𝛾 𝜏
𝜃Γ(𝜏+(1/𝛾))Γ(1−(1/𝛾)) (𝑥/𝜃) (𝑥/𝜃)
Inverse Burr Γ(𝜏) 𝛽(𝜏 + 𝛾1 , 1 − 𝛾1 ; 1+(𝑥/𝜃) 𝛾 ) + 𝑥[1 − ( 1+(𝑥/𝜃)𝛾 ) ]
Pareto
𝜃
𝛼=1 −𝜃 ln ( 𝑥+𝜃 )
𝛼−1
𝜃 𝜃
𝛼≠1 𝛼−1 [1 − ( 𝑥+𝜃 ) ]
𝑥/(𝑥+𝜃) 𝜏
𝑥
Inverse Pareto 𝜃𝜏 ∫0 𝑦𝜏 (1 − 𝑦)−1 𝑑𝑦 + 𝑥[1 − ( 𝑥+𝜃 ) ]
𝛾
(𝑥/𝜃) (𝑥/𝜃)𝛾
Loglogistic 𝜃Γ(1 + 𝛾1 )Γ(1 − 𝛾1 )𝛽(1 + 𝛾1 , 1 − 𝛾1 ; 1+(𝑥/𝜃) 𝛾 ) + 𝑥(1 − 1+(𝑥/𝜃)𝛾 )
1 1 𝛼
𝜃Γ(1+ 𝛼 )Γ(𝛼− 𝛼 )
Paralogistic Γ(𝛼) 𝛽(1 + 𝛼1 , 𝛼 − 𝛼1 ; 1 − 1+(𝑥/𝜃)
1 1
𝛼 ) + 𝑥( 1+(𝑥/𝜃)𝛼 )
𝜏
𝜃Γ(𝜏+ 𝜏1 )Γ(1− 𝜏1 ) (𝑥/𝜃)𝜏 (𝑥/𝜃)𝜏
Inverse Paralogistic Γ(𝜏) 𝛽(𝜏 + 𝜏1 , 1 − 𝜏1 ; 1+(𝑥/𝜃) 𝜏 ) + 𝑥[1 − ( 1+(𝑥/𝜃)𝜏 ) ]
𝜃Γ(𝛼+1) 𝑥 𝑥
Gamma Γ(𝛼) Γ(𝛼 + 1; 𝜃 ) + 𝑥[1 − Γ(𝛼; 𝜃 )]
𝜃Γ(𝛼−1) 𝜃 𝜃
Inverse Gamma Γ(𝛼) [1 − Γ(𝛼 − 1; 𝑥 )] + 𝑥Γ(𝛼; 𝑥 )
𝛼
Weibull 𝜃Γ(1 + 𝛼1 )Γ(1 + 𝛼1 ; ( 𝑥𝜃 ) ) + 𝑥 ∗ exp(−(𝑥/𝜃)𝛼 )
𝛼
Inverse Weibull 𝜃Γ(1 − 𝛼1 )[1 − Γ(1 − 𝛼1 ; ( 𝑥𝜃 ) )] + 𝑥[1 − exp(−(𝜃/𝑥)𝛼 )]
Exponential 𝜃(1 − exp(−(𝑥/𝜃)))
Inverse Exponential 𝜃𝐺(0; 𝑥𝜃 ) + 𝑥(1 − exp(−(𝜃/𝑥)))
2
Lognormal exp(𝜇 + 𝜎2 /2)Φ( ln(𝑥)−𝜇−𝜎
𝜎 ) + 𝑥[1 − Φ( ln(𝑥)−𝜇
𝜎 )]
1/2 1/2
Inverse Gaussian 𝑥 − 𝜇( 𝑥−𝜇 𝑥−𝜇 𝜃
𝜇 )Φ[( 𝜇 )( 𝑥 ) ] − 𝜇( 𝑥+𝜇 2𝜃 𝑥+𝜇 𝜃
𝜇 ) exp ( 𝜇 )Φ[ − ( 𝜇 )( 𝑥 ) ]
𝛼𝜃 𝜃𝛼
Single-Parameter Pareto − (𝛼−1)𝑥
𝛼−1 𝛼−1
𝜏 𝜏
𝜃Γ(𝑎+𝑏)Γ(𝑎+ 𝜏1 ) 1 𝑥 𝑥
Generalized Beta 1 𝛽(𝑎 + 𝜏 , 𝑏; ( 𝜃 ) ) + 𝑥[1 − 𝛽(𝑎, 𝑏; ( 𝜃 ) )]
Γ(𝑎)Γ(𝑎+𝑏+ 𝜏 )
𝜃𝑎 𝑥 𝑥
Beta (𝑎+𝑏) 𝛽(𝑎 + 1, 𝑏; 𝜃 ) + 𝑥[1 − 𝛽(𝑎, 𝑏; 𝜃 )]
Illustrative Graph
Comparison of Limited Expected Values for Selected Distributions
18.3. LIMITED EXPECTED VALUES 545
Distribution Parameters E[𝑋] 𝐸[𝑋 ∧ 100] 𝐸[𝑋 ∧ 250] 𝐸[𝑋 ∧ 500] 𝐸[𝑋 ∧ 1000]
Pareto 𝛼 = 3, 𝜃 = 200 100 55.55 80.25 91.84 97.22
Exponential 𝜃 = 100 100 63.21 91.79 99.33 99.99
Gamma 𝛼 = 2, 𝜃 = 50 100 72.93 97.64 99.97 100
200
Weibull 𝜏 = 2, 𝜃 = √ 𝜋
100 78.99 99.82 100 100
GB2 𝛼 = 3, 𝜏 = 2, 𝜃 = 100 100 62.50 86.00 94.91 98.42
Limted Expected Values for Several Distributions

100
Limited Expected Value
80
60
40
Pareto
Exponential
20
Gamma
Weibull
GB2
0
0 200 400 600 800 1000
x values
Chapter 19
Appendix E: Conventions
for Notation
Chapter Preview. Loss Data Analytics serves as a bridge between actuarial

problems and methods and widely accepted statistical concepts and tools. Thus,
the notation should be consistent with standard usage employed in probability
and mathematical statistics. See, for example, (Halperin et al., 1965) for a
description of one standard.
19.1 General Conventions

• Random variables are denoted by upper-case italicized Roman letters, with
𝑋 or 𝑌 denoting a claim size variable, 𝑁 a claim count variable, and 𝑆 an
aggregate loss variable. Realizations of random variables are denoted by
corresponding lower-case italicized Roman letters, with 𝑥 or 𝑦 for claim
sizes, 𝑛 for a claim count, and 𝑠 for an aggregate loss.
• Probability events are denoted by upper-case Roman letters, such as Pr(A)
for the probability that an outcome in the event ‘’A” occurs.
• Cumulative probability functions are denoted by 𝐹 (𝑧) and probability
density functions by the associated lower-case Roman letter: 𝑓(𝑧).
• For distributions, parameters are denoted by lower-case Greek letters. A
caret or ‘’hat” indicates a sample estimate of the corresponding population
parameter. For example, 𝛽 ̂ is an estimate of 𝛽 .
• The arithmetic mean of a set of numbers, say, 𝑥1 , … , 𝑥𝑛 , is usually denoted
by 𝑥;̄ the use of 𝑥, of course, is optional.
• Use upper-case boldface Roman letters to denote a matrix other than
a vector. Use lower-case boldface Roman letters to denote a (column)
vector. Use a superscript prime ’‘′” for transpose. For example, x′ Ax is
a quadratic form.
547
548 CHAPTER 19. APPENDIX E: CONVENTIONS FOR NOTATION
• Acronyms are to be used sparingly, given the international focus of our

audience. Introduce acronyms commonly used in statistical nomenclature
but limit the number of acronyms introduced. For example, pdf for prob-
ability density function is useful but GS for Gini statistic is not.
19.2 Abbreviations
Here is a list of abbreviations that we adopt. We italicize these acronyms. For

example, we can discuss the goodness of fit in terms of the AIC criterion.
𝐴𝐼𝐶 Akaike information criterion

𝐵𝐼𝐶 (Schwarz) Bayesian information criterion
𝑐𝑑𝑓 cumulative distribution function
𝑑𝑓 degrees of freedom
𝑖𝑖𝑑 independent and identically distributed
𝐺𝐿𝑀 generalized linear model
𝑚𝑙𝑒 maximum likelihood estimate/estimator
𝑜𝑙𝑠 ordinary least squares
𝑝𝑑𝑓 probability density function
𝑝𝑚𝑓 probability mass function
19.3 Common Statistical Symbols and Opera-

tors
Here is a list of commonly used statistical symbols and operators, including the
latex code that we use to generate them (in the parens).
19.4. COMMON MATHEMATICAL SYMBOLS AND FUNCTIONS 549
𝐼(⋅) binary indicator function (𝐼). For example, 𝐼(𝐴) is one if an outcome in event
𝐴 occurs and is 0 otherwise.
Pr(⋅) probability (\Pr)
E(⋅) expectation operator (\mathrm{E}). For example, E(𝑋) = E 𝑋 is the
expected value of the random variable 𝑋, commonly denoted by 𝜇.
Var(⋅) variance operator (\mathrm{Var}). For example, Var(𝑋) = Var 𝑋 is the
variance of the random variable 𝑋, commonly denoted by 𝜎2 .
𝜇𝑘 = E 𝑋 𝑘 kth moment of the random variable X. For 𝑘=1, use 𝜇 = 𝜇1 .
Cov(⋅, ⋅) covariance operator (\mathrm{Cov}). For example,
Cov(𝑋, 𝑌 ) = E {(𝑋 − E 𝑋)(𝑌 − E 𝑌 )} = E(𝑋𝑌 ) − (E 𝑋)(E 𝑌 )
is the covariance between random variables 𝑋 and 𝑌 .
E(𝑋|⋅) conditional expectation operator. For example, E(𝑋|𝑌 = 𝑦) is the
conditional expected value of a random variable 𝑋 given that
the random variable 𝑌 equals y.
Φ(⋅) standard normal cumulative distribution function (\Phi)
𝜙(⋅) standard normal probability density function (\phi)
∼ means is distributed as (\sim). For example, 𝑋 ∼ 𝐹 means that the
random variable 𝑋 has distribution function 𝐹 .
𝑠𝑒(𝛽)̂ standard error of the parameter estimate 𝛽 ̂ (\hat{\beta}), usually
̂
an estimate of the standard deviation of 𝛽,̂ which is √𝑉 𝑎𝑟(𝛽).
𝐻0 null hypothesis
𝐻𝑎 or 𝐻1 alternative hypothesis
19.4 Common Mathematical Symbols and Func-

tions
Here is a list of commonly used mathematical symbols and functions, including

the latex code that we use to generate them (in the parens).
550 CHAPTER 19. APPENDIX E: CONVENTIONS FOR NOTATION
≡ identity, equivalence (\equiv)

⟹ implies (\implies)
⟺ if and only if (\iff)
→, ⟶ converges to (\to, \longrightarrow)
ℕ natural numbers 1, 2, … (\mathbb{N})
ℝ real numbers (\mathbb{R})
∈ belongs to (\in)
∉ does not belong to (\notin)
⊆ is a subset of (\subseteq)
⊂ is a proper subset of (\subset)
∪ union (\cup)
∩ intersection (\cap)
∅ empty set (\emptyset)
𝐴𝑐 complement of 𝐴
∞
𝑔∗𝑓 convolution (𝑔 ∗ 𝑓)(𝑥) = ∫−∞ 𝑔(𝑦)𝑓(𝑥 − 𝑦)𝑑𝑦
exp exponential (\exp)
log natural logarithm (\log)
log𝑎 logarithm to the base 𝑎
! factorial
sgn(𝑥) sign of x(sgn)
⌊𝑥⌋ integer part of x, that is, largest integer ≤ 𝑥
(\lfloor, \rfloor)
|𝑥| absolute value of scalar 𝑥
Γ (𝑥) gamma (generalized factorial) function (\varGamma),
satisfying Γ (𝑥 + 1) = 𝑥Γ (𝑥)
𝐵(𝑥, 𝑦) beta function, Γ (𝑥)Γ (𝑦)/Γ (𝑥 + 𝑦)
19.5 Further Readings

To make connections to other literatures, see (Abadir and Magnus, 2002) http:
//www.janmagnus.nl/misc/notation.zip for a summary of notation from the
econometrics perspective. This reference has a terrific feature that many latex
symbols are defined in the article. Further, there is a long history of discussion
and debate surrounding actuarial notation; see (Boehm et al., 1975) for one
contribution.
Chapter 20
Glossary
Term Definition
analytics Analytics is the process of using data to make decisions.
renters insurance Renters insurance is an insurance policy that covers the contents of
an apartment or house that you are renting.
automobile insurance An insurance policy that covers damage to your vehicle, damage to
other vehicles in the accident, as well as medical expenses of those
injured in the accident.
casualty insurance Causalty insurance is a form of liability insurance providing
coverage for negligent acts and omissions. examples include workers
compensation, errors and omissions, fidelity, crime, glass, boiler, and
various malpractice coverages.
commercial insurance
term The duration of an insurance contract
insurance claim An insurance claim is the compensation provided by the insurer for
incurred hurt, loss, or damage that is covered by the policy.
homeowners insurance Homeowners insurance is an insurance policy that covers the
contents and property of a building that is owned by you or a friend.
property insurance Property insurance is a policy that protects the insured against loss
or damage to real or personal property. the cause of loss might be
fire, lightening, business interruption, loss of rents, glass breakage,
tornado, windstorm, hail, water damage, explosion, riot, civil
commotion, rain, or damage from aircraft or vehicles.
non-life Non-life insurance is any type of insurance where payments are not
based on the death (or survivorship) of a named insured. examples
include automobile, homeowners, and so on. also known as property
and casualty or general insurance.
551
552 CHAPTER 20. GLOSSARY
life insurance Life insurance is a contract where the insurer promises to pay upon
the death of an insured person. the person being paid is the
beneficiary.
personal insurance Insurance purchased by a person
loss adjustment Loss adjustment expenses are costs to the insurer that are directly
expenses attributable to settling a claims. for example, the cost of an adjuster
is someone who assess the claim cost or a lawyer who becomes
involve in settling an insurer’s legal obligation on a claim
unallocated Unallocated loss adjustment expenses are costs that can only be
indirectly attributed to claim settlement; for example, the cost of an
office to support claims staff
allocated Allocated loss adjustment expenses, sometimes known by the
acronym alea, are costs that can be directly attributed to settling a
claim; for example, the cost of an adjuster
underwriting Underwriting is the process where the company makes a decision as
to whether or not to take on a risk.
loss reserving A loss reserve is an estimate of liability indicating the amount the
insurer expects to pay for claims that have not yet been realized.
this includes losses incurred but not yet reported (ibnr) and those
claims that have been reported claims that haven’t been paid
(known by the acronym rbns for reported but not settled).
risk classification Risk classification is the process of grouping policyholders into
categories, or classes, where each insured in the class has a risk
profile that is similar to others in the class.
retrospective The process of determining the cost of an insurance policy based on
premiums the actual loss experience determined as an adjustment to the initial
premium payment.
claims adjustment Claims adjustment is the process of determining coverage, legal
liability, and settling claims.
claims leakage Claims leakage respresents money lost through claims management
inefficiencies.
adjuster An adjuster is a person who investigates claims and recommends
settlement options based on estimates of damage and insurance
policies held.
dividends A dividend is the refund of a portion of the premium paid by the
insured from insurer surplus.
indemnification Indemnification is the compensation provided by the insurer.
rating variables Rating variables are the components of an insurance pricing formula.
they can include numeric variables (like values, revenue, or area)
and classification variables (like location, type of vehicle, or type of
occupancy.)
frequency Count random variables that represent the number of claims
severity The amount, or size, of each payment for an insured event
553
probability mass A function that gives the probability that a discrete random
function (pmf) variable is exactly equal to some value
distribution function The chance that the random variable is less than or equal to x, as a
function of x
mean Average
moments The rth moment of a list is the average value of the random variable
raised to the rth power
survival function The probability that the random variable takes on a value greater
than a number x
moment generating The mgf of random variable n is defined the expectation of exp(tn),
function (mgf) as a function of t
probability generating For a random variable n, its pgf is defined as the expectation of s^n,
function (pgf) as a function of s
convex hulls The convex hull of a set of points x is the smallest convex set that
contains x
risk classes The formation of different premiums for the same coverage based on
each homogeneous group’s characteristics.
binomial distribution A random variable has a binomial distribution (with parameters m
and q) if it is the number of ”successes” in a fixed number m of
independent random trials, all of which have the same probability q
of resulting in ”success.”
binary outcomes Outcomes whose unit can take on only two possible states,
traditionally labeled as 0 and 1
m-convolution The addition of m independent random variables
poisson distribution A discrete probability distribution that expresses the probability of
a given number of events occurring in a fixed interval of time or
space if these events occur with a known constant rate and
independently of the time since the last event
negative binomial The number of successes until we observe the rth failure in
distribution independent repetitions of an experiment with binary outcomes
overdispersed The presence of greater variability (statistical dispersion) in a data
set than would be expected based on a given statistical model
underdispersed There was less variation in the data than predicted
(a, b, 0) class The poisson, binomial and negative binomial distributions
maximum likelihood The possible value of the parameter for which the chance of
estimator (mle) observing the data largest
local extrema The largest and smallest value of the function within a given range
central limit theorem In some situations, when independent random variables are added,
(clt) their properly normalized sum tends toward a normal distribution
even if the original variables themselves are not normally
distributed.
newton’s method A root-finding algorithm which produces successively better
approximations to the roots of a real-valued function
robust Resistant to errors in the results, produced by deviations from

assumptions
explanatory variables In regression, the explanatory variable is the one that is supposed to
”explain” the other.
regression analysis A set of statistical processes for estimating the relationships among
variables
homogeneous Units of exposure that face approximately the same expected
frequency and severity of loss.
(a,b,1) A count distribution with probabilities satisfying
p_k/p_{k-1}=a+b/k, for some some constants a and b and k>=2
zero truncation Zero modification of a count distribution such that it assigns zero
probability to zero count
degenerate A deterministic distribution and takes only a single value
distribution
convex combination A linear combination of points where all coefficients are
non-negative and sum to 1
convex function A real-valued function defined on an interval is called convex if the
line segment between any two points on the graph of the function
lies above or on the graph.
mixture distribution The probability distribution of a random variable that is derived
from a collection of other random variables as follows: first, a
random variable is selected by chance from the collection according
to given probabilities of selection, and then the value of the selected
random variable is realized
chi-square distribution The chi-squared distribution with k degrees of freedom is the
distribution of a sum of the squares of k independent standard
normal random variables
goodness of fit The goodness of fit of a statistical model describes how well it fits a
set of observations.
pearson’s chi-square A statistical test applied to sets of categorical data to evaluate how
test likely it is that any observed difference between the sets arose by
chance
multinomial likelihood The multinomial distribution models the probability of counts for
rolling a k-sided die n times
aggregate losses Aggregate claims, or total claims observed in the time period
liability insurance Insurance that compensates an insured for loss due to legal liability
towards others
mixture distribution A weighted average of other distributions, which may be continuous
or discrete
continuous random Random variable which can take infinitely many values in its
variable specified domain
raw moment The kth moment of a random variable x is the average (expected)
value of x^k
555
central moment The kth central moment of a random variable x is the expected
value of (x-its mean)^k
skewness Measure of the symmetry of a distribution, 3rd central
moment/standard deviation^3
kurtosis Measure of the peaked-ness of a distribution, 4th central
moment/standard deviation^4
expected value Average
exponential A single parameter continous probability distribution that is defined
distribution by its rate parameter
independent Two variables are independent if conditional information given about
one variable provides no information regarding the other variable
percentile The pth percentile of a random variable x is the smallest value x_p
such that the probability of not exceeding it is p%
chi-square distribution A common distribution used in chi-square tests for determining
goodness of fit of observed data to a theorized distribution
light tailed A distribution with thinner tails than the benchmark exponential
distribution distribution
pareto distribution A heavy-tailed and positively skewed distribution with 2 parameters
hazard function Ratio of the probability density function and the survival function:
f(x)/s(x), and represents an instantaneous probability within a small
time frame
weibull distribution A positively skewed continuous distribution with 2 parameters that
can have an increasing or decreasing hazard function depending on
the shape parameter
generalized beta A 4-parameter flexible distribution that encompasses many common
distribution of the distributions
second kind
parametric Probability distribution defined by a fixed set of parameters
distributions
transformation A function or method that turns one distribution into another
distribution function A transformation technique that involves finding the cdf of the
technique transformed distribution through its relation with the original cdf
change-of-variable A transformation technique that involves finding the pdf of the
technique transformed distribution through its relation with the original pdf
using inverse functions
moment-generating A transformation technique that uses moment generating functions
function technique properties to determine the mgf of a linear combination of variables
lognormal distribution A heavy-tailed, positively skewed 2-parameter continuous
distribution such that the natural log of the random variable is
normally distributed with the same parameter values
reliability data A dataset consisting of failure times for failed units and run times
for units still functioning
power transformation A transformation type that involves raising a random variable to a
power
exponential A transformation type that involves raising a random variable in the

transformation exponent
mixing parameters Proportion weight given to each subpopulation in a mixture
heterogeneous A dataset where the subpopulations are represented by separate
population distinct distributions
finite mixture A mixture distribution with a finite k number of subpopulations
continuous mixture A mixture distribution with an infinite number of subpopulations,
where the mixing parameter is itself a continuous distribution
conditional A probability distribution that applies to a subpopulation satisfying
distribution the condition
unconditional A probability distribution independent of any another imposed
distribution conditions
prior distribution A probability distribution assigned prior to observing additional
data
scale distribution A distribution with the property that multiplying all values by a
constant leads to the same distribution family with only the scale
parameter changed
moral hazard Situation where an insured is more likely to be risk seeking if they
do not bear sufficient consequences for a loss
payment per loss Amount insurer pays when a loss occurs and can be 0
payment per payment Amount insurer pays given a payment is needed and is greater than
0
left censored Values below a threshold d are not ignored but converted to 0
left truncated Values below a threshold d are not reported and unknown
loss elimination ratio % decrease of the expected payment by the insurer as a result of the
(ler) deductible
franchise deductible Insurer pays nothing for losses below the deductible, but pays the
full amount for any loss above the deductible
limit of coverage Policy limit, or maximum contractual financial obligation of the
insurer for a loss
group insurance Insurance provided to groups of people to take advantage of lower
administrative costs vs. individual policies
growth factor Multiplicative factor applied to a distribution to account for the
impact of inflation, typically (1+rate)
cedent Party that is transferring the risk to a reinsurer
excess of loss coverage Contract where an insurer pays all claims up to a specified amount
and then the reinsurer pays claims in excess of stated reinsurance
deductible
retention Maximum amount payable by the primary insurer in a reinsurance
arrangement
right censored variable Values above a threshold u are not ignored but converted to u
reinsurance A transaction where the primary insurer buys insurance from a
re-insurer who will cover part of the losses and/or loss adjustment
expenses of the primary insurer
557
method of maximum Statistical method used to derive the parameter values from data
likelihood that maximize the probability of observing the data given the
parameters
grouped data Data bucketed into categories with ranges, such as for use in
histograms or frequency tables
large-sample Asymptotic properties of a distribution as the amount of data
properties increases towards infinity
asymptotic variance Variability of the distribution of an estimator as the amount of data
increases towards infinity
delta method Statistical method used to approximate the asymptotic variance for
a function based on parameters whose asymptotic variance can be
determined
log-likelihood function Natural log of the likelihood function
covariance matrix Matrix where the (i,j)^th element represents the covariance between
the ith and jth random variables
complete data Data where each individual observation is known, and no values are
censored, truncated, or grouped
parametric Distributional assumptions made on the population from which the
data is drawn, with properties defined using parameters.
nonparametric No distributional assumptions are made on the population from
which the data is drawn.
sampling scheme How the data is obtained from the population and what data is
observed.
unbiased An estimator that has no bias, that is, the expected value of an
estimator equals the parameter being estimated.
plug-in principle The plug-in principle or analog principle of estimation proposes that
population parameters be estimated by sample statistics which have
the same property in the sample as the parameters do in the
population.
indicator A categorical variable that has only two groups. the numerical
values are usually taken to be one to indicate the presence of an
attribute, and zero otherwise. another name for a binary variable.
empirical distribution The empirical distribution is a non-parametric estimate of the
function underlying distribution of a random variable. it directly uses the
data observations to construct the distribution, with each observed
data point in a size-n sample having probability 1/n.
first quartile The 25th percentile; the number such that approximately 25% of
the data is below it.
third quartile The 75th percentile; the number such that approximately 75% of
quantile The q-th quantile is the point(s) at which the distribution function
is equal to q, i.e. the inverse of the cumulative distribution function.
smoothed empirical A quantile obtained by linear interpolation between two empirical
quantile quantiles, i.e. data points.
bandwidth A small positive constant that defines the width of the steps and the
degree of smoothing.
kernel density A nonparametric estimator of the density function of a random
estimator variable.
bias-variance tradeoff The tradeoff between model simplicity (underfitting; high bias) and
flexibility (overfitting; high variance).
model diagnostics Procedures to assess the validity of a model
probability-probability A plot that compares two models through their cumulative
(pp) plot probabilities.
quantile-quantile (qq) A plot that compares two models through their quantiles.
plot
goodness of fit A measure used to assess how well a statistical model fits the data,
statistics usually by summarizing the discrepancy between the observations
and the expected values under the model.
method of moments The estimation of population parameters by approximating
parametric moments using empirical sample moments.
percentile matching The estimation of population parameters by approximating
parametric percentiles using empirical quantiles.
percentile A 100p-th percentile is the number such that 100 times p percent of
gini index A measure for assessing income inequality. it measures the
discrepancy between the income and population distributions and is
calculated from the lorenz curve.
model selection The process of selecting a statistical model from a set of candidate
models using data.
in-sample A dataset used for analysis and model development. also known as a
training dataset.
out-of-sample A dataset used for model validation. also known as a test dataset.
cross-validation A model validation procedure in which the data sample is
partitioned into subsamples, where splits are formed by separately
taking each subsample as the out-of-sample dataset.
model validation The process of confirming that the proposed model is appropriate.
data-snooping Repeatedly fitting models to a data set without a prior hypothesis
of interest.
predictive inference Preditive inference is the process of using past data observations to
predict future observations.
likelihood function A function of the likeliness of the parameters in a model, given the
observed data.
ogive estimator A nonparametric estimator for the distribution function in the
presence of grouped data.
product-limit A nonparametric estimator of the survival function in the presence
estimator of incomplete data. also known as the kaplan-meier estimator.
risk set The number of observations that are active (not censored) at a
specific point.
559
nelson-aalen A nonparametric estimator of the cumulative hazard function in the

presence of incomplete data.
credibility An actuarial method of balancing an individual’s loss experience
and the experience in the overall portfolio to improve ratemaking
estimates.
bayesian A type of statistical inference in which the model parameters and
the data are random variables.
predictive distribution The distribution of new data, conditional on a base set of data,
under the bayesian framework.
least squares A technique for estimating parameters in linear regression. it is a
standard approach in regression analysis to the approximate
solution of overdetermined systems. in this technique, one
determines the parameters that minimize the sum of squared
differences between each observation and the corresponding linear
combination of explanatory variables.
markov chain monte The class of numerical methods that use markov chains to generate
carlo (mcmc) draws from a posterior distribution.
simulation
improper prior A prior distribution in which the sum or integral of the distribution
is not finite.
confidence interval Another term for interval estimate. unlike a point estimate, it gives
a range of reliability for approximating a parameter of interest.
decision analysis Bayesian decision theory is the study of an agent’s choices, which is
informed by bayesian probability.
conjugate Distributions such that the posterior and the prior come from the
distributions same family of distributions.
credibility interval A summary of the posterior distribution of parameters under the
bayesian framework.
prior distribution The distribution of the parameters prior to observing data under
the bayesian framework.
exposure A measure of the rating units for which rates are applied to
determine the premium. for example, exposures may be measured
on a per unit basis (e.g. a family with auto insurance under one
contract may have an exposure of 2 cars) or per $1,000 of value (e.g.
homeowners insurance).
inflation Inflation is a sustained increase in the general price level of goods
and services over a period of time.
business line
individual risk model A modeling approach for aggregate losses in which the loss from
each individual contract is considered.
collective risk model A modeling approach for aggregate losses in which the aggregate
loss is represented in terms of a frequency distribution and a
severity distribution.
coverage Insurance coverage is the amount of risk or liability that is covered

for an individual or entity by an insurance policy.
frequency distribution The random number of claims that occur under the collective risk
model.
severity distribution The randomly distributed amount of each loss under the collective
risk model.
central limit theorem Given certain conditions, the arithmetic mean of a large number of
replications of independent random variables, each with a finite
mean and variance, will be approximately normally distributed,
regardless of the underlying distribution.
term life insurance A term life insurance policy is payable only if death of the insured
occurs within a specified time, such as 5 or 10 years, or before a
specified age.
pure endowment A pure endowment is an insurance policy that is payable at the end
of the policy period if the insured is still alive. if the insured has
died, there is nothing paid in the form of benefits.
support The set of all outcomes for a random variable following some
distribution. for example, exponentially distributed random variable
x has support x>0.
convolution The convolution of probability distributions is the distribution
corresponding to the addition of independent random variables.
law of iterated A decomposition of the expected value of a random variable into
expectations conditional components. specifically, for random variables x and y,
e(x) = e[e(x|y)].
compound distribution A random variable follows a compound distribution if it is
parameterized and contains at least one parameter that is itself a
random variable. for example, the tweedie distribution is a
compound distribution.
tweedie distribution A compound distribution that is a poisson sum of gamma random
variables. because it can accommodate a discrete probability mass
at zero and a continuous positive component, it is suitable for
modeling aggregate insurance claims.
shape parameter A numerical parameter of a parametric distribution affecting the
shape of a distribution rather than simply shifting it (as a location
parameter does) or stretching/shrinking it (as a scale parameter
does).
scale parameter A numerical parameter of a parametric distribution that
stretches/shrinks the distribution without changing its location or
shape. the larger the scale parameter, the more spread out the
distribution. the scale parameter is also the reciprocal of the rate
parameter. for example, the normal distribution has scale parameter
\sigma.
561
exponential dispersion A set of distributions that represents a generalisation of the natural

exponential family and also plays an important role in generalized
linear models.
generalized linear Commonly known by the acronym glm. an extension of the linear
models regression model where the dependent variable is a member of the
linear exponential family. glm encompasses linear, binary, count,
and long-tailed, regressions all as special cases.
exponential family A family of parametric distributions that are practical for modeling
the underlying response variable in generalized linear models. this
family includes the normal, bernoulli, poisson, and tweedie
distributions as special cases, among many others.
monte carlo simulation A computerized statistical model that simulates the effects of
various types of uncertainty.
empirical distribution The empirical distribution is a non-parametric estimate of the
underlying distribution of a random variable. it directly uses the
data observations to construct the distribution, with each observed
data point in a size-n sample having probability 1/n.
converge A type of stochastic convergence for a sequence of random variables
x_1,…, x_n that approaches some other distribution as n
approaches \infty.
policy limits A policy limit is the maximum value covered by a policy.
ground-up loss The total amount of loss sustained before policy adjustments are
made (i.e. before deductions are applied for coinsurance,
deductibles, and/or policy limits.)
per-loss basis Due to policy modifications (e.g. deductibles), not all losses that
occur result in payment. the per-loss basis considers every loss that
occurs.
per-payment basis Due to policy modifications (e.g. deductibles), not all losses that
occur result in payment. the per-payment basis which considers only
the losses that result in some payment to the insured.
memoryless The memoryless property means that a given probability
distribution is independent of its history and what has already
elapsed. specifically, random variable x is memoryless if pr(x > s+t
| x >= s) = pr(x > t). note that it does not mean x > s+t and x
>= s are independent events.
central limit theorem The sample mean and sample sum of a random sample of n from a
population will converge to a normal curve as the sample size n
grows
simulations A computer generation of various hypothetical conditions and
outputs, based on the model structure provided
linear congruential Algorithm that yields pseudo-randomized numbers calculated using
generator a linear recursive relationship and a starting seed value
pseudo-random Values that appear random but can be replicated by formula
numbers
inverse transform Samples a uniform number between 0 and 1 to represent the

method randomly selected percentile, then uses the inverse of the cumulative
density function of the desired distribution to simulate from in order
to find the simulated value from the desired distribution
quantile function Inverse function for the cumulative density function which takes a
percentile value in [0,1] as the input, and outputs the corresponding
value in the distribution
greatest lower bound Largest value that is less than or equal to a specified subset of
values/elements
universal life insurance Type of cash value life insurance where the policy’s cash value is the
excess of premium payments over the cost of insurance, accumulated
with interest, with adjustable premiums and coverage over time
variable life insurance Type of life insurance whose face value and coverage term can vary
depending upon the performance of underlying invested securities
sampling variability How much an estimate can vary between samples
cauchy distribution A continuous distribution that represents the distribution of the
ratio of two independent normally random variables, where the
denominator distribution has mean zero
kolmogorov-smirnov A nonparametric statistical test used to determine if a data sample
test could come from a hypothesized continuous probability distribution
bootstrap A method of sampling with replacement from the original dataset to
create additional simulated datasets of the same size as the original
nonparametric A statistical method where no assumption is made about the
approach distribution of the population
parametric approach A statistical method where a prior assumption is made about the
distribution or model form
bias The difference between the expected value of an estimator and the
parameter being estimated. bias is an estimation error that does not
become smaller as one observes larger sample sizes.
bias-corrected If an estimator is known to be consistently biased in a manner, it
estimator can be corrected using a factor to be come less biased or unbiased
jensen inequality For a convex function f(x), f(expected value of x) <= expected value
of f(x)
natural estimator An estimator that uses the sample moments as the estimators for
the population
percentile bootstrap Confidence interval for the parameter estimates determined using
interval the actual percentile results from the bootstrap sampling approach,
as every bootstrap sample has an associated parameter estimate(s)
that can be ranked against the others
k-fold cross-validation A type of validation method where the data is randomly split into k
groups, and each of the k groups is held out as a test dataset in
turn, while the other k-1 gropus are used for distribution or model
fitting, with the process repeated k times in total
563
leave-one-out cross A special case of k-fold cross validation, where each single data
validation point gets a turn in being the lone hold-out test data point, and n
separate models in total are built and tested
jackknife statistics To calculate an estimator, leave out each observation in turn,
calculate the sample estimator statistic each time, and average over
the n separate estimates
accept-reject A sampling method that is used where the random sample is
mechanism discarded if not within a certain pre-specified range [a, b] and is
commonly used when the traditional inverse transform method
cannot be easily used
importance sampling Type of sampling method where values in the region of interest can
mechanism be over-sampled or values outside the region of interest can be
under-sampled
ergodic theorem Ergodic theory studies the behavior of a dynamical system when it
is allowed to run for an extended time
markov process A stochastic (time dependent) process that satisfies memorylessness,
meaning future predictions of the process can be made solely based
on its present state and not the historical path
invariant measure Any mathematical measure that is preserved by a function (the
mean is an example)
composants Component (smaller, self-contained part of larger entity)
hastings metropolis A markov chain monte carlo (mcmc) method for random sampling
from a probability distribution where values are iteratively
generated, with the distribution of the next sample dependent only
on the current sample value, and at each iteration, the candidate
sample can be either accepted or rejected
gibbs sampler A markov chain monte carlo (mcmc) method to obtain a sequence of
random samples from a specified multivariate continuous probability
distribution
premium Amount of money an insurer charges to provide the coverage
described in the policy
ratemaking Process used by insurers to calculate insurance rates, which drive
insurance premiums
insurance rates Amount of money needed to cover losses, expenses, and profit per
one unit of exposure
insured contingent A condition that results in an insurance claim
event
expected costs The cost to an insurer of payments to the insured and allocated loss
adjustment expenses (alaes). overhead and profit are not included
underwriting profit Profit an insurer derives from providing coverage, excluding
investment income
experience rating A type of rating plan that uses the insured’s historical loss
experience as part of the premium determination
price A quantity, usually of money, that is exchanged for a good or service
rates A rate is the price, or premium, charged per unit of exposure. a rate
is a premium expressed in standardized units.
technical prices
loss cost The sum of losses divided by an exposure; it is also known as the
pure premium.
profit loading A factor or percentage applied to the premium calculation to
account for insurer profit in a policy
indicated change A factor calculated from the loss ratio method that calculates how
factor the rates should change, with factors > 1 indicating an increase and
vice versa
indicated rate In a rate filing, the amount that the loss experience suggests that
the insurer should charge to cover costs.
credibility Weight assigned to observed data vs. that assigned to an external or
broader-based set of data
parametric Model assumption that the sample data comes from a population
distribution that can be modeled by a probability distribution with a fixed set of
parameters
commercial business Line of business that insures against damage to their buildings and
property contents due to a covered cause of loss
continuous variables Type of variable that can take on any real value
discrimination Process of determining premiums on the basis of likelihood of loss.
insurance laws prohibit ”unfair discrimination”.
rating factor A rating factor, or rating variable, is a characteristic of the
policyholder or risk being insured by which rates vary.
rating variable A rating factor, or rating variable, is a characteristic of the
policyholder or risk being insured by which rates vary.
factor A variable that varies by groups or categories.
relativity The difference of the expected risk between a specific level of a
rating factor and an accepted baseline value. this difference may be
arithmetic or proportional.
scale distribution Suppose that y = c x, where x comes from a parametric distribution
family and c is a positive constant. the distribution is said to be a
scale distribution if (i) the distributions of y and x come from the
same family and (ii) only a single parameter differs and that by a
factor of c.
written exposures Exposure is based off policies written/issued
earned exposures Exposure is based off amount exposed to loss for which coverage has
been provided
unearned exposures Exposure amount for which coverage has not yet been provided
in force exposures Exposure amount subject to loss at a particular point in time
calendar year method Experience for rating is aggregated based on calendar year, as
opposed to other methods such as when a policy term began
accident date Date of loss occurrence that gives rise to a claim
report date Date when insurer is notified of the claim
565
open claim A claim that has been reported but not yet closed
mix of business Different types of policies in an insurer’s portfolio
on-level earned Earned premium of historical policies using the current rate
premium structure
experience loss ratio Ratio of experience loss to on-level earned premium in the
experience period
claim The amount paid to an individual or corporation for the recovery,
under a policy of insurance, for loss that comes within that policy.
incurred but not A claim is said to be incurred but not reported if the insured event
reported occurs prior to a valuation date (and hence the insurer is liable for
payment) but the event has not been reported to the insurer.
closed A claim is said to be closed when the company deems its financial
obligations on the claim to be resolved.
valuation date A valuation date is the date at which a company summarizes its
financial position, typically quarterly or annually.
policy year This is the period between a policy’s anniversary dates.
gini index The gini index is twice the area between a lorenz curve and a 45
degree line.
line of equality 45 degree line equating x and y, that represents a perfect alignment
in the sample and population distribution
pp plot Statistical plot used to assess how close a data sample matches a
theorized distribution
performance curve A concentration curve is a graph of the distribution of two variables,
where both variables are ordered by only one of variables. for
insurance applications, it is a graph of distribution of losses versus
premiums, where both losses and premiums are ordered by
premiums.
community rating This generally refers to the premium principle where all risks pay
the same amount.
market conduct Regulation that ensures consumers obtain fair and reasonable
regulation insurance prices and coverage
government prescribed Government sets the entire rating system including coverages
prior approval Regulator must approve rates, forms, rules filed by insurers before
use
no file Insurers may use new rates, forms, rules without approval from
regulators
file only Insurers must file rates, forms, rules for record keeping and use
immediately
rating factors Characteristics of a risk that help price the insurance contract
multiplicative tariff A rating method where each rating factor is the product of
model parameters associated with that rating factor
risk characteristics The distinguishing features of a policy that help determine the
expected loss on the policy
gross insurance Sum of expected losses and expenses and profit on a policy
premium
adverse selection A pricing structure that entices riskier individuals to purchase and
discourages low-risk individuals from purchasing
adverse selection spiral Phenomenon where a book of business deteriorates as it attracts
ever-riskier individuals when forced to increase premiums due to
losses
a priori variables Variables which the insurer has prior knowledge of before the policy
inception
closed-form A mathematical expression that can be well defined with a formula
expressions that has a finite number of operations
levels Different outcomes of a categorical variable
nominal A categorical variable where the categories do not have a natural
order and any numbering is arbitrary
dummy variables A variable that takes on a value of 0 or 1 to indicate the absence or
presence of a categorical characteristic
log linear form Linear regression model where the response variable is the natural
log of the expected response value
base case The categorical level chosen as the default with all dummy variable
indicators of 0
workers compensation A no-fault insurance system prescribed by state law where benefits
are provided by an employer to an employee due to a job-related
injury, including death, resulting from an accident or occupational
disease
exposure bases The unit of measurement chosen to represent the exposure for a
particular risk
offset Natural log of the exposure amount that is added to a regression
model to account for varying exposures
tariff A table or list that contains the rating factors and associated
premiums and other risk information
in-force times The timeframe during which a policy is active and the insurer is
bound by the contractual obligation
rate parameter Parameter in certain distributions, such as the exponential, that
indicate how quickly the function decays, and it is the reciprocal of
the scale parameter
functional forms The algebraic relationship between a dependent variable and
explanatory variables
multiplicative form Relationship where the dependent variable is a product of the
explanatory variables
base tariff cell The chosen set of rating categories where the rate equals the
intercept of the model (the base value)
relativities A numerical estimate of value in one category relative to the value
in a base classification, typically expressed as a factor
567
non-automobile Motorized vehicles which are not autos, such as atvs, off-road
vehicles vehicles, go-carts, etc.
distributional The manner in which a statistical distribution is parameterized
structure
information matrix Matrix that measures the amount of information that an observable
random variable x carries about an unknown parameter of a
distribution, and is used to calculate covariance matrices of
maximum likelihood estimators
classification rating A rating plan that uses an insured’s risk characteristics to determine
plan premium
credibility weight The weight assigned to an insured’s historical loss experience for the
purposes of determining their premium in an experience rating plan
complement of The remainder of the weight not assigned to an insured’s historical
credibility loss experience in the experience rating plan
class rate Average rate per exposure for an insured in a particular
classification group
full credibility The threshold of experience necessary to assign 100% credibility to
standard the insured’s own experience
limited fluctuation A credibility method that attempts to limit fluctuations in its
credibility estimates
cumulative Cumulative density function for the normal distribution with mean
distribution function 0 and standard deviation 1
of the standard normal
buhlmann credibility A credibility method that uses the amount of experience, expected
value of the process variance, and variance of the hypothetical
means to determine the credibility weight
collective mean The mean estimate of a risk when no loss information about the risk
is known
law of total The expected value of the conditional expected value of x given y is
expectation the same as the expected value of x
risk parameter Parameter in a distribution whose value reflects the risk
categorization
expected value of the Average of the natural variability of observations from within each
process variance risk
variance of the Variance of the means across different classes, used to determine
hypothetical means how similar or different the classes are from one another
buhlmann-straub An extension of the buhlmann credibility model that allows for
credibility varying exposure by year
bayes theorem A probability law that expresses conditional probability of the event
a given the event b in terms of the conditional probability of the
event b given the event a and the unconditional probability of a
bayesian inference A branch of statistics that leverages bayes theorem to update the
distribution as more experience becomes available
gamma-poisson model A statistical model that assumes the frequency of claims is poisson
whose mean has a prior distribution that is a gamma distribution
exact credibility A situation where the bayesian credibility estimate matches that of
the buhlmann credibility estimate
beta-binomial model A statistical model for modeling the probability of an event using
the binomial distribution with a probability that has a prior
distribution from a beta distribution
nonparametric Statistical method that allows the functional form of a fit from data
estimation to have no assumed prior distribution, constraints, or parameters
empirical bayes Credibility methods that estimate the credibility weight without
methods using any assumptions about prior distributions or likelihoods,
instead relying only on empirical data
semiparametric Credibility method that assumes a distribution for the loss per
estimation exposure random variable and otherwise uses empirical data
portfolios A collection of contracts
insurance portfolios A collection, or aggregation, of insurance contracts
reinsurers A company that sells reinsurance
heavy tailed A rv is said to be heavy tailed if high probabilities are assigned to
large values
survival function One minus the distribution function. it gives the probability that a
rv exceeds a specific value.
coherent risk measure A risk measure that is is subadditive, monontonic, has positive
homogeneity, and is translation invariant.
mean excess loss The expected value of a loss in excess of a quantity, given that the
function loss exceeds the quantity
risk measure A measure that summarizes the riskiness, or uncertainty, of a
distribution
value-at-risk A risk measure based on a quantile function
ceding company A company that purchases reinsurance (also known as the reinsured)
excess of loss Under an excess of loss arrangement, the insurer sets a retention
level for each claim and pays claim amounts less than the level with
the reinsurer paying the excess.
primary insurance Insurance purchased by a non-insurer
proportional An agreement between a reinsurer and a ceding company (also
reinsurance known as the reinsured) in which the reinsurer assumes a given
percent of losses and premium
quota share A proportional treaty where the reinsurer receives a flat percent of
the premium for the book of business reinsured and pays a
percentage of losses, including allocated loss adjustment expenses.
the reinsurer may also pays the ceding company a ceding
commission which is designed to reflect the differences in
underwriting expenses incurred.
reinsured A company that purchases reinsurance (also known as the ceding
company)
569
retained line The amount of exposure that the the reinsured retains on a given
line in a surplus share reinsurance agreement.
retention function A function that maps the insurer portfolio loss into the amount of
loss retained by the insurer.
stop-loss Under a stop-loss arrangement, the insurer sets a retention level and
pays in full total claims less than the level with the reinsurer paying
the excess.
surplus share A proportional reinsurance treaty that is common in commercial
property insurance. a surplus share treaty allows the reinsured to
limit its exposure on any one risk to a given amount (the retained
line). the reinsurer assumes a part of the risk in proportion to the
amount that the insured value exceeds the retained line, up to a
given limit (expressed as a multiple of the retained line, or number
of lines).
treaty A reinsurance contract that applies to a designated book of business
or exposures.
bonus-malus system A type of rating mechanism where insured premiums are adjusted
based on their individual loss experience history
no claim discount A type of experience rating where insureds obtain discounts on
(ncd) system future years’ premiums based on claims-free experience
hunger for bonus Phenomenon where insureds under an experience rating system are
dissuaded from filing minor claims in order to keep their no-claims
discount
takaful Co-operative system of reimbursement or repayment in case of loss
as an insurance alternative
markov chain A stochastic model (time dependent) where the probability of each
event depends only on the current state and not the historical path
transition matrix Matrix that represents all probabilities for transition from one state
to another (could be same state) for a markov chain
stationary distribution Probability distribution remains unchanged in the markov chain as
time progresses
ergodic Irreducible markov chain where it is eventually possible to move
from any state to any other state, with positive probability
irreversible A markov chain where there does not exist a probability distribution
that allows for the chain to be walked backwards in time
eigenvector A non-zero vector that changes by only a scalar factor when that
linear transformation is applied
n-step transition Probability of ending in a state j after n periods, starting in state i,
probability where i and j can be the same state
convergence rate After n transitions, the sum of variation between the probability in
each state vs. the stationary probability
poisson regression Type of regression model used for fitting data with an integral
model (count) response variable with mean equal to the variance
negative binomial Type of regression model used for fitting data with an integral
regression model (count) response variable and can account for variance greater than
the mean
overdispersion Phenomenon where the variance of data is larger than what is
modeled
cross-classified rating Table that combines the effects of multiple rating classifications
classes
structured data Data that can be organized into a repository format, typically a
database
unstructured data Data that is not in a predefined format, most notably text, audio
visual
qualitative data Data which is non numerical in nature
quantitative data Data which is numerical in nature
ordinal data Data field with a natural ordering
interval data Continuous data which is broken into interval bands with a natural
ordering
key-value databases Data storage method that stores amd finds records using a unique
key hash
column-oriented Data storage method that stores records by column instead of by
databases row
document databases Data storage method that uses the document metadata for search
and retrieval, also known as semi-structured data
data decay Corruption of data due to hardware failure in the storage device
reverification Manual process of checking the integrity of data
data element analysis Analysis of the format and definition of each field
structural analysis Statistical analysis of the structured data present to detect
irregularities
robust Statistics which are more unaffected by outliers or small departures
from model assumptions
exploratory data Approach to analyzing data sets to summarize their main
analysis characteristics, using visual methods, descriptive statistics,
clustering, dimension reduction
confirmatory data Process used to challenge assumptions about the data through
analysis hypothesis tests, significance testing, model estimation, prediction,
confidence intervals, and inference
supervised learning Model that predicts a response target variable using explanatory
methods predictors as input
unsupervised learning Models that work with explanatory variables only to describe
methods patterns or groupings
classification methods Supervised learning method where the response is a categorical
variable
regression methods Supervised learning method where the response is a continuous
variable
571
model flexibility A measure of model complexity, typically based on the number of

estimated parameters
explanatory modeling Process where the modeling goal is to identify variables with
meaningful and statistically significant relationships and test
hypotheses
predictive modeling Process where the modeling goal is to predict new observations
data modeling Assumes data generated comes from a stochastic data model
algorithmic modeling Assumes data generated comes from unknown algorithmic models
predictive accuracy Quantitative measure of how well the explanatory variables predict
the response outcome
scripts A program or sequence of instructions that is executed by another
program
reproducible analysis Modeling practice where data, code, analyses are published together
in a manner so that others may verify the findings
literate programming Coding practice where documentation and code are written together
data ownership Governance process that details legal ownership of enterprise-wide
data and outlines who has ability to create, edit, modify, share and
restrict access to the data
machine learning Study of algorithms and statistical models that perform a specific
task without using explicit instructions, relying on patterns and
inference
pattern recognition Automated recognition of patterns and regularities in data
data mining Process of collecting, cleaning, processing, analyzing, and
discovering patterns and useful insights from large data sets
principal component Dimension reduction technique that uses orthogonal transformations
analysis to convert a set of possibly correlated variables into a set of linearly
uncorrelated variables
cluster analysis Unsupervised learning method that aims to splot data into
homogenous groups using a similarity measure
k-means algorithm Type of clustering that aims to partition data into k mutually
exclusive clusters by assigning observations to the cluster with the
nearest centroid
linear regression Supervised model that uses a linear function to approximate the
models relationship between the target and explanatory variables
generalized linear Supervised model that generalizes linear regression by allowing the
model linear component to be related to the response variable via a link
function and by allowing the variance of each measurement to be a
function of its predicted value
systematic component The linear combination of explanatory variables component in a glm
link function Function that relates between the linear predictor component to the
mean of the target variable
decision trees Modeling technique that uses a tree-like model of decisions to divide
the sample space into non-overlapping regions to make predictions
categorical variable A variable whose values are qualitative groups and can have no
natural ordering (nominal) or an ordering (ordinal)
variables A variable is any characteristics, number, or quantity that can be
measured or counted.
interval variable An ordinal variable with the additional property that the
magnitudes of the differences between two values are meaningful
spatial data Data and information having an implicit or explicit association with
a location relative to the earth
high dimensional Data set is high dimensional when it has many variables. In many
applications, the number of variables may be larger than the sample
size.
qualitative This is a type of variable in which the measurement denotes
membership in a set of groups, or categories
nominal variable This is a type of qualitative/ categorical variable which has two or
more categories without having any kind of natural order.
ordinal variable This is a type of qualitative/ categorical variable which has two or
more ordered categories.
binary variable Is a special type of categorical variable where there are only two
categories.
quantitative variable A quantitative variable is a type of variable in which numerical level
is a realization from some scale so that the distance between any
two levels of the scale takes on meaning.
continuous variable A continuous variable is a quantitative variable that can take on any
value within a finite interval.
policyholder Person in actual possession of insurance policy; policy owner.
discrete variable A discrete variable is quantitative variable that takes on only a
finite number of values in any finite interval.
count variable A count variable is a discrete variable with values on nonnegative
integers.
circular data In a circular data, all values around the circle are equally likely.
Example, imagine an analog picture of a clock.
insurers An insurance company authorized to write insurance under the laws
of any state.
multivariate Multivariate variable involves taking many measurements on a
single entity.
workers’ compensation Insurance that covers an employer’s liability for injuries, disability
or death to persons in their employment, without regard to fault, as
prescribed by state or federal workers’ compensation laws and other
statutes.
univariate Univariate analysis is the simplest form of analyzing data. “Uni”
means “one”, so in other words your data has only one variable.
573
missing data Missing data occur when no data value is stored for a variable in an
observation. Missing data can occur because of nonresponse: no
information is provided for one or more items or for a whole unit or
subject.
censored Censored data have unknown values beyond a bound on either end
of the number line or both. Here, the data is observed but the
values (measurements) are not known completely.
truncated Truncation occurs when values beyond a boundary are either
excluded when gathered or excluded when analyzed. An object can
be detected only if its value is greater than some number.
stochastic process Stochastic process is defined as a collection of random variables that
is indexed by some mathematical set, meaning that each random
variable of the stochastic process is uniquely associated with an
element in the set.
deductibles A deductible is a parameter specified in the contract. Typically,
losses below the deductible are paid by the policyholder whereas
losses in excess of the deductible are the insurer’s responsibility
(subject to policy limits and coninsurance).
rank based measures Statistical dependence between the rankings of two variables
odds ratio A statistic quantifying the strength of the association between two
events, a and b, which is defined as the ratio of the odds of a in the
presence of b and the odds of a in the absence of b
likelihood ratio test A statistical test of the goodness-of-fit between two models
pearson correlation A measure of the linear correlation between two variables
product-moment Pearson correlation, a measure of the linear correlation between two
(pearson) correlation variables
kendall’s tau A statistic used to measure the ordinal association between two
measured quantities
concordant An observation pair (x,y) is said to be concordant if the observation
with a larger value of x has also the larger value of y
discordant An observation pair (x,y) is said to be discordant if the observation
with a larger value of x has the smaller value of y
pearson chi-square A statistical test applied to sets of categorical data to evaluate how
statistic likely it is that any observed difference between the sets arose by
chance
tetrachoric correlation A technique for estimating the correlation between two theorised
normally distributed continuous latent variables, from two observed
binary variables
polychoric correlation A technique for estimating the correlation between two theorised
normally distributed continuous latent variables, from two observed
ordinal variables
polyserial correlation The correlation between two continuous variables with a bivariate
normal distribution, where one variable is observed directly, and the
other is unobserved
biserial correlation A correlation coefficient used when one variable is dichotomous

normal score Transformed data which closely resemble a standard normal
distribution
copula A multivariate distribution function with uniform marginals
spearman’s rho A nonparametric measure of rank correlation
marginal distributions The probability distribution of the variables contained in the subset
of a collection of random variables
fat-tailed A fat-tailed distribution is a probability distribution that exhibits a
large skewness or kurtosis, relative to that of either a normal
distribution or an exponential distribution
probability integral Any continuous variable can be mapped to a uniform random
transformation variable via its distribution function
elliptical copulas The copulas of elliptical distributions
correlation matrix A table showing correlation coefficients between variables
elliptical distributions Any member of a broad family of probability distributions that
generalize the multivariate normal distribution
tail dependency A measure of their comovements in the tails of the distributions
frechet-hoeffding Bounds of multivariate distribution functions
bounds
blomqvist’s beta A dependence measure based on the center of the distribution
reinsurance Insurance purchased by an insurer
deductible A deductible is a parameter specified in the contract. typically,
losses below the deductible are paid by the policyholder whereas
losses in excess of the deductible are the insurer’s responsibility
(subject to policy limits and coninsurance).
coinsurance Coinsurance is an arrangement whereby the insured and insurer
share the covered losses. typically, a coinsurance parameter specified
means that both parties receive a proportional share, e.g., 50%, of
the loss.
pure premium Pure premium is the total severity divided by the number of claims.
it does not include insurance company expenses, premium taxes,
contingencies, nor an allowance for profits. also called loss costs.
some definitions include allocated loss adjustment expenses (alae).
standard deviation The square-root of variance
variance Second central moment of a random variable x, measuring the
expected squared deviation of between the variable and its mean
aggregate claims The sum of all claims observed in a period of time
median 50th percentile of a definition, or middle value where half of the
distribution lies below
lorenz curve A graph of the proportion of a population on the horizontal axis and
a distribution function of interest on the vertical axis.
law of total variance A decomposition of the variance of a random variable into
conditional components. specifically, for random variables x and y
on the same probability space, var(x) = e[var(y|x)] + var[e(x|y)].
575
tail value-at-risk The expected value of a risk given that the risk exceeds a
value-at-risk
coefficient of variation Standard deviation divided by the mean of a distribution, to
measure variability in terms of units of the mean
loss ratio The sum of losses divided by the premium.
homogeneous risks Risks that have the same distribution, that is, the distributions are
identical.
heterogeneous Heterogeneous risks have different distributions. often, we can
attribute differences to varying exposures or risk factors.
exposure A type of rating variable that is so important that premiums and
losses are often quoted on a ”per exposure” basis. that is, premiums
and losses are commonly standardized by exposure variables.
loss The amount of damages sustained by an individual or corporation,
typically as the result of an insurable event.
iid Independent and identically distributed
pdf Probability density function
aic Akaike’s information criterion
bic Bayesian information criterion
pmf Probability mass function
mcmc Markov Chain Monte Carlo
cdf Cumulative distribution function
df Degrees of freedom
glm Generalized linear model
mle Maximum likelihood estimate
ols Ordinary least squares
pf Probability function
rv Random variable
reporting delay The time that elapses between the occurrence of the insured event
and the reporting of this event to the insurance company.
settlement delay The time between reporting and settlement of a claim.
rbns Reported, But is Not fully Settled
ibnr Incurred in the past But is Not yet Reported. For such a claim the
insured event took place, but the insurance company is not yet
aware of the associated claim.
granular
case estimates The claims handlers expert estimate of the outstanding amount on a
claim.
.csv Comma separated value file
.txt Text file
run-off triangle Triangular display of loss reserve data. Accident or occurrence
periods on one axis (often vertical) with development periods on the
other (often horizontal). Also known as a development triangle.
development triangle Triangular display of loss reserve data. Accident or occurrence

periods on one axis (often vertical) with development periods on the
other (often horizontal). Also known as a run-off triangle.
msep Mean Squared Error of Prediction
chain-ladder method An algorithm for predicting incomplete losses to their ultimate
cumulative value. The name refers to the chaining of a sequence of
(year-to-year development) factors into a ladder of factors.
wls weighted least squares
glm Generalized linear model
Bibliography
Aalen, Odd (1978). “Nonparametric inference for a family of counting processes,”

The Annals of Statistics, Vol. 6, pp. 701–726.
Abadir, Karim and Jan Magnus (2002). “Notation in econometrics: a proposal
for a standard,” The Econometrics Journal, Vol. 5, pp. 76–90.
Abbott, Dean (2014). Applied Predictive Analytics: Principles and Techniques
for the Professional Data Analyst, Hoboken, NJ. Wiley.
Abdullah, Mohammad F. and Kamsuriah Ahmad (2013). “The mapping process
of unstructured data to structured data,” in 2013 International Conference
on Research and Innovation in Information Systems (ICRIIS), pp. 151–155.
Actuarial Standards Board (2018). “Actuarial Standards of Practice,” American
Academy of Actuaries, URL: http://www.actuarialstandardsboard.org/stan
dards-of-practice/, [Retrieved on Oct 3, 2018].
Aggarwal, Charu C. (2015). Data Mining: The Textbook, New York, NY.
Springer.
Agresti, Alan (1996). An Introduction to Categorical Data Analysis. Wiley New
York.
Albers, Michael J. (2017). Introduction to Quantitative Data Analysis in the
Behavioral and Social Sciences, Hoboken, NJ. John Wiley & Sons, Inc.
Antonio, K. and R. Plat (2014). “Micro–level stochastic loss reserving for general
insurance,” Scandinavian Actuarial Journal, Vol. 7, pp. 649–669.
Bahnemann, David (2015). Distributions for Actuaries, No. 2, URL: https:
//www.casact.org/pubs/monographs/papers/02-Bahnemann.pdf.
Bailey, Robert A. and J. Simon LeRoy (1960). “Two studies in automobile
ratemaking,” Proceedings of the Casualty Actuarial Society Casualty Actuarial
Society, Vol. XLVII.
Bandyopadhyay, Prasanta S. and Malcolm R. Forster eds. (2011). Philosophy of
Statistics, Handbook of the Philosophy of Science 7. North Holland.
577
578 BIBLIOGRAPHY
Bauer, Daniel, Richard D. Phillips, and George H. Zanjani (2013). “Financial

pricing of insurance,” in Handbook of Insurance. Springer, pp. 627–645.
Billingsley, Patrick (2008). Probability and measure. John Wiley & Sons.
Bishop, Christopher M. (2007). Pattern Recognition and Machine Learning, New
York, NY. Springer.
Bishop, Yvonne M., Stephen E. Fienberg, and Paul W. Holland (1975). Discrete
Multivariate Analysis: Theory and Practice. Cambridge [etc.]: MIT.
Blomqvist, Nils (1950). “On a measure of dependence between two random
variables,” The Annals of Mathematical Statistic, pp. 593–600.
Bluman, Allan (2012). Elementary Statistics: A Step By Step Approach, New
York, NY. McGraw-Hill.
Boehm, C, J Engelfriet, M Helbig, A IM Kool, P Leepin, E Neuburger, and
AD Wilkie (1975). “Thoughts on the harmonization of some proposals for
a new International actuarial notation,” Blätter der DGVFM, Vol. 12, pp.
99–129.
Bowers, Newton L., Hans U. Gerber, James C. Hickman, Donald A. Jones, and
Cecil J. Nesbitt (1986). Actuarial Mathematics. Society of Actuaries Itasca,
Ill.
Box, George E. P. (1980). “Sampling and Bayes’ inference in scientific modelling
and robustness,” Journal of the Royal Statistical Society. Series A (General),
pp. 383–430.
Breiman, Leo (2001). “Statistical Modeling: The Two Cultures,” Statistical
Science, Vol. 16, pp. 199–231.
Breiman, Leo, Jerome Friedman, Charles J. Stone, and R.A. Olshen (1984).
Classification and Regression Trees, Raton Boca, FL. Chapman and
Hall/CRC.
Bühlmann, Hans (1967). “The Complement of Credibility,” pp. 199–207.
Bühlmann, Hans (1985). “Premium calculation from top down,” ASTIN Bul-
letin: The Journal of the IAA, Vol. 15, pp. 89–101.
Bühlmann, Hans, Massimo De Felice, Alois Gisler, Franco Moriconi, and
Mario V Wüthrich (2009). “Recursive credibility formula for chain ladder
factors and the claims development result,” ASTIN Bulletin: The Journal of
the IAA, Vol. 39, pp. 275–306.
Bühlmann, Hans and Alois Gisler (2005). A Course in Credibility Theory and
its Applications. ACTEX Publications.
Buttrey, Samuel E. and Lyn R. Whitaker (2017). A Data Scientist’s Guide to
Acquiring, Cleaning, and Managing Data in R, Hoboken, NJ. Wiley.
BIBLIOGRAPHY 579
Chen, Min, Shiwen Mao, Yin Zhang, and Victor CM Leung (2014). Big
Data: Related Technologies, Challenges and Future Prospects, New York, NY.
Springer.
Clark, David R (1996). Basics of reinsurance pricing, pp.41–43, URL: https:
//www.soa.org/files/edu/edu-2014-exam-at-study-note-basics-rein.pdf.
Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang (2009). Principles and
theory for data mining and machine learning, New York, NY. Springer-Verlag.
Cummins, J. David and Richard A. Derrig (2012). Managing the Insolvency Risk
of Insurance Companies: Proceedings of the Second International Conference
on Insurance Solvency, Vol. 12. Springer Science & Business Media.
Dabrowska, Dorota M. (1988). “Kaplan-meier estimate on the plane,” The An-
nals of Statistics, pp. 1475–1489.
Daroczi, Gergely (2015). Mastering Data Analysis with R, Birmingham, UK.
Packt Publishing.
De Jong, Piet and Gillian Z. Heller (2008). Generalized linear models for insur-
ance data. Cambridge University Press, Cambridge.
Denuit, Michel, Xavier Maréchal, Sandra Pitrebois, and Jean-François Walhin
(2007). Actuarial modelling of claim counts: risk classification, credibility and
bonus-malus systems. John Wiley & Sons, Chichester.
Denuit, Michel, Dominik Sznajder, and Julien Trufin (2019). “Model selection
based on Lorenz and concentration curves, Gini indices and convex order,”
Insurance: Mathematics and Economics.
Derrig, Richard A, Krzysztof M Ostaszewski, and Grzegorz A Rempala (2001).
“Applications of resampling methods in actuarial practice,” in Proceedings
of the Casualty Actuarial Society, Vol. 87, pp. 322–364, Casualty Actuarial
Society.
Dickson, David C. M., Mary Hardy, and Howard R. Waters (2013). Actuarial
Mathematics for Life Contingent Risks. Cambridge University Press.
Dionne, Georges and Charles Vanasse (1989). “A generalization of automobile
insurance rating models: the negative binomial distribution with a regression
component,” ASTIN Bulletin, Vol. 19(2), pp. 199–212.
Dobson, Annette J and Adrian Barnett (2008). An Introduction to Generalized
Linear Models. CRC press.
Earnix (2013). “2013 Insurance Predictive Modeling Survey,” Earnix and Insur-
ance Services Office, Inc. URL: https://www.verisk.com/archived/2013/m
ajority-of-north-american-insurance-companies-use-predictive-analytics-to-
enhance-business-performance-new-earnix-iso-survey-shows/, [Retrieved on
July 23, 2020].
580 BIBLIOGRAPHY
England, P. and R. Verrall (2002). “Stochastic claims reserving in general insur-

ance,” British Actuarial Journal, Vol. 8/3, pp. 443–518.
Faraway, Julian J (2016). Extending the Linear Model with R: Generalized Lin-
ear, Mixed Effects and Nonparametric Regression Models, Vol. 124. CRC press.
Fechner, G. T (1897). “Kollektivmasslehre,” Wilhelm Englemann, Leipzig.
Finger, Robert J. (2006). “Risk classification,”, pp. 231–276.
Forte, Rui Miguel (2015). Mastering Predictive Analytics with R, Birmingham,
UK. Packt Publishing.
Frank, Maurice J (1979). “On the simultaneous associativity of F(x, y) and
x+y-F(x, y),” Aequationes mathematicae, Vol. 19, pp. 194–226.
Frees, Edward W. (2009). Regression Modeling with Actuarial and Financial
Applications. Cambridge University Press.
Frees, Edward W (2014). “Frequency and severity models,” in Edward W Frees,
Glenn Meyers, and Richard Derrig eds. Predictive Modeling Applications in
Actuarial Science, Vol. 1, pp. 138–164. Cambridge University Press Cam-
bridge.
Frees, Edward W and Fei Huang (2020). “The Discriminating (Pricing) Actuary,”
URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3592475,
[Retrieved on August 3, 2020].
Frees, Edward W, Gee Lee, and Lu Yang (2016). “Multivariate frequency-
severity regression models in insurance,” Risks, Vol. 4, p. 4.
Frees, Edward W., Glenn Meyers, and A. David Cummings (2011). “Summariz-
ing insurance scores using a Gini index,” Journal of the American Statistical
Association, Vol. 106, pp. 1085–1098.
(2014). “Insurance ratemaking and a Gini index,” Journal of Risk and
Insurance, Vol. 81, pp. 335–366.
Frees, Edward W. and Emiliano A. Valdez (1998). “Understanding relationships
using copulas,” North American Actuarial Journal, Vol. 2, pp. 1–25.
(2008). “Hierarchical insurance claims modeling,” Journal of the Amer-
ican Statistical Association, Vol. 103, pp. 1457–1469.
Friedland, Jacqueline (2013). Fundamentals of General Insurance Actuarial
Analysis. Society of Actuaries.
Gan, Guojun (2011). Data Clustering in C++: An Object-Oriented Approach,
Data Mining and Knowledge Discovery Series, Boca Raton, FL, USA. Chap-
man & Hall/CRC Press, DOI: http://dx.doi.org/10.1201/b10814.
Gan, Guojun, Chaoqun Ma, and Jianhong Wu (2007). Data Clustering: Theory,
Algorithms, and Applications, Philadelphia, PA. SIAM Press, DOI: http://dx
.doi.org/10.1137/1.9780898718348.
BIBLIOGRAPHY 581
Gelman, Andrew (2004). “Exploratory Data Analysis for Complex Models,”

Journal of Computational and Graphical Statistics, Vol. 13, pp. 755–779.
Genest, Christian and Josh Mackay (1986). “The joy of copulas: Bivariate dis-
tributions with uniform marginals,” The American Statistician, Vol. 40, pp.
280–283.
Genest, Christian and Johanna Nešlohva (2007). “A primer on copulas for count
data,” Journal of the Royal Statistical Society, pp. 475–515.
Gerber, Hans U (1979). An Introduction to Mathematical Risk Theory, vol.
8 of SS Heubner Foundation Monograph Series. University of Pennsylvania
Wharton School SS Huebner Foundation for Insurance Education.
Gesmann, Markus, Daniel Murphy, Yanwei Zhang, Alessandro Carrato, Mario
Wuthrich, Fabio Concina, and Eric Dal Moro (2019). ChainLadder: Statistical
Methods and Models for Claims Reserving in General Insurance, URL: https:
//CRAN.R-project.org/package=ChainLadder, R package version 0.2.10.
Gisler, Alois (2006). “The estimation error in the chain-ladder reserving method:
a Bayesian approach,” ASTIN Bulletin: The Journal of the IAA, Vol. 36, pp.
554–565.
Gisler, Alois and Mario V Wüthrich (2008). “Credibility for the chain ladder
reserving method,” ASTIN Bulletin: The Journal of the IAA, Vol. 38, pp.
565–600.
Good, I. J. (1983). “The Philosophy of Exploratory Data Analysis,” Philosophy
of Science, Vol. 50, pp. 283–295.
Gorman, Mark and Stephen Swenson (2013). “Building believers: How to ex-
pand the use of predictive analytics in claims,” SAS, URL: https://www.the-
digital-insurer.com/wp-content/uploads/2014/10/265-wp-59831.pdf, [Re-
trieved on July 23, 2020].
Greenwood, Major (1926). “The errors of sampling of the survivorship tables,”
in Reports on Public Health and Statistical Subjects, Vol. 33. London: Her
Majesty’s Stationary Office.
Halperin, Max, Herman O Hartley, and Paul G Hoel (1965). “Recommended
standards for statistical symbols and notation: Copss Committee on Symbols
and Notation,” The American Statistician, Vol. 19, pp. 12–14.
Hardy, Mary R. (2006). An introduction to risk measures for actuarial applica-
tions. Society of Actuaries, URL: https://www.soa.org/globalassets/assets/fi
les/edu/c-25-07.pdf, [Retrieved on August 6, 2020].
Hartman, Brian (2016). “Bayesian Computational Methods,” Predictive Model-
ing Applications in Actuarial Science: Volume 2, Case Studies in Insurance.
Hashem, Ibrahim Abaker Targio, Ibrar Yaqoob, Nor Badrul Anuar, Salimah
Mokhtar, Abdullah Gani, and Samee Ullah Khan (2015). “The rise of “big
582 BIBLIOGRAPHY
data” on cloud computing: Review and open research issues,” Information

Systems, Vol. 47, pp. 98 – 115.
Hettmansperger, T. P. (1984). Statistical Inference Based on Ranks. Wiley.
Hofert, Marius, Ivan Kojadinovic, Martin Mächler, and Jun Yan (2018). Ele-
ments of Copula Modeling with R. Springer.
Hogg, Robert V, Elliot A Tanis, and Dale L Zimmerman (2015). Probability and
Statistical Inference, 9th Edition. Pearson, New York.
Hougaard, P (2000). Analysis of Multivariate Survival Data. Springer New York.
Hox, Joop J. and Hennie R. Boeije (2005). “Data collection, primary versus
secondary,” in Encyclopedia of social measurement. Elsevier, pp. 593 – 599.
Igual, Laura and Santi Segu (2017). Introduction to Data Science. A Python
Approach to Concepts, Techniques and Applications, New York, NY. Springer.
Inmon, W.H. and Dan Linstedt (2014). Data Architecture: A Primer for the
Data Scientist: Big Data, Data Warehouse and Data Vault, Cambridge, MA.
Morgan Kaufmann.
Insurance Information Institute (2016). “International Insurance Fact Book,”
Insurance Information Institute, URL: http://www.iii.org/sites/default/file
s/docs/pdf/international_insurance_factbook_2016.pdf, [Retrieved on Sept
9, 2018].
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013).
An introduction to statistical learning, Vol. 112. Springer.
Janert, Philipp K. (2010). Data Analysis with Open Source Tools, Sebastopol,
CA. O’Reilly Media.
Joe, Harry (2014). Dependence Modeling with Copulas. CRC Press.
Judd, Charles M., Gary H. McClelland, and Carey S. Ryan (2017). Data Analy-
sis. A Model Comparison Approach to Regression, ANOVA and beyond, New
York, NY. Routledge, 3rd edition.
Kaas, Rob, Marc Goovaerts, Jan Dhaene, and Michel Denuit (2008). Modern
actuarial risk theory: using R, Vol. 128. Springer Science & Business Media.
Kaplan, Edward L. and Paul Meier (1958). “Nonparametric estimation from
incomplete observations,” Journal of the American statistical association, Vol.
53, pp. 457–481.
Kendall, M. G (1945). “The treatment of ties in ranking problems,” Biometrika,
Vol. 33(3), pp. 239–251.
Kendall, Maurice G (1938). “A new measure of rank correlation,” Biometrika,
pp. 81–93.
BIBLIOGRAPHY 583
Klugman, Stuart A., Harry H. Panjer, and Gordon E. Willmot (2012). Loss
Models: From Data to Decisions. John Wiley & Sons.
Kreer, Markus, Ayşe Kızılersü, Anthony W Thomas, and Alfredo D Egídio dos
Reis (2015). “Goodness-of-fit tests and applications for left-truncated Weibull
distributions to non-life insurance,” European Actuarial Journal, Vol. 5, pp.
139–163.
Kremer, Erhard (1982). “IBNR-claims and the two-way model of ANOVA,”
Scandinavian Actuarial Journal, Vol. 1982, pp. 47–55.
(1984). “A class of autoregressive models for predicting the final claims
amount,” Insurance: Mathematics and Economics, Vol. 3, pp. 111–119.
Kubat, Miroslav (2017). An Introduction to Machine Learning, New York, NY.
Springer, 2nd edition.
Lee Rodgers, J and W. A Nicewander (1998). “Thirteen ways to look at the
correlation coeffeicient,” The American Statistician, Vol. 42, pp. 59–66.
Lemaire, Jean (1998). “Bonus-malus systems: the European and Asian approach
to merit rating,” North American Actuarial Journal, Vol. 2(1), pp. 26–38.
Lemaire, Jean and Hongmin Zi (1994). “A comparative analysis of 30 bonus-
malus systems,” ASTIN Bulletin, Vol. 24(2), pp. 287–309.
Levin, Bruce, James Reeds et al. (1977). “Compound multinomial likelihood
functions are unimodal: Proof of a conjecture of IJ Good,” The Annals of
Statistics, Vol. 5, pp. 79–87.
Lorenz, Max O. (1905). “Methods of measuring the concentration of wealth,”
Publications of the American statistical association, Vol. 9, pp. 209–219.
Mack, Thomas (1991). “A simple parametric model for rating automobile insur-
ance or estimating IBNR claims reserves,” ASTIN Bulletin: The Journal of
the IAA, Vol. 21, pp. 93–109.
(1993). “Distribution-free calculation of the standard error of chain
ladder reserve estimates,” ASTIN Bulletin: The Journal of the IAA, Vol. 23,
pp. 213–225.
Mack, Thomas and Gary Venter (2000). “A comparison of stochastic models
that reproduce chain ladder reserve estimates,” Insurance: mathematics and
economics, Vol. 26, pp. 101–107.
Mailund, Thomas (2017). Beginning Data Science in R: Data Analysis, Visual-
ization, and Modelling for the Data Scientist. Apress.
McCullagh, Peter and John A. Nelder (1989). Generalized Linear Models, Sec-
ond Edition, Chapman and Hall/CRC Monographs on Statistics and Applied
Probability Series. Chapman & Hall, London.
584 BIBLIOGRAPHY
McDonald, James B (1984). “Some generalized functions for the size distribution
of income,” Econometrica: journal of the Econometric Society, pp. 647–663.
McDonald, James B and Yexiao J Xu (1995). “A generalization of the beta
distribution with applications,” Journal of Econometrics, Vol. 66, pp. 133–
152.
Miles, Matthew, Michael Hberman, and Johnny Sdana (2014). Qualitative Data
Analysis: A Methods Sourcebook, Thousand Oaks, CA. Sage, 3rd edition.
Mirkin, Boris (2011). Core Concepts in Data Analysis: Summarization, Corre-
lation and Visualization, London, UK. Springer.
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Founda-
tions of Machine Learning, Cambridge, MA. MIT Press.
NAIC Glossary (2018). “Glossary of Insurance Terms,” National Association of
Insurance Commissioners, URL: https://www.naic.org/consumer_glossary.
htm, [Retrieved on Sept 11, 2018].
Nelson, Roger B. (1997). An Introduction to Copulas. Lecture Notes in Statistics
139.
Niehaus, Gregory and Scott Harrington (2003). Risk Management and Insur-
ance, New York. McGraw Hill.
Norberg, Ragnar (1976). “A credibility theory for automobile bonus system,”
Scandinavian Actuarial Journal, Vol. 2, pp. 92–107.
Ohlsson, Esbjörn and Björn Johansson (2010). Non-life Insurance Pricing with
Generalized Linear Models, Vol. 21. Springer.
O’Leary, D. E. (2013). “Artificial Intelligence and Big Data,” IEEE Intelligent
Systems, Vol. 28, pp. 96–99.
Olkin, Ingram, A John Petkau, and James V Zidek (1981). “A comparison of n
estimators for the binomial distribution,” Journal of the American Statistical
Association, Vol. 76, pp. 637–642.
Olson, Jack E. (2003). Data Quality: The Accuracy Dimension, San Francisco,
CA. Morgan Kaufmann.
Picard, Richard R. and Kenneth N. Berk (1990). “Data splitting,” The American
Statistician, Vol. 44, pp. 140–147.
Pitrebois, Sandra, Michel Denuit, and Jean-François Walhin (2003). “Setting
a bonus-malus scale in the presence of other rating factors: Taylor’s work
revisited,” ASTIN Bulletin, Vol. 33(2), pp. 419–436.
Pries, Kim H. and Robert Dunnigan (2015). Big Data Analytics: A Practical
Guide for Managers, Boca Raton, FL. CRC Press.
BIBLIOGRAPHY 585
Renshaw, A. and R. Verrall (1998). “A stochastic model underlying the chain-

ladder technique,” British Actuarial Journal, Vol. 4/4, pp. 903–923.
Renshaw, Arthur E (1989). “Chain ladder and interactive modelling.(Claims
reserving and GLIM),” Journal of the Institute of Actuaries, Vol. 116, pp.
559–587.
Samuel, A. L. (1959). “Some Studies in Machine Learning Using the Game of
Checkers,” IBM Journal of Research and Development, Vol. 3, pp. 210–229.
Schweizer, Berthold, Edward F Wolff et al. (1981). “On nonparametric measures
of dependence for random variables,” The Annals of Statistics, Vol. 9, pp. 879–
885.
Shmueli, Galit (2010). “To Explain or to Predict?” Statistical Science, Vol. 25,
pp. 289–310.
Sklar, M (1959). “Fonctions de repartition a N dimensions et leurs marges,”
Publ. inst. statist. univ. Paris, Vol. 8, pp. 229–231.
Snee, Ronald D. (1977). “Validation of regression models: methods and exam-
ples,” Technometrics, Vol. 19, pp. 415–428.
Spearman, C (1904). “The proof and measurement of association between two
things,” The American Journal of Psychology, Vol. 15, pp. 72–101.
Tan, Chong It (2016). “Optimal design of a bonus-malus system: linear relativ-
ities revisited,” Annals of Actuarial Science, Vol. 10(1), pp. 52–64.
Tan, Chong It, Jackie Li, Johnny Siu-Hang Li, and Uditha Balasooriya (2015).
“Optimal relativities and transition rules of a bonus-malus system,” Insurance:
Mathematics and Economics, Vol. 61, pp. 255–263.
Taylor, G. (2000). Loss reserving: an actuarial perspective. Kluwer Academic
Publishers.
Taylor, Gregory Clive (1986). Claims reserving in non-life insurance. North
Holland.
Tevet, Dan (2016). “Applying Generalized Linear Models to Insurance Data,”
Predictive Modeling Applications in Actuarial Science: Volume 2, Case Stud-
ies in Insurance, p. 39.
Tse, Yiu-Kuen (2009). Nonlife Actuarial Models: Theory, Methods and Evalua-
tion. Cambridge University Press.
Tukey, John W. (1962). “The Future of Data Analysis,” The Annals of Mathe-
matical Statistics, Vol. 33, pp. 1–67.
Venter, Gary (1983). “Transformed beta and gamma distributions and aggregate
losses,” in Proceedings of the Casualty Actuarial Society, Vol. 70, pp. 289–308.
Venter, Gary G. (2002). “Tails of copulas,” in Proceedings of the Casualty Ac-
tuarial Society, Vol. 89, pp. 68–113.
586 BIBLIOGRAPHY
Venter, Gary G (2006). “Discussion of the mean square error of prediction in the
chain ladder reserving method,” ASTIN Bulletin: The Journal of the IAA,
Vol. 36, pp. 566–571.
Werner, Geoff and Claudine Modlin (2016). Basic Ratemaking, Fifth Edition.
Casualty Actuarial Society, URL: https://www.casact.org/library/studynote
s/werner_modlin_ratemaking.pdf, [Retrieved on April 1, 2019].
Wüthrich, Mario V. and Michael Merz (2008). Stochastic claims reserving meth-
ods in insurance, Vol. 435 of Wiley Finance. John Wiley & Sons.
(2015). Stochastic Claims Reserving Manual: Advances in Dynamic
Modeling. SSRN.
Young, Virginia R (2014). “Premium principles,” Wiley StatsRef: Statistics
Reference Online.
Yule, G. Udny (1900). “On the association of attributes in statistics: with il-
lustrations from the material of the childhood society,” Philosophical Trans-
actions of the Royal Society of London. Series A, Containing Papers of a
Mathematical or Physical Character, pp. 257–319.
(1912). “On the methods of measuring association between two at-
tributes,” Journal of the Royal Statistical Society, pp. 579–652.

Loss Data Analytics Aug 2020

Uploaded by

Copyright:

Available Formats

Loss Data Analytics Aug 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Loss Data Analytics Aug 2020

Uploaded by

Copyright:

Available Formats

Loss Data Analytics

An open text authored by the Actuarial Community

1 Introduction to Loss Data Analytics 21

2.4.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . 55

3 Modeling Loss Severity 77

4 Model Selection and Estimation 125

4.2.4 Model Selection Based on Cross-Validation . . . . . . . . 149

5 Aggregate Loss Models 179

6 Simulation and Resampling 219

6.3.1 k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . 244

7 Premium Foundations 253

8 Risk Classification 285

TS 8.B. Selecting Rating Factors . . . . . . . . . . . . . . . . . . 305

9 Experience Rating Using Credibility Theory 309

10 Insurance Portfolio Management including Reinsurance 343

11 Loss Reserving 375

11.2.1 From Micro to Macro . . . . . . . . . . . . . . . . . . . . 378

12 Experience Rating using Bonus-Malus 401

13 Data and Systems 427

13.2.4 Parametric versus Nonparametric . . . . . . . . . . . . . . 434

14 Dependence Modeling 445

15 Appendix A: Review of Statistical Inference 475

15.2.1 Method of Moments Estimation . . . . . . . . . . . . . . . 479

16 Appendix B: Iterated Expectations 491

17 Appendix C: Maximum Likelihood Theory 503

18 Appendix D: Summary of Distributions 513

18.2.6 Distributions with Finite Support . . . . . . . . . . . . . 541

19 Appendix E: Conventions for Notation 547

Date: 23 August 2020

Loss Data Analytics is an interactive, online, freely available text.

• The online version contains many interactive objects (quizzes, computer

What will success look like?

How will the text be used?

Why is this good for the profession?

Why loss data analytics?

We also wish to acknowledge the support and sponsorship of the International

• Zeinab Amin is a Professor at the Department of Mathematics and Ac-

• Katrien Antonio, KU Leuven

• Jan Beirlant, KU Leuven

• Arthur Charpentier is a professor in the Department of Mathematics

• Curtis Gary Dean is the Lincoln Financial Distinguished Professor of

Executive Council. He contributed a chapter to Predictive Modeling Ap-

Statistics at Yonsei University, Seoul, Korea. He holds a Ph.D. degree in

• Mark Maxwell, University of Texas at Austin

For our Readers

Introduction to Loss Data

Chapter Preview. This book introduces readers to methods of analyzing in-

1.1 Relevance of Analytics to Insurance Activi-

In this section, you learn how to:

1.1.1 Nature and Relevance of Insurance

1.1.2 What is Analytics?

1.1.3 Insurance Processes

compensation amounts.) In a life insurance contract that stretches over many

1.2 Insurance Company Operations

In this section, you learn how to:

Policy Contract Contract

Policy Policy Policy

Figure 1.1: Timeline of a Typical Insurance Policy. Arrows mark the

• Describe five major operational areas of insurance companies.