What are you looking for?
Vose Software

# Fitting distributions to data and why you are probably doing it wrong

- By:

Courtesy of Vose Software

A common problem in risk analysis is fitting a probability distribution to a set of observations for a variable. One does this to be able to make forecasts about the future. The most common situation is to fit a distribution to a single variable (like the lifetime of a mechanical or electrical component), but problems also sometimes require the fitting of a multivariate distribution: for example, if one wishes to predict the weight and height of a random person, or the simultaneous change in price of two stocks.

There are a number of software tools on the market that will fit distributions to a data set, and most risk analysis tools incorporate a component that will do this. Unfortunately, the methods they use to measure the goodness of fit are wrong and very limited in the types of data that they can use. This paper explains why, and describes a method that is both correct and sufficiently flexible to handle any type of data set.

Fitting a single distribution
The principle behind fitting distributions to data is to find the type of distribution (normal, lognormal, gamma, beta, etc) and the value of the parameters (mean, variance, etc) that give the highest probability of producing the observed data. For example, Figure 1 shows the normal distribution with parameters that best fit a particular data set. The data were randomly generated from a Normal distribution with mean and standard deviation of 4 and 1 respectively. The data set consists of 1026 values, which is many more than one usually has to work with, so the parameter estimates (4.026 and 1.038) are close to the true values.

Usually, of course, we do not know that the data came from any specific type of distribution, though we can often guess at some good possible candidates by matching the nature of the variable to the theory on which the probability distributions are based. The normal distribution, for example, is a good candidate if the random variation of the variable under consideration is driven by a large number of random factors (none of which dominate) in an additive fashion, whereas the lognormal is a good candidate if a large number of factors influence the value of the variable in a multiplicative way.