Article Menu |
Calculating Correlation and Means with Missing DataA great deal has been written about fixing 'invalid correlation matrices' for risk management purposes. Correlation matrices are invalid when it's mathematically impossible to generate random numbers with those mutual correlations. The most common cause of this problem -as seen in finance- is that the correlation matrices numbers are made up, or are based on historical data with a wrong treatment of missing data. This article will show how to solve the later type of problems by explaining how to do a correct estimation of correlation coefficient for datasets with missing data. The technique to do so is called 'expectation maximization' (EM). We will illustrate the method by solving a practical example.
[edit] The first step: Finding initial guessesSuppose we have the following set of observations for the two variables
The first thing to do is to get initial estimates of the means and (co)variances. These first estimate are made using only complete data. This data is show in table 2 in bold. All the pairs of data with a value missing are thrown away.
[edit] Estimating the mean
We estimate the means for the two variables
[edit] Esitmating (co)variancesNext we estimate the covariances using one of the formulas below. Is this case we use the second method.
[edit] Estimating the correlationFinally, we get the first estimate of the correlation:
[edit] Using a linear model to fill in the missing dataThe EM algorithms starts by finding values for the missing data, and the uncertainty of those estimates. The best guesses for the missing values are made using a least squares linear fits. This first model will predict the missing values of Using the means and covariances, we get the following linear models The estimate for the first missing value of
Table 3 shows the completed table with all four missing values estimated
[edit] Main loopAfter estimating values for the missing data using least squares linear regression, we are ready to enter the iterative loop that converges to an optimal solution of the correlation estimate. [edit] Updating the meansWe estimate new values for means
[edit] Updating the (co)variancesNext we update the covariances using one of the formulas below. These equations are the same as before, except that they have an additional term in the variance estimates that correct for the uncertainty in estimates of the
[edit] Updating the correlationFinally, we get an update of the correlation:
[edit] Updating the estimates of the missing dataUsing the updated means and covariances, we get the following linear models The first missing value of
Table 4 shows the completed table with all four missing values updated
[edit] Repeating the processThese steps are repeated -starting with updating the means- a number of times untill the values converge.
[edit] ConclusionsThis example shows that estimating correlation based on missing data can give substantial different values compared to methods that throw away all samples that have missing data. The initial guess of this algorithm gives a correlation of -0.21, which is a correlation estimate based on removing all missing data sampled. The final results is a correlation of -0.56
An explanation for this big difference is that two samples of [edit] BibliographyR. Rebonato, P. Jackel - The most general methodology to create a valid correlation matrix for risk management and option pricing purposes Mortaza Jamshidian, Peter M. Bentler - ML Estimation of Mean and Covariance Structures with Missing Data Using Complete Data Routines Roderick J A Little, Donald B Rubin - Statistical analysis with missing data
[edit] Add a comment |
. The first two observations of
and the last two observations of
are missing (denoted by '?').
.


. This is illustrated in the left plot. The line is the linear fit, and values for
and
.
The second model is the other way around,
and will be used to predict the missing values of
= 13 + -0.5 / 2.222 (12 - 17.333) = 14.2
using the completed data
missing values in
.