Calculating Correlation and Means with Missing Data

A great deal has been written about fixing 'invalid correlation matrices' for risk management purposes. Correlation matrices are invalid when it's mathematically impossible to generate random numbers with those mutual correlations. The most common cause of this problem -as seen in finance- is that the correlation matrices numbers are made up, or are based on historical data with a wrong treatment of missing data. This article will show how to solve the later type of problems by explaining how to do a correct estimation of correlation coefficient for datasets with missing data. The technique to do so is called 'expectation maximization' (EM). We will illustrate the method by solving a practical example.


Contents

[edit] The first step: Finding initial guesses

Suppose we have the following set of observations for the two variables x_1, x_2\;. The first two observations of x_1\; and the last two observations of x_2\; are missing (denoted by '?').

Table 1
x_1\; ? ? 15 12 14 13 14 10 15 16
x_2\; 12 11 18 16 19 17 15 19 ? ?

The first thing to do is to get initial estimates of the means and (co)variances. These first estimate are made using only complete data. This data is show in table 2 in bold. All the pairs of data with a value missing are thrown away.


[edit] Estimating the mean

Table 2
x_1\; ? ? 15 12 14 13 14 10 15 16
x_2\; 12 11 18 16 19 17 15 19 ? ?

We estimate the means for the two variables \mu_1,\mu_2\;.

\mu_1\; = (15 + 12 + ... + 10)/6 = 13
\mu_2\; = (18 + 16 + ... + 19)/6 = 17.333

[edit] Esitmating (co)variances

Next we estimate the covariances using one of the formulas below. Is this case we use the second method.

\begin{align} \sigma^2_{ij} &= \sum_k (x_i[k]-\mu_i)(x_j[k]-\mu_j)/n \\ &= \sum_k \frac{x_i[k] x_j[k]}{n}- \mu_i \mu_k \\ \end{align}
\sigma^2_{11}\; =(15^2 + 12^2 + ... + 10^2)/6 - 13^2 = 2.667
\sigma^2_{12}\; =(15*18 + 12*16 + ... + 10*19)/6 - 13*17.333 = -0.5
\sigma^2_{22}\; =(18^2 + 16^2 + ... + 19^2)/6 - 17.333^2 = 2.222

[edit] Estimating the correlation

Finally, we get the first estimate of the correlation:

\rho = \sigma^2_{12}/\sqrt{\sigma^2_{11} \sigma^2_{22}}


\rho\; = -0.2054

[edit] Using a linear model to fill in the missing data

The EM algorithms starts by finding values for the missing data, and the uncertainty of those estimates.

The best guesses for the missing values are made using a least squares linear fits.

Image:Em_correlation2_250.png‎ Image:Em_correlation_250.png‎

This first model will predict the missing values of x_1\; based on x_2\; using a linear fit x_1 = a x_2 + b\;. This is illustrated in the left plot. The line is the linear fit, and values for x_1\; are determined for x_2=11\; and x_2=12\;. The second model is the other way around, x_2 = a x_1 + b\; and will be used to predict the missing values of x_2\;

Using the means and covariances, we get the following linear models

\begin{align} x_1 &= \mu_1 + \sigma^2_{12} / \sigma^2_{22} \left( x_2 - \mu_2\right)\\ x_2 &= \mu_2 + \sigma^2_{12} / \sigma^2_{11} \left( x_1 - \mu_1\right) \end{align}

The estimate for the first missing value of x_1\; that corresponds with the value 12 for x2 will thus be

x_1[1]\; = 13 + -0.5 / 2.222 (12 - 17.333) = 14.2

Table 3 shows the completed table with all four missing values estimated

Table 3
x_1\; 14.2 14.4 15 12 14 13 14 10 15 16
x_2\; 12 11 18 16 19 17 15 19 17.0 16.8

[edit] Main loop

After estimating values for the missing data using least squares linear regression, we are ready to enter the iterative loop that converges to an optimal solution of the correlation estimate.

[edit] Updating the means

We estimate new values for means \mu^\prime_1,\mu^\prime_2 using the completed data

\mu^\prime_1\; = (14.2 + 14.4 + 15 ... + 10 + 15 + 16 )/10 = 13.76
\mu^\prime_2\; = (12 + 11 + 18 ... + 19 + 17.0 + 16.8 )/10 = 16.07

[edit] Updating the (co)variances

Next we update the covariances using one of the formulas below. These equations are the same as before, except that they have an additional term in the variance estimates that correct for the uncertainty in estimates of the m_i\; missing values in x_i\;.

\begin{align} \sigma^{2\prime}_{ii} &= \sum_k (x_i[k]-\mu_i)^2/n +\sigma^2_{ii}(1-\rho^2)m_i/n\\ &= \sum_k \frac{x_i[k] x_j[k]}{n}- \mu^2_i + \sigma^2_{ii}(1-\rho^2)m_i/n\\ \sigma^{2\prime}_{ij} &= \sum_k (x_i[k]-\mu_i)(x_j[k]-\mu_j)/n \\ &= \sum_k \frac{x_i[k] x_j[k]}{n}- \mu_i \mu_k \\ \end{align}
\sigma^{2\prime}_{11}\; =(14.2^2 + 14.4^2 + ... + 16^2)/10 - 13.76^2 + 2.667 (1 - 0.042) 2/10= 3.18
\sigma^{2\prime}_{12}\; =(14.2*12 + 14.4*11 + ... + 16*16.8)/10 - 13.76*16.07 = -1.13
\sigma^{2\prime}_{22}\; =(12^2 + 11^2 + ... + 16.8^2)/10 - 16.07^2 +2.222(1 - 0.042) 2/10 = 7.07


[edit] Updating the correlation

Finally, we get an update of the correlation:

\rho^\prime = \sigma^{2\prime}_{12}/\sqrt{\sigma^{2\prime}_{11} \sigma^{2\prime}_{22}}


\rho^\prime\; = -0.2374

[edit] Updating the estimates of the missing data

Using the updated means and covariances, we get the following linear models

\begin{align} x_1 &= \mu^\prime_1 + \sigma^{2\prime}_{12} / \sigma^{2\prime}_{22} \left( x_2 - \mu^\prime_2\right)\\ x_2 &= \mu^\prime_2 + \sigma^{2\prime}_{12} / \sigma^{2\prime}_{11} \left( x_1 - \mu^\prime_1\right) \end{align}

The first missing value of x_1\; x2 will thus become

x_1[1]\; = 13.76 + -1.13 / 7.07 (12 - 16.07) = 14.41

Table 4 shows the completed table with all four missing values updated

Table 4
x_1\; 14.4 14.6 15 12 14 13 14 10 15 16
x_2\; 12 11 18 16 19 17 15 19 15.6 15.3

[edit] Repeating the process

These steps are repeated -starting with updating the means- a number of times untill the values converge.


step \mu_1\; \mu_1\; \sigma^2_{11}\; \sigma^2_{12}\; \sigma^2_{22}\; \rho\; x_1[1]\; x_1[2]\; x_2[9]\; x_2[10]\;
0 13.00 17.33 2.67 -0.50 2.22 -0.21 14.20 14.42 16.96 16.77
1 13.76 16.07 3.18 -1.13 7.07 -0.24 14.41 14.57 15.63 15.28
2 13.80 15.79 3.31 -1.77 7.86 -0.35 14.65 14.88 15.15 14.61
3 13.85 15.68 3.38 -2.21 8.04 -0.42 14.86 15.14 14.93 14.27
4 13.90 15.62 3.45 -2.51 8.09 -0.47 15.02 15.33 14.82 14.09
5 13.94 15.59 3.51 -2.70 8.09 -0.51 15.13 15.47 14.77 14.00
6 13.96 15.58 3.56 -2.83 8.07 -0.53 15.21 15.56 14.75 13.96
7 13.98 15.57 3.61 -2.91 8.05 -0.54 15.27 15.63 14.75 13.94
8 13.99 15.57 3.64 -2.96 8.04 -0.55 15.31 15.67 14.75 13.93
9 14.00 15.57 3.66 -3.00 8.02 -0.55 15.33 15.71 14.75 13.93
10 14.00 15.57 3.68 -3.02 8.01 -0.56 15.35 15.73 14.75 13.93
11 14.01 15.57 3.69 -3.04 8.01 -0.56 15.36 15.74 14.75 13.92
12 14.01 15.57 3.70 -3.05 8.00 -0.56 15.37 15.75 14.75 13.92
13 14.01 15.57 3.70 -3.06 8.00 -0.56 15.38 15.76 14.75 13.92
14 14.01 15.57 3.71 -3.07 7.99 -0.56 15.38 15.77 14.75 13.92
15 14.01 15.57 3.71 -3.07 7.99 -0.56 15.38 15.77 14.75 13.92

[edit] Conclusions

This example shows that estimating correlation based on missing data can give substantial different values compared to methods that throw away all samples that have missing data.

The initial guess of this algorithm gives a correlation of -0.21, which is a correlation estimate based on removing all missing data sampled. The final results is a correlation of -0.56 An explanation for this big difference is that two samples of x_2\; are thrown away deviate strongly from the other values of x_2\;. When including those samples, the estimate of the mean changes considerably (17.33 vs 16.07), which in turns has a strong impact in the estimate of the variance of x_2\; (2.22 vs 7.07)

[edit] Bibliography

R. Rebonato, P. Jackel - The most general methodology to create a valid correlation matrix for risk management and option pricing purposes
 Url 
Mortaza Jamshidian, Peter M. Bentler - ML Estimation of Mean and Covariance Structures with Missing Data Using Complete Data Routines
 Url 
Roderick J A Little, Donald B Rubin - Statistical analysis with missing data
John Wiley \& Sons, Inc., New York, NY, USA, 1986

[edit] Add a comment

Name (required):

Website:

Comment:

Talk:Calculating Correlation and Means with Missing Data