DETECTION OF INFLUENTIAL VALUES

usual methods in calculating the derivatives of the regression coefficients with ... Keywords : influential values, linear model, multicollinearity, multiple correlation ..... Consider the entries sp,p--1 and sp,p first ; their derivatives with respect to any.
72KB taille 7 téléchargements 406 vues
DETECTION OF INFLUENTIAL VALUES IN THE LINEAR MODEL Thierry Foucart Department of mathematics University of Poitiers SP2MI Boulevard 3 téléport 2 BP 179 86960 Futuroscope Cedex FRANCE e-mail : [email protected] Abstract. The search for influential points is essential to interpret estimations in the linear model and has been the object of many works. We propose here to supplement these usual methods in calculating the derivatives of the regression coefficients with respect to each value of the explanatory variables and the explained variable. These derivatives display the statistical units i and the explanatory variables xj of which a weak variation of the actual value xji causes quite a variation of the vector of regression (id. for the explained variable y). The article ends with an application to data treated in the book of Belsley et al. (1980) on influential values : the derivation very simply underlines all the influential statistical units detected by Belsley and moreover clarifies the nature of the influence which they exert. Keywords : influential values, linear model, multicollinearity, multiple correlation coefficient, regression coefficients. AMS Subject classification : 62J05 1. SEARCH FOR INFLUENTIAL POINTS. The search for influential points is essential to interpret estimations in the linear or time-series modelling and this has been the object of many works, papers and books. Some of these papers are related to multivariate data (Kosinski A., 1999), autoregressive models (Meintanis S.G., Donatos G.S., 1999). F. Peng and D.K. Pey (1995) present a Bayesian approach for detecting influential observations using divergence measures on the posterior distributions , whereas Schall and Dunn (1988) analyze the tails of distribution. The jackknife and the distance of Cook (Cook R.D., 1977) are classical ways to appreciate the influence of a particular observation on the coefficients of regression estimated by the estimator of least squares. The approach of Belsley et al. (1980) generalizes the previous one by studying the derivative of the coefficients of regression with respect to the weights assigned to the statistical units. One of the inconveniences of the jackknife and of the methods which are deduced from it, is that they are founded on deletion of each complete row of the data table. Thus it is difficult to detect the variables to which the influence of a row on the estimates is due. Our approach supplements these methods by accurately studying the effect of a weak variation of each actual value of the explanatory and explained variables on the estimates. It consists in calculating the derivatives of the regression coefficients with respect to the actual values of these variables. This analysis is also particularly useful to study the collinearities between the explanatory variables which make the vector of regression very unstable and impose recourse 1

to skewed estimators like regression on principal components (Tomassone, 1992), ridge regression (Hoerl and Kennard, 1970, Foucart, 1999) or regression PLS (Wold et al., 1984 or Tenenhaus et al., 1995). The derivatives of the squared length of the regression vector actually make it possible to explain the instability of the regression vector with precision. 2. DERIVATIVE OF A CORRELATION COEFFICIENT. Let cov(x,y) the covariance and r(x,y) the correlation coefficient of a couple of statistical variables (x,y) = (xi, yi), i = 1, ..., n. As usual, we denote : • mx and σx² the mean and the variance of the xi, • my and σy² the mean and the variance of the yi, • cov(x,y) their covariance, • r(x,y) their correlation coefficient , • b and b0 the regression coefficient estimated by the minimum squares errors estimator in the simple linear model : y = b x + b0 : b= cov(x,y)/σx σy b0 = my - b mx •

(ei) i = 1, …, n the residuals : ei = yi - ( bxi+ b0)

The following derivatives can be computed easily (cf. annex) : (1) (2)

∂r( x,y ) / ∂xi ∂b / ∂xi

= ( 1 / n ) ei / ( σxσy ) = ( 1 / n ) [ ei - b ( xi - mx ) ] / σ x²

It is worth noting that the derivative of the correlation coefficient with respect to xi is in proportion to the residual ei, whereas the derivative of b depends simultaneously on the residual and on the difference xi - mx. 3. MULTIPLE LINEAR MODEL. The linear model applied to p explanatory variables xj and an explained variable y, consists in supposing the following relation : Y = X β + β0 + ε in which Y is the vector (y1, y2..., yn)t defined by n observations of the explained variable, X the table of n actual values of the explanatory variables (xji) with n lines and p columns, β0 a real constant and ε a random vector centered (ε1, ε2..., εn)t of matrix variance σ2 I . We suppose that the variables are reduced in the following sense : n i Σ xj = 0 i=1

n i2 Σ xj = n i=1

n i Σ y =0 i=1

n i2 =n Σ y i=1

Under these conditions the covariance matrix between the explanatory variables is equal to the correlation matrix R and the estimate of β0 is 0. The vector β = (β1, β2..., βp)t is called vector of regression. The estimate of β is usually carried out by the estimator of least squares B = (B1..., Bp)t of which we give the expression below : B = (1/n) R-1 Xt Y = R-1 RXY

2

In this expression the variables are reduced in the previous sense, so that the vector (1/n)XtY = RXY should be defined by the coefficients of correlation between the explanatory variables and the explained variable. The multiple correlation coefficient R is the correlation coefficient between the explained variable y and its estimator defined by the linear combination X B. The coefficient of determination is R² by definition, and one knows its expression according to the coefficients of correlation between the explanatory variables (Foucart, 1999). From the derivatives of the linear coefficients of correlation (see before), one can easily calculate the derivative of the multiple correlation coefficient R with respect to the actual values xki and yi. The results given by the programs show that the derivative of R with respect to the xki is proportional to the residuals ei, as in simple linear regression (c.f. previous paragraph), and that the coefficient of correlation between the derivative of R with respect to the yi and the residuals is equal to R. We have not shown not these properties mathematically because of their limited practical interest. 4. DERIVATIVES OF THE VECTOR OF REGRESSION. One can calculate the derivative of the inverse correlation matrix R-1 with respect to any coefficient of correlation rk,l (Graybill, 1983 or Foucart, 1997a) for which an algorithm is included in the appendix. Thus the derivative of R-1 with respect to a value xki or yi (see paragraph 2) can be deduced from it, and then the derivatives of B. We denote to simplify S = R-1. One has : ∂B/∂xji = ∂S/∂xji RXY + S ∂RXY /∂xji The following relations are obvious :

∀ u =1, …, p ∀ u =1, …, p, ∀ u =1, …, p, j ≠ u

∀ v =1, …, p, u ≠ v

∂ru,u / ∂xji = 0 ∂ru,v / ∂xji = 0 ∂ru,y/ ∂xji = 0

From these relations we deduce the derivatives of the coefficients of regression Bk with respect to the xji and the yi. (3

∀ k = 1, …, p ∂Bk/∂xj =

(4)

∀ k = 1, …, p

i

∂Bk/∂yi =

p p ∑ ∑ [∂sk,l /∂rj,v ∂rj,v /∂xji] rl,y + sk,j ∂rj,y/∂xji l=1 v=1 v≠j p ∑ sk,l ∂rl,y/∂yi l=1

The derivative of the square length of the vector of regression is expressed easily : (5)

∀ k = 1, …, p

∂ B2 /∂xki =

p ∑ 2 Bj ∂Bj/∂xki l=1

(6)

∀ k = 1, …, p

∂ B2 / ∂yi =

p ∑ 2 Bj ∂Bj/∂yi l=1

3

Formulas (3) and (4) make it possible to detect the observations of the explanatory variables or the explained variable of which a small variation exerts a relatively strong influence on the values of the coefficients of regression. Formulas (5) and (6) are useful in case of collinear explanatory variables : it is indeed known that, in this case, the numerical results are very unstable and that the squared length of the regression vector is often overestimated by the estimator of least squares (Hoerl and Kennard, 1970). By studying the results given by these formulas, one will be able to detect which statistical units and variables are at the origin of this collinearity. 5. EXAMPLE As an example we take the data of Sterling A. (1977) analyzed by Belsley et al. (1980, p. 39).These data give 5 economic variables collected in 50 countries. The variables are : SR : mean spare rate by person from 1960 to 1970, POP15 : percentage of population less than 15 years old, POP75 : percentage of population more than 75 years old, DPI : mean individual income from 1960 to 1970, ∆DPI : mean rate of increase of the mean income The explained variable y is the mean spare rate SR. We consider the four other variables as explanatory variables x1 , x2, x3 and x4. By studying the variations of coefficients of regression and other parameters when one or more statistical units are eliminated from the data, Belsley et al. determine a list of potentially influential countries : their rows are 3, 6, 7, 10, 14, 19, 21, 23, 24, 32, 33, 34, 37, 39, 44, 46, 47, 49 (see Belsley, p. 54). We will see that the analysis of the derivatives presented in the previous paragraph makes it possible to find this list completely and explains the influence of each one of them. The correlation matrix between the 5 variables is given in the next table : x1 x1 1.000 x2 -0.908 x3 -0.756 x4 -0.048 y -0.456

x2

x3

x4

y

1.000 0.787 1.000 0.025 -0.129 1.000 0.317 0.220 0.305 1.000

Table 1 : correlation matrix (Sterling's data, 50 statistical units) The existence of collinearity between the explanatory variables is brought out by the size of the smallest eigenvalue of the correlation matrix (0.0897). Table 2 below gives the coefficients of regression on the reduced variables and the factors of inflation defined by the diagonal terms of the matrix R-1 :

b1 b2 b3 b4

estimates Student's t Inflation factor (reduced variables) (reduced variables) -0.9420 -3.189 5.94 -0.4873 -1.561 6.63 -0.0745 -0.362 2.88 0.2624 2.088 1.07 Table 2 : regression coefficients and inflation factors

4

The factors of inflation indicate that the coefficients b1 and b2 are the most sensitive to the multicollinearity. We can determine by formulas (3) and (4) the most influential values on these two coefficients of regression : in both cases, the largest derivatives in absolute value are the derivatives with respect to x246, x223, x219 and x27. Here we obtain here two properties : • the most influential units on b1 and b2 are : 46, 23, 19 and 7. • their influence is due to their value on the explanatory variable x2. The first property is given also by Belsley, but not the second one. Formulas (5) and (6) give us the derivatives of the squared length of the regression vector. In table 3 we give these derivatives for the statistical units detected by Belsley as influential : i 3 6 7 10 14 19 21 23 46 47 49

/∂x1 -0.005351 0.023841 0.057420 -0.028665 0.002998 0.066217 -0.008507 -0.067327 -0.084249 0.030276 0.045100

/∂x2 -0.077362 0.152476 0.410333 -0.134702 -0.008651 0.460134 -0.057306 -0.518613 -0.539934 0.235564 0.356040

/∂ x3 0.000033 0.000003 -0.000017 0.000046 0.000063 0.000045 0.000091 -0.000110 -0.000019 0.000018 0.000063

/∂ x4 0.015672 0.001879 -0.010692 0.020034 0.029206 0.016383 0.039592 -0.042642 -0.005255 0.010178 0.033057

/∂y -0.035371 -0.002043 0.036863 -0.039362 -0.058481 -0.014528 -0.076000 0.057625 -0.007469 -0.010639 -0.050571

Table 3 : Derivatives of the squared length of the regression vectors (statistical units potentially influential detected by Belsley) Table 3 brings out the exclusive importance of the variable x2 in the collinearity since the derivatives are the greatest with respect to this variable (in absolute value). We retrieve the statistical units of the list given by Belsley, except these of rows 3, 6, 10, 14, and 21, which we will examine below. We computed the coefficients of regression after having replaced x246 = 0.56 by x246 = 0.88 and 1.2, which represent variations of a quarter and of a half of the standard deviation of the variable x2 (1.28) and of 0.25 and 0.5 on the reduced variable. The coefficients of regression on the reduced centered variables become :

b1 b2 b3 b4

previous estimates Standard deviation of bj's new estimates new estimates x246 = 0.56 x246 = 1.2 (reduced variables) x246 = 0.88 (reduced variables) -0.9420 0.2954 -0.8872 -0.8279 -0.4873 0.3122 -0.4121 -0.3308 -0.0745 0.2059 -0.0924 -0.1119 0.2624 0.1257 0.2619 0.2608 Table 4 : estimates of the regression coefficients according to x246 = 0.56, 0.98 or 1.2

As hypothesized we have found a strong variation of b2 (approximately 25% and 50% of its standard deviation), and of b1 (20% and 40%) , a weak variation of b3 (10 and 20%) and an almost non-existent variation of b4.

5

The squared length of the regression vector B is : x246 = 0.56 x246 = 0.88 x246 = 1.2 2 2 B  = 1.20 B  = 1.034 B 2 = 0.875 The variation of x246 reduces the squared length B 2 by 14% and 27%. As for the rows 3, 6, 10, 14, and 21, the derivatives of b3 show that the statistical unit 6 is the most influential on b3 through its value y6 after unit 44. The other units, of rows 3, 10, 14 and 21 give derivatives of the squared length of the regression vector with respect to the explained variable that are among the greatest ones in absolute value. But these derivatives are much smaller than the previous ones given in table 3. So there may exist some doubt about the influence of these units on the regression vector. CONCLUSION By computing the derivatives of the regression vector with respect to a value of a statistical unit on each variable we can detect both the statistical units and the statistical variables which are influential on the estimates. In case of collinearity between the explanatory variables it is possible to analyze its causes precisely thanks to the derivatives of the squared length of the regression vector. This calculation of derivatives can be applied to detect outliers or influential points in any statistical method needing inversion of matrix like canonical and discriminant analysis. ANNEX. derivatives of the correlation coefficient and of the regression coefficient in the simple linear model : (paragraph 1) mx = (1/n) ∑ xi

dmx /dxi = 1/n

cov(x,y) = (1/n) ∑ xi yi - mx my

dcov(x,y) /dxi = (1/n) [yi - my]

σx² = (1/n) ∑ xi 2 - mx²

dσx² / dxi = 2/n [xi - mx ] dσx / dxi = (1/n) [xi - mx] / σx

r(x,y) = cov(x,y) / [σx σy ] dr(x,y)/dxi = [ (σx σy ) dcov(x,y)/dxi - cov(x,y) (σy dσx/dxi )] / (σx² σy² ) = [ (σx σy ) (1/n) (yi - my) - cov(x,y) (σy (1/n) (xi - mx) / σx)] / (σx² σy² ) = (1/n) (yi - my) / (σx σy ) - cov(x,y) (1/n) (xi - mx) / (σx3 σy ) = (1/n) (yi - my) / (σx σy ) - b (1/n) (xi - mx) / (σx σy ) = (1/n) [ (yi - my) - b (xi - mx) ] / (σx σy ) = (1/n) ei / (σx σy ) b = cov(x,y)/v(x) db / dxi = [ σx² dcov(x,y) /dxi - cov(x,y) dσx²/dxi ] / (σx² )2 = (1/n)[ σx² [yi - my] - 2 cov(x,y) (xi - mx ) ] / (σx²)2 = (1/n)[ [yi - my] - 2 b ( xi - mx ) ] / σx² = (1/n)[ ei - b ( xi - mx ) ] / σx²

6

algorithm to compute the derivatives of the entries of the matrix S = R-1 with respect to each entry of R. (paragraph 4) The following produce matrix can express the positive symmetric matrix R : R = A At in which A is an inferior triangular matrix (Ciarlet, 1989) for which the entries are given by : (1) ∀ i = 1,..., p ai,1 = r1,i / [r1,1 ] 1/2 i-1 ∀i = 2, ..., p ai,i = [ri,i - ∑ ai,k2 ] 1/2 (2) k=1 i-1 ri,j - ∑ ai,k aj,k k=1 ∀i = 2, ..., p ∀ j = i+1, ..., p aj,i = ______________________ (3) ai,i From which we deduce the inverse matrix S = R-1 : S = [ At ]-1 A-1 The entries sp,p and sp,p-1 are given with respect to ap,p, ap,p-1 and ap-1,p-1 by the following formulas : sp,p-1 = - ap,p-12 / [ ap,p ap-1,p-1] sp,p = 1 / ap,p2 One need only permute variables xk and xl, then xp and xp-1 to compute any entry sk,l with respect of the entries of A. Each entry sk,l of the matrix S = R-1 is a function of each term ri,j of the matrix R. Moreover the interval )α, β ( on which this function is defined is known (Foucart, 1997b). The functions sp,p-1 and sp,p of the entries ap,p, ap,p-1 et ap-1,p-1 are functions composed of derivable functions and are thus derivable. Then the functions sk,l of the entries ri,j are derivable too. Consider the entries sp,p--1 and sp,p first ; their derivatives with respect to any correlation coefficient are given by the following formulas : ap,p2 ap-1,p-1 a’p,p-1 - ap,p-1 (2 ap,p ap-1,p-1 a’p,p + ap,p2 a’p-1,p-1) _______________________________________________________________________ (4) s’p,p-1 = ap,p4 ap-1,p-12 s’p,p = -2 a’p,p / ap,p3 (5) Permute xk and xp, then xl and xp-1 to calculate the derivatives of sk,l and sk,k. Compute the derivatives of ap,p, ap,p-1 and ap-1,p-1 with respect to the correlation coefficient ri,j to obtain s’p,p-1 et s’p,p. • Derivatives with respect to ri,j (i