Applied Econometrics .fr

Let's consider an ordinary linear model yt = β0 + β1.xt + ut. ▷ Assume that the ... Parameters β0 and β1 are obtained by usual maximization techniques, or by ...
675KB taille 1 téléchargements 305 vues
Outline

Maximum likelihood

Models of discrete choice

Sample selection models

2/40

Outline

Maximum likelihood

Models of discrete choice

Sample selection models

3/40

The model (simple case)

I

Let’s consider an ordinary linear model yt = β0 + β1 .xt + ut

I

Assume that the error term ut follows a normal distribution N(0, σ 2 )

I

We thus have ut = yt − β0 − β1 .xt that follows a normal distribution N(0, σ 2 )

I

And yt follows a normal distribution as well : N(β0 + β1 .xt , σ 2 )

I

The "likelihood" of the sample is the probability that this sample indeed occurred

4/40

Estimation

I

The distribution of y depends on parameters β0 and β1 that are unknown

I

If we assume that the sample we have was correctly sampled, then parameters β0 and β1 should be such that they maximize the probability of getting that particular sample

I

β0 and β1 are thus estimated by maximizing the likelihood of the sample

5/40

Computation

I

We assume that individuals in the sample are independent

I

The likelihood of the sample is thus the product of the likelihood of each individual

I

For this reason, we usually take its log to get a sum (easier to handle)

I

We thus deal with the "log-likelihood"

I

Parameters β0 and β1 are obtained by usual maximization techniques, or by iterative methods

I

With ML, same results as OLS, with σ ˆ downward biased (ˆ σML = SSR/N)

6/40

Maximum Likelihood general properties

I

ML estimators can be biased

I

But they are Consistent,

I

Asymptotically normally distributed,

I

Asymptotically efficient (smallest variance among all consistent asymptotically normal estimators)

I

Caution : the likelihood must be well specified

7/40

How specification tests are run We generally do not have regular t-tests any more, since we can take any distribution for y and we are not any more in the OLS framework. Assume we want to test a linear restriction on parameter θ : Rθ = q. Possibilities, which are asymptotically equivalent are : I

Wald test : estimate parameter θ by ML, then check whether R θˆ − q is close to zero, using its asymptotic variance-covariance matrix (this procedure is close to tests we are used to)

I

Likelihood ratio test : estimate model with and without constraint, and check if the difference between the 2 LL is different from 0

I

Lagrange multiplier test : estimate the model with the restriction then check if first order condition from the unrestricted model is significantly violated in that case 8/40

Outline

Maximum likelihood

Models of discrete choice

Sample selection models

9/40

Discrete choice models and OLS (1)

I

Example : for individual i, yi = 1 if unemployed, yi = 0 if employed

I

We want to explain y by age and education :

I

yi = α + βagei + γeducationi + εi = xi b + εi

I

So E (yi ) = xi b

I

And we know that E (yi ) = 1 ∗ P(yi = 1) + 0 ∗ P(yi = 0), so P(yi = 1) = xi b

I

xi b should lie between 0 and 1, which imposes strong restrictions on x and b

10/40

Discrete choice models and OLS (2)

I

Furthermore, εi can take only 2 values : −xi b with probability 1 − xi b, or 1 − xi b with probability xi b

I

So that : V (εi ) = xi b(1 − xi b)

I

There is heteroskedasticity

11/40

Discrete choice models and OLS (3)

I

Heteroskedasticity could be dealt with using White’s standard errors

I

Non-normality could be dealt with using a large sample size (Central Limit Theorem)

I

But the main problem is that we need E (yi ) to belong to [0; 1]

I

For these various reasons, OLS is inappropriate

12/40

Modeling discrete choice models

I

We do as if there was an unobserved continuous variable y ∗ such that ∀i, yi∗ ≥ 0 ⇒ yi = 1 and yi∗ < 0 ⇒ yi = 0

I

This new variable y ∗ is continuous : we can thus refer to our usual model, with some adaptations

13/40

The model

I

Assume that y ∗ = X β + u

I

Let F be the distribution function of u, with a symmetric distribution

I

P(yi = 1) = P(y ∗ > 0) = P(X β + u > 0) = P(u > −X β) = P(u < X β) = F (X β)

I

We might take the normal (Probit model) or logistic (Logit model)

14/40

Distributions

2 √1 e −x /2 dx 2π 1 ex 1+e −x = 1+e x

Rx

I

Normal : F (x ) =

I

Logistic : F (x ) =

I

These 2 are very close, and give similar results

I

Expectation : 0 for both distributions

I

Variance : 1 for the standard normal, π 2 /3 for the logistic

I

Characteristics of the Logit : extreme events have higher probability (heavier tails) ; parameters are interpreted more easily

−∞

15/40

Interpreting parameters (1) Let xi be a row vector of observations (corresponding to individual i) and β be a colum vector of parameters. I

P(yi = 1) = F (xi β)

I

Parameters enter non-linearly in the expression

I

If we differentiate F (xi β) with respect to variable xk , calling f the derivative of F , we get : dF (xi β) = f (xi β)βk dxi,k

With f either the standard normal density φ (probit) or logistic density : exp(x )/(1 + exp(x ))2 (logit) The effect of a change in xi,k depends on the values of xi . 16/40

Interpreting parameters (2)

dF (xi β) = f (xi β)βk dxi,k The sign of the effect of a change in xi,k can however be determined : it is the sign of βk .

17/40

Log-likelihood (1)

Likelihood function for the entire sample : L=

Y

P[yi = 1|xi , β]yi P[yi = 0|xi , β]1−yi

And we have P[yi = 1|xi , β] = P[yi∗ > 0|xi , β] = F (xi β), so LL =

X

yi log(F (xi β)) +

X

(1 − yi )log(1 − F (xi β))

Maximizing this log-likelihood amounts to differentiating this expression with respect to parameter β and setting it to zero (it can be shown that it is globally concave).

18/40

Log-likelihood (2)

LL =

X

yi log(F (xi β)) +

X

(1 − yi )log(1 − F (xi β))

The first order condition is the following (derivative wrt β is 0) : dLL dβ = 0, which in fact is a column vector of derivatives (derivative of LL wrt β1 , β2 , etc) dLL X yi − F (xi β) = [ f (xi β)]xi0 = 0 dβ F (xi β)(1 − F (xi β)) For the logistic distribution, F (x ) =

ex 1+e x

and f (x ) =

ex (1+e x )2

so :

dLL X exp(xi β) = (yi − )x 0 = 0 dβ 1 + exp(xi β) i

19/40

Predicted and actual frequency (logistic) I I I

ˆ i = 1) = F (xi β) ˆ pˆi = P(y ˆ = exp(xi β)/(1 ˆ ˆ So pˆi = 1/(1 + exp(−xi β)) + exp(xi β)) βˆ is the solution of the first order condition (previous slide), so that we get : X

ˆ exp(xi β) )x 0 = 0 ˆ i 1 + exp(xi β)

(yi − X

(yi − pˆi )xi0 = 0

X

pˆi xi0 =

X

yi xi0

P

P

If there is a constant term in x , then pˆi = yi : the predicted frequency is equal to the actual frequency. This holds as well for the Probit model. 20/40

Logit vs. Probit

I

Since the two distributions are quite similar, predictions will be similar as well

I

However, values of parameters differ : this is because we use different formulas We usually get bˆlogit ' 1.6 bˆprobit √ We can find as well : bˆlogit ' (π/ 3) bˆprobit

I I

21/40

Logit : Odds-ratios

I

Parameters β do not have a linear impact : how can they be interpreted ?

I

Recall that P(yi = 1) = P(y ∗ > 0) = F (Xi β) =

I

I.e. ∀i, P(Yi = 1) = pi and P(Yi = 0) = 1 − pi

I

Let’s define ci as ci =

I

There are ci more chances that [Yi = 1] happens than [Yi = 0] happens

I

We have pi =

I

And 1 − pi =

I

So ci = e Xi β

1 1+e −Xi β 1 1+e Xi β

1 1+e −Xi β

pi 1−pi

=

e Xi β 1+e Xi β

22/40

Odds-ratios (2)

I

We thus get ci = e Xi β

I

But Xi β = β0 + Σxi,j βj

I

So ci = e β0 +Σxi,j βj = e β0 Πe xi,j βj

I

If variable xi,k increases by 1 unit, ci is multiplied by e βk

I

If variable xk increases by 1 unit, it multiplies the chances that [yi = 1] happens by the value of e βk

I

We call odds-ratio associated to variable xk the value e βk

23/40

Back to Probit Whatever the distribution chosen, we had : LL =

X

yi log(F (xi β)) +

X

(1 − yi )log(1 − F (xi β))

dLL X yi − F (xi β) = [ f (xi β)]xi0 = 0 dβ F (xi β)(1 − F (xi β)) For the normal distribution, Z xi β/σ

F (xi β) =

−∞

1 −t 2 √ exp{ }dt 2 2π

It can be seen that we can only estimate β/σ, not the two separately. Without loss of generality, we thus do as if σ = 1 (convenient normalization), hence the differing parameters between Logit and Probit 24/40

Additional remarks

I

The likelihood maximization in Probit involves integrals, but the scoring method will work anyway

I

No such thing as odds-ratios in Probit estimation, so all we have to interpret parameters is the computation of the marginal effects (usefulness of Probit : see next section)

I

Marginal effects are of the same sort as the Logit : values vary with values of x

I

It is thus useful to compute these marginal effects at the regressor mean and at various relevant regressor values

25/40

Determining model adequacy (1)

LLfit −LL0 LLmax −LL0

I

Pseudo-R 2 : R 2 =

I

LLfit : Log-likelihood of the model

I

LL0 : Log-likelihood of a model with only a constant, where the estimated probability is estimated by the proportion of ones in the sample

I

LLmax : Maximum Log-likelihood attainable. In our case, 0 because a model that predicts perfectly the observed values has likelihood 1, and log(1) = 0.

26/40

Determining model adequacy (2)

I

We can evaluate the goodness of fit of the model as well by comparing correct and incorrect predictions

I

Predicted outcomes : since 0 ≤ pˆi ≤ 1, we need to choose a cut-off point c that will determine if the prediction will be 0 or 1 (usually : c = 1/2)

I

ROC : Receiver Operating Characteristics curve : plots the fraction of y = 1 values correctly classified against the fraction of y = 0 incorrectly classified as c varies

27/40

Example

28/40

The ROC curve

I

Comes from signal detection theory, developed during WW2

I

Accuracy of the classification rule is measured with the area under the ROC curve

I

The area represents the ability of the procedure to correctly classify individuals, and is comprised between 0 and 1

I

Trade-off between sensitivity and specificity : the more bowed the curve is, the better the model is

I

The closer the ROC curve to the upper-left corner, the higher the predictive power

I

The optimal cut-off point is determined by the analyst depending on the issue studied

29/40

Outline

Maximum likelihood

Models of discrete choice

Sample selection models

30/40

About selection

I

Ex : consider the interpretation of average scores over time of an achievement test

I

A decline over time may be due to real deterioration in student knowledge, or it may just reflect a selection effect, i.e. more students have been taking the test over time and the new test takers are the relatively weaker students

I

Selection can arise from self-selection (individuals choose to participate or not in a particular activity) or sample selection (those who participate in the activity are oversampled)

I

These problems are treated alike in Econometrics through sample selection models

31/40

The Heckit model (or Generalized Tobit)

I

Say a variable y2∗ is partially observed (y2∗ is the latent variable)

I

Say we have another latent variable y1∗ , such that y2∗ is observed only if y1∗ > 0

I

Ex : health expenditures or wages

I

We’ll see here why the Probit model can be useful for estimation of the 1st step equation

I

Note : in the standard Tobit model, y2 is observed only if it is above (or below) a certain threshold

32/40

The model

Latent variables : I

Participation equation : y1 = 1 if y1∗ > 0, y1 = 0 otherwise

I

Outcome equation : y2 = y2∗ if y1 = 1, y2 is missing otherwise

Model for latent variables : I

Participation equation : y1∗ = x1 b1 + u1 otherwise

I

Outcome equation : y2∗ = x2 b2 + u2

The Logit model could not be used here for the 1st equation since the error term is normal : we have to consider the Probit model.

33/40

The likelihood

L=

∗ ∗ ∗ [y1,i ≤ 0]1−y1,i [f (y2,i |y1,i > 0) ∗ P(y1,i > 0)]y1,i

Y

I

∗ ≤0 First term : discrete contribution when y1,i

I

∗ >0 Second term : continuous contribution when y1,i

34/40

Conditional means E (y2 |x , y1∗ > 0) = E (x2 b2 +u2 |x 1b1 +u1 > 0) = x2 b2 +E (u2 |u1 > −x1 b1 ) In the normal case, it can be shown that : E (y2 |x , y1∗ > 0) = x2 b2 + σ1,2 λ(x1 b1 ) With λ the inverse Mill’s ratio : λ(z) = I I

I

I

φ(z) Φ(z)

If u1 and u2 are independent, then E (u2 |u1 > −x1 b1 ) = 0 and OLS are consistent Otherwise, they are not because we have to take into account the link between the 2 error terms : the sample selection bias σ1,2 λ(x1 b1 ) Error terms represent unobservable heterogeneity : it is likely that individual heterogeneity has an impact both on the decision to participate in the health care system and on the subsequent health care expenditures The sign of corr (u1 , u2 ) is the sign of σ1,2

35/40

The Heckman two-step estimator We have : E (y2 |x , y1∗ > 0) = x2 b2 + σ1,2 λ(x1 b1 ) The Heckit two-step estimator suggests to first estimate λ, then to ˆ estimate the model replacing λ by its estimate λ 1. Probit regression of y1 on x1 (indeed, P(y1∗ > 0) = Φ(x1 b1 )) ˆ= 2. Compute λ

ˆ1 ) φ(x1 b ˆ1 ) Φ(x1 b

3. Estimate the following regression : ˆ E (y2 |x , y1∗ > 0) = x2 b2 + σ1,2 λ ˆ will give the 4. The sign of the coefficient corresponding to λ sign of the correlation between the 2 error terms 5. And if this coefficient is not significant, the 2 equations can be considered to be independent 36/40

Remarks

I

Two-step Heckman estimates are consistent

I

This method is easy and fast to implement

I

However, there is a efficiency loss (not appropriate in small samples)

I

A more widely used method is Maximum Likelihood on the full model

I

Notice that the use of a Logit would not be appropriate here

37/40

Identification considerations

I

Theoretically, the exact same regressors can appear in both equations because the second set of equations appear in a non linear way : no strict multicollinearity

I

However, if that case, the model is close to unidentified because the inverse Mill’s ratio is almost linear

I

This leads to a great instability of parameters (see slides on near-perfect multicollinearity)

I

Thus, at least one exclusion restriction is usually required

38/40

Remarks

I

If we find out that there is in fact no correlation between the 2 equations, they can be estimated independently through a Two-Part model

I

If there is no correlation : selection is said to be made on observables, because it is fully explained by regressors

I

If there is correlation : selection is said to be made on unobservables, because it is partly explained by error terms, that comprise all the unobservable information

I

The Two-Part model can be used as well if the correlation of the inverse Mill’s ratio with regressors is too strong (near-to-perfect multicollinearity)

39/40

The Two-Part model

This model intentionally does not take into account potential correlation between the error terms of the 2 equations 1. Participation equation : Probit or Logit that gives the probability to participate 2. Conditional outcome equation : anything we like (linear model, count data etc) that gives the level of activity, given that the individual participates (distribution truncated at 0) And the prediction of the model is the product of the two.

40/40