Department of Applied Econometrics Working Papers Warsaw School of Economics Al. Niepodleglosci 164 02-554 Warszawa, Poland

(1)

Warsaw School of Economics Institute of Econometrics

Department of Applied Econometrics

Department of Applied Econometrics Working Papers

Warsaw School of Economics Al. Niepodleglosci 164 02-554 Warszawa, Poland

Working Paper No. 6-07

On modified discriminant analysis

Marcin Owczarczuk

Warsaw School of Economics

This paper is available at the Warsaw School of Economics

Department of Applied Econometrics website at: http://www.sgh.waw.pl/instytuty/zes/wp/

(2)

On modified discriminant analysis

Marcin Owczarczuk mo23628@sgh.waw.pl

22 May 2007

Abstract

Discriminant analysis is mostly used to predict the value of a discrete dependent variable of an ob- servation on the basis of a set of predictors. The commonly used criterion of the predictive power is the fraction of incorrectly predicted cases in the sample. In this article we construct a model for a modified discriminant problem. Namely to find a subpopulation of a given size having the highest percentage of observations of a chosen class. Our model maximizes the following criterion of the predictive power: the fraction of observations from chosen class in the found subpopulation.

Keywords: discriminant analysis, semiparametric estimation, smoothing, binary response .

JEL codes: C14, C35

(3)

1 Introduction

The aim of discriminant analysis is to predict the value of discrete explained variable Y on the basis of explanatory variables X = (X

1

, . . . , X

_k

), . We may also treat discriminant analysis as a model the aim of which is to split the space of observations into two regions: the first characterized by dominance of observations from class Y = 0 and second characterized by

¹

Y = 1.

In this article we formulate the task in a different manner:

Suppose we are given the parameter τ - the fraction of population. We want to split the population with respect to explanatory variables into two groups: the first one with τ of all observations and the second one with (1 −τ ) of all observations. The group of size τ should have as high percentage of observations characterized by Y = 1 as possible. In other words we want to find population of size τ which has the highest number of observations from class Y = 1 in the family of all subsets of size τ .

Our model is modification of maximum score estimator of Manski (1975) and smoothed maximum score estimator of Horowitz (1992). In case of their estimators the aim is to split the population into two groups characterized by Y = 1 and Y = 0 respectively, in order to minimize the fraction of incorrectly classified observations in the sample

²

. We also want to minimize number of incorrectly classified observations but only in the group classified by the model as having Y = 1 with additional condition on the size of this group.

This problem arises in many areas, for example in credit scoring and marketing campaigns.

In these applications the aim is to separate a small, fixed size group of clients with relative high probability of a positive value of a response variable. In case of marketing campaigns one wants to find a group of clients, on the basis of their features, who are most likely to respond to campaign - the target group. In case of credit scoring the policy of the bank may be based on assumption that a certain fraction of worst clients, for example 5% should not be granted a loan. In that case one should find 5% of customers with highest probability of default.

This problem can be solved using p - an estimator of the probability P ˆ (Y = 1). It can be calculated for example using logistic regression. We may use the following construction:

1. sort observations in ascending sequence of p ˆ 2. choose τ observations with highest p ˆ

1In this paper, for simplicity, we restrict ourselves only to binary explained variable and denote its levels by Y = 1 and Y = 0.

2Their estimators are parameterized by α∈ (0, 1). For α = ¹₂the fraction of incorrectly classified observations in the sample is minimized.

(4)

This method has intuitive grounds: observations with higher p are more likely to come from class ˆ Y = 1, so in the group with high ˆ p will be more observations from class Y = 1. Unfortunately we do not have guarantee that the chosen group will be optimal.

2 Problem formulation

We assume, as in case of classical discriminant analysis for binary responses

³

, that X ∈ R

^k

and Y ∈ {0, 1}. X and Y are random variables.

We denote

•

^A

- the family of all measurable subsets of R

^k

.

• f

1

- the distribution of random variable X|(Y = 1)

• f

0

- the distribution of random variable X|(Y = 0)

• Bin(π

1

) - the distribution of random variable Y Formally our goal is to find subset A that maximizes:

A∈^A

max

: P (A)=τ

P (Y = 1|X ∈ A) (1)

We also assume that we have a random sample

x

₁₁

, . . . , x

_1n1

for y = 1 x

₀₁

, . . . , x

_0n0

for y = 0

n = n

1

+ n

0

3 Model construction

In this section we show solution to problem described in the previous section.

3.1 Case 1. Predictors are normally distributed with common covariance matrix

This is probably most regular case. We assume that predictors have normal distribution with a common covariance matrix and probably different means among classes. These assumptions are the same as in case of Fisher linear discriminant analysis.

3see for example Hastie, Tibshirani, Friedman (2001)

(5)

• X|(Y = 1) ∼ N(m

1

, Σ)

⁴

• X|(Y = 0) ∼ N(m

0

, Σ)

• Y ∼ Bin(π

1

)

Theorem 3.1. Under above assumptions the optimal solution to problem (1) is

A = {X : (m

1

− m

0

)

^T

Σ

⁻¹

X ≥ b} (2) where b is quantile of order 1 − τ of the random variable (m

1

− m

0

)

^T

Σ

⁻¹

X

Proof.

A∈^A

max

: P (A)=τ

P (Y = 1|X ∈ A) =

A∈^A

max

: P (A)=τ

P (Y = 1 ∧ X ∈ A)

P (X ∈ A) = max

A∈^A: P (A)=τ

π

₁

R

A

f

₁

π

₁

R

A

f

₁

+ (1 − π

1

) R

A

f

₀

=

= max

A∈^A: P (A)=τ

1 1 +

^(1−π¹⁾

R

Af0

π1R

Af1

(3) Note that

argmax

A∈^A: P (A)=τ

1 1 +

^(1−π¹⁾

R

Af0

π1

R

Af1

= argmax

A∈^A: P (A)=τ

π

₁

R

A

f

₁

(1 − π

1

) R

A

f

₀

(4)

Then observe that fraction

_(1−π^π¹^f¹^(x)

1)f0(x)

has a constant value on the line c = π

₁

f

₁

(x)

(1 − π

₁

)f

₀

(x) =

π

₁ ¹

(2π)^p²|Σ|¹²

e

⁻¹²^(x−m¹⁾^T^Σ⁻¹^(x−m¹⁾

(1 − π

₁

)

¹

(2π)^p²|Σ|¹²

e

⁻¹²^(x−m⁰⁾^T^Σ⁻¹^(x−m⁰⁾

=

= π

₁

e

⁻¹²^(x−m¹⁾^T^Σ⁻¹^(x−m¹⁾

(1 − π

1

)e

⁻¹²^(x−m⁰⁾^T^Σ⁻¹^(x−m⁰⁾

(5) ln(c) = ln( π

₁

1 − π

1

) − (x − m

1

)

^T

Σ

⁻¹

(x − m

1

) + (x − m

0

)

^T

Σ

⁻¹

(x − m

0

) =

= ln( π

₁

1 − π

1

) + (m

1

− m

0

)

^T

Σ

⁻¹

x − 1

2 (m

1

− m

0

)

^T

Σ

⁻¹

(m

1

+ m

0

) (6) ln(c) − ln( π

₁

1 − π

1

) + 1

2 (m

1

− m

0

)

^T

Σ

⁻¹

(m

1

+ m

0

) = (m

1

− m

0

)

^T

Σ

⁻¹

x (7)

4W∼ g denotes that random variable W has distribution g

(6)

which defines linear equation in x.

So problem (1) has solution

A = {X : (m

1

− m

0

)

^T

Σ

⁻¹

X ≥ b} (8) where b is chosen so that P ({X : (m

₁

− m

₀

)

^T

Σ

⁻¹

X ≥ b}) = τ . In other words b is quantile of order 1 − τ of the random variable (m

1

− m

0

)

^T

Σ

⁻¹

X. ■

Parameters Σ, m

₁

and m

₀

are usually unknown and have to be estimated from data. As in case of Fisher discriminant analysis we estimate

ˆ m

₁

= 1

n

₁

n1

X

i=1

x

_1i

(9)

ˆ m

₀

= 1

n

₀

n0

X

i=1

x

_0i

(10)

Σ ˆ = 1 n − 2

1

X

k=0

h X

ⁿ^k

i=1

(x

ki

− ˆ m

_k

)(x

ki

− ˆ m

_k

)

^T

i

(11) We may note that the optimal discriminant line is parallel to solution of Fisher linear discriminant analysis. Our model has a form:

• y

^∗

= (m

1

− m

0

)

^T

Σ

⁻¹

x = a

^T

x = a

1

x

₁

+ · · · + a

k

x

_k

• if y

^∗

≥ b then y = 1

• if y

^∗

< b then y = 0

3.2 Case 2. Predictors are normally distributed with unequal covariance matrix

We assume that predictors have normal distribution but the covariance matrix differs among classes.

These assumptions are the same as in case of quadratic discriminant analysis.

• X|(Y = 1) ∼ N(m

1

, Σ

₁

)

• X|(Y = 0) ∼ N(m

0

, Σ

₀

)

• Y ∼ Bin(π

1

)

Theorem 3.2. Under above assumptions the optimal solution to problem (1) is A = {X : X

^T

(Σ

⁻¹₁

m

₁

− Σ

⁻¹₀

m

₀

) − 1

2 X

^T

(Σ

⁻¹₁

− Σ

⁻¹₀

)X ≥ b} (12)

(7)

where b is quantile of order 1 − τ of the random variable X

^T

(Σ

⁻¹₁

m

₁

− Σ

⁻¹₀

m

₀

) −

¹₂

X

^T

(Σ

⁻¹₁

− Σ

⁻¹₀

)X

Proof.

The proof is similar to proof of Theorem 3.1. We may observe that the fraction

_(1−π^π¹^f¹^(x)

1)f0(x)

is constant on the curve

c = π

₁

f

₁

(x) (1 − π

1

)f

0

(x) =

π

₁ ¹

(2π)^p²|Σ1|¹²

e

⁻¹²^(x−m¹⁾^T^Σ⁻¹¹ ^(x−m¹⁾

(1 − π

1

)

¹

(2π)^p²|Σ0|¹²

e

⁻¹²^(x−m⁰⁾^T^Σ⁻¹⁰ ^(x−m⁰⁾

(13)

ln(c) = ln π

₁

1 − π

1

+ 1

2 ln |Σ

0

|

|Σ

1

| +x

^T

(Σ

⁻¹₁

m

₁

−Σ

⁻¹₀

m

₀

)− 1

2 x

^T

(Σ

⁻¹₁

−Σ

⁻¹₀

)x− 1 2 m

^T

1

Σ

⁻¹

1

m

₁

+ 1 2 m

^T

0

Σ

⁻¹

0

m

₀

(14)

ln(c)−ln π

₁

1 − π

1

− 1

2 ln |Σ

₀

|

|Σ

1

| + 1 2 m

^T

1

Σ

⁻¹

1

m

₁

− 1 2 m

^T

0

Σ

⁻¹

0

m

₀

= x

^T

(Σ

⁻¹₁

m

₁

−Σ

⁻¹₀

m

₀

)− 1

2 x

^T

(Σ

⁻¹₁

−Σ

⁻¹₀

)x (15) which defines a quadratic equation in x.

So problem (1) has solution

A = {X : X

^T

(Σ

⁻¹₁

m

₁

− Σ

⁻¹₀

m

₀

) − 1

2 X

^T

(Σ

⁻¹₁

− Σ

⁻¹₀

)X ≥ b} (16) where b is chosen so that P ({X : X

^T

(Σ

⁻¹₁

m

₁

− Σ

⁻¹₀

m

₀

) −

¹₂

X

^T

(Σ

⁻¹₁

− Σ

⁻¹₀

)X}) = τ . In other words b is quantile of order 1−τ of the variable X

^T

(Σ

⁻¹₁

m

₁

−Σ

⁻¹₀

m

₀

)−

¹₂

X

^T

(Σ

⁻¹₁

−Σ

⁻¹₀

)X. ■

Parameters Σ

¹

, m

¹

, Σ

⁰

and m

⁰

are usually unknown and have to be estimated from data. As in case of quadratic discriminant analysis we estimate

ˆ m

₁

= 1

n

₁

n1

X

i=1

x

_1i

(17)

ˆ m

₀

= 1

n

₀

n0

X

i=1

x

_0i

(18)

ˆ

Σ

₁

= 1 n

₁

− 1

n1

X

k=1

(x

1i

− ˆ m

₁

)(x

1i

− ˆ m

₁

)

^T

(19)

Σ ˆ

₀

= 1 n

₀

− 1

n0

X

k=1

(x

0i

− ˆ m

₀

)(x

0i

− ˆ m

₀

)

^T

(20)

(8)

We may note that the optimal discriminant line is parallel to solution of quadratic discriminant analysis.

In this case our model has a form:

• y

^∗

= x

^T

(Σ

⁻¹₁

m

₁

− Σ

⁻¹₀

m

₀

) −

¹₂

x

^T

(Σ

⁻¹₁

− Σ

⁻¹₀

)x = P

k

i=1

a

_i

x

_i

+ P

k j=1

P

k

i=1

a

_ij

x

_i

x

_j

• if y

^∗

≥ b then y = 1

• if y

^∗

< b then y = 0

3.3 Case 3. Semiparametric model

In case we cannot make additional assumptions about the distribution of predictors, the problem (1) becomes NP-hard. So we must restrict ourselves to subsets A of particular form. In this sub- section we show semiparametric optimal solution in case subset A is bounded by a hyperplane:

A = {(X

1

, . . . , X

_k

) : a

1

X

₁

+ a

2

X

₂

+ · · · + a

k

X

_k

≥ b} = {X : a

^T

X ≥ b} (21) for some a

₁

, a

₂

, . . . , a

_k

, b ∈ R. In other words we want to construct linear discriminant function.

Since inequality a

^T

X ≥ b holds when multiplied by any positive constant we imply normalization kak = 1.

Comment

Let us recall the definition of maximum score estimator of Manski (1975):

max

a

S

_n^α

= 1 n

n

X

i=1

[(2Y

i

− 1) − (1 − 2α)]1(a

^T

x

_i

) (22)

subject to kak = 1 (23)

which in case α =

¹₂

takes form max

a

S

1 2

n

= 1 n

n

X

i=1

(2Y

i

− 1)1(a

^T

x

_i

) (24)

subject to kak = 1 (25)

Our problem can be formulated as

max

a,b

S

_n

= 1 n

n

X

i=1

(2Y

i

− 1)1(a

^T

x

_i

≥ b) (26)

subject to kak = 1 and b = q

1−τ

(a

^T

x

₁

, . . . , a

^T

x

_n

) (27)

(9)

■

Our model has the same form as in Case 1:

• y

^∗

= a

1

x

₁

+ · · · + a

k

x

_k

• if y

^∗

≥ b then y = 1

• if y

^∗

< b then y = 0

To derive optimal solution let us recall its formulation

A∈^A

max

: P (A)=τ

P (Y = 1|(X

₁

, . . . , X

_n

) ∈ A) (28) We may write

A∈^A

max

: P (A)=τ

P (Y = 1|X ∈ A) = max

kak=1,b: P (a^TX≥b)=τ

P (Y = 1|a

^T

X ≥ b) =

= max

P (Y = 1 ∧ a

^T

X ≥ b) P (a

^T

X ≥ b) =

= max

1 − F

a^TX|Y =1

(b)

τ =

= max

kak=1,b: 1−F_{aT X}(b)=τ

1 − F

a^TX|Y =1

(b)

τ (29)

In case the distributions f

₁

, f

₀

, and Bin(π

1

) are known, problem (29) is a deterministic optimization problem of the form:

a∈R

max

^p,b∈R

1 − F

^a^T^X|Y =1

(b)

τ (30)

subject to

1 − F

aTX

(b) = τ (31)

kak = 1 (32)

In case distributions f

₁

, f

₀

, and Bin(π

1

) are not known, they have to be estimated from data.

Cumulative distribution functions F

_{X|Y =0}

and F

X|Y =1

can be estimated consistently by either empirical cdf or by kernel cdf. In both cases problem (30)-(32) is replaced by

a∈R

max

^p,b∈R

1 − ˆ F

a^TX|Y =1

(b)

τ (33)

(10)

subject to

1 − ˆ F

a^TX

(b) = τ (34)

kak = 1 (35)

In case of differentiable kernels it is a differentiable optimization problem and can be easily solved numerically by standard gradient techniques.

Let us denote by a and ˆb the solution of the problem (33)-(35). The following theorem holds ˆ Theorem 3.3. Assume that

• (33)-(35) has almost sure one unique solution,

• (30)-(32) has one unique solution

• ˆ F

aTx

(b) and ˆ F

aTx|Y =1

(b) are continuous functions of a,b and x.

• f

1

and f

₀

are continuous

Then a and ˆb are consistent estimators of a and b. ˆ

Lemma 3.1. If f : R

ⁿ

× R

^m

→ R is continuous and B ∈ R

ⁿ

is bounded and closed, then max

x∈B

f (x, y) = g(y) is continuous.

Lemma 3.2. If X

_n

→ X, f is continuous on A and P (X ∈ A) = 1 then f (X

^P n

) → f (X)

^P

Proof of Theorem 3.3

We may note that for any fixed a

^∗

and b

^∗

1 − ˆ F

a∗TX|Y =1

(b

^∗

)

τ

−→

P

1 − F

a∗TX|Y =1

(b

^∗

)

τ (36)

1 − ˆ F

_a∗TX

(b

^∗

) −→ 1 − F

^P a^∗TX

(b

^∗

) (37) We may use Lemma 3.2 and Lemma 3.1 and write

a∈R

max

^p,b∈R

1 − ˆ F

a^TX|Y =1

(b) τ

−→

P

max

a∈R^p,b∈R

1 − F

a^TX|Y =1

(b)

τ (38)

■

Comment

Let us recall the definition of smoothed maximum score estimator of Horowitz (1992) for α =

¹₂

: max

a

S

1 2

n

= 1 n

n

X

i=1

(2Y

i

− 1)K a

^T

x

_i

h

(39)

(11)

subject to kak = 1 (40) where K(·) is smooth cdf. Our model can be formulated as

max

a,b

S

_n

= 1 n

n

X

i=1

Y

_i

K b − a

^T

x

_i

h

(41)

subject to kak = 1 and 1 n

n

X

i=1

K b − a

^T

x

_i

h

= 1 − τ (42)

■

4 Examples of model performance

In this section we show some examples of the model performance on artificial data. We compare our semiparametric model described in the previous section to logistic regression, which is also linear. We set τ = 10% , that is both models should find subpopulation of size 10% with highest fraction of Y = 1. Our model finds the best subpopulation directly. In case of logistic regression we find optimal τ % of observations by technique described in introduction, namely we calculate for each obervation p ˆ

_i

, the estimator of P (Y

i

= 1), i = 1, . . . , n and choose τ % observations with highest p ˆ

_i

. The smoothing parameter h in semiparametric model is set to 1. There are 5%

observations with Y = 1 in the sample. The observations come from the mixture of multivariate

normal distributions.

(12)

4.0.1 Example 1

We generated the following sample:



 



 



(X

1

, X

₂

) ∼ N













−1





 ,





 3 0 0 3











 , Y = 1, n = 110

(X

1

, X

₂

) ∼ N











 9 9





 ,





 3 0 0 3











 , Y = 1, n = 90

(X

1

, X

₂

) ∼ N











 10

4 



 ,





 3 0 0 3











 , Y = 1, n = 100

(X

1

, X

₂

) ∼ N











 4 4





 ,







10 0 0 10











 , Y = 0, n = 5700

(43)

The graph of this sample is shown on Figure 1.

4.0.2 Example 2

We generated the following sample:



 

 

 

 

(X

1

, X

₂

) ∼ N











 1 7





 ,





 3 0 0 3











 , Y = 1, n = 300

(X

1

, X

₂

) ∼ N











 4 2





 ,







10 0 0 10











 , Y = 0, n = 5700

(44)

The graph of this sample is shown on Figure 2.

We may note that in case of predictors having normal distribution both models achieve sim-

ilar results but in multimodal case such as mixture of normal distributions within class, logistic