Warsaw School of Economics Institute of Econometrics
Department of Applied Econometrics
Department of Applied Econometrics Working Papers
Warsaw School of Economics Al. Niepodleglosci 164 02-554 Warszawa, Poland
Working Paper No. 6-07
On modified discriminant analysis
Marcin Owczarczuk
Warsaw School of Economics
This paper is available at the Warsaw School of Economics
Department of Applied Econometrics website at: http://www.sgh.waw.pl/instytuty/zes/wp/
On modified discriminant analysis
Marcin Owczarczuk mo23628@sgh.waw.pl
22 May 2007
Abstract
Discriminant analysis is mostly used to predict the value of a discrete dependent variable of an ob- servation on the basis of a set of predictors. The commonly used criterion of the predictive power is the fraction of incorrectly predicted cases in the sample. In this article we construct a model for a modified discriminant problem. Namely to find a subpopulation of a given size having the highest percentage of observations of a chosen class. Our model maximizes the following criterion of the predictive power: the fraction of observations from chosen class in the found subpopulation.
Keywords: discriminant analysis, semiparametric estimation, smoothing, binary response .
JEL codes: C14, C35
1 Introduction
The aim of discriminant analysis is to predict the value of discrete explained variable Y on the basis of explanatory variables X = (X
1, . . . , X
k), . We may also treat discriminant analysis as a model the aim of which is to split the space of observations into two regions: the first characterized by dominance of observations from class Y = 0 and second characterized by
1Y = 1.
In this article we formulate the task in a different manner:
Suppose we are given the parameter τ - the fraction of population. We want to split the population with respect to explanatory variables into two groups: the first one with τ of all observations and the second one with (1 −τ ) of all observations. The group of size τ should have as high percentage of observations characterized by Y = 1 as possible. In other words we want to find population of size τ which has the highest number of observations from class Y = 1 in the family of all subsets of size τ .
Our model is modification of maximum score estimator of Manski (1975) and smoothed maxi- mum score estimator of Horowitz (1992). In case of their estimators the aim is to split the popu- lation into two groups characterized by Y = 1 and Y = 0 respectively, in order to minimize the fraction of incorrectly classified observations in the sample
2. We also want to minimize number of incorrectly classified observations but only in the group classified by the model as having Y = 1 with additional condition on the size of this group.
This problem arises in many areas, for example in credit scoring and marketing campaigns.
In these applications the aim is to separate a small, fixed size group of clients with relative high probability of a positive value of a response variable. In case of marketing campaigns one wants to find a group of clients, on the basis of their features, who are most likely to respond to campaign - the target group. In case of credit scoring the policy of the bank may be based on assumption that a certain fraction of worst clients, for example 5% should not be granted a loan. In that case one should find 5% of customers with highest probability of default.
This problem can be solved using p - an estimator of the probability P ˆ (Y = 1). It can be calculated for example using logistic regression. We may use the following construction:
1. sort observations in ascending sequence of p ˆ 2. choose τ observations with highest p ˆ
1In this paper, for simplicity, we restrict ourselves only to binary explained variable and denote its levels by Y = 1 and Y = 0.
2Their estimators are parameterized by α∈ (0, 1). For α = 12the fraction of incorrectly classified observations in the sample is minimized.
This method has intuitive grounds: observations with higher p are more likely to come from class ˆ Y = 1, so in the group with high ˆ p will be more observations from class Y = 1. Unfortunately we do not have guarantee that the chosen group will be optimal.
2 Problem formulation
We assume, as in case of classical discriminant analysis for binary responses
3, that X ∈ R
kand Y ∈ {0, 1}. X and Y are random variables.
We denote
•
A- the family of all measurable subsets of R
k.
• f
1- the distribution of random variable X|(Y = 1)
• f
0- the distribution of random variable X|(Y = 0)
• Bin(π
1) - the distribution of random variable Y Formally our goal is to find subset A that maximizes:
A∈A
max
: P (A)=τP (Y = 1|X ∈ A) (1)
We also assume that we have a random sample
x
11, . . . , x
1n1for y = 1 x
01, . . . , x
0n0for y = 0
n = n
1+ n
03 Model construction
In this section we show solution to problem described in the previous section.
3.1 Case 1. Predictors are normally distributed with common covariance matrix
This is probably most regular case. We assume that predictors have normal distribution with a common covariance matrix and probably different means among classes. These assumptions are the same as in case of Fisher linear discriminant analysis.
3see for example Hastie, Tibshirani, Friedman (2001)
• X|(Y = 1) ∼ N(m
1, Σ)
4• X|(Y = 0) ∼ N(m
0, Σ)
• Y ∼ Bin(π
1)
Theorem 3.1. Under above assumptions the optimal solution to problem (1) is
A = {X : (m
1− m
0)
TΣ
−1X ≥ b} (2) where b is quantile of order 1 − τ of the random variable (m
1− m
0)
TΣ
−1X
Proof.
A∈A
max
: P (A)=τP (Y = 1|X ∈ A) =
A∈A
max
: P (A)=τP (Y = 1 ∧ X ∈ A)
P (X ∈ A) = max
A∈A: P (A)=τ
π
1R
A
f
1π
1R
A
f
1+ (1 − π
1) R
A
f
0=
= max
A∈A: P (A)=τ
1 1 +
(1−π1)R
Af0
π1R
Af1
(3) Note that
argmax
A∈A: P (A)=τ
1 1 +
(1−π1)R
Af0
π1
R
Af1
= argmax
A∈A: P (A)=τ
π
1R
A
f
1(1 − π
1) R
A
f
0(4)
Then observe that fraction
(1−ππ1f1(x)1)f0(x)
has a constant value on the line c = π
1f
1(x)
(1 − π
1)f
0(x) =
π
1 1(2π)p2|Σ|12
e
−12(x−m1)TΣ−1(x−m1)(1 − π
1)
1(2π)p2|Σ|12
e
−12(x−m0)TΣ−1(x−m0)=
= π
1e
−12(x−m1)TΣ−1(x−m1)(1 − π
1)e
−12(x−m0)TΣ−1(x−m0)(5) ln(c) = ln( π
11 − π
1) − (x − m
1)
TΣ
−1(x − m
1) + (x − m
0)
TΣ
−1(x − m
0) =
= ln( π
11 − π
1) + (m
1− m
0)
TΣ
−1x − 1
2 (m
1− m
0)
TΣ
−1(m
1+ m
0) (6) ln(c) − ln( π
11 − π
1) + 1
2 (m
1− m
0)
TΣ
−1(m
1+ m
0) = (m
1− m
0)
TΣ
−1x (7)
4W∼ g denotes that random variable W has distribution g
which defines linear equation in x.
So problem (1) has solution
A = {X : (m
1− m
0)
TΣ
−1X ≥ b} (8) where b is chosen so that P ({X : (m
1− m
0)
TΣ
−1X ≥ b}) = τ . In other words b is quantile of order 1 − τ of the random variable (m
1− m
0)
TΣ
−1X. ■
Parameters Σ, m
1and m
0are usually unknown and have to be estimated from data. As in case of Fisher discriminant analysis we estimate
ˆ m
1= 1
n
1n1
X
i=1
x
1i(9)
ˆ m
0= 1
n
0n0
X
i=1
x
0i(10)
Σ ˆ = 1 n − 2
1
X
k=0
h X
nki=1
(x
ki− ˆ m
k)(x
ki− ˆ m
k)
Ti
(11) We may note that the optimal discriminant line is parallel to solution of Fisher linear discriminant analysis. Our model has a form:
• y
∗= (m
1− m
0)
TΣ
−1x = a
Tx = a
1x
1+ · · · + a
kx
k• if y
∗≥ b then y = 1
• if y
∗< b then y = 0
3.2 Case 2. Predictors are normally distributed with unequal covariance matrix
We assume that predictors have normal distribution but the covariance matrix differs among classes.
These assumptions are the same as in case of quadratic discriminant analysis.
• X|(Y = 1) ∼ N(m
1, Σ
1)
• X|(Y = 0) ∼ N(m
0, Σ
0)
• Y ∼ Bin(π
1)
Theorem 3.2. Under above assumptions the optimal solution to problem (1) is A = {X : X
T(Σ
−11m
1− Σ
−10m
0) − 1
2 X
T(Σ
−11− Σ
−10)X ≥ b} (12)
where b is quantile of order 1 − τ of the random variable X
T(Σ
−11m
1− Σ
−10m
0) −
12X
T(Σ
−11− Σ
−10)X
Proof.
The proof is similar to proof of Theorem 3.1. We may observe that the fraction
(1−ππ1f1(x)1)f0(x)
is constant on the curve
c = π
1f
1(x) (1 − π
1)f
0(x) =
π
1 1(2π)p2|Σ1|12
e
−12(x−m1)TΣ−11 (x−m1)(1 − π
1)
1(2π)p2|Σ0|12
e
−12(x−m0)TΣ−10 (x−m0)(13)
ln(c) = ln π
11 − π
1+ 1
2 ln |Σ
0|
|Σ
1| +x
T(Σ
−11m
1−Σ
−10m
0)− 1
2 x
T(Σ
−11−Σ
−10)x− 1 2 m
T1
Σ
−11
m
1+ 1 2 m
T0
Σ
−10
m
0(14)
ln(c)−ln π
11 − π
1− 1
2 ln |Σ
0|
|Σ
1| + 1 2 m
T1
Σ
−11
m
1− 1 2 m
T0
Σ
−10
m
0= x
T(Σ
−11m
1−Σ
−10m
0)− 1
2 x
T(Σ
−11−Σ
−10)x (15) which defines a quadratic equation in x.
So problem (1) has solution
A = {X : X
T(Σ
−11m
1− Σ
−10m
0) − 1
2 X
T(Σ
−11− Σ
−10)X ≥ b} (16) where b is chosen so that P ({X : X
T(Σ
−11m
1− Σ
−10m
0) −
12X
T(Σ
−11− Σ
−10)X}) = τ . In other words b is quantile of order 1−τ of the variable X
T(Σ
−11m
1−Σ
−10m
0)−
12X
T(Σ
−11−Σ
−10)X. ■
Parameters Σ
1, m
1, Σ
0and m
0are usually unknown and have to be estimated from data. As in case of quadratic discriminant analysis we estimate
ˆ m
1= 1
n
1n1
X
i=1
x
1i(17)
ˆ m
0= 1
n
0n0
X
i=1
x
0i(18)
ˆ
Σ
1= 1 n
1− 1
n1
X
k=1
(x
1i− ˆ m
1)(x
1i− ˆ m
1)
T(19)
Σ ˆ
0= 1 n
0− 1
n0
X
k=1
(x
0i− ˆ m
0)(x
0i− ˆ m
0)
T(20)
We may note that the optimal discriminant line is parallel to solution of quadratic discriminant analysis.
In this case our model has a form:
• y
∗= x
T(Σ
−11m
1− Σ
−10m
0) −
12x
T(Σ
−11− Σ
−10)x = P
ki=1
a
ix
i+ P
k j=1P
ki=1
a
ijx
ix
j• if y
∗≥ b then y = 1
• if y
∗< b then y = 0
3.3 Case 3. Semiparametric model
In case we cannot make additional assumptions about the distribution of predictors, the problem (1) becomes NP-hard. So we must restrict ourselves to subsets A of particular form. In this sub- section we show semiparametric optimal solution in case subset A is bounded by a hyperplane:
A = {(X
1, . . . , X
k) : a
1X
1+ a
2X
2+ · · · + a
kX
k≥ b} = {X : a
TX ≥ b} (21) for some a
1, a
2, . . . , a
k, b ∈ R. In other words we want to construct linear discriminant function.
Since inequality a
TX ≥ b holds when multiplied by any positive constant we imply normalization kak = 1.
Comment
Let us recall the definition of maximum score estimator of Manski (1975):
max
aS
nα= 1 n
n
X
i=1
[(2Y
i− 1) − (1 − 2α)]1(a
Tx
i) (22)
subject to kak = 1 (23)
which in case α =
12takes form max
aS
1 2
n
= 1 n
n
X
i=1
(2Y
i− 1)1(a
Tx
i) (24)
subject to kak = 1 (25)
Our problem can be formulated as
max
a,bS
n= 1 n
n
X
i=1
(2Y
i− 1)1(a
Tx
i≥ b) (26)
subject to kak = 1 and b = q
1−τ(a
Tx
1, . . . , a
Tx
n) (27)
■
Our model has the same form as in Case 1:
• y
∗= a
1x
1+ · · · + a
kx
k• if y
∗≥ b then y = 1
• if y
∗< b then y = 0
To derive optimal solution let us recall its formulation
A∈A
max
: P (A)=τP (Y = 1|(X
1, . . . , X
n) ∈ A) (28) We may write
A∈A
max
: P (A)=τP (Y = 1|X ∈ A) = max
kak=1,b: P (aTX≥b)=τ
P (Y = 1|a
TX ≥ b) =
= max
kak=1,b: P (aTX≥b)=τ
P (Y = 1 ∧ a
TX ≥ b) P (a
TX ≥ b) =
= max
kak=1,b: P (aTX≥b)=τ
1 − F
aTX|Y =1(b)
τ =
= max
kak=1,b: 1−FaT X(b)=τ
1 − F
aTX|Y =1(b)
τ (29)
In case the distributions f
1, f
0, and Bin(π
1) are known, problem (29) is a deterministic opti- mization problem of the form:
a∈R
max
p,b∈R1 − F
aTX|Y =1(b)
τ (30)
subject to
1 − F
aTX(b) = τ (31)
kak = 1 (32)
In case distributions f
1, f
0, and Bin(π
1) are not known, they have to be estimated from data.
Cumulative distribution functions F
X|Y =0and F
X|Y =1can be estimated consistently by either empirical cdf or by kernel cdf. In both cases problem (30)-(32) is replaced by
a∈R
max
p,b∈R1 − ˆ F
aTX|Y =1(b)
τ (33)
subject to
1 − ˆ F
aTX(b) = τ (34)
kak = 1 (35)
In case of differentiable kernels it is a differentiable optimization problem and can be easily solved numerically by standard gradient techniques.
Let us denote by a and ˆb the solution of the problem (33)-(35). The following theorem holds ˆ Theorem 3.3. Assume that
• (33)-(35) has almost sure one unique solution,
• (30)-(32) has one unique solution
• ˆ F
aTx(b) and ˆ F
aTx|Y =1(b) are continuous functions of a,b and x.
• f
1and f
0are continuous
Then a and ˆb are consistent estimators of a and b. ˆ
Lemma 3.1. If f : R
n× R
m→ R is continuous and B ∈ R
nis bounded and closed, then max
x∈Bf (x, y) = g(y) is continuous.
Lemma 3.2. If X
n→ X, f is continuous on A and P (X ∈ A) = 1 then f (X
P n) → f (X)
PProof of Theorem 3.3
We may note that for any fixed a
∗and b
∗1 − ˆ F
a∗TX|Y =1(b
∗)
τ
−→
P1 − F
a∗TX|Y =1(b
∗)
τ (36)
1 − ˆ F
a∗TX(b
∗) −→ 1 − F
P a∗TX(b
∗) (37) We may use Lemma 3.2 and Lemma 3.1 and write
a∈R
max
p,b∈R1 − ˆ F
aTX|Y =1(b) τ
−→
Pmax
a∈Rp,b∈R
1 − F
aTX|Y =1(b)
τ (38)
■
Comment
Let us recall the definition of smoothed maximum score estimator of Horowitz (1992) for α =
12: max
aS
1 2
n
= 1 n
n
X
i=1
(2Y
i− 1)K a
Tx
ih
(39)
subject to kak = 1 (40) where K(·) is smooth cdf. Our model can be formulated as
max
a,bS
n= 1 n
n
X
i=1
Y
iK b − a
Tx
ih
(41)
subject to kak = 1 and 1 n
n
X
i=1
K b − a
Tx
ih
= 1 − τ (42)
■
4 Examples of model performance
In this section we show some examples of the model performance on artificial data. We compare our semiparametric model described in the previous section to logistic regression, which is also linear. We set τ = 10% , that is both models should find subpopulation of size 10% with highest fraction of Y = 1. Our model finds the best subpopulation directly. In case of logistic regression we find optimal τ % of observations by technique described in introduction, namely we calculate for each obervation p ˆ
i, the estimator of P (Y
i= 1), i = 1, . . . , n and choose τ % observations with highest p ˆ
i. The smoothing parameter h in semiparametric model is set to 1. There are 5%
observations with Y = 1 in the sample. The observations come from the mixture of multivariate
normal distributions.
4.0.1 Example 1
We generated the following sample:
(X
1, X
2) ∼ N
−1
−1
,
3 0 0 3
, Y = 1, n = 110
(X
1, X
2) ∼ N
9 9
,
3 0 0 3
, Y = 1, n = 90
(X
1, X
2) ∼ N
10
4
,
3 0 0 3
, Y = 1, n = 100
(X
1, X
2) ∼ N
4 4
,
10 0 0 10
, Y = 0, n = 5700
(43)
The graph of this sample is shown on Figure 1.
4.0.2 Example 2