Ensembles of Decision Rules — General Framework for Rule Induction

(1)

Ensembles of Decision Rules —

General Framework for Rule Induction

Krzysztof Dembczy ński, Jerzy Błaszczy ński, Wojciech Kotłowski, Roman Słowi ński, Marcin Szel ˛ag

Intelligent Decision Support Systems, Institute of Computing Science, Pozna ´n University of Technology

(2)

Those who ignore Statistics are condemned to reinvent it

(3)

Decision rule is a logical expression in the form: if [conditions], then [decision].

If an object satisfies conditions of the rule, then it isassigned to the recommended class.

(4)

Decision rules were common in the early machine learning approaches (AQ, CN2, RIPPER)

The most popular decision rule induction algorithms are based on asequential covering procedure (also known asseparate-and-conquer strategy).

Decision rule models are widely considered in the Rough Set approaches to knowledge discovery and in Logical Analysis of Data where they are called patterns. Wide interest in decision rules may be explained by their simplicity and ease in interpretation.

However, it seems that decision trees are much more popular in machine leaning approaches.

(5)

Ensembles of decision rules described here follow a specific and original approach to decision rule generation:

Single rule is treated as a subsidiary, base classifier in the ensemble that indicates only one of the decision classes.

(6)

Ensemble methods became a very popular and efficient approach to machine learning problems.

Ensembles consist in forming committees of simple learning and classification procedures often referred to as base (or weak) learners (or classifiers).

The ensemble members are applied to a prediction task and their individual outputs are then aggregated to one output of the whole ensemble.

The aggregation is computed as a linear combination of outputs.

The most popular base learners are decision trees. There are several approaches to construction of the ensemble likebagging and boosting.

Ensembles are often treated asoff-the-shelf methods-of-choice.

(7)

Ensembles of decision rules

Ensemble consists ofsingle decision rules.

Variant ofForward Stagewise Additive Modeling (by Hastie, Tibshirani and Friedman) is used in construction of the ensemble.

Single rule is created in each iteration of Forward Stagewise Additive Modeling.

The rules are used in a prediction procedure by linear combination of their outputs.

(8)

Ensembles of decision rules

Ensembles of decision rules arecompetitive with other machine learning methods.

The rules areeasy in interpretation.

The algorithm is characterized bylow computational cost.

The approach is veryflexible, for example, one can

(9)

There are some similar approaches:

RuleFit (by Friedman and Popescu):based on Forward Stagewise Additive Modeling; the decision trees are used as base classifiers, and then each node (interior and terminal) of each resulting tree produces a rule; it is setup by the conjunction of conditions associated with all of the edges on the path from the root to that node; rule ensemble is fitted by gradient directed regularization.

SLIPPER (by Cohen and Singer):uses AdaBoost schema that is a specific case of Forward Stagewise Additive Modeling to produce an ensemble of decision rules.

Lightweight Rule Induction (by Weiss and Indurkhya): uses specific reweighing schema and DNF-formulas for single rules.

(10)

Ensembles of decision rules can be seen as aconnection of three very important issues in machine learning:

induction of decision rules by sequential covering algorithms,

boosting weak learners, gradient boosting machines.

In our opinion the methodology is original, however, most of theoretical results are based on those achieved by Friedman, Popescu, Hastie, Tibshirani, Schapire and Freund.

The originality comes from the fact that we are applying specific weak learner that is a single decision rule.

Our goal is to present a general framework for rule induction.

(11)

Outline

1 Problem Statement

2 Ensembles of Decision Rules

(12)

Outline

1 Problem Statement

(13)

The aim is to:

predict the unknown value of an attribute y (called output, response variable or decision attribute) of an object using the known joint values of other attributes (called predictors, condition attributes or independent variables) x = (x1, x2, . . . , xn).

The goal of a learning task is to find a function F(x) using a set of training examples {yi, xi}N₁ that predicts accurately y.

(14)

The optimal prediction procedure is given by: F∗(x) = arg min

F(x)EyxL(y, F(x))

where the expected value Eyxis over joint distribution of all

attributes (y, x) for the data to be predicted.

L(y, F(x))is a loss or cost for predicting F(x) when the actual value is y.

The learning procedure tries to construct F(x) to be the best possible approximation of F∗(x).

(15)

We considerbinary classification problem, in which y∈ {−1, 1} and regression problem, in which y ∈ R.

The typical loss in classification tasks is 0-1 loss:

L(y, F(x)) = (

0 y= F(x), 1 y6= F(x).

The typical loss in regression tasks is squared loss: L(y, F(x)) = (y − F(x))2

(16)

Loss functions in classification tasks: Sigmoid loss L(y, F(x)) = 1 1 + exp(β · y · F(x)) SVM loss L(y, F(x)) = ( 0, y· F(x) 1, 1 − y · F(x), y· F(x) < 1, Exponential loss L(y, F(x)) = exp(−β · y · F(x)) Binomial log-likelihood

(17)

−2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 yF(x) Loss 0−1 loss Sigmoid loss SVM loss Exponential loss Binomial log−likelihood loss

(18)

Loss functions in regression tasks:

Least absolute deviation

L(y, F(x)) = |y − F(x)| SVM regression loss L(y, F(x)) = ( 0, |y − F(x)| < 1, |y − F(x)| − 1, |y − F(x)| 1, Huber loss L(y, F(x)) = ( ₁ 2(y − F(x)) 2_, _{|y − F(x)| ¬ σ,} σ(|y − F(x)| − σ/2), |y − F(x)| > σ.

(19)

−3 −2 −1 0 1 2 3 0 1 2 3 4 y−F(x) Loss Squared loss Least absolute deviance SVM regression loss Huber loss

(20)

Linear models:

Linear models are among the most popular for data fitting:

F(x) = a0+ M

X

m=1

am· fm(x)

where {am}M₀ are parameters to be estimated and {fm}M₀ may

be the original measured variables and/or selected functions constructed from them.

(21)

Linear models:

The parameters of the linear model are estimated through:

{ˆam}M₀ = arg min {am}M0 N X i=1 L yi, a0+ M X m=1 am· fm(x)+ λ · P({am}M₁)

where the first term measures the loss on the training sample, and the second term is used for regularization.

λcontrols the degree of regularization

commonly employed penalty functions P({am}M₁)are:

P₁({am}M1) = M X m=1 |am| P2({am}M1 ) = M X m=1 |am|2

(22)

Outline

1 Problem Statement

(23)

Decision rule is the simplest and the most comprehensive representation of knowledge in the form of logical expression:

if [conditions], then [decision].

Example

if duration >= 31.5

andsavings status 6= no known savings andsavings status 6∈ (500, 1000)

andchecking status 6= no checking account andchecking status < 200

andemployment 6= unemployed andpurpose = furniture/equipment, thencustomer = bad

(24)

Decision rule:

Condition part of a decision rule is represented by a complex: Φ = φ∝₁ ∧ φ∝₂ ∧ . . . ∧ φ∝_t ,

where φ∝ is aselector and t is a number of selectors in the complex (i.e.,length of the rule).

Selector φ∝is defined as xj∝ vj, where vjis a value or a subset

of values from the domain of j-th attribute, ∝ is specified as =, 6=, ∈, or ¬, depending on the type of j-th attribute. Objects covered by complex Φ are denoted by cov(Φ) and referred to ascover of a complex Φ.

Decision part of a rule indicates one of the decision classes and is denoted by dec = d, where d = 1 or d = −1 in the simplest case.

(25)

Decision rule:

Decision rule denoted by r(x, c), where c = (Φ, dec), is defined as:

r(x, c) = (

d x ∈ cov(Φ), 0 x 6∈ cov(Φ).

In the simplest form, the loss of a single decision rule (0-1-` loss) takes the following form:

L(y, r(x, c)) =      0 y· r(x, c) > 0, 1 y· r(x, c) < 0, ` r(x, c) = 0,

where ` ∈ h0, 1i is a penalty for specificity of the rule: the lower the value of `, the smaller the number of objects covered by the rule from the opposite class.

(26)

Ensembles of decision rules

input : set of training examples {yi, xi}N1, M– number of decision rules.

output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPN_i=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x);

for m = 1 to M do

cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);

Fm(x) = Fm−1(x) + ν · rm(x);

end

ensemble= {rm(x)}M1;

Forward Stagewise Additive Modeling is a general framework suited to simulate ensemble approaches: bagging, random forest, boosting.

(27)

output : ensemble of decision rules {rm(x)}M₁. F0(x) := arg minα PN i=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x); for m = 1 to M do cm:= arg mincP

i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);

Fm(x) = Fm−1(x) + ν · rm(x); end

ensemble= {rm(x)}M 1;

Friedman, J. H., Hastie, T., Tibshirani, R.: Elements of Statistical Learning, Springer (2003)

Friedman, J. H., Popescu, B. E.: Importance Sampled Learning Ensembles. Dept. of Statistics, Stanford University Technical Report, http://www-stat.stanford.edu/˜jhf/, September (2003)

(28)

form= 1 to Mdo

cm:= arg mincP

i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);

Fm(x) = Fm−1(x) + ν · rm(x);

end

ensemble= {rm(x)}M 1;

(29)

output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPN_i=1L(yi, α); or F0(x) := 0;

F0(x) := ν · F0(x);

Fm(x) = Fm−1(x) + ν · rm(x);

end

(30)

cm:= argmincPi∈Sm(η)L(yi, Fm−1(xi) +r(xi, c)); rm(x) = r(x, cm);

Fm(x) = Fm−1(x) + ν · rm(x);

end

(31)

cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c));

rm(x) = r(x, cm);

Fm(x) = Fm−1(x) + ν · rm(x);

end

Sm(η)represents a different subsample of size η ¬ N randomly drawn with or without replacement from the original training data.

(32)

output : ensemble of decision rules {rm(x)}M 1. F0(x) := arg minα PN i=1L(yi, α); or F0(x) := 0; F0(x) :=ν· F0(x); for m = 1 to M do

Fm(x) = Fm−1(x) +ν· rm(x); end

ν ∈ h0, 1i is a shrinkage parameter that determines the degree to

which previously generated decision rules rk(x), k = 1, . . . , m, affect the successive one in the sequence, i.e., rm+1(x).

(33)

output : ensemble of decision rules {rm(x)}M₁. F0(x) := arg minαP N i=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x); for m = 1 to M do cm:= arg mincP

i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm); Fm(x) = Fm−1(x) + ν · rm(x); end ensemble= {rm(x)}M 1; F0(x) := arg minα PN

i=1L(yi, α); or F0(x) := 0; defines a default rule or there is no default rule.

(34)

In each consecutive iteration m we augment the function Fm−1(x)by one additional rule rm(x)weighed by shrinkage

parameter ν.

This gives a linear combination of rules Fm(x).

The additional rule rm(x) = r(x, c)is chosen to minimize

X

i∈Sm(η)

(35)

Prediction procedure is performed according to: F(x) = a0+ M X m=1 amrm(x).

It is alinear classifier in a very high dimensional space of deriveddecision rules that are highly nonlinear functions of the original predictor variables x.

Parameters {am}M₀ can be obtained in many ways:

set to fixed values, for example a0=0 and {am= 1/M}M₁,

computed by some optimization techniques, fitted in cross-validation experiments,

estimated in a process of constructing the ensemble (like in AdaBoost).

(36)

Prediction procedure is performed according to: F(x) = a0+ M X m=1 amrm(x).

In the case of decision rules, parameters {am}M₁ can be

identified with the output of a single rule rm(x), i.e.:

rm(x) =

(

am· dm x ∈ cov(Φ),

0 x 6∈ cov(Φ). where am∈ R+, dm∈ {−1, 1} usually, and let

(37)

There are two crucial elements of the algorithm to be chosen: heuristic constructing a single decision rule,

(38)

Greedy heuristic constructing a single decision rule Search for c such that:

Lm=

X

i∈Sm(η)

L(yi, Fm−1(xi) + r(xi, c))

is minimal.

At the beginning, there is an empty rule: the complex of the rule is empty (no selectors are specified) and decision is not

determined,

In the next step, a new selector is added to the complex and the decision of the rule is determined.

The selector and the decision are chosen to give the minimal value of Lm.

(39)

Greedy heuristic constructing a single decision rule Search for c such that:

Lm=

X

i∈Sm(η)

L(yi, Fm−1(xi) + r(xi, c))

is minimal.

Minimal value of Lmis a natural stop criterion in building a

single rule.

Additionally, another stop criterion can be introduced, for example, length of the rule.

(40)

Ensembles of decision rules with different loss functions: 0-1-` loss

Sigmoid loss Exponential loss

Squared loss (Regression Rules) Binomial log-likelihood loss

(41)

0-1-` loss L_0−1−`(y, Fm(x)) =      0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0. Population minimizer F∗(x) = arg min F(x)Ey|xL0−1−`(y, F(x)) = = ( d maxd={−1,1}P(y = d|x) > 1 − `, 0 max_d={−1,1}P(y = d|x) ¬ 1 − `.

(42)

0-1-` loss L_0−1−`(y, Fm(x)) =      0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0.

Prediction procedure is performed according to:

F(x) = sign(a0+ M

X

m=1

(43)

0-1-` loss L0−1−`(y, Fm(x)) =      0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0. Classification rules: r(x, c) = ( d x ∈ cov(Φ), 0 x 6∈ cov(Φ), where d ∈ {−1, 1}.

(44)

0-1-` loss L0−1−`(y, Fm(x)) =      0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0.

In each iteration m, the following expression has to be minimized: X yi·(Fm−1(xi)+r(xi,c))<0 1 + ` · X Fm−1(xi)+r(xi,c)=0 1

(45)

Sequential covering (separate-and-conquer)

Learn a rule that covers a part of the given training examples, remove the covered examples from the training set (the separate part) and recursively learn another rule that covers some of the remaining examples (the conquer part) until no examples remain.

Fürnkranz, J.: Separate-and-Conquer Rule Learning, Artificial Intelligence Review 113 (1996) 3–54

0-1-` loss and sequential covering

0-1-` loss gives aprocedure similar to sequential covering,

because loss of training examples covered by one rule is already equal 0, and there is no need to look for another rule covering them, i.e. it corresponds to removing them from the set of training examples.

(46)

Ensembles of decision rules – simple sequential covering input : set of training examples {yi, xi}N1,

M– number of decision rules.

output : ensemble of decision rules {rm(x)}M 1. F0(x) := 0;

for m = 1 to M = N do

cm:= arg mincP

i∈Sm(η)L0−1−`(yi, Fm−1(xi) + r(xi, c));

rm(x) = r(x, cm); Fm(x) = Fm−1(x) + ν · rm(x); end ensemble= {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < _N1,

Decision of the rules is set to d = 1 (positive examples), F0(x) := 0, M = N, ν = 1, η = N,

(47)

Ensembles of decision rules and ModLEM input : set of training examples X = {yi, xi}N1

set of positive examples ˆX⊂ X being P(Cl) or P(Cl). output: set of decision rules {rm(x)}M

1. F0(x) = 0; m = 0;

whileP

i∈ˆXL(yi, Fm(xi, c)) 6= 0do m= m + 1;

cm= arg mincP_i_∈ˆ_XL0−1−`(yi, Fm−1(xi) + r(xi, c));

rm(x) = r(x, cm); Fm(x) = Fm−1(x) + rm(x); end M= m; rules = {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < 1 N, Decision rule covers only positive examples, F(x) = sign(PM

m=1mat(rm(x)) · sup(rm(x)) · spe(rm(x)) · rm(x)),

(48)

Ensemble of decision rules – sequential covering II input : set of training examples {yi, xi}N₁,

M– number of decision rules.

output : ensemble of decision rules {rm(x)}M 1. F0(x) := 0;

for m = 1 to M = N do

c := arg mincP

i∈Sm(η)L0−1−`(yi, Fm−1(xi) + r(xi, c));

rm(x) = r(x, c); Fm(x) = Fm−1(x) + 2M−m· rm(x); end ensemble= {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < _N1, d ∈ {−1, 1}, F0(x) := 0, M = N, ν = 2M−m_. F(x) = sign(a0+PM_m=1amrm(x)),

(49)

Sigmoid loss Lsigm(y, Fm(x)) = 1 1 + exp(β · y · Fm(x)) Population minimizer F∗(x) = arg min F(x)Ey|xLsigm(y, F(x)) =      + inf P(y = 1|x) > 1₂, − inf P(y = −1|x) < 1₂, default otherwise.

(50)

Sigmoid loss

Lsigm(y, Fm(x)) =

1

1 + exp(β · y · Fm(x))

F(x) = sign(a0+ M

X

m=1

(51)

Sigmoid loss Lsigm(y, Fm(x)) = 1 1 + exp(β · y · Fm(x)) Classification rules: r(x, c) = ( d x ∈ cov(Φ), 0 x 6∈ cov(Φ), where d ∈ {−1, 1}.

(52)

Sigmoid loss

Lsigm(y, Fm(x)) =

1

1 + exp(β · y · Fm(x))

In each iteration m, the following expression has to be minimized: X yi·r(xi,c)>0 1 1 + exp(β · yi· (Fm−1(xi) + d)) + X yi·r(xi,c)<0 1 1 + exp(β · yi· (Fm−1(xi) + d)) + X r(xi,c)=0 1 1 + exp(β · yi· Fm−1(xi))

(53)

Sigmoid loss

Lsigm(y, Fm(x)) =

1

1 + exp(β · y · Fm(x))

Sigmoid loss is a relaxed form of the 0-1-` loss function, Hard statistical interpretation of sigmoid loss,

Our first (and promising) experiments were performed using sigmoid loss.

(54)

Exponential loss

Lexp(y, Fm(x)) = exp(−β · y · Fm(x))

Population minimizer F∗(x) = arg min F(x)Ey|xLexp(y, F(x)) = 1 2log P(y = 1|x) P(y = −1|x)

(55)

Exponential loss

F(x) = a0+ M X m=1 amrm(x), and P(y = 1|x) = 1 1 + exp(−2F(x)).

(56)

Exponential loss

Classification rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ {−d, d}, d = const or α ∈ R .

(57)

Exponential loss

For α ∈ {−d, d}, one has to minimized in each iteration m: X yi·r(xi,c)>0 exp(−β · (yi· Fm−1(x1) + d)) + X y·r(xi,c)<0 exp(−β · (yi· Fm−1(x1) − d)) + X r(xi,c)=0 exp(−β · yi· Fm−1(x1)),

where exp(−β · yi· Fm−1(xi))can be treated as w (m)

i (i.e., weight of i-th training example in the m-th iteration).

(58)

Exponential loss

The above leads to: X yi·r(xi,c)<0 w(m)_i + ` · X r(xi,c)=0 w(m)_i where ` = 1 − e −d ed_{− e}−d d= log 1 − ` ` .

(59)

Exponential loss

For α ∈ R the heuristic constructing a single rule has to minimize: 2 s X yi·r(xi,c)>0 w(m)_i · X yi·r(xi,c)<0 w(m)_i + X r(xi,c)=0 w(m)_i

and the final output of the rule is:

α = 1 2log P yi·r(xi,c)>0w (m) i P yi·r(xi,c)<0w (m) i

(60)

Exponential loss One can reformulate

α = 1 2log P yi·r(xi,c)>0w (m) i P yi·r(xi,c)<0w (m) i to the following form:

α = 1

2log

P(y = 1|r(x, c)) P(y = −1|r(x, c)) that is related also toconfirmation measure:

l(E|H) = log P(E|H) P(E|¬H)= log

P(H|E) · P(¬H) P(¬H|E) · P(H)

where E is evidence, H is hypothesis, P(H) and P(¬H) are constant in classification tasks .

(61)

Ensembles of Decision Rules and AdaBoost Fm(x)is updated according to:

Fm(x) = Fm−1(x) + α · rm(x)

which causes the weights for the next iteration to be: w(m+1)_i = w(m)_i · e−αyirm(x)

that gives, in fact, the reweighing schema well-known from AdaBoost.

{2α}M

1 parameters are used in AdaBoost to weighing weak learners.

AdaBoost uses a linear classifier combining weak learners. Decision rule is a specific base classifier that exceeds random classifier: the space covered by the rule has small error, uncovered space could be treated as a space of coin throwing.

(62)

SLIPPER (by Cohen and Singer)

Uses AdaBoost schema (i.e., uses exponential loss and minimizes with respect to α and c),

Decision rules are generated only for positive examples, There is a default rule,

Heuristic constructing a single rule applies post-pruning.

Cohen, W. W., Singer, Y.: A simple, fast, and effective rule learner. Proc. of 16th National Conference on Artificial Intelligence (1999) 335–342

(63)

Lightweight Rule Induction (by Weiss and Indurkhya)

Output of decision rules is rm(x) ∈ {0, 1}, Uses specific reweighing schema:

w(m)_i = 1 + ( m−1

X

j=1

I(yi6= rj(xi, c))3

Uses DNF-formulas for single rules,

Decision rules are separately induced for each class (the same number of rules for each class),

Weiss S., M., Indurkhya, N.: Lightweight Rule Induction. Proc. of 17th International Conference on Machine Learning (2000) 1135–1142

(64)

Lightweight Rule Induction (by Weiss and Indurkhya)

Heuristic constructing a single rule minimizes (for

r(xi, c) ∈ {0, 1}): X yi·r(xi,c)=−1 w(m)_i + k · X r(xi,c)=0∧yi=1 w(m)_i ,

where k = 1, 2, 4, . . . is changing during the construction of a rule, P

yi·r(xi,c)=−1w (m)

i are weighed false positives and P

r(xi,c)=0∧yi=1w (m)

i are weighed false negatives.

Weiss S., M., Indurkhya, N.: Lightweight Rule Induction. Proc. of 17th International Conference on Machine Learning (2000) 1135–1142

(65)

Ensembles of Decision Rules and Lightweight Rule Induction For k = const minimization of

X

yi·r(xi,c)=−1

w(m)_i + k · X

r(xi,c)=0∧yi=1

w(m)_i , is equivalent to minimization of X yi·r(xi,c)=−1 w(m)_i + ` X r(xi,c)=0 w(m)_i where ` = k k+ 1

(66)

Ensembles of Decision Rules and Lightweight Rule Induction Taking several assumptions like

k= const,

exponential example reweighing,

1-DNF formulas for single decision rules,

then we obtain an algorithm similar toensembles of decision rules

withexponential loss and d = const.

(67)

Squared loss (regression rules)

Lsquared(y, Fm(x)) = (y − Fm(x))2

Population minimizer

F∗(x) = arg min

(68)

F(x) = a0+ M

X

m=1

(69)

Squared loss (regression rules) Lsquared(y, Fm(x)) = (y − Fm(x))2 Regression rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ R.

(70)

In each iteration m, the following expression has to be minimized: X r(xi,c)=0 (yi− Fm−1(xi))2+ X r(xi,c)6=0 (yi− Fm−1(xi) − α)2 where α = P r(xi,c)6=0(yi− Fm−1(xi)) P r(xi,c)6=01 .

(71)

Regression Rules

Approximation of f (x) = 3x5_{− x}4_{− x, model built on 41 observations.}

−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x)

(72)

Regression Rules

−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 2 regression rules

(73)

Regression Rules

(74)

Regression Rules

(75)

Regression Rules

(76)

Regression Rules

(77)

Regression Rules

(78)

Regression Rules

(79)

Binomial log-likelihood

Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))

Population minimizer F∗(x) = arg min F(x) E_y_|xLlog(y, F(x)) = 1 2log P(y = 1|x) P(y = −1|x)

(80)

F(x) = a0+ M X m=1 amrm(x), and P(y = 1|x) = 1 1 + exp(−2F(x)).

(81)

Classification rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ {−d, d}, d = const or α ∈ R .

(82)

For α ∈ {−d, d}, one has to minimized in each iteration m: X

r(xi,c)=0

log(1 + exp(−β · yi· Fm−1(x1))) + X

y·r(xi,c)>0

log(1 + exp(−β · (yi· Fm−1(x1) + d))) +

X

y·r(xi,c)<0

log(1 + exp(−β · (yi· Fm−1(x1) − d))),

where exp(−β · yi· Fm−1(xi))can be treated as w(m)i (i.e., weight of i-th training example in the m-th iteration).

(83)

For α ∈ R, there is no straight-forward optimization, the solution is to

usegradient boosting machines.

Friedman, J. H.: Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, 529 (2001) 1189-1232

(84)

In each iteration m build aregression rule minimizing:

X

r(xi,c)=0

(y_i)2+ X r(xi,c)6=0

(y_i− α)2

where y_iis a pseudo-response defined as: y_i= − ∂L(yi, F(xi)) ∂F(xi) F(x)=Fm−1(x) = 2yi 1 + exp(2yiFm−1(xi)) .

Final output of the rule is α =P

r(x,c)6=0yi/ P

(85)

Ensembles of decision rules and Gradient Boosting Trees In each iteration m of gradient boosting trees one looks for

({γjm, cjm}J₁) = arg min {γj,cj}J₁ N X i=1 L(yi, Fm−1(xi) + J X j=1 γj· r(xi, cj))

where {r(xi, cj)}J₁ represents all paths from the root to leafs in

the tree.

In the ensembles of decision rules, each rule rm(x, c)and the

output α is optimizedseparately:

(αm, cm) = arg min α,c N X i=1 L(yi, Fm−1(xi) + α · r(xi, c))

(86)

Conclusions

Several properties of ensembles of decision rules in comparison to decision tree ensembles:

natural stop criterion in building a single rule, decision rule rm(x)and its output α are optimized

individually taking into account all previously generated rules {rm(x)}m₁−1.

parameters {am}M₁ corresponds to outputs of single rules

{αm}M₁ – one parameter to be optimize.

One can treat the ensembles of decision rules as: extension of sequential covering

(87)

Outline

1 Problem Statement

(88)

Software was written and experiments were performed using:

Weka package:

(89)

Classifiers used in the experiment

Classifier Abbrev.

NaiveBayes NB

Logistic -R 1.0E-8 -M -1 Log

RBFNetwork -B 2 -S 1 -R 1.0E-8 -M -1 -W 0.1 RBF SMO-C 1.0 -E 1.0 -G 0.01 -A 250007 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 SMO

IBk -K 5 -W 0 IBL AdaBoostM1 -P 100 -S 1 -I 50 -W DecisionStump AB DS AdaBoostM1-P 100 -S 1 -I 50 -WREPTree-M 2 -V 0.0010 -N 3 -S 1 -L -1 AB RT AdaBoostM1 -P 100 -S 1 -I 50 -W J48 -C 0.25 -M 2 AB J48 AdaBoostM1 -P 100 -S 1 -I 50 -W PART -M 2 -C 0.25 -Q 1 AB PT Bagging -P 100 -S 1 -I 50 -W REPTree -M 2 -V 0.0010 -N 3 -S 1 -L -1 B RT Bagging -P 100 -S 1 -I 50 -W J48 -C 0.25 -M 2 B J48 Bagging -P 100 -S 1 -I 50 -W PART -M 2 -C 0.25 -Q 1 B PT LogitBoost-P 100 -F 0 -R 1 -L -1.8e308 -H 1.0 -S 1 -I 50 -WDecisionStump LB DS LogitBoost-P 100 -F 0 -R 1 -L -1.8e308 -H 1.0 -S 1 -I 50 -WREPTree-M 2 -V 0.0010 -N

3 -S 1 -L -1

LB RT

J48 -C 0.25 -M 2 J48

RandomForest -I 50 -K 0 -S 1 RF

PART -M 2 -C 0.25 -Q 1 PT

Ensemble of Decision Rules, Lsigm(y, Fm(x)), M = 50, bootstrap sample

η = N, ν = 0.5, a0= F0(x), {am}M 1 = 1

(90)

Data sets included in the experiment

Data set Attributes Class -1 Class 1

German Credit (credit-g) 21 300 700

Pima Indians Diabetes (diabetes) 9 268 500

Heart Statlog (heart-statlog) 14 120 150

J. Hopkins University Ionosphere (ionosphere) 35 126 225 King+Rook vs. King+Pawn on a7 (kr-vs-kp) 37 1527 1669

(91)

Credit-g – leave-one-out estimate (accuracy [%])

PT J48 AB DS LB RT RBF AB RT B J48 IBL AB J48AB PT SMO B RT NB Log EDR RF B PT LB DS

accuracy [%] 66 68 70 72 74 76 78

(92)

Diabetes – leave-one-out estimate (accuracy [%])

PT AB J48AB RT RBF J48 IBL LB DS B PT RF B RT LB RT EDR NB AB PT B J48 AB DS SMO Log

accuracy [%] 66 68 70 72 74 76 78

(93)

Heart-statlog – leave-one-out estimate (accuracy [%])

J48 AB RT PT AB J48LB RT IBL B J48 LB DS RBF B PT RF EDR AB DS B RT NB Log SMO AB PT

accuracy [%] 74 76 78 80 82 84

(94)

Ionosphere – leave-one-out estimate (accuracy [%])

NB IBL J48 SMO Log PT B RT LB DS LB RT RBF AB RT AB PT B PT AB DS B J48 RF AB J48 EDR

accuracy [%]

80

85

90

(95)

kr-vs-kp – leave-one-out estimate (accuracy [%])

LB RT RBF NB AB DS EDR SMO LB DS IBL Log PT B RT AB RT B J48 RF B PT J48 AB J48AB PT

accuracy [%] 80 85 90 95 100

(96)

Sonar – leaving-one-out estimate (accuracy [%])

NB J48 RBF LB RT PT Log B RT SMO AB RT B J48 EDR IBL LB DS AB DS B PT AB J48AB PT RF

accuracy [%] 65 70 75 80 85 90

(97)

(98)

Future plans and conclusions

Promising first steps in the research onensembles of decision rules, but still a lot to do:

exhaustive experimental research, missing values,

interpretation of rules, robust loss functions, multi-class problem,

ordinal classification problem, ranking problem.

Ensembles of decision rules areflexible, powerful, fast and interpretable.

(99)

Future plans and conclusions

Promising first steps in the research onensembles of decision rules, but still a lot to do:

exhaustive experimental research, missing values,

interpretation of rules, robust loss functions, multi-class problem,

ordinal classification problem, ranking problem.

Ensembles of decision rules areflexible, powerful, fast and interpretable.

(100)

(101)