Ensembles of Decision Rules —
General Framework for Rule Induction
Krzysztof Dembczy ´nski, Jerzy Błaszczy ´nski, Wojciech Kotłowski, Roman Słowi ´nski, Marcin Szel ˛ag
Intelligent Decision Support Systems, Institute of Computing Science, Pozna ´n University of Technology
Those who ignore Statistics are condemned to reinvent it
Decision rule is a logical expression in the form: if [conditions], then [decision].
If an object satisfies conditions of the rule, then it isassigned to the recommended class.
Decision rules were common in the early machine learning approaches (AQ, CN2, RIPPER)
The most popular decision rule induction algorithms are based on asequential covering procedure (also known asseparate-and-conquer strategy).
Decision rule models are widely considered in the Rough Set approaches to knowledge discovery and in Logical Analysis of Data where they are called patterns. Wide interest in decision rules may be explained by their simplicity and ease in interpretation.
However, it seems that decision trees are much more popular in machine leaning approaches.
Ensembles of decision rules described here follow a specific and original approach to decision rule generation:
Single rule is treated as a subsidiary, base classifier in the ensemble that indicates only one of the decision classes.
Ensemble methods became a very popular and efficient approach to machine learning problems.
Ensembles consist in forming committees of simple learning and classification procedures often referred to as base (or weak) learners (or classifiers).
The ensemble members are applied to a prediction task and their individual outputs are then aggregated to one output of the whole ensemble.
The aggregation is computed as a linear combination of outputs.
The most popular base learners are decision trees. There are several approaches to construction of the ensemble likebagging and boosting.
Ensembles are often treated asoff-the-shelf methods-of-choice.
Ensembles of decision rules
Ensemble consists ofsingle decision rules.
Variant ofForward Stagewise Additive Modeling (by Hastie, Tibshirani and Friedman) is used in construction of the ensemble.
Single rule is created in each iteration of Forward Stagewise Additive Modeling.
The rules are used in a prediction procedure by linear combination of their outputs.
Ensembles of decision rules
Ensembles of decision rules arecompetitive with other machine learning methods.
The rules areeasy in interpretation.
The algorithm is characterized bylow computational cost.
The approach is veryflexible, for example, one can
There are some similar approaches:
RuleFit (by Friedman and Popescu):based on Forward Stagewise Additive Modeling; the decision trees are used as base classifiers, and then each node (interior and terminal) of each resulting tree produces a rule; it is setup by the conjunction of conditions associated with all of the edges on the path from the root to that node; rule ensemble is fitted by gradient directed regularization.
SLIPPER (by Cohen and Singer):uses AdaBoost schema that is a specific case of Forward Stagewise Additive Modeling to produce an ensemble of decision rules.
Lightweight Rule Induction (by Weiss and Indurkhya): uses specific reweighing schema and DNF-formulas for single rules.
Ensembles of decision rules can be seen as aconnection of three very important issues in machine learning:
induction of decision rules by sequential covering algorithms,
boosting weak learners, gradient boosting machines.
In our opinion the methodology is original, however, most of theoretical results are based on those achieved by Friedman, Popescu, Hastie, Tibshirani, Schapire and Freund.
The originality comes from the fact that we are applying specific weak learner that is a single decision rule.
Our goal is to present a general framework for rule induction.
Outline
1 Problem Statement
2 Ensembles of Decision Rules
Outline
1 Problem Statement
2 Ensembles of Decision Rules
The aim is to:
predict the unknown value of an attribute y (called output, response variable or decision attribute) of an object using the known joint values of other attributes (called predictors, condition attributes or independent variables) x = (x1, x2, . . . , xn).
The goal of a learning task is to find a function F(x) using a set of training examples {yi, xi}N1 that predicts accurately y.
The optimal prediction procedure is given by: F∗(x) = arg min
F(x)EyxL(y, F(x))
where the expected value Eyxis over joint distribution of all
attributes (y, x) for the data to be predicted.
L(y, F(x))is a loss or cost for predicting F(x) when the actual value is y.
The learning procedure tries to construct F(x) to be the best possible approximation of F∗(x).
We considerbinary classification problem, in which y∈ {−1, 1} and regression problem, in which y ∈ R.
The typical loss in classification tasks is 0-1 loss:
L(y, F(x)) = (
0 y= F(x), 1 y6= F(x).
The typical loss in regression tasks is squared loss: L(y, F(x)) = (y − F(x))2
Loss functions in classification tasks: Sigmoid loss L(y, F(x)) = 1 1 + exp(β · y · F(x)) SVM loss L(y, F(x)) = ( 0, y· F(x) 1, 1 − y · F(x), y· F(x) < 1, Exponential loss L(y, F(x)) = exp(−β · y · F(x)) Binomial log-likelihood
−2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 yF(x) Loss 0−1 loss Sigmoid loss SVM loss Exponential loss Binomial log−likelihood loss
Loss functions in regression tasks:
Least absolute deviation
L(y, F(x)) = |y − F(x)| SVM regression loss L(y, F(x)) = ( 0, |y − F(x)| < 1, |y − F(x)| − 1, |y − F(x)| 1, Huber loss L(y, F(x)) = ( 1 2(y − F(x)) 2, |y − F(x)| ¬ σ, σ(|y − F(x)| − σ/2), |y − F(x)| > σ.
−3 −2 −1 0 1 2 3 0 1 2 3 4 y−F(x) Loss Squared loss Least absolute deviance SVM regression loss Huber loss
Linear models:
Linear models are among the most popular for data fitting:
F(x) = a0+ M
X
m=1
am· fm(x)
where {am}M0 are parameters to be estimated and {fm}M0 may
be the original measured variables and/or selected functions constructed from them.
Linear models:
The parameters of the linear model are estimated through:
{ˆam}M0 = arg min {am}M0 N X i=1 L yi, a0+ M X m=1 am· fm(x)+ λ · P({am}M1)
where the first term measures the loss on the training sample, and the second term is used for regularization.
λcontrols the degree of regularization
commonly employed penalty functions P({am}M1)are:
P1({am}M1) = M X m=1 |am| P2({am}M1 ) = M X m=1 |am|2
Outline
1 Problem Statement
2 Ensembles of Decision Rules
Decision rule is the simplest and the most comprehensive representation of knowledge in the form of logical expression:
if [conditions], then [decision].
Example
if duration >= 31.5
andsavings status 6= no known savings andsavings status 6∈ (500, 1000)
andchecking status 6= no checking account andchecking status < 200
andemployment 6= unemployed andpurpose = furniture/equipment, thencustomer = bad
Decision rule:
Condition part of a decision rule is represented by a complex: Φ = φ∝1 ∧ φ∝2 ∧ . . . ∧ φ∝t ,
where φ∝ is aselector and t is a number of selectors in the complex (i.e.,length of the rule).
Selector φ∝is defined as xj∝ vj, where vjis a value or a subset
of values from the domain of j-th attribute, ∝ is specified as =, 6=, ∈, or ¬, depending on the type of j-th attribute. Objects covered by complex Φ are denoted by cov(Φ) and referred to ascover of a complex Φ.
Decision part of a rule indicates one of the decision classes and is denoted by dec = d, where d = 1 or d = −1 in the simplest case.
Decision rule:
Decision rule denoted by r(x, c), where c = (Φ, dec), is defined as:
r(x, c) = (
d x ∈ cov(Φ), 0 x 6∈ cov(Φ).
In the simplest form, the loss of a single decision rule (0-1-` loss) takes the following form:
L(y, r(x, c)) = 0 y· r(x, c) > 0, 1 y· r(x, c) < 0, ` r(x, c) = 0,
where ` ∈ h0, 1i is a penalty for specificity of the rule: the lower the value of `, the smaller the number of objects covered by the rule from the opposite class.
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPNi=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x);
for m = 1 to M do
cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x);
end
ensemble= {rm(x)}M1;
Forward Stagewise Additive Modeling is a general framework suited to simulate ensemble approaches: bagging, random forest, boosting.
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minα PN i=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x); for m = 1 to M do cm:= arg mincP
i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x); end
ensemble= {rm(x)}M 1;
Friedman, J. H., Hastie, T., Tibshirani, R.: Elements of Statistical Learning, Springer (2003)
Friedman, J. H., Popescu, B. E.: Importance Sampled Learning Ensembles. Dept. of Statistics, Stanford University Technical Report, http://www-stat.stanford.edu/˜jhf/, September (2003)
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPNi=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x);
form= 1 to Mdo
cm:= arg mincP
i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x);
end
ensemble= {rm(x)}M 1;
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPNi=1L(yi, α); or F0(x) := 0;
F0(x) := ν · F0(x);
for m = 1 to M do
cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x);
end
ensemble= {rm(x)}M1;
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPNi=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x);
for m = 1 to M do
cm:= argmincPi∈Sm(η)L(yi, Fm−1(xi) +r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x);
end
ensemble= {rm(x)}M1;
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαPNi=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x);
for m = 1 to M do
cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c));
rm(x) = r(x, cm);
Fm(x) = Fm−1(x) + ν · rm(x);
end
ensemble= {rm(x)}M1;
Sm(η)represents a different subsample of size η ¬ N randomly drawn with or without replacement from the original training data.
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M 1. F0(x) := arg minα PN i=1L(yi, α); or F0(x) := 0; F0(x) :=ν· F0(x); for m = 1 to M do
cm:= arg mincPi∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm);
Fm(x) = Fm−1(x) +ν· rm(x); end
ensemble= {rm(x)}M1;
ν ∈ h0, 1i is a shrinkage parameter that determines the degree to
which previously generated decision rules rk(x), k = 1, . . . , m, affect the successive one in the sequence, i.e., rm+1(x).
Ensembles of decision rules
input : set of training examples {yi, xi}N1, M– number of decision rules.
output : ensemble of decision rules {rm(x)}M1. F0(x) := arg minαP N i=1L(yi, α); or F0(x) := 0; F0(x) := ν · F0(x); for m = 1 to M do cm:= arg mincP
i∈Sm(η)L(yi, Fm−1(xi) + r(xi, c)); rm(x) = r(x, cm); Fm(x) = Fm−1(x) + ν · rm(x); end ensemble= {rm(x)}M 1; F0(x) := arg minα PN
i=1L(yi, α); or F0(x) := 0; defines a default rule or there is no default rule.
Ensembles of decision rules
In each consecutive iteration m we augment the function Fm−1(x)by one additional rule rm(x)weighed by shrinkage
parameter ν.
This gives a linear combination of rules Fm(x).
The additional rule rm(x) = r(x, c)is chosen to minimize
X
i∈Sm(η)
Prediction procedure is performed according to: F(x) = a0+ M X m=1 amrm(x).
It is alinear classifier in a very high dimensional space of deriveddecision rules that are highly nonlinear functions of the original predictor variables x.
Parameters {am}M0 can be obtained in many ways:
set to fixed values, for example a0=0 and {am= 1/M}M1,
computed by some optimization techniques, fitted in cross-validation experiments,
estimated in a process of constructing the ensemble (like in AdaBoost).
Prediction procedure is performed according to: F(x) = a0+ M X m=1 amrm(x).
In the case of decision rules, parameters {am}M1 can be
identified with the output of a single rule rm(x), i.e.:
rm(x) =
(
am· dm x ∈ cov(Φ),
0 x 6∈ cov(Φ). where am∈ R+, dm∈ {−1, 1} usually, and let
Ensembles of decision rules
There are two crucial elements of the algorithm to be chosen: heuristic constructing a single decision rule,
Greedy heuristic constructing a single decision rule Search for c such that:
Lm=
X
i∈Sm(η)
L(yi, Fm−1(xi) + r(xi, c))
is minimal.
At the beginning, there is an empty rule: the complex of the rule is empty (no selectors are specified) and decision is not
determined,
In the next step, a new selector is added to the complex and the decision of the rule is determined.
The selector and the decision are chosen to give the minimal value of Lm.
Greedy heuristic constructing a single decision rule Search for c such that:
Lm=
X
i∈Sm(η)
L(yi, Fm−1(xi) + r(xi, c))
is minimal.
Minimal value of Lmis a natural stop criterion in building a
single rule.
Additionally, another stop criterion can be introduced, for example, length of the rule.
Ensembles of decision rules with different loss functions: 0-1-` loss
Sigmoid loss Exponential loss
Squared loss (Regression Rules) Binomial log-likelihood loss
0-1-` loss L0−1−`(y, Fm(x)) = 0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0. Population minimizer F∗(x) = arg min F(x)Ey|xL0−1−`(y, F(x)) = = ( d maxd={−1,1}P(y = d|x) > 1 − `, 0 maxd={−1,1}P(y = d|x) ¬ 1 − `.
0-1-` loss L0−1−`(y, Fm(x)) = 0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0.
Prediction procedure is performed according to:
F(x) = sign(a0+ M
X
m=1
0-1-` loss L0−1−`(y, Fm(x)) = 0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0. Classification rules: r(x, c) = ( d x ∈ cov(Φ), 0 x 6∈ cov(Φ), where d ∈ {−1, 1}.
0-1-` loss L0−1−`(y, Fm(x)) = 0 y· Fm(x) > 0, 1 y· Fm(x) < 0, ` y· Fm(x) = 0.
In each iteration m, the following expression has to be minimized: X yi·(Fm−1(xi)+r(xi,c))<0 1 + ` · X Fm−1(xi)+r(xi,c)=0 1
Sequential covering (separate-and-conquer)
Learn a rule that covers a part of the given training examples, remove the covered examples from the training set (the separate part) and recursively learn another rule that covers some of the remaining examples (the conquer part) until no examples remain.
Fürnkranz, J.: Separate-and-Conquer Rule Learning, Artificial Intelligence Review 113 (1996) 3–54
0-1-` loss and sequential covering
0-1-` loss gives aprocedure similar to sequential covering,
because loss of training examples covered by one rule is already equal 0, and there is no need to look for another rule covering them, i.e. it corresponds to removing them from the set of training examples.
Ensembles of decision rules – simple sequential covering input : set of training examples {yi, xi}N1,
M– number of decision rules.
output : ensemble of decision rules {rm(x)}M 1. F0(x) := 0;
for m = 1 to M = N do
cm:= arg mincP
i∈Sm(η)L0−1−`(yi, Fm−1(xi) + r(xi, c));
rm(x) = r(x, cm); Fm(x) = Fm−1(x) + ν · rm(x); end ensemble= {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < N1,
Decision of the rules is set to d = 1 (positive examples), F0(x) := 0, M = N, ν = 1, η = N,
Ensembles of decision rules and ModLEM input : set of training examples X = {yi, xi}N1
set of positive examples ˆX⊂ X being P(Cl) or P(Cl). output: set of decision rules {rm(x)}M
1. F0(x) = 0; m = 0;
whileP
i∈ˆXL(yi, Fm(xi, c)) 6= 0do m= m + 1;
cm= arg mincPi∈ˆXL0−1−`(yi, Fm−1(xi) + r(xi, c));
rm(x) = r(x, cm); Fm(x) = Fm−1(x) + rm(x); end M= m; rules = {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < 1 N, Decision rule covers only positive examples, F(x) = sign(PM
m=1mat(rm(x)) · sup(rm(x)) · spe(rm(x)) · rm(x)),
Ensemble of decision rules – sequential covering II input : set of training examples {yi, xi}N1,
M– number of decision rules.
output : ensemble of decision rules {rm(x)}M 1. F0(x) := 0;
for m = 1 to M = N do
c := arg mincP
i∈Sm(η)L0−1−`(yi, Fm−1(xi) + r(xi, c));
rm(x) = r(x, c); Fm(x) = Fm−1(x) + 2M−m· rm(x); end ensemble= {rm(x)}M 1; L(yi, F(x)) = L0−1−`(yi, F(x)), ` < N1, d ∈ {−1, 1}, F0(x) := 0, M = N, ν = 2M−m. F(x) = sign(a0+PMm=1amrm(x)),
Sigmoid loss Lsigm(y, Fm(x)) = 1 1 + exp(β · y · Fm(x)) Population minimizer F∗(x) = arg min F(x)Ey|xLsigm(y, F(x)) = + inf P(y = 1|x) > 12, − inf P(y = −1|x) < 12, default otherwise.
Sigmoid loss
Lsigm(y, Fm(x)) =
1
1 + exp(β · y · Fm(x))
Prediction procedure is performed according to:
F(x) = sign(a0+ M
X
m=1
Sigmoid loss Lsigm(y, Fm(x)) = 1 1 + exp(β · y · Fm(x)) Classification rules: r(x, c) = ( d x ∈ cov(Φ), 0 x 6∈ cov(Φ), where d ∈ {−1, 1}.
Sigmoid loss
Lsigm(y, Fm(x)) =
1
1 + exp(β · y · Fm(x))
In each iteration m, the following expression has to be minimized: X yi·r(xi,c)>0 1 1 + exp(β · yi· (Fm−1(xi) + d)) + X yi·r(xi,c)<0 1 1 + exp(β · yi· (Fm−1(xi) + d)) + X r(xi,c)=0 1 1 + exp(β · yi· Fm−1(xi))
Sigmoid loss
Lsigm(y, Fm(x)) =
1
1 + exp(β · y · Fm(x))
Sigmoid loss is a relaxed form of the 0-1-` loss function, Hard statistical interpretation of sigmoid loss,
Our first (and promising) experiments were performed using sigmoid loss.
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
Population minimizer F∗(x) = arg min F(x)Ey|xLexp(y, F(x)) = 1 2log P(y = 1|x) P(y = −1|x)
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
Prediction procedure is performed according to:
F(x) = a0+ M X m=1 amrm(x), and P(y = 1|x) = 1 1 + exp(−2F(x)).
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
Classification rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ {−d, d}, d = const or α ∈ R .
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
For α ∈ {−d, d}, one has to minimized in each iteration m: X yi·r(xi,c)>0 exp(−β · (yi· Fm−1(x1) + d)) + X y·r(xi,c)<0 exp(−β · (yi· Fm−1(x1) − d)) + X r(xi,c)=0 exp(−β · yi· Fm−1(x1)),
where exp(−β · yi· Fm−1(xi))can be treated as w (m)
i (i.e., weight of i-th training example in the m-th iteration).
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
The above leads to: X yi·r(xi,c)<0 w(m)i + ` · X r(xi,c)=0 w(m)i where ` = 1 − e −d ed− e−d d= log 1 − ` ` .
Exponential loss
Lexp(y, Fm(x)) = exp(−β · y · Fm(x))
For α ∈ R the heuristic constructing a single rule has to minimize: 2 s X yi·r(xi,c)>0 w(m)i · X yi·r(xi,c)<0 w(m)i + X r(xi,c)=0 w(m)i
and the final output of the rule is:
α = 1 2log P yi·r(xi,c)>0w (m) i P yi·r(xi,c)<0w (m) i
Exponential loss One can reformulate
α = 1 2log P yi·r(xi,c)>0w (m) i P yi·r(xi,c)<0w (m) i to the following form:
α = 1
2log
P(y = 1|r(x, c)) P(y = −1|r(x, c)) that is related also toconfirmation measure:
l(E|H) = log P(E|H) P(E|¬H)= log
P(H|E) · P(¬H) P(¬H|E) · P(H)
where E is evidence, H is hypothesis, P(H) and P(¬H) are constant in classification tasks .
Ensembles of Decision Rules and AdaBoost Fm(x)is updated according to:
Fm(x) = Fm−1(x) + α · rm(x)
which causes the weights for the next iteration to be: w(m+1)i = w(m)i · e−αyirm(x)
that gives, in fact, the reweighing schema well-known from AdaBoost.
{2α}M
1 parameters are used in AdaBoost to weighing weak learners.
AdaBoost uses a linear classifier combining weak learners. Decision rule is a specific base classifier that exceeds random classifier: the space covered by the rule has small error, uncovered space could be treated as a space of coin throwing.
SLIPPER (by Cohen and Singer)
Uses AdaBoost schema (i.e., uses exponential loss and minimizes with respect to α and c),
Decision rules are generated only for positive examples, There is a default rule,
Heuristic constructing a single rule applies post-pruning.
Cohen, W. W., Singer, Y.: A simple, fast, and effective rule learner. Proc. of 16th National Conference on Artificial Intelligence (1999) 335–342
Lightweight Rule Induction (by Weiss and Indurkhya)
Output of decision rules is rm(x) ∈ {0, 1}, Uses specific reweighing schema:
w(m)i = 1 + ( m−1
X
j=1
I(yi6= rj(xi, c))3
Uses DNF-formulas for single rules,
Decision rules are separately induced for each class (the same number of rules for each class),
Weiss S., M., Indurkhya, N.: Lightweight Rule Induction. Proc. of 17th International Conference on Machine Learning (2000) 1135–1142
Lightweight Rule Induction (by Weiss and Indurkhya)
Heuristic constructing a single rule minimizes (for
r(xi, c) ∈ {0, 1}): X yi·r(xi,c)=−1 w(m)i + k · X r(xi,c)=0∧yi=1 w(m)i ,
where k = 1, 2, 4, . . . is changing during the construction of a rule, P
yi·r(xi,c)=−1w (m)
i are weighed false positives and P
r(xi,c)=0∧yi=1w (m)
i are weighed false negatives.
Weiss S., M., Indurkhya, N.: Lightweight Rule Induction. Proc. of 17th International Conference on Machine Learning (2000) 1135–1142
Ensembles of Decision Rules and Lightweight Rule Induction For k = const minimization of
X
yi·r(xi,c)=−1
w(m)i + k · X
r(xi,c)=0∧yi=1
w(m)i , is equivalent to minimization of X yi·r(xi,c)=−1 w(m)i + ` X r(xi,c)=0 w(m)i where ` = k k+ 1
Ensembles of Decision Rules and Lightweight Rule Induction Taking several assumptions like
k= const,
exponential example reweighing,
1-DNF formulas for single decision rules,
then we obtain an algorithm similar toensembles of decision rules
withexponential loss and d = const.
Squared loss (regression rules)
Lsquared(y, Fm(x)) = (y − Fm(x))2
Population minimizer
F∗(x) = arg min
Squared loss (regression rules)
Lsquared(y, Fm(x)) = (y − Fm(x))2
Prediction procedure is performed according to:
F(x) = a0+ M
X
m=1
Squared loss (regression rules) Lsquared(y, Fm(x)) = (y − Fm(x))2 Regression rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ R.
Squared loss (regression rules)
Lsquared(y, Fm(x)) = (y − Fm(x))2
In each iteration m, the following expression has to be minimized: X r(xi,c)=0 (yi− Fm−1(xi))2+ X r(xi,c)6=0 (yi− Fm−1(xi) − α)2 where α = P r(xi,c)6=0(yi− Fm−1(xi)) P r(xi,c)6=01 .
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x)
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 2 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 6 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 11 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 21 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 31 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 41 regression rules
Regression Rules
Approximation of f (x) = 3x5− x4− x, model built on 41 observations.
−1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 x f(x) f(x) 51 regression rules
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
Population minimizer F∗(x) = arg min F(x) Ey|xLlog(y, F(x)) = 1 2log P(y = 1|x) P(y = −1|x)
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
Prediction procedure is performed according to:
F(x) = a0+ M X m=1 amrm(x), and P(y = 1|x) = 1 1 + exp(−2F(x)).
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
Classification rules: r(x, c) = ( α x ∈ cov(Φ), 0 x 6∈ cov(Φ), where α ∈ {−d, d}, d = const or α ∈ R .
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
For α ∈ {−d, d}, one has to minimized in each iteration m: X
r(xi,c)=0
log(1 + exp(−β · yi· Fm−1(x1))) + X
y·r(xi,c)>0
log(1 + exp(−β · (yi· Fm−1(x1) + d))) +
X
y·r(xi,c)<0
log(1 + exp(−β · (yi· Fm−1(x1) − d))),
where exp(−β · yi· Fm−1(xi))can be treated as w(m)i (i.e., weight of i-th training example in the m-th iteration).
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
For α ∈ R, there is no straight-forward optimization, the solution is to
usegradient boosting machines.
Friedman, J. H.: Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, 529 (2001) 1189-1232
Binomial log-likelihood
Llog(y, Fm(x)) = log(1 + exp(−β · y · Fm(x)))
In each iteration m build aregression rule minimizing:
X
r(xi,c)=0
(yi)2+ X r(xi,c)6=0
(yi− α)2
where yiis a pseudo-response defined as: yi= − ∂L(yi, F(xi)) ∂F(xi) F(x)=Fm−1(x) = 2yi 1 + exp(2yiFm−1(xi)) .
Final output of the rule is α =P
r(x,c)6=0yi/ P
Ensembles of decision rules and Gradient Boosting Trees In each iteration m of gradient boosting trees one looks for
({γjm, cjm}J1) = arg min {γj,cj}J1 N X i=1 L(yi, Fm−1(xi) + J X j=1 γj· r(xi, cj))
where {r(xi, cj)}J1 represents all paths from the root to leafs in
the tree.
In the ensembles of decision rules, each rule rm(x, c)and the
output α is optimizedseparately:
(αm, cm) = arg min α,c N X i=1 L(yi, Fm−1(xi) + α · r(xi, c))
Conclusions
Several properties of ensembles of decision rules in comparison to decision tree ensembles:
natural stop criterion in building a single rule, decision rule rm(x)and its output α are optimized
individually taking into account all previously generated rules {rm(x)}m1−1.
parameters {am}M1 corresponds to outputs of single rules
{αm}M1 – one parameter to be optimize.
One can treat the ensembles of decision rules as: extension of sequential covering
Outline
1 Problem Statement
2 Ensembles of Decision Rules
Software was written and experiments were performed using:
Weka package:
Classifiers used in the experiment
Classifier Abbrev.
NaiveBayes NB
Logistic -R 1.0E-8 -M -1 Log
RBFNetwork -B 2 -S 1 -R 1.0E-8 -M -1 -W 0.1 RBF SMO-C 1.0 -E 1.0 -G 0.01 -A 250007 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 SMO
IBk -K 5 -W 0 IBL AdaBoostM1 -P 100 -S 1 -I 50 -W DecisionStump AB DS AdaBoostM1-P 100 -S 1 -I 50 -WREPTree-M 2 -V 0.0010 -N 3 -S 1 -L -1 AB RT AdaBoostM1 -P 100 -S 1 -I 50 -W J48 -C 0.25 -M 2 AB J48 AdaBoostM1 -P 100 -S 1 -I 50 -W PART -M 2 -C 0.25 -Q 1 AB PT Bagging -P 100 -S 1 -I 50 -W REPTree -M 2 -V 0.0010 -N 3 -S 1 -L -1 B RT Bagging -P 100 -S 1 -I 50 -W J48 -C 0.25 -M 2 B J48 Bagging -P 100 -S 1 -I 50 -W PART -M 2 -C 0.25 -Q 1 B PT LogitBoost-P 100 -F 0 -R 1 -L -1.8e308 -H 1.0 -S 1 -I 50 -WDecisionStump LB DS LogitBoost-P 100 -F 0 -R 1 -L -1.8e308 -H 1.0 -S 1 -I 50 -WREPTree-M 2 -V 0.0010 -N
3 -S 1 -L -1
LB RT
J48 -C 0.25 -M 2 J48
RandomForest -I 50 -K 0 -S 1 RF
PART -M 2 -C 0.25 -Q 1 PT
Ensemble of Decision Rules, Lsigm(y, Fm(x)), M = 50, bootstrap sample
η = N, ν = 0.5, a0= F0(x), {am}M 1 = 1
Data sets included in the experiment
Data set Attributes Class -1 Class 1
German Credit (credit-g) 21 300 700
Pima Indians Diabetes (diabetes) 9 268 500
Heart Statlog (heart-statlog) 14 120 150
J. Hopkins University Ionosphere (ionosphere) 35 126 225 King+Rook vs. King+Pawn on a7 (kr-vs-kp) 37 1527 1669
Credit-g – leave-one-out estimate (accuracy [%])
PT J48 AB DS LB RT RBF AB RT B J48 IBL AB J48AB PT SMO B RT NB Log EDR RF B PT LB DS
accuracy [%] 66 68 70 72 74 76 78
Diabetes – leave-one-out estimate (accuracy [%])
PT AB J48AB RT RBF J48 IBL LB DS B PT RF B RT LB RT EDR NB AB PT B J48 AB DS SMO Log
accuracy [%] 66 68 70 72 74 76 78
Heart-statlog – leave-one-out estimate (accuracy [%])
J48 AB RT PT AB J48LB RT IBL B J48 LB DS RBF B PT RF EDR AB DS B RT NB Log SMO AB PT
accuracy [%] 74 76 78 80 82 84
Ionosphere – leave-one-out estimate (accuracy [%])
NB IBL J48 SMO Log PT B RT LB DS LB RT RBF AB RT AB PT B PT AB DS B J48 RF AB J48 EDR
accuracy [%]
80
85
90
kr-vs-kp – leave-one-out estimate (accuracy [%])
LB RT RBF NB AB DS EDR SMO LB DS IBL Log PT B RT AB RT B J48 RF B PT J48 AB J48AB PT
accuracy [%] 80 85 90 95 100
Sonar – leaving-one-out estimate (accuracy [%])
NB J48 RBF LB RT PT Log B RT SMO AB RT B J48 EDR IBL LB DS AB DS B PT AB J48AB PT RF
accuracy [%] 65 70 75 80 85 90
Future plans and conclusions
Promising first steps in the research onensembles of decision rules, but still a lot to do:
exhaustive experimental research, missing values,
interpretation of rules, robust loss functions, multi-class problem,
ordinal classification problem, ranking problem.
Ensembles of decision rules areflexible, powerful, fast and interpretable.
Future plans and conclusions
Promising first steps in the research onensembles of decision rules, but still a lot to do:
exhaustive experimental research, missing values,
interpretation of rules, robust loss functions, multi-class problem,
ordinal classification problem, ranking problem.
Ensembles of decision rules areflexible, powerful, fast and interpretable.