• Nie Znaleziono Wyników

Surrogate regret bounds for generalized classification performance metrics

N/A
N/A
Protected

Academic year: 2021

Share "Surrogate regret bounds for generalized classification performance metrics"

Copied!
82
0
0

Pełen tekst

(1)

Surrogate regret bounds for generalized classification performance metrics

Wojciech Kot lowski Krzysztof Dembczy´nski

Pozna´n University of Technology, Poland

IDSS Seminar, 22.03.2016

(2)

Motivation

(3)

Kaggle Higgs Boson Machine Learning Challenge

(4)

Data set and rules

# events # features % signal % signal weight

250 000 30 34.3 0.17

Classes: “background” and “signal”.

Signal: Higgs boson decay h → τ+τ. Background: Collisions mimicking the signal.

Features: masses, momenta of produced particles, etc.

Weights: cancel over-representation of signal events.

Evaluation: Approximate Median Significance (AMS):

AMS = s

2(s + b + breg) log

 1 + s

breg



− s ' s

pb + breg s, b – total weight of signal/background events classified as signal, breg= 10.

(5)

Data set and rules

# events # features % signal % signal weight

250 000 30 34.3 0.17

Classes: “background” and “signal”.

Signal: Higgs boson decay h → τ+τ. Background: Collisions mimicking the signal.

Features: masses, momenta of produced particles, etc.

Weights: cancel over-representation of signal events.

Evaluation: Approximate Median Significance (AMS):

AMS = s

2(s + b + breg) log



1 + s 

− s ' s

(6)

Results

Public Leaderboard Private (Final) Leaderboard

(7)

Results

Public Leaderboard Private (Final) Leaderboard

Classifier: ensembles of decision rules (ENDER) mixed a bit with tree ensembles.

(8)

How to optimize AMS?

Research Problem

How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?

Most popular approach:

Sort classifier’s scores and threshold to maximize AMS.

classifier’s score threshold

classified as negative classified as positive

AMS not used while training, only for tuning the threshold.

Is this approach theoretically justified?

(9)

How to optimize AMS?

Research Problem

How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?

Most popular approach:

Sort classifier’s scores and threshold to maximize AMS.

classifier’s score threshold

classified as negative classified as positive

AMS not used while training, only for tuning the threshold.

Is this approach theoretically justified?

(10)

How to optimize AMS?

Research Problem

How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?

Most popular approach:

Sort classifier’s scores and threshold to maximize AMS.

classifier’s score threshold

classified as negative classified as positive

AMS not used while training, only for tuning the threshold.

(11)

Statistical Learning Theory

h = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y) Optimal classifier

What about multivariate outputs?

Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)

Multi-label classification/structured-output prediction: Multivariate output space

(12)

Statistical Learning Theory

h = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y) Optimal classifier

What about multivariate outputs?

Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)

Multi-label classification/structured-output prediction: Multivariate output space

(13)

Statistical Learning Theory

h = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y)

Optimal classifier

What about multivariate outputs?

Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)

Multi-label classification/structured-output prediction: Multivariate output space

(14)

Statistical Learning Theory

h = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y) Optimal classifier

What about multivariate outputs?

Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)

Multi-label classification/structured-output prediction: Multivariate output space

(15)

Statistical Learning Theory

h = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y) Optimal classifier

What about multivariate outputs?

(16)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m

A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)

Two possible approaches:

Decision-Theoretic Approach(DTA)

Empirical Utility Maximization(EUM) Names not perfect

(17)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m

A multivariate lossΨ(y, ˆy) Two possible approaches:

Decision-Theoretic Approach(DTA)

Empirical Utility Maximization(EUM) Names not perfect

(18)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)

Two possible approaches:

Decision-Theoretic Approach(DTA)

Empirical Utility Maximization(EUM) Names not perfect

(19)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)

Two possible approaches:

Decision-Theoretic Approach(DTA) Empirical Utility Maximization(EUM)

Names not perfect

(20)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)

Two possible approaches:

Decision-Theoretic Approach(DTA)

Empirical Utility Maximization(EUM) Names not perfect

(21)

Decision Theoretic Approach

Expectation over Ψ(y, ˆy):

h = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(22)

Decision Theoretic Approach

Expectation over Ψ(y, ˆy):

h = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed

Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(23)

Decision Theoretic Approach

Expectation over Ψ(y, ˆy):

h = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification

Multi-label classification: a test example with y to be predicted

(24)

Decision Theoretic Approach

Expectation over Ψ(y, ˆy):

h = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(25)

Empirical Utility Maximization

Ψ overexpectations:

h = argmin

h

Ψ



TP(h), TN(h), FP(h), FN(h)

 , where:

TP(h) = Pr(h(x) = 1 ∧ y = 1), TN(h) = Pr(h(x) = −1 ∧ y = −1),

FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).

(26)

Research on complex performance measures

P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006

M. Jansche. A maximum expected utility framework for binary sequence labeling. In ACL, pages 736–743, 2007

W. Kot lowski, K. Dembczy´nski, and E. H¨ullermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113–1120, 2011

K. Dembczynski, W. Kotlowski, and E. H¨ullermeier. Consistent multilabel ranking through univariate losses. In ICML, 2012

Ye Nan, Kian Ming Adam Chai, Wee Sun Lee, and Hai Leong Chieu.

Optimizing F-measure: A Tale of Two Approaches. In ICML, 2012

(27)

Research on complex performance measures

Willem Waegeman Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke H¨ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research,

15(1):3333–3388, 2014

Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet.

Optimizing f-measures by cost-sensitive classification. In NIPS 27, 2014 H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NIPS 27, 2014

Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent binary classification with generalized performance metrics. In NIPS 27, pages 2744–2752, 2014

Sanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent multilabel classification. In NIPS 29, 2015

(28)

Optimization of generalized performance metrics

Wojciech Kot lowski and Krzysztof Dembczy´nski. Surrogate regret bounds

(29)

Generalized performance metrics for binary classification

Given a binary classifier h : X → {−1, 1}, define:

Ψ(h) = Ψ



FP(h), FN(h)

 , where:

FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).

predicted ˆy = h(x)

−1 +1 total true y −1 TN FP 1 − P

+1 FN TP P

(30)

Linear-fractional performance metric

Definition

Ψ(FP, FN) = a0+ a1FP + a2FN b0+ b1FP + b2FN,

Examples

Accuracy Acc = 1 − FN − FP

Fβ-measure Fβ = (1+β(1+β2)P −FN+FP2)(P −FN)

Jaccard similarity J = P −FNP +FP

AM measure AM = 1 −2P1 FN − 2(1−P )1 FP Weighted accuracy WA = 1 − wFP − w+FN

(31)

Linear-fractional performance metric

Definition

Ψ(FP, FN) = a0+ a1FP + a2FN b0+ b1FP + b2FN, Examples

Accuracy Acc = 1 − FN − FP

Fβ-measure Fβ = (1+β(1+β2)P −FN+FP2)(P −FN)

Jaccard similarity J = P −FNP +FP

(32)

Convex performance metrics

Definition

Ψ(FP, FN) is jointly convex in FP and FN.

Example: AMS2 score

AMS2(TP, FP) = 2



(TP + FP) log



1 +TP FP



− TP

 .

(33)

Convex performance metrics

Definition

Ψ(FP, FN) is jointly convex in FP and FN.

Example: AMS2 score

AMS2(TP, FP) = 2



(TP + FP) log



1 +TP FP



− TP

 .

(34)

Example - F

1

-measure

FP F

(35)

Example - AMS

2

score

FP AMS

(36)

A simple approach to optimization of Ψ(h)

training data

learn real-valued f (x)

(using standard classification tool)

f (x)

validation data

learn a threshold θ on f (x) by optimizing Ψ(hf,θ)

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(37)

A simple approach to optimization of Ψ(h)

training data

learn real-valued f (x)

(using standard classification tool)

f (x)

validation data

learn a threshold θ on f (x) by optimizing Ψ(hf,θ)

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(38)

A simple approach to optimization of Ψ(h)

training data

learn real-valued f (x)

(using standard classification tool)

f (x)

validation data

learn a threshold θ on f (x) by optimizing Ψ(hf,θ)

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(39)

A simple approach to optimization of Ψ(h)

training data

learn real-valued f (x)

(using standard classification tool)

f (x)

validation data

learn a threshold θ on f (x)

(40)

Our results

Algorithm training set f (x)

validation set

hf,θ(x) θ

1 Learn f minimizing a surrogate loss on the training sample.

2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.

Our results (informally) Assumptions:

the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional orjointly convex,

Claim:

If f is close to the minimizer of the surrogate loss, then hf,θ is close to the maximizer of Ψ.

(41)

Our results

Algorithm training set f (x)

validation set

hf,θ(x) θ

1 Learn f minimizing a surrogate loss on the training sample.

2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.

Our results (informally) Assumptions:

the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractionalor jointly convex,

(42)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

RegΨ(h) = Ψ(h) − Ψ(h) where h= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R. Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

where f = argmin

f

Risk`(f ).

(43)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

RegΨ(h) = Ψ(h) − Ψ(h) where h= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.

Used in training: logistic loss, squared loss, hinge loss, . . .

Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

where f = argmin

f

Risk`(f ).

(44)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

RegΨ(h) = Ψ(h) − Ψ(h) where h= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.

Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

where f = argmin

f

Risk`(f ).

(45)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

RegΨ(h) = Ψ(h) − Ψ(h) where h= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.

Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

(46)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

RegΨ(h) = Ψ(h) − Ψ(h) where h= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.

Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] . Relate Ψ-regret of hf,θ to `-regret of f

(47)

Examples of surrogate losses

Logistic loss

`(y,y) = logb



1 + e−yby

 . Risk minimizer f(x):

f(x) = log η(x)

1 − η(x), where η(x) = Pr(y = 1|x).

Invertible function of conditional probability η(x).

Hinge loss

`(y,y) = (1 − yb y)b+. Its risk minimizer f(x) is non-invertible:

f(x) = sgn(η(x) − 1/2).

(48)

Examples of surrogate losses

Logistic loss

`(y,y) = logb



1 + e−yby

 . Risk minimizer f(x):

f(x) = log η(x)

1 − η(x), where η(x) = Pr(y = 1|x).

Invertible function of conditional probability η(x).

Hinge loss

(49)

Examples of surrogate losses

0.00.51.01.52.02.53.0loss

0/1 loss

squared error loss logistic loss hinge loss exponential loss

(50)

Examples of surrogate losses

loss f(η) = ψ(η) η(f) = ψ−1(f)

squared error 2η − 1 1+f2

logistic log1−ηη 1+e1−f ∗

exponential 12log1−ηη 1+e1−2f ∗

hinge sgn(η − 1/2) doesn’t exist

(51)

Proper composite losses

We call `(y, f )proper composite if f(x) is an invertible function of the conditional probability η(x) = Pr(y = 1|x).

In other words, there exists astrictly increasing link functionψ, such that:

f(x) = ψ(η(x)), where η(x) = Pr(y = 1|x).

Minimizing proper composite losses implies probability estimation.

(52)

Strongly proper composite losses [Agarwal, 2014]

We call `(y, f )λ-strongly proper composite if it isproper

compositeand additionally for any f , any distribution and any x:

Ey|x[`(y, f (x)) − `(y, f(x))] ≥ λ

2 η(x) − ψ−1(f (x))2

.

Technical condition. Satisfied by all commonly used proper composite losses.

(53)

Strongly proper composite losses: Examples

loss f(η) = ψ(η) η(f) = ψ−1(f) λ

squared error 2η − 1 1+f2 8

logistic log1−ηη 1+e1−f ∗ 4

exponential 12log1−ηη 1+e1−2f ∗ 4

(54)

Main result

Theorem for linear fractional measures

Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.

Assume the denominator of Ψ isbounded from belowby γ > 0.

Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ, such that for any real-valued function f ,

RegΨ(hf,θ) ≤ C q2

λpReg`(f ), where C = γ1(Ψ(h)(b1+ b2) − (a1+ a2)) > 0.

metric γ C

Fβ-measure β2P 1+ββ2P2

Jaccard similarity P JP+1 AM measure 2P (1 − P ) 2P (1−P )1 Similar theorem for convex performance

metrics (such as AMS2)

(55)

Main result

Theorem for linear fractional measures

Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.

Assume the denominator of Ψ isbounded from belowby γ > 0.

Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ, such that for any real-valued function f ,

RegΨ(hf,θ) ≤ C q2

λpReg`(f ), where C = γ1(Ψ(h)(b1+ b2) − (a1+ a2)) > 0.

metric γ C

Similar theorem for convex performance metrics (such as AMS2)

(56)

Main result

Theorem for linear fractional measures

Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.

Assume the denominator of Ψ isbounded from belowby γ > 0.

Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ, such that for any real-valued function f ,

RegΨ(hf,θ) ≤ C q2

λpReg`(f ), where C = γ1(Ψ(h)(b1+ b2) − (a1+ a2)) > 0.

metric γ C

Similar theorem for convex performance metrics (such as AMS2)

(57)

Explanation of the theorem

Classifier h maximizing Ψ is h(x) = sgn(η(x) − η) for some threshold η, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f(x).

If ` is proper composite, f(x) is an increasing, invertible function of η(x), f= ψ(η).

Thresholding η(x) at η is the same as thresholding f(x) at θ = ψ(η). This results in optimal classifier h.

f(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ= ψ(η)

η

η(x) = 1

1+e−f ∗ (x)

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f.

(58)

Explanation of the theorem

Classifier h maximizing Ψ is h(x) = sgn(η(x) − η) for some threshold η, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f(x).

If ` is proper composite, f(x) is an increasing, invertible function of η(x), f= ψ(η).

Thresholding η(x) at η is the same as thresholding f(x) at θ = ψ(η). This results in optimal classifier h.

f(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ= ψ(η)

η

η(x) = 1

1+e−f ∗ (x)

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f.

(59)

Explanation of the theorem

Classifier h maximizing Ψ is h(x) = sgn(η(x) − η) for some threshold η, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f(x).

If ` is proper composite, f(x) is an increasing, invertible function of η(x), f= ψ(η).

Thresholding η(x) at η is the same as thresholding f(x) at θ = ψ(η). This results in optimal classifier h.

f(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ= ψ(η)

η

η(x) = 1

1+e−f ∗ (x)

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f.

(60)

Explanation of the theorem

Classifier h maximizing Ψ is h(x) = sgn(η(x) − η) for some threshold η, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f(x).

If ` is proper composite, f(x) is an increasing, invertible function of η(x), f= ψ(η).

Thresholding η(x) at η is the same as thresholding f(x) at θ = ψ(η). This results in optimal classifier h.

f(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ= ψ(η)

η(x) = 1

Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f.

(61)

Explanation of the theorem

Classifier h maximizing Ψ is h(x) = sgn(η(x) − η) for some threshold η, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f(x).

If ` is proper composite, f(x) is an increasing, invertible function of η(x), f= ψ(η).

Thresholding η(x) at η is the same as thresholding f(x) at θ = ψ(η). This results in optimal classifier h.

f(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ= ψ(η)

η(x) = 1

1+e−f ∗ (x)

(62)

The optimal threshold θ

When Ψ is classification accuracy, θ = 0, η = 12. (that’s why we threshold linear classifiers at 0).

When Ψ is weighted accuracy, η = w w

++ w.

For more complex measure, θ is unknown as it depends on Ψ, which is unknown . . .

=⇒ we can estimate θ from validation data

(63)

The optimal threshold θ

When Ψ is classification accuracy, θ = 0, η = 12. (that’s why we threshold linear classifiers at 0).

When Ψ is weighted accuracy, η = w w

++ w.

For more complex measure, θ is unknown as it depends on Ψ, which is unknown . . .

=⇒ we can estimate θ from validation data

(64)

The optimal threshold θ

When Ψ is classification accuracy, θ = 0, η = 12. (that’s why we threshold linear classifiers at 0).

When Ψ is weighted accuracy, η = w w

++ w.

For more complex measure, θ is unknown as it depends on Ψ, which is unknown . . .

=⇒ we can estimate θ from validation data

(65)

The optimal threshold θ

When Ψ is classification accuracy, θ = 0, η = 12. (that’s why we threshold linear classifiers at 0).

When Ψ is weighted accuracy, η = w w

++ w.

For more complex measure, θ is unknown as it depends on Ψ, which is unknown . . .

=⇒ we can estimate θ from validation data

(66)

Tuning the threshold

Corollary

Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ θΨ(hb f,θ), where bΨ is the performance metric

calculated on the validation sample. Then, under the same assumptions and notation:

RegΨ(hf,ˆθ) ≤ C q2

λpReg`(f ) + O

1 m

 .

Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit.

(67)

Tuning the threshold

Corollary

Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ θΨ(hb f,θ), where bΨ is the performance metric

calculated on the validation sample. Then, under the same assumptions and notation:

RegΨ(hf,ˆθ) ≤ C q2

λpReg`(f ) + O

1 m

 .

Learning standard binary classifier and tuning the threshold

(68)

Multilabel classification

A vector of m labels y = (y1, . . . , ym) for each x.

Multilabel classifier h(x) = (h1(x), . . . , hm(x)).

Separate false positive/negative rates for each label:

FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).

Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?

We extend our bounds to cover micro- andmacro-averaging.

(69)

Multilabel classification

A vector of m labels y = (y1, . . . , ym) for each x.

Multilabel classifier h(x) = (h1(x), . . . , hm(x)).

Separate false positive/negative rates for each label:

FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).

Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?

(70)

Micro- and macro-averaging

Macro-averaging Averageoutside Ψ:

Ψmacro(h) =

m

X

i=1

Ψ FPi(hi), FNi(hi).

Our bound suggests that aseparate threshold needs to be tuned for each label.

Micro-averaging AverageinsideΨ:

Ψmicro(h) = Ψ 1 m

m

X

i=1

FPi(hi), 1 m

m

X

i=1

FNi(hi)

! .

Our bound suggests that all labels share asingle threshold.

(71)

Micro- and macro-averaging

Macro-averaging Averageoutside Ψ:

Ψmacro(h) =

m

X

i=1

Ψ FPi(hi), FNi(hi).

Our bound suggests that aseparate threshold needs to be tuned for each label.

Micro-averaging AverageinsideΨ:

1 m 1 m !

(72)

Experiments

(73)

Experimental results

Two syntheticand twobenchmark data sets.

Surrogates: Logistic loss (LR) andhinge loss (SVM).

Performance metrics: F-measure andAM measure.

The-step learning procedure:

Minimize surrogate ` (logistic or hinge) on the training data.

Tune the threshold ˆθ to optimize the performance metric on the separate validation data.

(74)

Synthetic experiment I

X is finite set X = {1, 2, . . . , 25} with Pr(x) uniform.

For each x ∈ X, η(x) drawn uniformly at random.

logistic loss surrogate (converges as expected)

0 2000 4000 6000 8000 10000

0.000.020.040.060.080.10

# of training examples

Regret of F−measure, the AM measure, and logistic loss

Logistic regret F−measure regret AM regret

hinge loss surrogate 0.08

0.10 Hinge regret

F−measure regret AM regret

(75)

Synthetic experiment II

x ∈ X = R2 generated from a standard Gaussian.

Logistic model: η(x) = (1 + exp(−a0− a>x))−1.

Convergence of F-measure

0 500 1000 1500 2000 2500 3000

0.000.050.100.150.20

# of training examples

Regret of the F−measure and surrogate losses

Logistic regret (LR) Hinge regret (SVM) F−measure regret (LR) F−measure regret (SVM)

Convergence of 0.15

0.20

Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM)

(76)

Benchmark data experiment – a bit of surprise

dataset #examples #features covtype.binary 581,012 54

gisette 7,000 5,000

covtype.binary gisette

0 50000 100000 150000 200000

0.650.700.750.80

# of training examples

F−measure

F−measure (LR) F−measure (SVM)

0 1000 2000 3000 4000

0.880.920.961.00

# of training examples

F−measure

F−measure (LR) F−measure (SVM)

0.700.750.80AM 0.951.00AM

(77)

Multilabel classification

data set # labels # training examples # test examples #features

scene 6 1211 1169 294

yeast 14 1500 917 103

mediamill 101 30993 12914 120

Surrogates: Logistic loss (LR) andhinge loss (SVM).

Performance metrics: F-measure andAM measure.

Macro-averaging(separate threshold for each label) and micro-averaging (single threshold for all labels).

Cross-evaluationof algorithms tuned for micro-averaging in

(78)

Multilabel classification: F measure

F-macro F-micro

scene

200 400 600 800 1000 1200

0.500.550.600.650.70

# of training examples

Macro F−measure LR Macro−F

SVM Macro−F LR Micro−F SVM Micro−F

200 400 600 800 1000 1200

0.500.550.600.650.70

# of training examples

Micro F−measure LR Macro−F

SVM Macro−F LR Micro−F SVM Micro−F

yeast

200 400 600 800 1000 1200 1400

0.250.300.350.400.450.50

# of training examples

Macro F−measure LR Macro−F

SVM Macro−F LR Micro−F SVM Micro−F

200 400 600 800 1000 1200 1400

0.500.550.600.65

# of training examples

Micro F−measure LR Macro−F

SVM Macro−F LR Micro−F SVM Micro−F

0.20 0.6

(79)

Multilabel classification: AM measure

AM-macro AM-micro

scene

200 400 600 800 1000 1200

0.700.750.800.85

# of training examples

Macro AM

LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM

200 400 600 800 1000 1200

0.700.750.800.85

# of training examples

Micro AM

LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM

yeast

200 400 600 800 1000 1200 1400

0.540.560.580.600.62

# of training examples

Macro AM

LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM

200 400 600 800 1000 1200 1400

0.500.550.600.650.700.75

# of training examples

Micro AM

LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM

0.70 0.90

(80)

Open problem

maxh Ψ(h) − Ψ(h) ≤ const · r

Risk`(f ) − min

f Risk`(f )

can be decreased to 0 only if the optimal f is in the class we are optimizing over

If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .

. . . this can be, however, beneficial non non-proper losses (e.g., hinge loss).

Most of the time the optimal f isoutside the class we are optimizing over and a theory for this is needed.

(81)

Open problem

maxh Ψ(h) − Ψ(h) ≤ const · r

Risk`(f ) − min

f Risk`(f )

can be decreased to 0 only if the optimal f is in the class we are optimizing over

If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .

. . . this can be, however, beneficial non non-proper losses

(82)

Summary

Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification.

Regret bounds forlinear-fractional andconvex functions optimized by means of strongly proper composite surrogates.

The theorem relates convergence to the global risk minimizer.

Can we say anything about convergence to therisk minimizer within some class of functions?

Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions.

Cytaty

Powiązane dokumenty

In Section 2 we review some background and in Section 3 we establish two general principles for the construction of global function fields with many rational places. Some of

K. In our paper [5] a sharp upper bound was given for the degree of an arbitrary squarefree binary form F ∈ Z[X, Y ] in terms of the absolute value of the discriminant of F.

From the modeling of test S1T1, S2T1 and S3T1 was found that the FEM models were very sensitive to the correct mechanical properties of the felt and plywood layer placed between the

transported to the mash-tun. The mixture is cal led mash. Several different infusion- and decoction methods are possible; these methods are temperature - time

Suppose ρ • is E-rational compatible system which is a direct sum of absolutely irreducible com- patible systems and has semisimple monodromy. In the first subsection we shall

We discuss conformally invariant pseudo-metrics on the class of all sense-preserving homeomorphisms of a given Jordan curve by means of the second module of a quadrilateral..

Śledząc tę zm ianę, Iren eu sz O packi ukazuje, że rom antycy p rag n ę li ocale­ nia jednostkow ości, chw ilowych doznań, zauw ażali w ru in ie d estrukcję, nie

Nie mniej jednak twierdzono, że byłaby to na lokalną skalę pozytywna „rewolucja eko- nomiczna”, przyczyniająca się do zmiany nastroju gospodarczego powiatu, tworząca