Surrogate regret bounds for generalized classification performance metrics
Wojciech Kot lowski Krzysztof Dembczy´nski
Pozna´n University of Technology, Poland
IDSS Seminar, 22.03.2016
Motivation
Kaggle Higgs Boson Machine Learning Challenge
Data set and rules
# events # features % signal % signal weight
250 000 30 34.3 0.17
Classes: “background” and “signal”.
Signal: Higgs boson decay h → τ+τ−. Background: Collisions mimicking the signal.
Features: masses, momenta of produced particles, etc.
Weights: cancel over-representation of signal events.
Evaluation: Approximate Median Significance (AMS):
AMS = s
2(s + b + breg) log
1 + s
breg
− s ' s
pb + breg s, b – total weight of signal/background events classified as signal, breg= 10.
Data set and rules
# events # features % signal % signal weight
250 000 30 34.3 0.17
Classes: “background” and “signal”.
Signal: Higgs boson decay h → τ+τ−. Background: Collisions mimicking the signal.
Features: masses, momenta of produced particles, etc.
Weights: cancel over-representation of signal events.
Evaluation: Approximate Median Significance (AMS):
AMS = s
2(s + b + breg) log
1 + s
− s ' s
Results
Public Leaderboard Private (Final) Leaderboard
Results
Public Leaderboard Private (Final) Leaderboard
Classifier: ensembles of decision rules (ENDER) mixed a bit with tree ensembles.
How to optimize AMS?
Research Problem
How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?
Most popular approach:
Sort classifier’s scores and threshold to maximize AMS.
classifier’s score threshold
classified as negative classified as positive
AMS not used while training, only for tuning the threshold.
Is this approach theoretically justified?
How to optimize AMS?
Research Problem
How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?
Most popular approach:
Sort classifier’s scores and threshold to maximize AMS.
classifier’s score threshold
classified as negative classified as positive
AMS not used while training, only for tuning the threshold.
Is this approach theoretically justified?
How to optimize AMS?
Research Problem
How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?
Most popular approach:
Sort classifier’s scores and threshold to maximize AMS.
classifier’s score threshold
classified as negative classified as positive
AMS not used while training, only for tuning the threshold.
Statistical Learning Theory
h∗ = argmin
h E(x,y)
h
`(y, h(x)) i
.
Pointwise (univariate) loss function
Expectation over entire distribution Pr(x, y) Optimal classifier
What about multivariate outputs?
Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)
Multi-label classification/structured-output prediction: Multivariate output space
Statistical Learning Theory
h∗ = argmin
h E(x,y)
h
`(y, h(x)) i
.
Pointwise (univariate) loss function
Expectation over entire distribution Pr(x, y) Optimal classifier
What about multivariate outputs?
Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)
Multi-label classification/structured-output prediction: Multivariate output space
Statistical Learning Theory
h∗ = argmin
h E(x,y)
h
`(y, h(x)) i
.
Pointwise (univariate) loss function
Expectation over entire distribution Pr(x, y)
Optimal classifier
What about multivariate outputs?
Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)
Multi-label classification/structured-output prediction: Multivariate output space
Statistical Learning Theory
h∗ = argmin
h E(x,y)
h
`(y, h(x)) i
.
Pointwise (univariate) loss function
Expectation over entire distribution Pr(x, y) Optimal classifier
What about multivariate outputs?
Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)
Multi-label classification/structured-output prediction: Multivariate output space
Statistical Learning Theory
h∗ = argmin
h E(x,y)
h
`(y, h(x)) i
.
Pointwise (univariate) loss function
Expectation over entire distribution Pr(x, y) Optimal classifier
What about multivariate outputs?
Multivariate performance measures
A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m
A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)
Two possible approaches:
Decision-Theoretic Approach(DTA)
Empirical Utility Maximization(EUM) Names not perfect
Multivariate performance measures
A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m
A multivariate lossΨ(y, ˆy) Two possible approaches:
Decision-Theoretic Approach(DTA)
Empirical Utility Maximization(EUM) Names not perfect
Multivariate performance measures
A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)
Two possible approaches:
Decision-Theoretic Approach(DTA)
Empirical Utility Maximization(EUM) Names not perfect
Multivariate performance measures
A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)
Two possible approaches:
Decision-Theoretic Approach(DTA) Empirical Utility Maximization(EUM)
Names not perfect
Multivariate performance measures
A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}m A classifier h that delivers ˆy = (ˆy1, . . . , ˆym) ∈ {0, 1}m A multivariate lossΨ(y, ˆy)
Two possible approaches:
Decision-Theoretic Approach(DTA)
Empirical Utility Maximization(EUM) Names not perfect
Decision Theoretic Approach
Expectation over Ψ(y, ˆy):
h∗ = argmin
h Ey[Ψ(y, ˆy)]
Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted
Decision Theoretic Approach
Expectation over Ψ(y, ˆy):
h∗ = argmin
h Ey[Ψ(y, ˆy)]
Only joint distributionof y is considered with x beingfixed
Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted
Decision Theoretic Approach
Expectation over Ψ(y, ˆy):
h∗ = argmin
h Ey[Ψ(y, ˆy)]
Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification
Multi-label classification: a test example with y to be predicted
Decision Theoretic Approach
Expectation over Ψ(y, ˆy):
h∗ = argmin
h Ey[Ψ(y, ˆy)]
Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted
Empirical Utility Maximization
Ψ overexpectations:
h∗ = argmin
h
Ψ
TP(h), TN(h), FP(h), FN(h)
, where:
TP(h) = Pr(h(x) = 1 ∧ y = 1), TN(h) = Pr(h(x) = −1 ∧ y = −1),
FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).
Research on complex performance measures
P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006
M. Jansche. A maximum expected utility framework for binary sequence labeling. In ACL, pages 736–743, 2007
W. Kot lowski, K. Dembczy´nski, and E. H¨ullermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113–1120, 2011
K. Dembczynski, W. Kotlowski, and E. H¨ullermeier. Consistent multilabel ranking through univariate losses. In ICML, 2012
Ye Nan, Kian Ming Adam Chai, Wee Sun Lee, and Hai Leong Chieu.
Optimizing F-measure: A Tale of Two Approaches. In ICML, 2012
Research on complex performance measures
Willem Waegeman Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke H¨ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research,
15(1):3333–3388, 2014
Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet.
Optimizing f-measures by cost-sensitive classification. In NIPS 27, 2014 H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NIPS 27, 2014
Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent binary classification with generalized performance metrics. In NIPS 27, pages 2744–2752, 2014
Sanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent multilabel classification. In NIPS 29, 2015
Optimization of generalized performance metrics
Wojciech Kot lowski and Krzysztof Dembczy´nski. Surrogate regret bounds
Generalized performance metrics for binary classification
Given a binary classifier h : X → {−1, 1}, define:
Ψ(h) = Ψ
FP(h), FN(h)
, where:
FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).
predicted ˆy = h(x)
−1 +1 total true y −1 TN FP 1 − P
+1 FN TP P
Linear-fractional performance metric
Definition
Ψ(FP, FN) = a0+ a1FP + a2FN b0+ b1FP + b2FN,
Examples
Accuracy Acc = 1 − FN − FP
Fβ-measure Fβ = (1+β(1+β2)P −FN+FP2)(P −FN)
Jaccard similarity J = P −FNP +FP
AM measure AM = 1 −2P1 FN − 2(1−P )1 FP Weighted accuracy WA = 1 − w−FP − w+FN
Linear-fractional performance metric
Definition
Ψ(FP, FN) = a0+ a1FP + a2FN b0+ b1FP + b2FN, Examples
Accuracy Acc = 1 − FN − FP
Fβ-measure Fβ = (1+β(1+β2)P −FN+FP2)(P −FN)
Jaccard similarity J = P −FNP +FP
Convex performance metrics
Definition
Ψ(FP, FN) is jointly convex in FP and FN.
Example: AMS2 score
AMS2(TP, FP) = 2
(TP + FP) log
1 +TP FP
− TP
.
Convex performance metrics
Definition
Ψ(FP, FN) is jointly convex in FP and FN.
Example: AMS2 score
AMS2(TP, FP) = 2
(TP + FP) log
1 +TP FP
− TP
.
Example - F
1-measure
FP F
Example - AMS
2score
FP AMS
A simple approach to optimization of Ψ(h)
training data
learn real-valued f (x)
(using standard classification tool)
f (x)
validation data
learn a threshold θ on f (x) by optimizing Ψ(hf,θ)
f (x)
θ
hf,θ(x) = −1 hf,θ(x) = +1
A simple approach to optimization of Ψ(h)
training data
learn real-valued f (x)
(using standard classification tool)
f (x)
validation data
learn a threshold θ on f (x) by optimizing Ψ(hf,θ)
f (x)
θ
hf,θ(x) = −1 hf,θ(x) = +1
A simple approach to optimization of Ψ(h)
training data
learn real-valued f (x)
(using standard classification tool)
f (x)
validation data
learn a threshold θ on f (x) by optimizing Ψ(hf,θ)
f (x)
θ
hf,θ(x) = −1 hf,θ(x) = +1
A simple approach to optimization of Ψ(h)
training data
learn real-valued f (x)
(using standard classification tool)
f (x)
validation data
learn a threshold θ on f (x)
Our results
Algorithm training set f (x)
validation set
hf,θ(x) θ
1 Learn f minimizing a surrogate loss on the training sample.
2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.
Our results (informally) Assumptions:
the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional orjointly convex,
Claim:
If f is close to the minimizer of the surrogate loss, then hf,θ is close to the maximizer of Ψ.
Our results
Algorithm training set f (x)
validation set
hf,θ(x) θ
1 Learn f minimizing a surrogate loss on the training sample.
2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.
Our results (informally) Assumptions:
the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractionalor jointly convex,
Ψ-regret and `-regret
Ψ-regret of aclassifierh : X → {−1, 1}:
RegΨ(h) = Ψ(h∗) − Ψ(h) where h∗= argmax
h
Ψ(h).
Measures suboptimality.
Surrogate loss `(y, f (x)) of a real-valued functionf : X → R. Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :
Risk`(f ) = E(x,y)[`(y, f (x))] .
`-regret of f :
where f∗ = argmin
f
Risk`(f ).
Ψ-regret and `-regret
Ψ-regret of aclassifierh : X → {−1, 1}:
RegΨ(h) = Ψ(h∗) − Ψ(h) where h∗= argmax
h
Ψ(h).
Measures suboptimality.
Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.
Used in training: logistic loss, squared loss, hinge loss, . . .
Expected loss (`-risk) of f :
Risk`(f ) = E(x,y)[`(y, f (x))] .
`-regret of f :
where f∗ = argmin
f
Risk`(f ).
Ψ-regret and `-regret
Ψ-regret of aclassifierh : X → {−1, 1}:
RegΨ(h) = Ψ(h∗) − Ψ(h) where h∗= argmax
h
Ψ(h).
Measures suboptimality.
Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.
Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :
Risk`(f ) = E(x,y)[`(y, f (x))] .
`-regret of f :
where f∗ = argmin
f
Risk`(f ).
Ψ-regret and `-regret
Ψ-regret of aclassifierh : X → {−1, 1}:
RegΨ(h) = Ψ(h∗) − Ψ(h) where h∗= argmax
h
Ψ(h).
Measures suboptimality.
Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.
Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :
Risk`(f ) = E(x,y)[`(y, f (x))] .
Ψ-regret and `-regret
Ψ-regret of aclassifierh : X → {−1, 1}:
RegΨ(h) = Ψ(h∗) − Ψ(h) where h∗= argmax
h
Ψ(h).
Measures suboptimality.
Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.
Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :
Risk`(f ) = E(x,y)[`(y, f (x))] . Relate Ψ-regret of hf,θ to `-regret of f
Examples of surrogate losses
Logistic loss
`(y,y) = logb
1 + e−yby
. Risk minimizer f∗(x):
f∗(x) = log η(x)
1 − η(x), where η(x) = Pr(y = 1|x).
Invertible function of conditional probability η(x).
Hinge loss
`(y,y) = (1 − yb y)b+. Its risk minimizer f∗(x) is non-invertible:
f∗(x) = sgn(η(x) − 1/2).
Examples of surrogate losses
Logistic loss
`(y,y) = logb
1 + e−yby
. Risk minimizer f∗(x):
f∗(x) = log η(x)
1 − η(x), where η(x) = Pr(y = 1|x).
Invertible function of conditional probability η(x).
Hinge loss
Examples of surrogate losses
0.00.51.01.52.02.53.0loss
0/1 loss
squared error loss logistic loss hinge loss exponential loss
Examples of surrogate losses
loss f∗(η) = ψ(η) η(f∗) = ψ−1(f∗)
squared error 2η − 1 1+f2 ∗
logistic log1−ηη 1+e1−f ∗
exponential 12log1−ηη 1+e1−2f ∗
hinge sgn(η − 1/2) doesn’t exist
Proper composite losses
We call `(y, f )proper composite if f∗(x) is an invertible function of the conditional probability η(x) = Pr(y = 1|x).
In other words, there exists astrictly increasing link functionψ, such that:
f∗(x) = ψ(η(x)), where η(x) = Pr(y = 1|x).
Minimizing proper composite losses implies probability estimation.
Strongly proper composite losses [Agarwal, 2014]
We call `(y, f )λ-strongly proper composite if it isproper
compositeand additionally for any f , any distribution and any x:
Ey|x[`(y, f (x)) − `(y, f∗(x))] ≥ λ
2 η(x) − ψ−1(f (x))2
.
Technical condition. Satisfied by all commonly used proper composite losses.
Strongly proper composite losses: Examples
loss f∗(η) = ψ(η) η(f∗) = ψ−1(f∗) λ
squared error 2η − 1 1+f2 ∗ 8
logistic log1−ηη 1+e1−f ∗ 4
exponential 12log1−ηη 1+e1−2f ∗ 4
Main result
Theorem for linear fractional measures
Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.
Assume the denominator of Ψ isbounded from belowby γ > 0.
Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ∗, such that for any real-valued function f ,
RegΨ(hf,θ∗) ≤ C q2
λpReg`(f ), where C = γ1(Ψ(h∗)(b1+ b2) − (a1+ a2)) > 0.
metric γ C
Fβ-measure β2P 1+ββ2P2
Jaccard similarity P J∗P+1 AM measure 2P (1 − P ) 2P (1−P )1 Similar theorem for convex performance
metrics (such as AMS2)
Main result
Theorem for linear fractional measures
Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.
Assume the denominator of Ψ isbounded from belowby γ > 0.
Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ∗, such that for any real-valued function f ,
RegΨ(hf,θ∗) ≤ C q2
λpReg`(f ), where C = γ1(Ψ(h∗)(b1+ b2) − (a1+ a2)) > 0.
metric γ C
Similar theorem for convex performance metrics (such as AMS2)
Main result
Theorem for linear fractional measures
Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.
Assume the denominator of Ψ isbounded from belowby γ > 0.
Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ∗, such that for any real-valued function f ,
RegΨ(hf,θ∗) ≤ C q2
λpReg`(f ), where C = γ1(Ψ(h∗)(b1+ b2) − (a1+ a2)) > 0.
metric γ C
Similar theorem for convex performance metrics (such as AMS2)
Explanation of the theorem
Classifier h∗ maximizing Ψ is h∗(x) = sgn(η(x) − η∗) for some threshold η∗, where η(x) = Pr(y = 1|x).
Minimizing a surrogate loss gives f∗(x).
If ` is proper composite, f∗(x) is an increasing, invertible function of η(x), f∗= ψ(η).
Thresholding η(x) at η∗ is the same as thresholding f∗(x) at θ∗ = ψ(η∗). This results in optimal classifier h∗.
f∗(x)
-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3
θ∗= ψ(η∗)
η∗
η(x) = 1
1+e−f ∗ (x)
0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95
Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f∗.
Explanation of the theorem
Classifier h∗ maximizing Ψ is h∗(x) = sgn(η(x) − η∗) for some threshold η∗, where η(x) = Pr(y = 1|x).
Minimizing a surrogate loss gives f∗(x).
If ` is proper composite, f∗(x) is an increasing, invertible function of η(x), f∗= ψ(η).
Thresholding η(x) at η∗ is the same as thresholding f∗(x) at θ∗ = ψ(η∗). This results in optimal classifier h∗.
f∗(x)
-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3
θ∗= ψ(η∗)
η∗
η(x) = 1
1+e−f ∗ (x)
0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95
Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f∗.
Explanation of the theorem
Classifier h∗ maximizing Ψ is h∗(x) = sgn(η(x) − η∗) for some threshold η∗, where η(x) = Pr(y = 1|x).
Minimizing a surrogate loss gives f∗(x).
If ` is proper composite, f∗(x) is an increasing, invertible function of η(x), f∗= ψ(η).
Thresholding η(x) at η∗ is the same as thresholding f∗(x) at θ∗ = ψ(η∗). This results in optimal classifier h∗.
f∗(x)
-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3
θ∗= ψ(η∗)
η∗
η(x) = 1
1+e−f ∗ (x)
0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95
Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f∗.
Explanation of the theorem
Classifier h∗ maximizing Ψ is h∗(x) = sgn(η(x) − η∗) for some threshold η∗, where η(x) = Pr(y = 1|x).
Minimizing a surrogate loss gives f∗(x).
If ` is proper composite, f∗(x) is an increasing, invertible function of η(x), f∗= ψ(η).
Thresholding η(x) at η∗ is the same as thresholding f∗(x) at θ∗ = ψ(η∗). This results in optimal classifier h∗.
f∗(x)
-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3
θ∗= ψ(η∗)
η(x) = 1
Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f∗.
Explanation of the theorem
Classifier h∗ maximizing Ψ is h∗(x) = sgn(η(x) − η∗) for some threshold η∗, where η(x) = Pr(y = 1|x).
Minimizing a surrogate loss gives f∗(x).
If ` is proper composite, f∗(x) is an increasing, invertible function of η(x), f∗= ψ(η).
Thresholding η(x) at η∗ is the same as thresholding f∗(x) at θ∗ = ψ(η∗). This results in optimal classifier h∗.
f∗(x)
-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3
θ∗= ψ(η∗)
η(x) = 1
1+e−f ∗ (x)
The optimal threshold θ
∗When Ψ is classification accuracy, θ∗ = 0, η∗ = 12. (that’s why we threshold linear classifiers at 0).
When Ψ is weighted accuracy, η∗ = w w−
++ w−.
For more complex measure, θ∗ is unknown as it depends on Ψ∗, which is unknown . . .
=⇒ we can estimate θ∗ from validation data
The optimal threshold θ
∗When Ψ is classification accuracy, θ∗ = 0, η∗ = 12. (that’s why we threshold linear classifiers at 0).
When Ψ is weighted accuracy, η∗ = w w−
++ w−.
For more complex measure, θ∗ is unknown as it depends on Ψ∗, which is unknown . . .
=⇒ we can estimate θ∗ from validation data
The optimal threshold θ
∗When Ψ is classification accuracy, θ∗ = 0, η∗ = 12. (that’s why we threshold linear classifiers at 0).
When Ψ is weighted accuracy, η∗ = w w−
++ w−.
For more complex measure, θ∗ is unknown as it depends on Ψ∗, which is unknown . . .
=⇒ we can estimate θ∗ from validation data
The optimal threshold θ
∗When Ψ is classification accuracy, θ∗ = 0, η∗ = 12. (that’s why we threshold linear classifiers at 0).
When Ψ is weighted accuracy, η∗ = w w−
++ w−.
For more complex measure, θ∗ is unknown as it depends on Ψ∗, which is unknown . . .
=⇒ we can estimate θ∗ from validation data
Tuning the threshold
Corollary
Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ θΨ(hb f,θ), where bΨ is the performance metric
calculated on the validation sample. Then, under the same assumptions and notation:
RegΨ(hf,ˆθ) ≤ C q2
λpReg`(f ) + O
√1 m
.
Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit.
Tuning the threshold
Corollary
Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ θΨ(hb f,θ), where bΨ is the performance metric
calculated on the validation sample. Then, under the same assumptions and notation:
RegΨ(hf,ˆθ) ≤ C q2
λpReg`(f ) + O
√1 m
.
Learning standard binary classifier and tuning the threshold
Multilabel classification
A vector of m labels y = (y1, . . . , ym) for each x.
Multilabel classifier h(x) = (h1(x), . . . , hm(x)).
Separate false positive/negative rates for each label:
FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).
Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?
We extend our bounds to cover micro- andmacro-averaging.
Multilabel classification
A vector of m labels y = (y1, . . . , ym) for each x.
Multilabel classifier h(x) = (h1(x), . . . , hm(x)).
Separate false positive/negative rates for each label:
FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).
Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?
Micro- and macro-averaging
Macro-averaging Averageoutside Ψ:
Ψmacro(h) =
m
X
i=1
Ψ FPi(hi), FNi(hi).
Our bound suggests that aseparate threshold needs to be tuned for each label.
Micro-averaging AverageinsideΨ:
Ψmicro(h) = Ψ 1 m
m
X
i=1
FPi(hi), 1 m
m
X
i=1
FNi(hi)
! .
Our bound suggests that all labels share asingle threshold.
Micro- and macro-averaging
Macro-averaging Averageoutside Ψ:
Ψmacro(h) =
m
X
i=1
Ψ FPi(hi), FNi(hi).
Our bound suggests that aseparate threshold needs to be tuned for each label.
Micro-averaging AverageinsideΨ:
1 m 1 m !
Experiments
Experimental results
Two syntheticand twobenchmark data sets.
Surrogates: Logistic loss (LR) andhinge loss (SVM).
Performance metrics: F-measure andAM measure.
The-step learning procedure:
Minimize surrogate ` (logistic or hinge) on the training data.
Tune the threshold ˆθ to optimize the performance metric on the separate validation data.
Synthetic experiment I
X is finite set X = {1, 2, . . . , 25} with Pr(x) uniform.
For each x ∈ X, η(x) drawn uniformly at random.
logistic loss surrogate (converges as expected)
0 2000 4000 6000 8000 10000
0.000.020.040.060.080.10
# of training examples
Regret of F−measure, the AM measure, and logistic loss
Logistic regret F−measure regret AM regret
hinge loss surrogate 0.08
0.10 Hinge regret
F−measure regret AM regret
Synthetic experiment II
x ∈ X = R2 generated from a standard Gaussian.
Logistic model: η(x) = (1 + exp(−a0− a>x))−1.
Convergence of F-measure
0 500 1000 1500 2000 2500 3000
0.000.050.100.150.20
# of training examples
Regret of the F−measure and surrogate losses
Logistic regret (LR) Hinge regret (SVM) F−measure regret (LR) F−measure regret (SVM)
Convergence of 0.15
0.20
Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM)
Benchmark data experiment – a bit of surprise
dataset #examples #features covtype.binary 581,012 54
gisette 7,000 5,000
covtype.binary gisette
0 50000 100000 150000 200000
0.650.700.750.80
# of training examples
F−measure
F−measure (LR) F−measure (SVM)
0 1000 2000 3000 4000
0.880.920.961.00
# of training examples
F−measure
F−measure (LR) F−measure (SVM)
0.700.750.80AM 0.951.00AM
Multilabel classification
data set # labels # training examples # test examples #features
scene 6 1211 1169 294
yeast 14 1500 917 103
mediamill 101 30993 12914 120
Surrogates: Logistic loss (LR) andhinge loss (SVM).
Performance metrics: F-measure andAM measure.
Macro-averaging(separate threshold for each label) and micro-averaging (single threshold for all labels).
Cross-evaluationof algorithms tuned for micro-averaging in
Multilabel classification: F measure
F-macro F-micro
scene
200 400 600 800 1000 1200
0.500.550.600.650.70
# of training examples
Macro F−measure LR Macro−F
SVM Macro−F LR Micro−F SVM Micro−F
200 400 600 800 1000 1200
0.500.550.600.650.70
# of training examples
Micro F−measure LR Macro−F
SVM Macro−F LR Micro−F SVM Micro−F
yeast
200 400 600 800 1000 1200 1400
0.250.300.350.400.450.50
# of training examples
Macro F−measure LR Macro−F
SVM Macro−F LR Micro−F SVM Micro−F
200 400 600 800 1000 1200 1400
0.500.550.600.65
# of training examples
Micro F−measure LR Macro−F
SVM Macro−F LR Micro−F SVM Micro−F
0.20 0.6
Multilabel classification: AM measure
AM-macro AM-micro
scene
200 400 600 800 1000 1200
0.700.750.800.85
# of training examples
Macro AM
LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM
200 400 600 800 1000 1200
0.700.750.800.85
# of training examples
Micro AM
LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM
yeast
200 400 600 800 1000 1200 1400
0.540.560.580.600.62
# of training examples
Macro AM
LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM
200 400 600 800 1000 1200 1400
0.500.550.600.650.700.75
# of training examples
Micro AM
LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM
0.70 0.90
Open problem
maxh Ψ(h) − Ψ(h) ≤ const · r
Risk`(f ) − min
f Risk`(f )
can be decreased to 0 only if the optimal f is in the class we are optimizing over
If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .
. . . this can be, however, beneficial non non-proper losses (e.g., hinge loss).
Most of the time the optimal f isoutside the class we are optimizing over and a theory for this is needed.
Open problem
maxh Ψ(h) − Ψ(h) ≤ const · r
Risk`(f ) − min
f Risk`(f )
can be decreased to 0 only if the optimal f is in the class we are optimizing over
If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .
. . . this can be, however, beneficial non non-proper losses
Summary
Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification.
Regret bounds forlinear-fractional andconvex functions optimized by means of strongly proper composite surrogates.
The theorem relates convergence to the global risk minimizer.
Can we say anything about convergence to therisk minimizer within some class of functions?
Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions.