Surrogate regret bounds for generalized classification performance metrics

(1)

Surrogate regret bounds for generalized classification performance metrics

Wojciech Kot lowski Krzysztof Dembczy´nski

Pozna´n University of Technology, Poland

IDSS Seminar, 22.03.2016

(2)

Motivation

(3)

Kaggle Higgs Boson Machine Learning Challenge

(4)

Data set and rules

# events # features % signal % signal weight

250 000 30 34.3 0.17

Classes: “background” and “signal”.

Signal: Higgs boson decay h → τ⁺τ⁻. Background: Collisions mimicking the signal.

Features: masses, momenta of produced particles, etc.

Weights: cancel over-representation of signal events.

Evaluation: Approximate Median Significance (AMS):

AMS = s

2(s + b + breg) log

1 + s

breg

− s ' s

pb + b_reg s, b – total weight of signal/background events classified as signal, b_reg= 10.

(5)

Data set and rules

# events # features % signal % signal weight

250 000 30 34.3 0.17

Classes: “background” and “signal”.

Signal: Higgs boson decay h → τ⁺τ⁻. Background: Collisions mimicking the signal.

Features: masses, momenta of produced particles, etc.

Weights: cancel over-representation of signal events.

Evaluation: Approximate Median Significance (AMS):

AMS = s

2(s + b + breg) log

1 + s

− s ' s

(6)

Results

Public Leaderboard Private (Final) Leaderboard

(7)

Results

Public Leaderboard Private (Final) Leaderboard

Classifier: ensembles of decision rules (ENDER) mixed a bit with tree ensembles.

(8)

How to optimize AMS?

Research Problem

How to optimize aglobalfunction of true/false positives/negatives, not decomposableinto individual losses over the observations?

How to optimize AMS?

Research Problem

Is this approach theoretically justified?

(10)

How to optimize AMS?

Research Problem

(11)

Statistical Learning Theory

h^∗ = argmin

h E(x,y)

h

`(y, h(x)) i

.

Pointwise (univariate) loss function

Expectation over entire distribution Pr(x, y) Optimal classifier

What about multivariate outputs?

Binary classification: Performance measures defined over the entire test set (like in the Higgs Challenge)

Multi-label classification/structured-output prediction: Multivariate output space

(12)

Statistical Learning Theory

h^∗ = argmin

h E(x,y)

h

`(y, h(x)) i

.

(13)

Statistical Learning Theory

h^∗ = argmin

h E(x,y)

h

`(y, h(x)) i

.

Expectation over entire distribution Pr(x, y)

Optimal classifier

(14)

Statistical Learning Theory

h^∗ = argmin

h E(x,y)

h

`(y, h(x)) i

.

(15)

Statistical Learning Theory

h^∗ = argmin

h E(x,y)

h

`(y, h(x)) i

.

(16)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}^m

A classifier h that delivers ˆy = (ˆy₁, . . . , ˆy_m) ∈ {0, 1}^m A multivariate lossΨ(y, ˆy)

Two possible approaches:

Decision-Theoretic Approach(DTA)

Empirical Utility Maximization(EUM) Names not perfect

(17)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}^m A classifier h that delivers ˆy = (ˆy₁, . . . , ˆy_m) ∈ {0, 1}^m

A multivariate lossΨ(y, ˆy) Two possible approaches:

(18)

Multivariate performance measures

A vector of m labels y = (y1, . . . , ym) ∈ {0, 1}^m A classifier h that delivers ˆy = (ˆy₁, . . . , ˆy_m) ∈ {0, 1}^m A multivariate lossΨ(y, ˆy)

(19)

Multivariate performance measures

Decision-Theoretic Approach(DTA) Empirical Utility Maximization(EUM)

Names not perfect

(20)

Multivariate performance measures

(21)

Decision Theoretic Approach

Expectation over Ψ(y, ˆy):

h^∗ = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(22)

Decision Theoretic Approach

h^∗ = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed

Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(23)

Decision Theoretic Approach

h^∗ = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification

Multi-label classification: a test example with y to be predicted

(24)

Decision Theoretic Approach

h^∗ = argmin

h Ey[Ψ(y, ˆy)]

Only joint distributionof y is considered with x beingfixed Binary classification: a fixed test set in binary classification Multi-label classification: a test example with y to be predicted

(25)

Empirical Utility Maximization

Ψ overexpectations:

h^∗ = argmin

h

Ψ

TP(h), TN(h), FP(h), FN(h)

, where:

TP(h) = Pr(h(x) = 1 ∧ y = 1), TN(h) = Pr(h(x) = −1 ∧ y = −1),

FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).

(26)

Research on complex performance measures

P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006

M. Jansche. A maximum expected utility framework for binary sequence labeling. In ACL, pages 736–743, 2007

W. Kot lowski, K. Dembczy´nski, and E. H¨ullermeier. Bipartite ranking through minimization of univariate loss. In International Conference on Machine Learning, pages 1113–1120, 2011

K. Dembczynski, W. Kotlowski, and E. H¨ullermeier. Consistent multilabel ranking through univariate losses. In ICML, 2012

Ye Nan, Kian Ming Adam Chai, Wee Sun Lee, and Hai Leong Chieu.

Optimizing F-measure: A Tale of Two Approaches. In ICML, 2012

(27)

Research on complex performance measures

Willem Waegeman Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke H¨ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research,

15(1):3333–3388, 2014

Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet.

Optimizing f-measures by cost-sensitive classification. In NIPS 27, 2014 H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NIPS 27, 2014

Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent binary classification with generalized performance metrics. In NIPS 27, pages 2744–2752, 2014

Sanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon. Consistent multilabel classification. In NIPS 29, 2015

(28)

Optimization of generalized performance metrics

Wojciech Kot lowski and Krzysztof Dembczy´nski. Surrogate regret bounds

(29)

Generalized performance metrics for binary classification

Given a binary classifier h : X → {−1, 1}, define:

Ψ(h) = Ψ

FP(h), FN(h)

, where:

FP(h) = Pr(h(x) = 1 ∧ y = −1), FN(h) = Pr(h(x) = −1 ∧ y = 1).

predicted ˆy = h(x)

−1 +1 total true y −1 TN FP 1 − P

+1 FN TP P

(30)

Linear-fractional performance metric

Definition

Ψ(FP, FN) = a₀+ a₁FP + a₂FN b0+ b1FP + b2FN,

Examples

Accuracy Acc = 1 − FN − FP

F_β-measure F_β = _(1+β^(1+β2)P −FN+FP²^{)(P −FN)}

Jaccard similarity J = ^{P −FN}_{P +FP}

AM measure AM = 1 −_2P¹ FN − _{2(1−P )}¹ FP Weighted accuracy WA = 1 − w−FP − w+FN

(31)

Linear-fractional performance metric

Definition

Ψ(FP, FN) = a₀+ a₁FP + a₂FN b0+ b1FP + b2FN, Examples

Accuracy Acc = 1 − FN − FP

F_β-measure F_β = _(1+β^(1+β2)P −FN+FP²^{)(P −FN)}

Jaccard similarity J = ^{P −FN}_{P +FP}

(32)

Convex performance metrics

Definition

Ψ(FP, FN) is jointly convex in FP and FN.

Example: AMS² score

AMS²(TP, FP) = 2

(TP + FP) log

1 +TP FP

− TP

.

(33)

Convex performance metrics

Definition

Ψ(FP, FN) is jointly convex in FP and FN.

Example: AMS² score

AMS²(TP, FP) = 2

(TP + FP) log

1 +TP FP

− TP

.

(34)

Example - F

₁

-measure

FP F

(35)

Example - AMS

²

score

FP AMS

(36)

A simple approach to optimization of Ψ(h)

training data

learn real-valued f (x)

(using standard classification tool)

f (x)

validation data

learn a threshold θ on f (x) by optimizing Ψ(hf,θ)

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(37)

A simple approach to optimization of Ψ(h)

training data

f (x)

validation data

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(38)

A simple approach to optimization of Ψ(h)

training data

f (x)

validation data

f (x)

θ

hf,θ(x) = −1 hf,θ(x) = +1

(39)

A simple approach to optimization of Ψ(h)

training data

f (x)

validation data

learn a threshold θ on f (x)

(40)

Our results

Algorithm training set f (x)

validation set

hf,θ(x) θ

1 Learn f minimizing a surrogate loss on the training sample.

2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.

Our results (informally) Assumptions:

the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractional orjointly convex,

Claim:

If f is close to the minimizer of the surrogate loss, then h_f,θ is close to the maximizer of Ψ.

(41)

Our results

Algorithm training set f (x)

validation set

hf,θ(x) θ

1 Learn f minimizing a surrogate loss on the training sample.

2 Given f , tune a threshold θ on f on a the validation sample by direct optimization of Ψ.

Our results (informally) Assumptions:

the surrogate loss is strongly proper composite (e.g., logistic, exponential, squared-error loss), Ψ is linear-fractionalor jointly convex,

(42)

Ψ-regret and `-regret

Ψ-regret of aclassifierh : X → {−1, 1}:

Reg_Ψ(h) = Ψ(h^∗) − Ψ(h) where h^∗= argmax

h

Ψ(h).

Measures suboptimality.

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R. Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

where f^∗ = argmin

f

Risk`(f ).

(43)

Ψ-regret and `-regret

h

Ψ(h).

Surrogate loss `(y, f (x)) of a real-valued functionf : X → R.

Used in training: logistic loss, squared loss, hinge loss, . . .

Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

f

Risk`(f ).

(44)

Ψ-regret and `-regret

h

Ψ(h).

Used in training: logistic loss, squared loss, hinge loss, . . . Expected loss (`-risk) of f :

Risk`(f ) = E(x,y)[`(y, f (x))] .

`-regret of f :

f

Risk`(f ).

(45)

Ψ-regret and `-regret

h

Ψ(h).

Risk`(f ) = E(x,y)[`(y, f (x))] .

(46)

Ψ-regret and `-regret

h

Ψ(h).

Risk_`(f ) = E(x,y)[`(y, f (x))] . Relate Ψ-regret of h_f,θ to `-regret of f

(47)

Examples of surrogate losses

Logistic loss

`(y,y) = logb

1 + e^−y^b^y

. Risk minimizer f^∗(x):

f^∗(x) = log η(x)

1 − η(x), where η(x) = Pr(y = 1|x).

Invertible function of conditional probability η(x).

Hinge loss

`(y,y) = (1 − yb y)b₊. Its risk minimizer f^∗(x) is non-invertible:

f^∗(x) = sgn(η(x) − 1/2).

(48)

Examples of surrogate losses

Logistic loss

`(y,y) = logb

1 + e^−y^b^y

. Risk minimizer f^∗(x):

f^∗(x) = log η(x)

1 − η(x), where η(x) = Pr(y = 1|x).

Invertible function of conditional probability η(x).

Hinge loss

(49)

Examples of surrogate losses

0.00.51.01.52.02.53.0loss

0/1 loss

squared error loss logistic loss hinge loss exponential loss

(50)

Examples of surrogate losses

loss f^∗(η) = ψ(η) η(f^∗) = ψ⁻¹(f^∗)

squared error 2η − 1 ^1+f₂ ^∗

logistic log_1−η^η _1+e¹−f ∗

exponential ¹₂log_1−η^η _1+e¹−2f ∗

hinge sgn(η − 1/2) doesn’t exist

(51)

Proper composite losses

We call `(y, f )proper composite if f^∗(x) is an invertible function of the conditional probability η(x) = Pr(y = 1|x).

In other words, there exists astrictly increasing link functionψ, such that:

f^∗(x) = ψ(η(x)), where η(x) = Pr(y = 1|x).

Minimizing proper composite losses implies probability estimation.

(52)

Strongly proper composite losses [Agarwal, 2014]

We call `(y, f )λ-strongly proper composite if it isproper

compositeand additionally for any f , any distribution and any x:

Ey|x[`(y, f (x)) − `(y, f^∗(x))] ≥ λ

2 η(x) − ψ⁻¹(f (x))2

.

Technical condition. Satisfied by all commonly used proper composite losses.

(53)

Strongly proper composite losses: Examples

loss f^∗(η) = ψ(η) η(f^∗) = ψ⁻¹(f^∗) λ

squared error 2η − 1 ^1+f₂ ^∗ 8

logistic log_1−η^η _1+e¹−f ∗ 4

exponential ¹₂log_1−η^η _1+e¹−2f ∗ 4

(54)

Main result

Theorem for linear fractional measures

Let Ψ(FP, FN) belinear-fractional,non-increasingin FP and FN.

Assume the denominator of Ψ isbounded from belowby γ > 0.

Let ` be aλ-strongly proper composite loss function. Then, there exists athresholdθ^∗, such that for any real-valued function f ,

Reg_Ψ(h_f,θ^∗) ≤ C q2

λpReg_`(f ), where C = _γ¹(Ψ(h^∗)(b1+ b2) − (a1+ a2)) > 0.

metric γ C

Fβ-measure β²P ^1+β_β2P²

Jaccard similarity P ^J^∗_P⁺¹ AM measure 2P (1 − P ) _{2P (1−P )}¹ Similar theorem for convex performance

metrics (such as AMS²)

(55)

Main result

metric γ C

Similar theorem for convex performance metrics (such as AMS²)

(56)

Main result

metric γ C

Similar theorem for convex performance metrics (such as AMS²)

(57)

Explanation of the theorem

Classifier h^∗ maximizing Ψ is h^∗(x) = sgn(η(x) − η^∗) for some threshold η^∗, where η(x) = Pr(y = 1|x).

Minimizing a surrogate loss gives f^∗(x).

If ` is proper composite, f^∗(x) is an increasing, invertible function of η(x), f^∗= ψ(η).

Thresholding η(x) at η^∗ is the same as thresholding f^∗(x) at θ^∗ = ψ(η^∗). This results in optimal classifier h^∗.

f^∗(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ^∗= ψ(η^∗)

η^∗

η(x) = ¹

1+e^{−f ∗ (x)}

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

Gradients of Ψ and λ measure local variations of Ψ and ` when f is not equal to f^∗.

(58)

Explanation of the theorem

f^∗(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ^∗= ψ(η^∗)

η^∗

η(x) = ¹

1+e^{−f ∗ (x)}

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

(59)

Explanation of the theorem

f^∗(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ^∗= ψ(η^∗)

η^∗

η(x) = ¹

1+e^{−f ∗ (x)}

0.03 0.08 0.12 0.27 0.38 0.62 0.82 0.88 0.95

(60)

Explanation of the theorem

f^∗(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ^∗= ψ(η^∗)

η(x) = ¹

(61)

Explanation of the theorem

f^∗(x)

-3.5 -2.5 -2 -1 -0.5 0.5 1.5 2 3

θ^∗= ψ(η^∗)

η(x) = ¹

1+e^{−f ∗ (x)}

(62)

The optimal threshold θ

^∗

When Ψ is classification accuracy, θ^∗ = 0, η^∗ = ¹₂. (that’s why we threshold linear classifiers at 0).

When Ψ is weighted accuracy, η^∗ = _w ^w⁻

++ w−.

For more complex measure, θ^∗ is unknown as it depends on Ψ^∗, which is unknown . . .

=⇒ we can estimate θ^∗ from validation data

(63)

The optimal threshold θ

^∗

++ w−.

(64)

The optimal threshold θ

^∗

++ w−.

(65)

The optimal threshold θ

^∗

++ w−.

(66)

Tuning the threshold

Corollary

Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ _θΨ(hb _f,θ), where bΨ is the performance metric

calculated on the validation sample. Then, under the same assumptions and notation:

Reg_Ψ(h_f,ˆ_θ) ≤ C q2

λpReg_`(f ) + O

√1 m

.

Learning standard binary classifier and tuning the threshold afterwards is able to recover the maximizer of Ψ in the limit.

(67)

Tuning the threshold

Corollary

Given a real-valued function f , a validation sample of size m, let θ = argmaxˆ _θΨ(hb _f,θ), where bΨ is the performance metric

calculated on the validation sample. Then, under the same assumptions and notation:

Reg_Ψ(h_f,ˆ_θ) ≤ C q2

λpReg_`(f ) + O

√1 m

.

Learning standard binary classifier and tuning the threshold

(68)

Multilabel classification

A vector of m labels y = (y1, . . . , ym) for each x.

Multilabel classifier h(x) = (h₁(x), . . . , h_m(x)).

Separate false positive/negative rates for each label:

FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).

Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?

We extend our bounds to cover micro- andmacro-averaging.

(69)

Multilabel classification

A vector of m labels y = (y1, . . . , ym) for each x.

Multilabel classifier h(x) = (h₁(x), . . . , h_m(x)).

Separate false positive/negative rates for each label:

FPi(hi) = Pr(hi = 1, yi= −1), FNi(hi) = Pr(hi = −1, yi = 1).

Given a binary classification performance metric Ψ, how can we use it in the multilabel setting?

(70)

Micro- and macro-averaging

Macro-averaging Averageoutside Ψ:

Ψ_macro(h) =

m

X

i=1

Ψ FP_i(h_i), FN_i(h_i).

Our bound suggests that aseparate threshold needs to be tuned for each label.

Micro-averaging AverageinsideΨ:

Ψmicro(h) = Ψ 1 m

m

X

i=1

FPi(hi), 1 m

m

X

i=1

FNi(hi)

! .

Our bound suggests that all labels share asingle threshold.

(71)

Micro- and macro-averaging

Macro-averaging Averageoutside Ψ:

Ψ_macro(h) =

m

X

i=1

Ψ FP_i(h_i), FN_i(h_i).

Our bound suggests that aseparate threshold needs to be tuned for each label.

Micro-averaging AverageinsideΨ:

1 ^m 1 ^m !

(72)

Experiments

(73)

Experimental results

Two syntheticand twobenchmark data sets.

Surrogates: Logistic loss (LR) andhinge loss (SVM).

Performance metrics: F-measure andAM measure.

The-step learning procedure:

Minimize surrogate ` (logistic or hinge) on the training data.

Tune the threshold ˆθ to optimize the performance metric on the separate validation data.

(74)

Synthetic experiment I

X is finite set X = {1, 2, . . . , 25} with Pr(x) uniform.

For each x ∈ X, η(x) drawn uniformly at random.

logistic loss surrogate (converges as expected)

0 2000 4000 6000 8000 10000

0.000.020.040.060.080.10

# of training examples

Regret of F−measure, the AM measure, and logistic loss

Logistic regret F−measure regret AM regret

hinge loss surrogate ^0.08

0.10 Hinge regret

F−measure regret AM regret

(75)

Synthetic experiment II

x ∈ X = R² generated from a standard Gaussian.

Logistic model: η(x) = (1 + exp(−a₀− a^>x))⁻¹.

Convergence of F-measure

0 500 1000 1500 2000 2500 3000

0.000.050.100.150.20

Regret of the F−measure and surrogate losses

Logistic regret (LR) Hinge regret (SVM) F−measure regret (LR) F−measure regret (SVM)

Convergence of ^0.15

0.20

Logistic regret (LR) Hinge regret (SVM) AM regret (LR) AM regret (SVM)

(76)

Benchmark data experiment – a bit of surprise

dataset #examples #features covtype.binary 581,012 54

gisette 7,000 5,000

covtype.binary gisette

0 50000 100000 150000 200000

0.650.700.750.80

F−measure

F−measure (LR) F−measure (SVM)

0 1000 2000 3000 4000

0.880.920.961.00

F−measure

F−measure (LR) F−measure (SVM)

0.700.750.80AM 0.951.00AM

(77)

Multilabel classification

data set # labels # training examples # test examples #features

scene 6 1211 1169 294

yeast 14 1500 917 103

mediamill 101 30993 12914 120

Surrogates: Logistic loss (LR) andhinge loss (SVM).

Performance metrics: F-measure andAM measure.

Macro-averaging(separate threshold for each label) and micro-averaging (single threshold for all labels).

Cross-evaluationof algorithms tuned for micro-averaging in

(78)

Multilabel classification: F measure

F-macro F-micro

scene

200 400 600 800 1000 1200

0.500.550.600.650.70

Macro F−measure LR Macro−F

SVM Macro−F LR Micro−F SVM Micro−F

200 400 600 800 1000 1200

0.500.550.600.650.70

Micro F−measure LR Macro−F

yeast

200 400 600 800 1000 1200 1400

0.250.300.350.400.450.50

Macro F−measure LR Macro−F

200 400 600 800 1000 1200 1400

0.500.550.600.65

Micro F−measure LR Macro−F

0.20 0.6

(79)

Multilabel classification: AM measure

AM-macro AM-micro

scene

200 400 600 800 1000 1200

0.700.750.800.85

Macro AM

LR Macro−AM SVM Macro−AM LR Micro−AM SVM Micro−AM

200 400 600 800 1000 1200

0.700.750.800.85

Micro AM

yeast

200 400 600 800 1000 1200 1400

0.540.560.580.600.62

Macro AM

200 400 600 800 1000 1200 1400

0.500.550.600.650.700.75

Micro AM

0.70 0.90

(80)

Open problem

maxh Ψ(h) − Ψ(h) ≤ const · r

Risk`(f ) − min

f Risk`(f )

can be decreased to 0 only if the optimal f is in the class we are optimizing over

If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .

. . . this can be, however, beneficial non non-proper losses (e.g., hinge loss).

Most of the time the optimal f isoutside the class we are optimizing over and a theory for this is needed.

(81)

Open problem

maxh Ψ(h) − Ψ(h) ≤ const · r

Risk`(f ) − min

f Risk`(f )

can be decreased to 0 only if the optimal f is in the class we are optimizing over

If the optimal f is not in the class, the right-hand side does not converge to 0 even for proper losses. . .

. . . this can be, however, beneficial non non-proper losses

(82)

Summary

Theoretical analysis of the two-step approach to optimize generalized performance metrics for classification.

Regret bounds forlinear-fractional andconvex functions optimized by means of strongly proper composite surrogates.

The theorem relates convergence to the global risk minimizer.

Can we say anything about convergence to therisk minimizer within some class of functions?

Why does hinge loss perform so well if the risk minimizer is outside the family of classification functions.