Multi-class SVMs in the Extreme Classiﬁcation Regime Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies

(1)

Multi-class SVMs in the Extreme Classification Regime

Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies

Marius Kloft

Joint work withJulian Zimmert (HU Berlin), Yunwen Lei (CU Hong Kong), Maximilian Albers (Berlin Big Data Center), Urun Dogan (Microsoft Research), Moustapha Cisse (Facebook Research), and Rohit Babbar (MPI Tübingen)

(2)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References2 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 _{Distributed Algorithms} All-in-one MC-SVMs Parallelization Results 5 _Theory 6 _{Regularization Strategies}

(3)

(4)

Machine Learning Group @ HU Berlin

I Our topics inresearch

I Development of novel machine learning algorithms

I _{Speeding up machine learning algorithms to big data (e.g.,} via distributed computing)

I _{Statistical learning theory} I _Applications

(e.g., in the biomedical domain)

I Our topics inteaching

I _{Machine Learning} I _{Data Modeling}

(5)

1 About::me

(6)

What is Multi-class Classification?

Multiclass classificationis, given a data point x, decide on the

(7)

What is Extreme Classification?

Extreme classificationis multi-class classification using an

(8)

Example 1

We are continuously monitoring the internet for new webpages, which we would like to categorize.

(9)

Example 2

We have data from an online biomedical bibliographic database that we want to index for quick access to clinicians.

(10)

Example 3

We are collecting data from an online feed of photographs that we would like to classify into image categories.

(11)

Example 4

We add new articles to an online encyclopedia and intend to predict the categories of the articles.

(12)

Example 5

Giving a huge collection of ads, we want to built a classifier from this data.

(13)

Need

(14)

1 About::me

(15)

Support Vector Machine (SVM) is a Popular Method

for Binary Classification (Cortes and Vapnik, ’95)

Core idea:

(16)

Support Vector Machine (SVM) is a Popular Method

for Binary Classification

I Which hyperplane to take?

(17)

Popular Generalization to Multiple Classes:

One-vs.-Rest SVM

Let C be the number of classes. One-vs.-rest SVM

1 For c = 1..C

2 class1 := c, class2 := union(allOtherClasses)

3 wc:= solutionOfSVM(class1,class2)

4 end

5 Given a test point x, predict cpredicted := arg maxcw > cx

(18)

Problem With That

:) trainingcan be parallelizedin the number of classes (extreme classification!)

:( Is just a hack. One-vs.-Rest SVM is not built for multiple classes (coupling of classes not exploited)! Occurring sub-SVMs have quite unbalanced class sizes.

(19)

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

MC: Lin, Lee, and_{Wahba (’04)} _{Weston (’99)}Watkins and Crammer and Singer (’02)

Problem: State of the art solvers require a training time

complexity of O(dn · C2_{), where d =dim, n=examples_per_class,}

and C :=number_of_classes.

We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time

(20)

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

(21)

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

(22)

1 About::me

(23)

All-in-one SVMs

All of them have in common that they minimize a trade-off of a regularizer and a loss term:

min w=(w1,...,C) 1 2 X c kwck2+ L(w, data)

(24)

Overview on All-in-one SVMs

OVR: n X i=1  l(wT_y ixi)+ X c6=yi l(−wTcxi)   LLW: n X i=1   X c6=yi l(−wTcxi)  , s.t. X c wc = 0 WW: n X i=1   X c6=yi l((wyi− wc) T xi)   CS: n X i=1 max c6=yi l((wyi− wc) T xi) Sources:

Lee, Lin, and Wahba (2004), Weston and Watkins (1999), Crammer and Singer (2002)

(25)

(26)

This is the LLW

Dual

Problem

max α X c  −1 2||X(αc− 1 C X ˜ c α˜c)||2+ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C

(27)

This is the LLW

Dual

Problem

max α X c  −1 2||Xαc−w¯|| 2₊ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C ¯ w= X1 C X c αc

(28)

This is the LLW

Dual

Problem

max α,w¯ X c Dc(αc,¯w) z }| {  −1 2||Xαc−w¯|| 2₊ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C

(29)

LLW: Proposed Algorithm

Algorithm 1 Simple wrapper algorithm

1: function SIMPLESOLVE-LLW(C, X, Y) 2: while not converged do

3: for c = 1..C do in parallel 4: αc ← arg max_αc_˜ Dc( ˜αc, ¯w) 5: end for 6: w¯ ← arg max_wD(α, w) 7: end while 8: end function

(30)

Ok, fine so far with the LLW SVM.

(31)

WW

: This is How the

Dual

problem looks like

max α∈Rn×C =:D(α) z }| { C X c=1  −1 2|| − Xαc|| 2₊ X i:yi6=c αi,c   s.t. ∀i : α_i,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C

A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate

ascent”):

1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)

4 end 5 end

(32)

WW

: This is How the

Dual

problem looks like

max α∈Rn×C =:D(α) z }| { C X c=1  −1 2|| − Xαc|| 2₊ X i:yi6=c αi,c   s.t. ∀i : α_i,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C

A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate

ascent”):

1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)

4 end 5 end

(33)

This is now the Story...

We optimize αi,cinto gradient direction: ∂

∂αi,c : 1 − (wyi − wc) T

xi

Derivative depends only ontwoweight vectors (not all C many!).

(34)

Analogy: Soccer League Schedule

We are given a football league (e.g., Bundesliga) with C many teams.

Before the season, we have to decide on a schedule such that each team plays any other team exactly once.

Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.

Example

Bundesliga has C = 18 teams.

⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)

(35)

Analogy: Soccer League Schedule

We are given a football league (e.g., Bundesliga) with C many teams.

Before the season, we have to decide on a schedule such that each team plays any other team exactly once.

Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.

Example

Bundesliga has C = 18 teams.

⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)

(36)

This is a Classical Computer Science Problem...

This is the1-factorization of a graphproblem.

The solution is known:

(37)

This is a Classical Computer Science Problem...

This is the1-factorization of a graphproblem. The solution is known:

(38)

WW: Proposed Algorithm

Algorithm 2 Simplistic DBCA

wrapper algorithm

1: function SIMPLESOLVE-WW(C, X, Y) 2: while not converged do

3: for r = 1...C − 1 do # iterate over “matchdays”

4: for c = 1..C/2 do in parallel # iterate over

“matches”

5: (ci, cj) ←the two classes (“opposing teams”)

6: αI

ci,cj, αIcj,ci ← arg maxα1,α2Dc(α1, α2)

7: end for

8: end for

9: end while

10: end function

(39)

Accuracies

Dataset #Training # Test # Classes # Features

ALOI 98,200 10,800 1000 128 LSHTCsmall 4,463 1,858 1,139 51,033 DMOZ2010 128,710 34,880 12,294 381,581 Dataset OVR CS WW LLW ALOI 0.1824 0.0974 0.0930 0.6560 LSHTCsmall 0.549 0.5919 0.5505 0.9263 DMOZ 0.5721 - 0.5432 0.9586

Table:Datasets used in our paper, their properties and best test error over a grid of C values.

(40)

Results: Speedup

12 4 8 16 32 0 5 10 15 20 25 LLW:Number of Nodes Speedup LLW:dmoz_2010 LLW:aloi 1 2 4 8 16 2 4 6 8

WW:Number of active Nodes WW:dmoz_2010

(41)

Open questions

I higher efficiencies via GPUs? I parallelization for CS?

(42)

1 About::me

(43)

Theory

and

Algorithms

in Extreme Classification

I Just saw:Algorithmsthat better handle large number of classes

I Theorynot prepared for extreme classification

I Data-dependent bounds scale at leastlinearlywith the number of classes

(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014)

Questions

I Can we get bounds withmilddependence on #classes? ⇒ Novel algorithms?

(44)

Multi-class Classification

Given:

I Training data z1= (x1, y1), . . . , zn= (xn, yn)

| {z } ∈X ×Y i.i.d. ∼ P I Y_{:= {1, 2, . . . ,}C} I C= number of classes

aeroplane bicycle bird boat bottle

bus car cat chair cow

diningtable _dog horse motorbike person

(45)

Formal Problem Setting

Aim:

I Define a hypothesis class H of functions h = (h1, . . . , hc) I Find an h ∈ H that “predicts well” via

ˆy:= arg max

y∈Yhy(x) Multi-class SVMs:

I hy(x) = hwy, φ(x)i

I Introduce notion of the(multi-class) margin

ρh(x, y) :=hy(x) − max

y0_:y0_6=yhy

0(x)

I _{the larger the margin, the better}

(46)

Types of Generalization bounds

for Multi-class

Classification

Data-independentbounds I based on covering numbers

(Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative

I unable to adapt to data

Data-dependentbounds

I based on Rademacher complexity

(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014)

+ tighter

I _{able to capture the real data} I _{computable from the data}

(47)

Rademacher & Gaussian Complexity

Definition

I Letσ1, . . . , σn be independent Rademacher variables (taking only values ±1, with equal probability).

I TheRademacher complexity(RC) is defined as

R(H):= Eσ sup h∈H 1 n n X i=1 σi h(zi) Definition I Letg1, . . . , gn∼ N(0, 1).

I TheGaussian complexity(GC) is defined as

G_{(H) = Eg} sup h∈H 1 n n X i=1 gi h(zi)

Interpretation: RC and GC reflect theability of the hypothesis

class to correlate with random noise.

Theorem ((Ledoux and Talagrand, 1991))

R(H) ≤r π 2G(H) ≤ 3 r π 2 p log nR(H).

(48)

Existing Data-Dependent Analysis

The key step is estimating R({ρh: h ∈ H})induced from the

margin operatorρhand class H.

Existing bounds build on the structural result:

R(max{h1, . . . , hC} : hj ∈ Hc, c = 1, . . . , C) ≤

C

X

c=1

R(Hc) (1)

The correlation among class-wise components is ignored.

Best known dependence on the number of classes:

I quadraticdependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013)

I lineardependence Kuznetsov et al. (2014)

(49)

A New Structural Lemma on Gaussian Complexities

We consider Gaussian complexity.

I We show: G {max{h1, . . . , hC}: h = (h1, . . . , hC) ∈ H} ≤ 1 nEg_{h=(h1,...,hC)∈H}sup n X i=1 C X c=1 gichc(xi). (2)

Core idea: Comparison inequalityon GPs:(Slepian, 1962)

Xh:= n X i=1 gimax{h1(xi), . . . , hC(xi)}, Yh:= n X i=1 C X c=1 gichc(xi), ∀h ∈ H.

E[(Xθ− X¯θ)2] ≤ E[(Yθ− Y¯θ)2] =⇒E[sup θ∈Θ

X_θ] ≤E[sup

θ∈Θ

Y_θ].

(50)

Example on Comparison of the Structural Lemma

I Consider

H:= {(x1, x2) → (h1, h2)(x1, x2) = (w1x1, w2x2) : k(w1, w2)k2≤ 1} I For the function class {max{h1, h2} : h = (h1, h2) ∈ H},

sup (h1,h2)∈H

Pn

i=1σih1(xi) + sup

(h1,h2)∈H Pn

i=1σih2(xi)

sup (h1,h2)∈H n X i=1 [gi1h1(xi) + gi2h2(xi)]

(51)

Estimating Multi-class Gaussian Complexity

I Consider avector-valuedfunction class defined by

H:= {hw= (hw1, φ(x)i, . . . , hwc, φ(x)i) :f(w) ≤ Λ}, where f isβ-strongly convexw.r.t. k · k

I f(αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −β₂α(1 − α)kx − yk2. Theorem 1 nEghsupw_∈H n X i=1 C X c=1 gichw_c(xi) ≤ 1 n v u u t 2πΛ β Eg n X i=1 gicφ(xi) C c=1 2 ∗, (3)

(52)

Features of the complexity bound

I Applies to ageneralfunction class defined through a strongly-convex regularizer f

I Class-wise components h1, . . . , hC are correlated through the term gicφ(xi) C c=1 2 ∗ I Consider classH_p,Λ := {hw: kwk2,p ≤ Λ}, (1p+ 1 p∗= 1); then: 1 nEghwsup_∈H p,Λ n X i=1 C X c=1 gichw_c(xi) ≤ Λ n v u u t n X i=1 k(xi, xi)×    √

e(4 log C)1+2 log C1 _, _{if p}∗ ≥ 2 log C, 2p∗1+

1 p∗ _C

1

p∗ _, _otherwise.

The dependence issublinearfor 1 ≤ p ≤ 2, and even

(53)

1 About::me

(54)

`

p

-norm Multi-class SVM

Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 h_XC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max

y:y6=yihwy, φ(xi)i,

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)_p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n. (D)

(55)

`

p

-norm Multi-class SVM

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)_p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.

(D)

(56)

`

p

-norm Multi-class SVM

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)_p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.

(D)

(57)

Equivalent Formulation

We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 1₂ PC c=1 kwck2 βc + λ kβk p

p has optimum for βc∝

p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+

s.t. ti≤ hwy_i, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(58)

Equivalent Formulation

s.t. ti≤ hwy_i, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk_¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(59)

Equivalent Formulation

s.t. ti≤ hwy_i, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk_¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(60)

Empirical Results

Description of datasets used in the experiments:

Dataset # Classes # Training Examples # Test Examples # Attributes

Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:

Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2

Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1

Proposed `p-norm MC-SVM consistently better on benchmark

(61)

Empirical Results

Description of datasets used in the experiments:

Dataset # Classes # Training Examples # Test Examples # Attributes

Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:

Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2

Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1

Proposed `p-norm MC-SVM consistently better on benchmark

(62)

Future Directions

Theory: A data-dependent boundindependentof the class

size?

⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.

Algorithms: New models & efficient solvers

I Novel modelsmotivated by theory

I _{top-k MC-SVM}(Lapin et al., 2015), nuclear norm regularization, ...

I Scalablealgorithms

I Analyze p > 2 regime

(63)

Future Directions

size?

(64)

Future Directions

size?

(65)

Refs I

C. Cortes, M. Mohri, and A. Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In ICML-13, pages 46–54, 2013.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.Journal of Machine Learning Research, 2:265–292, 2002.

Y. Guermeur. Combining discriminant models with new multi-class svms.Pattern Analysis & Applications, 5(2): 168–179, 2002.

S. I. Hill and A. Doucet. A framework for kernel-based multi-category classification.Journal of Artificial Intelligence Research, 30(1):525–564, 2007.

V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers.Annals of Statistics, pages 1–50, 2002.

V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. InAdvances in Neural Information Processing Systems, pages 2501–2509, 2014.

M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM.CoRR, abs/1511.06683, 2015. URL http://arxiv.org/abs/1511.06683.

M. Ledoux and M. Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, Berlin, 1991.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data.Journal of the American Statistical Association, 99(465):67–82, 2004.

M. Mohri, A. Rostamizadeh, and A. Talwalkar.Foundations of machine learning. MIT press, 2012. D. Slepian. The one-sided barrier problem for gaussian noise.Bell System Technical Journal, 41(2):463–501,

1962.

J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In M. Verleysen, editor, Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN), pages 219–224. Evere, Belgium: d-side publications, 1999.

T. Zhang. Class-size independent generalization analsysis of some discriminative multi-category classification. In Advances in Neural Information Processing Systems, pages 1625–1632, 2004a.

T. Zhang. Statistical analysis of some multi-category large margin classification methods.The Journal of Machine Learning Research, 5:1225–1251, 2004b.