Multi-class SVMs in the Extreme Classification Regime
Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies
Marius Kloft
Joint work withJulian Zimmert (HU Berlin), Yunwen Lei (CU Hong Kong), Maximilian Albers (Berlin Big Data Center), Urun Dogan (Microsoft Research), Moustapha Cisse (Facebook Research), and Rohit Babbar (MPI Tübingen)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References2 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References4 / 53
Machine Learning Group @ HU Berlin
I Our topics inresearch
I Development of novel machine learning algorithms
I Speeding up machine learning algorithms to big data (e.g., via distributed computing)
I Statistical learning theory I Applications
(e.g., in the biomedical domain)
I Our topics inteaching
I Machine Learning I Data Modeling
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References5 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References6 / 53
What is Multi-class Classification?
Multiclass classificationis, given a data point x, decide on the
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References7 / 53
What is Extreme Classification?
Extreme classificationis multi-class classification using an
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References8 / 53
Example 1
We are continuously monitoring the internet for new webpages, which we would like to categorize.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References9 / 53
Example 2
We have data from an online biomedical bibliographic database that we want to index for quick access to clinicians.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References10 / 53
Example 3
We are collecting data from an online feed of photographs that we would like to classify into image categories.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References11 / 53
Example 4
We add new articles to an online encyclopedia and intend to predict the categories of the articles.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References12 / 53
Example 5
Giving a huge collection of ads, we want to built a classifier from this data.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References13 / 53
Need
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References14 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References15 / 53
Support Vector Machine (SVM) is a Popular Method
for Binary Classification (Cortes and Vapnik, ’95)
Core idea:
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References16 / 53
Support Vector Machine (SVM) is a Popular Method
for Binary Classification
I Which hyperplane to take?
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References17 / 53
Popular Generalization to Multiple Classes:
One-vs.-Rest SVM
Let C be the number of classes. One-vs.-rest SVM
1 For c = 1..C
2 class1 := c, class2 := union(allOtherClasses)
3 wc:= solutionOfSVM(class1,class2)
4 end
5 Given a test point x, predict cpredicted := arg maxcw > cx
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References18 / 53
Problem With That
:) trainingcan be parallelizedin the number of classes (extreme classification!)
:( Is just a hack. One-vs.-Rest SVM is not built for multiple classes (coupling of classes not exploited)! Occurring sub-SVMs have quite unbalanced class sizes.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53
There are “True” Multi-class SVMs,
So-called
All-in-one
Multi-class SVMs
SVM binary:
MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)
Problem: State of the art solvers require a training time
complexity of O(dn · C2), where d =dim, n=examples_per_class,
and C :=number_of_classes.
We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53
There are “True” Multi-class SVMs,
So-called
All-in-one
Multi-class SVMs
SVM binary:
MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)
Problem: State of the art solvers require a training time
complexity of O(dn · C2), where d =dim, n=examples_per_class,
and C :=number_of_classes.
We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53
There are “True” Multi-class SVMs,
So-called
All-in-one
Multi-class SVMs
SVM binary:
MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)
Problem: State of the art solvers require a training time
complexity of O(dn · C2), where d =dim, n=examples_per_class,
and C :=number_of_classes.
We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References20 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References21 / 53
All-in-one SVMs
All of them have in common that they minimize a trade-off of a regularizer and a loss term:
min w=(w1,...,C) 1 2 X c kwck2+ L(w, data)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References22 / 53
Overview on All-in-one SVMs
OVR: n X i=1 l(wTy ixi)+ X c6=yi l(−wTcxi) LLW: n X i=1 X c6=yi l(−wTcxi) , s.t. X c wc = 0 WW: n X i=1 X c6=yi l((wyi− wc) T xi) CS: n X i=1 max c6=yi l((wyi− wc) T xi) Sources:Lee, Lin, and Wahba (2004), Weston and Watkins (1999), Crammer and Singer (2002)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References23 / 53
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References24 / 53
This is the LLW
Dual
Problem
max α X c −1 2||X(αc− 1 C X ˜ c α˜c)||2+ X i:yi=c αi s.t. αi,yi = 0 0 ≤ αi,c≤ C
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References25 / 53
This is the LLW
Dual
Problem
max α X c −1 2||Xαc−w¯|| 2+ X i:yi=c αi s.t. αi,yi = 0 0 ≤ αi,c≤ C ¯ w= X1 C X c αc
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References26 / 53
This is the LLW
Dual
Problem
max α,w¯ X c Dc(αc,¯w) z }| { −1 2||Xαc−w¯|| 2+ X i:yi=c αi s.t. αi,yi = 0 0 ≤ αi,c≤ C
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References27 / 53
LLW: Proposed Algorithm
Algorithm 1 Simple wrapper algorithm
1: function SIMPLESOLVE-LLW(C, X, Y) 2: while not converged do
3: for c = 1..C do in parallel 4: αc ← arg maxαc˜ Dc( ˜αc, ¯w) 5: end for 6: w¯ ← arg maxwD(α, w) 7: end while 8: end function
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References28 / 53
Ok, fine so far with the LLW SVM.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References29 / 53
WW
: This is How the
Dual
problem looks like
max α∈Rn×C =:D(α) z }| { C X c=1 −1 2|| − Xαc|| 2+ X i:yi6=c αi,c s.t. ∀i : αi,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C
A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate
ascent”):
1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)
4 end 5 end
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References29 / 53
WW
: This is How the
Dual
problem looks like
max α∈Rn×C =:D(α) z }| { C X c=1 −1 2|| − Xαc|| 2+ X i:yi6=c αi,c s.t. ∀i : αi,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C
A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate
ascent”):
1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)
4 end 5 end
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References30 / 53
This is now the Story...
We optimize αi,cinto gradient direction: ∂
∂αi,c : 1 − (wyi − wc) T
xi
Derivative depends only ontwoweight vectors (not all C many!).
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References31 / 53
Analogy: Soccer League Schedule
We are given a football league (e.g., Bundesliga) with C many teams.
Before the season, we have to decide on a schedule such that each team plays any other team exactly once.
Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.
Example
Bundesliga has C = 18 teams.
⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References31 / 53
Analogy: Soccer League Schedule
We are given a football league (e.g., Bundesliga) with C many teams.
Before the season, we have to decide on a schedule such that each team plays any other team exactly once.
Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.
Example
Bundesliga has C = 18 teams.
⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References32 / 53
This is a Classical Computer Science Problem...
This is the1-factorization of a graphproblem.
The solution is known:
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References32 / 53
This is a Classical Computer Science Problem...
This is the1-factorization of a graphproblem. The solution is known:
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References33 / 53
WW: Proposed Algorithm
Algorithm 2 Simplistic DBCA
wrapper algorithm
1: function SIMPLESOLVE-WW(C, X, Y) 2: while not converged do
3: for r = 1...C − 1 do # iterate over “matchdays”
4: for c = 1..C/2 do in parallel # iterate over
“matches”
5: (ci, cj) ←the two classes (“opposing teams”)
6: αI
ci,cj, αIcj,ci ← arg maxα1,α2Dc(α1, α2)
7: end for
8: end for
9: end while
10: end function
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References34 / 53
Accuracies
Dataset #Training # Test # Classes # Features
ALOI 98,200 10,800 1000 128 LSHTCsmall 4,463 1,858 1,139 51,033 DMOZ2010 128,710 34,880 12,294 381,581 Dataset OVR CS WW LLW ALOI 0.1824 0.0974 0.0930 0.6560 LSHTCsmall 0.549 0.5919 0.5505 0.9263 DMOZ 0.5721 - 0.5432 0.9586
Table:Datasets used in our paper, their properties and best test error over a grid of C values.
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References35 / 53
Results: Speedup
12 4 8 16 32 0 5 10 15 20 25 LLW:Number of Nodes Speedup LLW:dmoz_2010 LLW:aloi 1 2 4 8 16 2 4 6 8WW:Number of active Nodes WW:dmoz_2010
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References36 / 53
Open questions
I higher efficiencies via GPUs? I parallelization for CS?
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References37 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References38 / 53
Theory
and
Algorithms
in Extreme Classification
I Just saw:Algorithmsthat better handle large number of classes
I Theorynot prepared for extreme classification
I Data-dependent bounds scale at leastlinearlywith the number of classes
(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014)
Questions
I Can we get bounds withmilddependence on #classes? ⇒ Novel algorithms?
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References39 / 53
Multi-class Classification
Given:
I Training data z1= (x1, y1), . . . , zn= (xn, yn)
| {z } ∈X ×Y i.i.d. ∼ P I Y:= {1, 2, . . . ,C} I C= number of classes
aeroplane bicycle bird boat bottle
bus car cat chair cow
diningtable dog horse motorbike person
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References40 / 53
Formal Problem Setting
Aim:
I Define a hypothesis class H of functions h = (h1, . . . , hc) I Find an h ∈ H that “predicts well” via
ˆy:= arg max
y∈Yhy(x) Multi-class SVMs:
I hy(x) = hwy, φ(x)i
I Introduce notion of the(multi-class) margin
ρh(x, y) :=hy(x) − max
y0:y06=yhy
0(x)
I the larger the margin, the better
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References41 / 53
Types of Generalization bounds
for Multi-class
Classification
Data-independentbounds I based on covering numbers
(Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative
I unable to adapt to data
Data-dependentbounds
I based on Rademacher complexity
(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014)
+ tighter
I able to capture the real data I computable from the data
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References42 / 53
Rademacher & Gaussian Complexity
DefinitionI Letσ1, . . . , σn be independent Rademacher variables (taking only values ±1, with equal probability).
I TheRademacher complexity(RC) is defined as
R(H):= Eσ sup h∈H 1 n n X i=1 σi h(zi) Definition I Letg1, . . . , gn∼ N(0, 1).
I TheGaussian complexity(GC) is defined as
G(H) = Eg sup h∈H 1 n n X i=1 gi h(zi)
Interpretation: RC and GC reflect theability of the hypothesis
class to correlate with random noise.
Theorem ((Ledoux and Talagrand, 1991))
R(H) ≤r π 2G(H) ≤ 3 r π 2 p log nR(H).
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References43 / 53
Existing Data-Dependent Analysis
The key step is estimating R({ρh: h ∈ H})induced from the
margin operatorρhand class H.
Existing bounds build on the structural result:
R(max{h1, . . . , hC} : hj ∈ Hc, c = 1, . . . , C) ≤
C
X
c=1
R(Hc) (1)
The correlation among class-wise components is ignored.
Best known dependence on the number of classes:
I quadraticdependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013)
I lineardependence Kuznetsov et al. (2014)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References44 / 53
A New Structural Lemma on Gaussian Complexities
We consider Gaussian complexity.
I We show: G {max{h1, . . . , hC}: h = (h1, . . . , hC) ∈ H} ≤ 1 nEgh=(h1,...,hC)∈Hsup n X i=1 C X c=1 gichc(xi). (2)
Core idea: Comparison inequalityon GPs:(Slepian, 1962)
Xh:= n X i=1 gimax{h1(xi), . . . , hC(xi)}, Yh:= n X i=1 C X c=1 gichc(xi), ∀h ∈ H.
E[(Xθ− X¯θ)2] ≤ E[(Yθ− Y¯θ)2] =⇒E[sup θ∈Θ
Xθ] ≤E[sup
θ∈Θ
Yθ].
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References45 / 53
Example on Comparison of the Structural Lemma
I ConsiderH:= {(x1, x2) → (h1, h2)(x1, x2) = (w1x1, w2x2) : k(w1, w2)k2≤ 1} I For the function class {max{h1, h2} : h = (h1, h2) ∈ H},
sup (h1,h2)∈H
Pn
i=1σih1(xi) + sup
(h1,h2)∈H Pn
i=1σih2(xi)
sup (h1,h2)∈H n X i=1 [gi1h1(xi) + gi2h2(xi)]
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References46 / 53
Estimating Multi-class Gaussian Complexity
I Consider avector-valuedfunction class defined byH:= {hw= (hw1, φ(x)i, . . . , hwc, φ(x)i) :f(w) ≤ Λ}, where f isβ-strongly convexw.r.t. k · k
I f(αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −β2α(1 − α)kx − yk2. Theorem 1 nEghsupw∈H n X i=1 C X c=1 gichwc(xi) ≤ 1 n v u u t 2πΛ β Eg n X i=1 gicφ(xi) C c=1 2 ∗, (3)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References47 / 53
Features of the complexity bound
I Applies to ageneralfunction class defined through a strongly-convex regularizer f
I Class-wise components h1, . . . , hC are correlated through the term gicφ(xi) C c=1 2 ∗ I Consider classHp,Λ := {hw: kwk2,p ≤ Λ}, (1p+ 1 p∗= 1); then: 1 nEghwsup∈H p,Λ n X i=1 C X c=1 gichwc(xi) ≤ Λ n v u u t n X i=1 k(xi, xi)× √
e(4 log C)1+2 log C1 , if p∗ ≥ 2 log C, 2p∗1+
1 p∗ C
1
p∗ , otherwise.
The dependence issublinearfor 1 ≤ p ≤ 2, and even
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References48 / 53
1 About::me
2 What’s Extreme Classification?
3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53
`
p-norm Multi-class SVM
Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max
y:y6=yihwy, φ(xi)i,
(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n. (D)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53
`
p-norm Multi-class SVM
Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max
y:y6=yihwy, φ(xi)i,
(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.
(D)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53
`
p-norm Multi-class SVM
Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max
y:y6=yihwy, φ(xi)i,
(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.
(D)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53
Equivalent Formulation
We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p
p has optimum for βc∝
p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+
s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.
(E)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53
Equivalent Formulation
We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p
p has optimum for βc∝
p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+
s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.
(E)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53
Equivalent Formulation
We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p
p has optimum for βc∝
p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+
s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.
(E)
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References51 / 53
Empirical Results
Description of datasets used in the experiments:
Dataset # Classes # Training Examples # Test Examples # Attributes
Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:
Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2
Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1
Proposed `p-norm MC-SVM consistently better on benchmark
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References51 / 53
Empirical Results
Description of datasets used in the experiments:
Dataset # Classes # Training Examples # Test Examples # Attributes
Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:
Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2
Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1
Proposed `p-norm MC-SVM consistently better on benchmark
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53
Future Directions
Theory: A data-dependent boundindependentof the class
size?
⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.
Algorithms: New models & efficient solvers
I Novel modelsmotivated by theory
I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...
I Scalablealgorithms
I Analyze p > 2 regime
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53
Future Directions
Theory: A data-dependent boundindependentof the class
size?
⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.
Algorithms: New models & efficient solvers
I Novel modelsmotivated by theory
I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...
I Scalablealgorithms
I Analyze p > 2 regime
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53
Future Directions
Theory: A data-dependent boundindependentof the class
size?
⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.
Algorithms: New models & efficient solvers
I Novel modelsmotivated by theory
I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...
I Scalablealgorithms
I Analyze p > 2 regime
About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References53 / 53
Refs I
C. Cortes, M. Mohri, and A. Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In ICML-13, pages 46–54, 2013.
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.Journal of Machine Learning Research, 2:265–292, 2002.
Y. Guermeur. Combining discriminant models with new multi-class svms.Pattern Analysis & Applications, 5(2): 168–179, 2002.
S. I. Hill and A. Doucet. A framework for kernel-based multi-category classification.Journal of Artificial Intelligence Research, 30(1):525–564, 2007.
V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers.Annals of Statistics, pages 1–50, 2002.
V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. InAdvances in Neural Information Processing Systems, pages 2501–2509, 2014.
M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM.CoRR, abs/1511.06683, 2015. URL http://arxiv.org/abs/1511.06683.
M. Ledoux and M. Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, Berlin, 1991.
Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data.Journal of the American Statistical Association, 99(465):67–82, 2004.
M. Mohri, A. Rostamizadeh, and A. Talwalkar.Foundations of machine learning. MIT press, 2012. D. Slepian. The one-sided barrier problem for gaussian noise.Bell System Technical Journal, 41(2):463–501,
1962.
J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In M. Verleysen, editor, Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN), pages 219–224. Evere, Belgium: d-side publications, 1999.
T. Zhang. Class-size independent generalization analsysis of some discriminative multi-category classification. In Advances in Neural Information Processing Systems, pages 1625–1632, 2004a.
T. Zhang. Statistical analysis of some multi-category large margin classification methods.The Journal of Machine Learning Research, 5:1225–1251, 2004b.