• Nie Znaleziono Wyników

Multi-class SVMs in the Extreme Classification Regime Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies

N/A
N/A
Protected

Academic year: 2021

Share "Multi-class SVMs in the Extreme Classification Regime Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies"

Copied!
65
0
0

Pełen tekst

(1)

Multi-class SVMs in the Extreme Classification Regime

Distributed Training Algorithms, Generalization Error Bounds, and Regularization Strategies

Marius Kloft

Joint work withJulian Zimmert (HU Berlin), Yunwen Lei (CU Hong Kong), Maximilian Albers (Berlin Big Data Center), Urun Dogan (Microsoft Research), Moustapha Cisse (Facebook Research), and Rohit Babbar (MPI Tübingen)

(2)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References2 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(3)
(4)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References4 / 53

Machine Learning Group @ HU Berlin

I Our topics inresearch

I Development of novel machine learning algorithms

I Speeding up machine learning algorithms to big data (e.g., via distributed computing)

I Statistical learning theory I Applications

(e.g., in the biomedical domain)

I Our topics inteaching

I Machine Learning I Data Modeling

(5)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References5 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(6)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References6 / 53

What is Multi-class Classification?

Multiclass classificationis, given a data point x, decide on the

(7)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References7 / 53

What is Extreme Classification?

Extreme classificationis multi-class classification using an

(8)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References8 / 53

Example 1

We are continuously monitoring the internet for new webpages, which we would like to categorize.

(9)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References9 / 53

Example 2

We have data from an online biomedical bibliographic database that we want to index for quick access to clinicians.

(10)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References10 / 53

Example 3

We are collecting data from an online feed of photographs that we would like to classify into image categories.

(11)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References11 / 53

Example 4

We add new articles to an online encyclopedia and intend to predict the categories of the articles.

(12)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References12 / 53

Example 5

Giving a huge collection of ads, we want to built a classifier from this data.

(13)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References13 / 53

Need

(14)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References14 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(15)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References15 / 53

Support Vector Machine (SVM) is a Popular Method

for Binary Classification (Cortes and Vapnik, ’95)

Core idea:

(16)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References16 / 53

Support Vector Machine (SVM) is a Popular Method

for Binary Classification

I Which hyperplane to take?

(17)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References17 / 53

Popular Generalization to Multiple Classes:

One-vs.-Rest SVM

Let C be the number of classes. One-vs.-rest SVM

1 For c = 1..C

2 class1 := c, class2 := union(allOtherClasses)

3 wc:= solutionOfSVM(class1,class2)

4 end

5 Given a test point x, predict cpredicted := arg maxcw > cx

(18)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References18 / 53

Problem With That

:) trainingcan be parallelizedin the number of classes (extreme classification!)

:( Is just a hack. One-vs.-Rest SVM is not built for multiple classes (coupling of classes not exploited)! Occurring sub-SVMs have quite unbalanced class sizes.

(19)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)

Problem: State of the art solvers require a training time

complexity of O(dn · C2), where d =dim, n=examples_per_class,

and C :=number_of_classes.

We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time

(20)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)

Problem: State of the art solvers require a training time

complexity of O(dn · C2), where d =dim, n=examples_per_class,

and C :=number_of_classes.

We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time

(21)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References19 / 53

There are “True” Multi-class SVMs,

So-called

All-in-one

Multi-class SVMs

SVM binary:

MC: Lin, Lee, andWahba (’04) Weston (’99)Watkins and Crammer and Singer (’02)

Problem: State of the art solvers require a training time

complexity of O(dn · C2), where d =dim, n=examples_per_class,

and C :=number_of_classes.

We gonna parallelize ’em all! Will develop algorithms where O(C) machines in parallel solve the problem in O(dn · C) time

(22)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References20 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(23)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References21 / 53

All-in-one SVMs

All of them have in common that they minimize a trade-off of a regularizer and a loss term:

min w=(w1,...,C) 1 2 X c kwck2+ L(w, data)

(24)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References22 / 53

Overview on All-in-one SVMs

OVR: n X i=1  l(wTy ixi)+ X c6=yi l(−wTcxi)   LLW: n X i=1   X c6=yi l(−wTcxi)  , s.t. X c wc = 0 WW: n X i=1   X c6=yi l((wyi− wc) T xi)   CS: n X i=1  max c6=yi l((wyi− wc) T xi)  Sources:

Lee, Lin, and Wahba (2004), Weston and Watkins (1999), Crammer and Singer (2002)

(25)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References23 / 53

(26)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References24 / 53

This is the LLW

Dual

Problem

max α X c  −1 2||X(αc− 1 C X ˜ c α˜c)||2+ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C

(27)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References25 / 53

This is the LLW

Dual

Problem

max α X c  −1 2||Xαc−w¯|| 2+ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C ¯ w= X1 C X c αc

(28)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References26 / 53

This is the LLW

Dual

Problem

max α,w¯ X c Dc(αc,¯w) z }| {  −1 2||Xαc−w¯|| 2+ X i:yi=c αi   s.t. αi,yi = 0 0 ≤ αi,c≤ C

(29)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References27 / 53

LLW: Proposed Algorithm

Algorithm 1 Simple wrapper algorithm

1: function SIMPLESOLVE-LLW(C, X, Y) 2: while not converged do

3: for c = 1..C do in parallel 4: αc ← arg maxαc˜ Dc( ˜αc, ¯w) 5: end for 6: w¯ ← arg maxwD(α, w) 7: end while 8: end function

(30)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References28 / 53

Ok, fine so far with the LLW SVM.

(31)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References29 / 53

WW

: This is How the

Dual

problem looks like

max α∈Rn×C =:D(α) z }| { C X c=1  −1 2|| − Xαc|| 2+ X i:yi6=c αi,c   s.t. ∀i : αi,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C

A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate

ascent”):

1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)

4 end 5 end

(32)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References29 / 53

WW

: This is How the

Dual

problem looks like

max α∈Rn×C =:D(α) z }| { C X c=1  −1 2|| − Xαc|| 2+ X i:yi6=c αi,c   s.t. ∀i : αi,yi = − X c:c6=yi αi,c, ∀c 6= yi : 0 ≤ αi,c ≤ C

A common strategy to optimize such a dual problem, is to optimize one coordinate after another (“dual coordinate

ascent”):

1 for i = 1, ..., n 2 for c = 1, . . . , C 3 αi,c= maxαi,cD(α)

4 end 5 end

(33)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References30 / 53

This is now the Story...

We optimize αi,cinto gradient direction: ∂

∂αi,c : 1 − (wyi − wc) T

xi

Derivative depends only ontwoweight vectors (not all C many!).

(34)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References31 / 53

Analogy: Soccer League Schedule

We are given a football league (e.g., Bundesliga) with C many teams.

Before the season, we have to decide on a schedule such that each team plays any other team exactly once.

Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.

Example

Bundesliga has C = 18 teams.

⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)

(35)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References31 / 53

Analogy: Soccer League Schedule

We are given a football league (e.g., Bundesliga) with C many teams.

Before the season, we have to decide on a schedule such that each team plays any other team exactly once.

Furthermore, all teams shall play on every matchday so that in total we need only C − 1 matchdays.

Example

Bundesliga has C = 18 teams.

⇒ C − 1 = 17 matchdays (or twice that many if counting home and away matches)

(36)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References32 / 53

This is a Classical Computer Science Problem...

This is the1-factorization of a graphproblem.

The solution is known:

(37)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References32 / 53

This is a Classical Computer Science Problem...

This is the1-factorization of a graphproblem. The solution is known:

(38)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References33 / 53

WW: Proposed Algorithm

Algorithm 2 Simplistic DBCA

wrapper algorithm

1: function SIMPLESOLVE-WW(C, X, Y) 2: while not converged do

3: for r = 1...C − 1 do # iterate over “matchdays”

4: for c = 1..C/2 do in parallel # iterate over

“matches”

5: (ci, cj) ←the two classes (“opposing teams”)

6: αI

ci,cj, αIcj,ci ← arg maxα1,α2Dc(α1, α2)

7: end for

8: end for

9: end while

10: end function

(39)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References34 / 53

Accuracies

Dataset #Training # Test # Classes # Features

ALOI 98,200 10,800 1000 128 LSHTCsmall 4,463 1,858 1,139 51,033 DMOZ2010 128,710 34,880 12,294 381,581 Dataset OVR CS WW LLW ALOI 0.1824 0.0974 0.0930 0.6560 LSHTCsmall 0.549 0.5919 0.5505 0.9263 DMOZ 0.5721 - 0.5432 0.9586

Table:Datasets used in our paper, their properties and best test error over a grid of C values.

(40)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References35 / 53

Results: Speedup

12 4 8 16 32 0 5 10 15 20 25 LLW:Number of Nodes Speedup LLW:dmoz_2010 LLW:aloi 1 2 4 8 16 2 4 6 8

WW:Number of active Nodes WW:dmoz_2010

(41)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References36 / 53

Open questions

I higher efficiencies via GPUs? I parallelization for CS?

(42)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References37 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(43)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References38 / 53

Theory

and

Algorithms

in Extreme Classification

I Just saw:Algorithmsthat better handle large number of classes

I Theorynot prepared for extreme classification

I Data-dependent bounds scale at leastlinearlywith the number of classes

(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014)

Questions

I Can we get bounds withmilddependence on #classes? ⇒ Novel algorithms?

(44)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References39 / 53

Multi-class Classification

Given:

I Training data z1= (x1, y1), . . . , zn= (xn, yn)

| {z } ∈X ×Y i.i.d. ∼ P I Y:= {1, 2, . . . ,C} I C= number of classes

aeroplane bicycle bird boat bottle

bus car cat chair cow

diningtable dog horse motorbike person

(45)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References40 / 53

Formal Problem Setting

Aim:

I Define a hypothesis class H of functions h = (h1, . . . , hc) I Find an h ∈ H that “predicts well” via

ˆy:= arg max

y∈Yhy(x) Multi-class SVMs:

I hy(x) = hwy, φ(x)i

I Introduce notion of the(multi-class) margin

ρh(x, y) :=hy(x) − max

y0:y06=yhy

0(x)

I the larger the margin, the better

(46)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References41 / 53

Types of Generalization bounds

for Multi-class

Classification

Data-independentbounds I based on covering numbers

(Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative

I unable to adapt to data

Data-dependentbounds

I based on Rademacher complexity

(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014)

+ tighter

I able to capture the real data I computable from the data

(47)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References42 / 53

Rademacher & Gaussian Complexity

Definition

I Letσ1, . . . , σn be independent Rademacher variables (taking only values ±1, with equal probability).

I TheRademacher complexity(RC) is defined as

R(H):= Eσ sup h∈H 1 n n X i=1 σi h(zi)  Definition I Letg1, . . . , gn∼ N(0, 1).

I TheGaussian complexity(GC) is defined as

G(H) = Eg sup h∈H 1 n n X i=1 gi h(zi) 

Interpretation: RC and GC reflect theability of the hypothesis

class to correlate with random noise.

Theorem ((Ledoux and Talagrand, 1991))

R(H) ≤r π 2G(H) ≤ 3 r π 2 p log nR(H).

(48)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References43 / 53

Existing Data-Dependent Analysis

The key step is estimating R({ρh: h ∈ H})induced from the

margin operatorρhand class H.

Existing bounds build on the structural result:

R(max{h1, . . . , hC} : hj ∈ Hc, c = 1, . . . , C) ≤

C

X

c=1

R(Hc) (1)

The correlation among class-wise components is ignored.

Best known dependence on the number of classes:

I quadraticdependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013)

I lineardependence Kuznetsov et al. (2014)

(49)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References44 / 53

A New Structural Lemma on Gaussian Complexities

We consider Gaussian complexity.

I We show: G {max{h1, . . . , hC}: h = (h1, . . . , hC) ∈ H} ≤ 1 nEgh=(h1,...,hC)∈Hsup n X i=1 C X c=1 gichc(xi). (2)

Core idea: Comparison inequalityon GPs:(Slepian, 1962)

Xh:= n X i=1 gimax{h1(xi), . . . , hC(xi)}, Yh:= n X i=1 C X c=1 gichc(xi), ∀h ∈ H.

E[(Xθ− X¯θ)2] ≤ E[(Yθ− Y¯θ)2] =⇒E[sup θ∈Θ

Xθ] ≤E[sup

θ∈Θ

Yθ].

(50)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References45 / 53

Example on Comparison of the Structural Lemma

I Consider

H:= {(x1, x2) → (h1, h2)(x1, x2) = (w1x1, w2x2) : k(w1, w2)k2≤ 1} I For the function class {max{h1, h2} : h = (h1, h2) ∈ H},

sup (h1,h2)∈H

Pn

i=1σih1(xi) + sup

(h1,h2)∈H Pn

i=1σih2(xi)

sup (h1,h2)∈H n X i=1 [gi1h1(xi) + gi2h2(xi)]

(51)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References46 / 53

Estimating Multi-class Gaussian Complexity

I Consider avector-valuedfunction class defined by

H:= {hw= (hw1, φ(x)i, . . . , hwc, φ(x)i) :f(w) ≤ Λ}, where f isβ-strongly convexw.r.t. k · k

I f(αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −β2α(1 − α)kx − yk2. Theorem 1 nEghsupw∈H n X i=1 C X c=1 gichwc(xi) ≤ 1 n v u u t 2πΛ β Eg n X i=1  gicφ(xi) C c=1 2 ∗, (3)

(52)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References47 / 53

Features of the complexity bound

I Applies to ageneralfunction class defined through a strongly-convex regularizer f

I Class-wise components h1, . . . , hC are correlated through the term  gicφ(xi) C c=1 2 ∗ I Consider classHp,Λ := {hw: kwk2,p ≤ Λ}, (1p+ 1 p∗= 1); then: 1 nEghwsup∈H p,Λ n X i=1 C X c=1 gichwc(xi) ≤ Λ n v u u t n X i=1 k(xi, xi)×    √

e(4 log C)1+2 log C1 , if p∗ ≥ 2 log C, 2p∗1+

1 p∗ C

1

p∗ , otherwise.

The dependence issublinearfor 1 ≤ p ≤ 2, and even

(53)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References48 / 53

1 About::me

2 What’s Extreme Classification?

3 Basics 4 Distributed Algorithms All-in-one MC-SVMs Parallelization Results 5 Theory 6 Regularization Strategies

(54)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53

`

p

-norm Multi-class SVM

Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max

y:y6=yihwy, φ(xi)i,

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n. (D)

(55)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53

`

p

-norm Multi-class SVM

Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max

y:y6=yihwy, φ(xi)i,

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.

(D)

(56)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References49 / 53

`

p

-norm Multi-class SVM

Motivated by themild dependenceon C as p → 1, we consider (`p-norm) Multi-class SVM, 1 ≤ p ≤ 2 min w 1 2 hXC c=1 kwck p 2 i2 p + C n X i=1 (1 − ti)+, s.t. ti= hwyi, φ(xi)i − max

y:y6=yihwy, φ(xi)i,

(P) Dual Problem sup α∈Rn×C −1 2 hXC c=1 n X i=1 αicφ(xi) p p−1 2 i2(p−1)p + n X i=1 αiyi s.t. αi≤ eyi · C ∧ αi· 1 = 0, ∀i = 1, . . . , n.

(D)

(57)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53

Equivalent Formulation

We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p

p has optimum for βc∝

p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+

s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(58)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53

Equivalent Formulation

We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p

p has optimum for βc∝

p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+

s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(59)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References50 / 53

Equivalent Formulation

We introduce class weightsβ1, . . . , βC to get quadratic dual minβ 12 PC c=1 kwck2 βc + λ kβk p

p has optimum for βc∝

p+1q kwck2. Equivalent Problem min w,β C X c=1 kwck2 2 2βc + C n X i=1 (1 − ti)+

s.t. ti≤ hwyi, φ(xi)i − hwy, φ(xi)i, y6= yi, i = 1, . . . , n, kβk¯p≤ 1, ¯p= p(2 − p)−1, βj ≥ 0.

(E)

(60)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References51 / 53

Empirical Results

Description of datasets used in the experiments:

Dataset # Classes # Training Examples # Test Examples # Attributes

Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:

Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2

Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1

Proposed `p-norm MC-SVM consistently better on benchmark

(61)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References51 / 53

Empirical Results

Description of datasets used in the experiments:

Dataset # Classes # Training Examples # Test Examples # Attributes

Sector 105 6, 412 3, 207 55, 197 News 20 20 15, 935 3, 993 62, 060 Rcv1 53 15, 564 518, 571 47, 236 Birds 50 200 9, 958 1, 830 4, 096 Caltech 256 256 12, 800 16, 980 4, 096 Empirical Results:

Method / Dataset Sector News 20 Rcv1 Birds 50 Caltech 256 `p-norm MC-SVM 94.2±0.3 86.2±0.1 85.7±0.7 27.9±0.2 56.0±1.2

Crammer & Singer 93.9±0.3 85.1±0.3 85.2±0.3 26.3±0.3 55.0±1.1

Proposed `p-norm MC-SVM consistently better on benchmark

(62)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53

Future Directions

Theory: A data-dependent boundindependentof the class

size?

⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.

Algorithms: New models & efficient solvers

I Novel modelsmotivated by theory

I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...

I Scalablealgorithms

I Analyze p > 2 regime

(63)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53

Future Directions

Theory: A data-dependent boundindependentof the class

size?

⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.

Algorithms: New models & efficient solvers

I Novel modelsmotivated by theory

I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...

I Scalablealgorithms

I Analyze p > 2 regime

(64)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References52 / 53

Future Directions

Theory: A data-dependent boundindependentof the class

size?

⇒ Need more powerful structural result on Gaussian complexity for functions induced bymaximum operator. I Might be worth to look into`∞-norm covering numbers.

Algorithms: New models & efficient solvers

I Novel modelsmotivated by theory

I top-k MC-SVM(Lapin et al., 2015), nuclear norm regularization, ...

I Scalablealgorithms

I Analyze p > 2 regime

(65)

About::me What’s Extreme Classification? Basics Distributed Algorithms Theory Regularization Strategies References53 / 53

Refs I

C. Cortes, M. Mohri, and A. Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In ICML-13, pages 46–54, 2013.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.Journal of Machine Learning Research, 2:265–292, 2002.

Y. Guermeur. Combining discriminant models with new multi-class svms.Pattern Analysis & Applications, 5(2): 168–179, 2002.

S. I. Hill and A. Doucet. A framework for kernel-based multi-category classification.Journal of Artificial Intelligence Research, 30(1):525–564, 2007.

V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers.Annals of Statistics, pages 1–50, 2002.

V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. InAdvances in Neural Information Processing Systems, pages 2501–2509, 2014.

M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM.CoRR, abs/1511.06683, 2015. URL http://arxiv.org/abs/1511.06683.

M. Ledoux and M. Talagrand.Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, Berlin, 1991.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data.Journal of the American Statistical Association, 99(465):67–82, 2004.

M. Mohri, A. Rostamizadeh, and A. Talwalkar.Foundations of machine learning. MIT press, 2012. D. Slepian. The one-sided barrier problem for gaussian noise.Bell System Technical Journal, 41(2):463–501,

1962.

J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In M. Verleysen, editor, Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN), pages 219–224. Evere, Belgium: d-side publications, 1999.

T. Zhang. Class-size independent generalization analsysis of some discriminative multi-category classification. In Advances in Neural Information Processing Systems, pages 1625–1632, 2004a.

T. Zhang. Statistical analysis of some multi-category large margin classification methods.The Journal of Machine Learning Research, 5:1225–1251, 2004b.

Cytaty

Powiązane dokumenty

In addition to contrasting the two cases through the grounded theory approach, the study suggests that the adapted well-being framework is useful for understanding L2

This paper is a starting point of investigations on uni ­ form transposition of well known notions of formal algorithms (Tur ­ ing machines, Markov normal

duces a Koebe function was by shown Pfluger (8), who made use of the fact that the omitted arc of any support point of S has an asymptotic line at °°.) In any case terminal support

Our goal: to devise an algorithm that requires little space, little time, can use various underlying classifiers, and is trained online.... Error bounds for convolutional codes and

Daarbij zijn waterkeringen essentieel, niet als noodzakelijk kwaad, maar als mooie technische objecten die het landschap van Nederland vormen en een groot deel van de

Równocześnie jest to pierwsza tak obszernie przedstawiona walka uczonego plebej- sfciego z obojętnością władz państwowych wobec potrzeb jedynej wów­ czas w

Nie jest bowiem obojętne, czy sąd odmawia wiary pewnej części zeznania świadka dlatego, że świadek znajdował się w niekorzystnych warunkach obserwacyjnych (np.

Figure 8.13: The expected queue-size in the chain topology, with m = 2 correlated (ρ = −1) uniformly distributed link weights as a function of the number of nodes N... THE IMPACT