2 Statistical learning theory

(1)

Wojciech Kot lowski

Institute of Computing Science, Pozna´ n University of Technology

IDSS, 04.06.2013

1 / 53

(2)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

2 / 53

(3)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

3 / 53

(4)

Data set: Reuters RCV1:

Set of ∼ 800 000 documents from Reuters News published during 1996-1997

781 265 training examples, 23 149 testing examples.

47 152 TF-IDF features.

For each document, some topics (categories) were assigned.

For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial).

4 / 53

(5)

<?xml version="1.0" encoding="iso-8859-1" ?>

<title>USA: Tylan stock jumps; weighs sale of company.</title>

<headline>Tylan stock jumps; weighs sale of company.</headline>

<dateline>SAN DIEGO</dateline>

<text>

The stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.

Tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.

The company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.

</text>

<copyright>(c) Reuters Limited 1996</copyright>

</codes>

</codes>

</codes>

</metadata>

</newsitem>

5 / 53

(6)

Two types of loss functions:

Logistic loss (logistic regression) Hinge loss (SVM)

Two types of learning:

Standard (batch setting): minimization of the (regularized) empirical risk on the training data.

Online gradient descent (stochastic gradient descent) L ₂ regularization.

Source: http://leon.bottou.org/projects/sgd

6 / 53

(7)

Hinge loss

method comp. time training error test error SVMLight (batch) 23 642 sec 0.2275 6.02%

SVMPerf (batch) 66 sec 0.2278 6.03%

SGD 1.4 sec 0.2275 6.02%

Logistic loss

method comp. time training error test error LibLinear (batch) 30 sec 0.18907 5.68%

SGD 2.3 sec 0.18893 5.66%

7 / 53

(8)

Data set #objects #features time LIBSVM time SGD

Reuters 781 000 47 000 2.5 days 7sec

Translation 1 000 000 274 000 many days 7sec

SuperTag 950 000 46 000 8h 1sec

Voicetone 579 000 88 000 10h 1sec

8 / 53

(9)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

9 / 53

(10)

learning algorithm

function (classifier) w

S

: X → Y

predictions ˆ

y

5

= w

S

(x

5

) ˆ

y

6

= w

S

(x

6

) ˆ

y

7

= w

S

(x

7

)

accuracy

`(y

5

, ˆ y

5

)

`(y

5

, ˆ y

6

)

`(y

7

, ˆ y

7

)

training set S test set T feedback y

5

, y

6

, y

7

(x

1

, y

1

) (x

2

, y

2

) (x

3

, y

3

)

(x

4

, y

4

)

(x

5

, ?)(x

6

, ?) (x

7

, ?)

10 / 53

(11)

Test set performance

Ultimate question of machine learning theory

Given a training set S = {(x _i , y _i )} ⁿ _i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set

T = {(x _i , y i )} ^m _i=n+1 :

m

X

i=n+1

`(y i , w S (x i ))

is minimized?

Training data and test data must be in some sense similar.

11 / 53

(12)

Ultimate question of machine learning theory

Given a training set S = {(x _i , y _i )} ⁿ _i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set

T = {(x _i , y i )} ^m _i=n+1 :

m

X

i=n+1

`(y i , w S (x i ))

is minimized?

No reasonable answer without any assumptions (“no free lunch”).

Training data and test data must be in some sense similar.

11 / 53

(13)

Statistical learning theory

Assumption

Training data and test data were independently generated from the same distribution P (i.i.d.).

Mean training error of w (empirical risk):

L S (w) = 1 n

n

X

i=1

`(y _i , w(x _i )).

Test error = expected error of w (risk):

L(w) = E (x,y)∼P [`(y, w(x))] .

Given training data S, how to construct w S to make L(w S ) as small as possible?

12 / 53

(14)

Assumption

Training data and test data were independently generated from the same distribution P (i.i.d.).

Mean training error of w (empirical risk):

L S (w) = 1 n

n

X

i=1

`(y _i , w(x _i )).

Test error = expected error of w (risk):

L(w) = E (x,y)∼P [`(y, w(x))] .

Given training data S, how to construct w S to make L(w S ) as small as possible? =⇒ Empirical risk minimization.

12 / 53

(15)

Generalization bounds

Theorem for finite classes (0/1 loss) [Occam’s Razor]

Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W:

w S = arg min

w∈W L S (w), then:

E S∼P L(w S ) = min

w∈W L(w) + O

r log |W|

n

! .

best test set performance in W

overhead for learning on a finite training set

13 / 53

(16)

Theorem for finite classes (0/1 loss) [Occam’s Razor]

Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W:

w S = arg min

w∈W L S (w), then:

E S∼P L(w S ) = min

w∈W L(w) + O

r log |W|

n

! .

our test set performance

best test set performance in W

overhead for learning on a finite training set

13 / 53

(17)

Generalization bounds

Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971]

Let W has VC dimension d _VC . If the classifier w S was trained on S by minimizing the empirical risk within W:

w S = arg min

w∈W L S (w), then:

E S∼P L(w S ) = min

w∈W L(w) + O

r d _VC n

! .

best test set performance in W

overhead for learning on a finite training set

14 / 53

(18)

Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971]

Let W has VC dimension d _VC . If the classifier w S was trained on S by minimizing the empirical risk within W:

w S = arg min

w∈W L S (w), then:

E S∼P L(w S ) = min

w∈W L(w) + O

r d _VC n

! .

our test set performance

best test set performance in W

overhead for learning on a finite training set

14 / 53

(19)

The bounds are asymptotically tight.

Typically d _VC ∼ d, the number of parameters, so that:

L(w S ) = min

w∈W L(w) + O r d

n

! .

Similar results (sometimes better) for losses other than 0/1.

Holds for any distribution P.

Improvements: data dependent bounds.

The theory only tells you what happens for a given class W;

choosing W is an “art” (domain knowledge).

15 / 53

(20)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

16 / 53

(21)

Alternative view on learning

Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).

Learning process by its very nature is incremental.

We do not observe the distributions, we only see the data.

Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?

Can we obtain performance guarantees solely based on observed quantities?

we can! =⇒ online learning theory

17 / 53

(22)

Alternative view on learning

Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).

Learning process by its very nature is incremental.

We do not observe the distributions, we only see the data.

Motivation

Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?

Can we obtain performance guarantees solely based on observed quantities?

17 / 53

(23)

Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).

Learning process by its very nature is incremental.

We do not observe the distributions, we only see the data.

Motivation

Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?

Can we obtain performance guarantees solely based on observed quantities?

we can! =⇒ online learning theory

17 / 53

(24)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

. . .

18 / 53

(25)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . . 50%

. . .

18 / 53

(26)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . . 50%

. . .

18 / 53

(27)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25%

. . .

18 / 53

(28)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25%

. . .

18 / 53

(29)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25% 10%

. . .

18 / 53

(30)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25% 10%

. . .

18 / 53

(31)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25% 10% 25%

. . .

18 / 53

(32)

Example: weather prediction (rain/sunny)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25% 10% 25%

. . .

18 / 53

(33)

i = 1 i = 2 i = 3 i = 4 . . .

50% 25% 10% 25% . . .

. . .

18 / 53

(34)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

10% 80% 50% 10% . . .

20% 70% 50% 30% . . .

60% 30% 50% 80% . . .

30% 65% 50% 10% . . .

. . .

19 / 53

(35)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . . 30%

10%

80% 50% 10% . . .

20%

70% 50% 30% . . .

60%

30% 50% 80% . . .

30% 65% 50% 10% . . .

. . .

19 / 53

(36)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . . 30%

10%

80% 50% 10% . . .

20%

70% 50% 30% . . .

60%

30% 50% 80% . . .

30%

65% 50% 10% . . .

. . .

19 / 53

(37)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . . 30%

10%

80% 50% 10% . . .

20%

70% 50% 30% . . .

60%

30% 50% 80% . . .

30%

65% 50% 10% . . .

. . .

19 / 53

(38)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50%

10% 80%

50% 10% . . .

20% 70%

50% 30% . . .

60% 30%

50% 80% . . .

30%

65% 50% 10% . . .

. . .

19 / 53

(39)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50%

10% 80%

50% 10% . . .

20% 70%

50% 30% . . .

60% 30%

50% 80% . . .

30% 65%

50% 10% . . .

. . .

19 / 53

(40)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50%

10% 80%

50% 10% . . .

20% 70%

50% 30% . . .

60% 30%

50% 80% . . .

30% 65%

50% 10% . . .

. . .

19 / 53

(41)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50%

10% 80% 50%

10% . . .

20% 70% 50%

30% . . .

60% 30% 50%

80% . . .

30% 65%

50% 10% . . .

. . .

19 / 53

(42)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50%

10% 80% 50%

10% . . .

20% 70% 50%

30% . . .

60% 30% 50%

80% . . .

30% 65% 50%

10% . . .

. . .

19 / 53

(43)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50%

10% 80% 50%

10% . . .

20% 70% 50%

30% . . .

60% 30% 50%

80% . . .

30% 65% 50%

10% . . .

. . .

19 / 53

(44)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50% 10%

10% 80% 50% 10%

. . .

20% 70% 50% 30%

. . .

60% 30% 50% 80%

. . .

30% 65% 50%

10% . . .

. . .

19 / 53

(45)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50% 10%

10% 80% 50% 10%

. . .

20% 70% 50% 30%

. . .

60% 30% 50% 80%

. . .

30% 65% 50% 10%

. . .

19 / 53

(46)

Example: weather prediction (rain/sunny)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50% 10%

10% 80% 50% 10%

. . .

20% 70% 50% 30%

. . .

60% 30% 50% 80%

. . .

30% 65% 50% 10%

. . .

19 / 53

(47)

expert i = 1 i = 2 i = 3 i = 4 . . .

30% 50% 50% 10% . . .

10% 80% 50% 10% . . .

20% 70% 50% 30% . . .

60% 30% 50% 80% . . .

30% 65% 50% 10% . . .

. . .

19 / 53

(48)

Example: weather prediction (rain/sunny)

Prediction accuracy evaluated by a loss function, e.g.:

`(y _i , ˆ y _i ) = |y _i − ˆ y _i |.

Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.

expert 1 2 3 4 cumulative loss

30% 50% 50% 10%

10% 80% 50% 10%

20% 70% 50% 30%

0.2 + 0.3 + 0.5 + 0.3 = 1.3

60% 30% 50% 80%

0.6 + 0.7 + 0.5 + 0.8 = 2.6

30% 65% 50% 10%

0.3 + 0.45 + 0.5 + 0.1 = 1.35

Regret of the learner: 1.35 − 0.9 = 0.45.

The goal is to have small regret for any data sequence.

20 / 53

(49)

Example: weather prediction (rain/sunny)

Prediction accuracy evaluated by a loss function, e.g.:

`(y _i , ˆ y _i ) = |y _i − ˆ y _i |.

Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.

30% 50% 50% 10%

10% 80% 50% 10%

20% 70% 50% 30%

0.2 + 0.3 + 0.5 + 0.3 = 1.3

60% 30% 50% 80%

0.6 + 0.7 + 0.5 + 0.8 = 2.6

30% 65% 50% 10%

0.3 + 0.45 + 0.5 + 0.1 = 1.35

Regret of the learner: 1.35 − 0.9 = 0.45.

The goal is to have small regret for any data sequence.

20 / 53

(50)

Example: weather prediction (rain/sunny)

Prediction accuracy evaluated by a loss function, e.g.:

`(y _i , ˆ y _i ) = |y _i − ˆ y _i |.

Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.

30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4

10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9

20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3

60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6

30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35

20 / 53

(51)

Prediction accuracy evaluated by a loss function, e.g.:

`(y _i , ˆ y _i ) = |y _i − ˆ y _i |.

Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.

30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4

10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9

20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3

60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6

30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35

Regret of the learner: 1.35 − 0.9 = 0.45.

The goal is to have small regret for any data sequence.

20 / 53

(52)

learner (strategy) w

_i

: X → Y

prediction ˆ

y

i

= w

i

(x

i

)

suffered loss

`(y

i

, ˆ y

i

)

new instance

(x

_i

, ?) feedback:

y

i

i → i + 1

21 / 53

(53)

Set of strategies (actions) W; known loss function `.

Learner starts with some initial strategy (action) w 1 . For i = 1, 2, . . .:

1 Learner observes instance x _i .

2 Learner predicts with ˆ y i = w i (x i ).

3 The environment reveals outcome y i .

4 Learner suffers loss ` i (w i ) = `(y i , ˆ y i ).

5 Learner updates its strategy w _i → w _i+1 .

22 / 53

(54)

The goal of the learner is to be close to the best w in hindsight.

Cumulative loss of the learner:

L ˆ _n =

n

X

i=1

` _i (w _i ).

Cumulative loss of the best strategy w in hindsight:

L ^∗ _n = min

w∈W n

X

i=1

` _i (w).

Regret of the learner:

R n = ˆ L n − L ^∗ _n .

The goal is to minimize regret over all possible data sequences.

23 / 53

(55)

N prediction strategies (“experts”) to follow.

Instance x = (x 1 , . . . , x N ): vector of experts’ predictions.

Strategy w = (w ₁ , . . . , w _N ): probability vector P N

k=1 w _k = 1, w _k ≥ 0 (learner’s beliefs about experts, or probability of following a given expert).

Learner’s prediction:

ˆ

y = w(x) = w ^> x =

N

X

k=1

w k x k . Regret: competing with the best expert

R n = ˆ L n − L ^∗ _n , L ^∗ _n = min

k=1,...,N L n (k).

where L _n (k) is the loss of kth expert.

24 / 53

(56)

Combining advices from actual experts :-) Combining N learning algorithms/classifiers.

Gambling (e.g., horse racing).

Portfolio selection.

Routing/shortest path.

Spam filtering/document classification.

Learning boolean formulae.

Ranking/ordering.

. . .

25 / 53

(57)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

26 / 53

(58)

Follow the leader (FTL)

At iteration i, choose an expert which has the smallest loss on the past data:

k _min = arg min

k=1,...,N i−1

X

j=1

` _j (k) = arg min

k=1,...,N L _i−1 (k).

Corresponding strategy w: w _k

_min

= 1, w _k = 0 for k 6= k _min .

27 / 53

(59)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

0 25% 0% 0% 0% 0% 0% 0%

0 50% 0% 100% 0% 100% 0% 100%

0 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(60)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 0

25%

0% 0% 0% 0% 0% 0%

0 50% 0% 100% 0% 100% 0% 100%

0 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(61)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 0

25%

0% 0% 0% 0% 0% 0%

0 50%

0% 100% 0% 100% 0% 100%

0 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(62)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 0.75

25%

0% 0% 0% 0% 0% 0% 0

0.25 50%

0% 100% 0% 100% 0% 100% 0

0.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(63)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 0.75

25% 0%

0% 0% 0% 0% 0% 0

0.25 50%

0% 100% 0% 100% 0% 100% 0

0.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(64)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 0.75

25% 0%

0% 0% 0% 0% 0% 0

0.25 50% 0%

100% 0% 100% 0% 100% 0

0.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(65)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 0.75

25% 0%

0% 0% 0% 0% 0% 0

1.25 50% 0%

100% 0% 100% 0% 100% 0

1.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(66)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 0.75

25% 0% 0%

0% 0% 0% 0% 0

1.25 50% 0%

100% 0% 100% 0% 100% 0

1.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(67)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 0.75

25% 0% 0%

0% 0% 0% 0% 0

1.25 50% 0% 100%

0% 100% 0% 100% 0

1.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(68)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 1.75

25% 0% 0%

0% 0% 0% 0% 0

1.25 50% 0% 100%

0% 100% 0% 100% 0

2.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(69)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 1.75

25% 0% 0% 0%

0% 0% 0% 0

1.25 50% 0% 100%

0% 100% 0% 100% 0

2.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(70)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 1.75

25% 0% 0% 0%

0% 0% 0% 0

1.25 50% 0% 100% 0%

100% 0% 100% 0

2.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(71)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 1.75

25% 0% 0% 0%

0% 0% 0% 0

2.25 50% 0% 100% 0%

100% 0% 100% 0

3.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(72)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 1.75

25% 0% 0% 0% 0%

0% 0% 0

2.25 50% 0% 100% 0%

100% 0% 100% 0

3.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(73)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 1.75

25% 0% 0% 0% 0%

0% 0% 0

2.25 50% 0% 100% 0% 100%

0% 100% 0

3.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(74)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0%

0% 0% 0

2.25 50% 0% 100% 0% 100%

0% 100% 0

4.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(75)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0% 0%

0% 0

2.25 50% 0% 100% 0% 100%

0% 100% 0

4.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(76)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0% 0%

0% 0

2.25 50% 0% 100% 0% 100% 0%

100% 0

4.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(77)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0% 0%

0% 0

3.25 50% 0% 100% 0% 100% 0%

100% 0

5.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(78)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0% 0% 0%

0

3.25 50% 0% 100% 0% 100% 0%

100% 0

5.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(79)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 100% 2.75

25% 0% 0% 0% 0% 0% 0%

0

3.25 50% 0% 100% 0% 100% 0% 100%

0

5.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(80)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 100% 3.75

25% 0% 0% 0% 0% 0% 0%

0

3.25 50% 0% 100% 0% 100% 0% 100%

0

6.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(81)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 100% 3.75

25% 0% 0% 0% 0% 0% 0%

0

3.25 50% 0% 100% 0% 100% 0% 100%

0

6.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ .

Learner must hedge its bets on experts!

28 / 53

(82)

Failure of FTL

expert 1 2 3 4 5 6 7 loss

75% 100% 100% 100% 100% 100% 100% 3.75

25% 0% 0% 0% 0% 0% 0%

0

3.25 50% 0% 100% 0% 100% 0% 100%

0

6.5 L ˆ _n ' n, L ^∗ _n ' ⁿ ₂ , R _n ' ⁿ ₂ . Learner must hedge its bets on experts!

28 / 53

(83)

Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997]

Algorithm

Each time expert k receives a loss `(k), multiply the weight w _k associated with that expert by e ^−η`(k) , where η > 0.

where Z _i = P N

k=1 w _i,k e ^−η`

ⁱ

^(k) . Unwinding this update:

w _i+1,k = e ^−ηL

ⁱ

^(k) Z _i where L i (k) = P

j≤i ` j (k) and Z i = P N

k=1 e ^−ηL

ⁱ

^(k) .

29 / 53

(84)

Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997]

Algorithm

Each time expert k receives a loss `(k), multiply the weight w _k associated with that expert by e ^−η`(k) , where η > 0.

w _i+1,k = w i,k e ^−η`

ⁱ

^(k) Z _i , where Z _i = P N

k=1 w _i,k e ^−η`

ⁱ

^(k) .

w _i+1,k = e Z _i where L i (k) = P

j≤i ` j (k) and Z i = P N

k=1 e ^−ηL

ⁱ

^(k) .

29 / 53

(85)

1997]

Algorithm

Each time expert k receives a loss `(k), multiply the weight w _k associated with that expert by e ^−η`(k) , where η > 0.

w _i+1,k = w i,k e ^−η`

ⁱ

^(k) Z _i , where Z _i = P N

k=1 w _i,k e ^−η`

ⁱ

^(k) . Unwinding this update:

w _i+1,k = e ^−ηL

ⁱ

^(k) Z _i where L i (k) = P

j≤i ` j (k) and Z i = P N

k=1 e ^−ηL

ⁱ

^(k) .

29 / 53

(86)

Hedge as Bayesian updates

Prior probability over N alternatives E 1 , . . . , E N . Data likelihoods: P (D i |E _k ), k = 1, . . . , N .

P (E _k |D _i ) =

P (D _i |E _k ) × P (E _k ) P N

j=1 P (D _i |E _j ) × P (E _j )

prior probability w _i,k

P (E k |D _i ) =

P (D i |E _k ) × P (E k ) P N

j=1 P (D i |E _j ) × P (E j )

normalization Z _i

30 / 53

(87)

Prior probability over N alternatives E ₁ , . . . , E _N . Data likelihoods: P (D i |E _k ), k = 1, . . . , N .

posterior probability w _i+1,k data likelihood e ^−ηL

ⁱ

^(k)

prior probability w i,k

P (E _k |D _i ) =

P (D _i |E _k ) × P (E _k ) P N

j=1 P (D _i |E _j ) × P (E _j )

normalization Z i

30 / 53

(88)

Hedge example (η = 2)

30% 10% 20% 60% 30%

0 0.25 0.5 0.75 1

31 / 53

(89)

Hedge example (η = 2)

30% 10% 20% 60% 30%

0.3 0.1 0.2 0.6 0.3

0 0.25 0.5 0.75 1

31 / 53

(90)

Hedge example (η = 2)

30% 10% 20% 60% 30%

0.3 0.1 0.2 0.6 0.3

0 0.25 0.5 0.75 1

31 / 53

(91)

Hedge example (η = 2)

50% 80% 70% 30% 64%

0 0.25 0.5 0.75 1

31 / 53

(92)

Hedge example (η = 2)

50% 80% 70% 30% 64%

0.5 0.2 0.3 0.7 0.36

0 0.25 0.5 0.75 1

31 / 53

(93)

Hedge example (η = 2)

50% 80% 70% 30% 64%

0.5 0.2 0.3 0.7 0.36

0 0.25 0.5 0.75 1

31 / 53

(94)

Hedge example (η = 2)

50% 50% 50% 50% 50%

0 0.25 0.5 0.75 1

31 / 53

(95)

Hedge example (η = 2)

50% 50% 50% 50% 50%

0.5 0.5 0.5 0.5 0.5

0 0.25 0.5 0.75 1

31 / 53

(96)

Hedge example (η = 2)

50% 50% 50% 50% 50%

0.5 0.5 0.5 0.5 0.5

0 0.25 0.5 0.75 1

31 / 53

(97)

Hedge example (η = 2)

10% 10% 30% 80% 21%

0 0.25 0.5 0.75 1

31 / 53

(98)

Hedge example (η = 2)

10% 10% 30% 80% 21%

0.1 0.1 0.3 0.8 0.21

0 0.25 0.5 0.75 1

31 / 53

(99)

Hedge example (η = 2)

10% 10% 30% 80% 21%

0.1 0.1 0.3 0.8 0.21

0 0.25 0.5 0.75 1

31 / 53

(100)

Regret bound

For any data sequence, when η =

q 8 log N n , R _n ≤

r n log N 2

Regret bound

For any data sequence, when η = q

2 log N L

^∗_n

, R n ≤ p

2L ^∗ _n log N + log N

Both bounds are tight.

32 / 53

(101)

Statistical learning theory vs. online learning theory

Statistical learning theory

Theorem

If W is finite, then for classifier w S trained by empirical risk minimization:

E S∼P L(w S ) − min

w∈W L(w)

= O

r log |W|

n

! .

Online learning theory

Theorem

In the expert setting (|W| = N ), for Hedge algorithm:

1 n R _n = 1

n L ˆ _n − 1 n L ^∗ _n

= O

r log |W|

n

! ,

33 / 53

(102)

Statistical learning theory

Theorem

If W is finite, then for classifier w S trained by empirical risk minimization:

E S∼P L(w S ) − min

w∈W L(w)

= O

r log |W|

n

! .

Online learning theory

Theorem

In the expert setting (|W| = N ), for Hedge algorithm:

1 n R _n = 1

n L ˆ _n − 1 n L ^∗ _n

= O

r log |W|

n

! ,

Essentially the same performance without i.i.d. assumption, but at the price of averaging!

33 / 53

(103)

Large (or countably infinite) class of experts.

Concept drift: competing with the best sequence of experts.

Competing with the best small set of recurring experts.

Ranking: competing with the best permutation.

Partial feedback: multi-armed bandits.

. . .

34 / 53

(104)

Concept drift

Competing with the best sequence of m experts

Idea: treat each sequence of m experts as a new expert and run Hedge on the top.

w _i+1,k = α 1

N + (1 − α) w _i,k e ^−η`

ⁱ

^(k) P N

j=1 w _i,j e ^−η`

ⁱ

^(j) ,

Mixture of standard Hedge weights and the initial distribution.

Parameter α = ^m _n (frequency of changes).

Bayesian interpretation: prior and posterior over sequences of experts.

Mixture over all past Hedge weights.

35 / 53

(105)

Competing with the best sequence of m experts

Idea: treat each sequence of m experts as a new expert and run Hedge on the top.

w _i+1,k = α 1

N + (1 − α) w _i,k e ^−η`

ⁱ

^(k) P N

j=1 w _i,j e ^−η`

ⁱ

^(j) ,

Mixture of standard Hedge weights and the initial distribution.

Parameter α = ^m _n (frequency of changes).

Bayesian interpretation: prior and posterior over sequences of experts.

Competing with the best small set of m recurring experts Mixture over all past Hedge weights.

35 / 53

(106)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

36 / 53

(107)

Instances x: feature vectors.

Outcome y: class/real output.

Strategy w ∈ W: parameter vector:

w = (w ₁ , . . . , w _d ) ∈ R ^d . W can be R ^d or can be a regularization ball W = {w : kwk _p ≤ B}.

Prediction is linear: ˆ y = w ^> x.

Loss ` depends on the task we want to solve, but we assume

`(y, ˆ y) is convex in ˆ y.

37 / 53

(108)

Linear regression : y ∈ R and

`(y, ˆ y) = (y − ˆ y) ² . Logistic regression: y ∈ {0, 1} and

`(y, ˆ y) = log

1 + e ^{−y ˆ} ^y

. Support vector machines: y ∈ {0, 1} and

`(y, ˆ y) = (1 − y ˆ y) +

38 / 53

(109)

−2 −1 0 1 2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

prediction

loss

hinge (SVM) square logistic

Logistic and hinge losses plotted for y = 1.

Squared error loss plotted for y = 0.

39 / 53

(110)

Gradient descent method

Minimize a function f (w) over w ∈ R ^d .

Gradient descent method:

w _i+1 = w _i − η _i ∇ _w

_i

f (w _i ).

where η _i is a step size.

w _i+1 := arg min

w∈W kw _i+1 − wk ² .

Source: http://www-bcf.usc.edu/∼larry/, http://takisword.files.wordpress.com 40 / 53

(111)

Minimize a function f (w) over w ∈ R ^d .

Gradient descent method:

w _i+1 = w _i − η _i ∇ _w

_i

f (w _i ).

where η _i is a step size.

If we have a set of constraints w ∈ W, after each step we need to project back to W:

w _i+1 := arg min

w∈W kw _i+1 − wk ² .

Source: http://www-bcf.usc.edu/∼larry/, http://takisword.files.wordpress.com 40 / 53

(112)

Algorithm

Start with any initial vector w ₁ ∈ W.

For i = 1, 2, . . .:

1 Observe input vector x _i .

2 Predict with ˆ y _i = w ^> _i x _i .

3 Outcome y i is revealed.

4 Suffer loss ` i (w i ) = `(y i , ˆ y i ).

5 Push the weight vector toward the negative gradient of the loss:

w _i+1 := w _i − η _i ∇ _w

_i

` _i (w _i ).

6 If w _i+1 ∈ W, project it back to W: / w i+1 := arg min

w∈W kw _i+1 − wk ² .

41 / 53

(113)

The function we want to minimize:

f (w) =

n

X

j=1

` _j (w).

Standard = batch GD w i+1 : = w i − η _i ∇ _w

_i

f (w i )

= w _i − η _i X

j

∇ _w

_i

` _j (w _i ) O(n) per iteration, need to see all data.

Online GD

w i+1 := w i − η _i ∇ _w

_i

` i (w i ).

O(1) per iteration, need to see a single data point.

42 / 53

(114)

Online (stochastic) gradient descent

W w ₁

w 3

w 4

2

−η∇

_w₁

`

1

(w

1

)

43 / 53

(115)

Online (stochastic) gradient descent

W w ₁

w 3

w 4

−η∇

_w₁

`

1

(w

1

)

2

43 / 53

(116)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

w 4

−η∇

_w₁

`

1

(w

1

)

43 / 53

(117)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

w 4

^w¹ ¹ ¹

−η∇

w₂

`

2

(w

2

)

43 / 53

(118)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

w 4

^w¹ ¹ ¹

−η∇

w₂

`

2

(w

2

)

43 / 53

(119)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

−η∇

_w₃

`

3

(w

3

)

43 / 53

(120)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

−η∇

_w₃

`

3

(w

3

)

43 / 53

(121)

Online (stochastic) gradient descent

W w ₁

w 2

w 3

w 4

w1 1 1

projection

43 / 53

(122)

The gradient ∇ w

i

` i (w i ) can be obtained by applying a chain rule:

∂` _i (w)

∂w k

= ∂`(y _i , w ^> x _i )

∂w k

= ∂`(y, ˆ y)

∂ ˆ y y=w ˆ

^>

x

i

∂(w ^> x i )

∂w _k

= ∂`(y, ˆ y)

∂ ˆ y y=w ˆ

^>

x

i

x _ik .

If we denote ` ⁰ _i (w) := ^∂`(y,ˆ _{∂ ˆ} _y ^y) y=w ˆ

^>

x

i

, then we can write

∇ _w

_i

` i (w i ) = ` ⁰ _i (w i )x i

44 / 53

(123)

Update rules for specific losses

Linear regression:

`(y, ˆ y) = (y − ˆ y) ² ` ⁰ (w) = −2(y − ˆ y) Update:

w i+1 = w i + 2η i (y i − ˆ y i )x i .

Logistic regression:

`(y, ˆ y) = log 1 + e ^{−y ˆ} ^y

` ⁰ (w) = − y 1 + e ^{y ˆ} ^y Update:

w i+1 = w i + η i

y i x i

1 + e ^y

ⁱ

^ˆ ^y

ⁱ

. Support vector machines:

`(y, ˆ y) = (1 − y ˆ y) ₊ ` ⁰ (w) = 0 if y ˆ y > 1

−y if y ˆ y ≤ 1 Update:

w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i

45 / 53

(124)

Update rules for specific losses

Linear regression:

`(y, ˆ y) = (y − ˆ y) ² ` ⁰ (w) = −2(y − ˆ y) Update:

w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:

`(y, ˆ y) = log 1 + e ^{−y ˆ} ^y

` ⁰ (w) = − y 1 + e ^{y ˆ} ^y Update:

w i+1 = w i + η i

y i x i

1 + e ^y

ⁱ

^y ^ˆ

ⁱ

.

Support vector machines:

`(y, ˆ y) = (1 − y ˆ y) ₊ ` ⁰ (w) = 0 if y ˆ y > 1

−y if y ˆ y ≤ 1 Update:

w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i

45 / 53

(125)

Update rules for specific losses

Linear regression:

`(y, ˆ y) = (y − ˆ y) ² ` ⁰ (w) = −2(y − ˆ y) Update:

w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:

`(y, ˆ y) = log 1 + e ^{−y ˆ} ^y

` ⁰ (w) = − y 1 + e ^{y ˆ} ^y Update:

w i+1 = w i + η i

y i x i

1 + e ^y

ⁱ

^y ^ˆ

ⁱ

. Support vector machines:

`(y, ˆ y) = (1 − y ˆ y) ₊ ` ⁰ (w) = 0 if y ˆ y > 1

−y if y ˆ y ≤ 1 Update:

w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i

45 / 53

(126)

Linear regression:

`(y, ˆ y) = (y − ˆ y) ² ` ⁰ (w) = −2(y − ˆ y) Update:

w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:

`(y, ˆ y) = log 1 + e ^{−y ˆ} ^y

` ⁰ (w) = − y 1 + e ^{y ˆ} ^y Update:

w i+1 = w i + η i

y i x i

1 + e ^y

ⁱ

^y ^ˆ

ⁱ

. Support vector machines:

`(y, ˆ y) = (1 − y ˆ y) ₊ ` ⁰ (w) = 0 if y ˆ y > 1

−y if y ˆ y ≤ 1 Update:

w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i ⇐ perceptron!

45 / 53

(127)

Projection

w _i := arg min

w∈W kw _i − wk ² .

corresponds to renormalization of the weight vector: if kw _i k > B =⇒ w _i := Bw _i

kw _i k . Equivalent to L ₂ regularization.

When W = {w : P d

k=1 |w _k | ≤ B} is an L ₁ -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.

Equivalent to L ₁ regularization, results in sparse solutions.

46 / 53

(128)

Projection

w _i := arg min

w∈W kw _i − wk ² . When W = R ^d ⇒ no projection step.

if kw _i k > B =⇒ w _i := Bw _i kw _i k . Equivalent to L ₂ regularization.

When W = {w : P d

k=1 |w _k | ≤ B} is an L ₁ -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.

Equivalent to L ₁ regularization, results in sparse solutions.

46 / 53

(129)

Projection

w _i := arg min

w∈W kw _i − wk ² . When W = R ^d ⇒ no projection step.

When W = {w : kwk ≤ B} is an L 2 -ball, projection corresponds to renormalization of the weight vector:

if kw _i k > B =⇒ w _i := Bw _i kw _i k . Equivalent to L ₂ regularization.

clipping smaller weights to 0.

Equivalent to L ₁ regularization, results in sparse solutions.

46 / 53

(130)

w _i := arg min

w∈W kw _i − wk ² . When W = R ^d ⇒ no projection step.

When W = {w : kwk ≤ B} is an L 2 -ball, projection corresponds to renormalization of the weight vector:

if kw _i k > B =⇒ w _i := Bw _i kw _i k . Equivalent to L ₂ regularization.

When W = {w : P d

k=1 |w _k | ≤ B} is an L ₁ -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.

Equivalent to L ₁ regularization, results in sparse solutions.

46 / 53

(131)

L ₁ vs. L ₂ projection

w _i := arg min

w∈W kw _i − wk ² .

w _i w _i

w ₂ w ₂

w 1 w 1

W W

47 / 53

(132)

w _i := arg min

w∈W kw _i − wk .

w _i w _i

w ₂ w ₂

w 1 w 1

W W

projected w _i projected w _i

47 / 53

(133)

Theorem

Let 0 ∈ W. Assume k∇ _w ` _i (w)k ≤ L and let kWk = max _w∈W kwk. Then, with w ₁ = 0, η i = ^√ ¹

i kWk

L the regret is bounded by:

R n ≤ kWkL √ 2n, so that the per-iteration regret:

1 n R n ≤ kWkL r 2

n .

48 / 53

(134)

Gradient descent

w i+1 := w i − η _i ∇ _w

_i

` i (w i ).

Exponentiated descent w _i+1 := 1

Z i

w _i e ^−η

ⁱ

^∇

^wi

^`

ⁱ

^(w

ⁱ

⁾ . Direct extension of Hedge for classification/regression

framework.

Requires positive weights, but can be applied in a general setting using doubling trick.

Works much better than online gradient descent when:

d is very large (many features)

only a small number of features is relevant.

Works very well when the best model is sparse, but does not keep sparse solution itself!

49 / 53

(135)

Theorem

Let W be a positive orthant of L 1 -ball with radius kWk, and assume k∇ w ` i (w)k ≤ L. Then, with w 1 = ¹ _d , . . . , ¹ _d , η _i = ^√ ¹

i kWk √

8 log d

L the regret is bounded by:

R _n ≤ kWkL

r n log d 2 , so that the per-iteration regret:

1 n R _n ≤ kWkL r log d

2n .

50 / 53

(136)

Concept drift: competing with drifting parameter vectors.

Partial feedback: contextual multi-armed bandit problems.

Improvements for some (strongly convex, exp-concave) loss functions.

Infinite-dimensional feature spaces via kernel trick.

Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices).

. . .

51 / 53

(137)

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

52 / 53

(138)

A theoretical framework for learning without i.i.d. assumption.

Performance bounds often simpler to prove than in the stochastic setting.

Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial

information (bandits), etc.

Results in online algorithms directly applicable to large-scale learning problems.

Most of currently used offline learning algorithms employ online learning as an optimization routine.

53 / 53

2 Statistical learning theory

Wojciech Kot lowski

Institute of Computing Science, Pozna´ n University of Technology

IDSS, 04.06.2013

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

Data set: Reuters RCV1:

Set of ∼ 800 000 documents from Reuters News published during 1996-1997

781 265 training examples, 23 149 testing examples.

47 152 TF-IDF features.

For each document, some topics (categories) were assigned.

For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial).

Two types of loss functions:

Logistic loss (logistic regression) Hinge loss (SVM)

Two types of learning:

Standard (batch setting): minimization of the (regularized) empirical risk on the training data.

Online gradient descent (stochastic gradient descent) L 2 regularization.

Source: http://leon.bottou.org/projects/sgd

Hinge loss

method comp. time training error test error SVMLight (batch) 23 642 sec 0.2275 6.02%

SVMPerf (batch) 66 sec 0.2278 6.03%

SGD 1.4 sec 0.2275 6.02%

Logistic loss

method comp. time training error test error LibLinear (batch) 30 sec 0.18907 5.68%

SGD 2.3 sec 0.18893 5.66%

Data set #objects #features time LIBSVM time SGD

Reuters 781 000 47 000 2.5 days 7sec

Translation 1 000 000 274 000 many days 7sec

SuperTag 950 000 46 000 8h 1sec

Voicetone 579 000 88 000 10h 1sec

1 Example: Online (Stochastic) Gradient Descent

2 Statistical learning theory

3 Online learning

4 Algorithms for prediction with experts advice

5 Algorithms for classification and regression.

6 Conclusions

learning algorithm

function (classifier) w

: X → Y

predictions ˆ

y

= w

(x

) ˆ

y

= w

(x

) ˆ

y

= w

(x

)

accuracy

`(y

, ˆ y

)

`(y

, ˆ y

)

`(y

, ˆ y

)

training set S test set T feedback y

, y

, y

(x

, y

) (x

, y

) (x

Online gradient descent (stochastic gradient descent) L ₂ regularization.

Given a training set S = {(x _i , y _i )} ⁿ _i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set

T = {(x _i , y i )} ^m _i=n+1 :

Given a training set S = {(x _i , y _i )} ⁿ _i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set

T = {(x _i , y i )} ^m _i=n+1 :

`(y _i , w(x _i )).

`(y _i , w(x _i )).