Wojciech Kot lowski
Institute of Computing Science, Pozna´ n University of Technology
IDSS, 04.06.2013
1 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
2 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
3 / 53
Data set: Reuters RCV1:
Set of ∼ 800 000 documents from Reuters News published during 1996-1997
781 265 training examples, 23 149 testing examples.
47 152 TF-IDF features.
For each document, some topics (categories) were assigned.
For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial).
4 / 53
<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="2330" id="root" date="1996-08-20" xml:lang="en">
<title>USA: Tylan stock jumps; weighs sale of company.</title>
<headline>Tylan stock jumps; weighs sale of company.</headline>
<dateline>SAN DIEGO</dateline>
<text>
<p>The stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.</p>
<p>Tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.</p>
<p>The company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.</p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
<code code="USA"> </code>
</codes>
<codes class="bip:industries:1.0">
<code code="I34420"> </code>
</codes>
<codes class="bip:topics:1.0">
<code code="C15"> </code>
<code code="C152"> </code>
<code code="C18"> </code>
<code code="C181"> </code>
<code code="CCAT"> </code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="1996-08-20"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="SAN DIEGO"/>
<dc element="dc.creator.location.country.name" value="USA"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>
5 / 53
Two types of loss functions:
Logistic loss (logistic regression) Hinge loss (SVM)
Two types of learning:
Standard (batch setting): minimization of the (regularized) empirical risk on the training data.
Online gradient descent (stochastic gradient descent) L 2 regularization.
Source: http://leon.bottou.org/projects/sgd
6 / 53
Hinge loss
method comp. time training error test error SVMLight (batch) 23 642 sec 0.2275 6.02%
SVMPerf (batch) 66 sec 0.2278 6.03%
SGD 1.4 sec 0.2275 6.02%
Logistic loss
method comp. time training error test error LibLinear (batch) 30 sec 0.18907 5.68%
SGD 2.3 sec 0.18893 5.66%
7 / 53
Data set #objects #features time LIBSVM time SGD
Reuters 781 000 47 000 2.5 days 7sec
Translation 1 000 000 274 000 many days 7sec
SuperTag 950 000 46 000 8h 1sec
Voicetone 579 000 88 000 10h 1sec
8 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
9 / 53
learning algorithm
function (classifier) w
S: X → Y
predictions ˆ
y
5= w
S(x
5) ˆ
y
6= w
S(x
6) ˆ
y
7= w
S(x
7)
accuracy
`(y
5, ˆ y
5)
`(y
5, ˆ y
6)
`(y
7, ˆ y
7)
training set S test set T feedback y
5, y
6, y
7(x
1, y
1) (x
2, y
2) (x
3, y
3)
(x
4, y
4)
(x
5, ?)(x
6, ?) (x
7, ?)
10 / 53
Test set performance
Ultimate question of machine learning theory
Given a training set S = {(x i , y i )} n i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set
T = {(x i , y i )} m i=n+1 :
m
X
i=n+1
`(y i , w S (x i ))
is minimized?
Training data and test data must be in some sense similar.
11 / 53
Ultimate question of machine learning theory
Given a training set S = {(x i , y i )} n i=1 , how to learn a function w S : X → Y, so that total loss on a separate test set
T = {(x i , y i )} m i=n+1 :
m
X
i=n+1
`(y i , w S (x i ))
is minimized?
No reasonable answer without any assumptions (“no free lunch”).
Training data and test data must be in some sense similar.
11 / 53
Statistical learning theory
Assumption
Training data and test data were independently generated from the same distribution P (i.i.d.).
Mean training error of w (empirical risk):
L S (w) = 1 n
n
X
i=1
`(y i , w(x i )).
Test error = expected error of w (risk):
L(w) = E (x,y)∼P [`(y, w(x))] .
Given training data S, how to construct w S to make L(w S ) as small as possible?
12 / 53
Assumption
Training data and test data were independently generated from the same distribution P (i.i.d.).
Mean training error of w (empirical risk):
L S (w) = 1 n
n
X
i=1
`(y i , w(x i )).
Test error = expected error of w (risk):
L(w) = E (x,y)∼P [`(y, w(x))] .
Given training data S, how to construct w S to make L(w S ) as small as possible? =⇒ Empirical risk minimization.
12 / 53
Generalization bounds
Theorem for finite classes (0/1 loss) [Occam’s Razor]
Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W:
w S = arg min
w∈W L S (w), then:
E S∼P L(w S ) = min
w∈W L(w) + O
r log |W|
n
! .
best test set performance in W
overhead for learning on a finite training set
13 / 53
Theorem for finite classes (0/1 loss) [Occam’s Razor]
Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W:
w S = arg min
w∈W L S (w), then:
E S∼P L(w S ) = min
w∈W L(w) + O
r log |W|
n
! .
our test set performance
best test set performance in W
overhead for learning on a finite training set
13 / 53
Generalization bounds
Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971]
Let W has VC dimension d VC . If the classifier w S was trained on S by minimizing the empirical risk within W:
w S = arg min
w∈W L S (w), then:
E S∼P L(w S ) = min
w∈W L(w) + O
r d VC n
! .
best test set performance in W
overhead for learning on a finite training set
14 / 53
Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971]
Let W has VC dimension d VC . If the classifier w S was trained on S by minimizing the empirical risk within W:
w S = arg min
w∈W L S (w), then:
E S∼P L(w S ) = min
w∈W L(w) + O
r d VC n
! .
our test set performance
best test set performance in W
overhead for learning on a finite training set
14 / 53
The bounds are asymptotically tight.
Typically d VC ∼ d, the number of parameters, so that:
L(w S ) = min
w∈W L(w) + O r d
n
! .
Similar results (sometimes better) for losses other than 0/1.
Holds for any distribution P.
Improvements: data dependent bounds.
The theory only tells you what happens for a given class W;
choosing W is an “art” (domain knowledge).
15 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
16 / 53
Alternative view on learning
Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).
Learning process by its very nature is incremental.
We do not observe the distributions, we only see the data.
Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?
Can we obtain performance guarantees solely based on observed quantities?
we can! =⇒ online learning theory
17 / 53
Alternative view on learning
Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).
Learning process by its very nature is incremental.
We do not observe the distributions, we only see the data.
Motivation
Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?
Can we obtain performance guarantees solely based on observed quantities?
17 / 53
Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series).
Learning process by its very nature is incremental.
We do not observe the distributions, we only see the data.
Motivation
Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary?
Can we obtain performance guarantees solely based on observed quantities?
we can! =⇒ online learning theory
17 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . . 50%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . . 50%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25% 10%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25% 10%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25% 10% 25%
. . .
18 / 53
Example: weather prediction (rain/sunny)
i = 1 i = 2 i = 3 i = 4 . . .
50% 25% 10% 25%
. . .
18 / 53
i = 1 i = 2 i = 3 i = 4 . . .
50% 25% 10% 25% . . .
. . .
18 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
10% 80% 50% 10% . . .
20% 70% 50% 30% . . .
60% 30% 50% 80% . . .
30% 65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . . 30%
10%
80% 50% 10% . . .
20%
70% 50% 30% . . .
60%
30% 50% 80% . . .
30% 65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . . 30%
10%
80% 50% 10% . . .
20%
70% 50% 30% . . .
60%
30% 50% 80% . . .
30%
65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . . 30%
10%
80% 50% 10% . . .
20%
70% 50% 30% . . .
60%
30% 50% 80% . . .
30%
65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50%
10% 80%
50% 10% . . .
20% 70%
50% 30% . . .
60% 30%
50% 80% . . .
30%
65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50%
10% 80%
50% 10% . . .
20% 70%
50% 30% . . .
60% 30%
50% 80% . . .
30% 65%
50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50%
10% 80%
50% 10% . . .
20% 70%
50% 30% . . .
60% 30%
50% 80% . . .
30% 65%
50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50%
10% 80% 50%
10% . . .
20% 70% 50%
30% . . .
60% 30% 50%
80% . . .
30% 65%
50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50%
10% 80% 50%
10% . . .
20% 70% 50%
30% . . .
60% 30% 50%
80% . . .
30% 65% 50%
10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50%
10% 80% 50%
10% . . .
20% 70% 50%
30% . . .
60% 30% 50%
80% . . .
30% 65% 50%
10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50% 10%
10% 80% 50% 10%
. . .
20% 70% 50% 30%
. . .
60% 30% 50% 80%
. . .
30% 65% 50%
10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50% 10%
10% 80% 50% 10%
. . .
20% 70% 50% 30%
. . .
60% 30% 50% 80%
. . .
30% 65% 50% 10%
. . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50% 10%
10% 80% 50% 10%
. . .
20% 70% 50% 30%
. . .
60% 30% 50% 80%
. . .
30% 65% 50% 10%
. . .
. . .
19 / 53
expert i = 1 i = 2 i = 3 i = 4 . . .
30% 50% 50% 10% . . .
10% 80% 50% 10% . . .
20% 70% 50% 30% . . .
60% 30% 50% 80% . . .
30% 65% 50% 10% . . .
. . .
19 / 53
Example: weather prediction (rain/sunny)
Prediction accuracy evaluated by a loss function, e.g.:
`(y i , ˆ y i ) = |y i − ˆ y i |.
Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.
expert 1 2 3 4 cumulative loss
30% 50% 50% 10%
10% 80% 50% 10%
20% 70% 50% 30%
0.2 + 0.3 + 0.5 + 0.3 = 1.3
60% 30% 50% 80%
0.6 + 0.7 + 0.5 + 0.8 = 2.6
30% 65% 50% 10%
0.3 + 0.45 + 0.5 + 0.1 = 1.35
Regret of the learner: 1.35 − 0.9 = 0.45.
The goal is to have small regret for any data sequence.
20 / 53
Example: weather prediction (rain/sunny)
Prediction accuracy evaluated by a loss function, e.g.:
`(y i , ˆ y i ) = |y i − ˆ y i |.
Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.
expert 1 2 3 4 cumulative loss
30% 50% 50% 10%
10% 80% 50% 10%
20% 70% 50% 30%
0.2 + 0.3 + 0.5 + 0.3 = 1.3
60% 30% 50% 80%
0.6 + 0.7 + 0.5 + 0.8 = 2.6
30% 65% 50% 10%
0.3 + 0.45 + 0.5 + 0.1 = 1.35
Regret of the learner: 1.35 − 0.9 = 0.45.
The goal is to have small regret for any data sequence.
20 / 53
Example: weather prediction (rain/sunny)
Prediction accuracy evaluated by a loss function, e.g.:
`(y i , ˆ y i ) = |y i − ˆ y i |.
Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.
expert 1 2 3 4 cumulative loss
30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4
10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9
20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3
60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6
30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35
20 / 53
Prediction accuracy evaluated by a loss function, e.g.:
`(y i , ˆ y i ) = |y i − ˆ y i |.
Total performance evaluated by a regret: learner’s cumulative loss minus cumulative loss of the best expert in hindsight.
expert 1 2 3 4 cumulative loss
30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4
10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9
20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3
60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6
30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35
Regret of the learner: 1.35 − 0.9 = 0.45.
The goal is to have small regret for any data sequence.
20 / 53
learner (strategy) w
i: X → Y
prediction ˆ
y
i= w
i(x
i)
suffered loss
`(y
i, ˆ y
i)
new instance
(x
i, ?) feedback:
y
ii → i + 1
21 / 53
Set of strategies (actions) W; known loss function `.
Learner starts with some initial strategy (action) w 1 . For i = 1, 2, . . .:
1 Learner observes instance x i .
2 Learner predicts with ˆ y i = w i (x i ).
3 The environment reveals outcome y i .
4 Learner suffers loss ` i (w i ) = `(y i , ˆ y i ).
5 Learner updates its strategy w i → w i+1 .
22 / 53
The goal of the learner is to be close to the best w in hindsight.
Cumulative loss of the learner:
L ˆ n =
n
X
i=1
` i (w i ).
Cumulative loss of the best strategy w in hindsight:
L ∗ n = min
w∈W n
X
i=1
` i (w).
Regret of the learner:
R n = ˆ L n − L ∗ n .
The goal is to minimize regret over all possible data sequences.
23 / 53
N prediction strategies (“experts”) to follow.
Instance x = (x 1 , . . . , x N ): vector of experts’ predictions.
Strategy w = (w 1 , . . . , w N ): probability vector P N
k=1 w k = 1, w k ≥ 0 (learner’s beliefs about experts, or probability of following a given expert).
Learner’s prediction:
ˆ
y = w(x) = w > x =
N
X
k=1
w k x k . Regret: competing with the best expert
R n = ˆ L n − L ∗ n , L ∗ n = min
k=1,...,N L n (k).
where L n (k) is the loss of kth expert.
24 / 53
Combining advices from actual experts :-) Combining N learning algorithms/classifiers.
Gambling (e.g., horse racing).
Portfolio selection.
Routing/shortest path.
Spam filtering/document classification.
Learning boolean formulae.
Ranking/ordering.
. . .
25 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
26 / 53
Follow the leader (FTL)
At iteration i, choose an expert which has the smallest loss on the past data:
k min = arg min
k=1,...,N i−1
X
j=1
` j (k) = arg min
k=1,...,N L i−1 (k).
Corresponding strategy w: w k
min= 1, w k = 0 for k 6= k min .
27 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
0
25% 0% 0% 0% 0% 0% 0%
0
50% 0% 100% 0% 100% 0% 100%
0
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 0
25%
0% 0% 0% 0% 0% 0%
0
50% 0% 100% 0% 100% 0% 100%
0
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 0
25%
0% 0% 0% 0% 0% 0%
0
50%
0% 100% 0% 100% 0% 100%
0
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 0.75
25%
0% 0% 0% 0% 0% 0% 0
0.25
50%
0% 100% 0% 100% 0% 100% 0
0.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 0.75
25% 0%
0% 0% 0% 0% 0% 0
0.25
50%
0% 100% 0% 100% 0% 100% 0
0.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 0.75
25% 0%
0% 0% 0% 0% 0% 0
0.25
50% 0%
100% 0% 100% 0% 100% 0
0.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 0.75
25% 0%
0% 0% 0% 0% 0% 0
1.25
50% 0%
100% 0% 100% 0% 100% 0
1.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 0.75
25% 0% 0%
0% 0% 0% 0% 0
1.25
50% 0%
100% 0% 100% 0% 100% 0
1.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 0.75
25% 0% 0%
0% 0% 0% 0% 0
1.25
50% 0% 100%
0% 100% 0% 100% 0
1.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 1.75
25% 0% 0%
0% 0% 0% 0% 0
1.25
50% 0% 100%
0% 100% 0% 100% 0
2.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 1.75
25% 0% 0% 0%
0% 0% 0% 0
1.25
50% 0% 100%
0% 100% 0% 100% 0
2.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 1.75
25% 0% 0% 0%
0% 0% 0% 0
1.25
50% 0% 100% 0%
100% 0% 100% 0
2.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 1.75
25% 0% 0% 0%
0% 0% 0% 0
2.25
50% 0% 100% 0%
100% 0% 100% 0
3.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 1.75
25% 0% 0% 0% 0%
0% 0% 0
2.25
50% 0% 100% 0%
100% 0% 100% 0
3.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 1.75
25% 0% 0% 0% 0%
0% 0% 0
2.25
50% 0% 100% 0% 100%
0% 100% 0
3.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0%
0% 0% 0
2.25
50% 0% 100% 0% 100%
0% 100% 0
4.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0% 0%
0% 0
2.25
50% 0% 100% 0% 100%
0% 100% 0
4.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0% 0%
0% 0
2.25
50% 0% 100% 0% 100% 0%
100% 0
4.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0% 0%
0% 0
3.25
50% 0% 100% 0% 100% 0%
100% 0
5.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0% 0% 0%
0
3.25
50% 0% 100% 0% 100% 0%
100% 0
5.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 100% 2.75
25% 0% 0% 0% 0% 0% 0%
0
3.25
50% 0% 100% 0% 100% 0% 100%
0
5.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 100% 3.75
25% 0% 0% 0% 0% 0% 0%
0
3.25
50% 0% 100% 0% 100% 0% 100%
0
6.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 100% 3.75
25% 0% 0% 0% 0% 0% 0%
0
3.25
50% 0% 100% 0% 100% 0% 100%
0
6.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 .
Learner must hedge its bets on experts!
28 / 53
Failure of FTL
expert 1 2 3 4 5 6 7 loss
75% 100% 100% 100% 100% 100% 100% 3.75
25% 0% 0% 0% 0% 0% 0%
0
3.25
50% 0% 100% 0% 100% 0% 100%
0
6.5
L ˆ n ' n, L ∗ n ' n 2 , R n ' n 2 . Learner must hedge its bets on experts!
28 / 53
Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997]
Algorithm
Each time expert k receives a loss `(k), multiply the weight w k associated with that expert by e −η`(k) , where η > 0.
where Z i = P N
k=1 w i,k e −η`
i(k) . Unwinding this update:
w i+1,k = e −ηL
i(k) Z i where L i (k) = P
j≤i ` j (k) and Z i = P N
k=1 e −ηL
i(k) .
29 / 53
Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997]
Algorithm
Each time expert k receives a loss `(k), multiply the weight w k associated with that expert by e −η`(k) , where η > 0.
w i+1,k = w i,k e −η`
i(k) Z i , where Z i = P N
k=1 w i,k e −η`
i(k) .
w i+1,k = e Z i where L i (k) = P
j≤i ` j (k) and Z i = P N
k=1 e −ηL
i(k) .
29 / 53
1997]
Algorithm
Each time expert k receives a loss `(k), multiply the weight w k associated with that expert by e −η`(k) , where η > 0.
w i+1,k = w i,k e −η`
i(k) Z i , where Z i = P N
k=1 w i,k e −η`
i(k) . Unwinding this update:
w i+1,k = e −ηL
i(k) Z i where L i (k) = P
j≤i ` j (k) and Z i = P N
k=1 e −ηL
i(k) .
29 / 53
Hedge as Bayesian updates
Prior probability over N alternatives E 1 , . . . , E N . Data likelihoods: P (D i |E k ), k = 1, . . . , N .
P (E k |D i ) =
P (D i |E k ) × P (E k ) P N
j=1 P (D i |E j ) × P (E j )
prior probability w i,k
P (E k |D i ) =
P (D i |E k ) × P (E k ) P N
j=1 P (D i |E j ) × P (E j )
normalization Z i
30 / 53
Prior probability over N alternatives E 1 , . . . , E N . Data likelihoods: P (D i |E k ), k = 1, . . . , N .
posterior probability w i+1,k data likelihood e −ηL
i(k)
prior probability w i,k
P (E k |D i ) =
P (D i |E k ) × P (E k ) P N
j=1 P (D i |E j ) × P (E j )
normalization Z i
30 / 53
Hedge example (η = 2)
30% 10% 20% 60% 30%
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
30% 10% 20% 60% 30%
0.3 0.1 0.2 0.6 0.3
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
30% 10% 20% 60% 30%
0.3 0.1 0.2 0.6 0.3
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 80% 70% 30% 64%
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 80% 70% 30% 64%
0.5 0.2 0.3 0.7 0.36
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 80% 70% 30% 64%
0.5 0.2 0.3 0.7 0.36
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 50% 50% 50% 50%
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 50% 50% 50% 50%
0.5 0.5 0.5 0.5 0.5
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
50% 50% 50% 50% 50%
0.5 0.5 0.5 0.5 0.5
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
10% 10% 30% 80% 21%
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
10% 10% 30% 80% 21%
0.1 0.1 0.3 0.8 0.21
0 0.25 0.5 0.75 1
31 / 53
Hedge example (η = 2)
10% 10% 30% 80% 21%
0.1 0.1 0.3 0.8 0.21
0 0.25 0.5 0.75 1
31 / 53
Regret bound
For any data sequence, when η =
q 8 log N n , R n ≤
r n log N 2
Regret bound
For any data sequence, when η = q
2 log N L
∗n, R n ≤ p
2L ∗ n log N + log N
Both bounds are tight.
32 / 53
Statistical learning theory vs. online learning theory
Statistical learning theory
Theorem
If W is finite, then for classifier w S trained by empirical risk minimization:
E S∼P L(w S ) − min
w∈W L(w)
= O
r log |W|
n
! .
Online learning theory
Theorem
In the expert setting (|W| = N ), for Hedge algorithm:
1
n R n = 1
n L ˆ n − 1 n L ∗ n
= O
r log |W|
n
! ,
33 / 53
Statistical learning theory
Theorem
If W is finite, then for classifier w S trained by empirical risk minimization:
E S∼P L(w S ) − min
w∈W L(w)
= O
r log |W|
n
! .
Online learning theory
Theorem
In the expert setting (|W| = N ), for Hedge algorithm:
1
n R n = 1
n L ˆ n − 1 n L ∗ n
= O
r log |W|
n
! ,
Essentially the same performance without i.i.d. assumption, but at the price of averaging!
33 / 53
Large (or countably infinite) class of experts.
Concept drift: competing with the best sequence of experts.
Competing with the best small set of recurring experts.
Ranking: competing with the best permutation.
Partial feedback: multi-armed bandits.
. . .
34 / 53
Concept drift
Competing with the best sequence of m experts
Idea: treat each sequence of m experts as a new expert and run Hedge on the top.
w i+1,k = α 1
N + (1 − α) w i,k e −η`
i(k) P N
j=1 w i,j e −η`
i(j) ,
Mixture of standard Hedge weights and the initial distribution.
Parameter α = m n (frequency of changes).
Bayesian interpretation: prior and posterior over sequences of experts.
Mixture over all past Hedge weights.
35 / 53
Competing with the best sequence of m experts
Idea: treat each sequence of m experts as a new expert and run Hedge on the top.
w i+1,k = α 1
N + (1 − α) w i,k e −η`
i(k) P N
j=1 w i,j e −η`
i(j) ,
Mixture of standard Hedge weights and the initial distribution.
Parameter α = m n (frequency of changes).
Bayesian interpretation: prior and posterior over sequences of experts.
Competing with the best small set of m recurring experts Mixture over all past Hedge weights.
35 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
36 / 53
Instances x: feature vectors.
Outcome y: class/real output.
Strategy w ∈ W: parameter vector:
w = (w 1 , . . . , w d ) ∈ R d . W can be R d or can be a regularization ball W = {w : kwk p ≤ B}.
Prediction is linear: ˆ y = w > x.
Loss ` depends on the task we want to solve, but we assume
`(y, ˆ y) is convex in ˆ y.
37 / 53
Linear regression : y ∈ R and
`(y, ˆ y) = (y − ˆ y) 2 . Logistic regression: y ∈ {0, 1} and
`(y, ˆ y) = log
1 + e −y ˆ y
. Support vector machines: y ∈ {0, 1} and
`(y, ˆ y) = (1 − y ˆ y) +
38 / 53
−2 −1 0 1 2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
prediction
loss
hinge (SVM) square logistic
Logistic and hinge losses plotted for y = 1.
Squared error loss plotted for y = 0.
39 / 53
Gradient descent method
Minimize a function f (w) over w ∈ R d .
Gradient descent method:
w i+1 = w i − η i ∇ w
if (w i ).
where η i is a step size.
w i+1 := arg min
w∈W kw i+1 − wk 2 .
Source: http://www-bcf.usc.edu/∼larry/, http://takisword.files.wordpress.com 40 / 53
Minimize a function f (w) over w ∈ R d .
Gradient descent method:
w i+1 = w i − η i ∇ w
if (w i ).
where η i is a step size.
If we have a set of constraints w ∈ W, after each step we need to project back to W:
w i+1 := arg min
w∈W kw i+1 − wk 2 .
Source: http://www-bcf.usc.edu/∼larry/, http://takisword.files.wordpress.com 40 / 53
Algorithm
Start with any initial vector w 1 ∈ W.
For i = 1, 2, . . .:
1 Observe input vector x i .
2 Predict with ˆ y i = w > i x i .
3 Outcome y i is revealed.
4 Suffer loss ` i (w i ) = `(y i , ˆ y i ).
5 Push the weight vector toward the negative gradient of the loss:
w i+1 := w i − η i ∇ w
i` i (w i ).
6 If w i+1 ∈ W, project it back to W: / w i+1 := arg min
w∈W kw i+1 − wk 2 .
41 / 53
The function we want to minimize:
f (w) =
n
X
j=1
` j (w).
Standard = batch GD w i+1 : = w i − η i ∇ w
if (w i )
= w i − η i X
j
∇ w
i` j (w i ) O(n) per iteration, need to see all data.
Online GD
w i+1 := w i − η i ∇ w
i` i (w i ).
O(1) per iteration, need to see a single data point.
42 / 53
Online (stochastic) gradient descent
W w 1
w 3
w 4
2−η∇
w1`
1(w
1)
43 / 53
Online (stochastic) gradient descent
W w 1
w 3
w 4
−η∇
w1`
1(w
1)
2
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
w 4
−η∇
w1`
1(w
1)
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
w 4
w1 1 1−η∇
w2`
2(w
2)
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
w 4
w1 1 1−η∇
w2`
2(w
2)
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
−η∇
w3`
3(w
3)
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
−η∇
w3`
3(w
3)
43 / 53
Online (stochastic) gradient descent
W w 1
w 2
w 3
w 4
w1 1 1
projection
43 / 53
The gradient ∇ w
i` i (w i ) can be obtained by applying a chain rule:
∂` i (w)
∂w k
= ∂`(y i , w > x i )
∂w k
= ∂`(y, ˆ y)
∂ ˆ y y=w ˆ
>x
i∂(w > x i )
∂w k
= ∂`(y, ˆ y)
∂ ˆ y y=w ˆ
>x
ix ik .
If we denote ` 0 i (w) := ∂`(y,ˆ ∂ ˆ y y) y=w ˆ
>x
i, then we can write
∇ w
i` i (w i ) = ` 0 i (w i )x i
44 / 53
Update rules for specific losses
Linear regression:
`(y, ˆ y) = (y − ˆ y) 2 ` 0 (w) = −2(y − ˆ y) Update:
w i+1 = w i + 2η i (y i − ˆ y i )x i .
Logistic regression:
`(y, ˆ y) = log 1 + e −y ˆ y
` 0 (w) = − y 1 + e y ˆ y Update:
w i+1 = w i + η i
y i x i
1 + e y
iˆ y
i. Support vector machines:
`(y, ˆ y) = (1 − y ˆ y) + ` 0 (w) = 0 if y ˆ y > 1
−y if y ˆ y ≤ 1 Update:
w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i
45 / 53
Update rules for specific losses
Linear regression:
`(y, ˆ y) = (y − ˆ y) 2 ` 0 (w) = −2(y − ˆ y) Update:
w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:
`(y, ˆ y) = log 1 + e −y ˆ y
` 0 (w) = − y 1 + e y ˆ y Update:
w i+1 = w i + η i
y i x i
1 + e y
iy ˆ
i.
Support vector machines:
`(y, ˆ y) = (1 − y ˆ y) + ` 0 (w) = 0 if y ˆ y > 1
−y if y ˆ y ≤ 1 Update:
w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i
45 / 53
Update rules for specific losses
Linear regression:
`(y, ˆ y) = (y − ˆ y) 2 ` 0 (w) = −2(y − ˆ y) Update:
w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:
`(y, ˆ y) = log 1 + e −y ˆ y
` 0 (w) = − y 1 + e y ˆ y Update:
w i+1 = w i + η i
y i x i
1 + e y
iy ˆ
i. Support vector machines:
`(y, ˆ y) = (1 − y ˆ y) + ` 0 (w) = 0 if y ˆ y > 1
−y if y ˆ y ≤ 1 Update:
w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i
45 / 53
Linear regression:
`(y, ˆ y) = (y − ˆ y) 2 ` 0 (w) = −2(y − ˆ y) Update:
w i+1 = w i + 2η i (y i − ˆ y i )x i . Logistic regression:
`(y, ˆ y) = log 1 + e −y ˆ y
` 0 (w) = − y 1 + e y ˆ y Update:
w i+1 = w i + η i
y i x i
1 + e y
iy ˆ
i. Support vector machines:
`(y, ˆ y) = (1 − y ˆ y) + ` 0 (w) = 0 if y ˆ y > 1
−y if y ˆ y ≤ 1 Update:
w i+1 = w i + η i 1[y ˆ y ≤ 1]y i x i ⇐ perceptron!
45 / 53
Projection
w i := arg min
w∈W kw i − wk 2 .
corresponds to renormalization of the weight vector: if kw i k > B =⇒ w i := Bw i
kw i k . Equivalent to L 2 regularization.
When W = {w : P d
k=1 |w k | ≤ B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.
Equivalent to L 1 regularization, results in sparse solutions.
46 / 53
Projection
w i := arg min
w∈W kw i − wk 2 . When W = R d ⇒ no projection step.
if kw i k > B =⇒ w i := Bw i kw i k . Equivalent to L 2 regularization.
When W = {w : P d
k=1 |w k | ≤ B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.
Equivalent to L 1 regularization, results in sparse solutions.
46 / 53
Projection
w i := arg min
w∈W kw i − wk 2 . When W = R d ⇒ no projection step.
When W = {w : kwk ≤ B} is an L 2 -ball, projection corresponds to renormalization of the weight vector:
if kw i k > B =⇒ w i := Bw i kw i k . Equivalent to L 2 regularization.
clipping smaller weights to 0.
Equivalent to L 1 regularization, results in sparse solutions.
46 / 53
w i := arg min
w∈W kw i − wk 2 . When W = R d ⇒ no projection step.
When W = {w : kwk ≤ B} is an L 2 -ball, projection corresponds to renormalization of the weight vector:
if kw i k > B =⇒ w i := Bw i kw i k . Equivalent to L 2 regularization.
When W = {w : P d
k=1 |w k | ≤ B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0.
Equivalent to L 1 regularization, results in sparse solutions.
46 / 53
L 1 vs. L 2 projection
w i := arg min
w∈W kw i − wk 2 .
w i w i
w 2 w 2
w 1 w 1
W W
47 / 53
w i := arg min
w∈W kw i − wk .
w i w i
w 2 w 2
w 1 w 1
W W
projected w i projected w i
47 / 53
Theorem
Let 0 ∈ W. Assume k∇ w ` i (w)k ≤ L and let kWk = max w∈W kwk. Then, with w 1 = 0, η i = √ 1
i kWk
L the regret is bounded by:
R n ≤ kWkL √ 2n, so that the per-iteration regret:
1
n R n ≤ kWkL r 2
n .
48 / 53
Gradient descent
w i+1 := w i − η i ∇ w
i` i (w i ).
Exponentiated descent w i+1 := 1
Z i
w i e −η
i∇
wi`
i(w
i) . Direct extension of Hedge for classification/regression
framework.
Requires positive weights, but can be applied in a general setting using doubling trick.
Works much better than online gradient descent when:
d is very large (many features)
only a small number of features is relevant.
Works very well when the best model is sparse, but does not keep sparse solution itself!
49 / 53
Theorem
Let W be a positive orthant of L 1 -ball with radius kWk, and assume k∇ w ` i (w)k ≤ L. Then, with w 1 = 1 d , . . . , 1 d , η i = √ 1
i kWk √
8 log d
L the regret is bounded by:
R n ≤ kWkL
r n log d 2 , so that the per-iteration regret:
1
n R n ≤ kWkL r log d
2n .
50 / 53
Concept drift: competing with drifting parameter vectors.
Partial feedback: contextual multi-armed bandit problems.
Improvements for some (strongly convex, exp-concave) loss functions.
Infinite-dimensional feature spaces via kernel trick.
Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices).
. . .
51 / 53
1 Example: Online (Stochastic) Gradient Descent
2 Statistical learning theory
3 Online learning
4 Algorithms for prediction with experts advice
5 Algorithms for classification and regression.
6 Conclusions
52 / 53
A theoretical framework for learning without i.i.d. assumption.
Performance bounds often simpler to prove than in the stochastic setting.
Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial
information (bandits), etc.
Results in online algorithms directly applicable to large-scale learning problems.
Most of currently used offline learning algorithms employ online learning as an optimization routine.
53 / 53