Scale-invariant online learning
Micha l Kempka Wojciech Kot lowski
IDSS Seminar, 27.11.2018
1 / 22
Online learning example: travel time estimation
• At every timestamp t, navigation software needs to predict travel time y
tat a given road segment
• Given feature vector x
t∈ R
drepresenting current traffic conditions, predict b y
t= x
>tw
twith a linear model
• Observe real y
tand measure prediction loss, e.g. (y
t− y b
t)
2• Improve model parameters w
t→ w
t+12 / 22
Online learning example: spam filtering
• At every timestamp t, spam filter needs to classify an incoming email as spam/no-spam (y
t∈ {+1, −1})
• Given feature vector x
t∈ R
drepresenting email’s body, predict b y
t= x
>tw
twith a linear model
• Receive feedback y
tfrom a user and measure prediction loss, e.g. logistic loss log(1 + e
−ytbyt)
• Improve model parameters w
t→ w
t+13 / 22
Online learning with linear models
At each trial t = 1, . . . , T :
Nature reveals input instance x
t∈ R
dLearner predicts with a linear model y b
t= x
>tw
t, where w
t∈ R
dNature reveals label y
tLearner suffers loss `(y
t, b y
t)
, convex and L-Lipschitz in y b
trevealed before prediction!
L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂
by
`(y, y) b L logistic log 1 + e
−yyb −y1+eyby
1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1
Without loss of generality assume L = 1
No stochastic assumptions on the data sequence (x
t, y
t) are made Minimize regret relative to oracle weight vector w
?∈ R
d:
regret
T(w
?) =
T
X
t=1
`(y
t, x
>tw
t) −
T
X
t=1
`(y
t, x
>tw
?),
Goal: sublinear regret for any w
?and any data sequence (x
t, y
t)
4 / 22
Online learning with linear models
At each trial t = 1, . . . , T :
Nature reveals input instance x
t∈ R
dLearner predicts with a linear model y b
t= x
>tw
t, where w
t∈ R
dNature reveals label y
tLearner suffers loss `(y
t, b y
t)
, convex and L-Lipschitz in y b
trevealed before prediction!
L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂
by
`(y, y) b L logistic log 1 + e
−yyb −y1+eyby
1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1
Without loss of generality assume L = 1
No stochastic assumptions on the data sequence (x
t, y
t) are made Minimize regret relative to oracle weight vector w
?∈ R
d:
regret
T(w
?) =
T
X
t=1
`(y
t, x
>tw
t) −
T
X
t=1
`(y
t, x
>tw
?),
Goal: sublinear regret for any w
?and any data sequence (x
t, y
t)
4 / 22
Online learning with linear models
At each trial t = 1, . . . , T :
Nature reveals input instance x
t∈ R
dLearner predicts with a linear model y b
t= x
>tw
t, where w
t∈ R
dNature reveals label y
tLearner suffers loss `(y
t, b y
t), convex and L-Lipschitz in y b
trevealed before prediction!
L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂
by
`(y, y) b L logistic log 1 + e
−yyb −y1+eyby
1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1
Without loss of generality assume L = 1
No stochastic assumptions on the data sequence (x
t, y
t) are made Minimize regret relative to oracle weight vector w
?∈ R
d:
regret
T(w
?) =
T
X
t=1
`(y
t, x
>tw
t) −
T
X
t=1
`(y
t, x
>tw
?),
Goal: sublinear regret for any w
?and any data sequence (x
t, y
t)
4 / 22
Online learning with linear models
At each trial t = 1, . . . , T :
Nature reveals input instance x
t∈ R
dLearner predicts with a linear model y b
t= x
>tw
t, where w
t∈ R
dNature reveals label y
tLearner suffers loss `(y
t, b y
t), convex and L-Lipschitz in y b
trevealed before prediction!
L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂
by
`(y, y) b L logistic log 1 + e
−yyb −y1+eyby
1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1
Without loss of generality assume L = 1
No stochastic assumptions on the data sequence (x
t, y
t) are made Minimize regret relative to oracle weight vector w
?∈ R
d:
regret
T(w
?) =
T
X
t=1
`(y
t, x
>tw
t) −
T
X
t=1
`(y
t, x
>tw
?),
Goal: sublinear regret for any w
?and any data sequence (x
t, y
t)
4 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t)
Make a small step along negative gradient of the loss
regret
T(w
?) ≤ kw
?k
22η + η
2
T
X
t=1
k∇
tk
2(starting at w
1= 0)
• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
22η + η 2
T
X
t=1
k∇
tk
2(starting at w
1= 0)
• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
22η + η 2
T
X
t=1
k∇
tk
2(starting at w
1= 0)
Optimal in-hindsight tuning η
?= √
kw?kP
tk∇tk2
to minimize the regret (impossible in practice)
• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i,
Each feature has its own learning rate
regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
w
?i22η
i+ η
i2
T
X
t=1
∇
2t,i(w
1= 0)
Optimal in-hindsight tuning η
?i= √
P|w?i|t∇2t,i
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
|w
i?| s
X
t
∇
2t,ifor η
?i= |w
i?| q P
t
∇
2t,iBetter than the previous bound (single tuning per feature)
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
|w
i?| s
X
t
∇
2t,ifor η
?i= |w
i?| q P
t
∇
2t,iCan we get the optimal SGD regret bound:
d
X
i=1
|w
?i| s
X
t
∇
2t,iwith some adaptive tuning strategy?
• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
|w
i?| s
X
t
∇
2t,ifor η
?i= |w
i?| q P
t
∇
2t,i• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iTuning the learning rate mimics
the optimal tuning
regret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
|w
i?| s
X
t
∇
2t,ifor η
?i= |w
i?| q P
t
∇
2t,i• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,i5 / 22
Stochastic Gradient Descent (SGD)
• Fixed learning rate
w
t+1= w
t− η∇
t, where ∇
t= ∇
wt`(y
t, x
>tw
t) regret
T(w
?) ≤ kw
?k
s X
t
k∇
tk
2for η
?= kw
?k pP
t
k∇
tk
2• Separate fixed learning rate per feature w
t+1,i= w
t,i− η
i∇
t,i, regret
T(w
?) ≤
d
X
i=1
|w
i?| s
X
t
∇
2t,ifor η
?i= |w
i?| q P
t
∇
2t,i• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w
t+1,i= w
t,i− η
i,t∇
t,i, where η
i,t= η
iq
+ P
j≤t
∇
2j,iregret
T(w
?) ≤
d
X
i=1
max
t|w
i?− w
i,t|
22η
i+η
is
+ X
t
∇
2t,iNot there yet: still requires to tune η
idepending on unknown w
?!
5 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, y) b
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i]
1/[X
i] 1/[X
i]
dimensionless [X
i]
Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
q|wi?|P
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, b y)
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iFor example, for squared-error loss:
∇
wt(y
t− w
t>x
t)
2= 2(y
t− w
>tx
t)
| {z }
gt
x
tSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i]
1/[X
i] 1/[X
i]
dimensionless [X
i]
Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
q|wi?|P
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, b y)
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i]
1/[X
i] 1/[X
i]
dimensionless [X
i]
Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
q|wi?|P
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, b y)
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i] 1/[X
i] 1/[X
i]
dimensionless [X
i]
units do not match!
Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
q|wi?|P
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, b y)
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i] 1/[X
i] 1/[X
i]
dimensionless [X
i]
. . . unless [η
i] = 1/[X
i]
2Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
q|wi?|P
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
w
t+1,i= w
t,i− η
i∇
t,iBy the chain rule ∇
t= ∇
wt`(y
t, x
>tw
t) = ∂`(y
t, b y)
∂ y b
| {z }
gt
byt=x>t wt
x
t:
w
t+1,i= w
t,i− η
ig
tx
t,iSuppose feature i has a physical unit [X
i], while the label and prediction are dimensionless (like in, e.g., classification)
=⇒ i-th weight coordinate w
imust have unit 1/[X
i] 1/[X
i] 1/[X
i]
dimensionless [X
i]
. . . unless [η
i] = 1/[X
i]
2Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η
i=
|w? i| qP
t∇2t,i
achieves exactly that) Single learning rate is unable to compensate units.
6 / 22
Feature scales
AdaGrad [Duchi et al., 2011]:
w
t+1,i= w
t,i− η
q
+ P
j≤t
∇
2j,i∇
t,i1/[X
i] 1/[X
i] [X
i]
[X
i] 1/[X
i] ?
Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time
• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]
• Heuristically solved by Adadelta [Zeiler, 2012]
Motivation: fully adaptive algorithms need to resolve this scaling issue
7 / 22
Feature scales
AdaGrad [Duchi et al., 2011]:
w
t+1,i= w
t,i− η
q
+ P
j≤t
∇
2j,i∇
t,i1/[X
i] 1/[X
i] [X
i]
[X
i] 1/[X
i] ?
Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time
• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]
• Heuristically solved by Adadelta [Zeiler, 2012]
Motivation: fully adaptive algorithms need to resolve this scaling issue
7 / 22
Feature scales
AdaGrad [Duchi et al., 2011]:
w
t+1,i= w
t,i− η
q
+ P
j≤t
∇
2j,i∇
t,i1/[X
i] 1/[X
i] [X
i]
[X
i] 1/[X
i] ?
Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time
• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]
• Heuristically solved by Adadelta [Zeiler, 2012]
Motivation: fully adaptive algorithms need to resolve this scaling issue
7 / 22
Feature scales
AdaGrad [Duchi et al., 2011]:
w
t+1,i= w
t,i− η
q
+ P
j≤t
∇
2j,i∇
t,i1/[X
i] 1/[X
i] [X
i]
[X
i] 1/[X
i] ?
Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time
• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]
• Heuristically solved by Adadelta [Zeiler, 2012]
Motivation: fully adaptive algorithms need to resolve this scaling issue
7 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀i, t x
t,i7→ a
ix
t,iw
i7→ a
−1iw
i=⇒ x
>tw 7→ x
>tw
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1Example: minimizing squared error loss:
w
?= X
t
x
tx
>t−1X
t
x
ty
tA learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1Example: minimizing squared error loss:
w
?7→ X
t
Ax
t(Ax
t)
>−1X
t
Ax
ty
tA learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1Example: minimizing squared error loss:
w
?7→ A
−1X
t
x
tx
>t−1A
−1A X
t
x
ty
tA learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1Example: minimizing squared error loss:
w
?7→ A
−1X
t
x
tx
>t −1X
t
x
ty
t= A
−1w
?A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Scale invariance
A natural symmetry in the linear problems
Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:
∀t x
t7→ A
−1x
t, w 7→ Aw =⇒ x
>tw 7→ x
>tw for any diagonal matrix A
In particular: if w
?is optimal (loss-minimizer) for sequence {(x
t, y
t)}
Tt=1, then A
−1w
?is optimal for sequence {(Ax
t, y
t)}
Tt=1A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:
∀t x
t7→ Ax
t=⇒ w
t7→ A
−1w
t=⇒ w
>tx
t7→ w
>tx
tno initial data normalization required!
Motivation: A fully adaptive algorithm needs to be scale-invariant
8 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]
9 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]
9 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]
9 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]
9 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning:
[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018]
Prior to this work: [Kot lowski, 2017]
9 / 22
Past work
Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]
Assumption: |x
t,iw
?i| ≤ C for all i, t for some constant C regret
T(w
?) = O
d √
C
2T
Compare with optimal SGD regret:
d
X
i=1
|w
i?| s
X
t
∇
2t,iC
2T =⇒ X
t
(∇
t,iw
t,i?)
2d =⇒ X
i
[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions
Some more recent work on unconstrained online learning:
[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,
Cutkosky and Orabona, 2018]
Prior to this work: [Kot lowski, 2017]
9 / 22
Scale-invariant algorithms
Scale Invariant Online Learning, Algorithm 1: ScInOL
1Parameter: = 1
Keep track of data statistics: M
t,i= max
j≤t
|x
j,i|, S
t,i2= X
j≤t
∇
2j,i, G
t,i= X
j≤t
∇
j,iMaximum value at a given feature
Sum of squared
gradients Sum of gradients and an auxilary variable β
t,i= min{β
t−1,i,
(S2
t−1,i+Mt,i2 )
x2t,it
} with β
0,i= w
t,i= β
t,isgn(θ
i)
2 q
S
t−1,i2+ M
t,i2e
|θi|/2− 1
, where θ
i= G
t,iq
S
2t−1,i+ M
t,i21/[X
i]
unitless
p[X
i]
2unitless
[X
i]
p[X
i]
2regret
T(w
?) =
d
X
i=1
O ˜
|w
?i| s
max
tx
2t,i+ X
t
∇
2t,i, where ˜ O(·) hides logarithmic factors
Optimal up to logarithmic terms
10 / 22
Scale-invariant algorithms
Scale Invariant Online Learning, Algorithm 1: ScInOL
1Parameter: = 1
Keep track of data statistics: M
t,i= max
j≤t
|x
j,i|, S
t,i2= X
j≤t
∇
2j,i, G
t,i= X
j≤t
∇
j,iMaximum value at a given feature
Sum of squared
gradients Sum of gradients and an auxilary variable β
t,i= min{β
t−1,i,
(S2
t−1,i+Mt,i2 )
x2t,it
} with β
0,i= w
t,i= β
t,isgn(θ
i)
2 q
S
t−1,i2+ M
t,i2e
|θi|/2− 1
, where θ
i= G
t,iq
S
2t−1,i+ M
t,i21/[X
i]
unitless
p[X
i]
2unitless
[X
i]
p[X
i]
2regret
T(w
?) =
d
X
i=1
O ˜
|w
?i| s
max
tx
2t,i+ X
t
∇
2t,i, where ˜ O(·) hides logarithmic factors
Optimal up to logarithmic terms
10 / 22
Scale-invariant algorithms
Scale Invariant Online Learning, Algorithm 1: ScInOL
1Parameter: = 1
Keep track of data statistics:
M
t,i= max
j≤t
|x
j,i|, S
t,i2= X
j≤t
∇
2j,i, G
t,i= X
j≤t
∇
j,iMaximum value at a given feature
Sum of squared
gradients Sum of gradients
and an auxilary variable β
t,i= min{β
t−1,i,
(S2
t−1,i+Mt,i2 )
x2t,it
} with β
0,i= w
t,i= β
t,isgn(θ
i)
2 q
S
t−1,i2+ M
t,i2e
|θi|/2− 1
, where θ
i= G
t,iq
S
t−1,i2+ M
t,i21/[X
i]
unitless
p[X
i]
2unitless
[X
i]
p[X
i]
2regret
T(w
?) =
d
X
i=1
O ˜
|w
?i| s
max
tx
2t,i+ X
t
∇
2t,i, where ˜ O(·) hides logarithmic factors
Optimal up to logarithmic terms
10 / 22
Scale-invariant algorithms
Scale Invariant Online Learning, Algorithm 1: ScInOL
1Parameter: = 1
Keep track of data statistics:
M
t,i= max
j≤t
|x
j,i|, S
t,i2= X
j≤t
∇
2j,i, G
t,i= X
j≤t
∇
j,iMaximum value at a given feature
Sum of squared
gradients Sum of gradients
and an auxilary variable β
t,i= min{β
t−1,i,
(S2
t−1,i+Mt,i2 )
x2t,it
} with β
0,i=
w
t,i= β
t,isgn(θ
i) 2
q
S
t−1,i2+ M
t,i2e
|θi|/2− 1
, where θ
i= G
t,iq
S
t−1,i2+ M
t,i21/[X
i]
unitless
p[X
i]
2unitless
[X
i]
p[X
i]
2regret
T(w
?) =
d
X
i=1
O ˜
|w
?i| s
max
tx
2t,i+ X
t
∇
2t,i, where ˜ O(·) hides logarithmic factors
Optimal up to logarithmic terms
10 / 22