Scale-invariant online learning

(1)

Scale-invariant online learning

Micha l Kempka Wojciech Kot lowski

IDSS Seminar, 27.11.2018

1 / 22

(2)

Online learning example: travel time estimation

• At every timestamp t, navigation software needs to predict travel time y

t

at a given road segment

• Given feature vector x

_t

∈ R

^d

representing current traffic conditions, predict b y

_t

= x

^>_t

w

_t

with a linear model

• Observe real y

_t

and measure prediction loss, e.g. (y

t

− y b

t

)

²

• Improve model parameters w

t

→ w

_t+1

2 / 22

(3)

Online learning example: spam filtering

• At every timestamp t, spam filter needs to classify an incoming email as spam/no-spam (y

_t

∈ {+1, −1})

• Given feature vector x

_t

∈ R

^d

representing email’s body, predict b y

t

= x

^>_t

w

t

with a linear model

• Receive feedback y

t

from a user and measure prediction loss, e.g. logistic loss log(1 + e

^−y^t^b^y^t

)

• Improve model parameters w

_t

→ w

_t+1

3 / 22

(4)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

^d

Learner predicts with a linear model y b

_t

= x

^>_t

w

_t

, where w

_t

∈ R

^d

Nature reveals label y

_t

Learner suffers loss `(y

t

, b y

t

)

, convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

^−y^y^b

_−y

1+e^yb^y

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

_t

, y

_t

) are made Minimize regret relative to oracle weight vector w

^?

∈ R

^d

:

regret

_T

(w

^?

) =

T

X

t=1

`(y

t

, x

^>_t

w

t

) −

T

X

t=1

`(y

t

, x

^>_t

w

^?

),

Goal: sublinear regret for any w

^?

and any data sequence (x

_t

, y

_t

)

4 / 22

(5)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

^d

Learner predicts with a linear model y b

_t

= x

^>_t

w

_t

, where w

_t

∈ R

^d

Nature reveals label y

_t

Learner suffers loss `(y

t

, b y

t

)

, convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

^−y^y^b

_−y

1+e^yb^y

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

_t

, y

_t

) are made Minimize regret relative to oracle weight vector w

^?

∈ R

^d

:

regret

_T

(w

^?

) =

T

X

t=1

`(y

t

, x

^>_t

w

t

) −

T

X

t=1

`(y

t

, x

^>_t

w

^?

),

Goal: sublinear regret for any w

^?

and any data sequence (x

_t

, y

_t

)

4 / 22

(6)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

^d

Learner predicts with a linear model y b

_t

= x

^>_t

w

_t

, where w

_t

∈ R

^d

Nature reveals label y

_t

Learner suffers loss `(y

t

, b y

t

), convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

^−y^y^b

−y

1+e^yb^y

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

_t

, y

_t

) are made Minimize regret relative to oracle weight vector w

^?

∈ R

^d

:

regret

_T

(w

^?

) =

T

X

t=1

`(y

t

, x

^>_t

w

t

) −

T

X

t=1

`(y

t

, x

^>_t

w

^?

),

Goal: sublinear regret for any w

^?

and any data sequence (x

_t

, y

_t

)

4 / 22

(7)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

^d

Learner predicts with a linear model y b

_t

= x

^>_t

w

_t

, where w

_t

∈ R

^d

Nature reveals label y

_t

Learner suffers loss `(y

t

, b y

t

), convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

^−y^y^b

_−y

1+e^yb^y

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

_t

, y

_t

) are made Minimize regret relative to oracle weight vector w

^?

∈ R

^d

:

regret

_T

(w

^?

) =

T

X

t=1

`(y

t

, x

^>_t

w

t

) −

T

X

t=1

`(y

t

, x

^>_t

w

^?

),

Goal: sublinear regret for any w

^?

and any data sequence (x

_t

, y

_t

)

4 / 22

(8)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

)

Make a small step along negative gradient of the loss

regret

_T

(w

^?

) ≤ kw

^?

k

²

2η + η

2

T

X

t=1

k∇

_t

k

²

(starting at w

1

= 0)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

_i

+ η

i

2

T

X

t=1

∇

²_t,i

(w

₁

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

i

s

+ X

t

∇

²_t,i

5 / 22

(9)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

²

2η + η 2

T

X

t=1

k∇

_t

k

²

(starting at w

₁

= 0)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

_i

2

T

X

t=1

∇

²_t,i

(w

₁

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

_t+1,i

= w

_t,i

− η

_i,t

∇

_t,i

, where η

_i,t

= η

_i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

t

|w

_i^?

− w

_i,t

|

²

2η

_i

+η

i

s

+ X

t

∇

²_t,i

5 / 22

(10)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

²

2η + η 2

T

X

t=1

k∇

_t

k

²

(starting at w

₁

= 0)

Optimal in-hindsight tuning η

^?

= √

^kw^?^k

P

tk∇_tk²

to minimize the regret (impossible in practice)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

_i

2

T

X

t=1

∇

²_t,i

(w

₁

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

_t+1,i

= w

_t,i

− η

_i,t

∇

_t,i

, where η

_i,t

= η

_i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

t

|w

_i^?

− w

_i,t

|

²

2η

_i

+η

i

s

+ X

t

∇

²_t,i

5 / 22

(11)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

i

2

T

X

t=1

∇

²_t,i

(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(12)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

,

Each feature has its own learning rate

regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

i

2

T

X

t=1

∇

²_t,i

(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(13)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

i

2

T

X

t=1

∇

²_t,i

(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(14)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

w

^?_i²

2η

i

+ η

i

2

T

X

t=1

∇

²_t,i

(w

1

= 0)

Optimal in-hindsight tuning η

^?_i

= √

_P^|w^?ⁱ^|

t∇²_t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(15)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

for η

^?_i

= |w

_i^?

| q P

t

∇

²_t,i

Better than the previous bound (single tuning per feature)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(16)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

for η

^?_i

= |w

_i^?

| q P

t

∇

²_t,i

Can we get the optimal SGD regret bound:

d

X

i=1

|w

^?_i

| s

X

t

∇

²_t,i

with some adaptive tuning strategy?

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

_i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

_i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(17)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

for η

^?_i

= |w

_i^?

| q P

t

∇

²_t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

Tuning the learning rate mimics

the optimal tuning

regret

_T

(w

^?

) ≤

d

X

i=1

max

t

|w

_i^?

− w

_i,t

|

²

2η

_i

+η

i

s

+ X

t

∇

²_t,i

5 / 22

(18)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

for η

^?_i

= |w

_i^?

| q P

t

∇

²_t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

5 / 22

(19)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

_t

, where ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) regret

_T

(w

^?

) ≤ kw

^?

k

s X

t

k∇

_t

k

²

for η

^?

= kw

^?

k pP

t

k∇

_t

k

²

• Separate fixed learning rate per feature w

_t+1,i

= w

_t,i

− η

_i

∇

_t,i

, regret

_T

(w

^?

) ≤

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

for η

^?_i

= |w

_i^?

| q P

t

∇

²_t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

_i,t

∇

_t,i

, where η

i,t

= η

i

q

+ P

j≤t

∇

²_j,i

regret

_T

(w

^?

) ≤

d

X

i=1

max

_t

|w

_i^?

− w

_i,t

|

²

2η

i

+η

_i

s

+ X

t

∇

²_t,i

Not there yet: still requires to tune η

i

depending on unknown w

^?

!

5 / 22

(20)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, y) b

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

_i

=

q^|wⁱ^?^|

P

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(21)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

For example, for squared-error loss:

∇

wt

(y

_t

− w

_t^>

x

_t

)

²

= 2(y

_t

− w

^>_t

x

_t

)

| {z }

gt

x

_t

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

_i

=

q^|wⁱ^?^|

P

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(22)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

_i

=

q^|wⁱ^?^|

P

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(23)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

units do not match!

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

_i

=

q^|wⁱ^?^|

P

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(24)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

. . . unless [η

i

] = 1/[X

i

]

²

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

_i

=

q^|wⁱ^?^|

P

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(25)

Feature scales

w

t+1,i

= w

t,i

− η

_i

∇

_t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

^>_t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x^>_t wt

x

t

:

w

_t+1,i

= w

_t,i

− η

_i

g

_t

x

_t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

_i

must have unit 1/[X

_i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

_i

]

. . . unless [η

i

] = 1/[X

i

]

²

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

^|w

? i| qP

t∇²_t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(26)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

_t+1,i

= w

_t,i

− η

q

+ P

j≤t

∇

²_j,i

∇

_t,i

1/[X

_i

] 1/[X

_i

] [X

_i

]

[X

_i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(27)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

_t+1,i

= w

_t,i

− η

q

+ P

j≤t

∇

²_j,i

∇

_t,i

1/[X

_i

] 1/[X

_i

] [X

_i

]

[X

_i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(28)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

_t+1,i

= w

_t,i

− η

q

+ P

j≤t

∇

²_j,i

∇

_t,i

1/[X

_i

] 1/[X

_i

] [X

_i

]

[X

_i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(29)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

_t+1,i

= w

_t,i

− η

q

+ P

j≤t

∇

²_j,i

∇

_t,i

1/[X

_i

] 1/[X

_i

] [X

_i

]

[X

_i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(30)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀i, t x

_t,i

7→ a

_i

x

_t,i

w

_i

7→ a

⁻¹_i

w

_i

=⇒ x

^>_t

w 7→ x

^>_t

w

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(31)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(32)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(33)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

Example: minimizing squared error loss:

w

^?

= X

t

x

_t

x

^>_t

−1

X

t

x

_t

y

_t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

_t

7→ A

⁻¹

w

_t

=⇒ w

^>_t

x

_t

7→ w

^>_t

x

_t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(34)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

Example: minimizing squared error loss:

w

^?

7→ X

t

Ax

_t

(Ax

_t

)

^>

−1

X

t

Ax

_t

y

_t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

_t

7→ A

⁻¹

w

_t

=⇒ w

^>_t

x

_t

7→ w

^>_t

x

_t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(35)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

Example: minimizing squared error loss:

w

^?

7→ A

⁻¹

X

t

x

_t

x

^>_t

−1

A

⁻¹

A X

t

x

_t

y

_t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

_t

7→ A

⁻¹

w

_t

=⇒ w

^>_t

x

_t

7→ w

^>_t

x

_t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(36)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

Example: minimizing squared error loss:

w

^?

7→ A

⁻¹

X

t

x

_t

x

^>_t

−1

X

t

x

_t

y

_t

= A

⁻¹

w

^?

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

_t

7→ A

⁻¹

w

_t

=⇒ w

^>_t

x

_t

7→ w

^>_t

x

_t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(37)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(38)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(39)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

_t

7→ A

⁻¹

x

_t

, w 7→ Aw =⇒ x

^>_t

w 7→ x

^>_t

w for any diagonal matrix A

In particular: if w

^?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

^T_t=1

, then A

⁻¹

w

^?

is optimal for sequence {(Ax

_t

, y

_t

)}

^T_t=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

_t

7→ Ax

_t

=⇒ w

t

7→ A

⁻¹

w

t

=⇒ w

^>_t

x

t

7→ w

^>_t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(40)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(41)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(42)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(43)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(44)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning:

[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018]

Prior to this work: [Kot lowski, 2017]

9 / 22

(45)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

_t,i

w

^?_i

| ≤ C for all i, t for some constant C regret

_T

(w

^?

) = O

d √

C

²

T

Compare with optimal SGD regret:

d

X

i=1

|w

_i^?

| s

X

t

∇

²_t,i

C

²

T =⇒ X

t

(∇

_t,i

w

_t,i^?

)

²

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning:

[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018]

Prior to this work: [Kot lowski, 2017]

9 / 22

(46)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

₁

Parameter: = 1

Keep track of data statistics: M

_t,i

= max

j≤t

|x

_j,i

|, S

_t,i²

= X

j≤t

∇

²_j,i

, G

_t,i

= X

j≤t

∇

_j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients and an auxilary variable β

_t,i

= min{β

_t−1,i

,

^(S

2

t−1,i+M_t,i² )

x²_t,it

} with β

_0,i

= w

_t,i

= β

_t,i

sgn(θ

_i

)

2 q

S

_t−1,i²

+ M

_t,i²

e

^|θⁱ^|/2

− 1

, where θ

_i

= G

_t,i

q

S

²_t−1,i

+ M

_t,i²

1/[X

i

]

unitless

p[X

i

]

²

unitless

[X

_i

]

p[X

i

]

²

regret

_T

(w

^?

) =

d

X

i=1

O ˜

|w

^?_i

| s

max

t

x

²_t,i

+ X

t

∇

²_t,i

, where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(47)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

₁

Parameter: = 1

Keep track of data statistics: M

_t,i

= max

j≤t

|x

_j,i

|, S

_t,i²

= X

j≤t

∇

²_j,i

, G

_t,i

= X

j≤t

∇

_j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients and an auxilary variable β

_t,i

= min{β

_t−1,i

,

^(S

2

t−1,i+M_t,i² )

x²_t,it

} with β

_0,i

= w

_t,i

= β

_t,i

sgn(θ

_i

)

2 q

S

_t−1,i²

+ M

_t,i²

e

^|θⁱ^|/2

− 1

, where θ

_i

= G

_t,i

q

S

²_t−1,i

+ M

_t,i²

1/[X

i

]

unitless

p[X

i

]

²

unitless

[X

_i

]

p[X

i

]

²

regret

_T

(w

^?

) =

d

X

i=1

O ˜

|w

^?_i

| s

max

t

x

²_t,i

+ X

t

∇

²_t,i

, where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(48)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

₁

Parameter: = 1

Keep track of data statistics:

M

_t,i

= max

j≤t

|x

_j,i

|, S

_t,i²

= X

j≤t

∇

²_j,i

, G

_t,i

= X

j≤t

∇

_j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients

and an auxilary variable β

_t,i

= min{β

_t−1,i

,

^(S

2

t−1,i+M_t,i² )

x²_t,it

} with β

_0,i

= w

_t,i

= β

_t,i

sgn(θ

_i

)

2 q

S

_t−1,i²

+ M

_t,i²

e

^|θⁱ^|/2

− 1

, where θ

_i

= G

_t,i

q

S

_t−1,i²

+ M

_t,i²

1/[X

_i

]

unitless

p[X

i

]

²

unitless

[X

i

]

p[X

i

]

²

regret

_T

(w

^?

) =

d

X

i=1

O ˜

|w

^?_i

| s

max

t

x

²_t,i

+ X

t

∇

²_t,i

, where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(49)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

₁

Parameter: = 1

Keep track of data statistics:

M

_t,i

= max

j≤t

|x

_j,i

|, S

_t,i²

= X

j≤t

∇

²_j,i

, G

_t,i

= X

j≤t

∇

_j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients

and an auxilary variable β

_t,i

= min{β

_t−1,i

,

^(S

2

t−1,i+M_t,i² )

x²_t,it

} with β

_0,i

=

w

_t,i

= β

_t,i

sgn(θ

_i

) 2

q

S

_t−1,i²

+ M

_t,i²

e

^|θⁱ^|/2

− 1

, where θ

_i

= G

_t,i

q

S

_t−1,i²

+ M

_t,i²

1/[X

_i

]

unitless

p[X

i

]

²

unitless

[X

i

]

p[X

i

]

²

regret

_T

(w

^?

) =

d

X

i=1

O ˜

|w

^?_i

| s

max

t

x

²_t,i

+ X

t

∇

²_t,i

, where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22