• Nie Znaleziono Wyników

Scale-invariant online learning

N/A
N/A
Protected

Academic year: 2021

Share "Scale-invariant online learning"

Copied!
75
0
0

Pełen tekst

(1)

Scale-invariant online learning

Micha l Kempka Wojciech Kot lowski

IDSS Seminar, 27.11.2018

1 / 22

(2)

Online learning example: travel time estimation

• At every timestamp t, navigation software needs to predict travel time y

t

at a given road segment

• Given feature vector x

t

∈ R

d

representing current traffic conditions, predict b y

t

= x

>t

w

t

with a linear model

• Observe real y

t

and measure prediction loss, e.g. (y

t

− y b

t

)

2

• Improve model parameters w

t

→ w

t+1

2 / 22

(3)

Online learning example: spam filtering

• At every timestamp t, spam filter needs to classify an incoming email as spam/no-spam (y

t

∈ {+1, −1})

• Given feature vector x

t

∈ R

d

representing email’s body, predict b y

t

= x

>t

w

t

with a linear model

• Receive feedback y

t

from a user and measure prediction loss, e.g. logistic loss log(1 + e

−ytbyt

)

• Improve model parameters w

t

→ w

t+1

3 / 22

(4)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

d

Learner predicts with a linear model y b

t

= x

>t

w

t

, where w

t

∈ R

d

Nature reveals label y

t

Learner suffers loss `(y

t

, b y

t

)

, convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

−yyb



−y

1+eyby

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

t

, y

t

) are made Minimize regret relative to oracle weight vector w

?

∈ R

d

:

regret

T

(w

?

) =

T

X

t=1

`(y

t

, x

>t

w

t

) −

T

X

t=1

`(y

t

, x

>t

w

?

),

Goal: sublinear regret for any w

?

and any data sequence (x

t

, y

t

)

4 / 22

(5)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

d

Learner predicts with a linear model y b

t

= x

>t

w

t

, where w

t

∈ R

d

Nature reveals label y

t

Learner suffers loss `(y

t

, b y

t

)

, convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

−yyb



−y

1+eyby

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

t

, y

t

) are made Minimize regret relative to oracle weight vector w

?

∈ R

d

:

regret

T

(w

?

) =

T

X

t=1

`(y

t

, x

>t

w

t

) −

T

X

t=1

`(y

t

, x

>t

w

?

),

Goal: sublinear regret for any w

?

and any data sequence (x

t

, y

t

)

4 / 22

(6)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

d

Learner predicts with a linear model y b

t

= x

>t

w

t

, where w

t

∈ R

d

Nature reveals label y

t

Learner suffers loss `(y

t

, b y

t

), convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

−yyb



−y

1+eyby

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

t

, y

t

) are made Minimize regret relative to oracle weight vector w

?

∈ R

d

:

regret

T

(w

?

) =

T

X

t=1

`(y

t

, x

>t

w

t

) −

T

X

t=1

`(y

t

, x

>t

w

?

),

Goal: sublinear regret for any w

?

and any data sequence (x

t

, y

t

)

4 / 22

(7)

Online learning with linear models

At each trial t = 1, . . . , T :

Nature reveals input instance x

t

∈ R

d

Learner predicts with a linear model y b

t

= x

>t

w

t

, where w

t

∈ R

d

Nature reveals label y

t

Learner suffers loss `(y

t

, b y

t

), convex and L-Lipschitz in y b

t

revealed before prediction!

L-Lipschitz = (sub)derivative bounded by L Loss function `(y, y) b ∂

by

`(y, y) b L logistic log 1 + e

−yyb



−y

1+eyby

1 hinge max{0, 1 − y y} b −y1[y b y ≤ 1] 1 absolute | b y − y| sgn( y − y) b 1

Without loss of generality assume L = 1

No stochastic assumptions on the data sequence (x

t

, y

t

) are made Minimize regret relative to oracle weight vector w

?

∈ R

d

:

regret

T

(w

?

) =

T

X

t=1

`(y

t

, x

>t

w

t

) −

T

X

t=1

`(y

t

, x

>t

w

?

),

Goal: sublinear regret for any w

?

and any data sequence (x

t

, y

t

)

4 / 22

(8)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

)

Make a small step along negative gradient of the loss

regret

T

(w

?

) ≤ kw

?

k

2

2η + η

2

T

X

t=1

k∇

t

k

2

(starting at w

1

= 0)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(9)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

2

2η + η 2

T

X

t=1

k∇

t

k

2

(starting at w

1

= 0)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(10)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

2

2η + η 2

T

X

t=1

k∇

t

k

2

(starting at w

1

= 0)

Optimal in-hindsight tuning η

?

= √

kw?k

P

tk∇tk2

to minimize the regret (impossible in practice)

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(11)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(12)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

,

Each feature has its own learning rate

regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(13)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(14)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1

 w

?i2

i

+ η

i

2

T

X

t=1

2t,i



(w

1

= 0)

Optimal in-hindsight tuning η

?i

= √

P|w?i|

t2t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(15)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1



|w

i?

| s

X

t

2t,i



for η

?i

= |w

i?

| q P

t

2t,i

Better than the previous bound (single tuning per feature)

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(16)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1



|w

i?

| s

X

t

2t,i



for η

?i

= |w

i?

| q P

t

2t,i

Can we get the optimal SGD regret bound:

d

X

i=1



|w

?i

| s

X

t

2t,i



with some adaptive tuning strategy?

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(17)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1



|w

i?

| s

X

t

2t,i



for η

?i

= |w

i?

| q P

t

2t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

Tuning the learning rate mimics

the optimal tuning

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(18)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1



|w

i?

| s

X

t

2t,i



for η

?i

= |w

i?

| q P

t

2t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

5 / 22

(19)

Stochastic Gradient Descent (SGD)

• Fixed learning rate

w

t+1

= w

t

− η∇

t

, where ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) regret

T

(w

?

) ≤ kw

?

k

s X

t

k∇

t

k

2

for η

?

= kw

?

k pP

t

k∇

t

k

2

• Separate fixed learning rate per feature w

t+1,i

= w

t,i

− η

i

t,i

, regret

T

(w

?

) ≤

d

X

i=1



|w

i?

| s

X

t

2t,i



for η

?i

= |w

i?

| q P

t

2t,i

• Adaptive learning rate per feature (AdaGrad [Duchi et al., 2011]) w

t+1,i

= w

t,i

− η

i,t

t,i

, where η

i,t

= η

i

q

 + P

j≤t

2j,i

regret

T

(w

?

) ≤

d

X

i=1

 max

t

|w

i?

− w

i,t

|

2

i

i

 s

 + X

t

2t,i

Not there yet: still requires to tune η

i

depending on unknown w

?

!

5 / 22

(20)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, y) b

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

q|wi?|

P

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(21)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

For example, for squared-error loss:

wt

(y

t

− w

t>

x

t

)

2

= 2(y

t

− w

>t

x

t

)

| {z }

gt

x

t

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

q|wi?|

P

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(22)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

]

1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

q|wi?|

P

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(23)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

units do not match!

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

q|wi?|

P

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(24)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

. . . unless [η

i

] = 1/[X

i

]

2

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

q|wi?|

P

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(25)

Feature scales

w

t+1,i

= w

t,i

− η

i

t,i

By the chain rule ∇

t

= ∇

wt

`(y

t

, x

>t

w

t

) = ∂`(y

t

, b y)

∂ y b

| {z }

gt

byt=x>t wt

x

t

:

w

t+1,i

= w

t,i

− η

i

g

t

x

t,i

Suppose feature i has a physical unit [X

i

], while the label and prediction are dimensionless (like in, e.g., classification)

=⇒ i-th weight coordinate w

i

must have unit 1/[X

i

] 1/[X

i

] 1/[X

i

]

dimensionless [X

i

]

. . . unless [η

i

] = 1/[X

i

]

2

Learning rate should compensate units on each coordinate! (in fact, the optimal in-hindsight tuning η

i

=

|w

? i| qP

t2t,i

achieves exactly that) Single learning rate is unable to compensate units.

6 / 22

(26)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

t+1,i

= w

t,i

− η

q

 + P

j≤t

2j,i

t,i

1/[X

i

] 1/[X

i

] [X

i

]

[X

i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(27)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

t+1,i

= w

t,i

− η

q

 + P

j≤t

2j,i

t,i

1/[X

i

] 1/[X

i

] [X

i

]

[X

i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(28)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

t+1,i

= w

t,i

− η

q

 + P

j≤t

2j,i

t,i

1/[X

i

] 1/[X

i

] [X

i

]

[X

i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(29)

Feature scales

AdaGrad [Duchi et al., 2011]:

w

t+1,i

= w

t,i

− η

q

 + P

j≤t

2j,i

t,i

1/[X

i

] 1/[X

i

] [X

i

]

[X

i

] 1/[X

i

] ?

Learning rate still needs to compensate units, but cannot do so for all coordinates at the same time

• Also applies to RMSprop [Tieleman and Hinton, 2012] and Adam [Kingma and Ba, 2014]

• Heuristically solved by Adadelta [Zeiler, 2012]

Motivation: fully adaptive algorithms need to resolve this scaling issue

7 / 22

(30)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀i, t x

t,i

7→ a

i

x

t,i

w

i

7→ a

−1i

w

i

=⇒ x

>t

w 7→ x

>t

w

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(31)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(32)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(33)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

Example: minimizing squared error loss:

w

?

=  X

t

x

t

x

>t



−1

X

t

x

t

y

t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(34)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

Example: minimizing squared error loss:

w

?

7→  X

t

Ax

t

(Ax

t

)

>



−1

X

t

Ax

t

y

t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(35)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

Example: minimizing squared error loss:

w

?

7→ A

−1

 X

t

x

t

x

>t



−1

A

−1

A X

t

x

t

y

t

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(36)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

Example: minimizing squared error loss:

w

?

7→ A

−1

 X

t

x

t

x

>t



−1

X

t

x

t

y

t

= A

−1

w

?

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(37)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(38)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(39)

Scale invariance

A natural symmetry in the linear problems

Rescaling the features followed by the inverse scaling of the weights keep the predictions (and hence losses) invariant:

∀t x

t

7→ A

−1

x

t

, w 7→ Aw =⇒ x

>t

w 7→ x

>t

w for any diagonal matrix A

In particular: if w

?

is optimal (loss-minimizer) for sequence {(x

t

, y

t

)}

Tt=1

, then A

−1

w

?

is optimal for sequence {(Ax

t

, y

t

)}

Tt=1

A learning algorithm is scale-invariant if it returns the same predictions under arbitrary rescaling of the data:

∀t x

t

7→ Ax

t

=⇒ w

t

7→ A

−1

w

t

=⇒ w

>t

x

t

7→ w

>t

x

t

no initial data normalization required!

Motivation: A fully adaptive algorithm needs to be scale-invariant

8 / 22

(40)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(41)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(42)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(43)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning: [McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018] Prior to this work: [Kot lowski, 2017]

9 / 22

(44)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning:

[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018]

Prior to this work: [Kot lowski, 2017]

9 / 22

(45)

Past work

Scale-invariant algorithms with bounded predictions [Ross et al., 2013, Orabona et al., 2015]

Assumption: |x

t,i

w

?i

| ≤ C for all i, t for some constant C regret

T

(w

?

) = O

 d √

C

2

T



Compare with optimal SGD regret:

d

X

i=1



|w

i?

| s

X

t

2t,i



C

2

T =⇒ X

t

(∇

t,i

w

t,i?

)

2

d =⇒ X

i

[Luo et al., 2016] considers a more general version of scale invariance, but also with bounded predictions

Some more recent work on unconstrained online learning:

[McMahan and Streeter, 2010, McMahan and Abernethy, 2013, Orabona, 2013, Cutkosky and Boahen, 2017,

Cutkosky and Orabona, 2018]

Prior to this work: [Kot lowski, 2017]

9 / 22

(46)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

1

Parameter:  = 1

Keep track of data statistics: M

t,i

= max

j≤t

|x

j,i

|, S

t,i2

= X

j≤t

2j,i

, G

t,i

= X

j≤t

j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients and an auxilary variable β

t,i

= min{β

t−1,i

,

(S

2

t−1,i+Mt,i2 )

x2t,it

} with β

0,i

=  w

t,i

= β

t,i

sgn(θ

i

)

2 q

S

t−1,i2

+ M

t,i2



e

i|/2

− 1 

, where θ

i

= G

t,i

q

S

2t−1,i

+ M

t,i2

1/[X

i

]

unitless

p[X

i

]

2

unitless

[X

i

]

p[X

i

]

2

regret

T

(w

?

) =

d

X

i=1

O ˜



|w

?i

| s

max

t

x

2t,i

+ X

t

2t,i

 , where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(47)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

1

Parameter:  = 1

Keep track of data statistics: M

t,i

= max

j≤t

|x

j,i

|, S

t,i2

= X

j≤t

2j,i

, G

t,i

= X

j≤t

j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients and an auxilary variable β

t,i

= min{β

t−1,i

,

(S

2

t−1,i+Mt,i2 )

x2t,it

} with β

0,i

=  w

t,i

= β

t,i

sgn(θ

i

)

2 q

S

t−1,i2

+ M

t,i2



e

i|/2

− 1 

, where θ

i

= G

t,i

q

S

2t−1,i

+ M

t,i2

1/[X

i

]

unitless

p[X

i

]

2

unitless

[X

i

]

p[X

i

]

2

regret

T

(w

?

) =

d

X

i=1

O ˜



|w

?i

| s

max

t

x

2t,i

+ X

t

2t,i

 , where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(48)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

1

Parameter:  = 1

Keep track of data statistics:

M

t,i

= max

j≤t

|x

j,i

|, S

t,i2

= X

j≤t

2j,i

, G

t,i

= X

j≤t

j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients

and an auxilary variable β

t,i

= min{β

t−1,i

,

(S

2

t−1,i+Mt,i2 )

x2t,it

} with β

0,i

=  w

t,i

= β

t,i

sgn(θ

i

)

2 q

S

t−1,i2

+ M

t,i2



e

i|/2

− 1 

, where θ

i

= G

t,i

q

S

t−1,i2

+ M

t,i2

1/[X

i

]

unitless

p[X

i

]

2

unitless

[X

i

]

p[X

i

]

2

regret

T

(w

?

) =

d

X

i=1

O ˜



|w

?i

| s

max

t

x

2t,i

+ X

t

2t,i

 , where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

(49)

Scale-invariant algorithms

Scale Invariant Online Learning, Algorithm 1: ScInOL

1

Parameter:  = 1

Keep track of data statistics:

M

t,i

= max

j≤t

|x

j,i

|, S

t,i2

= X

j≤t

2j,i

, G

t,i

= X

j≤t

j,i

Maximum value at a given feature

Sum of squared

gradients Sum of gradients

and an auxilary variable β

t,i

= min{β

t−1,i

,

(S

2

t−1,i+Mt,i2 )

x2t,it

} with β

0,i

= 

w

t,i

= β

t,i

sgn(θ

i

) 2

q

S

t−1,i2

+ M

t,i2



e

i|/2

− 1 

, where θ

i

= G

t,i

q

S

t−1,i2

+ M

t,i2

1/[X

i

]

unitless

p[X

i

]

2

unitless

[X

i

]

p[X

i

]

2

regret

T

(w

?

) =

d

X

i=1

O ˜



|w

?i

| s

max

t

x

2t,i

+ X

t

2t,i

 , where ˜ O(·) hides logarithmic factors

Optimal up to logarithmic terms

10 / 22

Cytaty

Powiązane dokumenty

W perspektywie psychologii transpersonalnej Stanislava Grofa wydaje się, że Lovecraft utknął w szczelinie między dwoma rzeczywistościami, czerpiąc siłę

6. Podstawowe organizacje PZPR przy okręgowych izbach adwokackich stoją obecnie przed ważnym wydarzeniem wewnętrznym: przed zbliżającymi się zebra- nami w

wyraził „zrozum ienie dla protestu studentów i pracow ników uczelni” oraz „uwzględniając społeczne i dydaktyczno-wychowawcze skutki trwającego straj­ ku oraz

1824 we Lwowie i potem w kraju litera­ turą polską, mógł kupić dzieła Krasickiego, ale jest rzeczą wąt­ pliwą, czy wgłębiał się w naszego rzecznika

Kur- dybachy zależny jest pod w zględ em zasadniczej koncepcji syn tezy historii litera­ tury staropolskiej, ma prawo n ie zgodzić się na w yk orzystan ie sw ego

Autorzy nie podają przyjętych przez siebie kryteriów podziału na części mowy, ale z opisu gramatycznego jednostek leksykalnych oraz katalogu klas i podklas gramatycznych podanych

Szczególnie ostro skrytykow ał K ersten nowszą historiografię opartą o m e­ todologię m arksistow ską, oczyw iście nie dlatego, że opierała się na m arksizm ie,

W naszych badaniach prowadzonych w paradygmacie afektywnego poprzedzania kilkakrotnie odnotowaliśmy wpływy asymilacyjne przy jawnej prezentacji afektyw- nego poprzedzania,