• Nie Znaleziono Wyników

Online isotonic regression

N/A
N/A
Protected

Academic year: 2021

Share "Online isotonic regression"

Copied!
116
0
0

Pełen tekst

(1)

Online isotonic regression

Wojciech Kot lowski

Joint work with: Wouter Koolen (CWI, Amsterdam)

Alan Malek (MIT)

Pozna´n University of Technology 06.06.2017

(2)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(3)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(4)

Motivation I – house pricing

(5)

Motivation I – house pricing

Den Bosch data set

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 400 600 800 200 300 400 500 600 700 800 area pr ice

(6)

Motivation I – house pricing

Fitting linear function

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 400 600 800 200 300 400 500 600 700 800 area pr ice

(7)

Motivation I – house pricing

Fitting isotonic1 function

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 400 600 800 200 300 400 500 600 700 800 area pr ice 1

(8)

Motivation II – predicting good probabilities

Predictions of SVM classifier (german credit)

● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●● ● ● ●●●●● ● ● ●●●●●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●● ● ● ● ● ●● ● ●●●●●●●● ● ● ● ●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●● ● ● ● ●● ● ● ●●●●●● ● ● ● ●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●●●●● ● ● ● ●●● ● ● ● ● ● ●●●● ● ● ● ●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ●●●● ● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ●●●●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●● ● ●● ● ● ●●● ● ● ● ● ●●●● ● ● ● ●●● ● ● −4 −3 −2 −1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 score label

(9)

Motivation II – predicting good probabilities

Fitting isotonic function to thelabels [Zadrozny & Elkan, 2002]

● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●● ● ● ●●●●● ● ● ●●●●●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●● ● ● ● ● ●● ● ●●●●●●●● ● ● ● ●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●● ● ● ● ●● ● ● ●●●●●● ● ● ● ●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●●●●● ● ● ● ●●● ● ● ● ● ● ●●●● ● ● ● ●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ●●●● ● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ●●●●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●● ● ●● ● ● ●●● ● ● ● ● ●●●● ● ● ● ●●● ● ● −4 −3 −2 −1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 score label/probability

(10)

Motivation II – predicting good probabilities

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of positives

Calibration plots (reliability curve)

Perfectly calibrated Logistic (0.099) SVM (0.163) SVM + Isotonic (0.100)

(11)

Motivation II – predicting good probabilities

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of positives

Calibration plots (reliability curve)

Perfectly calibrated Logistic (0.099) Naive Bayes (0.118) Naive Bayes + Isotonic (0.098)

(12)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(13)

Isotonic regression

Definition

Fit an isotonic (monotonically increasing) function to the data.

Extensively studied in statistics [Ayer et al., 55; Brunk, 55; Robertson et al., 98].

Numerous applications:

Biology, medicine, psychology, etc. Multicriteria decision support.

Hypothesis tests under order constraints. Multidimensional scaling.

(14)
(15)

Isotonic regression

Definition

Given data {(xt, yt)}Tt=1⊂ R × R, find isotonic(nondecreasing)

f∗: R → R, which minimizes squared error over the labels: min f : T X t=1 (yt− f (xt))2, subject to : xt≥ xq =⇒ f (xt) ≥ f (xq), q, t ∈ {1, . . . , T }.

The optimal solution f∗ is called isotonic regression function. What only matters are values f (xt), t = 1, . . . , T .

(16)

Isotonic regression example

(17)

Properties of isotonic regression

Depends on instances (x ) only through their order relation. Only defined at points {x1, . . . , xT}.

Often extended to R by linear interpolation.

Piecewise constants (splits the data into level sets).

Self-averaging property: the value of f∗ in a given level set equals the average of labels in that level set. For any v :

v = 1

|Sv|

X

t∈Sv

yt where Sv = {t : f(xt) = v }.

(18)

Isotonic regression gives calibrated probabilities

Definition

Let y ∈ {0, 1}. A probability estimatorp of y isb calibratedif

E[y |bp = v ] = v

Fact

For binary labels, isotonic regression f∗ is a calibrated probability estimator on the data set.

Proof: Let Sv = {t : f(xt) = v }. By self-averaging:

E[y |f(x ) = v ] = |S1

v| X

t∈Sv

(19)

Isotonic regression gives calibrated probabilities

Definition

Let y ∈ {0, 1}. A probability estimatorp of y isb calibratedif

E[y |bp = v ] = v

Fact

For binary labels, isotonic regression f∗ is a calibrated probability estimator on the data set.

Proof: Let Sv = {t : f(xt) = v }. By self-averaging:

E[y |f(x ) = v ] = |S1

v| X

t∈Sv

(20)

Pool Adjacent Violators Algorithm (PAVA)

Iterative merging of of data points intoblocksuntil no violators of isotonic constraints exist.

The values assigned to each block is theaverage over labelsin this block.

The final assignments to blocks corresponds to the level sets

of isotonic regression.

(21)

PAVA: example

x 7 −1 −2 9 2 0 6 3 −3 5 −3 7 −5 y 1 0.4 0.2 0.7 0.7 0.6 0.8 0.2 0.3 0.6 0.4 1 0 ● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 x y

(22)

PAVA: example

Step 1: Sort the data in the increasing order of x .

x 7 −1 −2 9 2 0 6 3 −3 5 −3 7 −5

y 1 0.4 0.2 0.7 0.7 0.6 0.8 0.2 0.3 0.6 0.4 1 0

⇓ ⇓ ⇓

x −5 −3 −3 −2 −1 0 2 3 5 6 7 7 9

(23)

PAVA: example

Step 2: Split the data into blocks B1, . . . , Br, such that points

with the same xt fall into the same block.

Assign value fi to each block (i = 1, . . . , r ) which is the average of

labels in this block.

x −5 −3 −3 −2 −1 0 2 3 5 6 7 7 9 y 0 0.4 0.3 0.2 0.4 0.6 0.7 0.2 0.6 0.8 1 1 0.7 ⇓ ⇓ ⇓ block B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} fi 0 0.35 0.2 0.4 0.6 0.7 0.2 0.6 0.8 1 0.7

(24)

PAVA: example

Step 3: While there exists aviolator, i.e. a pair of blocks Bi, Bi +1

such that fi > fi +1:

Merge Bi and Bi +1 and assign aweighted average:

fi = |Bi|fi+ |Bi +1|fi +1 |Bi| + |Bi +1| . block B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} fi 0 0.35 0.2 0.4 0.6 0.7 0.2 0.6 0.8 1 0.7 ⇓ ⇓ ⇓ block B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.6 0.7 0.2 0.6 0.8 1 0.7

(25)

PAVA: example

Step 3: While there exists aviolator, i.e. a pair of blocks Bi, Bi +1

such that fi > fi +1:

Merge Bi and Bi +1 and assign aweighted average:

fi = |Bi|fi+ |Bi +1|fi +1 |Bi| + |Bi +1| . block B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.6 0.7 0.2 0.6 0.8 1 0.7 ⇓ ⇓ ⇓ block B1 B2 B3 B4 B5 B6 B7 B8 B9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.6 0.45 0.6 0.8 1 0.7

(26)

PAVA: example

Step 3: While there exists aviolator, i.e. a pair of blocks Bi, Bi +1

such that fi > fi +1:

Merge Bi and Bi +1 and assign aweighted average:

fi = |Bi|fi+ |Bi +1|fi +1 |Bi| + |Bi +1| . block B1 B2 B3 B4 B5 B6 B7 B8 B9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.6 0.45 0.6 0.8 1 0.7 ⇓ ⇓ ⇓ block B1 B2 B3 B4 B5 B6 B7 B8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.5 0.6 0.8 1 0.7

(27)

PAVA: example

Step 3: While there exists aviolator, i.e. a pair of blocks Bi, Bi +1

such that fi > fi +1:

Merge Bi and Bi +1 and assign aweighted average:

fi = |Bi|fi+ |Bi +1|fi +1 |Bi| + |Bi +1| . block B1 B2 B3 B4 B5 B6 B7 B8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} fi 0 0.3 0.4 0.5 0.6 0.8 1 0.7 ⇓ ⇓ ⇓ block B1 B2 B3 B4 B5 B6 B7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} fi 0 0.3 0.4 0.5 0.6 0.8 0.9

(28)

PAVA: example

Reading out the solution.

block B1 B2 B3 B4 B5 B6 B7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} fi 0 0.3 0.4 0.5 0.6 0.8 0.9 ⇓ ⇓ ⇓ x −5 −3 −3 −2 −1 0 2 3 5 6 7 7 9 y 0 0.4 0.3 0.2 0.4 0.6 0.7 0.2 0.6 0.8 1 1 0.7 f∗ 0 0.3 0.3 0.3 0.4 0.5 0.5 0.5 0.6 0.8 0.9 0.9 0.9

(29)

PAVA: example

x −5 −3 −3 −2 −1 0 2 3 5 6 7 7 9 y 0 0.4 0.3 0.2 0.4 0.6 0.7 0.2 0.6 0.8 1 1 0.7 f∗ 0 0.3 0.3 0.3 0.4 0.5 0.5 0.5 0.6 0.8 0.9 0.9 0.9 ● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 x y

(30)

Generalized isotonic regression

Definition

Given data {(xt, yt)}Tt=1⊂ R × R, find isotonic f∗: R → R which

minimizes: min isotonic f T X t=1 ∆(yt, f (xt)).

Squared loss (yt− f (xt))2 replaced with general loss ∆(yt, f (xt)).

Theorem [Robertson et al., 1998]

All loss functions of the form:

∆(y , z) = Ψ(y ) − Ψ(z) − Ψ0(z)(y − z)

for some strictly convex Ψ result inthe same isotonic regression functionf∗.

(31)

Generalized isotonic regression

Definition

Given data {(xt, yt)}Tt=1⊂ R × R, find isotonic f∗: R → R which

minimizes: min isotonic f T X t=1 ∆(yt, f (xt)).

Squared loss (yt− f (xt))2 replaced with general loss ∆(yt, f (xt)).

Theorem [Robertson et al., 1998]

All loss functions of the form:

∆(y , z) = Ψ(y ) − Ψ(z) − Ψ0(z)(y − z)

for some strictly convex Ψ result inthe same isotonic regression functionf∗.

(32)

Generalized isotonic regression – examples

∆(y , z) = Ψ(y ) − Ψ(z) − Ψ0(z)(y − z)

Squared functionΨ(y ) = y2:

∆(y , z) = y2− z2− 2f (y − z) = (y − z)2 (squared loss).

EntropyΨ(y ) = −y log y − (1 − y ) log(1 − y ), y ∈ [0, 1]

∆(y , z) = − y log z − (1 − y ) log(1 − z) (cross-entropy).

Negative logarithmΨ(y ) = − log y , y > 0 ∆(y , z) = y

z − log y

(33)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(34)

Online learning framework

A theoretical framework for the analysis of online algorithms.

Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for anydata. Meaningful performance guarantees based on observed quantities: regret bounds.

(35)

Online learning framework

learner (strategy) ft: X → Y prediction b yt = ft(xt) suffered loss `(yt,byt) new instance (xt, ?) feedback: yt t → t + 1

(36)

Online learning framework

Set of strategies (actions) F ; known loss function `. Learner starts with some initial strategy (action) f1.

For t = 1, 2, . . .:

1 Learner observes instance xt.

2 Learner predicts withybt = ft(xt).

3 The environment reveals outcome yt.

4 Learner suffers loss `(yt,ybt).

(37)

Online learning framework

The goal of the learner is to be close to the best f in hindsight.

Cumulative loss of the learner:

b LT = T X t=1 `(yt,byt).

Cumulative loss of the best strategy f in hindsight:

LT = min f ∈F T X t=1 `(yt, f (xt)).

Regretof the learner:

regretT =LbT − LT.

(38)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(39)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(40)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(41)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(42)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(43)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(44)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(45)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(46)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1)2

(47)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (yb1− y1) 2

(48)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(49)

Online isotonic regression

X Y x1 x2 x3 x4 x5 x6 x7 x8 0 1 x5 b y5 y5 loss = (by5− y5)2 x1 b y1 y1 loss = (by1− y1) 2

(50)

Online isotonic regression

The protocol

Given: x1 < x2 < . . . < xT.

At trial t = 1, . . . , T :

Environment chooses a yet unlabeled point xit.

Learner predictsybit ∈ [0, 1].

Environment reveals label yit ∈ [0, 1].

Learner suffers squared loss (yit −byit) 2.

Strategies = isotonic functions:

F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT = T X t=1 (yitybit) 2 − min f ∈F T X t=1 (yit − f (xit))2

(51)

Online isotonic regression

The protocol

Given: x1 < x2 < . . . < xT.

At trial t = 1, . . . , T :

Environment chooses a yet unlabeled point xit.

Learner predictsybit ∈ [0, 1].

Environment reveals label yit ∈ [0, 1].

Learner suffers squared loss (yit −byit) 2.

Strategies = isotonic functions:

F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT = T X t=1 (yitybit) 2 − min f ∈F T X t=1 (yit − f (xit))2

(52)

Online isotonic regression

The protocol

Given: x1 < x2 < . . . < xT.

At trial t = 1, . . . , T :

Environment chooses a yet unlabeled point xit.

Learner predictsybit ∈ [0, 1].

Environment reveals label yit ∈ [0, 1].

Learner suffers squared loss (yit −byit) 2.

Strategies = isotonic functions:

F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT = T X t=1 (yitybit) 2 − min f ∈F T X t=1 (yit − f (xit))2

(53)

Online isotonic regression

F = {f : f (x1) ≤ f (x2) ≤ . . . ≤ f (xT)} regretT = T X t=1 (yitybit) 2 − min f ∈F T X t=1 (yit − f (xit))2

Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight.

(54)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(55)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(56)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(57)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(58)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(59)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(60)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(61)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(62)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(63)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(64)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(65)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(66)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(67)

The adversary is too powerful!

Every algorithm will have Ω(T ) regret

X Y 0 1 x1 b y1 y1 x1 loss ≥ 1/4 x2 b y2 y2 x2 loss ≥ 1/4 x3 b y1 y3 x3 loss ≥ 1/4

(68)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(69)

Fixed design

Data x1, . . . , xT is known in advance to the learner

We will show that in such model, efficient online algorithms exist.

K., Koolen, Malek: Online Isotonic Regression. Proc. of Conference on Learning Theory (COLT), pp. 1165–1189, 2016.

(70)

Off-the-shelf online algorithms

Algorithm General bound Bound for online IR Stochastic Gradient Descent G2D2

T T

Exponentiated Gradient GD1

T log dT log T

Follow the Leader G2D2d log T T2log T

Exponential Weights d log T T log T

(71)

Exponential Weights (Bayes) with uniform prior

Let f = (f1, . . . , fT) denote values of f at (x1, . . . , xT).

π(f ) = const, for all f : f1 ≤ . . . ≤ fT,

P(f |yi1, . . . , yit) ∝ π(f )e −1 2loss1...t(f ), b yit+1 = Z fit+1P(f |yi1, . . . , yit)df | {z } = posterior mean .

(72)

Exponential Weights with uniform prior does not learn

prior mean X Y 0 0.2 0.4 0.6 0.8 1 prior mean posterior mean

(73)

Exponential Weights with uniform prior does not learn

posterior mean (t = 10) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(74)

Exponential Weights with uniform prior does not learn

posterior mean (t = 20) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(75)

Exponential Weights with uniform prior does not learn

posterior mean (t = 50) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(76)

Exponential Weights with uniform prior does not learn

posterior mean (t = 100) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(77)

The algorithm

Exponential Weights on a covering net

FK =  f : ft = kt K, k ∈ {0, 1, . . . , K }, f1 ≤ . . . ≤ fT  , π(f ) uniform on FK.

Efficient implementation by dynamic programming: O(Kt) at trial t.

(78)

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

(79)

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

(80)

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

(81)

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

(82)

Covering net

A finite set of isotonic functions on a discrete grid of y values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

(83)

Performance of the algorithm

Regret bound

When K = ΘT1/3log−1/3(T ),

Regret = OT1/3log2/3(T )

Matching lower bound Ω(T1/3) (up to log factor).

Proof idea

Regret =Loss(alg) − min f ∈FK

Loss(f )

+ min

(84)

Performance of the algorithm

Regret bound

When K = ΘT1/3log−1/3(T ),

Regret = OT1/3log2/3(T )

Matching lower bound Ω(T1/3) (up to log factor).

Proof idea

Regret =Loss(alg) − min f ∈FK

Loss(f )

+ min

(85)

Performance of the algorithm

Regret bound

When K = ΘT1/3log−1/3(T ),

Regret = OT1/3log2/3(T )

Matching lower bound Ω(T1/3) (up to log factor).

Proof idea

Regret =Loss(alg) − min f ∈FK

Loss(f )

+ min

(86)

Performance of the algorithm

Regret bound

When K = ΘT1/3log−1/3(T )

,

Regret = OT1/3log2/3(T )

Matching lower bound Ω(T1/3) (up to log factor).

Proof idea

Regret =Loss(alg) − min f ∈FK

Loss(f )

| {z }

=2 log |FK|=O(K log T )

+ min

f ∈FKLoss(f ) −isotonic fmin Loss(f )

| {z }

= T

(87)

Performance of the algorithm

prior mean X Y 0 0.2 0.4 0.6 0.8 1 prior mean posterior mean

(88)

Performance of the algorithm

posterior mean (t = 10) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(89)

Performance of the algorithm

posterior mean (t = 20) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(90)

Performance of the algorithm

posterior mean (t = 50) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(91)

Performance of the algorithm

posterior mean (t = 100) X Y 0 0.2 0.4 0.6 0.8 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● prior mean posterior mean

(92)

Other loss functions

Cross-entropy loss

`(y ,by ) = −y logy − (1 − y ) log(1 −b y )b

The same bound OT1/3log2/3(T ).

Covering net FK obtained by non-uniform discretization.

Absolute loss

`(y ,y ) = |y −b by |

OT log T

obtained by Exponentiated Gradient. Matching lower bound Ω(√T ) (up to log factor).

(93)

Other loss functions

Cross-entropy loss

`(y ,by ) = −y logy − (1 − y ) log(1 −b y )b

The same bound OT1/3log2/3(T ).

Covering net FK obtained by non-uniform discretization.

Absolute loss

`(y ,y ) = |y −b by |

OT log T

obtained by Exponentiated Gradient. Matching lower bound Ω(√T ) (up to log factor).

(94)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(95)

Random permutation model

A more realistic scenario for generating x1, . . . , xT which allows

data to be unknown in advance.

The data are chosenadversariallybefore the game begins, but then are presented to the learnerin a random order

Motivation: data gathering process is independenton the underlying data generation mechanism.

Still very weak assumption.

Evaluation: regret averaged over all permutations of data: Eσ[regretT]

K., Koolen, Malek: Random Permutation Online Isotonic

(96)

Random permutation model

A more realistic scenario for generating x1, . . . , xT which allows

data to be unknown in advance.

The data are chosenadversariallybefore the game begins, but then are presented to the learnerin a random order

Motivation: data gathering process isindependent on the underlying data generation mechanism.

Still very weak assumption.

Evaluation: regret averaged over all permutations of data: Eσ[regretT]

K., Koolen, Malek: Random Permutation Online Isotonic

(97)

Leave-one-out loss

Definition

Given t labeled points {(xi, yi)}ti =1, for i = 1, . . . , t:

Take out i -th point and give remaining t − 1 points to the learner as a training data.

Learner predict ybi on xi and receives loss `(yi,byi).

Evaluate the learner by `oot = 1tPti =1`(yi,ybi)

No sequential structure in the definition.

Theorem

(98)

Leave-one-out loss

Definition

Given t labeled points {(xi, yi)}ti =1, for i = 1, . . . , t:

Take out i -th point and give remaining t − 1 points to the learner as a training data.

Learner predict ybi on xi and receives loss `(yi,byi).

Evaluate the learner by `oot = 1tPti =1`(yi,ybi)

No sequential structure in the definition.

Theorem

(99)

Fixed design to random permutation conversion

Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that:

`oot

1

tEσ[fixed-design-regrett]

We thus get an optimal algorithm (Exponential Weights on a grid) withO(Te −2/3) leave-one-out loss “for free”, but it is complicated.

(100)

Follow the Leader (FTL) algorithm

Definition

Given past t − 1 data, compute the optimal (loss-minimizing) function fand predict on new instance x according to f(x ).

FTL isundefinedfor isotonic regression.

x −3 −1 0 2 3 y 0 0.2 0.7 1 f(x ) 0 0.2 ?? 0.7 1

(101)

Follow the Leader (FTL) algorithm

Definition

Given past t − 1 data, compute the optimal (loss-minimizing) function fand predict on new instance x according to f(x ).

FTL isundefinedfor isotonic regression.

x −3 −1 0 2 3 y 0 0.2 0.7 1 f(x ) 0 0.2 ?? 0.7 1

(102)

Follow the Leader (FTL) algorithm

Definition

Given past t − 1 data, compute the optimal (loss-minimizing) function fand predict on new instance x according to f(x ).

FTL isundefinedfor isotonic regression.

x −3 −1 0 2 3

y 0 0.2 0.7 1

(103)

Foward Algorithm (FA)

Definition

Given past t − 1 data and a new instance x , take anyguess

y0 ∈ [0, 1] of the new label and predict according to the optimal function f∗ on the past dataincluding the new point (x , y0).

x −3 −1 0 2 3 y 0 0.2 y0= 1 0.7 1 f(x ) 0 0.2 0.85 0.85 1 Various popular prediction algorithms for IR fall into this

framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

(104)

Foward Algorithm (FA)

Definition

Given past t − 1 data and a new instance x , take anyguess

y0 ∈ [0, 1] of the new label and predict according to the optimal function f∗ on the past dataincluding the new point (x , y0).

x −3 −1 0 2 3

y 0 0.2 y0= 1 0.7 1

f(x )

0 0.2 0.85 0.85 1 Various popular prediction algorithms for IR fall into this

framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

(105)

Foward Algorithm (FA)

Definition

Given past t − 1 data and a new instance x , take anyguess

y0 ∈ [0, 1] of the new label and predict according to the optimal function f∗ on the past dataincluding the new point (x , y0).

x −3 −1 0 2 3

y 0 0.2 y0= 1 0.7 1

f(x ) 0 0.2 0.85 0.85 1

Various popular prediction algorithms for IR fall into this

framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

(106)

Foward Algorithm (FA)

Definition

Given past t − 1 data and a new instance x , take anyguess

y0 ∈ [0, 1] of the new label and predict according to the optimal function f∗ on the past dataincluding the new point (x , y0).

x −3 −1 0 2 3

y 0 0.2 y0= 1 0.7 1

f(x ) 0 0.2 0.85 0.85 1

Various popular prediction algorithms for IR fall into this

framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

(107)

Foward Algorithm (FA)

Two extreme FA:guess-1and guess-0, denoted f1and f0∗. Prediction of any FA is always between: f0(x ) ≤ f(x ) ≤ f1(x ).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 0 1 f1f0

(108)

Foward Algorithm (FA)

Two extreme FA:guess-1and guess-0, denoted f1and f0∗. Prediction of any FA is always between: f0(x ) ≤ f(x ) ≤ f1(x ).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 0 1 f1f0

(109)

Foward Algorithm (FA)

Two extreme FA:guess-1and guess-0, denoted f1and f0∗. Prediction of any FA is always between: f0(x ) ≤ f(x ) ≤ f1(x ).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 0 1 f1f0

(110)

Foward Algorithm (FA)

Two extreme FA:guess-1and guess-0, denoted f1and f0∗. Prediction of any FA is always between: f0(x ) ≤ f(x ) ≤ f1(x ).

X Y x1 y1 x2 y2 x3 y3 x5 y5 x6 y6 x7 y7 x8 y8 x4 0 1 f1f0

(111)

Performance of FA

Theorem

For squared loss,every forward algorithm has:

`oot= O   s log t t  

The bound is suboptimal, but only a factor of O(t1/6) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made.

(112)

Outline

1 Motivation

2 Isotonic regression

3 Online learning

4 Online isotonic regression

5 Fixed design online isotonic regression

6 Random permutation online isotonic regression

(113)

Conclusions

Two models for online isotonic regression: fixed design and

random permutation.

Optimal algorithm in both models: Exponential Weights (Bayes) on a grid.

In the random permutation model, a class of forward algorithms with good bounds on the leave-one-out loss.

Open problem:

(114)

Bibliography

Statistics

M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman.An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955

H. D. Brunk.Maximum likelihood estimates of monotone parameters. Annals of Mathematical Statistics, 26(4):607–616, 1955

J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.Psychometrika, 29(1):1–27, 1964

R. E. Barlow and H. D. Brunk.The isotonic regression problem and its dual. Journal of the American Statistical Association, 67:140–147, 1972

T. Robertson, F. T. Wright, and R. L. Dykstra.Order Restricted Statistical Inference.

John Wiley & Sons, 1998

Sara Van de Geer.Estimating a regression function. Annals of Statistics, 18:907–924, 1990

Cun-Hui Zhang.Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528–555, 2002

Jan de Leeuw, Kurt Hornik, and Patrick Mair.Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods.Journal of Statistical Software, 32:1–24, 2009

(115)

Bibliography

Machine Learning

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates.In KDD, pages 694–699, 2002

Alexandru Niculescu-Mizil and Rich Caruana.Predicting good probabilities with supervised learning.In ICML, volume 119, pages 625–632. ACM, 2005

Tom Fawcett and Alexandru Niculescu-Mizil.PAV and the ROC convex hull. Machine Learning, 68(1):97–106, 2007

Vladimir Vovk, Ivan Petej, and Valentina Fedorova.Large-scale probabilistic predictors with and without guarantees of validity. In NIPS, pages 892–900, 2015

Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado.Predicting accurate probabilities with a ranking loss. In ICML, 2012 Rasmus Kyng, Anup Rao, and Sushant Sachdeva.Fast, provable algorithms for isotonic regression in all `p-norms.In NIPS, 2015

Adam Tauman Kalai and Ravi Sastry.The isotron algorithm: High-dimensional isotonic regression.In COLT, 2009

T. Moon, A. Smola, Y. Chang, and Z. Zheng.Intervalrank: Isotonic regression with listwise and pairwise constraint.In WSDM, pages 151–160. ACM, 2010

Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai.Efficient learning of generalized linear and single index models with isotonic regression.In NIPS, pages 927–935, 2011

(116)

Bibliography

Online isotonic regression

Alexander Rakhlin and Karthik Sridharan.Online nonparametric regression. In COLT, pages 1232–1264, 2014

Pierre Gaillard and S´ebastien Gerchinovitz. A chaining algorithm for online nonparametric regression.In COLT, pages 764–796, 2015

Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek.Online isotonic regression.

In COLT, pages 1165–1189, 2016

Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek.Random permutation online isotonic regression.submitted, 2017

Cytaty

Powiązane dokumenty

I choć władzy nie można łączyć z jednym tylko podmiotem, z jego świa- domym działaniem dającym się wyodrębnić jako osobny plan, to można jednak za- obserwować illokucyjny

Przy końcu XVI wieku przysw ajają językowi polskiemu części pism Sulmońskiego poety Andrzej Dębowski i Melchior Pudłowski. Chronologicznie wyprzedza ich o lat

Z czasem okazało się bowiem, że sacrum zajmuje poczesne miejsce w aksjologii literatury zarysowywanej przez Sawickiego: znalazłszy się w tytułach trzech jego książek -

Diakon św. Szczepana jest przejawem wiary w mesjaństwo Chrystusa i Jego misję. Stanowi to nowość w stosunku do modlitw Starego Testamentu. Chrystologia apostolska

Ayant situé son intrigue plus de deux siècles avant le temps de rédaction de la nouvelle, Balzac, mal- gré quelques petites précisions vestimentaires ou architecturales pour la

Figure 2, we show that there is a significant energy barrier for the formation of the CO pathway intermediate *COOH, whereas *OCHO is expected to form readily on the Ag(110) surface

[r]

Tadeusza Włodygę, chór ministrantów, który przez kilka lat śpiewał kolędy przy akompaniamencie orkiestry wojskowej podczas obchodów święta Trzech Króli, uświetniał