• Nie Znaleziono Wyników

Gradient Boosting in Regression

N/A
N/A
Protected

Academic year: 2021

Share "Gradient Boosting in Regression"

Copied!
9
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

E u g e n iu sz G atn a r*

G RADIENT BOOSTING IN REG RESSIO N

Abstract

T h e successful tree-based m eth o d o lo g y has one serious d isad v an tag e: lack o f stability. T h a t is, regression tree m odel depends on the train in g set and even sm all change in a p re d icto r value could lead to a q u ite different m odel. In o rd e r to solve th is p ro b lem single trees are com bined in to one m odel. T h ere a re three aggregation m eth o d s used in classification: b o o tstra p ag g reg atio n (bagging), ad ap tiv e resam ple and com bine (boosting) an d a d ap tiv e bagging (hybrid bagg in g -b o o stin g procedure).

In the field o f regression a v a ria n t o f b oosting, i.e. g rad ien t boo stin g , can be used. F rie d m a n (1999) proved th a t boosting is equivalent to a stepw ise fu n ctio n ap p ro x im atio n in w hich in each step a regression tree m odels residuals fro m last step m odel.

Key words: tree-based m odels, regression, boosting.

I. IN T R O D U C T IO N

T h e goal o f a regression is to find a function F * (x) th a t m ap s x to y:

F * (x ): x —> y, (1)

and m inim ises the expected value o f a specified loss fu nction L (y ,F (x )) over the jo in t d istrib u tio n o f all (x, у ) values:

F *(x) = arg m in E y_ xL (y, E (x)), (2)

FM given a sam ple (called “ train in g set”):

* P rofessor, In stitu te o f Statistics, U niversity o f E conom ics in K atow ice.

(2)

О'х.ХхМ з'г.ХгХ-.О'лг.Хд,). (3)

T h e m ost frequently used loss function for m easu rin g erro rs between у and F(x) is the squared error:

L (y ,F (x ) ) = ( y - r ( x ) ) 2. (4)

In this p ap e r we consider F (x ) having an additive form:

F (x) = Z & / « ( * , O - (5) m = 0

where / m(x, a) is a simple function o f x with p aram eters a (called “ base learner” ), for exam ple the linear function:

/ ffl(Xi am )= I « - (6)

/=1

W hen the base learner (6) is a tree, the p aram eters a are the spliting variables, split locations and m ean values o f у in regions R k.

II. R E G R E S S IO N T R E E S

T he tree corresponds to an additive m odel in the form of: к

/ ( x , a ) = £ a J O c e K J , (7) k= 1

where R k arc hyper-rectangular disjoint regions in the M -dim ensional feature space, ak denotes real param eters and I is an indicator function (G atnar, 2001).

E ach real-valued dim ension o f the region R k is characterised by its up per and low er boundary: i vjJi respectively. T herefore the region induces a p ro d u c t o f M ind icato r functions:

I(x e R k) = П iS x m < vj&), (8) m = 1

(3)

M

Я х е К * ) = Y \ l ( x me B km), (9) m = 1

w here Bm is a subset o f the set o f the variable values.

T h e p aram eter estim ation form ula depends on the way how the h o m o ­ genity o f the region R k is m easured. In the sim plest case, when variance is used, the best estim ate is the m ean o f all у values in R k:

1 n<*)

(10)

w here N (k ) is the nu m b er o f objects from train in g set belong to region R k. T ree-based regression m odels arc represented by step functions (Fig. 1).

NOX

Figure 1. Exam ple o f a step fu n ctio n

Because their lack o f sm oothness could be som etim es a disadvantage, F ried m an (1991) proposed to use splines (in the M A R S p rocedu re) to solve this problem .

III. B O O STLN G

T h e successful tree-based m ethodology has one undesirable feature: lack of stability. T h a t is a regression tree m odel depends on the trainin g set and even sm all change in a predictor value could lead to a q u ite d ifferen t m odel.

T o solve this problem in the field o f classification single trees are combined into one m odel and then averaged. T here are three aggregation m ethods developed so far:

(4)

1) b o o tstra p aggregation (bagging), developed by B rcim an (1996), 2) adap tive resam ple and com bine (boosting), proposed by F reun d and S hapire (1996),

3) ad ap tiv e bagging, proposed by B rcim an (1999).

B oosting is seen as the m ost successful and pow erful idea in statistical learning (lla s tie et al., 2001). It was developed by F reun d and Shapire (1996) originally for classification problem s, to p rod u ce th e m o st accurate m odel as a com m ittee o f m any “ w eak” classifiers.

G iven a set o f training d a ta (3) and classifier f m{ \ , a) prod ucing values from the set {—1 ,-H l}, the algorithm Ada.Boost train s the classifier on m odified train in g sam ple, giving higher weights to cases th a t arc currently m isclassified. I his repeats for a sequence o f weighted sam ples and the result is a linear com b in atio n o f the classifiers from each stage.

T h e algorithm w orks as follows:

1. S tart with equal weights for each case: 1 w, =

i = l ...IV

W, = - , (11)

2. R epeat for m = 1 to M:

a) fit the classifier: / m(x, a) to the train in g d a ta using weights w;, b) com pute the classification error:

N Z w J iy t * / m(xi; a)) 1=1______ N I (12) i — 1

c) com pute the classifier weight:

= (13)

d ) set weights for cases:

w . w. . ^14^

3. F in a l classifier is:

F (x) = s g n ^ Z ß J m(x h a ) (15)

In th e step 2d) cases misclassified by / m(x, a) have th eir weights increased and then they form the classifier / m+1(x ,a ).

(5)

IV. G R A D IE N T B O O S T IN G

F riedm an (1999) developed a variant o f boosting, i.e. “ gradient boosting” o f trees which produces highly ro b u st m odels, especially a p p ro p ria te for im perfect d ata. H e proved th a t boosting is equivalent to forw ard stepwise m odelling, th a t is sequentially adding new functions to the expansion:

В Д = / o ( * l ) + / * l / l ( X | ) + ßi f l ( xi) + ... (16)

Using stccpest-dcsccnt m ethod from num erical m inim isation , the negative gradient:

p £,(L O > ,F (x))|x)~ |

a.W

=

dF(x)

---

(

17

)

define the “ steepcst-descent” direction:

/« ( * ) = ~ ^ * .9ffl(x) (18)

and:

F m- i ( x ) = I / , ( x ) . (19)

i= о

T h e weights Am in (18) are estim ated as:

Am = a rg m in £ y>xL (y , F m_ j ( x ) + A -/m(x)), (20) x

and the ap p ro x im a tio n updated:

F m(x ) = F m_ i(x ) + l m f j x ) . (21) F o r squared erro r loss function (4), o r its m in o r m od ification:

L (y, F (x)) = ^ (У - Я х ) ) 2, (22)

the negative grad ien t is ju s t the residual:

(6)

T h e grad ien t b oosting is a stepwise function ap p ro x im atio n in which each step m odels residuals from last step m odel.

If the base learner / m(x ,a ) is a regression tree (7), then the boosted tree is induced according to the procedure:

1. Initialise:

В Д = y . (24)

2. F o r m = 1 to M:

a) rep eat fo r each i = 1, N:

Щ = y i - F m- t ( x ) . (25)

b) grow regression tree for the residuals u, finding hom ogeneous regions R jm,

c) for y ' = l , J m com pute:

ajm = a rg m in £ (yi - ( T m- l ( x i) + a) ) 2. (26) d) m odify: Jm F m(*) = Fm - l(x) + X a7m/(x c R j j . (27) J= i 3. T h e final m odel: F * (x ) = F M(x). (28)

I he tree is grow n to group observations in to ho m o gen ou s subsets. Once we have the subsets o u r update q uantities for each subset arc com puted in a separate step.

V. E X A M P L E

C o n sider B oston H ousing d a ta set (H arriso n and R ubinfeld, 1978). I he d a ta consisted o f 14 variables m easured for each o f 506 census tracts in the B oston area. T he dependen t variable is M V - m ed ian o f n eigh bo r­ ho od hom e value an d independent variables are: C R IM — crim e rate,

(7)

RM - average nu m b er o f room s, LSTA T - percent low er-status p op ulation , etc.

A verage value o f M V is S 22.533. We start m odel F 0(x) w ith the m ean (24) and constru ct residuals. T h e residuals ' rc com pu ted w ith tw o-node tree and th e tree separates positive from negative residuals.

T h en we upd ate the m odel, o b tain new residuals and repeat th e process (e.g. twice). E stim ated function consists o f three p a rts and is show n in F ig u re 2. yes +0.4 LSTAT < 14.3 y e s / no ^ -8 .4 MV = 22.5 + RM < 6 .8 ( + no \ N--- ► +13.7

Figure 2. B oosting tree for B o sto n d a ta

As we can see (Fig. 2) only three independent variables were selected to the m o d e l1: R M , LSTA T and CR1M .

T h e resulted b oosting tree is b etter m odel, in term s o f goodness-of-fit, th a n o th e r reg ressio n m o d els. In T a b le 1 we p re se n t a co m p a riso n o f th e v alu e o f R 2 fo r tw o d a ta sets: B o sto n H o u sin g a n d C a li­ fo rn ia H o u sin g 2.

1 W e used the M A R T system im plem ented in the S-Plus e n v iro n m en t.

2 T h e C a lifo rn ia H o u sin g d a ta set is available fro m S tatlib re p o sito ry . I t w as analysed by Pace and B a rry (1997) and consists o f d a ta fro m 20460 n e ig h b o rh o o d s (1990 census block gro u p s) in C alifo rn ia.

(8)

T abic 1. C o m p ariso n o f R 2 lo r tw o d a ta sets

D a ta set Single regression tree G ra d ie n t b o o stin g tree

B oston H ou sin g 0.67 0.84

C alifo rn ia H o u sin g 0.70 0.86

VI. C O N C L U S IO N S

T h ere arc several advantages o f using the m eth od o f gradient boosting in n o n p aram etric regression. Boosting regression trees can cope with outliers, are in v aria n t to m o n o to n e tran sfo rm atio n s o f variables, and can handle m issing values. They also autom atically select variables to the m odel and perform regression very fast.

T h e regression m odel obtained in the form o f a b oo sting tree is also extrem ely easy to interp ret and to use for prediction.

R E F E R E N C E S

B reim an L. (1996), Bagging p red icto rs, M achine Learning, 24, 123-140.

B rcim an L. (1999), U sing adaptive bagging to debias regressions, Technical R eport, 547, Statistics D e p a rtm e n t, U niversity o f C alifornia, Berkeley.

B reim an L ., F rie d m a n , J., O lshen, R ., S tone, C . (1984), Classification and Regression Trees, W ad sw o rth , B elm ont, С Л .

F reu n d Y ., Schapire, R .E . (1997), A d ecision-theoretic generalization o f on-line learning and a n a p p lic atio n to bo o stin g , Journal o f Com puter and S y ste m Sciences, 55, 119-139. F rie d m a n J.H . (1991), M u ltiv a riate adaptive regression splines, A nnals o f Sta tistics, 19, 1-141. F rie d m a n J.H . (1999), G reedy Function A pproxim ation: a Gradient Boosting M achine, Statistics

D e p a rtm e n t, S ta n fo rd U niversity, Stanford.

G a tn a r E. (2001), N ieparam etryczna metoda dyplorym inacji i regresji (N o n p a ram e tric m ethod fo r d iscrim in atio n an d regression; in Polish) l’W N , W arszaw a.

H a rris o n D ., R ubin feld , D .L . (1978), H e d o n ic prices and the d e m a n d fo r clean a ir, Journal o f E nvironm ental Econom ics and M anagem ent, 8, 81-102.

H astie T ., T ib sh ira n i, R ., F rie d m a n , J. (2001), The E lem ents o f S ta tistic a l lea r n in g , Springer, N ew Y o rk .

Pace R .K ., B arry , R . (1997), S parse spatial auto reg ressio n s, S ta tistics a n d P robability Letters, 33, 291-297.

(9)

Eugeniusz Galnar

GRADIENTOWA ODMIANA METODY B O O S T IN G W ANALIZIE REGRESJI Streszczenie

Szero k o sto so w an e w p raktyce m eto d y n iep aram etry czn e w y korzystujące tzw . drzew a regresyjne m ają jed n ą istotną wadę. O tóż w ykazują one niestabilność, k tó ra oznacza, że niewielka zm iana w arto ści cech o biektów w zbiorze uczącym m oże pro w ad zić d o p o w s ta n ia zupełnie innego m o d elu . O czyw iście w pływa to negatyw nie n a ich trafn o ść p ro g n o sty czn ą. T ę w adę m o żn a je d n a k w yelim inow ać, d o k o n u ją c agregacji kilku in dyw idualnych m odeli w jeden.

Z n an e są trzy m eto d y agregacji m odeli i wszystkie o p ierają się n a losow aniu ze zw racaniem ob iek tó w ze zb io ru uczącego d o kolejnych p ró b uczących: ag reg acja b o o tstra p o w a (boosting), losow anie ad ap ta c y jn e (bagging) o raz m eto d a hy b ry d o w a, łącząca elem enty obu poprzednich.

W analizie regresji szczególnie w arto zastosow ać gradientow ą, sekw encyjną, odm ianę m etody boo stin g . W istocie polega o n a w ykorzystaniu drzew regrcsyjnych w kolejnych k ro k a ch d o m o d elo w an ia reszt d la m odelu uzyskanego w p o p rzed n im k ro k u .

Cytaty

Powiązane dokumenty

We conducted experiments comparing the proposed method with ε-insensitive Support Vector Regression (ε-SVR) on various synthetic and real world data sets.. The results indicate that

można poprowadzić prostą przechodzącą przez wszystkie prostokąty niepewności pomiarowych, nie ma podstaw do stwierdzenia odstępstwa od ... Ewentualnie: Odstępstwo

Figure 11.7: Linear Regression: Statistics Dialog, Statistics Tab To request a collinearity analysis, follow these steps:.. Click on the Tests tab in the

It is applied to a framework called Global Decision Tree (GDT) that can be used for evolutionary induction of classification [19] and regression [9] trees.. The manuscript can be seen

Motivation : data gathering process is independent on the underlying data generation mechanism.. Still very

W ypeªniony wodorem balon, który wraz z gondol¡ ma mas 850kg i unosi dwó h podró»ników oraz.. ª¡dunek o ª¡ znej masie 180kg, opada ze staª¡

The difference between Rotation = 1 (with rotation) and Rotation = 0 (without rotation) is that in the first case the stages will deactivate in the same order in which they

After that the solution based on the Shibboleth augmented with newly created software for non-web authentication (ShibIdpClient, ShibIdpCliClient) and authorization (MOCCA