Gradient Boosting in Regression

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

E u g e n iu sz G atn a r*

G RADIENT BOOSTING IN REG RESSIO N

Abstract

T h e successful tree-based m eth o d o lo g y has one serious d isad v an tag e: lack o f stability. T h a t is, regression tree m odel depends on the train in g set and even sm all change in a p re d icto r value could lead to a q u ite different m odel. In o rd e r to solve th is p ro b lem single trees are com bined in to one m odel. T h ere a re three aggregation m eth o d s used in classification: b o o tstra p ag g reg atio n (bagging), ad ap tiv e resam ple and com bine (boosting) an d a d ap tiv e bagging (hybrid bagg in g -b o o stin g procedure).

In the field o f regression a v a ria n t o f b oosting, i.e. g rad ien t boo stin g , can be used. F rie d m a n (1999) proved th a t boosting is equivalent to a stepw ise fu n ctio n ap p ro x im atio n in w hich in each step a regression tree m odels residuals fro m last step m odel.

Key words: tree-based m odels, regression, boosting.

I. IN T R O D U C T IO N

T h e goal o f a regression is to find a function F * (x) th a t m ap s x to y:

F * (x ): x —> y, (1)

and m inim ises the expected value o f a specified loss fu nction L (y ,F (x )) over the jo in t d istrib u tio n o f all (x, у ) values:

F *(x) = arg m in E y_ xL (y, E (x)), (2)

FM given a sam ple (called “ train in g set”):

* P rofessor, In stitu te o f Statistics, U niversity o f E conom ics in K atow ice.

(2)

О'х.ХхМ з'г.ХгХ-.О'лг.Хд,). (3)

T h e m ost frequently used loss function for m easu rin g erro rs between у and F(x) is the squared error:

L (y ,F (x ) ) = ( y - r ( x ) ) 2. (4)

In this p ap e r we consider F (x ) having an additive form:

F (x) = Z & / « ( * , O - (5) m = 0

where / m(x, a) is a simple function o f x with p aram eters a (called “ base learner” ), for exam ple the linear function:

/ ffl(Xi am )= I « - (6)

/=1

W hen the base learner (6) is a tree, the p aram eters a are the spliting variables, split locations and m ean values o f у in regions R k.

II. R E G R E S S IO N T R E E S

T he tree corresponds to an additive m odel in the form of: к

/ ( x , a ) = £ a J O c e K J , (7) k= 1

where R k arc hyper-rectangular disjoint regions in the M -dim ensional feature space, ak denotes real param eters and I is an indicator function (G atnar, 2001).

E ach real-valued dim ension o f the region R k is characterised by its up per and low er boundary: i vjJi respectively. T herefore the region induces a p ro d u c t o f M ind icato r functions:

I(x e R k) = П iS x m < vj&), (8) m = 1

(3)

M

Я х е К * ) = Y \ l ( x me B km), (9) m = 1

w here Bm is a subset o f the set o f the variable values.

T h e p aram eter estim ation form ula depends on the way how the h o m o genity o f the region R k is m easured. In the sim plest case, when variance is used, the best estim ate is the m ean o f all у values in R k:

1 n<*)

(10)

w here N (k ) is the nu m b er o f objects from train in g set belong to region R k. T ree-based regression m odels arc represented by step functions (Fig. 1).

NOX

Figure 1. Exam ple o f a step fu n ctio n

Because their lack o f sm oothness could be som etim es a disadvantage, F ried m an (1991) proposed to use splines (in the M A R S p rocedu re) to solve this problem .

III. B O O STLN G

T h e successful tree-based m ethodology has one undesirable feature: lack of stability. T h a t is a regression tree m odel depends on the trainin g set and even sm all change in a predictor value could lead to a q u ite d ifferen t m odel.

T o solve this problem in the field o f classification single trees are combined into one m odel and then averaged. T here are three aggregation m ethods developed so far:

(4)

1) b o o tstra p aggregation (bagging), developed by B rcim an (1996), 2) adap tive resam ple and com bine (boosting), proposed by F reun d and S hapire (1996),

3) ad ap tiv e bagging, proposed by B rcim an (1999).

B oosting is seen as the m ost successful and pow erful idea in statistical learning (lla s tie et al., 2001). It was developed by F reun d and Shapire (1996) originally for classification problem s, to p rod u ce th e m o st accurate m odel as a com m ittee o f m any “ w eak” classifiers.

G iven a set o f training d a ta (3) and classifier f m{ \ , a) prod ucing values from the set {—1 ,-H l}, the algorithm Ada.Boost train s the classifier on m odified train in g sam ple, giving higher weights to cases th a t arc currently m isclassified. I his repeats for a sequence o f weighted sam ples and the result is a linear com b in atio n o f the classifiers from each stage.

T h e algorithm w orks as follows:

1. S tart with equal weights for each case: 1 w, =

i = l ...IV

W, = - , (11)

2. R epeat for m = 1 to M:

a) fit the classifier: / m(x, a) to the train in g d a ta using weights w;, b) com pute the classification error:

N Z w J iy t * / m(xi; a)) 1=1______ N I (12) i — 1

c) com pute the classifier weight:

= (13)

d ) set weights for cases:

w . w. . ^14^

3. F in a l classifier is:

F (x) = s g n ^ Z ß J m(x h a ) (15)

In th e step 2d) cases misclassified by / m(x, a) have th eir weights increased and then they form the classifier / m+1(x ,a ).

(5)

IV. G R A D IE N T B O O S T IN G

F riedm an (1999) developed a variant o f boosting, i.e. “ gradient boosting” o f trees which produces highly ro b u st m odels, especially a p p ro p ria te for im perfect d ata. H e proved th a t boosting is equivalent to forw ard stepwise m odelling, th a t is sequentially adding new functions to the expansion:

В Д = / o ( * l ) + / * l / l ( X | ) + ßi f l ( xi) + ... (16)

Using stccpest-dcsccnt m ethod from num erical m inim isation , the negative gradient:

p £,(L O > ,F (x))|x)~ |

a.W

=

dF(x)

---

(

17

)

define the “ steepcst-descent” direction:

/« ( * ) = ~ ^ * .9ffl(x) (18)

and:

F m- i ( x ) = I / , ( x ) . (19)

i= о

T h e weights Am in (18) are estim ated as:

Am = a rg m in £ y>xL (y , F m_ j ( x ) + A -/m(x)), (20) x

and the ap p ro x im a tio n updated:

F m(x ) = F m_ i(x ) + l m f j x ) . (21) F o r squared erro r loss function (4), o r its m in o r m od ification:

L (y, F (x)) = ^ (У - Я х ) ) 2, (22)

the negative grad ien t is ju s t the residual:

(6)

T h e grad ien t b oosting is a stepwise function ap p ro x im atio n in which each step m odels residuals from last step m odel.

If the base learner / m(x ,a ) is a regression tree (7), then the boosted tree is induced according to the procedure:

1. Initialise:

В Д = y . (24)

2. F o r m = 1 to M:

a) rep eat fo r each i = 1, N:

Щ = y i - F m- t ( x ) . (25)

b) grow regression tree for the residuals u, finding hom ogeneous regions R jm,

c) for y ' = l , J m com pute:

ajm = a rg m in £ (yi - ( T m- l ( x i) + a) ) 2. (26) d) m odify: Jm F m(*) = Fm - l(x) + X a7m/(x c R j j . (27) J= i 3. T h e final m odel: F * (x ) = F M(x). (28)

I he tree is grow n to group observations in to ho m o gen ou s subsets. Once we have the subsets o u r update q uantities for each subset arc com puted in a separate step.

V. E X A M P L E

C o n sider B oston H ousing d a ta set (H arriso n and R ubinfeld, 1978). I he d a ta consisted o f 14 variables m easured for each o f 506 census tracts in the B oston area. T he dependen t variable is M V - m ed ian o f n eigh bo r ho od hom e value an d independent variables are: C R IM — crim e rate,

(7)

RM - average nu m b er o f room s, LSTA T - percent low er-status p op ulation , etc.

A verage value o f M V is S 22.533. We start m odel F 0(x) w ith the m ean (24) and constru ct residuals. T h e residuals ' rc com pu ted w ith tw o-node tree and th e tree separates positive from negative residuals.

T h en we upd ate the m odel, o b tain new residuals and repeat th e process (e.g. twice). E stim ated function consists o f three p a rts and is show n in F ig u re 2. yes +0.4 LSTAT < 14.3 y e s / no ^ -8 .4 MV = 22.5 + RM < 6 .8 ( + no \ N--- ► +13.7

Figure 2. B oosting tree for B o sto n d a ta

As we can see (Fig. 2) only three independent variables were selected to the m o d e l1: R M , LSTA T and CR1M .

T h e resulted b oosting tree is b etter m odel, in term s o f goodness-of-fit, th a n o th e r reg ressio n m o d els. In T a b le 1 we p re se n t a co m p a riso n o f th e v alu e o f R 2 fo r tw o d a ta sets: B o sto n H o u sin g a n d C a li fo rn ia H o u sin g 2.

1 W e used the M A R T system im plem ented in the S-Plus e n v iro n m en t.

2 T h e C a lifo rn ia H o u sin g d a ta set is available fro m S tatlib re p o sito ry . I t w as analysed by Pace and B a rry (1997) and consists o f d a ta fro m 20460 n e ig h b o rh o o d s (1990 census block gro u p s) in C alifo rn ia.

(8)

T abic 1. C o m p ariso n o f R 2 lo r tw o d a ta sets

D a ta set Single regression tree G ra d ie n t b o o stin g tree

B oston H ou sin g 0.67 0.84

C alifo rn ia H o u sin g 0.70 0.86

VI. C O N C L U S IO N S

T h ere arc several advantages o f using the m eth od o f gradient boosting in n o n p aram etric regression. Boosting regression trees can cope with outliers, are in v aria n t to m o n o to n e tran sfo rm atio n s o f variables, and can handle m issing values. They also autom atically select variables to the m odel and perform regression very fast.

T h e regression m odel obtained in the form o f a b oo sting tree is also extrem ely easy to interp ret and to use for prediction.

R E F E R E N C E S

B reim an L. (1996), Bagging p red icto rs, M achine Learning, 24, 123-140.

B rcim an L. (1999), U sing adaptive bagging to debias regressions, Technical R eport, 547, Statistics D e p a rtm e n t, U niversity o f C alifornia, Berkeley.

B reim an L ., F rie d m a n , J., O lshen, R ., S tone, C . (1984), Classification and Regression Trees, W ad sw o rth , B elm ont, С Л .

F reu n d Y ., Schapire, R .E . (1997), A d ecision-theoretic generalization o f on-line learning and a n a p p lic atio n to bo o stin g , Journal o f Com puter and S y ste m Sciences, 55, 119-139. F rie d m a n J.H . (1991), M u ltiv a riate adaptive regression splines, A nnals o f Sta tistics, 19, 1-141. F rie d m a n J.H . (1999), G reedy Function A pproxim ation: a Gradient Boosting M achine, Statistics

D e p a rtm e n t, S ta n fo rd U niversity, Stanford.

G a tn a r E. (2001), N ieparam etryczna metoda dyplorym inacji i regresji (N o n p a ram e tric m ethod fo r d iscrim in atio n an d regression; in Polish) l’W N , W arszaw a.

H a rris o n D ., R ubin feld , D .L . (1978), H e d o n ic prices and the d e m a n d fo r clean a ir, Journal o f E nvironm ental Econom ics and M anagem ent, 8, 81-102.

H astie T ., T ib sh ira n i, R ., F rie d m a n , J. (2001), The E lem ents o f S ta tistic a l lea r n in g , Springer, N ew Y o rk .

Pace R .K ., B arry , R . (1997), S parse spatial auto reg ressio n s, S ta tistics a n d P robability Letters, 33, 291-297.

(9)

Eugeniusz Galnar

GRADIENTOWA ODMIANA METODY B O O S T IN G W ANALIZIE REGRESJI Streszczenie

Szero k o sto so w an e w p raktyce m eto d y n iep aram etry czn e w y korzystujące tzw . drzew a regresyjne m ają jed n ą istotną wadę. O tóż w ykazują one niestabilność, k tó ra oznacza, że niewielka zm iana w arto ści cech o biektów w zbiorze uczącym m oże pro w ad zić d o p o w s ta n ia zupełnie innego m o d elu . O czyw iście w pływa to negatyw nie n a ich trafn o ść p ro g n o sty czn ą. T ę w adę m o żn a je d n a k w yelim inow ać, d o k o n u ją c agregacji kilku in dyw idualnych m odeli w jeden.

Z n an e są trzy m eto d y agregacji m odeli i wszystkie o p ierają się n a losow aniu ze zw racaniem ob iek tó w ze zb io ru uczącego d o kolejnych p ró b uczących: ag reg acja b o o tstra p o w a (boosting), losow anie ad ap ta c y jn e (bagging) o raz m eto d a hy b ry d o w a, łącząca elem enty obu poprzednich.

W analizie regresji szczególnie w arto zastosow ać gradientow ą, sekw encyjną, odm ianę m etody boo stin g . W istocie polega o n a w ykorzystaniu drzew regrcsyjnych w kolejnych k ro k a ch d o m o d elo w an ia reszt d la m odelu uzyskanego w p o p rzed n im k ro k u .