On the Application of Classification and Regression Trees in Medical Diagnosis

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

M a ł g o r z a t a M i s z t a l *

ON T H E A PPL IC A T IO N OK CLASSIFICATION AND REG RESSIO N T REES IN M EDICAL D IA G N O SIS

Abstract

D ecision tree is a g raphical p re sen ta tio n o f the recursive p a rtitio n in g the learn in g set in to h o m o g en o u s subsets considering d ep en d en t variable y.

I f d e p en d e n t variable у is no m in al we deal w ith n o n p a ra m e tric d isc rim in an t analysis (classification trees), w hen у is num erical - w ith n o n p a ra m e tric regression analysis (regression trees).

T h e aim o f the p a p e r is to present som e ap p licatio n s o f regression an d classification trees in m edical d iag n o sis fo r solving decision - m ak in g problem s.

Key words: classification an d regression trees, m edical diagnosis.

I. INTRODUCTION

D ecision tree ca n be described as a tree-lik e w ay o f rep resen ting a collection o f hierarchical rules th a t lead to a class or to a value.

Let us c o n s id e r a learning set: ~

U = {(х 1 ,У 1),(х2,.у2) , . . . , ( х * , У * ) } (1)

where x is the vector o f independent variables x = [x1; x 2, x J T and у is the response (dependent) variable.

T h e m odel building process is based on recursive p artitio n in g the learning set into hom ogenous subsets U t , U 2, U M considering dependent variable y.

Tree-based m odels are simple, flexible and powerful tools for classification and regression analysis, dealing with different kind s o f variables, including m issing values an d very easy to interpret.

(2)

M a łg o rz a ta M isztal

II. M O D E L B U IL D IN G P R O C E S S

W e consider an additive m odel: M

y = a0 + £ a mgm(x,ß), m = 1

w here gm( \ , ß) are functions o f x with p aram eters ß. A n ap p ro x im a tio n o f (2) can be w ritten as:

M

У = а 0 +

Y, aJ{eRm)>*

(3) where R m (m = are disjoint regions in th e /^-dimensional feature space, am are real param eters and I {A} is an in d icato r function:

F o r real-valued dim ension o f the region R m, characterized by its upper and low er b o u n d ary x*d) and x*9), we have:

w here B mr is a subset o f the set o f the variable values (see G a tn a r, 2001). I f dependent variable у is nom inal we deal w ith n o n p aram etric disc rim in an t analysis (so we have classification trees); w hen у is num erical - w ith n o n p aram etric regression analysis (regression trees).

In d iscrim inant analysis the response variable у represents a class label, so th a t we w ant to predict the class o f an object from the values o f its p re d ic to r variables.

In regression analysis the response variable у is assum ed to depend on the regressors x t , x 2, x p th ro u g h the relationship:

1, if th e p ro p o sitio n inside the brackets is true

0, otherw ise (4)

p

(5)

F o r each categorical variable x r we have:

p

I { x e R m} = l ] I { r r e B mr}, (6)

(3)

T h e goal o f the regression analysis is to find an estim ate j? o f у th a t m inim izes a certain loss function (e.g. squared e rro r loss). T h e estim ation o f у can be d one by recursive p artitio n in g the d a ta and the regressor space using tech niq ues th a t yield piecewise co n stan t estim ate o f y:

/ ( * ) =

Z 0 ml { xeRm},

(8)

m- 1 so th a t for each \ e R m:

Ш = 0т. (9)

T h e p rocedu re o f the partitioning o f the sam ple space requires selecting a p p ro p ria te variables.

T h e goal o f the variable selection process is to choose those variables th at yield the best partitio n . T h e quality o f p artitio n o f a learning set U is m easured by the difference between heterogeneity (in term s o f y) o f U and o f resulted subsets U lt U 2, U M (see G a tn a r, 2001). In o th e r w ords, we have a reduction o f im purity:

A I I ( U ; x r) = H ( U ) - £ H ( U J p ( m ) , (10) m= 1

where:

H (*) - is a heterogeneity m easure,

p(m) = - - - is the p ro p o rtio n o f objects in Um, N

M

N (m) - is the nu m b er o f objects in Um; o f course £ N (m) = N. m-1

I f у is catcgorical wc have for exam ple entrop y function: к

H ( U J = - X > (i/m )lo g 2p(i/m ) (11) i= 1

o r G ini index:

я ( [ / т ) = 1 - i > 2(zVm), ( 1 2 )

i= 1 N (ľti)

where p(i/m) = l. ; N ^ m ) - is the num ber o f objects from class ŕ in Um. N( m)

(4)

In regression trees analysis the m ost frequently used heterogeneity m easure is variance: Using (13) wc have: M1{U- x r) = s \ U ) = - £ s2( U m)p(m) (14) m = 1 and 1 Ö" = N(m) j r / " ' (15)

A n ideal goal o f recursive partitioning is to find a decision tree o f m inim al size and m axim al predictive accuracy. T o evaluate the m odel accuracy a test set is required. If no t, cross - validation is used.

T h e right size o f the tree can be chosen using som e p ru n in g m ethods. T he usual procedure is to choose optim al splits all the way back, dow n the tree so th a t as an interm ediary step one obtains a tree with as m any leaves as ob servations. T hen the tree is pruned back using a fun ctio n th a t gives a cost to com plexity (see B reim an et al., 1984):

R , (T) = R (T) + a x size (T), (16)

where sizc(T ) is som e m easure o f the com plexity o f th e tree such as its size ( = n u m ber o f leaves), a is the cost-com plexity p a ram eter and R(T) is a risk function th a t penalizes for bad prediction.

T h e optim al tree is determ ined for exam ple by the 1-SE rule p ro p o sed by B reim an et al., 1984. T hen, it is th e sm allest tree w ith e rro r not m ore th an one stan d ard erro r greater th an th e low est e rro r tree in the sequence.

T ree-structured m odels have m an y practical applications. O ne o f them is m edical diagnosis w here the learning sam ple consists o f case records c o n tain in g a description o f p a tie n t’s sym ptom s and co rrespo nd in g o u tcom e.

(5)

III. A P P L IC A T IO N : P R E D IC T IO N OI< D U R A T IO N O F IC U ST A Y A F T E R С A BC

In the follow ing exam ple we analyse som e d a ta from th e D e p artm en t o f C ard io th o ra cic Surgery o f L odz M edical U niversity, w here th e set o f 244 case records o f patients undergoing C A B G (C oronary A rtery Bypass G rafting) during 1997-1998 was collected.

D ependent variable у is the d u ra tio n o f a p a tie n t’s stay in th e Intensive C are U nit (IC U ) - in days. F o r classification trees analysis length o f stay is categorized as follows: class 1 - IC U stay 1 4 days; class 2 - IC U stay 5 or m o re days. D eaths arc excluded.

P redictor variables are the following: 1) sex;

2) age in years;

3) BMI - body m ass index; 4) BSA - body surface area;

5) diabetes m ellitus (yes/no);

6) ch ronic pulm o nary diseases (yes/no); 7) A O - arterial o b stru ctio n (yes/no); 8) hyperthyroidism (on m edication) (yes/no); 9) diagnosis (stable angina, unstable angina);

10) C C S - C an ad ia n C o ro n ary Score (class I, II, III, IV); 11) h istory o f m yocardial infarction;

12) previous cardiac surgery (yes/no); 13) left m ain stenosis > 75% (yes/no);

14) E F % - left ventricular ejection fraction in % ; 15) A spA t - a sp a rta te am inotransferase in U /L ; 16) p riority o f o p eratio n (elective, urgent, em ergent);

17) in trao p e rativ e m yocardial infarction or low cardiac o u tp u t syndrom e (yes/no);

18) perfusion tim e in m inutes; 19) ao rtic clam ping tim e in m inutes; 20) reperfusion tim e in m inutes;

21) severity score based on preoperative risk facto rs (see T ab . 1). T h e follow ing alg orithm s are used:

1) C A R T (C lassification and R egression T rees) by B reim an et al. (1984); 2) Q U E S T (Q uick, U nbiased, Efficient S tatistical T rees) by Loh and

Shih (1997);

3) G U ID E (G eneralized, U nbiased Interaction D etection and E stim ation) by L oh (2002).

C A R T and Q U E S T are well know n algorithm s, available for exam ple in S T A T IS T IC A P L package.

(6)

G U ID E is a new algorithm for building piecewise c o n sta n t o r piecewise linear regression m odels with un ivariate splits. It has fo u r useful properties: (i) negligible bias selection, (ii) sensitivity to cu rv atu re an d local pairw ise interactions betw een regressor variables, (iii) inclusion o f categorical predictor variables, including ord in al categorical variables, (iv) choice o f three roles for each ordered p red icto r variable: split selection only, regression m odelling only, o r both. F o r m o re details see Loh (2002).

T able 1. C linical Severity S coring System

Risk F a c to rs S core

L eft v en tricu lar ejection fractio n < 4 0 % 3

E m ergency case 3

A ge > 60 1 ( + 1 p o in t p e r 5 years)

H y p erth y ro id ism (on m edication) 2

D ia b etes m ellitus 2

P revious c ard iac surgery 2

C h ro n ic p u lm o n a ry diseases 2

U n s ta b le a n g in a 2

BSA < 1,75 m 2 2

A sp A t > 40 U /L 1

C re atin in level > 1,2 m g/dl 1

A rte ria l o b stru c tio n 1

L eft m ain stenosis > 75% 2

U n sta b le h em o d y n am ic state 4

S ource: E la b o rate d by D e p a rtm e n t o f C a rd io th o rac ic S urgery o f Ł ó d ź M edical U niversity an d C h a ir o f Statistical M e th o d s, U niversity o f Ł ódź.

T h e results o f using selected tree buildin g m e th o d s are show n in F ig u res 1-4.

F ig u re 1 show s the piecewise co n stan t G U ID E tree. 0-SE tree has 8 term inal nodes. 5 variables ap p e ar in the splits: prio rity o f o peration , body surface area, age, severity score and sex. T he n um b er in each term inal node is the sam ple m ean o f the IC U stay. Predicted m ean sq uared error is 5.245.

T h e piecewise c o n sta n t G U ID E tree from q u an tile (m edian) regression m odel is show n in F igure 2. It is shorter th an the least squ ares tree in F ig u re 1. 0-SE tree has 3 term inal nodes. T he n u m b er in each term inal node is the sam ple m ed ian o f the IC U stay. T h e splitting variables are: p riority o f o p eratio n and body surface area. Predicted m ean squared error equals 5.893.

F ig u res 3 and 4 show trees o b tain ed using th e classification trees alg orith m s (Q U E ST and C A R T respectively).

(7)

F igure 1. G U ID E regression tree - least sq u ares regression, piecew ise c o n s ta n t m odel Source: A u th o r’s calculations

Figure 2. G U ID E regression tree - m edian regression, piecew ise c o n sta n t m odel Sourcc: A u th o r’s calcu latio n s

(8)

Figure 3. Q U E S T classification tree S ource: A u th o r’s c alculations

Figure 4. C A R T classification tree Source: A u th o r’s calculations

(9)

Both: C A R T and Q U E ST 0-SE trees have 4 term inal nodes. Trees are quite sim ilar. T h re e prcdictor variables m ay be sufficient to predict the response: body surface area, body m ass index and p rio rity o f o peration . C ross-validation estim ate o f m isclassification e rro r is 0.4303 for the Q U E S T tree and 0.3934 for the C A R T tree.

IV. C O N C L U S IO N S

P rediction o f the length o f stay in the Intensive C are U nit after cardiac surgery is n o t easy. P rolonged stay in the IC U increases the overall costs o f cardiac surgery and m ay also limit the n u m b er o f op eratio n s perform ed. T herefore, the ability to accurately predict the length o f stay in the IC U seems to be very im p o rta n t (see M ichalopoulos et al., 1996).

In o rd e r to predict the d u ra tio n o f the p a tie n t’s stay in the IC U we p ropose to use tree-structured m ethods. A dvantages o f tree-based m odels are, am o n g others:

1) no requirem ent o f know ledge o f the variable distrib u tio n ;

2) dealing with different types o f variables - both: categorical and co n tin u o u s, including m issing values;

3) rob ustness to outliers;

4) direct and intuitive way o f in terpretation;

5) reduction o f the cost o f the research by selecting only som e im p o rtan t variables for splitting nodes, so th a t each new object can be described by a few risk factors.

Interprctability o f a tree structure decreases with increase in its com plexity so th a t in fu rth e r researches we will pay m ore atten tio n to fitting piecewise linear m odels w hich tend to possess better prediction accuracy.

R E F E R E N C E S

B reim an L ., F rie d m a n J., O lshen R ., S tone C. (1984), Classification a n d Regression Trees, C R C Press, L o n d o n .

D o m a ń sk i C z., P ru s k a K ., W agner W. (1998), W nioskowanie sta tysty c zn e p rz y nieklasycznych

założeniach, W yd. U L , Ł ódź.

G a tn a r E. (2001), N ieparam etryczna m etoda dyskrym inacji i regresji, P W N , W arszaw a. L oh W .-Y . (2002), R egression trees w ith unbiased variable selection an d in te rac tio n detectio n ,

S ta tistica Sinica, 12, 361-386.

L o h W .-Y ., Shih Y .-S. (1997), Split selection m eth o d s for classification trees, S ta tistica Sinica, 7, 815-840.

M ic h alo p o u lo s A ., T zelepis G ., Pavlides G ., K ria rs J., D a fn i U ., G e ro u lan o s S. (1996), D e term in a n ts o f d u ra tio n o f IC U stay a fte r c o ro n a ry a rte ry b ypass g ra ft surgery, British

(10)

M a łg o r z a ta M i s z t a l

O Z A S T O S O W A N IU D R Z E W K L A S Y FIK A C Y JN Y C H I R E P R E S Y JN Y C H W D IA G N O S T Y C E M E D Y C Z N E J

Streszczenie

D rzew o decyzyjne jes t g raficzną p rezen tacją m etody rek u ren cy jn eg o p o d ziału . M e to d a ta p o leg a n a stopniow ym podziale zbioru o b iek tó w n a rozłączne p o d z b io ry aż d o m om entu u zy sk an ia ich jed n o ro d n o ś ci ze względu n a w yróżnioną cechę y.

G d y у jest zm ienną nom inalną, m am y d o czynienia z nieparam etryczną analizą dyskrym inacji (drzew a klasyfikacyjne), gdy zaś jest zm ienną ilościow ą z n iep a ram e try c z n ą an alizą regresji (drzew a regresyjne).

W referacie p rzed staw io n o m ożliwości zasto so w ań drzew regresyjnych i klasyfikacyjnych d o ro zw iązyw ania p ro b lem ó w o ch arak terze decyzyjnym w d iag n o sty ce m edycznej.

On the Application of Classification and Regression Trees in Medical Diagnosis

Y, aJ{*eRm)>

Z 0 ml { xeRm},

Y, aJ{eRm)>*