A C T A U N I V E R S I T A T I S L O D Z I E N S I S
F O L IA O E C O N O M IC A 194, 2005
M a ł g o r z a t a M i s z t a l *
ON T H E A PPL IC A T IO N OK CLASSIFICATION AND REG RESSIO N T REES IN M EDICAL D IA G N O SIS
Abstract
D ecision tree is a g raphical p re sen ta tio n o f the recursive p a rtitio n in g the learn in g set in to h o m o g en o u s subsets considering d ep en d en t variable y.
I f d e p en d e n t variable у is no m in al we deal w ith n o n p a ra m e tric d isc rim in an t analysis (classification trees), w hen у is num erical - w ith n o n p a ra m e tric regression analysis (regression trees).
T h e aim o f the p a p e r is to present som e ap p licatio n s o f regression an d classification trees in m edical d iag n o sis fo r solving decision - m ak in g problem s.
Key words: classification an d regression trees, m edical diagnosis.
I. INTRODUCTION
D ecision tree ca n be described as a tree-lik e w ay o f rep resen ting a collection o f hierarchical rules th a t lead to a class or to a value.
Let us c o n s id e r a learning set: ~
U = {(х 1 ,У 1),(х2,.у2) , . . . , ( х * , У * ) } (1)
where x is the vector o f independent variables x = [x1; x 2, x J T and у is the response (dependent) variable.
T h e m odel building process is based on recursive p artitio n in g the learning set into hom ogenous subsets U t , U 2, U M considering dependent variable y.
Tree-based m odels are simple, flexible and powerful tools for classification and regression analysis, dealing with different kind s o f variables, including m issing values an d very easy to interpret.
M a łg o rz a ta M isztal
II. M O D E L B U IL D IN G P R O C E S S
W e consider an additive m odel: M
y = a0 + £ a mgm(x,ß), m = 1
w here gm( \ , ß) are functions o f x with p aram eters ß. A n ap p ro x im a tio n o f (2) can be w ritten as:
M
У = а 0 +
Y, aJ{*eRm)>
(3) where R m (m = are disjoint regions in th e /^-dimensional feature space, am are real param eters and I {A} is an in d icato r function:F o r real-valued dim ension o f the region R m, characterized by its upper and low er b o u n d ary x*d) and x*9), we have:
w here B mr is a subset o f the set o f the variable values (see G a tn a r, 2001). I f dependent variable у is nom inal we deal w ith n o n p aram etric disc rim in an t analysis (so we have classification trees); w hen у is num erical - w ith n o n p aram etric regression analysis (regression trees).
In d iscrim inant analysis the response variable у represents a class label, so th a t we w ant to predict the class o f an object from the values o f its p re d ic to r variables.
In regression analysis the response variable у is assum ed to depend on the regressors x t , x 2, x p th ro u g h the relationship:
1, if th e p ro p o sitio n inside the brackets is true
0, otherw ise (4)
p
(5)
F o r each categorical variable x r we have:
p
I { x e R m} = l ] I { r r e B mr}, (6)
T h e goal o f the regression analysis is to find an estim ate j? o f у th a t m inim izes a certain loss function (e.g. squared e rro r loss). T h e estim ation o f у can be d one by recursive p artitio n in g the d a ta and the regressor space using tech niq ues th a t yield piecewise co n stan t estim ate o f y:
/ ( * ) =
Z 0 ml { xeRm},
(8)m- 1 so th a t for each \ e R m:
Ш = 0т. (9)
T h e p rocedu re o f the partitioning o f the sam ple space requires selecting a p p ro p ria te variables.
T h e goal o f the variable selection process is to choose those variables th at yield the best partitio n . T h e quality o f p artitio n o f a learning set U is m easured by the difference between heterogeneity (in term s o f y) o f U and o f resulted subsets U lt U 2, U M (see G a tn a r, 2001). In o th e r w ords, we have a reduction o f im purity:
A I I ( U ; x r) = H ( U ) - £ H ( U J p ( m ) , (10) m= 1
where:
H (*) - is a heterogeneity m easure,
p(m) = - - - is the p ro p o rtio n o f objects in Um, N
M
N (m) - is the nu m b er o f objects in Um; o f course £ N (m) = N. m-1
I f у is catcgorical wc have for exam ple entrop y function: к
H ( U J = - X > (i/m )lo g 2p(i/m ) (11) i= 1
o r G ini index:
я ( [ / т ) = 1 - i > 2(zVm), ( 1 2 )
i= 1 N (ľti)
where p(i/m) = l. ; N ^ m ) - is the num ber o f objects from class ŕ in Um. N( m)
In regression trees analysis the m ost frequently used heterogeneity m easure is variance: Using (13) wc have: M1{U- x r) = s \ U ) = - £ s2( U m)p(m) (14) m = 1 and 1 Ö" = N(m) j r / " ' (15)
A n ideal goal o f recursive partitioning is to find a decision tree o f m inim al size and m axim al predictive accuracy. T o evaluate the m odel accuracy a test set is required. If no t, cross - validation is used.
T h e right size o f the tree can be chosen using som e p ru n in g m ethods. T he usual procedure is to choose optim al splits all the way back, dow n the tree so th a t as an interm ediary step one obtains a tree with as m any leaves as ob servations. T hen the tree is pruned back using a fun ctio n th a t gives a cost to com plexity (see B reim an et al., 1984):
R , (T) = R (T) + a x size (T), (16)
where sizc(T ) is som e m easure o f the com plexity o f th e tree such as its size ( = n u m ber o f leaves), a is the cost-com plexity p a ram eter and R(T) is a risk function th a t penalizes for bad prediction.
T h e optim al tree is determ ined for exam ple by the 1-SE rule p ro p o sed by B reim an et al., 1984. T hen, it is th e sm allest tree w ith e rro r not m ore th an one stan d ard erro r greater th an th e low est e rro r tree in the sequence.
T ree-structured m odels have m an y practical applications. O ne o f them is m edical diagnosis w here the learning sam ple consists o f case records c o n tain in g a description o f p a tie n t’s sym ptom s and co rrespo nd in g o u tcom e.
III. A P P L IC A T IO N : P R E D IC T IO N OI< D U R A T IO N O F IC U ST A Y A F T E R С A BC
In the follow ing exam ple we analyse som e d a ta from th e D e p artm en t o f C ard io th o ra cic Surgery o f L odz M edical U niversity, w here th e set o f 244 case records o f patients undergoing C A B G (C oronary A rtery Bypass G rafting) during 1997-1998 was collected.
D ependent variable у is the d u ra tio n o f a p a tie n t’s stay in th e Intensive C are U nit (IC U ) - in days. F o r classification trees analysis length o f stay is categorized as follows: class 1 - IC U stay 1 4 days; class 2 - IC U stay 5 or m o re days. D eaths arc excluded.
P redictor variables are the following: 1) sex;
2) age in years;
3) BMI - body m ass index; 4) BSA - body surface area;
5) diabetes m ellitus (yes/no);
6) ch ronic pulm o nary diseases (yes/no); 7) A O - arterial o b stru ctio n (yes/no); 8) hyperthyroidism (on m edication) (yes/no); 9) diagnosis (stable angina, unstable angina);
10) C C S - C an ad ia n C o ro n ary Score (class I, II, III, IV); 11) h istory o f m yocardial infarction;
12) previous cardiac surgery (yes/no); 13) left m ain stenosis > 75% (yes/no);
14) E F % - left ventricular ejection fraction in % ; 15) A spA t - a sp a rta te am inotransferase in U /L ; 16) p riority o f o p eratio n (elective, urgent, em ergent);
17) in trao p e rativ e m yocardial infarction or low cardiac o u tp u t syndrom e (yes/no);
18) perfusion tim e in m inutes; 19) ao rtic clam ping tim e in m inutes; 20) reperfusion tim e in m inutes;
21) severity score based on preoperative risk facto rs (see T ab . 1). T h e follow ing alg orithm s are used:
1) C A R T (C lassification and R egression T rees) by B reim an et al. (1984); 2) Q U E S T (Q uick, U nbiased, Efficient S tatistical T rees) by Loh and
Shih (1997);
3) G U ID E (G eneralized, U nbiased Interaction D etection and E stim ation) by L oh (2002).
C A R T and Q U E S T are well know n algorithm s, available for exam ple in S T A T IS T IC A P L package.
G U ID E is a new algorithm for building piecewise c o n sta n t o r piecewise linear regression m odels with un ivariate splits. It has fo u r useful properties: (i) negligible bias selection, (ii) sensitivity to cu rv atu re an d local pairw ise interactions betw een regressor variables, (iii) inclusion o f categorical predictor variables, including ord in al categorical variables, (iv) choice o f three roles for each ordered p red icto r variable: split selection only, regression m odelling only, o r both. F o r m o re details see Loh (2002).
T able 1. C linical Severity S coring System
Risk F a c to rs S core
L eft v en tricu lar ejection fractio n < 4 0 % 3
E m ergency case 3
A ge > 60 1 ( + 1 p o in t p e r 5 years)
H y p erth y ro id ism (on m edication) 2
D ia b etes m ellitus 2
P revious c ard iac surgery 2
C h ro n ic p u lm o n a ry diseases 2
U n s ta b le a n g in a 2
BSA < 1,75 m 2 2
A sp A t > 40 U /L 1
C re atin in level > 1,2 m g/dl 1
A rte ria l o b stru c tio n 1
L eft m ain stenosis > 75% 2
U n sta b le h em o d y n am ic state 4
S ource: E la b o rate d by D e p a rtm e n t o f C a rd io th o rac ic S urgery o f Ł ó d ź M edical U niversity an d C h a ir o f Statistical M e th o d s, U niversity o f Ł ódź.
T h e results o f using selected tree buildin g m e th o d s are show n in F ig u res 1-4.
F ig u re 1 show s the piecewise co n stan t G U ID E tree. 0-SE tree has 8 term inal nodes. 5 variables ap p e ar in the splits: prio rity o f o peration , body surface area, age, severity score and sex. T he n um b er in each term inal node is the sam ple m ean o f the IC U stay. Predicted m ean sq uared error is 5.245.
T h e piecewise c o n sta n t G U ID E tree from q u an tile (m edian) regression m odel is show n in F igure 2. It is shorter th an the least squ ares tree in F ig u re 1. 0-SE tree has 3 term inal nodes. T he n u m b er in each term inal node is the sam ple m ed ian o f the IC U stay. T h e splitting variables are: p riority o f o p eratio n and body surface area. Predicted m ean squared error equals 5.893.
F ig u res 3 and 4 show trees o b tain ed using th e classification trees alg orith m s (Q U E ST and C A R T respectively).
F igure 1. G U ID E regression tree - least sq u ares regression, piecew ise c o n s ta n t m odel Source: A u th o r’s calculations
Figure 2. G U ID E regression tree - m edian regression, piecew ise c o n sta n t m odel Sourcc: A u th o r’s calcu latio n s
Figure 3. Q U E S T classification tree S ource: A u th o r’s c alculations
Figure 4. C A R T classification tree Source: A u th o r’s calculations
Both: C A R T and Q U E ST 0-SE trees have 4 term inal nodes. Trees are quite sim ilar. T h re e prcdictor variables m ay be sufficient to predict the response: body surface area, body m ass index and p rio rity o f o peration . C ross-validation estim ate o f m isclassification e rro r is 0.4303 for the Q U E S T tree and 0.3934 for the C A R T tree.
IV. C O N C L U S IO N S
P rediction o f the length o f stay in the Intensive C are U nit after cardiac surgery is n o t easy. P rolonged stay in the IC U increases the overall costs o f cardiac surgery and m ay also limit the n u m b er o f op eratio n s perform ed. T herefore, the ability to accurately predict the length o f stay in the IC U seems to be very im p o rta n t (see M ichalopoulos et al., 1996).
In o rd e r to predict the d u ra tio n o f the p a tie n t’s stay in the IC U we p ropose to use tree-structured m ethods. A dvantages o f tree-based m odels are, am o n g others:
1) no requirem ent o f know ledge o f the variable distrib u tio n ;
2) dealing with different types o f variables - both: categorical and co n tin u o u s, including m issing values;
3) rob ustness to outliers;
4) direct and intuitive way o f in terpretation;
5) reduction o f the cost o f the research by selecting only som e im p o rtan t variables for splitting nodes, so th a t each new object can be described by a few risk factors.
Interprctability o f a tree structure decreases with increase in its com plexity so th a t in fu rth e r researches we will pay m ore atten tio n to fitting piecewise linear m odels w hich tend to possess better prediction accuracy.
R E F E R E N C E S
B reim an L ., F rie d m a n J., O lshen R ., S tone C. (1984), Classification a n d Regression Trees, C R C Press, L o n d o n .
D o m a ń sk i C z., P ru s k a K ., W agner W. (1998), W nioskowanie sta tysty c zn e p rz y nieklasycznych
założeniach, W yd. U L , Ł ódź.
G a tn a r E. (2001), N ieparam etryczna m etoda dyskrym inacji i regresji, P W N , W arszaw a. L oh W .-Y . (2002), R egression trees w ith unbiased variable selection an d in te rac tio n detectio n ,
S ta tistica Sinica, 12, 361-386.
L o h W .-Y ., Shih Y .-S. (1997), Split selection m eth o d s for classification trees, S ta tistica Sinica, 7, 815-840.
M ic h alo p o u lo s A ., T zelepis G ., Pavlides G ., K ria rs J., D a fn i U ., G e ro u lan o s S. (1996), D e term in a n ts o f d u ra tio n o f IC U stay a fte r c o ro n a ry a rte ry b ypass g ra ft surgery, British
M a łg o r z a ta M i s z t a l
O Z A S T O S O W A N IU D R Z E W K L A S Y FIK A C Y JN Y C H I R E P R E S Y JN Y C H W D IA G N O S T Y C E M E D Y C Z N E J
Streszczenie
D rzew o decyzyjne jes t g raficzną p rezen tacją m etody rek u ren cy jn eg o p o d ziału . M e to d a ta p o leg a n a stopniow ym podziale zbioru o b iek tó w n a rozłączne p o d z b io ry aż d o m om entu u zy sk an ia ich jed n o ro d n o ś ci ze względu n a w yróżnioną cechę y.
G d y у jest zm ienną nom inalną, m am y d o czynienia z nieparam etryczną analizą dyskrym inacji (drzew a klasyfikacyjne), gdy zaś jest zm ienną ilościow ą z n iep a ram e try c z n ą an alizą regresji (drzew a regresyjne).
W referacie p rzed staw io n o m ożliwości zasto so w ań drzew regresyjnych i klasyfikacyjnych d o ro zw iązyw ania p ro b lem ó w o ch arak terze decyzyjnym w d iag n o sty ce m edycznej.