• Nie Znaleziono Wyników

On the Application of Classification and Regression Trees in Medical Diagnosis

N/A
N/A
Protected

Academic year: 2021

Share "On the Application of Classification and Regression Trees in Medical Diagnosis"

Copied!
10
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

F O L IA O E C O N O M IC A 194, 2005

M a ł g o r z a t a M i s z t a l *

ON T H E A PPL IC A T IO N OK CLASSIFICATION AND REG RESSIO N T REES IN M EDICAL D IA G N O SIS

Abstract

D ecision tree is a g raphical p re sen ta tio n o f the recursive p a rtitio n in g the learn in g set in to h o m o g en o u s subsets considering d ep en d en t variable y.

I f d e p en d e n t variable у is no m in al we deal w ith n o n p a ra m e tric d isc rim in an t analysis (classification trees), w hen у is num erical - w ith n o n p a ra m e tric regression analysis (regression trees).

T h e aim o f the p a p e r is to present som e ap p licatio n s o f regression an d classification trees in m edical d iag n o sis fo r solving decision - m ak in g problem s.

Key words: classification an d regression trees, m edical diagnosis.

I. INTRODUCTION

D ecision tree ca n be described as a tree-lik e w ay o f rep resen ting a collection o f hierarchical rules th a t lead to a class or to a value.

Let us c o n s id e r a learning set: ~

U = {(х 1 ,У 1),(х2,.у2) , . . . , ( х * , У * ) } (1)

where x is the vector o f independent variables x = [x1; x 2, x J T and у is the response (dependent) variable.

T h e m odel building process is based on recursive p artitio n in g the learning set into hom ogenous subsets U t , U 2, U M considering dependent variable y.

Tree-based m odels are simple, flexible and powerful tools for classification and regression analysis, dealing with different kind s o f variables, including m issing values an d very easy to interpret.

(2)

M a łg o rz a ta M isztal

II. M O D E L B U IL D IN G P R O C E S S

W e consider an additive m odel: M

y = a0 + £ a mgm(x,ß), m = 1

w here gm( \ , ß) are functions o f x with p aram eters ß. A n ap p ro x im a tio n o f (2) can be w ritten as:

M

У = а 0 +

Y, aJ{*eRm)>

(3) where R m (m = are disjoint regions in th e /^-dimensional feature space, am are real param eters and I {A} is an in d icato r function:

F o r real-valued dim ension o f the region R m, characterized by its upper and low er b o u n d ary x*d) and x*9), we have:

w here B mr is a subset o f the set o f the variable values (see G a tn a r, 2001). I f dependent variable у is nom inal we deal w ith n o n p aram etric disc­ rim in an t analysis (so we have classification trees); w hen у is num erical - w ith n o n p aram etric regression analysis (regression trees).

In d iscrim inant analysis the response variable у represents a class label, so th a t we w ant to predict the class o f an object from the values o f its p re d ic to r variables.

In regression analysis the response variable у is assum ed to depend on the regressors x t , x 2, x p th ro u g h the relationship:

1, if th e p ro p o sitio n inside the brackets is true

0, otherw ise (4)

p

(5)

F o r each categorical variable x r we have:

p

I { x e R m} = l ] I { r r e B mr}, (6)

(3)

T h e goal o f the regression analysis is to find an estim ate j? o f у th a t m inim izes a certain loss function (e.g. squared e rro r loss). T h e estim ation o f у can be d one by recursive p artitio n in g the d a ta and the regressor space using tech niq ues th a t yield piecewise co n stan t estim ate o f y:

/ ( * ) =

Z 0 ml { xeRm},

(8)

m- 1 so th a t for each \ e R m:

Ш = 0т. (9)

T h e p rocedu re o f the partitioning o f the sam ple space requires selecting a p p ro p ria te variables.

T h e goal o f the variable selection process is to choose those variables th at yield the best partitio n . T h e quality o f p artitio n o f a learning set U is m easured by the difference between heterogeneity (in term s o f y) o f U and o f resulted subsets U lt U 2, U M (see G a tn a r, 2001). In o th e r w ords, we have a reduction o f im purity:

A I I ( U ; x r) = H ( U ) - £ H ( U J p ( m ) , (10) m= 1

where:

H (*) - is a heterogeneity m easure,

p(m) = - - - is the p ro p o rtio n o f objects in Um, N

M

N (m) - is the nu m b er o f objects in Um; o f course £ N (m) = N. m-1

I f у is catcgorical wc have for exam ple entrop y function: к

H ( U J = - X > (i/m )lo g 2p(i/m ) (11) i= 1

o r G ini index:

я ( [ / т ) = 1 - i > 2(zVm), ( 1 2 )

i= 1 N (ľti)

where p(i/m) = l. ; N ^ m ) - is the num ber o f objects from class ŕ in Um. N( m)

(4)

In regression trees analysis the m ost frequently used heterogeneity m easure is variance: Using (13) wc have: M1{U- x r) = s \ U ) = - £ s2( U m)p(m) (14) m = 1 and 1 Ö" = N(m) j r / " ' (15)

A n ideal goal o f recursive partitioning is to find a decision tree o f m inim al size and m axim al predictive accuracy. T o evaluate the m odel accuracy a test set is required. If no t, cross - validation is used.

T h e right size o f the tree can be chosen using som e p ru n in g m ethods. T he usual procedure is to choose optim al splits all the way back, dow n the tree so th a t as an interm ediary step one obtains a tree with as m any leaves as ob servations. T hen the tree is pruned back using a fun ctio n th a t gives a cost to com plexity (see B reim an et al., 1984):

R , (T) = R (T) + a x size (T), (16)

where sizc(T ) is som e m easure o f the com plexity o f th e tree such as its size ( = n u m ber o f leaves), a is the cost-com plexity p a ram eter and R(T) is a risk function th a t penalizes for bad prediction.

T h e optim al tree is determ ined for exam ple by the 1-SE rule p ro p o ­ sed by B reim an et al., 1984. T hen, it is th e sm allest tree w ith e rro r not m ore th an one stan d ard erro r greater th an th e low est e rro r tree in the sequence.

T ree-structured m odels have m an y practical applications. O ne o f them is m edical diagnosis w here the learning sam ple consists o f case records c o n tain in g a description o f p a tie n t’s sym ptom s and co rrespo nd in g o u ­ tcom e.

(5)

III. A P P L IC A T IO N : P R E D IC T IO N OI< D U R A T IO N O F IC U ST A Y A F T E R С A BC

In the follow ing exam ple we analyse som e d a ta from th e D e p artm en t o f C ard io th o ra cic Surgery o f L odz M edical U niversity, w here th e set o f 244 case records o f patients undergoing C A B G (C oronary A rtery Bypass G rafting) during 1997-1998 was collected.

D ependent variable у is the d u ra tio n o f a p a tie n t’s stay in th e Intensive C are U nit (IC U ) - in days. F o r classification trees analysis length o f stay is categorized as follows: class 1 - IC U stay 1 4 days; class 2 - IC U stay 5 or m o re days. D eaths arc excluded.

P redictor variables are the following: 1) sex;

2) age in years;

3) BMI - body m ass index; 4) BSA - body surface area;

5) diabetes m ellitus (yes/no);

6) ch ronic pulm o nary diseases (yes/no); 7) A O - arterial o b stru ctio n (yes/no); 8) hyperthyroidism (on m edication) (yes/no); 9) diagnosis (stable angina, unstable angina);

10) C C S - C an ad ia n C o ro n ary Score (class I, II, III, IV); 11) h istory o f m yocardial infarction;

12) previous cardiac surgery (yes/no); 13) left m ain stenosis > 75% (yes/no);

14) E F % - left ventricular ejection fraction in % ; 15) A spA t - a sp a rta te am inotransferase in U /L ; 16) p riority o f o p eratio n (elective, urgent, em ergent);

17) in trao p e rativ e m yocardial infarction or low cardiac o u tp u t syndrom e (yes/no);

18) perfusion tim e in m inutes; 19) ao rtic clam ping tim e in m inutes; 20) reperfusion tim e in m inutes;

21) severity score based on preoperative risk facto rs (see T ab . 1). T h e follow ing alg orithm s are used:

1) C A R T (C lassification and R egression T rees) by B reim an et al. (1984); 2) Q U E S T (Q uick, U nbiased, Efficient S tatistical T rees) by Loh and

Shih (1997);

3) G U ID E (G eneralized, U nbiased Interaction D etection and E stim ation) by L oh (2002).

C A R T and Q U E S T are well know n algorithm s, available for exam ple in S T A T IS T IC A P L package.

(6)

G U ID E is a new algorithm for building piecewise c o n sta n t o r piecewise linear regression m odels with un ivariate splits. It has fo u r useful properties: (i) negligible bias selection, (ii) sensitivity to cu rv atu re an d local pairw ise interactions betw een regressor variables, (iii) inclusion o f categorical predictor variables, including ord in al categorical variables, (iv) choice o f three roles for each ordered p red icto r variable: split selection only, regression m odelling only, o r both. F o r m o re details see Loh (2002).

T able 1. C linical Severity S coring System

Risk F a c to rs S core

L eft v en tricu lar ejection fractio n < 4 0 % 3

E m ergency case 3

A ge > 60 1 ( + 1 p o in t p e r 5 years)

H y p erth y ro id ism (on m edication) 2

D ia b etes m ellitus 2

P revious c ard iac surgery 2

C h ro n ic p u lm o n a ry diseases 2

U n s ta b le a n g in a 2

BSA < 1,75 m 2 2

A sp A t > 40 U /L 1

C re atin in level > 1,2 m g/dl 1

A rte ria l o b stru c tio n 1

L eft m ain stenosis > 75% 2

U n sta b le h em o d y n am ic state 4

S ource: E la b o rate d by D e p a rtm e n t o f C a rd io th o rac ic S urgery o f Ł ó d ź M edical U niversity an d C h a ir o f Statistical M e th o d s, U niversity o f Ł ódź.

T h e results o f using selected tree buildin g m e th o d s are show n in F ig u res 1-4.

F ig u re 1 show s the piecewise co n stan t G U ID E tree. 0-SE tree has 8 term inal nodes. 5 variables ap p e ar in the splits: prio rity o f o peration , body surface area, age, severity score and sex. T he n um b er in each term inal node is the sam ple m ean o f the IC U stay. Predicted m ean sq uared error is 5.245.

T h e piecewise c o n sta n t G U ID E tree from q u an tile (m edian) regression m odel is show n in F igure 2. It is shorter th an the least squ ares tree in F ig u re 1. 0-SE tree has 3 term inal nodes. T he n u m b er in each term inal node is the sam ple m ed ian o f the IC U stay. T h e splitting variables are: p riority o f o p eratio n and body surface area. Predicted m ean squared error equals 5.893.

F ig u res 3 and 4 show trees o b tain ed using th e classification trees alg orith m s (Q U E ST and C A R T respectively).

(7)

F igure 1. G U ID E regression tree - least sq u ares regression, piecew ise c o n s ta n t m odel Source: A u th o r’s calculations

Figure 2. G U ID E regression tree - m edian regression, piecew ise c o n sta n t m odel Sourcc: A u th o r’s calcu latio n s

(8)

Figure 3. Q U E S T classification tree S ource: A u th o r’s c alculations

Figure 4. C A R T classification tree Source: A u th o r’s calculations

(9)

Both: C A R T and Q U E ST 0-SE trees have 4 term inal nodes. Trees are quite sim ilar. T h re e prcdictor variables m ay be sufficient to predict the response: body surface area, body m ass index and p rio rity o f o peration . C ross-validation estim ate o f m isclassification e rro r is 0.4303 for the Q U E S T tree and 0.3934 for the C A R T tree.

IV. C O N C L U S IO N S

P rediction o f the length o f stay in the Intensive C are U nit after cardiac surgery is n o t easy. P rolonged stay in the IC U increases the overall costs o f cardiac surgery and m ay also limit the n u m b er o f op eratio n s perform ed. T herefore, the ability to accurately predict the length o f stay in the IC U seems to be very im p o rta n t (see M ichalopoulos et al., 1996).

In o rd e r to predict the d u ra tio n o f the p a tie n t’s stay in the IC U we p ropose to use tree-structured m ethods. A dvantages o f tree-based m odels are, am o n g others:

1) no requirem ent o f know ledge o f the variable distrib u tio n ;

2) dealing with different types o f variables - both: categorical and co n tin u o u s, including m issing values;

3) rob ustness to outliers;

4) direct and intuitive way o f in terpretation;

5) reduction o f the cost o f the research by selecting only som e im p o rtan t variables for splitting nodes, so th a t each new object can be described by a few risk factors.

Interprctability o f a tree structure decreases with increase in its com plexity so th a t in fu rth e r researches we will pay m ore atten tio n to fitting piecewise linear m odels w hich tend to possess better prediction accuracy.

R E F E R E N C E S

B reim an L ., F rie d m a n J., O lshen R ., S tone C. (1984), Classification a n d Regression Trees, C R C Press, L o n d o n .

D o m a ń sk i C z., P ru s k a K ., W agner W. (1998), W nioskowanie sta tysty c zn e p rz y nieklasycznych

założeniach, W yd. U L , Ł ódź.

G a tn a r E. (2001), N ieparam etryczna m etoda dyskrym inacji i regresji, P W N , W arszaw a. L oh W .-Y . (2002), R egression trees w ith unbiased variable selection an d in te rac tio n detectio n ,

S ta tistica Sinica, 12, 361-386.

L o h W .-Y ., Shih Y .-S. (1997), Split selection m eth o d s for classification trees, S ta tistica Sinica, 7, 815-840.

M ic h alo p o u lo s A ., T zelepis G ., Pavlides G ., K ria rs J., D a fn i U ., G e ro u lan o s S. (1996), D e term in a n ts o f d u ra tio n o f IC U stay a fte r c o ro n a ry a rte ry b ypass g ra ft surgery, British

(10)

M a łg o r z a ta M i s z t a l

O Z A S T O S O W A N IU D R Z E W K L A S Y FIK A C Y JN Y C H I R E P R E S Y JN Y C H W D IA G N O S T Y C E M E D Y C Z N E J

Streszczenie

D rzew o decyzyjne jes t g raficzną p rezen tacją m etody rek u ren cy jn eg o p o d ziału . M e to d a ta p o leg a n a stopniow ym podziale zbioru o b iek tó w n a rozłączne p o d z b io ry aż d o m om entu u zy sk an ia ich jed n o ro d n o ś ci ze względu n a w yróżnioną cechę y.

G d y у jest zm ienną nom inalną, m am y d o czynienia z nieparam etryczną analizą dyskrym inacji (drzew a klasyfikacyjne), gdy zaś jest zm ienną ilościow ą z n iep a ram e try c z n ą an alizą regresji (drzew a regresyjne).

W referacie p rzed staw io n o m ożliwości zasto so w ań drzew regresyjnych i klasyfikacyjnych d o ro zw iązyw ania p ro b lem ó w o ch arak terze decyzyjnym w d iag n o sty ce m edycznej.

Cytaty

Powiązane dokumenty

Istotny dla języka polskiego był także okres rozbiorów, ponieważ miał wpływ na rozprzestrzenienie się języków niemieckiego oraz rosyjskiego [Przy- bylska 2003: 284]..

Język polski, który zawiera w sobie całość naszej kultury i ogromnego dorobku narodu w różnych dziedzinach życia, gwałtownie się zmienia.. Zmiany te są wywołane

It is applied to a framework called Global Decision Tree (GDT) that can be used for evolutionary induction of classification [19] and regression [9] trees.. The manuscript can be seen

The main advan- tage of evolutionary induced trees over greedy search methods is the ability to avoid local optima and search more globally for the best tree structure, tests

In contrast to classical top-down inducers, where locally optimal tests are sequen- tially chosen, both the tree structure and tests in internal nodes are searched in the same time

K azim ierz G odłow ski był archeologiem o ogrom nym autorytecie, a nazw isko Jego stało się sym bolem dla specjalistów okresu rzym skiego, zw łaszcza m łodej

Mądrości Bożej należy widzieć raczej Chrystusa niż Bogurodzicę, która może być je dynie łączona z pojawiającym się na kartach Prz 9, 1–6 obra- zem „domu Sofii”.

Opis fi zyczny dokumentu rękopiśmiennego: Rękopis: „Wizyty jeneralnej całego funduszu kościoła parafi alnego kroszyńskiego w roku 1820 odbyta”, po- chodzący z