A Proposal for Using Selected Tree-Based Models to Identify Operative Risk Subgroups among Patients Undergoing Coronary Artery Bypass Grafting

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O EC O N O M IC A 206, 2007

M a ł go r z a ta Mi sz ta l *

A P R O P O S A L FO R U SIN G SELECTED TREE-BASED M O D E L S TO ID E N TIFY O PERATIVE RISK SU B G R O U PS A M O N G PAT IE N T S

U N D E R G O IN G CORONARY ARTERY B Y PA SS GRAFTING

Abstract. Classification and regression trees are very popular and attractive types o f classifiers, widely used to solve decision-m aking problems in different fields o f science.

The study w as conducted to identify preoperative risk factors associated with m orbidity outcom e am on g patients undergoing isolated Coronary Artery Bypass Grafting (C A B G ) and to develop som e classification rules assigning patients to selected risk subgroups. Prediction rules were established on the basis o f the selected tree-structured m odels. T he follow ing tree-based algorithm s were used: Q U EST , C R U ISE, L O TU S and PLUS.

Key words: recursive partitioning inethod, classification and regression trees, coronary artery disease, coronary artery bypass grafting.

1. IN TR O D U C T IO N A N D O BJEC TIV ES

The decision to perform coronary artery bypass grafting (CABG) surgery on a patient is taken under conditions of uncertainty. In that case the benefits o f CABG m ust be balanced against its risk. T o estimate this risk we m ust simultaneously consider many types of inform ation including characteristics o f the patient and characteristics o f the disease.

The m ain goal of the study was to identify factors associated with m orbidity outcom e am ong patients undergoing CABG and to develop decision rules for the classification of patients into selected risk subgroups. Prediction rules were established on the basis of tree-structured models.

Decision tree can be described as a tree-like way o f representing a col lection of hierarchical rules that lead to a class or to a value.

We consider a learning set U = {(x,, y,), (x2, y 2) , ..., ( xN, yw)}, where x is the vector o f independent variables x = [x,, x2, ..., x^]1 and у is the response

(2)

(dependent) variable. The model building process is based on recursive partitioning the learning set into hom ogenous subsets U x, U2, . . . , U M con sidering dependent variable y. If у is nom inal we deal with nonparam etric discrim inant analysis (classification tree), when у is num erical - with non param etric regression analysis (regression tree) (see e.g. B r e i m a n et al. 1984; G a t n a r 2001).

In medical diagnosis tasks vector x consists of variables describing patient’s symptom s, characteristics o f the disease and the state of the patient before and during the treatm ent. The response variable y, in our study, is the num ber of the class (risk subgroup) the patient belongs to.

2. M A TERIAL A N D M E T H O D S

The set o f 2568 case records of patients undergoing CABG during 2003-2004 in Poland were analysed. The d ata from 2003 (N = 947) cons tituted the learning set and from 2004 (N = 1621) - the test set.

Only preoperative risk factors were taken into account. 37 predictor variables were evaluated. Three clinical scoring systems: EuroSC O R E ( N a s - h e f et al. 1999), Cleveland Clinic F oundation ( H i g g i n s et al. 1992) and lately created Łódź Score o f Surgical Risk ( D o m a ń s k i et al. 2003, see: T ab. 1) were also taken into consideration.

T a b l e 1 Ł ódź Clinical Scoring System

Risk Factors Score

EF < 40% 3

Em ergency case 3

A ge > 60 1 ( + 1 poin t per 5 years)

H yperthyroidism (on medication) 2

D iabetes mellitus 2

Previous cardiac surgery 2

Chronic pulm onary diseases 2

Unstable angina 2

B S A < 1.75 m 2 2

A sp A t > 40 U /L 1

Creatinin level > 1.2 mg/dl 1

Arterial obstruction 1

Left main stenosis > 7 5 % 2

U nstable haem odynam ic state 4

S o u r c e : Elaborated by D epartm ent o f Cardiac Surgery o f Ł ódź M edical University and Chair o f Statistical M ethods, U niver sity o f Łódź.

(3)

The outcom e after CABG included the following 2 classes:

1) class 0 - with uncomplicated postoperative outcom e (629 patients in the learning set and 922 patients in the test set);

2) class 1 - patients with one or m ore o f the following: i) deaths; ii) cardiac complications; iii) central nervous system complications, iv) renal failure, v) respiratory failure, vi) any serious infection (318 cases in the learning sample and 699 cases in the test sample).

The potential association of each of the considered factors with the postoperative outcom e was calculated using x* test or M ann—W hitney s test. Factors significant to at least p < 0 .1 0 were used to establish classification rules to identify the high-risk subgroup. Statistical analyses were performed with STA TISTICA PL Software ver. 6.0.

The following tree-based algorithms were used:

1) Q U EST (Q uick, Unbiased, Efficient Statistical Trees) described in W .-Y. L o h and Y.-S. S h i h (1997) - designed to have unbiased variable selection in the splitting procedures (obtained from: http://w w w .stat.w isc.edu/~L oh/quest.htm l);

2) C R U ISE (Classification Rule with Unbiased Interaction Selection and Estim ation) described in H. K i m and W.-Y. L o h (2001) - with an interaction detection m ethods in the splitting process (obtained from: http://w w w .w pi.edu/~hkim /cruise/).

3) LO TUS (Logistic Regression Trees with Unbiased Selection) described in K .-Y . C h a n and W.-Y. L o h (2004) - designed to fit a piecewise (multiple or simple) linear logistic regression model by recursively partitioning the d a ta and fitting a different logistic regression in each partition (obtained from: http://w w w .stat.nus.sg/~kinyee/lotus.htm l).

4) PLU S (Polytom ous Logistic Regression Trees with Unbiased Split) described in T.-S. L im (2000) - which combines a polytom ous logistic regression with tree-based models (obtained from. http.//www/recursive-partitioning.com /plus).

We used the learning set to develop some decision rules for the clas sification and the test set to evaluate the model accuracy.

3. R ESU LTS

The following 19 risk factors were significantly associated with m orbidity outcome:

(4)

• age (p < 0 .0 0 1);

• BSA (body surface area, p < 0.001);

• BM I (body mass index, p < 0.05); • unstable angina (p < 0.001); • recent ( < 9 0 days) myocardial in

fraction (p < 0.001);

• m itral regurgitation (p < 0.01); • E F (left ventricular ejection frac

tion, p < 0.001);

• anticoagulation an d /or antiplate let treatm ent (p < 0.05);

• Cleveland Clinic Foun datio n Sco re (p <0.001);

• Ł ódź Clinical Scoring System (p < 0.001);

• carotid arteries arteriosclerosis - sym ptom atic T IA (p < 0 .0 1 ); • preoperative hem atocrit level

(p < 0.10);

• critical preoperative state (at least one of: preoperative cardiac m as sage, preoperative intubation, preoperative intra-aortic balloon;

p < 0.01);

• unstable haem odynam ic state (p < 0.001);

• priority o f operation (p < 0,001); • EuroSC O R E (p < 0.001);

• type of operation (in cxtracorpo- real circulation - ECC or off- pum p operation - w ithout ECC; p < 0 .1 0 ).

Risk factors m entioned above were employed in tree-structured an alysis. All the results are shown in Fig. 1-4. Tables 2-3 present details of term inal node models o f logistic regression trees for LOTUS and PLUS algorithms.

Fig. 1. Q U EST classification tree S o u r c e : own elaboration.

(5)

Fig. 2. C R U ISE classification tree S o u r c e : own elaboration.

Fig. 3. L O T U S logistic regression tree (best simple linear logistic m odels in terminal nodes)

(6)

Fig. 4. PL U S logistic regression tree S o u r c e : own elaboration.

T a b l e 2

Terminal node m odels o f logistic regression tree (LO TU S algorithm)

N o d e Variable Coefficient t-value

4 Intercept 1.628 1.051 н е т -0 .0 4 3 -1 .1 6 7 5 Intercept 3.735 1.748 EF -0 .1 0 5 -2 .3 8 6 6 Intercept 1.178 1.852 EF -0 .0 2 0 -1 .6 3 8 7 Intercept -1 .4 3 3 -0 .4 8 7 A ge 0.048 1.112

(7)

T a b l e 3

Terminal node m odels o f logistic regression tree (PLU S algorithm)

N o d e Variable Coefficient t-value

2 Intercept -2.653 -1 .3 1 8 ŁódźScore 0.5615 1.928 6 Intercept -0.713 -3 .1 0 7 ŁódźScore 0.190 4.302 14 Intercept -0.513 -2 .3 9 9 EuroScore 0.208 3.314 15 Intercept -3.468 -4 .4 7 0 ŁódźScore 0.589 3.927 S o u r c e : author’s calculations.

The following abbreviations were used:

• EuroScore - the European System for Cardiac Operative Risk Eva luation;

• E F — left ventricular ejection fraction; • H aem od. — unstable haemodynamic state; • H C T - preoperative hem atocrit level;

« T IA — carotid arteries arteriosclerosis — symptom atic IIA , • M I < 9 0 - recent ( < 9 0 days) M I;

• Off-pum p - operation without ECC (extracorporeal circulation); • LodzScore - Łódź Clinical Scoring System based on preoperative risk factors.

The classification rules obtained from the QUEST tree for high-risk patients can be described as follows:

- Łódź Clinical Scoring System > 7.5;

- Łódź Clinical Scoring System e(3.5; 7.5] л [ ( 1 I A = y es) v (Ope ration in ECC)];

- Łódź Clinical Scoring System < 3.5 л O peration in ECC л MI < 90 days = ‘yes’.

The decision rules for high-risk patients, constructed using the CR U ISE algorithm , are the following:

- operation w ithout ECC л Łódź Clinical Scoring System > 4 .5 , - operation in ECC л M I < 90 days = ‘yes’ > age 50.5 years;

- operation in ECC ( MI < 90 days = ‘n o ’ л [(preoperative hem ato crit level < 40% ) v unstable haemodynamic state v (age > 6 4 .5 ye ars)].

(8)

Trees obtained from LOTUS and PLUS algorithm s have 4 terminal nodes. Best simple linear logistic regression models are fitted in every term inal node. Some m ore details and the interpretation o f the param eters are expounded in M. M i s z t a l (2005). Logistic regression trees are shorter than classification trees.

The results o f the application o f the selected tree-based m odels for the learning and test sets are summarized in Tab. 4-5.

T a b l e 4 Classification matrix based on the learning sam ple

M ethod Actual risk group

Predicted group % o f correct classifications 10-fold CV- -error rate class 0 class 1 Q U E ST class 0 418 211 66.45 37.17 class 1 121 197 61.95 C R U IS E class 0 486 139 77.76 26.14 class 1 64 254 79.87 L O T U S class 0 405 224 64.39 38.50 class 1 117 201 63.21 PLU S class 0 409 220 65.02 37.90 class 1 103 215 67.61 S o u r c e : author’s calculations. T a b l e 5 Classification matrix based on the test sample

M ethod Actual risk group

Predicted group % o f correct classifications class 0 class 1 Q U E ST class 0 700 222 75.92 class 1 273 426 60.94 C R U IS E class 0 542 380 58.79 class 1 200 499 71.39 L O T U S class 0 695 227 75.38 class 1 314 386 55.14 PL U S class 0 616 306 66.81 class 1 200 499 71.39

(9)

The results obtained for the test sample are usually worse than for the training set. On the basis of the results showed in l a b . 4 and 5 ior further analyses we can recommend decision rules constructed using QUEST and PLU S trees.

4. C O N C L U SIO N S

According to L. В r e i m a n et al. (1984) there are at least two main objectives of a classification task: 1) to get as accurate prediction as possible on unseen d a ta and 2) to gain understanding and insight into the predictive structure o f the data.

The results obtained from classification and logistic regression trees are not very good in terms o f accuracy. One of the reasons can be th at we have focused only on preoperative risk factors and have not taken into account events that can affect outcome after CABG during the intraoperative and immediate postoperative period.

However, the results are better than those obtained from classical m ul tivariate statistical analysis (multiple logistic regression model. 62.5% oi correct classifications for class 0 and 59.75% for class 1, discrim inant analysis: 68.7 and 55.03% for class 0 and class 1 respectively; considering the learning sample. The results for the test set are even worse).

On the other hand, there are some other advantages of tree-based models over m any traditional statistical methods:

1) no requirem ent of knowledge of the variable distribution,

2) dealing with: large data sets, high dimensionality, mixed d ata types, missing values, and outliers;

3) direct and intuitive way o f interpretation (a hierarchy of questions is asked and the final decision depends on the answers to all the previous questions; the predicted classification oi each patient as a class 0 or class

1 m em ber can be m ade from a few simple “if-then” logical conditions); 4) reduction o f the cost of the research by selecting only some im portant variables for splitting nodes, so that each new object can be described by a few risk factors and

5) ability to m ake sense o f the data.

Recursive partitioning can be recommended as a supplem ent to classical statistical m ethods such as discriminant analysis or logistic regression. It identifies subgroups with different risk and also m ay uncover relationships between variables in different parts o f the m easurem ent space that may be overlooked in the traditional analysis.

(10)

R EFER EN C ES

B r e i m a n L., F r i e d m a n J., O l s h e n R. , S t o n e C. (1984), Classification and R e gression Trees, C R C Press, London.

C h a n K .-Y ., L o h W .-Y . (2004), L O T U S: An Algorithm f o r Building Accurate and Comprehensible L ogistic Regression Trees, “ Journal o f C om putational and Graphical Statistics”, 13, Issue 4, 8 2 6-852.

D o m a ń s k i Cz., I w a s z k i e w i c z - Z a s ł o n k a A. , J a s z e w s k i R. , Z a s ł o n k a J. (red.) (2003), Z astosow anie m eto d statystycznych w badaniach pacjentów z chorobą niedokrwienną serca leczonych operacyjnie, W ydaw nictw o Uniwersytetu Łódzkiego, Łódź.

G a t n a r E. (2001), N ieparam etryczna m etoda dyskrym inacji i regresji, W ydawnictwo N aukow e PW N , Warszawa.

H i g g i n s T. L., E s t a f a n o u s F. G. , L o o p F. D. , B e c k G. J., B l u m J. M. , P a r a n a n d i L. (1992), Stratification o f M orbidity and M o rta lity O utcom e b y Preoperative R isk Factors in C oronary A rtery B ypass Patients. A Clinical S everity Score, JA M A (M ay 6), 267, 17, 2344-2348.

K im H ., L o h W .-Y . (2001), Classification Trees with Unbiased M u ltiw ay Splits, “ Journal o f the A m erican Statistical A ssociation” , 96, 598-604.

L im T.-S. (2000), Polytom ous Logistic Regression Trees, Departm ent o f Statistics, University o f W isconsin, M adison, P hD Thesis.

L o h W .-Y ., S h i h Y.-S. (1997), S plit Selection M ethods f o r Classification Trees, “ Statistica Sinica” , 7, 8 15-840.

M i s z t a l M. (2005), W ykorzystanie m etody rekurencyjnego podziału do identyfikacji grup ry zy k a operacyjnego pacjentów z chorobą wieńcową, [in:] K lasyfikacja i analiza danych - teoria i zastosow ania, K . Jajuga, M . Walesiak (red.), “T aksonom ia” , 12, (Prace N au kow e Akademii Ekonom icznej we W rocławiu nr 1076, W ydaw nictw o A E w e W rocławiu), 330-338.

N a s h e f S . A. , R o q u e s F. , M i c h e l P., G a u d u c h e a u E., L e m e s h o w S., L e m - s h o w S., S a l a m o n R. (1999), European System f o r Cardiac O perative R isk Evaluation (E u ro S C O R E ), “European Journal o f Cardiothoracic Surgery” , 16, 9-1 3 .

M ałgorzata M iszta l

P R O P O Z Y C JA W Y K O R ZY STA N IA W Y BR A N Y C H M O D E L I D R ZEW K LA SY FIK A C Y JN Y C H I R E G R E SY JN Y C H D O ID EN TY FIK A C JI G R U P RYZYKA

O PE R A C Y JN E G O PA C JE N T Ó W Z C H O R O B Ą W IE Ń C O W Ą L E C Z O N Y C H O PE R A C Y JN IE

Drzewa klasyfikacyjne i regresyjne należą d o bardzo popularnych metod klasyfikacji, przede wszystkim ze względu na prostotę interpretacji i przejrzystą formę wizualizacji wyników. Stąd też są one szeroko wykorzystyw ane d o rozwiązywania problem ów decyzyjnych w różnych dziedzinach nauki.

Celem prow adzonych badań była identyfikacja przedoperacyjnych czynników ryzyka, związanych z wystąpieniem pow ikłań śród- i pooperacyjnych wśród pacjentów z chorobą wieńcow ą, leczonych w sposób operacyjny.

D o d a tk o w o podjęto próbę zdefiniowania reguł decyzyjnych, które m ogłyby umożliwić przydzielenie pacjenta d o jednej z wyróżnionych grup ryzyka operacyjnego na podstawie opisujących go cech przedoperacyjnych.

Reguły klasyfikacyjne budow ano wykorzystując m etodę rekurencyjnego podziału. W analizie uw zględniono algorytm y Q U E ST i CR U JSE, tworzące drzewa klasyfikacyjne oraz algorytmy L O T U S i PL U S, łączące rekurencyjny podział przestrzeni cech z analizą regresji logistycznej.