Imputation of Missing Data Using R Package

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FOLIA OECONOMICA 269, 2012

[131]

Maágorzata Misztal*

IMPUTATION OF MISSING DATA USING R PACKAGE

Abstract. Missing data are quite common in practical applications of statistical methods.

Im-putation is general statistical method for the analysis of incomplete data sets.

The goal of the paper is to review selected imputation techniques. Special attention is paid to methods implemented in some packages working in the R environment. An example is presented to show how to handle missing values using a few procedures of single and multiple imputation implemented in R.

Key words: missing values, single imputation, multiple imputation, R – project.

I. INTRODUCTION

Incomplete data are quite common in practical applications of statistical methods. Dealing with data sets with missing values researchers often discard observations with any missing values and perform complete case analysis. It can lead to biased estimates, incorrect standard errors and incorrect inferences or results.

Another way to deal with missing data is to impute all missing values before analysis, using single or multiple imputation methods.

The goal of the paper is to review selected imputation techniques imple-mented in some packages working in the R environment. An example is pre-sented to show how to handle missing values using different imputation methods implemented in R.

II. IMPUTATION PROCEDURES

Using any method of dealing with missing values it is important to under-stand why the data are missing. Little and Rubin (2002) described three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR).

According to Molenberghs and Kenward (2007, p. 4), the MCAR mecha-nism potentially depends on observed covariates, but not on observed or unobserved

(2)

outcomes. The MAR mechanism depends on the observed outcomes and perhaps also on the covariates but not on unobserved measurements. Finally, the NMAR mechanism depends on unobserved measurements perhaps in addition to de-pendencies on covariates and on observed outcomes.

For MCAR mechanism the observed values are essentially a random sample of the full data set so the complete case analysis gives the same results as the full data set would have.

Under an assumption of MAR mechanism handling missing data one can use (among others) imputation – based procedures or model - based ones.

In imputation – based techniques the missing values are filled in (using sin-gle or multiple imputation methods) and the complete data are analyzed by stan-dard statistical methods. Some details are listed below.

In model – based procedures one should define a model for the observed data – inferences are based on the likelihood or posterior distribution under that model with parameters estimated by procedures such as maximum likelihood – see Little and Rubin (2002) for details.

Since imputations are means or draws from a predictive distribution of the missing values, there is a need for a method creating such a predictive distribu-tion for the imputadistribu-tion based on the observed data. Little and Rubin (2002) state that there are two approaches to generating this distribution:

1. Explicit modeling – where the predictive distribution is based on a for-mal statistical model (e. g. multivariate norfor-mal);

2. Implicit modeling – where the focus is on the algorithm, which implies an underlying model.

The most popular explicit modeling methods are:

(1) mean /mode imputation – for any continuous variable missing values are imputed using the mean of the observed values; for categorical variables the mode is used;

(2) conditional mean imputation (regression imputation) – missing values are replaced by predicted values from a regression model relating predictor with missing values to all other predictors; least squares, logistic and ordinal regres-sions are used with continuous, binary and ordered categorical predictors, re-spectively.

(3) stochastic regression imputation – missing values are imputed by pre-dicted values from a regression model plus a residual.

The most popular implicit modeling methods are:

(1) hot deck imputation – missing values are imputed using sampling with replacement from the observed data;

(2) substitution – nonresponding units are replaced with alternative units not selected into the sample;

(3)

(3) cold deck imputation – missing values are filled in by a constant value from an external source;

(4) predictive mean matching – combination of regression imputation and hot deck method – the method starts with regressing the variable to be imputed – Y - on a set of predictors for cases with complete data; on the basis of this re-gression model predicted values are generated for both the missing and non-missing cases; then for each case with non-missing data, a set of cases with complete data that have predicted values of Y that are “close” to the predicted values for the case with missing data is found and from this set of cases one is randomly chosen – its Y value is used to impute the missing case (see Allison 2002).

Single imputation does not take into account the uncertainty in the imputa-tions. That’s why multiple imputation (MI) is recommended as appropriate way of handling missingness in data. There are three steps of multiple imputation process (Yu et al. 2007):

I. generate m>1 imputed data sets by Þlling in the missing values with plau-sible values;

II. perform standard analyses on each of the m imputed data sets; III. combine the results from the m analyses.

According to van Buuren and Groothuis-Oudshoorn (2010) there are two general approaches to multiple imputation: joint modeling (JM) proposed by Schafer (1997) and fully conditional specification (FCS) developed by van Buuren (2007).

Joint modeling entails specifying a multivariate distribution for the missing data and drawing imputation from their conditional distributions by Markov Chain Monte Carlo (MCMC) techniques (e.g. data augmentation).

FCS is based on the iterative process that involves specifying a conditional distribution for each incomplete variable. It does not explicitly assume a particu-lar multivariate distribution, but assumes that one exists and draws can be gener-ated from it using Gibbs sampling (see Yu et al. 2007). The imputed values can be either the predicted values sampled from the posterior distribution of the in-complete variable or obtained using predictive mean matching as the observed value from the complete case with the closest predicted value to the incomplete case.

MCAR and MAR mechanisms are called ignorable ones and there are a lot of techniques for handling ignorable missing data.

NMAR mechanism is called non-ignorable and requires a different and more complex approach, i. e. selection models or pattern-mixture models (see details in Allison 2002, Little and Rubin 2002 or Molenberghs and Kenward 2007).

(4)

III. IMPUTATION SOFTWARE

Imputation techniques are implemented in some statistical packages. SO-LAS (Statistical Solutions Inc, Sargus, MA, USA) is a specific software package designed for handling missing data and performing multiple imputations.

Several standard statistical packages – SAS, SPSS, STATA and R-project have also implemented standard and user – written programs for dealing with missing data. The performances of these packages are compared for example by Yu et al. (2007) or by Horton and Kleiman (2007). In this paper only R-project is taken under consideration.

In R missing values are indicated by NA’s. There are (at least) 11 packages, working in the R environment, to handle missing data: Amelia II, Hmisc, mi, mice, yaImpute, mix, cat, norm, pan, monoman, mvnmle. Another two packages – mitools and VIM can be useful to combine the results from multiple imputa-tions and to explore the data and the structure of the missing values. Short de-scription of every package is presented in Table 1.

Some of the packages mentioned above are used in an example. IV. EXAMPLE

Let’s consider the data set of 467 people that were granted a consumer credit. The aim of the study was to classify the borrowers into two risk classes: bad (defaulted loans) and good (paid off loans).

There were 6 independent variables (age, loan amount, borrower’s seniority in months, average income of the last three months, monthly installment, loan period in months). Decision rules were established on the basis of logistic re-gression model.

From the complete data set of 467 objects, 5.72% of values were randomly removed and replaces by NA’s.

Data with missing values are stored in the cred.txt file and read into R using the command:

> cred=read.table("C:/Documents and Settings/dane/cred.txt", header=TRUE).

Using logistic regression model with the complete original data set produces the results presented in Table 2.

Discarding observations with any missing value there are 294 cases for complete case analysis. The results from complete case analysis using logistic regression are also summarized in Table 2. The Design package was used to estimate the logistic regression model coefficients.

(5)

T ab le 1 . H an d li n g m is si n g d at a w it h R – b as ic i n fo rm at io n P ac k ag e V er si o n / D at e T it le A u th o rs D es cr ip ti o n B as ic c o m m an d 1 2 3 4 5 6 A m el ia I I 1 .2 -1 8 2 0 1 0 -1 1 -0 4 A m el ia I I: A P ro g ra m f o r M is si n g D at a Ja m es H o n ak er , G ar y K in g , M at th ew B la ck w el l - H ar v ar d U n iv er si ty U se s a b o o ts tr ap + E M a lg o ri th m to i m p u te m is sin g v alu es f ro m a d at as et a n d p ro d u ce s m u lt ip le o u tp u t d at as et s fo r an al y si s am el ia (x , m = 5 , p 2 s = 1 , fr o n te n d = F A L S E , id v ar s = N U L L , ts = N U L L , cs = N U L L , p o ly ti m e = N U L L , sp li n et im e = N U L L , in te rc s = F A L S E , la g s = N U L L , lea d s = N U L L , st ar tv al s = 0 , to le ran ce = 0 .0 0 0 1 , lo g s = N U L L , sq rt s = N U L L , lg st c = N U L L , n o m s = N U L L , o rd s = N U L L , in ch ec k = T R U E , co ll ec t = F A L S E , ar g li st = N U L L , em p ri = N U L L , p ri o rs = N U L L , au to p ri = 0 .0 5 , em b u rn = c (0 ,0 ), b o u n d s = N U L L , m ax .r es am p le = 1 0 0 , .. .) M u lt ip le I m p u ta ti o n u si n g A d d it iv e R eg re ss io n , B o o ts tr ap p in g , an d P re d ic ti v e M ea n M at ch in g ar eg Im p u te (f o rm u la , d at a, s u b se t, n .i m p u te = 5 , g ro u p = N U L L , n k = 3 , tl in ea r= T R U E , y p e= c( 'p m m ',' re g re ss io n ') , m at ch = c( 'w ei g h te d ',' cl o se st ') , fw ei g h te d = 0 .2 , cu rt ai l= T R U E , b o o t. m et h o d = c( 'si m p le ', 'a p p ro x im at e b ay esi an ') , b u rn in = 3 , x = F A L S E , p r= T R U E , p lo tT ra n s= F A L S E , to le ra n ce = N U L L , B = 7 5 ) H m is c 3 .8 -3 2 0 1 0 -0 9 -0 8 H ar re ll M is ce ll an eo u s F ra n k E H ar re ll J r - V an d er b il t U n iv er si ty S ch o o l o f M ed ic in e T ra n sf o rm at io n s/ Im p u ta ti o n s u si n g C an o n ic al V ar ia te s tr an sc an (x , m et h o d = c( "c an o n ic al ", "p c" ), ca te g o ri ca l= N U L L , as is = N U L L , n k , im p u te d = F A L S E , n .i m p u te , b o o t. m et h o d = c( 'a p p ro x im at e b ay es ia n ', 's im p le' ), tr an ta b = F A L S E , tr an sf o rmed = F A L S E , im p ca t= c( "s co re ", " m u lt in o m ", " rp ar t" , "t re e" ), m in cu t= 4 0 , in v er se =c (' li n ea rI n te rp ',' sa m p le ') , to lI n v er se = .0 5 , p r= T R U E , p l= T R U E , al lp l= F A L S E , sh o w .n a= T R U E , im p u te d .a ct u al = c( 'n o n e' ,'d at ad en si ty ',' h is t', 'q q ',' ec d f' ), it er .m ax = 5 0 , ep s= .1 , cu rt ai l= T R U E , im p .c o n = F A L S E , sh ri n k = F A L S E , in it .c at = "m o d e" , n re s= if (b o o t. m et h o d = = 's im p le ') 2 0 0 e ls e 4 0 0 , d at a, s u b se t, n a. ac ti o n , tr ee in fo = F A L S E , rh sI m p = c( 'm ea n ',' ra n d o m ') , d et ai ls .i m p ca t= '', . .. )

(6)

T ab le 1 ( co n t. ) 1 2 3 4 5 6 m ic e 2 .4 2 0 1 0 -1 0 -1 8 M u lt iv ar ia te Im p u ta ti o n b y C h ai n ed E q u at io n s S te f v an B u u re n ( T N O Q u al it y o f L if e, L ei d en + U n iv er si ty o f U tr ec h t) & K ar in G ro o th u is -O u d sh o o rn (R o es si n g h R D , E n sc h ed e + U n iv er si ty T w en te ) M u lt ip le I m p u ta ti o n u si n g F u ll y C o n d it io n al S p ec if ic at io n m ic e( d at a, m = 5 , m et h o d = v ec to r( "c h ar ac te r" ,l en g th = n co l( d at a) ), p re d ic to rM at ri x = (1 d ia g (1 , n co l( d at a) )) , v is it S eq u en ce = (1 :n co l( d at a) )[ ap p ly (i s. n a( d at a) ,2 ,a n y )] , p o st = v ec to r( "c h ar ac te r" , le n g th = n co l( d at a) ), d ef au lt M et h o d = c( "p m m ", "l o g re g ", "p o ly re g ") , m ax it = 5 , d ia g n o st ic s = T R U E , p ri n tF la g = T R U E , se ed = N A , im p u ta ti o n M et h o d = N U L L , d ef au lt Im p u ta ti o n M et h o d = N U L L ) m i 0 .0 9 -1 1 .0 3 2 0 1 0 -1 1 -1 1 M is si n g D at a Im p u ta ti o n a n d M o d el C h ec k in g A n d re w G el m an , Je n n if er H il l, Y u -S u n g S u , M as an ao Y aj im a, M ar ia G ra zi a P it ta u -C o lu m b ia U n iv er si ty M u lt ip le I te ra ti v e R eg re ss io n Im p u ta ti o n – t h e b as ic c o m m an d ge n er at es a m u lt ip ly i m p u te d m at ri x a p p ly in g t h e el em en ta ry fu n ct io n s it er at iv el y to t h e v ar ia b le s w it h m is si n g n es s in t h e d at a ra n d o m ly i m p u ti n g e ac h v ar ia b le a n d l o o p in g t h ro u g h u n ti l ap p ro x im at e co n v er g en ce m i( o b je ct , in fo , n .i m p = 3 , n .i te r = 3 0 , R .h at = 1 .1 , m ax .m in u te s = 2 0 , ra n d .i m p .m et h o d = " b o o ts tr ap ", ru n .p as t. co n v er g en ce = F A L S E , se ed = N A , ch ec k .c o ef .c o n v er g en ce = F A L S E , ad d .n o is e = n o is e. co n tr o l( )) y aI m p u te 1 .0 -1 2 2 0 1 0 -0 9 -0 1 y aI m p u te : A n R P ac k ag e fo r k -N N Im p u ta ti o n N ic h o la s L . C ro o k st o n & A n d re w O . F in le y -M ic h ig an S ta te U n iv er si ty P er fo rm s p o p u la r n ea re st n ei g h b o r ro u ti n es f o r im p u ta ti o n F in d K n ea re st n ei g h b o rs : y ai (x = N U L L , y = N U L L , d at a= N U L L , k = 1 , n o T rg s= F A L S E , n o R ef s= F A L S E , n V ec = N U L L , p V al = .0 5 , m et h o d = "m sn ", a n n = T R U E , m tr y = N U L L , n tr ee = 5 0 0 , rf M o d e= "b u il d C la ss es ") Im p u te v ar ia b le s fr o m r ef er en ce s to t ar g et s: im p u te (o b je ct , an ci ll ar y D at a= N U L L , m et h o d = "c lo se st ", m et h o d .f ac to r= m et h o d , k = N U L L ,v ar s= N U L L , o b se rv ed = T R U E ,. .. )

(7)

T ab le 1 ( co n t. ) 1 2 3 4 5 6 m ix 1 .0 -8 2 0 1 0 -0 1 -0 3 E st im at io n /m u lt ip le Im p u ta ti o n f o r M ix ed C at eg o ri ca l an d C o n ti n u o u s D at a Jo se p h L . S ch af er T h e P en n sy lv an ia S ta te U n iv er si ty Im p u te s M is si n g D at a U n d er G en er al L o ca ti o n M o d el im p .m ix (s , th et a, x ) n o rm 1 .0 -9 .2 2 0 1 0 -0 4 -2 9 A n al y si s o f m u lt iv ar ia te n o rm al d at as et s w it h m is si n g v al u es P o rt ed t o R b y A lv ar o A . N o v o . O ri g in al b y J o se p h L . S ch af er Im p u te s m is si n g m u lt iv ar ia te n o rm al d at a im p .n o rm (s , th et a, x ) ca t 0 .0 -6 .2 2 0 0 9 -0 7 -2 8 A n al y si s o f ca te g o ri ca l-v ar ia b le d at as et s w it h m is si n g v al u es P o rt ed t o R b y T ed H ar d in g an d F er n an d o T u se ll . O ri g in al b y Jo se p h L . S ch af er Im p u te s m is si n g c at eg o ri ca l d at a -p er fo rm s si n g le r an d o m im p u ta ti o n o f m is si n g v al u es i n a ca te g o ri ca l d at as et u n d er a u se r-su p p li ed v al u e o f th e u n d er ly in g c el l p ro b ab il it ie s im p .c at (s , th et a) p an 0 .2 -6 2 0 0 9 -0 4 -1 9 M u lt ip le i m p u ta ti o n fo r m u lt iv ar ia te p an el o r cl u st er ed d at a Jo se p h L . S ch af er T h e P en n sy lv an ia S ta te U n iv er si ty Im p u ta ti o n o f m u lt iv ar ia te p an el o r cl u st er d at a u si n g t h e G ib b s sa m p le r al g o ri th m p an (y , su b j, p re d , x co l, z co l, p ri o r, s ee d , it er = 1 , st ar t) m o n o m v n 1 .8 -3 2 0 1 0 -0 4 -2 3 E st im at io n f o r m u lt iv ar ia te n o rm al an d S tu d en t-t d at a w it h m o n o to n e m is si n g n ess R o b er t B . G ra m ac y – U n iv er si ty o f C h ic ag o M ax im u m l ik el ih o o d e st im at io n o f th e m ea n a n d c o v ar ia n ce m at ri x o f m u lt iv ar ia te n o rm al (M V N ) d is tr ib u te d d at a w it h a m o n o to n e m is si n g n es s p at te rn m o n o m v n (y , p re = T R U E , m et h o d = c (" p ls r" , "p cr ", "l as so ", " la r" , "f o rw ar d .s ta g ew is e" , "s te p w is e" , "r id g e" , "f ac to r" ), p = 0 .9 , n co m p .m ax = I n f, b at ch = T R U E , v al id at io n = c (" C V ", " L O O ", " C p ") , o b s = F A L S E , v er b = 0 , q u ie t = T R U E )

(8)

T ab le 1 ( co n t. ) 1 2 3 4 5 6 m v n m le 0 .1 -8 2 0 0 9 -0 4 -1 7 M L e st im at io n f o r m u lt iv ar ia te n o rm al d at a w it h m is si n g v al u es K ev in G ro ss , w it h h el p f ro m D o u g la s B at es , N o rt h C ar o li n a S ta te U n iv er si ty F in d s th e m ax im u m l ik el ih o o d es ti m at e o f th e m ea n v ec to r an d v ar ia n ce -c o v ar ia n ce m at ri x f o r m u lt iv ar ia te n o rm al d at a w it h m is si n g v al u es m le st (d at a, . .. ) m it o o ls 2 .0 .1 2 0 1 0 -0 5 -0 7 T o o ls f o r m u lt ip le im p u ta ti o n o f m is si n g d at a T h o m as L u m le y – U n iv er si ty o f A u ck la n d T o o ls to p er fo rm a n al y se s an d co m b in e re su lt s fr o m m u lt ip le -im p u ta ti o n d at as et s M Ic o m b in e( re su lt s, v ar ia n ce s, ca ll =s y s. ca ll () , d f. co m p le te = In f, .. .) V IM 1 .4 .2 2 0 1 0 -1 0 -2 0 M at th ia s T em p l, A n d re as A lf o n s, A le x an d er K o w ar ik -V ie n n a U n iv er si ty o f T ec h n o lo g y P ac k ag e in tr o d u ce s n ew t o o ls fo r th e v is u al iz at io n o f m is si n g v al u es i n R , w h ic h c an b e u se d fo r ex p lo ri n g t h e d at a an d t h e st ru ct u re o f th e m is si n g v al u es A l o t o f co m m an d s fo r v is u al iz at io n a n d e x p lo ri n g m is si n g d at a S o u rc e: S el f-p re p ar ed o n t h e b asi s o f M a n u a ls a v ai la b le o n h tt p :/ /w w w .r -p ro je ct .o rg /.

(9)

Table 2. The results of using logistic regression model – original data and complete case analysis. Complete original data set

(no missing values, n=467)

Complete case analysis (no missing values, n=264)

Variables Coeff. SE p-value Variables Coeff. SE p-value

Intercept -0.15900 0.5967 0.7899 Intercept -0.36000 0.8438 0.6696 X1 -0.01883 0.0110 0.0855 X1 -0.00443 0.0137 0.7466 X2 -0.00004 0.0002 0.8060 X2 -0.00009 0.0003 0.7660 X3 -0.00507 0.0017 0.0036 X3 -0.00549 0.0029 0.0612 X4 -0.00059 0.0002 0.0006 X4 -0.00066 0.0002 0.0045 X5 0.00392 0.0026 0.1282 X5 0.00343 0.0043 0.4273 X6 0.05681 0.0220 0.0097 X6 0.04718 0.0315 0.1346

Source: Author’s calculations.

The results of fitting the logistic regression model to some data sets obtained from using different strategies for dealing with missing data are summarized in Table 3. Seven procedures were employed for handling missing data. A short description and some examples of commands in R are presented below.

The most popular and often used in practice single imputation method is

mean imputation – for each continuous predictor missing values are imputed

using the mean of the observed values. Assuming that mean imputed complete data set is denoted as cred_mean.txt the following list of commands gives the set of estimated regression coefficients and fitted probabilities:

> require(Design)

> cred_mean=read.table("C:/Documents and Settings/dane/cred_mean.txt", header=TRUE)

> cred.mean.lr=lrm(Y~X1+X2+X3+X4+X5+X6, data=cred_mean, method="lrm.fit")

> cred.mean.lr

> cred.mean.pred=predict.lrm(cred.mean.lr, type="fitted")

The next single imputation method used in the example is the nearest

neighbor search and imputation procedure, implemented in the yaImpute

pack-age. The complete data set can be obtained with the following commands: > require(yaImpute)

> x=as.data.frame(cred[,"Y"]) # the list of variables measured on all obser-vations

> y=cred[, c("X1", "X2", "X3", "X4", "X5", "X6")] # the list of variables with missing values

> cred.yai=yai(x=x, y=y, data=cred, method="euclidean") # the kNN search > cred.yai.imp=impute(cred.yai) # imputation

(10)

Since single imputation methods suffer from the problem that tests and conÞdence intervals are distorted by overstated precision, multiple imputation procedures have been developed to alleviate this problem (Ambler et al. 2007). Three packages working in the R environment are used in our example: Amelia II, Hmisc and mice.

Multiple imputation using the Amelia package can be made using the fol-lowing list of commands:

> require(Amelia)

> bds=matrix(c(2,3,4,5,6,7,18,500,1,400,30,4,65,10000,320,4500,700,36), nrow=6, ncol=3) # setting the logical bounds for variables with missing values

> cred.aimp=amelia(cred, m=5, bound=bds, max.resample=1000) # multiple imputation, m=5

> summary(cred.aimp) # summarizing the results

> write.amelia(cred.aimp, "C:/Documents and Settings/dane/cred_imp", format="csv") # writing the imputed data sets to file

To combine the results from multiple imputation data sets the Zelig package can be used:

> require(Zelig)

> cred.zelig=zelig(Y~X1+X2+X3+X4+X5+X6, data=cred.aimp$imputations, model="logit")

> summary(cred.zelig)

Multiple imputation performing by the Hmisc package is based on additive regression, bootstrapping and predictive mean matching techniques (the aregIm-pute function) or on the transformations/imputations using canonical variates (the transcan function):

> require(Hmisc)

> cred.Himp=aregImpute(~Y+X1+X2+X3+X4+X5+X6, n.impute=5, data=cred)

> cred.H.fit=fit.mult.impute(Y~X1+X2+X3+X4+X5+X6, lm, cred.Himp, data=cred) # combining the results from multiple imputation

> summary(cred.H.fit)

> cred.Himp.t=transcan(~Y+X1+X2+X3+X4+X5+X6, method="canonical", n.impute=5, imputed=TRUE, data=cred)

> cred.H.t.fit=fit.mult.impute(Y~X1+X2+X3+X4+X5+X6, lm, cred.Himp.t, data=cred)

> summary(cred.H.t.fit)

The last package used in the example is mice. Multiple imputation by chained equations method, implemented in mice, uses regression models and Bayesian sampling to impute missing values conditional on other predictors. The following list of commands should be useful to obtain the results:

(11)

> cred.mice=mice(data=cred, m=5, seed=123) # multiple imputation by pre-dictive mean matching

> cred.mice.fit=glm.mids(Y~X1+X2+X3+X4+X5+X6,

fam-ily=binomial(link=logit), data=cred.mice) # applying glm() to a multiply im-puted data set

> cred.mice.fit.pool=pool(cred.mice.fit) # pooling the results of m=5 re-peated complete data analysis

> summary(cred.mice.fit.pool)

> cred.mice.sample=mice(data=cred, m=5,

seed=123,imputationMethod="sample") # multiple imputation by simple random sampling

> cred.mice.sample.fit=glm.mids(Y~X1+X2+X3+X4+X5+X6, fam-ily=binomial(link=logit), data=cred.mice.sample)

> cred.mice.sample.fit.pool=pool(cred.mice.sample.fit) > summary(cred.mice.sample.fit.pool)

All the results obtained from described imputation techniques are presented in Table 3. The results of classifying borrowers into the risk groups based on their predicted probabilities are summarized in Table 4.

Table 3. The results of fitting logistic regression models to imputed data sets.

Imputation method Variable Coeff. SE p-value

1 2 3 4 5 Intercept 0.74652 0.5904 0.2061 X1 -0.01823 0.0112 0.1041 X2 0.00043 0.0002 0.0118 X3 -0.00487 0.0018 0.0060 X4 -0.00056 0.0002 0.0010 X5 -0.00290 0.0025 0.2478 Mean Imputation X6 0.01046 0.0191 0.5832 Intercept 0.91901 0.5196 0.0770 X1 -0.00773 0.0109 0.4788 X2 0.00031 0.0001 0.0182 X3 -0.00667 0.0018 0.0002 X4 -0.00060 0.0002 0.0004 X5 -0.00214 0.0020 0.2889 kNN Imputation (yaImpute) X6 -0.00135 0.0149 0.9274

(12)

Table 3 (cont.) 1 ₂ ₃ ₄ ₅ Intercept 0.34823 0.6398 0.5863 X1 -0.01907 0.0121 0.1186 X2 0.00020 0.0002 0.3632 X3 -0.00446 0.0018 0.0159 X4 -0.00056 0.0002 0.0014 X5 0.00054 0.0033 0.8683

Multiple Imputation by Bootstrap-ping and EM algorithm (Amelia II)

X6 0.02901 0.0238 0.2228 Intercept 0.49910 0.1459 0.0007 X1 -0.00381 0.0025 0.1229 X2 0.00002 0.0000 0.6680 X3 -0.00111 0.0004 0.0028 X4 -0.00012 0.0000 0.0013 X5 0.00046 0.0007 0.5191

Multiple Imputation by Additive Regression, Bootstrapping and Predictive Mean Matching tech-niques (Hmisc)

X6 0.00942 0.0053 0.0751

Imputation method Variable Coeff. SE p-value

Intercept 0.51020 0.1348 0.0002 X1 -0.00482 0.0024 0.0486 X2 0.00000 0.0000 0.9888 X3 -0.00103 0.0004 0.0047 X4 -0.00013 0.0000 0.0004 X5 0.00081 0.0006 0.2067

Multiple Imputation by Canonical Variates (Hmisc) X6 0.01076 0.0050 0.0312 Intercept 0.51010 0.1473 0.0006 X1 -0.00340 0.0026 0.1978 X2 0.00002 0.0000 0.6038 X3 -0.00114 0.0004 0.0071 X4 -0.00012 0.0000 0.0013 X5 0.00039 0.0007 0.5858

Multiple Imputation by Chained Equations using Predictive Mean Matching (mice)

(13)

Table 3 (cont.) 1 2 3 4 5 Intercept 0.62004 0.1255 0.0000 X1 -0.00396 0.0025 0.1075 X2 0.00007 0.0000 0.0295 X3 -0.00110 0.0004 0.0042 X4 -0.00012 0.0000 0.0009 X5 -0.00030 0.0005 0.5783

Multiple Imputation by Chained Equations using Simple Random Sampling (mice)

X6 0.00406 0.0037 0.2741

Table 4. Proportions of correctly classified objects.

Method % of correct classifications

Original data set (no missing values) 64.88%

Complete Case Analysis 62.24%

Mean Imputation 64.03%

kNN Imputation (yaImpute) 65.52%

Multiple Imputation by Bootstrapping and EM algorithm (Amelia II) 64.24%

Multiple Imputation by Additive Regression, Bootstrapping

and Predictive Mean Matching techniques (Hmisc) 64.03%

Multiple Imputation by Canonical Variates (Hmisc) 63.81%

Multiple Imputation by Chained Equations using

Predictive Mean Matching (mice) 64.88%

Multiple Imputation by Chained Equations using

Simple Random Sampling (mice) 63.38%

Since an example is presented (not a simulation study), there is no possibil-ity to draw general conclusions but, as we can see, the worst results are obtained from complete case analysis. Only one coefficient in logistic regression model is significant and the misclassification error rate is the highest. Imputation proce-dures lead to quite similar results concerning both the logistic regression model and the misclassification error rate.

V. CONCLUDING REMARKS

The objective of the paper was to review selected imputation techniques. Special attention was paid to methods implemented in some packages working in the R environment. The goal of the example was only to show how to handle missing values using a few procedures implemented in R and not to compare any imputation techniques.

(14)

Ambler et al. (2007) presented the results of a simulation comparison of dif-ferent imputation techniques for handling missing predictor values in a risk model based on logistic regression. They showed that missing data could affect the predictions from risk models and simply ignoring missing data and perform-ing a complete case analysis could lead to substantial bias and poor predictions. Single imputation procedures improved the results but they did not allow for imputation uncertainty so the conÞdence intervals of the regression coefficients could be too narrow and p-values too small. The best way to handle missing data is multiple imputation. Multiple imputation techniques generally performed well and they should be recommended in practical applications.

REFERENCES

Allison P. D. (2002), Missing data, Series: Quantitative Applications in the Social Sciences 07-136, SAGE Publications, Thousand Oaks, London, New Delhi.

Ambler G., Omar R. Z., Royston P. (2007), A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome, “Statistical Methods in Medical Research” 2007; 16: 277–298.

Crookston N. L., Finley A. O. (2008), yaImpute: An R Package for kNN Imputation, “Journal of Statistical Software”, January 2008, Volume 23, Issue 10.

Horton N. J., Kleinman K. P. (2007), Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models, “The American Statisti-cian” 2007, 6 (1): 79-90.

Kenward M. G., Carpenter J. (2007), Multiple imputation: current perspectives, “Statistical Meth-ods in Medical Research” 2007; 16: 199–218.

Little R. J. A., Rubin D. B. (2002), Statistical Analysis with Missing Data, Wiley, New Jersey. Molenberghs G., Kenward M. G (2007), Missing Data in Clinical Studies, Wiley, England. Schafer J. L. (1996), Analysis of Incomplete Multivariate Data, Chapman & Hall, New York. Su Y.-S., Gelman A., Hill J., Yajima M. (2011), Multiple Imputation with Diagnostics (mi) in R:

Opening Windows into the Black Box, “Journal of Statistical Software”, in press.

van Buuren S., Groothuis-Oudshoorn K. (2011), MICE: Multivariate Imputation by Chained Equations in R, „Journal of Statistical Software”, in press.

Wayman J. C. (2003), Multiple Imputation for Missing Data: What Is It And How Can I Use It?, http://www.csos.jhu.edu/contact/staff/jwayman_pub/ wayman_multimp_aera2003.pdf. Yu L.-M., Burton A., Rivero-Arias O. (2007), Evaluation of software for multiple imputation of

semi-continuous data, “Statistical Methods in Medical Research” 2007; 16: 243–258.

Maágorzata Misztal

IMPUTACJA BRAKUJĄCYCH DANYCH Z WYKORZYSTANIEM ĝRODOWISKA R

W praktycznych zastosowaniach metod statystycznych czĊsto pojawia siĊ problem wystĊpo-wania w zbiorach danych brakujących wartoĞci. W takich sytuacjach wykorzystaü moĪna metody imputacji danych, polegające na zastąpieniu brakujących danych konkretnymi wartoĞciami w celu uzyskania kompletnego zbioru danych.

W referacie dokonano przeglądu metod imputacji danych oraz opisano moĪliwoĞci wykona-nia koniecznych obliczeĔ z wykorzystaniem dostĊpnych w Ğrodowisku R pakietów realizujących procedury imputacji jednostkowej i wielokrotnej.