• Nie Znaleziono Wyników

View of Outliers vs Robustness in Nonparametric Methods of Regression

N/A
N/A
Protected

Academic year: 2021

Share "View of Outliers vs Robustness in Nonparametric Methods of Regression"

Copied!
11
0
0

Pełen tekst

(1)

www.czasopisma.uni.lodz.pl/foe/

4(337) 2018

[99]

Acta Universitatis Lodziensis

ISSN 0208-6018 e-ISSN 2353-7663

DOI: http://dx.doi.org/10.18778/0208-6018.337.07

Joanna Trzęsiok

University of Economics in Katowice, Faculty of Finance and Insurance, Department of Economic and Financial Analysis, joanna.trzesiok@ue.katowice.pl

Outliers vs Robustness in Nonparametric Methods

of Regression

Abstract: The article addresses the question of how robust methods of regression are against

out-liers in a given data set. In the first part, we presented the selected methods used to detect outout-liers. Then, we tested the robustness of three nonparametric methods of regression: PPR, POLYMARS, and RANDOM FORESTS. The analysis was conducted applying simulation procedures to the data sets where outliers were detected. Contrary to a relatively common conviction about the robustness of nonparametric regression, the study revealed that the models built on the basis of complete data sets represent a significantly lower predictive capability than models based on the sets from which outliers were discarded.

Keywords: outliers, robustness, nonparametric regression methods JEL: C14

(2)

100 Joanna Trzęsiok

1. Introduction

The assumption of the homogeneity of a given data set is one of key assumptions in regression analysis. Its adoption means that we treat data used for analysis as a set of observations coming from the same population. In data sets, however, especially real data sets, there may be data points that are distant from other ob‑ servations. They require particular attention as they may cause the model based on such a data set to be inappropriate for the analysed phenomenon. According‑ ly, it is highly likely that inference, prediction and decision making based on such a model will be erroneous.

Robustness is another complex problem. In most general terms, the applica‑ tion of a robust regression method means that we have a model that follows a ten‑ dency manifested by the majority of observations. The robustness of regression, however, may be approached from a number of angles.

A regression method can be robust to:

1) the occurrence, in a training set, of distant (outlying) points which may disturb and significantly alter the equation of the regression function;

2) random disturbances in the value of a dependent variable (e.g.: random meas‑ urement errors with a normal distribution);

3) the occurrence, in a training set, of insignificant variables that do not have an impact on the model and the value of a dependent variable;

4) sampling of a training set that is the basis for the construction of a giv‑ en model;

5) the lack of values of some variables in a training set; 6) the method falling short of expectations.

While referring to robustness of regression, we tend to equal it with the in‑ sensitivity of the model to the quality of data, so – primarily – with the presence of distant (outlying) observations in a training set. They may be a result of the disturbances in the value of both a dependent variable and explanatory variables. This is the context in which we will discuss the robustness of selected regression methods presented in the article

It attempts to identify distant observations using three criteria: Ward’s clus‑ ter analysis, multidimensional scaling, and the Mahalanobis distance amended by Filzmoser, Maronna and Werner (2008).While the method applying the Ma‑ halanobis distance to outlier detection is quite commonly used, the approach based on taxonomic analysis and multidimensional scaling is the author’s original idea. However, the main goal of the article was not to identify outliers, but to verify the hypothesis about the robustness of nonparametric regression methods to the oc‑ currence of outliers.

(3)

www.czasopisma.uni.lodz.pl/foe/ FOE 4(337) 2018

2. Outliers and their identification

The notion of an outlier does not have a single unequivocal definition in the lit‑ erature. On the contrary, it is defined in many ways. This article adopts the defi‑ nition proposed by Hawkins (1980), who argues that an outlier is “an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism”.

In terms of the causes of their occurrence, outliers can be divided into (Rous‑ seeuw, Leroy, 2003):

1) outliers originating from a number of different types of errors: measurement errors, errors involved in data collection and entry, deliberate dishonesty in reporting, unsuitable research methodology, poor sampling, or wrong as‑ sumptions;

2) outliers for the heavy‑tailed distribution;

3) influential observations which have a significant impact on a given model and may lead to interesting hypotheses.

The detection of outliers and the ways of handling them are important issues re‑ lated to the notion of robustness in statistics (Trzpiot, 2013). The literature provides many approaches to the identification of outliers. The most popular ones are: a one‑di‑ mensional quantile criterion (Tukey, 1977), methods based on Cook’s distance (Cook, 1977), estimates based on the Mahalanobis distance (Healy, 1968), and the method

involving the local outlier factor (Breunig, Kriegel, Ng, Sander, 2000).1

A number of researchers showed interest in the topics related to outliers and non‑parametric regression. Outliers detection and identification were, for exam‑ ple, discussed by:

1) Majewska (2015), who, apart from classical methods, uses non‑traditional methods based on robust PCA in her work;

2) Batóg (2016), whose work is based on the comparison of methods that enable the identification of spatial outliers;

3) Ganczarek‑Gamrot (2016), who used electricity market data to present meth‑ ods for detecting outliers within time series;

4) Trzęsiok (2014), who discussed outliers in the context of data quality. In the context of robust regression, on the other hand, applications and com‑ parisons of various robust methods, with particular emphasis on the regression depth concept, were proposed by Kosiorowski (e.g.: 2007; 2012).

This article uses three criteria.

1. Criterion based on the Mahalanobis distance (Healy, 1968):

3

the identification of outliers. The most popular ones are: a one-dimensional quantile criterion

(Tukey, 1977), methods based on Cook’s distance (Cook, 1977), estimates based on the

Ma-halanobis distance (Healy, 1968), and the method involving the local outlier factor (Breunig,

Kriegel, Ng, Sander, 2000).

1

A number of researchers showed interest in the topics related to outliers and

non-parametric regression. Outliers detection and identification were, for example, discussed by:

1) Majewska (2015), who, apart from classical methods, uses non-traditional methods based

on robust PCA in her work;

2) Batóg (2016), whose work is based on the comparison of methods that enable the

identifi-cation of spatial outliers;

3) Ganczarek-Gamrot (2016), who used electricity market data to present methods for

de-tecting outliers within time series;

4) Trzęsiok (2014), who discussed outliers in the context of data quality.

In the context of robust regression, on the other hand, applications and comparisons of

various robust methods, with particular emphasis on the regression depth concept, were

pro-posed by Kosiorowski (e.g.: 2007; 2012).

This article uses three criteria.

1.

Criterion based on the Mahalanobis distance (Healy, 1968):

( ) (

ˆ

) (

ˆ

1

ˆ

)

T

MD

x

=

x μ Σ x μ (1)

where

ˆμ

s a mean value, while

ˆΣ

– a variance-covariance matrix:

(

) (

)

1

ˆ

1

ˆ

ˆ

1

n T i

n

=

=

Σ

x μ

x μ (2)

According to this criterion, we treat an observation as an outlier when it is matched

by a high value of MD(

x) compared to critical values in χ

2

distribution tables.

The major weakness of this method is that it draws on classic statists, which are very

sensitive to outliers and, in consequence, the values of the measure MD cannot

al-ways be deemed as reliable. For this reason, the literature proposes many

modifica-tions in the Mahalanobis distance. One of such modificamodifica-tions is the MD* approach,

developed by Filzmoser, Maronna, and Werner in 2008, which applies principal

component analysis to outlier detection. This method is presented in detail in

(Filzmoser, Maronna, Werner, 2008).

1 The criteria for outlier detection were discussed in detail, inter alia, in Trzęsiok (2014).

(1)

(4)

102 Joanna Trzęsiok

where

3

the identification of outliers. The most popular ones are: a one-dimensional quantile criterion

(Tukey, 1977), methods based on Cook’s distance (Cook, 1977), estimates based on the

Ma-halanobis distance (Healy, 1968), and the method involving the local outlier factor (Breunig,

Kriegel, Ng, Sander, 2000).

1

A number of researchers showed interest in the topics related to outliers and

non-parametric regression. Outliers detection and identification were, for example, discussed by:

1) Majewska (2015), who, apart from classical methods, uses non-traditional methods based

on robust PCA in her work;

2) Batóg (2016), whose work is based on the comparison of methods that enable the

identifi-cation of spatial outliers;

3) Ganczarek-Gamrot (2016), who used electricity market data to present methods for

de-tecting outliers within time series;

4) Trzęsiok (2014), who discussed outliers in the context of data quality.

In the context of robust regression, on the other hand, applications and comparisons of

various robust methods, with particular emphasis on the regression depth concept, were

pro-posed by Kosiorowski (e.g.: 2007; 2012).

This article uses three criteria.

1.

Criterion based on the Mahalanobis distance (Healy, 1968):

( ) (

ˆ

) (

ˆ

1

ˆ

)

T

MD

x

=

x μ Σ x μ (1)

where

ˆμ

s a mean value, while

ˆΣ

– a variance-covariance matrix:

(

) (

)

1

ˆ

1

ˆ

ˆ

1

n T i

n

=

=

Σ

x μ

x μ (2)

According to this criterion, we treat an observation as an outlier when it is matched

by a high value of MD(

x) compared to critical values in χ

2

distribution tables.

The major weakness of this method is that it draws on classic statists, which are very

sensitive to outliers and, in consequence, the values of the measure MD cannot

al-ways be deemed as reliable. For this reason, the literature proposes many

modifica-tions in the Mahalanobis distance. One of such modificamodifica-tions is the MD* approach,

developed by Filzmoser, Maronna, and Werner in 2008, which applies principal

component analysis to outlier detection. This method is presented in detail in

(Filzmoser, Maronna, Werner, 2008).

1 The criteria for outlier detection were discussed in detail, inter alia, in Trzęsiok (2014).

is a mean value, while

3

the identification of outliers. The most popular ones are: a one-dimensional quantile criterion

(Tukey, 1977), methods based on Cook’s distance (Cook, 1977), estimates based on the

Ma-halanobis distance (Healy, 1968), and the method involving the local outlier factor (Breunig,

Kriegel, Ng, Sander, 2000).

1

A number of researchers showed interest in the topics related to outliers and

non-parametric regression. Outliers detection and identification were, for example, discussed by:

1) Majewska (2015), who, apart from classical methods, uses non-traditional methods based

on robust PCA in her work;

2) Batóg (2016), whose work is based on the comparison of methods that enable the

identifi-cation of spatial outliers;

3) Ganczarek-Gamrot (2016), who used electricity market data to present methods for

de-tecting outliers within time series;

4) Trzęsiok (2014), who discussed outliers in the context of data quality.

In the context of robust regression, on the other hand, applications and comparisons of

various robust methods, with particular emphasis on the regression depth concept, were

pro-posed by Kosiorowski (e.g.: 2007; 2012).

This article uses three criteria.

1.

Criterion based on the Mahalanobis distance (Healy, 1968):

( ) (

ˆ

) (

ˆ

1

ˆ

)

T

MD

x

=

x μ Σ x μ (1)

where

ˆμ

s a mean value, while

ˆΣ

– a variance-covariance matrix:

(

) (

)

1

ˆ

1

ˆ

ˆ

1

n T i

n

=

=

Σ

x μ

x μ (2)

According to this criterion, we treat an observation as an outlier when it is matched

by a high value of MD(

x) compared to critical values in χ

2

distribution tables.

The major weakness of this method is that it draws on classic statists, which are very

sensitive to outliers and, in consequence, the values of the measure MD cannot

al-ways be deemed as reliable. For this reason, the literature proposes many

modifica-tions in the Mahalanobis distance. One of such modificamodifica-tions is the MD* approach,

developed by Filzmoser, Maronna, and Werner in 2008, which applies principal

component analysis to outlier detection. This method is presented in detail in

(Filzmoser, Maronna, Werner, 2008).

1 The criteria for outlier detection were discussed in detail, inter alia, in Trzęsiok (2014).

– a variance‑covariance matrix:

3

the identification of outliers. The most popular ones are: a one-dimensional quantile criterion

(Tukey, 1977), methods based on Cook’s distance (Cook, 1977), estimates based on the

Ma-halanobis distance (Healy, 1968), and the method involving the local outlier factor (Breunig,

Kriegel, Ng, Sander, 2000).

1

A number of researchers showed interest in the topics related to outliers and

non-parametric regression. Outliers detection and identification were, for example, discussed by:

1) Majewska (2015), who, apart from classical methods, uses non-traditional methods based

on robust PCA in her work;

2) Batóg (2016), whose work is based on the comparison of methods that enable the

identifi-cation of spatial outliers;

3) Ganczarek-Gamrot (2016), who used electricity market data to present methods for

de-tecting outliers within time series;

4) Trzęsiok (2014), who discussed outliers in the context of data quality.

In the context of robust regression, on the other hand, applications and comparisons of

various robust methods, with particular emphasis on the regression depth concept, were

pro-posed by Kosiorowski (e.g.: 2007; 2012).

This article uses three criteria.

1.

Criterion based on the Mahalanobis distance (Healy, 1968):

( ) (

ˆ

) (

ˆ

1

ˆ

)

T

MD

x

=

x μ Σ x μ (1)

where

ˆμ

s a mean value, while

ˆΣ

– a variance-covariance matrix:

(

) (

)

1

ˆ

1

ˆ

ˆ

1

n T i

n

=

=

Σ

x μ

x μ (2)

According to this criterion, we treat an observation as an outlier when it is matched

by a high value of MD(

x) compared to critical values in χ

2

distribution tables.

The major weakness of this method is that it draws on classic statists, which are very

sensitive to outliers and, in consequence, the values of the measure MD cannot

al-ways be deemed as reliable. For this reason, the literature proposes many

modifica-tions in the Mahalanobis distance. One of such modificamodifica-tions is the MD* approach,

developed by Filzmoser, Maronna, and Werner in 2008, which applies principal

component analysis to outlier detection. This method is presented in detail in

(Filzmoser, Maronna, Werner, 2008).

1 The criteria for outlier detection were discussed in detail, inter alia, in Trzęsiok (2014).

(2) According to this criterion, we treat an observation as an outlier when

it is matched by a high value of MD(x) compared to critical values in χ2 distribu‑

tion tables.

The major weakness of this method is that it draws on classic statists, which are very sensitive to outliers and, in consequence, the values of the measure MD cannot always be deemed as reliable. For this reason, the literature proposes many modifications in the Mahalanobis distance. One of such modifications is the MD* approach, developed by Filzmoser, Maronna, and Werner in 2008, which applies principal component analysis to outlier detection. This method is presented in de‑ tail in (Filzmoser, Maronna, Werner, 2008).

2. Ward’s method, or hierarchical cluster analysis, is one of the agglomerative

methods which is the most frequently applied and which yields the best re‑ sults. It involves the successive merging of clusters into increasingly larger ones. The way that the method works (just as in the case of all hierarchical methods) can be represented with a dendrogram, which allows the reconstruc‑ tion of the classification process. A dendrogram also enables the visualisation and graphic representation of the results of clustering. Hierarchical methods, including Ward’s method, were discussed in (Walesiak, Gatnar, 2009). The application of clustering methods to outlier detection has attracted criti‑ cism in the literature (Breunig, Kriegel, Ng, Sander, 2000), due to their other – pri‑ mary – goal. However, in this case, we intended to apply a few criteria, in a way complementary and enabling the visualisation of multi‑dimensional observa‑ tions.

3. Multidimensional scaling is a method that allows the visualisation of rela‑

tions between individual cases in a data set. It involves transforming origi‑ nal observations to space that has fewer dimensions (most frequently 2 or 3 dimensions), so that the distances between the objects in a new coordinate system are possibly the closest to the original distances between the relevant observations. This enables the identification of outliers in fewer dimensional space (e.g.: two‑dimensional). This method also has the advantage of being able to generate the graphic representation of analysis results. Multidimen‑ sional scaling was presented in more detail in, for example, (Walesiak, Gat‑ nar, 2009).

Outlier detection is not a simple task. Moreover, it is only the first step in the analysis. Outliers are not always a negative occurrence. They may result from a measurement error, yet they may also be influential observations, which should

(5)

www.czasopisma.uni.lodz.pl/foe/ FOE 4(337) 2018

not be removed from a data set, since they may carry meaningful and poten‑ tially useful information. However, the discovery of the nature of an observation is a complex and difficult task, so the right decision seems to be preserving outliers in a data set and applying robust statistical methods for further analysis. The ques‑ tion arises which methods are robust to the occurrence of outliers in a data set.

3. Regression methods used in the study

Robustness is of particular importance in the case of nonparametric regression models, which are characterised by high flexibility and the capacity for an adap‑ tive and precise fit to data, accounting for variability caused by disturbances. The question arises how nonparametric models built on training sets disturbed by out‑ liers behave.

In view of the above, nonparametric methods may generate models that are not robust to the occurrence of outliers in training sets, have poor predictive capa‑ bilities, and, as a result, do not hold a substantive cognitive value for researchers. On the other hand, however, many of these methods have an in‑built regularisa‑ tion mechanism which reduces the problem of the overfitting of a model to a train‑ ing set. The mechanism involves adopting a certain compromise between the fit of a model and its complexity (Trzęsiok, 2011), which results in the increased pre‑ dictive capabilities of the model. The question, however, arises to what extent the mechanism is effective and whether the methods are really robust to outliers.

The study used three selected nonparametric methods that are frequently ap‑ plied in comparative analyses and possess good predictive capabilities (Meyer, Leisch, Hornik, 2003):

1) projection pursuit regression PPR (Friedman, Stuetzle, 1981),

2) multivariate adaptive regression splines POLYMARS (Kooperberg, Bose, Stone, 1997),

3) random forests (Breiman, 2001).

4. Research procedure

As mentioned above, the study aimed not only to detect outliers in data sets but also to test nonparametric methods for robustness to the occurrence of such ob‑ servations. Accordingly, the analytical procedure applied in the study can be pre‑ sented in the following steps:

1. Outlier detection:

– the three outlier detection criteria presented above were used to analyse the data sets,

(6)

104 Joanna Trzęsiok

– then, the majorisation rule was applied to classify as outliers the observa‑ tions that were detected as such according to all three criteria.

2. The construction of nonparametric regression models: – based on the entire original data set,

– based on the data set from which outliers were eliminated.

3. The comparison of the models in terms of their predictive capabilities, using

the mean squared error MSECV calculated with the cross‑validation method

(involving the breakdown of a data set into 10 parts).

The robustness of the selected regression methods to the occurrence of outli‑ ers in a training set was tested on three data sets:

1) crime, proposed in (Agresti, Finlay, 2009); it is a set of real data on criminal activity in the US states (51 observations); it contains three outliers;

2) hbk, presented in (Rousseeuw, Leroy, 2003); it is a computer generated data set, containing 75 observations, 14 of which are outliers;

3) flats is a set of real data generated based on the information about sale transac‑ tions of flats provided by the online service www.oferty.net; the data concern sale transactions completed from June 2007 to September 2009; the flats data set contains 747 observations described by 8 explanatory variables (5 of which

are variables measured in interval or ratio scales)2.

Ward’s method does not field unequivocal identifications (apart from detect‑ ing the object DC – District Columbia) of outliers in the crime set. Multidimen‑ sional scaling detects 3 outliers, whereas the MD* method – Mahalanobi distance amended by Filzmoser, Maronna and Werner (2008) – identifies 4 such observa‑ tion points. The final conclusion is that the following states are outliers: MS (Mis‑ sissippi), DC (District Columbia) and LA (Louisiana).

In the case of the hbk set, all three criteria indicated that the first 14 observa‑ tions in the set were outliers.

In the flats set, Ward’s method showed 23 outliers (they belong to the small‑ est of the classes created as a result of breaking down the set into 8 groups in ac‑ cordance with the silhouette index). Multidimensional scaling identified 31 such observations, while the Mahalanobis distance amended by Filzmoser, Maronna and Werner (2008) – 68 outliers.

As mentioned above, we conducted the two variations of analysis for each set. First, the model was built based on the set containing the outliers, then the outliers were removed and the new model was constructed. In each case (for each set and

each regression method), we cross‑validated the mean squared error MSECV. The

results are presented in Table 1.

2 As for the flats set, we do not know the number of outliers because it is a set of real data. There

(7)

www.czasopisma.uni.lodz.pl/foe/ FOE 4(337) 2018 N H IA NE N D ID SD MEVT MTWYVA RIPAIN OH NJCTMA M D D ENV C OWAMNUT KSORWI AK HI D C G ANC TNAL SC N M O K M O TX IL AZ MIFLCANY W V AR KY LA MS 0 5 10 15 20 25 30 Dendrogram hclust (*, "ward.D") distance.matrix H ei gh t -10 -8 -6 -4 -2 0 2 -1 0 -5 0 5 Multidimensional scaling Dimension 1 AK AL AR AZ CA COCT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NENH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY DC

Figure 1. The dendrogram for Ward’s method and the visualisation of multidimensional scaling for the crime set

Source: own computation

63 19 55 17 32 40 72 42 60 66 74 37 61 26 36 22 43 70 24 54 68 73 53 75 29 62 25 34 20 35 16 21 49 47 52 64 15 27 31 41 48 28 65 38 57 51 69 39 45 30 44 46 18 71 56 58 50 59 33 23 67 7 6 3 5 4 9 1 10 2 814 13 11 12 0 20 40 60 80 Dendrogram hclust (*, "ward.D") distance.matrix H ei gh t -4 -3 -2 -1 0 1 -2 0 2 4 Multidimensional scaling Dimension 1 D im en si on 2 1 2 3 4 56 7 8 9 10 11 12 13 14 15 1618 19 17 20 21 22 23 24 25 26 27 28 2930 31 32 33 34 35 36 37 38 3940 41 42 43 44 45 46 47 48 49 50 51 52 53 5455 56 57 58 59 6061 62 63 64 65 66 67 68 69 70 71 72 73 74 75

Figure 2. The dendrogram for Ward’s method and the visualisation of multidimensional scaling for the hbk set

(8)

106 Joanna Trzęsiok -2 0 2 4 6 8 10 -6 -4 -2 0 2 4 Multidimensional scaling Dimension 1 D im en si on 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 2425 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 4445 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111112 113 114 115 116117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195196 197 198 199 200 201 202 203 204 205206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258259 260 261 262 263264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284285 286 287 288 289 290 291 292 293 294 295 296 297 298299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371372 373374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418419420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456457 458 459 460 461462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479480 481 482 483 484 485 486 487 488 489 490 491 493492494495 496 497498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567568 569 570 571 572573 574 575 576 577578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628629 630 631 632 633 634 635 636637 638 639 640 641 642 643 644645 646 647 648 649 650 651 652 653 654 655 656657659658 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724725 726 727 728 729 730 731 732 733 734 735 736738 737 739 740 741 742 743 744745 746 747

Figure 3. The visualisation of multidimensional scaling for the flats set Source: own computation

Table 1. The values of the mean squared errors MSECV, calculated for different regression models

built on the data sets with and without outliers

Data sets

Methods crime without crime

outliers hbk hbk without outliers flats flats without outliers PPR 78 236 31 311 2.72 0.29 11 321 3 566 POLYMARS 109 334 29 628 1.74 0.33 10 348 3 275 R.FORESTS 61 893 21 669 0.81 0.22 8 037 1 804

Source: own computation

While analysing the results for particular methods, presented in Table 1, we should compare the pairs of MSECV values obtained for the models construct‑ ed based on:

1) the set containing outliers, and

(9)

www.czasopisma.uni.lodz.pl/foe/ FOE 4(337) 2018

It is not important which model adopts the lowest values of MSECV, but how

these values (in corresponding pairs) change as a result of removing outliers. Com‑ paring figures in columns 2 and 3, 4 and 5, as well as 6 and 7 in Table 1, we can observe that in each case there was a relatively large decrease in the value of the mean squared error, which means that none of the methods under consideration is robust to the occurrence of outliers in a training set.

5. Conclusion

The article presents selected outlier detection methods which enable the prelimi‑ nary analysis of a data set and, as a result, can bring certain anomalies occurring in the set to a researcher’s attention. However, we cannot be certain that these methods will detect all outliers in real data sets.

It is also worth emphasising that the occurrence of outliers does not mean the immediate necessity to remove them from a data set. On the contrary, they may have a significant but positive influence on a given model. Therefore, a good solu‑ tion is to apply robust methods to the analysis of such a data set. This study test‑ ed three nonparametric regression methods – PPR, POLYMARS and RANDOM FORESTS – for robustness to outliers.

The studies on the topics related to outliers mentioned in Part 2 focused pri‑ marily on the identification and detection of these observations. This article was only the initial stage of the study, as it aimed to examine the properties of select‑ ed regression methods that are commonly considered robust. The results of the examination, however, clearly show that the selected regression methods adopt

significantly lower values of the mean squared errors MSECV after the removal

of outliers from the data sets. Thus, the research hypothesis proposed in the in‑ troduction was verified negatively and rejected. These nonparametric regression methods cannot be considered robust to the occurrence of outlying observations in a training set.

(10)

108 Joanna Trzęsiok

References

Agresti A., Finlay B. (2009), Statistical Methods for the Social Sciences, 4th ed., Pearson, New Jersey.

Batóg J. (2016), Identyfikacja obserwacji odstających w analizie skupień, [in:] K. Jajuga, M. Wa‑ lesiak (eds.), Taksonomia 26. Klasyfikacja i analiza danych, “Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu”, no. 426, pp. 13–21.

Breiman L. (2001), Random Forests, “Machine Learning”, no. 45, pp. 5–32.

Breunig M.M., Kriegel H.‑P., Ng R.T., Sander J. (2000), LOF: Identifying Density‑Based Outliers, Proceedings of the 29th ACM SIDMOD International Conference on Management of Data (SIGMOD 2000), Dallas.

Cook R.D. (1977), Detection of Influential Observations in Linear Regression, “Technometrics”, no. 19(1), pp. 15–18.

Filzmoser P., Maronna R.A., Werner M. (2008), Outlier Identification in High Dimensions, “Com‑ putational Statistics & Data Analysis”, no. 52, pp. 1694–1711.

Friedman J., Stuetzle W. (1981), Projection Pursuit Regression, “Journal of the American Statisti‑ cal Association”, no. 76, pp. 817–823.

Ganczarek‑Gamrot A. (2016), Obserwacje odstające na rynku energii elektrycznej, “Studia Ekono‑ miczne. Zeszyty Naukowe Uniwersytetu Ekonomicznego w Katowicach”, no. 288, pp. 7–20. Hawkins D. (1980), Identification of Outliers, Chapman and Hall, London.

Healy M.J.R. (1968), Multivariate Normal Plotting, “Applied Statistics”, no. 17, pp. 157–161. Kooperberg C., Bose S., Stone C. (1997), Polychotomous Regression, “Journal of the American

Statistical Association”, no. 92, pp. 117–127.

Kosiorowski D. (2007), O odpornej analizie regresji w ekonomii na przykładzie koncepcji głębi

regresyjnej, “Przegląd Statystyczny”, vol. 54, pp. 109–121.

Kosiorowski D. (2012), Statystyczne funkcje głębi w odpornej analizie ekonomicznej, Wydawnic‑ two UEK w Krakowie, Kraków.

Majewska J. (2015), Identification of Multivariate Outliers – Problems and Challenges of Visu‑

alization Methods, “Studia Ekonomiczne. Zeszyty Naukowe Uniwersytetu Ekonomicznego

w Katowicach”, no. 247, pp. 69–83.

Meyer D., Leisch F., Hornik K. (2003), The Support Vector Machine under Test, “Neurocomput‑ ing”, vol. 1–2, no. 55, pp. 169–186.

Rousseeuw P., Leroy A. (2003), Robust Regression and Outlier Detection, John Wiley & Sons Inc., New York.

Trzęsiok J. (2011), Przegląd metod regularyzacji w zagadnieniach regresji nieparametrycznej, [in:] K. Jajuga, M. Walesiak (eds.), Taksonomia 18. Klasyfikacja i analiza danych, “Prace Na‑ ukowe Uniwersytetu Ekonomicznego we Wrocławiu”, no. 176, pp. 330–339.

Trzęsiok M. (2014), Wybrane metody identyfikacji obserwacji oddalonych, [in:] K. Jajuga, M. Wa‑ lesiak (eds.), Taksonomia 22. Klasyfikacja i analiza danych – teoria i zastosowania, “Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu”, no. 327, pp. 157–166.

Trzpiot G. (ed.) (2013), Wybrane elementy statystyki odpornej, Wydawnictwo Uniwersytetu Eko‑ nomicznego w Katowicach, Katowice.

Tukey J.W. (1977), Exploratory Data Analysis, Addison‑Wesley, Boston.

Walesiak M., Gatnar E. (2009), Statystyczna analiza danych z wykorzystaniem programu R, Wy‑ dawnictwo Naukowe PWN, Warszawa.

(11)

www.czasopisma.uni.lodz.pl/foe/ FOE 4(337) 2018

Obserwacje odstające a problem odporności

Streszczenie: Artykuł poświęcony jest zagadnieniu odporności metod regresji na obserwacje

od-stające występujące w zbiorze danych. W pierwszej części przedstawiono wybrane metody identy-fikacji obserwacji nietypowych. Następnie badano odporność trzech nieparametrycznych metod regresji: PPR, POLYMARS i RANDOM FORESTS. Analiz dokonano za pomocą procedur symulacyjnych na zbiorach danych, w których wykryto obserwacje odstające. Mimo dosyć powszechnych przeko-nań o odporności regresji nieparametrycznej okazało się, że modele zbudowane na całych zbiorach danych mają istotnie mniejsze zdolności predykcyjne niż modele uzyskane na zbiorach, z których usunięto obserwacje nietypowe.

Słowa kluczowe: obserwacje odstające, odporność, nieparametryczne metody regresji JEL: C14

© by the author, licensee Łódź University – Łódź University Press, Łódź, Poland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license CC‑BY

(http: //creativecommons.org/licenses/by/3.0/) Received: 2016‑12‑17; verified: 2018‑04‑11. Accepted: 2018‑06‑18

Cytaty

Powiązane dokumenty