• Nie Znaleziono Wyników

Searching for an Optimal MDS Procedure for Metric and Interval-Valued Data using mdsOpt R package

N/A
N/A
Protected

Academic year: 2021

Share "Searching for an Optimal MDS Procedure for Metric and Interval-Valued Data using mdsOpt R package "

Copied!
18
0
0

Pełen tekst

(1)

Searching for an Optimal MDS Procedure for Metric and Interval-Valued Data using mdsOpt R package

Marek WALESIAK

Wroclaw University of Economics and Business, Wroclaw, Poland ORCID 0000-0003-0922-2323

marek.walesiak@ue.wroc.pl

Andrzej DUDEK

Wroclaw University of Economics and Business, Wroclaw, Poland ORCID 0000-0002-4943-8703

andrzej.dudek@ue.wroc.pl

Abstract

In multidimensional scaling (MDS) applied to a metric data matrix (interval, ratio) or interval-valued data table three approaches can be distinguished: classic-to-classic – for metric data, symbolic-to- classic) and symbolic-to-symbolic – for interval-valued data. The article presents the mdsOpt pack- age, which helps to solve the problem of choosing the optimal MDS procedure. It uses two criteria for selecting the optimal MDS procedure: Kruskal's -1 fit measure ( - in the case of the symbolic-to-symbolic approach) and Hirschman-Herfindahl index, calculated using Stress per point values (interval stress per box in the case of the symbolic-to-symbolic approach). In the first part three possible approaches are described, including the theoretical background of the methods and the relationships between the mdsOpt package and existing R packages. The second part explains the procedure and criteria for selecting the optimal MDS procedure for metric and interval-valued data.

The last part contains details on how to use the package and applications to real data sets.

Keywords: Multidimensional Scaling, Metric And Interval-Valued Data, Tourist Attractiveness, Mdsopt

Introduction

The article proposes a solution for choosing the optimal MDS procedure according to various scenar- ios. The novelty of the study presented in this article is related to a family of algorithms for selecting the optimal MDS procedure implemented in the mdsOpt package of the R program. The effective- ness of the algorithms is demonstrated using real data sets.

The Aim of Multidimensional Scaling

Classical MDS is a method that represents (dis)similarity in the data as distances in a low- dimensional space (typically 2 or 3 dimensions) in order to enable data exploration by means of visu- alization (Borg and Groenen (2005), p. 3). Classical MDS requires that each entry of dissimilarity matrix be a single numerical value. Dissimilarity between object and object can be fuzzy (Groenen et al. (2006)), p. 361). The fuzzy dissimilarity is represented by an interval and × dis- similarity matrix is an interval of values ; , where ( ) denotes the lower (upper) bound of the dissimilarity of objects and in -dimensional space. MDS of interval dissimilarities represents the lower and upper bounds of dissimilarities as distances between hypercubes (rectangles in a two- dimensional space and cubes in a three-dimensional space). The dimensions are not directly observa- ble. They can be treated as latent variables. MDS makes it possible to explain the similarities and dif- ferences between the analyzed objects.

(2)

MDS is a widely used technique in many areas, including psychology (Takane (2007)), sociology (Pinkley et al. (2005)), linguistics (Embleton et al. (2013)), marketing research (Cooper (1983)), tourism (Marcussen (2014)), musicology (McAdams et al. (1995)).

The approaches in multidimensional scaling using the mdsOpt package

When MDS is applied to a metric data matrix (interval, ratio) or interval-valued data table by means of the mdsOpt package, three approaches can be distinguished:

1. Classic-to-classic – for metric data:

= × → = × → ! "# $

→ %: ! " → ( !)"# → ) = * # $+, (1) where: ( ) – the value (the normalized value) of the j-th variable for the i-th object, , = 1, … , – the number of the object, . = 1, … , – the number of variable, ! "# × – a distance matrix (dissimilarities) between objects in an m-dimensional space (distances are calculated using e.g. city-block, Euclidean, Chebyshev, squared Euclidean), ( !)"# – a distance matrix in a q- dimensional space (/ < ", f – function which maps distances in an m-dimensional space ! " in- to corresponding distances ( !)" in a q-dimensional space, ) = * # $+ – data matrix in a q- dimensional space.

The starting point of MDS in the classic-to-classic approach is a metric data matrix = × , for which observations are obtained from secondary data sources. It is a typical situation in socio- economic research. Methods of determining the distance matrix # can be divided into direct methods (typically results from similarity ratings on pairs of objects, from rankings, or from card- sorting tasks) and indirect (see (1)) methods (see e.g. Borg and Groenen (2005), pp. 111-133).

2. Symbolic-to-classic – for interval-valued data:

= , × → = , × → ! "# $

→ %: ! " → ( !)"# → ) = * # $+, (2) where: , # – the observation of the j-th interval-valued variable for the i-th object ( ≤ ), ( ) – the lower (upper) bound of interval, = , – the normalized interval-valued data ta- ble in an m-dimensional space, ! "# – a distance matrix (dissimilarities) in an m-dimensional space (distances are calculated using distance measures for interval-valued data – see Table 3).

3. Symbolic-to-symbolic – for interval-valued data:

= , × → = , × → , →

→ %: 2 , → ( , ( 3 → ) = * , * # ×+, (3)

where: ! " [( !( "] – the lower (upper) bound of the dissimilarity of objects i and k in an m- dimensional [q-dimensional] space, % – function which represents the lower and upper bounds of the dissimilarities by minimum and maximum distances between rectangles (cubes in a three- dimensional space) as well as possible distances in the sense of least-squares (Groenen et al. (2006), p. 363), ) = * , * # – an interval-valued data table in q-dimensional space.

In the symbolic-to-classic and symbolic-to-symbolic approaches the starting point of MDS is a data table = ; ( ≤ ). Gioia and Lauro (2006, p. 344) provide different real examples of in- terval data:

(3)

− financial data (e.g. opening and closing value in a session),

− customer satisfaction data (expected or perceived characteristics of the quality of a product),

− tolerance limits in quality control,

− confidence intervals of estimates from sample surveys,

− database queries.

Additional examples of real life interval-valued data can be found in Brito et al. (2015):

− high–low intervals of financial prices,

− some questions in questionnaire surveys (e.g. age, income, time spent).

Interval-valued data can be obtained by generalizing classical single-valued variables into interval- valued variables (see e.g. Bock (2000), pp. 43-44).

The main idea of the mdsOpt package

The authors of the monograph (Borg et al. (2018), chapter 7) point out typical mistakes made by MDS users. One frequent mistake consists in evaluating Stress manually (rejecting an MDS solution on the grounds that its value seems “too high”). According to the authors (Borg et al. (2018), pp. 85- 86), “The Stress value is, however, merely a technical index, a target criterion for an optimization al- gorithm. An MDS solution can be robust and replicable, even if its Stress value is high” and “Stress is a summative index for all proximities. It does not inform the user how well a particular proximity value is represented in the given MDS space (...) The least one can do is to take a look at the Stress- per-point values”. Considering that we should take into account stress per point values (Borg and Mair (2017)) and Shepard diagram (Mair et al. (2016); De Leeuw and Mair (2015)) for classic-to- classic and symbolic-to-classic approaches or the - s per box index (ispb) and the -( dia- gram for the symbolic-to-symbolic approach.

Criteria for selecting the optimal MDS procedure

To solve the problem of choosing the optimal MDS procedure, two criteria are implemented in the mdsOpt package (Walesiak and Dudek (2019b)):

− in the classic-to-classic and symbolic-to-classic approaches: Kruskal’s -1 (standard- ized residual sum of squares) fit measure and the Hirschman-Herfindahl index, calculated from Stress per point values (spp).

− in the symbolic-to-symbolic approach: - fit measure and index, calculated from - per box index values (ispb).

Package mdsOpt versus other packages

The algorithms implemented in the mdsOpt package has not been used in other R packages so far and can be treated as a complementary package for well-known libraries smacof (Mair et al. (2019); De Leeuw and Mair (2009)) and smds (Terada and Groenen (2015)), extending theirs possibilities. The relationships between mdsOpt and other R packages are presented in Table 1.

Additionally, the mdsOpt package contains functions for calculating the - per box index (ispb) and charting the -( diagram for interval-valued data.

Selection of the optimal multidimensional scaling procedure

The article proposes a solution for selecting the optimal MDS procedure depending on whether met- ric and interval-valued data are used.

(4)

Basic Decision Problems

For the purpose of the classic-to-classic and symbolic-to-classic approaches exemplified in the study, the smacofSym function from the smacof package was used. In the smacofSym function, the user has to choose the following attributes:

– the normalization method (18 normalization methods are available – see Table 2),

– the distance measure: 5 for metric data (Manhattan, Euclidean, Chebyshev, squared Euclide- an, GDM1 – see e.g. Everitt et al. (2011), pp. 49-50; Jajuga et al. (2003)) and 4 for interval-valued data (see Table 3),

– the MDS model (3 MDS models are available: ratio, interval, polynomial).

For the purpose of the symbolic-to-symbolic approach exemplified in the study, the IMDS function from the smds package was used. The following attributes need to be selected in the IMDS function:

– the normalization method – 18 normalization methods are available,

– the optimization method – 2 methods are available: the majorization-minimization algorithm

“MM” (Groenen et al. (2006), p. 366); the quasi-Newton method “BFGS” (Nash (1990)).

Table 1: Relationships between mdsOpt and other R packages MDS approach

Classic-to-classic Symbolic-to-classic Symbolic-to-symbolic Type of data

metric interval-valued interval-valued

Functions of mdsOpt package

optSmacofSym_mMDS optSmacofSymInterval optIscalInterval Decision problem 1: normalization method

clusterSim

(data.Normalization);

base (R Core Team 2019) (scale)

clusterSim

(interval_normalization)

clusterSim

(interval_normalization)

Decision problem 2: distance measure Manhattan, Euclidean, Che-

byshev, squared Euclidean, GDM1

Ichino-Yaguchi, Euclidean Ichino-Yaguchi, Hausdorff, Euclidean Hausdorff

stats (R Core Team 2019) (dist); clusterSim (dist.GDM)

clusterSim

(dist.Symbolic) –

Decision problem 3: MDS model / optimization method ratio, interval, polynomial

smacof (smacofSym)

ratio, interval, polynomial smacof (smacofSym)

majorization-minimization (MM), quasi-Newton (BFGS) smds (IMDS)

Table 2 presents normalization methods, given by linear formula (4), which were used to select the optimal MDS procedure (see Jajuga and Walesiak (2000), pp. 106-107):

(5)

= 4 + 6 =$78;9:8

8 =;<

8:;8

8 (4 > 0), (4)

where: ( ) – the value (the normalized value) of j-th variable for the i-th object, @ – shift pa- rameter to arbitrary zero, A – scale parameter.

The variables describing the analyzed objects are normalized when they are metric or interval-valued.

The purpose of normalization is to achieve comparability of variables (Milligan and Cooper (1988)).

For classical metric data an observation on the j-th variable for the i-th object in a data matrix =

× is expressed as one real number. Column 1 in Table 2 presents the type of normalization method selected in the data.Normalization function from the clusterSim package (Walesiak and Dudek (2019a)). A similar data normalization procedure is available in the scale function of the base package. In this function the user defines parameters @ and A .

For interval-valued variables each cell in a data table represents the interval = ; ( ≤ ). Interval-valued data require a special normalization approach. The lower and upper bound of the interval of the j-th variable for objects are combined into one vector containing 2 observations.

This approach makes it possible to apply normalization methods used for classical metric data. After normalization, observations on each variable from 1 to are the lower bounds of intervals while ob- servations from + 1 to 2 are the upper bounds. In the study the data were normalized using the interval normalization function from the clusterSim package.

(6)

Table 2: Normalization methods

Type Method Parameter

A @

n1 Standardization ̅

n2 Positional standardization 6( (

n3 Unitization ̅

n3a Positional unitization (

n4 Unitization with zero minimum minG H

n5 Normalization in range [–1; 1] maxK − ̅ K ̅

n5a Positional normalization in range [–1; 1] maxK − ( K ( n6

Quotient transformations

0

n6a 6( 0

n7 0

n8 maxG H 0

n9 ̅ 0

n9a ( 0

n10 L

M< 0

n11 NL O

M< 0

n12 Normalization NL 2 − ̅ 3O

M< ̅

n12a Positional normalization NL 2 − ( 3O

M< (

n13 Normalization with zero being the central point ⁄ 2

̅ ( , )– mean (standard deviation, range) for the j-th variable, = QmaxG H + minG HR 2⁄ – mid-range for the j-th variable, ( = (2 3 – median for the j-th variable, 6( = 6(2 3 – median absolute deviation for the j-th variable.

Source: based on Jajuga and Walesiak (2000), Walesiak (2018).

Table 3: Distance measures for interval-valued data

Symbol Name Distance measure ! "

U_2_q1 Ichino-Yaguchi

/ = 1, S = 0.5 L V2 , 3

M<

U_2_q2 Euclidean Ichino-Yaguchi

/ = 2, S = 0.5 NL V2 , 3O

M<

H_q1 Hausdorff

/ = 1 L max2K − K, K − K3

M<

H_q2 Euclidean Hausdorff

/ = 2 WL max2K − K, K − K3O

M< X

< O

(7)

V2 , 3 = K ⨁ K − K ⨂ K + S22 ∙ K ⨂ K − K K − K K3; | | – length of interval,

⨁ = ∪ , ⨂ = ∩

Source: based on Billard and Diday (2006), pp. 244-246; Esposito et al. (2000), pp. 165-185; Ichino and Yaguchi (1994).

Stages in selecting the optimal procedure for MDS

The starting point in applying the smacofSym function is to determine e.g. the following values of arguments (all parameters can be changed by the user):

− initial configuration (“torgerson” classical scaling starting solution),

− convergence criterion (eps=1e-06),

− maximum number of iterations (itmax=1000).

The first step in applying the IMDS function from the smds package is to determine e.g. the following values of arguments (all parameters can be changed by the user):

− initial configuration (the hyper-rectangles with centers assigned as a result of classical multi- dimensional scaling of primary space interval centers and vertices located at unit distance from the centers),

− convergence criterion (eps=1e-5),

− maximum number of iterations (maxit=1000).

Selecting the optimal procedure for MDS takes place in several stages:

1. Set the number of dimensions in MDS to two (ndim=2).

2. Take into account the following options, depending on the approach:

• In the classic-to-classic approach – 10 normalization methods, 5 distance measures and 4 MDS models (mspline model – polynomial function of second and third degree), yield a total of 200 MDS procedures.

Since the normalization methods listed in groups A, B, C and D (see Table 4) yield identical MDS re- sults, only the first method listed in each groups (n1, n2, n3, n9), plus the other methods (n5, n5a, n8, n9a, n11, n12a) are used in further analysis.

• In the symbolic-to-classic approach – 18 normalization methods, 4 distance measures for in- terval-valued data and 4 MDS models yield a total of 288 MDS procedures.

• In the symbolic-to-symbolic approach – 18 normalization methods and 2 optimization meth- ods, yield a total of 36 MDS procedures.

Table 4: Groups of normalization methods resulting in identical distance matrices Groups of

normalization meth- ods

Normalization methods

GDM1 distance Minkowski distances, squared Euclidean dis- tance*

A n1, n6, n12 n1, n6, n12

B n2, n6a n2, n6a

C n3, n3a, n4, n7, n13 n3, n3a, n4, n7, n13

D n9, n10 n9, n10

* after dividing distances in each distance matrix by the maximum value.

Source: Walesiak and Dudek (2017).

(8)

3. Perform MDS for each procedure separately. The procedures are ordered by increasing values of:

• -1 fit measure in the classic-to-classic and symbolic-to-classic approaches (see e.g.

Borg et al. (2018), p. 32):

-1_= `∑c ( !V" − (b Od∑ (c O!)", (5) where: e – MDS procedure number, (b – d-hats, disparities, target distances or pseudo distances (see Borg and Groenen (2005), p. 199), (b = %! " by defining % in different ways (ratio, interval, pol- ynomial MDS).

• - fit measure in the symbolic-to-symbolic approach (Groenen et al. 2006, p. 363):

- _=l7mgf7gh7gi9j7gi kn∑ f7goh7gp 9j7gp q

l k 7mg

l7mgf7gh7gi kn∑l7mgf7gh7gp k , (6) where: , (( , ( ) – the lower and upper bound of the dissimilarity in m-dimensional space (q- dimensional space), r – nonnegative weight (in general r = 1).

4. Calculate the Hirschman-Herfindahl index (Herfindahl (1950); Hirschman (1964)) using Stress per point (spp) values (Stress contribution in percentages) in the classic-to-classic and symbolic- to-classic approaches:

_= ∑M< ee_O, (7)

where: , = 1, … , – object number.

In the symbolic-to-symbolic approach, the Hirschman-Herfindahl index is calculated on the basis of Interval stress per box (ispb) values (Interval Stress contribution in percentages),:

_= ∑ e4_O

M< , (8)

where: e4 = s∑lgtuf7gh7gi9j7gi kn∑ f7goh7gp9j7gp q

l k

gtu vd

l7tuws∑lgtuf7gh7gi9j7gi kn∑lgtuf7gh7gp9j7gp kvd x∙ 100%.

The _ index takes values in the interval o<z,zzz; 10,000q. The value <z,zzz means that the distribu- tion of errors for individual objects is uniform. The maximum value appears when the summary fit measure ( -1, - ) is the result of loss assigned only to one object. For other objects, the loss function will be equal to zero. The optimal situation for an MDS procedure is the minimum val- ue of the _ index.

5. Draw a chart with -1_ ( - _) fit measure values on the x-axis and _ index values on the y-axis for p procedures of MDS.

6. Use the maximum acceptable value of -1 ( - ) as { (it may be calculated as a mid- range or median of -1 ! - "). For all MDS, for which -1_≤ { ( - _≤ { ), we choose one where min_ G _H.

7. Perform MDS for the selected procedure and check if interpretation results are acceptable. The correctness of the model scaling is evaluated using the Shepard diagram ( -( diagram) and ( - ) plot,. If the results are acceptable, the procedure ends, otherwise it returns to step 1 and MDS for three dimensions is performed (ndim=3).

Evaluation of tourism attractiveness of districts in Dolnoslaskie province (the classic-to-classic approach)

In the first application, we find the optimal solution for the classic-to-classic MDS approach. The mdsOpt package contains a dataset called data lower silesian containing 16 metric variables that describe tourism attractiveness of 31 objects (29 districts in Dolnoslaskie – Lower Silesia prov-

(9)

ince, plus the pattern and anti-pattern object). Variables x1-x3, x7, x8, x10-x16 are stimulants (where higher values are preferable), variables x4, x5 and x6 are destimulants (where lower values are pref- erable), and x9 is a nominant (preferable values lie within a certain range – 50% level was adopted as the optimal one). Variable x9 was transformed into a stimulant. The coordinates of the pattern object correspond to the most optimum values of the preference variable (the maximum for a stimulant, the minimum for a destimulant). The coordinates of the anti-pattern object correspond to the least opti- mum values of the preference variable (the minimum for a stimulant, the maximum for a destimu- lant).

Ten normalizations methods, five distance measures and 4 MDS models are used for selecting the op- timal MDS procedure (see Script 1 in the Appendix).

Figure 1 shows the dependency between -1 and index with the best solution marked by the red circle. In the end we choose the MDS solution that satisfies condition -1 ≤ { and min- imizes .

Fig. 1: The values of Stress-1 fit measure and HHI index for p MDS procedures (the best solution marked by the red circle)

The results of the optimal MDS procedure (117: n12a normalization method, Euclidean distance and interval MDS model), obtained by applying Script 2 from the Appendix, for 31 objects described in terms of tourism attractiveness are presented in Figure 2.

Figure 2 (a configuration plot with bubbles) shows the share of each object in total error, which is in- dicated by the radius of the circle around each object. The Shepard diagram and the Stress plot con- firm the correctness of the selected scaling model. Figure 2 (a configuration plot with bubbles) in- cludes the set axis, i.e. the shortest connection between the pattern and anti-pattern object. It indicates the level of tourism attractiveness of the districts. The objects located closer to the pattern object are rated higher in terms of tourism attractiveness.

(10)

For comparison with the best MDS procedure (117), we also include results from one of the worst procedures (13): n9a normalization method, mspline of third degree model, Chebyshev distance.

These results can be obtained by running Script 3 from the Appendix, with changes in lines 3-5 in re- lation to script 2 and in the Shepard diagram.

Fig. 2: The results of MDS (procedure 117) for 31 objects in terms of tourism attractiveness

The results of MDS for procedure 13 are presented in Figure 3.

Overall Stress for procedure 13 (0.0381) is much better than for procedure 117 (0.1322). Figure 3 (Stress plot) shows that objects Jeleniogorski (3), the anti-pattern (31) and Zgorzelecki (7) contribute most to the overall Stress (56.62%). It also shows (see the Shepard Diagram – in the lower left-hand corner) that two points (distance between Jeleniogorski(3) and the anti-pattern object (31); Jelenio- gorski (3) and Zgorzelecki (7)) are outliers. These outliers contribute over-proportionally to total Stress. MDS configuration (Figure 3 – a configuration plot with bubbles) does not represent all prox- imities equally well. Jeleniogorski (3) is one of most highly rated districts in the province in terms of tourism attractiveness. In the configuration plot with bubbles, this district lies near the anti-pattern object (the worst object). As the value of the _ index increases, the ability of multidimensional scaling to represent real relationships between objects deteriorates.

(11)

Fig. 3: The results of MDS (procedure 13) of 31 objects in terms of tourism attractiveness

Evaluation of tourism attractiveness of Polish provinces (symbolic-to-symbolic approach)

In the second application we find the optimal solution for the symbolic-to-symbolic MDS approach.

The dataset data symbolic interval polish voivodships comes from the clusterSim package. Tourism attractiveness of Polish provinces in 2016 is evaluated using a dataset obtained in two steps.

Step 1. A dataset containing 9 metric variables describing tourism attractiveness for 380 districts was created. Three variables x4, x5 i x6 can be treated as destimulants. All other variables are stimulants.

Step 2. The data for districts were aggregated at province level yielding a set of interval-valued data.

The lower bound of the interval for each variable was obtained by calculating the first quartile based on district data. The upper bound of the interval was calculating the third quartile. The final dataset contains data about 18 objects (16 provinces, plus the pattern and anti-pattern objects) described by 9 interval-valued variables.

18 normalizations methods (see Table 2) and two optimization methods ("MM","BFGS") are used for selecting the optimal MDS procedure (with - {6| algorithm). The results of the optimal MDS

(12)

procedure (5: n2 normalization method, “MM” optimization method), obtained by means of Script 4 from the Appendix, for 18 objects ranked in terms of tourism attractiveness are presented on Figure 4.

Fig. 4: The results of multidimensional scaling (procedure 5) of 18 objects in terms of tourism attractiveness

Figure 4 (-( diagram and - plot) confirms the correctness of the MDS results (Configura- tion plot). Objects located closer to the pattern of development are characterized by a higher level of tourism attractiveness.

For comparison with the best MDS procedure (5), we include the results from one of the worst pro- cedures (12), in terms of the index. These results can be obtained by running Script 5 from the Appendix, with changes in lines 3-5 in relation to script 4 and in the -( diagram.

The results of multidimensional scaling for procedure 12 are presented in Figure 5.

(13)

Fig. 5: The results of multidimensional scaling (procedure 5) of 18 objects in terms of tourism attractiveness

Figure 5 (- plot) shows that Lubuskie (4), the pattern object (17) and Zachodniopomorskie (16) contribute most to overall - (57.68%). It also shows (see Figure 5 – -( diagram) that some points (upper distances between Zachodniopomorskie (16) and the pattern object (17); the pat- tern object (17) and Lubuskie (4); the lower distance between Zachodniopomorskie (16) and Pod- karpackie (9)) are outliers. These outliers contribute over-proportionally to total - . MDS con- figuration (Figure 5 – Configuration plot) does not represent all proximities equally well. Zachodnio- pomorskie (16) is the most highly rated province in terms of tourist attractiveness. However, in Fig- ure 5 (Configuration plot) it lies further away from the pattern object than Lubuskie(4). As the value of the _ index increases, the ability of multidimensional scaling to represent real relationships be- tween objects deteriorates.

Summary

The article proposes a method for selecting the optimal MDS procedure for classical metric and in- terval-valued data. In the classic-to-classic approach the best MDS procedure is selected by choosing an optimal combination of normalization methods, distance measures and scaling models based on the metric data matrix. In the symbolic-to-classic approach the best MDS procedure is selected by choosing an optimal combination of normalization methods, distance measures for interval-valued

(14)

data and scaling models based on an interval-valued data table. In the symbolic-to-symbolic approach the best MDS procedure is selected by choosing an optimal combination of normalization and opti- mization methods carried out on the basis of the interval-valued data table.

The optimal MDS procedure was selected by applying two criteria implemented in the mdsOpt pack- age: Kruskal’s -1 fit measure and the Hirschman-Herfindahl index (in the classic-to- classic and symbolic-to-classic approaches) and the - fit measure and the index (in the symbolic-to-symbolic approach).

In step 6 the maximum acceptable value of the fit measures -1 and - were arbitrarily chosen. It is not determined how much the error distribution for each object may deviate from the uniform distribution. In the case of MDS procedures for which -1 ≤ { ( - ≤ { ) the one with G _H is selected. This constraint does not essentially limit the presented proposal, as additional acceptability criteria, such as the Shepard diagram (De Leeuw and Mair (2015)) and Stress plot or -( diagram and - plot confirm the correctness of the MDS results.

Appendix Script 1

library(mdsOpt)

data(data_lower_silesian)

metnor<-c("n1","n2","n3","n5","n5a","n8","n9","n9a","n11","n12a") metscale<-c("ratio","interval","mspline")

metdist<-c("euclidean","manhattan","seuclidean","maximum","GDM1") res<-optSmacofSym_mMDS(data_lower_silesian,normalizations=metnor,

distances=metdist,mdsmodels=metscale,spline.degrees=c(2:3),outDec=".", stressDigits=6,HHIDigits=2)

options(max.print=1200)

stress<-as.numeric(res[,"STRESS 1"]) hhi<-as.numeric(res[,"HHI spp"]) cs<-(min(stress)+max(stress))/2 t<-findOptimalSmacofSym(res,cs) plot(stress[-t$Nr],hhi[-t$Nr],

xlab="Stress-1",ylab="HHI",type="n",font.lab=3)

text(stress[-t$Nr],hhi[-t$Nr],labels=(1:nrow(res))[-t$Nr]) abline(v=cs,col="red")

points(stress[t$Nr],hhi[t$Nr],cex=5,col="red")

text(stress[t$Nr],hhi[t$Nr],labels=(1:nrow(res))[t$Nr],col="red") Script 2

library(mdsOpt)

data(data_lower_silesian)

z<-data.Normalization(data_lower_silesian,type="n12a") d<-dist(z,method="euclidean")

res<-smacofSym(delta=d,ndim=2,type="interval") par(mfrow=c(2,2),pty="s")

#Shepard Diagram

plot(res,plot.type="Shepard",cex.main=0.8,cex.lab=0.8, cex.axis=0.8,cex=0.2)

#Stress plot

spp<-sort(res$spp,decreasing=TRUE)

names(spp)<-order(res$spp,decreasing=TRUE)

plot(spp,main="Stress plot",ylab="Stress contribution in percents", xlab="Objects",ylim=c(-2,30),cex=0.4,cex.main=0.8,

cex.lab=0.8,cex.axis=0.8)

(15)

text(spp,pos=3,names(spp),cex=0.4)

#Configuration plot with bubble bubsize=res$spp/length(spp)*4

plot(res$conf,main="Configuration plot with bubble",xlab="Dimension 1", ylab="Dimension 2",cex=bubsize,cex.main=0.8,cex.lab=0.8,

cex.axis=0.8,asp=1)

text(res$conf[,1],res$conf[,2],pos=3,1:nrow(res$conf),cex=0.7)

arrows(res$conf[nrow(z),1],res$conf[nrow(z),2],res$conf[nrow(z)-1,1], res$conf[nrow(z)-1,2],length=0.05,col="black")

plot.new()

legend("center",paste(1:nrow(res$conf),rownames(res$conf)), bty="n",cex=0.7,ncol=2,title="Legend")

Script 3

z<-data.Normalization(data_lower_silesian,type="n9a") d<-dist(z,method="maximum")

res<-smacofSym(delta=d,ndim=2,type="mspline",spline.degree=3) ...

#Shepard Diagram

plot(res,plot.type="Shepard",cex.main=0.8,cex.lab=0.8, cex.axis=0.8,cex=0.2)

t1<-as.matrix(res$delta) t2<-as.matrix(res$confdist)

text(t1[7,3],t2[7,3],pos=4,"(7,3)",cex=0.6) text(t1[31,3],t2[31,3],pos=1,"(31,3)",cex=0.6) Script 4

library(smds) library(mdsOpt)

data("data_symbolic_interval_polish_voivodships") data<-data_symbolic_interval_polish_voivodships

normalized<-interval_normalization(x=data,dataType="simple",type="n2") x<-normalized$simple[,,1];y<-normalized$simple[,,2]

my.idiss<-idistBox(X=(x+y)/2,R=(y-x)/2) cmat<-(my.idiss[2,,]+my.idiss[1,,])/2 iniX<-cmdscale(as.dist(cmat),k=2) n=dim(my.idiss)[2]

iniR<-matrix(rep(1,n*2),nrow=n,ncol=2)

res.box<-IMDS(IDM=my.idiss,p=2,model="box",opt.method="MM", report=1001,ini=list(iniX,iniR))

y_l<-res.box$IDM[1,,];x_u<-res.box$IDM[2,,]

x_l<-res.box$EIDM[1,,];y_u<-res.box$EIDM[2,,]

spb<-ispb(res.box$EIDM,my.idiss) HHI<-sum(spb^2)

par(mfrow=c(2,2),pty="s")

#I-dist diagram

plot(x_u,y_u, main="I-dist diagram",

ylab="The lower (red) and upper (green)\n configuration distances", xlab="The lower (red) and upper\n (green) dissimilarities",

col="green",cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.5) points(x_l,y_l,col="red",cex=0.5)

#I-Stress plot

w<-sort(spb,decreasing=TRUE)

names(w)<-order(spb,decreasing=TRUE)

plot(w,main="I-Stress plot",xlab="Object",ylab="ispb in percents",

(16)

ylim=c(-2,25),cex=0.4,cex.main=0.8,cex.lab=0.8,cex.axis=0.8) text(w,pos=3,names(w),cex=0.6)

#Configuration plot

x<-(res.box$X-res.box$R);y<-(res.box$X+res.box$R)

plot(NULL,xlim=c(min(x[,1]),max(y[,1])),ylim=c(min(x[,2]),max(y[,2])), pch=1,cex=0.4,main="Configuration plot",xlab="Dimension 1",

ylab="Dimension 2",cex.main=0.8,cex.lab=0.8,asp=1,cex.axis=0.8) rect(x[,1],x[,2],y[,1],y[,2])

text(res.box$X[,1],res.box$X[,2],labels=1:18,cex=0.8) plot.new()

legend("center",legend=paste(1:dim(data)[[1]],attr(data,"row.names")), bty="n",ncol=2,cex=0.65,title="Legend")

Script 5

normalized<-interval_normalization(x=data,dataType="simple",type="n5a") ...

res.box<-IMDS(IDM=my.idiss,p=2,model="box",opt.method="BFGS", report=1001,ini=list(iniX,iniR))

...

#I-dist diagram

plot(x_u,y_u,main="I-dist diagram",

ylab="The lower (red) and upper (green)\n configuration distances", xlab="The lower (red) and upper\n (green) dissimilarities",col="green", cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.5)

points(x_l,y_l,col="red",cex=0.5)

text(x_u[17,16],y_u[17,16],pos=2,"(17,16)",cex=0.6) text(x_u[17,4],y_u[17,4],pos=1,"(17,4)",cex=0.6) text(x_l[16,9],y_l[16,9],pos=3,"(16,9)",cex=0.6) Acknowledgments

The project is financed by the Ministry of Science and Higher Education in Poland under the program

“Regional Initiative of Excellence” 2019-2022, project number 015/RID/2018/19, total funding amount 10,721,040 PLN.

References

• Billard, L. and Diday, E. (2006), Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons, Chichester, DOI:10.1002/9780470090183.

• Bock, H.H. (2000), Symbolic data, Bock, H.H. and Diday E. (eds.), Analysis of Symbolic Data.

Exploratory Methods for Extracting Statistical Information from Complex Data, 39–53. Springer- Verlag, Berlin, Heidelberg, DOI:10.1007/978-3-642-57155-8.

• Borg, I. and Groenen, P.J.F. (2005), Modern Multidimensional Scaling. Theory and Applications, Springer Science+Business Media, New York, DOI:10.1007/0-387-28981-X.

• Borg, I., Groenen, P.J.F. and Mair, P. (2018), Applied Multidimensional Scaling and Unfolding, Springer, Heidelberg, New York, Dordrecht, London, DOI:10.1007/978-3-319-73471-2.

• Borg, I. and Mair, P. (2017), ‘The Choice of Initial Configurations in Multidimensional Scaling:

Local Minima, Fit, and Interpretability’, Austrian Journal of Statistics, 46(2), 19–32, DOI:10.17713/ajs.v46i2.561.

• Brito, P., Noirhomme-Fraiture M. and Arroyo, J. (2015), ‘Editorial for Special Issue on Symbolic

(17)

Data Analysis’, Advanced in Data Analysis and Classification, 9(1), 1–4, DOI:10.1007/ s11634-015- 0202-1.

• Cooper, L.G. (1983), ‘A Review of Multidimensional Scaling in Marketing Research’, Applied Psychological Measurement, 7(4), 427–450, DOI:10.1177/014662168300700404.

• De Leeuw, J. and Mair, P. (2009), ‘Multidimensional Scaling Using Majorization: SMACOF in R’, Journal of Statistical Software, 31(3), 1–30, DOI:10.18637/jss.v031.i03.

• De Leeuw, J. and Mair, P. (2015), Shepard Diagram, Wiley StatsRef: Statistics Reference Online, DOI:10.1002/9781118445112.stat06268.pub2.

• Embleton, S., Uritescu, D. and Wheeler, E.S. (2013), ‘Defining Dialect Regions with Interpreta- tions: Advancing the Multidimensional Scaling Approach’, Literary and Linguistic Computing, 28(1), 13–22, DOI:10.1093/llc/fqs048.

• Esposito, F., Malerba, D. and Tamma, V. (2000), Dissimilarity Measures for Symbolic Objects, Bock, H.H. and Diday, E. (eds.), Analysis of Symbolic Data. Exploratory Methods for Extracting Statistical Information from Complex Data, 165–185, Springer-Verlag, Berlin, Heidelberg, DOI:10.1007/978-3-642-57155-8.

• Everitt, B., Landau, S., Leese, M. and Stahl, D. (2011), Cluster Analysis, John Wiley & Sons, Chichester, DOI:10.1002/9780470977811.

• Gioia, F. and Lauro, C.N. (2006), ‘Principal Component Analysis on Interval Data’, Computa- tional Statistics, 21(2), 343–363, DOI:10.1007/s00180-006-0267-6.

• Groenen, P.J.F., Winsberg, S., Rodriguez, O. and Diday, E. (2006), ‘I-Scal: Multidimensional Scaling of Interval Dissimilarities’, Computational Statistics & Data Analysis, 51(1), 360–378, DOI:10.1016/j.csda.2006.04.003.

• Herfindahl, O.C. (1950), Concentration in the Steel Industry, Ph.D. thesis, Columbia University.

• Hirschman, A.O. (1964), ‘The Paternity of an Index’, The American Economic Review, 54(5), 761–762, http://www.jstor.org/stable/1818582.

• Ichino, M. and Yaguchi, H. (1994), ‘Generalized Minkowski Metrics for Mixed Feature-type Data Analysis’, IEEE Transactions on Systems, Man, and Cybernetics, 24(4), 698–708, DOI:10.1109/21.286391.

• Jajuga, K. and Walesiak, M. (2000), Standardisation of Data Set under Different Measurement Scales, Decker, R. and Gaul, W. (eds.), Classification and Information Processing at the Turn of the Millennium, 105–112. Springer-Verlag, Berlin, Heidelberg. DOI:10.1007/978-3-642-57280-7_11.

• Jajuga, K., Walesiak, M. and Bak, A. (2003), On the General Distance Measure. Schwaiger, M.

and Opitz, O. (eds.), Exploratory Data Analysis in Empirical Research, 104–109, Springer-Verlag, Berlin, Heidelberg, DOI:10.1007/978-3-642-55721-7_12.

• Mair, P., Borg, I. and Rusch, T. (2016), ‘Goodness-of-fit Assessment in Multidimensional Scaling and Unfolding’, Multivariate Behavioral Research, 51(6), 772–789, DOI:10.1080/00273171.2016.1235966.

• Mair, P., De Leeuw, J., Borg, I. and Groenen, P.J.F. (2019), smacof: Multidimensional Scaling, R package version 2.0-0 edition, https://CRAN.R-project.org/package=smacof.

(18)

• Marcussen, C. (2014), ‘Multidimensional Scaling in Tourism Literature’, Tourism Management Perspectives, 12(October), 31–40, DOI:10.1016/j.tmp.2014.07.003.

• McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G. and Krimphoff, J. (1995). ‘Perceptual Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and Latent Subject Classes’, Psychological Research, 58(3), 177–192, DOI:10.1007/BF00419633.

• Milligan, G.W. and Cooper, M.C. (1988), ‘A Study of Standardization of Variables in Cluster Analysis’, Journal of Classification, 5(2), 181–204.

• Nash, J.C. (1990), Compact Numerical Methods for Computers. Linear Algebra and Function Minimisation, Adam Hilger, Bristol and New York.

• Pinkley, R.L., Gelfand, M.J. and Duan, L. (2005), ‘When, Where and How: The Use of Multidi- mensional Scaling Methods in the Study of Negotiation and Social Conflict’, International Negotia- tion, 10(1), 79–96, DOI:10.1163/1571806054741056.

• R Core Team (2019), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

• Takane, Y. (2007), Applications of Multidimensional Scaling in Psychometrics, Rao, C.R. and Sinharay, S. (eds.), Handbook of Statistics (Vol. 26): Psychometrics, 359–400. Elsevier, Amsterdam.

• Terada, Y. and Groenen, P.J.F. (2015), smds: Symbolic Multidimensional Scaling, R package ver- sion 1.0 edition, https://CRAN.R-project.org/package=smds.

• Walesiak, M. (2018), ‘The Choice of Normalization Method and Rankings of the Set of Objects Based on Composite Indicator Values’, Statistics in Transition – new series, 19(4), 693–710, DOI:10.21307/stattrans-2018-036.

• Walesiak, M. and Dudek, A. (2017), ‘Selecting the Optimal Multidimensional Scaling Procedure for Metric Data with R Environment’, Statistics in Transition – new series, 18(3), 521–540, DOI:10.21307/stattrans-2016-084.

• Walesiak, M. and Dudek, A. (2019a), clusterSim: Searching for Optimal Clustering Procedure for a Data Set, R package version 0.48-3 edition, https://CRAN.R-project.org/package= clusterSim.

• Walesiak, M. and Dudek, A. (2019b), mdsOpt: Searching for Optimal MDS Procedure for Metric and Interval-valued Data, R package version 0.4-3 edition, https://CRAN.R-project.

org/package=mdsOpt.

Cytaty

Powiązane dokumenty

(To echo Wata, którego Machej, jak wielu jego rówieśników, uważnie przeczytał...) Uzmysławia też, że opisywane przez Macheja czuwanie-przy-rzeczach jest w swej istocie czuwaniem

In multidimensional scaling (MDS) carried out on the basis of a metric data matrix (interval, ratio), the main decision problems relate to the selection of the

Чёткая дифференциация семантических категорий “человек” и  “жи- вотное” в  паремиях трёх языков не прослеживается, так как характерис- тики

Een manier waarop we robuust beleid kunnen ontwikkelen is het inbouwen van flexibiliteit, ofwel het vermogen om het systeem aan te passen aan veran- derende toekomstige

Mean water mass change over the Sahara desert for 2005 June after the application of various filters: 700 km isotropic Gaussian filter (G700), convolution of destriping filter and

In the case of a single point mapped relatively to the existing map, a better choice is to initial- ize the representative point by the triangulation method [11], leading in

The layout resulting of this relative MDS mapping scheme cannot be as low as the one obtained by one full MDS of the entire dataset, because in relative MDS distances between

Firstly – if a respondent meets the selection criteria for several surveys, it is not obvious to which one of them she should be assigned in order to minimize the total number