Multidimensional Scaling for Symbolic Interval Data

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S FOLIA OECONOMICA 228, 2009

A n d rz e j Dudek*

MULTIDIMENSIONAL SCALING FOR SYMBOLIC

INTERVAL DATA

Abstract. The aim of multidimensional scaling is to represent dissimilarities among objects in high dimensional space as distances in low (usually 2- or 3-) dimensional space. Usually the input to multidimensional scaling procedure is a square, symmetric matrix indicating relationships (similarities or dissimilarities) among a set of items. There are many techniques of classical multidimensional scaling but all under assump-tion that each entry in relaassump-tionship matrix is single numeric value.

Denoeux and Masson (2002) have proposed to extend multidimensional scaling onto symbolic interval data. The input to theirs INTERSCAL algorithm is interval dis-similarity table containing minimum and maximum distance between hyper-rectangles representing objects. The same approach is used in SYMSCAL and I-SCAL algorithms proposed by Groenen et al. (2005).

Article presents main algorithms of multi-dimensional scaling for symbolic data in form of intervals along with some examples on datasets taken from symbolic data re-pository (http://www.ceremade.dauphine.fr/~touati/sodas-pagegarde.htm).

Key words: Multidimensional scaling, visualization, symbolic data.

I. INTRODUCTION

Visualizing data in the form o f illustrative diagrams and searching, in these diagrams, for structures, clusters, trends, dependencies etc. is one o f the main aims o f multivariate statistical analysis. In the case o f symbolic data (e.g. data in form of: single quantitative value, categorical values, intervals, multi-valued variables, multi-valued variables with weights), some well-known methods are provided by suitable ’symbolic’ adaptations o f classical methods such as princi-pal component analysis, factor analysis or multidimensional scaling (MDS) (Kruskal (1964)). The main difference between classical methods and “sym-bolic” methods is form o f data they are dealing with. In case o f symbolic data analysis the input contains intervals instead o f single numerical values.

(2)

This paper describes methods o f multidimensional scaling o f symbolic ob-jects containing variables in form o f intervals (symbolic interval data). The aim o f multidimensional scaling o f symbolic interval data is, like in classical, case: to represent dissimilarities among objects in w-dimensional as distances in re-duced 2- or 3- dimensional space. But, while in classical MDS points o f z «-dimensional space are transformed into points in ^-dim ensional space (p = 2 or 3), in case o f multidimensional scaling o f symbolic interval data hiper-cubes o f higher dimensional space are translated into intervals in lower dimensional space.

First chapter describes the form o f input and output data for algorithms o f multidimensional scaling o f symbolic interval data. Second presents three main methods o f multidimensional scaling for symbolic interval data: Interscal, I-scal and Symscal, which is more detailed described in third chapter. Forth chapter presents example o f usage o f Symscal algorithm on symbolic data acquired from symbolic data repository. Finally some remarks and conclusions are given.

II. INPUT AND OUTPUT DATA

All methods o f multidimensional scaling for symbolic interval data require matrix o f minimal and maximal distances between objects in form similar to formula 1.

M . ^21 >4žl 4)1? 4)1

^21»^21 S 22, S22 •‘' 4)2 > 4)2

4,i ’ 4,1 4)2 >4)2 " ’ 4>«

where:

S(J minimal distance between /-th and /-th symbolic object. Sy maximal distance between /-th and /-th symbolic object. n number o f symbolic objects.

Sometimes this matrix is not given directly but should be calculated from n symbolic objects containing intervals. In these case, according to Deneux and Masson (2000) minimal and maximal distances are computed due to formulas 2 and 3.

(3)

(дел - * * ) + (*> - x Jk) + 2 X ik - X , k X j k - X ,

(2)

Ü - 7 . E » ( x* - * ,* ) + ( * > - x jk) + 2 Xik-XIk Xjk ~ - f i (*ft - x lk) + (xjk - x Jk) + 2 X i k - X lk X j k - X j k (3) where:

S)j - minimal distance between /-th andy'-th symbolic object, Stj - maximal distance between /-th and y'-th symbolic object,

(jty, x y) - y'-th variable o f /-th object (the beginning and the end o f interval) m - number o f symbolic variables describing each object.

The aim o f multidimensional scaling o f symbolic interval data is like in clas-sical case: to represent dissimilarities among objects in и-dimensional as dis-tances in 2- or 3- dimensional space thus output matrix also contains minimal and maximal distances between and can be written in form o f 4.

(4)

d \ \ - , d \ \ d j ^ d2, d „ i , d n ]

^ 2 1 ^ 2 1 ^22 ■d 27 ■" d n 2 , d n2

4 .1 ’ d n \ d n 2 ’ d n 2 ' d m i ’ d nn

where:

í/у - mi n i ma l distance between /-th andy'-th symbolic object in reduced space, d j - maximal distance between /-th and y'-th symbolic object in reduced space, n - number o f symbolic objects.

(4)

III. M ETH O D S O F M U LT ID IM EN SIO N A L SCA LIN G F O R SY M B O LIC IN T E R V A L DATA

There are three main algorithms o f multidimensional scaling for symbolic interval data, one non-iterative (Interscal) and two (I-scal and Symscal) itera-tively searching for optimal value o f transformation loss function.

Main steps o f Interscal (Denoeux, Masson [2002]) method can be stated as: • Calculation o f modified A matrix containing 2n rows and I n columns with minimal distance, maximal distance and average distance for every pair of objects.

• Calculation o f В - matrix o f scalar products o f rows o f A , • Calculation o f eigenvectors v and eigenvalues Я o f B, • Calculation o f d tj d (J .

I-scal and Symscal algorithms starting from X (centers o f intervals) and R (spread o f intervals) matrices are iteratively searching for optimal values o f I-Stress/Stress-Sym loss function .

IV. SYM SCAL

The idea o f SymScal algorithm is majorization o f loss function. In first steps matrices X (centers o f intervals) and R (spread o f intervals) are generated (usually in random way, but Groenen et al. (2006) suggests to use Interscal algo-rithm to find initial values o f X and R .

In second and next steps, STRESS-Sym measure is calculated due to formula 5

STRESS-Sym (X ,R ) = ± - i , (X, R ) T + £ % [ s , - ( X .R ) ] ’ (5)

i<J K j

where:

X ,R — centers o f intervals and spread o f intervals, (ú{j - weights (usually equal)

8 j - minimal distance between /-th and y-th symbolic object, 5y - maximal distance between /-th andy'-th symbolic object,

dij ’ d j — minimal and maximal distance between /-th and /-th symbolic object in reduced space, calculated from X and R according to 6 and 7,

(5)

d,j (X,R) = [ Ц

-

XJSI

+ (r„ +

rjs

)f ,

(

6

)

d A X ,

R) =

X maX °> К “ I + f a + ГiS )] , (?)

.v-1

p dimensionality o f reduced space.

Method o f majorization o f STRESS-Sym is described in details in Groenen et al. (2006). The main weakness o f this method is fact that STRESS-Sym is not a normalized measure. Thus one can observe the improvement o f loss function during iteration process but cannot compare quality o f transformation to other datasets.

V. EXAMPLE OF USE OF MULTIDIMENSIONAL SCALING

FOR SYMBOLIC INTERVAL DATA

As an illustration o f usage o f multidimensional scaling for symbolic interval data the wine.xml set taken from symbolic data repository has been used. These set contained 21 symbolic objects described by 23 symbolic interval variables. Figure 1 shows original

» *0 ю so to 10 к w «0 V) (0 o k ШшШл Шш в 1 а i в 1 1 В i В 1 В} Q ■ ■ ■ ■ ■ ■ в в в в в в В В В В В В ■ ■ « ш а ш к а в в в в в в в в в а в * к • ■ в в в в в в в в в в в В ■ ■ В 0 ■ ■ ■ ш в в в в в в H B l l f l t t t,« ■ ■ ■ £* ■ ■ ■ в в в в в в в в а в а в ■ V ■ K I R I I в в в в в в в а в а в в * в ■ И ■ ■ Е? ■ в в в в в в в а в в в в ■ ■ ■ № ■ ■ ■ ( * в ■ в в в в В В В В В f t « « « ■ ■ ■ ■ ■ ■ ■ в ■ в в в в в а в а а а ■ ■ ■ ■ ■ ■ ■ ■ Q В В В В В В В В В В f t J, ■ ■ ■ ■ ■ ■ ■ В Q В В В В в в в в в в ■ ■ k l l l l l в в » в в в В В В В В f t g » a ■ ■ ■ ■ ■ ■ ■ В В В Q В В в а в в а в ■ ■ ■ ■ ■ ■ ■ ■ В В В В Q В в в а в в f t к ■ ■ В ■ ■ В ■ в в в а а н в в в в в в ■ ■ в в в в в в в ■ в в в в в В В В В f t „• ■ в в в в в ■ в а в в в в q в а в в в ■ ■ в в в в в в в а а в в а В W В В В f t в * * в в в в в в в а в в в в BBÍ 9 RI B ■ ■ в в в в в в в в а в в а в а в ш а f t ж , я ш в в в в в в в в в в в в в в в в ш а f f » ш щ т щ ш щ щ ш щ я щ т ■ ! ■ * ■ № * n *0 «в * ю и *> • w т

Fig. 1. wine.xml dataset in о ginal space

As one can see there is no clear structure of data. In fact there are to many dimensions on the scatterplot to observe anything.

(6)

Figure 2 shows data in first 6 dimensions but even with this limitation no structure can be observed.

< о т » я > Ю Л 1 М « re «о

Fig. 2. wine.xml dataset in original space (first 6 dimensions)

Symscal procedure has been to those data. All calculations have been made in R statistical environment with use o f package SymbolicDA written by author. Effects o f multidimensional scaling presents figure 6. The STRESS-Sym loss function has changed from 476620 to 816,35 during 8 iterations steps.

1 Ausone; 2 Cheval Blanc; 3 Cos d'Estournel; 4 Ducru-Beaucaillou; 5 Haut-Brion; 6 Lafite- Rothschild; 7 Lafleur; 8 Latour; 9 Leoville Las Cases; 10 L'Evangile; 11 Lynch-Bages; 12 Margaux; 13 Mission Haut-Brion; 14 Montrose; 15 Mouton-Rothschild: 16 Petit Village; 17 Petrus; 18 Pichon C.de Lalande; 19 Pichon Longueville; 20 Sassicaia; 21 Trotanoy;

(7)

VI. FINAL REMARKS

Methods o f multidimensional scaling can be adapted to symbolic interval data. There are three main methods o f multidimensional scaling for symbolic interval data: Interscal, Symscal and I-Scal. The main weakness o f those method is lack o f an objective measure o f quality o f transformation.

An open issue is also adaptation or development o f new method o f multidi-mensional scaling o f other types o f symbolic data type (nominal and multi- nominal, categorical data, distributions).

REFERENCES

Billard L., Diday E. (2006), Symbolic data analysis. Conceptual statistics and data

min-ing,Wiley, Chichester.

Bock H.-H., Diday E. (eds.), (2000), Analysis o f symbolic data. Explanatory methods fo r extracting statistical information fro m complex data, Springer Verlag, Berlin. Denoeux T., Masson M. (2000), Multidimensional scaling o f interval-valued

dissimilar-ity data, Pattern Recognition Letters,vol. 21, issue 1, 83-92.

Groenen P. J. F., Winsberg S., Rodriguez O., Diday E. (2005), SymScal: Symbolic M ul-tidimensional Scaling o f Interval Dissimilarities, Econometric Report E l 2005-15, Erasmus University, Rotterdam.

Groenen P. J. F., Winsberg S., Rodriguez O., Diday E. (2006), I-Scal: Multidimensional

scaling o f interval dissimilarities, Computational Statistics & Data Analysis vol. 51,

issue 1, 360-378.

Kruskal, J. B. (1964a), Multidimensional scaling by optimizing goodness o f fit to

a nonmetric hypothesis. Psychometrika,29, 1-27.

Kruskal, J. B. (1964b),s Nonmetric multidimensional scaling: A numerical method. Psychometrika,29, 115-129.

Andrzej Dudek

SKALOWANIE WIELOWYMIAROWE DLA DANYCH SYMBOLICZNYCH PRZEDZ IAŁO WYCH

Podstawowym celem skalowania wielowymiarowego jest przedstawienie relacji między obiektami w przestrzeni wielowymiarowej jako odległości w przestrzeni 2- lub 3- wymiarowej. Dane wejściowe do procedur skalowania wielowymiarowego to zazwy-czaj symetryczna macierz kwadratowa wskazująca na relacje (podobieństwa lub niepo-dobieństwa) pomiędzy obiektami pewnego zbioru. Istnieje wiele technik klasycznego skalowania wielowymiarowego, jednak wszystkie z nich wym agają aby w poszczegól-nych komórkach tej macierzy znajdowały się pojedyncze wartości liczbowe.

Denoeux and Masson (2002) zaproponowali rozszerzenie klasycznego skalowania wielowymiarowego na dane symboliczne w postaci przedziałów liczbowych. Danymi

(8)

wejściowymi do opracowanego przez nich algorytmu 1NTERSCAL jest tabela zawiera-jąca minimalne i maksymalne odległości pomiędzy hiperprostopadłościanami reprezen-tującymi obiekty. Takie same podejście występuje w algorytmach SYMSCAL i I-SCAL zaproponowanych przez Groenena i in. (2005).

W artykule przedstawiony zostały najważniejsze algorytmy skalowania wielowy-miarowego dla danych symbolicznych w postaci przedziałów liczbowych oraz przykłady

ich zastosowania dla danych symbolicznych pochodzących z repozytorium