• Nie Znaleziono Wyników

Multidimensional Scaling for Symbolic Interval Data

N/A
N/A
Protected

Academic year: 2021

Share "Multidimensional Scaling for Symbolic Interval Data"

Copied!
8
0
0

Pełen tekst

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S FOLIA OECONOMICA 228, 2009

A n d rz e j Dudek*

MULTIDIMENSIONAL SCALING FOR SYMBOLIC

INTERVAL DATA

Abstract. The aim of multidimensional scaling is to represent dissimilarities among objects in high dimensional space as distances in low (usually 2- or 3-) dimensional space. Usually the input to multidimensional scaling procedure is a square, symmetric matrix indicating relationships (similarities or dissimilarities) among a set of items. There are many techniques of classical multidimensional scaling but all under assump-tion that each entry in relaassump-tionship matrix is single numeric value.

Denoeux and Masson (2002) have proposed to extend multidimensional scaling onto symbolic interval data. The input to theirs INTERSCAL algorithm is interval dis-similarity table containing minimum and maximum distance between hyper-rectangles representing objects. The same approach is used in SYMSCAL and I-SCAL algorithms proposed by Groenen et al. (2005).

Article presents main algorithms of multi-dimensional scaling for symbolic data in form of intervals along with some examples on datasets taken from symbolic data re-pository (http://www.ceremade.dauphine.fr/~touati/sodas-pagegarde.htm).

Key words: Multidimensional scaling, visualization, symbolic data.

I. INTRODUCTION

Visualizing data in the form o f illustrative diagrams and searching, in these diagrams, for structures, clusters, trends, dependencies etc. is one o f the main aims o f multivariate statistical analysis. In the case o f symbolic data (e.g. data in form of: single quantitative value, categorical values, intervals, multi-valued variables, multi-valued variables with weights), some well-known methods are provided by suitable ’symbolic’ adaptations o f classical methods such as princi-pal component analysis, factor analysis or multidimensional scaling (MDS) (Kruskal (1964)). The main difference between classical methods and “sym-bolic” methods is form o f data they are dealing with. In case o f symbolic data analysis the input contains intervals instead o f single numerical values.

(2)

This paper describes methods o f multidimensional scaling o f symbolic ob-jects containing variables in form o f intervals (symbolic interval data). The aim o f multidimensional scaling o f symbolic interval data is, like in classical, case: to represent dissimilarities among objects in w-dimensional as distances in re-duced 2- or 3- dimensional space. But, while in classical MDS points o f z «-dimensional space are transformed into points in ^-dim ensional space (p = 2 or 3), in case o f multidimensional scaling o f symbolic interval data hiper-cubes o f higher dimensional space are translated into intervals in lower dimensional space.

First chapter describes the form o f input and output data for algorithms o f multidimensional scaling o f symbolic interval data. Second presents three main methods o f multidimensional scaling for symbolic interval data: Interscal, I-scal and Symscal, which is more detailed described in third chapter. Forth chapter presents example o f usage o f Symscal algorithm on symbolic data acquired from symbolic data repository. Finally some remarks and conclusions are given.

II. INPUT AND OUTPUT DATA

All methods o f multidimensional scaling for symbolic interval data require matrix o f minimal and maximal distances between objects in form similar to formula 1.

M . ^21 >4žl 4)1? 4)1

^21»^21 S 22, S22 •‘' 4)2 > 4)2

4,i ’ 4,1 4)2 >4)2 " ’ 4>«

where:

S(J minimal distance between /-th and /-th symbolic object. Sy maximal distance between /-th and /-th symbolic object. n number o f symbolic objects.

Sometimes this matrix is not given directly but should be calculated from n symbolic objects containing intervals. In these case, according to Deneux and Masson (2000) minimal and maximal distances are computed due to formulas 2 and 3.

(3)

(дел - * * ) + (*> - x Jk) + 2 X ik - X , k X j k - X ,

(2)

Ü - 7 . E » ( x* - * ,* ) + ( * > - x jk) + 2 Xik-XIk Xjk ~ - f i (*ft - x lk) + (xjk - x Jk) + 2 X i k - X lk X j k - X j k (3) where:

S)j - minimal distance between /-th andy'-th symbolic object, Stj - maximal distance between /-th and y'-th symbolic object,

(jty, x y) - y'-th variable o f /-th object (the beginning and the end o f interval) m - number o f symbolic variables describing each object.

The aim o f multidimensional scaling o f symbolic interval data is like in clas-sical case: to represent dissimilarities among objects in и-dimensional as dis-tances in 2- or 3- dimensional space thus output matrix also contains minimal and maximal distances between and can be written in form o f 4.

(4)

d \ \ - , d \ \ d j ^ d2, d „ i , d n ]

^ 2 1 ^ 2 1 ^22 ■d 27 ■" d n 2 , d n2

4 .1 ’ d n \ d n 2 ’ d n 2 ' d m id nn

where:

í/у - mi n i ma l distance between /-th andy'-th symbolic object in reduced space, d j - maximal distance between /-th and y'-th symbolic object in reduced space, n - number o f symbolic objects.

(4)

III. M ETH O D S O F M U LT ID IM EN SIO N A L SCA LIN G F O R SY M B O LIC IN T E R V A L DATA

There are three main algorithms o f multidimensional scaling for symbolic interval data, one non-iterative (Interscal) and two (I-scal and Symscal) itera-tively searching for optimal value o f transformation loss function.

Main steps o f Interscal (Denoeux, Masson [2002]) method can be stated as: • Calculation o f modified A matrix containing 2n rows and I n columns with minimal distance, maximal distance and average distance for every pair of objects.

• Calculation o f В - matrix o f scalar products o f rows o f A , • Calculation o f eigenvectors v and eigenvalues Я o f B, • Calculation o f d tj d (J .

I-scal and Symscal algorithms starting from X (centers o f intervals) and R (spread o f intervals) matrices are iteratively searching for optimal values o f I-Stress/Stress-Sym loss function .

IV. SYM SCAL

The idea o f SymScal algorithm is majorization o f loss function. In first steps matrices X (centers o f intervals) and R (spread o f intervals) are generated (usually in random way, but Groenen et al. (2006) suggests to use Interscal algo-rithm to find initial values o f X and R .

In second and next steps, STRESS-Sym measure is calculated due to formula 5

STRESS-Sym (X ,R ) = ± - i , (X, R ) T + £ % [ s , - ( X .R ) ] ’ (5)

i<J K j

where:

X ,R — centers o f intervals and spread o f intervals, (ú{j - weights (usually equal)

8 j - minimal distance between /-th and y-th symbolic object, 5y - maximal distance between /-th andy'-th symbolic object,

dij ’ d j — minimal and maximal distance between /-th and /-th symbolic object in reduced space, calculated from X and R according to 6 and 7,

(5)

d,j (X,R) = [ Ц

-

XJSI

+ (r„ +

rjs

)f ,

(

6

)

d A X ,

R) =

X maX °> К “ I + f a + ГiS )] , (?)

.v-1

p dimensionality o f reduced space.

Method o f majorization o f STRESS-Sym is described in details in Groenen et al. (2006). The main weakness o f this method is fact that STRESS-Sym is not a normalized measure. Thus one can observe the improvement o f loss function during iteration process but cannot compare quality o f transformation to other datasets.

V. EXAMPLE OF USE OF MULTIDIMENSIONAL SCALING

FOR SYMBOLIC INTERVAL DATA

As an illustration o f usage o f multidimensional scaling for symbolic interval data the wine.xml set taken from symbolic data repository has been used. These set contained 21 symbolic objects described by 23 symbolic interval variables. Figure 1 shows original

» *0 ю so to 10 к w «0 V) (0 o k ШшШл Шш в 1 а i в 1 1 В i В 1 В} Q ■ ■ ■ ■ ■ ■ в в в в в в В В В В В В ■ ■ « ш а ш к а в в в в в в в в в а в * к ■ в в в в в в в в в в в В ■ ■ В 0 ■ ■ ■ ш в в в в в в H B l l f l t t t,« ■ ■ ■ £* ■ ■ ■ в в в в в в в в а в а в ■ V ■ K I R I I в в в в в в в а в а в в * в ■ И ■ ■ Е? ■ в в в в в в в а в в в в ■ ■ ■ № ■ ■ ■ ( * в ■ в в в в В В В В В f t « « « ■ ■ ■ ■ ■ ■ ■ в ■ в в в в в а в а а а ■ ■ ■ ■ ■ ■ ■ ■ Q В В В В В В В В В В f t J, ■ ■ ■ ■ ■ ■ ■ В Q В В В В в в в в в в ■ ■ k l l l l l в в » в в в В В В В В f t g » a ■ ■ ■ ■ ■ ■ ■ В В В Q В В в а в в а в ■ ■ ■ ■ ■ ■ ■ ■ В В В В Q В в в а в в f t к ■ ■ В ■ ■ В ■ в в в а а н в в в в в в ■ ■ в в в в в в в ■ в в в в в В В В В f t „• ■ в в в в в ■ в а в в в в q в а в в в ■ ■ в в в в в в в а а в в а В W В В В f t в * * в в в в в в в а в в в в BBÍ 9 RI B ■ ■ в в в в в в в в а в в а в а в ш а f t ж , я ш в в в в в в в в в в в в в в в в ш а f f » ш щ т щ ш щ щ ш щ я щ т ■ ! ■ * ■ № * n *0 «в * ю и *> • w т

Fig. 1. wine.xml dataset in о ginal space

As one can see there is no clear structure of data. In fact there are to many dimensions on the scatterplot to observe anything.

(6)

Figure 2 shows data in first 6 dimensions but even with this limitation no structure can be observed.

< о т » я > Ю Л 1 М « re «о

Fig. 2. wine.xml dataset in original space (first 6 dimensions)

Symscal procedure has been to those data. All calculations have been made in R statistical environment with use o f package SymbolicDA written by author. Effects o f multidimensional scaling presents figure 6. The STRESS-Sym loss function has changed from 476620 to 816,35 during 8 iterations steps.

1 Ausone; 2 Cheval Blanc; 3 Cos d'Estournel; 4 Ducru-Beaucaillou; 5 Haut-Brion; 6 Lafite- Rothschild; 7 Lafleur; 8 Latour; 9 Leoville Las Cases; 10 L'Evangile; 11 Lynch-Bages; 12 Margaux; 13 Mission Haut-Brion; 14 Montrose; 15 Mouton-Rothschild: 16 Petit Village; 17 Petrus; 18 Pichon C.de Lalande; 19 Pichon Longueville; 20 Sassicaia; 21 Trotanoy;

(7)

VI. FINAL REMARKS

Methods o f multidimensional scaling can be adapted to symbolic interval data. There are three main methods o f multidimensional scaling for symbolic interval data: Interscal, Symscal and I-Scal. The main weakness o f those method is lack o f an objective measure o f quality o f transformation.

An open issue is also adaptation or development o f new method o f multidi-mensional scaling o f other types o f symbolic data type (nominal and multi- nominal, categorical data, distributions).

REFERENCES

Billard L., Diday E. (2006), Symbolic data analysis. Conceptual statistics and data

min-ing,Wiley, Chichester.

Bock H.-H., Diday E. (eds.), (2000), Analysis o f symbolic data. Explanatory methods fo r extracting statistical information fro m complex data, Springer Verlag, Berlin. Denoeux T., Masson M. (2000), Multidimensional scaling o f interval-valued

dissimilar-ity data, Pattern Recognition Letters,vol. 21, issue 1, 83-92.

Groenen P. J. F., Winsberg S., Rodriguez O., Diday E. (2005), SymScal: Symbolic M ul-tidimensional Scaling o f Interval Dissimilarities, Econometric Report E l 2005-15, Erasmus University, Rotterdam.

Groenen P. J. F., Winsberg S., Rodriguez O., Diday E. (2006), I-Scal: Multidimensional

scaling o f interval dissimilarities, Computational Statistics & Data Analysis vol. 51,

issue 1, 360-378.

Kruskal, J. B. (1964a), Multidimensional scaling by optimizing goodness o f fit to

a nonmetric hypothesis. Psychometrika,29, 1-27.

Kruskal, J. B. (1964b),s Nonmetric multidimensional scaling: A numerical method. Psychometrika,29, 115-129.

Andrzej Dudek

SKALOWANIE WIELOWYMIAROWE DLA DANYCH SYMBOLICZNYCH PRZEDZ IAŁO WYCH

Podstawowym celem skalowania wielowymiarowego jest przedstawienie relacji między obiektami w przestrzeni wielowymiarowej jako odległości w przestrzeni 2- lub 3- wymiarowej. Dane wejściowe do procedur skalowania wielowymiarowego to zazwy-czaj symetryczna macierz kwadratowa wskazująca na relacje (podobieństwa lub niepo-dobieństwa) pomiędzy obiektami pewnego zbioru. Istnieje wiele technik klasycznego skalowania wielowymiarowego, jednak wszystkie z nich wym agają aby w poszczegól-nych komórkach tej macierzy znajdowały się pojedyncze wartości liczbowe.

Denoeux and Masson (2002) zaproponowali rozszerzenie klasycznego skalowania wielowymiarowego na dane symboliczne w postaci przedziałów liczbowych. Danymi

(8)

wejściowymi do opracowanego przez nich algorytmu 1NTERSCAL jest tabela zawiera-jąca minimalne i maksymalne odległości pomiędzy hiperprostopadłościanami reprezen-tującymi obiekty. Takie same podejście występuje w algorytmach SYMSCAL i I-SCAL zaproponowanych przez Groenena i in. (2005).

W artykule przedstawiony zostały najważniejsze algorytmy skalowania wielowy-miarowego dla danych symbolicznych w postaci przedziałów liczbowych oraz przykłady

ich zastosowania dla danych symbolicznych pochodzących z repozytorium

Cytaty

Powiązane dokumenty

On dopiero (za przy­ kładem owych pieśni popularnych i za przykładem ód Horacego) wprowadził je do przybytku literatury książkowej i stworzył z nich nowy

A simple observation that instead of using VDM metric, one can replace each symbolic value with a number of probabilities and use Minkovski measure on the converted data, leads to

Instrumentalny rozum, choć wie, co czym się robi, jest jednak ślepy na wartości dobra, prawdy i piękna.. Dlatego instrumentalny rozum budował narzędzia opresji i zniszczenia w

The paper presents a concept of clustering, classifications of cluster analysis methods, comparison of numerical and symbolic taxonomy, specificity of symbolic data as

Ostatecznej finalizacji powyższego porozum ienia przeszkodziły zmiany polityczne, jakie zaszły na Pomorzu po zakończeniu I wojny światowej. U m owa ta i zaw arta

Ну, что же делать, если она так глупо сотворена, что привязалась к человеку, с которым прожила восемь лет, имела детей, знала все его слабости

Gdy klikniemy w ikonę okładki miękkiej przeniesiemy się na stronę, gdzie będzie można kupić również trzy cyfrowe wersje książki w formatach pdf, epub i mobi. Wersje cyfrowe

Rozmyślaniom pisarza przewodzi dialogiczna metafora lustra odniesiona do Innego, w którym przeglądamy się, docierając tym sposobem do własnej tożsamości, bądź