Andrzej Dudek*
K O IIO N E N S E L F -O R G A N IZ IN G M A P S F O R S Y M B O L IC O B J E C T S
A BSTRA CT. Visualizing data in the form o f illustrative diagrams and searching, in these diagrams, for structures, clusters, trends, dependencies etc. is one o f the main aims o f multivariate statistical analysis. In the case o f symbolic data (e.g. data in form of: sin gle quantitative value, categorical values, intervals, multi-valued variables, multi-valued variables with weights), some well-known methods are provided by suitable ’sym bolic’ adaptations o f classical methods such as principal component analysis or factor analysis. An alternative visualization o f symbolic data is obtained by constructing a Kohonen map. Instead o f displaying the individual items к = 1, .... n by л points or rectangles in a two dimensional space, the n items are first clustered into a number m o f mini-clusters and then these mini-clusters are assigned to the vertices o f a rectangular lattice o f points in the plane such that ’sim ilar’ clusters are represented by neighbouring vertices in the lattice.
Article present algorithm o f creating Kohonen self-organizing maps for symbolic objects along with some examples on datasets taken from symbolic data repository (http://www.ceremade.dauphine.fr/~touati/sodas-pagegarde.htm).
Key words: Classification, visualization, symbolic data, neural networks.
I. INTRODU CTIO N
Self-organizing maps (SOMs) are a data visualization technique invented by Professor Teuvo Kohonen which reduces the dimensions o f data through the use of self-organizing neural networks. They can also be treated as a classification method due to assignments o f mini-clusters o f objects to nodes o f rectangular lattice. Kohonen’s algorithm has been developed for “traditional” numerical data. Recently some extensions of Kohonen’s algorithm have been proposed by El Golli, Conan-Guez and Rossi [2004] adapting SOMs to data in form of intervals First part o f this paper explains how to create self-organizing maps and locate SOMs among methods of multivariate statistical analysis. Second part is an.
* Ph.D., Chair o f Econometrics and Informatics, University o f Economics, Wrocław. [245]
introduction to symbolic data analysis, symbolic objects and symbolic variables are described and dissimilarity measures for symbolic objects are presented. In third part some modifications o f original Kohonen’s algorithms extending SOMs onto data in form o f intervals are described. Forth part presents examples of creation o f SOM from data in form of intervals and use Kohonen map for sym bolic objects as discriminant analysis technique. Finally some conclusions and remarks are given.
II, KOHONEN’S SELF-ORGANIZING MAPS AMONG OTHER TECHNIQUES OF MULTIVARIATE STATISTICAL ANALYSIS
Self-organizing maps arc a data visualization technique but they also be can treated as classification method as well as a branch o f neural networks. This method assumes than objects are first clustered into m mini-clusters and then these mini-clusters are assigned to the vertices of a rectangular lattice of points in the plane such that ’sim ilar’ clusters are represented by neighbouring vertices in the lattice.
The algorithm o f self-organizing maps creation can be described in four main points (Kohonen[1997]).
1. Cluster prototypes (centers) are defined randomly or in some way (for ex ample using eigenvectors o f principal components).
2. Iteratively, input object (x) is assigned to cluster with the smallest distance to its prototype. As distance measure for this purpose squared euclidean distance is used (which is important for later considerations due to fact that euclidean dis tance is not defined for symbolic data).
3. Prototypes o f clusters are re-calculated according to formula 1:
/ > < - /> + Л с(х)1( ( * - / > ) , í 1) where:
Pi - cluster prototype; or*** - learning factor in £-th iteration step, typically this factor its decreasing according to k\ h( ) - kernel neighbouring function, typically treshold, gaussian, Epanechnikov or exponential kernel; c ( x ) - clus ter, to which x is assigned.
4. Steps two and three are repeated until convergence criterion is fulfilled. The final effect can be presented in form of mini-clusters map as in figure 1.
Figure 1. Self-organizing maps examples.
Source: Own research, graphics and calculations made in R environment with Kohonen li brary.
It's hard to assign SOMs exactly to one of techniques o f Multivariate Statis tical Analysis, but there are some relations between Kohonen s maps and other methods pointed in figure 2.
Neural networks . . ... . . ...
M u ltid im e n sio n a l scallig D iscrim in a nt
analysis
Kohonen s j ____ p rin c ip a l co m p o n e n t
М Й Я Я Я ! Й ~ Я ~ Л anal ysi s
/
C la s s ific a tio n m ethod V isu a liza tio n m e tho d
Figure 2. Self-organizing networks and other multivariate statistical analysis methods. Source: Own research based on Kohonen [1997], Bock [2004].
SOMs can be treated as a branch o f neural networks due to iact that data are processed sequentially, no calculations on data treated as numerical matrix is made. It is also a clustering method because mini-clusters are displayed in nodes o f lattice and there is a classification process in background. It’s quite obvious that this is a visualization method and due to fact that self-organizing maps re duce number o f dimension there is an affinity to multidimensional scaling and principal components analysis. At last but not least supervised Kohonen s maps can be also used as discriminant method. Example of such use for symbolic data is presented in chapter IV.
III. SY M B O LIC VARIABLES AND SY M B O LIC O B JE C T S
Symbolic data, unlike classical data, are more complex than tables o f nu meric values. While Table 1 presents usual data representation with objects in rows and variables (attributes) in columns with a number in each cell, table 2 presents symbolic objects with intervals, set and text data.
Table 1 Classical data situation
X Variable 1 Variable 2 Variab!e3
1 1 108 11,98
2 1,3 123 -2 3 ,3 7
3 0,9 99 14,35
Source: own research.
Table 2 Symbolic data table
X Variable 1 Variable 2 Variable 3 Variable 4
1 (0,9;0,9) {106; 108; 110} 11;98 {Blue;green(
2 ( i ;2) •{ 123; 124; 125} -23;37 {light-grey} 3 (0,9;1,3) {100;102;99;97} 14;35 {pale}
Source: own research.
Bock and Diday [2000] define five types o f symbolic variables: single quan titative value, categorical value, interval, multivalued variable, multivalued vari able with weights.
Variables in a symbolic object can also be, regardless o f its type (Diday [2002]): taxonomic - representing hierarchical structure, hierarchically depend ent, logically dependent.
There are four main types o f dissimilarity measures for symbolic objects (Malerba et al. [2000], Ichino and Yaguchi. [1994]):
• Gowda, Krishna and Diday - mutual neighbourhood value, with no taxo nomic variables implemented;
• Ichino and Yaguchi - dissimilarity measure based on operators o f Carte sian join and Cartesian meet, which extend operators и (sum o f sets) and n (product o f sets) onto all data types represented in symbolic object,
• De Carvalho measures - extension o f Ichino and Yaguchi measure based on a comparison function (CF), aggregation function (AF) and description po tential o f an object.
• Hausdorff distance (for symbolic objects containing intervals).
For symbolic data containing only interval-type variables Hausdorff distance and vertex-type distance (sum o f squares o f all distances between adequate ver tices o f «-dimensional hyper-cubes defined by n interval variables) is often used.
IV. C R E A T IO N O F SOMS FO R SY M BO LIC DATA IN FO RM O F INTERVALS
El Golli, Conan-Guez and Rossi[2004] proposed an extension o f original Kohonen’s algorithm which allows creation o f self-organizing maps for sym bolic data containing intervals.
Two main innovations for SOMs for symbolic data are proposed. In original Kohonen algorithm in step 2 squared euclidean distance is used for assignment o f actual object to the closest cluster. For data in form o f intervals Hausdorff dis tance or vertex-type distance can be used.
Second change is that cluster prototypes are not points but hyper-cubes. Thus prototypes adjustment step (step 3) is repeated for each vertex o f hyper cube defined by intervals and formula (1) recalculates coordinates o f each vertex separately.
V. EX A M PLES O F USE O F K O IIO N E N ’S M APS FO R SYM BOLIC O B JEC TS
Symbolic data set containing information about car models has been used as input data for constructing self-organizing map. The result o f this process shows figure 3.
The mini-cluster assignment is the following.
C luster(lx l): Alfa 145, Vectra, Skoda Octavia, Cluster(lx2): Alfa 156, Rover 75; Cluster(lx3): Alfa 166, Lancia K, Mercedes Class C; CIuster(lx4): Aston Martin; Cluster(2xl): Audi A3; Cluster(2x2) : Audi A6; Cluster(2x3): Audi A8, Maserati GT; Cluster(2x4): Bmw serie 3; Cluster (3x1): Bmw serie 5; Cluster(3x2): Bmw serie 7, Mercedes SL, Mercedes Class E, Porsche; Cluster (3x3): Ferrari, Mercedes Class S; Cluster(3x4): Punto, Lancia Y; Cluster(4xl): Fiesta, Nissan Miera, Corsa, Twingo, Rover 25, Skoda Fabia, Cluster(4x2): Fo cus, Passat; Cluster(4x3): Honda NSK; Cluster(4x4): Lamborghini.
1x1 • 1x2 • 1x3 • 1x4 • 2x1 ♦ 2x2 • 2x3 • 2x4 • 3x1 • 3x2 • ■ 3X3 • 3x4 ♦ 4x2 • 4x3 • 4x4 •
Figure 3. Self-organizing map for car.sds data set.
Source: own calculation with SODA 2.5 software, file car.sds comes from symbolic data repository http://www.ccremade.dauphine.fr/~touati/sodas-pagegarde.htm.
In second example SOMs has been used for discriminant analysis. From 177 objects sets containing information about french wines (taken from http://www.ceremade.dauphine.fr/~touati/sodas-pagegarde.htm) 120 objects has been treated as training set and 57 as test set. The map after learning process is showed on figure 4.
Figure 4. Self-organizing map for wine.sds data set after learning process. Source: own calculations.
Table 3 is contingency table between actual and predicted class assignment. Table 3
Contingency table for wine.sds file
1 2 3
1 18 0 0
2 1 21 2
3 0 0 15
Source: Own calculations.
Error ratio for this prediction is 5,2% and Rand and corrected Rand o f class agreement between real and predicted cluster structures is 0,92 and 0,84.
VI. FINAL REM ARKS
Kohonen self-organizing maps can be, after small modifications o f original algorithm, adapted for symbolic data in form o f intervals. It is visualization method for this kind o f data as well a clustering method. Supervised SOMs can also be treated as a discriminant analysis technique for symbolic interval data.
Still an open issue, worth further development is how to adapt Kohonen al gorithm for other symbolic data type (nominal and multi-nominal, categorical data, distributions).
R E F E R E N C E S
B ock H .-H ., D id a y E (E d s.) (2 0 0 0 ), A n a ly s is o f sy m b o lic d a ta . E x p la n a to r y m e th o d s f o r
e x tr a c tin g s ta tis tic a l in fo rm a tio n fr o m c o m p le x d a ta , Springer V erlag, Berlin.
B ock H .-H . (2 0 0 3 ), C lustering algorithm s and K ohonen m aps for sy m b o lic data. J o u r n a l
o f th e J a p a n e s e S o c ie ty o f C o m p u ta tio n a l S ta tistics, 15.2, 2 1 7 —2 2 9 .
D id ay E. (2 0 0 2 ), A n introduction to sym b olic data analysis and the S O D A S softw are, Journal o f S y m b o lic D ata A n alysis, V ol. 1.
El G olli A ., C on an -G u ez B ., R ossi F. (2 0 0 4 ), S e lf O rganizing M ap and S y m b o lic Data.
J o u r n a l o f S y m b o lic D a ta A n a ly sis, V ol. 2.
Ichino M ., Y agu ch i H. (19 9 4 ), G e n e ra lize d M in k o w s k i M e tr ic s f o r M ix e d F e a tu r e - T y p e
D a ta A n a ly s is , „IEEE T ransactions on S ystem s, M an, and C ybernetics”, V o l. 2 4,
N o. 4 , 6 9 8 - 7 0 7 .
M alerba D ., E sp ozito F, G io v a llc V ., Tam m a V . (2 0 0 1 ), C om paring D issim ilarity M eas ures for S y m b o lic D ata A nalysis, N e w T e c h n iq u e s a n d T e c h n o lo g ie s f o r S ta lis tc s (E T K -N T T S '0 1 ), 4 7 3 - 4 8 1 .
V erde R .(2 0 0 4 ), C lustering M ethods in Sym b olic Data A n a ly sis, C la ssific a tio n , C lu ste r
in g a n d D a ta M in in g , B erlin-Springer-V erlag, 2 9 9 -3 1 8 .
A n d r z e j D u d e k
S A M O O R G A N IZ U J Ą C E S IĘ M A P Y K O IIO N E N A D L A O B IE K T Ó W S Y M B O L I C Z N Y C H
W izualizacja danych w postaci diagram ów i p oszu k iw an ie w tych diagram ach struk tur, klas, trendów , z a le ż n o śc i itp. je s t jed n ym z g łó w n y ch zadań w ielo w y m ia ro w ej anali z y statystycznej. W przypadku danych sym b oliczn ych (to je s t danych reprezentow anych w postaci liczb , p rzed ziałów liczb o w y ch , zbiorów kategorii, c z y zb io ró w kategorii z w a gam i) w ersje znanych m etod takich jak analiza czyn n ik ow a, c z y analiza skład ow ych g łó w n y ch m o g ą być sto so w a n e po pew n ych m odyfikacjach.
A lternatyw ną m etod ą w izu alizacji danych są sam oorganizujące się m apy K ohonena. Zam iast w y św ietla ć к = 1, ..., n ob iek tów w dw uw ym iarow ej przestrzeni w postaci punktów c z y prostokątów o b iek ty są najpierw d zielo n e na m m ini-klas a następnie te m in i-k lasy przyporządkow yw ane są w ierzchołkom prostokątnej kraty na p ła szczy źn ie w taki sp o só b aby „podobne” m ini-klasy b y ły przyporządkow ane sąsied n im w ierzch o ł k om kraty.
W artykule przedstawiony został algorytm tworzenia map Kohonena dla danych sym bo licznych oraz przykłady je g o zastosowania dla danych sym bolicznych pochodzących z repo zytorium http://www.cerem ade.dauphine.fr/~touati/sodas-pagegarde.htm.