INTELLIGENCE AND

(1)

Unsupervised Competitive Learning Kohonen’s Self-Organizing Maps SOM

AGH University of Science and Technology

Krakow, Poland

Adrian Horzyk

horzyk@agh.edu.pl

ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE AND

KNOWLEDGE ENGINEERING

(2)

SELF-ORGANIZING MAPS

Kohonen’s SOM network (invented in 1982 by Tuevo Kohonen):

• is a network with a regular structure consisting of nodes that represent similar objects from the input space;

• represents multidimensional data in a smaller space (usually 2D or 3D), which does not always allow the representation of groups of similar objects by neighboring nodes;

• Is adapted with the use of unsupervised competitive learning, i.e. trained without a teacher (indication of the purpose of learning, i.e. class, output values) in such a way that the nodes compete with each other for the representation of input data;

• neighborhood of nodes plays a significant role during the adaptation of the network, as it also allows the adaptation of nodes close to the winner leading to the automatic reorganization of representation of objects through the Kohonen’s network during the learning process;

• the nodes in close proximity should represent similar groups (clusters) of objects;

• they resemble neural networks to a small extent, as inputs to nodes of this network are not weighed, but compared to weight vectors based on distance (usually Euclidean distance);

• they determine the average of the represented patterns at each node of the network stored in the weight vector, where the patterns are represented in the node,

which cause that the given node becomes the winner (it is the closest).

(3)

SAMPLE SOM NETWORK

The map showing the quality of life in different countries of the world used for the adaptation of the SOM network gave the following result aggregating the representation of different

countries and placing other countries with a similar quality of life in their neighborhood, e.g.:

• Poland (POL) is represented by the same node as Hungary (HUN) and Portugal (PRT), on the left we see a close node representing the Czech Republic (CSK).

• Developed countries with a higher life quality, i.e. USA and Canada, are located in yellow nodes, while western European countries, i.e. Germany (DEU) and France (FRA) are

represented by one yellow-orange node.

(4)

CREATING A MAP FOR TRAINING DATA

The x points from the input space are mapped to points I(x) in the output space.

(5)

SELF-ORGANIZING MAPS

Trying to map the 3D space of RGB colors in 2D space, we are not able to represent all similar colors side by side:

• The yellow should be between red and green.

• The pink should be between red and blue.

• The blue should be between green and blue.

As you can see in the pictures below, there is no way to perfectly project 3D space on 2D!

(6)

COMPETITION BETWEEN NODES

The SOM network, based on the similarity of input patterns, updates the weights of the network nodes on the competitive learning basis in such a way that the node representing the value closest to the input data (the most similar) in the sense of the adopted metric (e.g. Euclidean metric) becomes the winner.

The winning node weights are most strongly updated towards the input data (i.e. reducing the distance) compared to neighboring patterns, which weights are updated weaker,

the less the more away from the winner are.

By modifying the weights of neighboring nodes, we obtain the possibility of representing similar sample groups by neighboring SOM nodes:

rectangular hexagonal

(7)

COMPETITION BETWEEN NODES

Let us consider learning patterns in the form of vectors X

_k

= {x

₁

, x

₂

, …, x

_n

}, which elements are represented by n input nodes of the SOM network.

Each input node is associated with each node on the output map.

We initiate the weights with random values, so that the nodes react the strongest to the different combinations of input data X

_k

= {x

₁

, x

₂

, …, x

_n

}.

Winner

Neighboring nodes

Input nodes

(8)

SOM NETWORK LEARING ALGORITHM

1. Build the output node map (usually 2D) and define the neighborhood relationship (square, hexagonal, star, radial, rhomboid/diamond).

2. Initialize weights of each node with small random numbers different from 0.

3. Take next or random vectors X_k = {x₁, x₂, …, x_n} from the training set X₁, …, X_K. 4. Calculate the output value of all SOM network map nodes based on the selected

distance (usually Euclidean, but you can also use the distance Manhattan, Mahalanobis etc.) according to the following formula, which determines the distance of the input pattern to the weight vector:

5. Determine which node is closest to the input pattern X_k = {x₁, x₂, …, x_n}?

6. Update the weights of the winning node in such a way that they better match the input vector (approaching it / move to it).

7. Update the weights of the neighboring winner's nodes in a similar manner, but

with a decreasing coefficient, i.e. the further away they are from the winning node, the less they update their weights.

8. Return to step 3 until all learning patterns are represented on the network with

sufficient accuracy, which means that you need to show all of your learning patterns to the network repeatedly many times.

𝒅 𝑿_𝒌, 𝑾_𝒊,𝒋 𝒕 =

𝒊=𝟏 𝑰

𝒋=𝟏 𝑱

𝑿_𝒌 − 𝑾_𝒊,𝒋 𝒕 ^𝟐

(9)

SOM SAMOORGANIZATION PROCESS

Suppose we have four data points (crosses) in our continuous 2D input space, and want to map them onto four points in a discrete 1D output space. The output nodes map to points in the input space (circles). Random initial weights start the circles at random positions in the center of the input space.

We randomly pick one of the data points for training (cross in circle). The closest output point represents the winning node (solid diamond). That winning node is

moved towards the data point by a certain

amount, and the two neighboring nodes

move by smaller amounts (small arrows).

(10)

SOM SAMOORGANIZATION PROCESS

Next, we randomly pick another data point for training (cross in a circle). The closest output point gives the new winning node (solid diamond). The winning node moves towards the data point by a certain amount, and the one neighboring node moves by a smaller amount (arrows).

We carry on randomly picking data points

for training (cross in a circle). Each winning

node moves towards the data point by

a certain amount, and its neighboring

node(s) move by smaller amounts (small

arrows). Eventually, the whole output grid

unravels itself to represent the input space.

(11)

WINNER DESIGNATION

We designate successively Euclidean distances of the individual teaching standards X₁, …, X_K relative to all weight vectors W_1,1, …, W_I,J, where I x J is the number of nodes in the 2D SOM network:

𝒅 𝑿_𝒌, 𝑾_𝒊,𝒋 𝒕 =

𝒊=𝟏 𝑰

𝒋=𝟏 𝑱

𝑿_𝒌 − 𝑾_𝒊,𝒋 𝒕 ^𝟐

During the calculation of these distances, we determine simultaneously the arguments (i, j) for the nearest vector of weights (of the shortest distance) to each teaching standard:

𝒂, 𝒃 = 𝒂𝒓𝒈 𝐦𝐢𝐧

𝒊,𝒋 𝒅 𝑿_𝒌, 𝑾_𝒊,𝒋 𝒕 which will be the winner updating their weights the most and

with respect to which distances to neighboring nodes will be determined which weights will be updated

less frequently.

(12)

PARAMETERS OF SOM NETWORK ADAPTATION

During the learning of the SOM network, the range of updated neighbors is gradually decreasing. In the beginning, the range of neighbors' update is large (it can even cover

the entire network) and gradually narrows down, which can be mathematically expressed by changing the distance radius of neighboring nodes versus the winner depending on the time elapsed since the beginning of learning 𝒕_𝟎 and some constant narrowing 𝜶, e.g. 𝜶 = 1000:

𝝈 𝒕 = 𝝈_𝟎 ∙ 𝒆⁻

𝒕

𝜶 where 𝝈_𝟎 specifies a certain start radius, which in the beginning can cover even the entire network of nodes, i.e.. 𝝈_𝟎 = 𝒎𝒂𝒙 𝑰, 𝑱 .

Then we calculate the coefficient 𝜹 𝒕 , which depends on the distance of the node to the winner 𝑵_𝒂,𝒃 𝒕 :

𝜹 𝒕 = 𝒆⁻

𝒅 𝑵𝒊,𝒋 𝒕 ,𝑵𝒂,𝒃 𝒕 𝟐

𝟐∙𝝈𝟐 𝒕

where 𝑵_𝒊,𝒋 𝒕 is a node placed at

the position (i, j) in a 2D grid (matrix).

Similarly, we determine

the coefficient of the strength of adaptation of weights:

𝜸 𝒕 = 𝜸_𝟎 ∙ 𝒆⁻^𝜶^𝒕 np. 𝜸_𝟎 = 1

We update the weights in the next discrete time step t+1 according to:

𝑾_𝒊,𝒋 𝒕 + 𝟏 = 𝑾_𝒊,𝒋 𝒕 + 𝜹 𝒕 ∙ 𝜸 𝒕 ∙ 𝑿_𝒌 − 𝑾_𝒊,𝒋 𝒕

(13)

COMPUTING THE DISTANCES BETWEEN NODES

Calculation of the distance of nodes 𝑵_𝒊,𝒋 𝒕 relative to the winner 𝑵_𝒂,𝒃 𝒕 depends on the accepted grid, and distance measure (Euclidean, Manhattan or another metric), e.g.:

𝒅 𝑵_𝒊,𝒋 𝒕 , 𝑵_𝒂,𝒃 𝒕 = 𝒊 − 𝒂 + 𝒋 − 𝒃 𝒅 𝑵_𝒊,𝒋 𝒕 , 𝑵_𝒂,𝒃 𝒕 = 𝒊 − 𝒂 ^𝟐 + 𝒋 − 𝒃 ^𝟐

There are various types of node nodes and different ways to determine the closest neighbor:

rectangular hexagonal

(14)

NETWORK TYPE AND DIMENSION

Choose the appropriate network of nodes and its dimension:

• specify the number of independent attributes m in input data vectors / matrices, where 1  m  n.

• In order to enable SOM networks not only to define sample groups, but also to correctly map distances

between groups, the dimension of grid space v, to which the input patterns are to be projected, should be specified accordingly.

This dimension can certainly not be smaller than the number of independent attributes, i.e. m  v  n.

Therefore, you can start with the dimension m, if necessary, gradually increase it, but not more than the dimension n.

Recognize that the mesh size of the nodes is right or wrong:

• The size of the network is inadequate if (as a result of

many adaptations starting with the randomly drawn initial weights), the distances of winner nodes representing

the strongest classes are significantly different.

These nodes do not have to be in the same places, but they should be more or less at a distance.

(15)

HOW TO CHOOSE PARAMETERS?

From the point of view of adaptation of the SOM network, it is important to choose the appropriate parameters:

• the range of variability of the initially randomly selected small positive weights different from zero, where the range of these values should preferably be from the range from

the hundredth to the decimal variation range for the given attribute, e.g. if 𝒙_𝒊^𝒎𝒊𝒏 ≤ 𝒙_𝒊 ≤ 𝒙_𝒊^𝒎𝒂𝒙, then weights should be drawn from the range:

𝒙_𝒊^𝒎𝒂𝒙 − 𝒙_𝒊^𝒎𝒊𝒏

𝟏𝟎𝟎 ≤ 𝒘_𝒊 ≤ 𝒙_𝒊^𝒎𝒂𝒙 − 𝒙_𝒊^𝒎𝒊𝒏

• independence of input data attributes X_k = {x₁, x₂, …, x𝟏𝟎_n} (where n is the size of the input data),

• the speed of adaptation and selection of coefficients, i.e. a, while in the course of learning t  a,

• determining the number of degrees of freedom (type of grid and its size),

i.e. potentially the number of nearest neighbors with a similar degree of closeness.

• the number of grid output nodes, the number of which should be significantly smaller than the number of learning patterns in order to "force" the network to represent subsets of similar patterns by means of the same nodes (compression of the representation). The number of these nodes can not be smaller than the number of expected classes and the dimensionality of the data.

• If the number of desirable classes / groups c is known, then the initial number of nodes can be specified as cm to form a m-dimensional hypercube (matrix) of nodes with the dimension c instead of creating a 2D grid.

(16)

EXPLORATION >>> EXPLOITATION

First, the grid must group similar patterns, so the learning area is wide (exploration), then narrow it down to accelerate the learning and specialization of individual

network nodes (exploitation):

The effect of too fast narrowing of

the learning area, i.e.

fast exploitation and transition into

operation,

is the separation of similar sample groups.

(17)

NETWORK DIMENSION AND SIZE

How to recognize that the number of mesh nodes is insufficient?

If, as a result of learning for many adaptations that start with differently drawn initial weights, different or different winners are created compared to other results, this is a difficulty that can be solved gradually by increasing the number of nodes, but no more than the number of patterns / 2, and for very many sets of patterns even their root, that is:

𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆_𝒏𝒖𝒎𝒃𝒆𝒓.

The method of choosing the number of nodes and the size of the grid::

1. We start with a mesh with the dimension m equal to the number of independent attributes, and if it is unknown then with the value 1, 2 or 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆_𝒏𝒖𝒎𝒃𝒆𝒓.

2. The initial number of nodes for a given dimension c are selected according to the number of expected / desired classes / groups, and therefore for the entire grid cm.

3. Next, we go to the learning process by examining and comparing the numbers of

representatives for individual nodes after a certain number of steps, e.g. 1000, starting learning the SOM network many times (eg 10x) from differently drawn initial weights.

4. If we receive a significantly different number of representatives for individual output nodes, increase the number of nodes in one or several dimensions, eg by 1.

(18)

NETWORK DIMENSION AND SIZE

5. When the number of representatives (patterns) of individual nodes (ie the number of patterns represented by individual nodes will stabilize as far as possible, then it is worth calculating the winners' distances as the smallest number of transitions separating these nodes.

It is worth comparing those winners who represent similar sets of input patterns with similar amount.

6. If these distances are significantly different, it means that the dimension of hyperspace m to which we have dropped the set of learning patterns is insufficient to correctly map relations between groups / classes.

7. It is therefore necessary to increase the dimension of hyperspace to m + 1 and simultaneously reduce the number of nodes for all dimensions and start from point 2.

8. When the distances between the winners representing the largest number of patterns

stabilize for several learning processes started with different random weights, then most likely we have achieved the proper size of SOM subspace and the correct number of nodes.

9. In addition, you can still experiment with determining the neighborhood in such a grid, whether we are only considering an orthogonal neighborhood or diagonally.

Generally, it should give better results diagonally and not cause the need to increase the size of the SOM grid too much, speeding up learning.

Such an adaptation process may require several dozen or even several hundred adaptations of the SOM network starting from various weights in order to determine the correct (optimal) model.

(19)

Bibliography and Literature

1. Shyam M. Guthikonda, Kohonen Self-Organizing Maps, 2005 2. Mat Buckland, http://www.ai-junkie.com/ann/som/som1.html 3. ET-Map – http://ai.eller.arizona.edu/research/dl/etspace.htm

4. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016, ISBN 978-1-59327-741-3 or PWN 2018.

5. Holk Cruse,Neural Networks as Cybernetic Systems, 2nd and revised edition 6. R. Rojas,Neural Networks, Springer-Verlag, Berlin, 1996.

7. Convolutional Neural Network(Stanford)

8. Visualizing and Understanding Convolutional Networks, Zeiler, Fergus, ECCV 2014

9. IBM: https://www.ibm.com/developerworks/library/ba-data-becomes-knowledge-1/index.html

University of Science and Technology in Krakow, Poland Adrian Horzyk horzyk@agh.edu.pl

Google: Horzyk