Artificial neural networks - Biologically inspired artificial intelligence methods

2. A RTIFICIAL I NTELLIGENCE 1. Foundations

2.2. Biologically inspired artificial intelligence methods

2.2.1. Artificial neural networks

Information processing in natural biological nerve systems has become the inspiration for building artificial structures with similar in some aspects properties, although with the use of simplified elements (Tadeusiewicz 2007). The most complex biological information processor is of course human brain, the only system complex enough for making possible the occurrence of self-consciousness.

Tadeusiewicz (1993) summarizes the brain physical parameters in the context of the processing information speed. Human brain‟s volume is only 1.4 l., its surface is

approximately 2000 cm², and the typical weight is around 1.5 kg. The part of a brain, which is responsible for logical activity is cerebral cortex, having thickness of only 3 mm. Despite such compactness the number of nerve cells in a brain oscillates around 10¹⁰-10¹¹, and, what seems to be even more important, the number of connections (synapses) between neurons is between 10¹⁴ and 10¹⁵. The huge number of extremely small information processors (neurons) is in a opposition with a speed of operation of a single neuron. The typical nerve cell impulses have frequency 1-100Hz, duration 1-2 ms, and the voltage 100mV. Therefore, the maximum speed of brain, computed as a number of synapse switching per second, achieves a rate of 10¹⁵ connections  100Hz = 10¹⁷ operations/s. When the processing of sensual perception is considered, the fastest of the senses, the visual channel, operates at a speed 100Mb/s (Tadeusiewicz 1993).

The history of artificial neural networks started with the work of McCullough and Pitts (1943) who proposed the mathematical model of artificial neuron (see Fig. 1), as an element operating according to

 

i i

j j ij

i w x y n

n 



1



(2.2:1) where ni is the network excitation, xj are the inputs for j = 1, 2, …, n and x0 = 1, wij are weights (corresponding to synapses in biological nerve systems) connecting the receiving neuron i with the source neuron j, y_i is the output of the neuron, and 1(n) is the Heaviside step function, which is a discontinuous function whose value is zero for non-positive argument and one for positive argument. The Heaviside step function, proposed by McCulloch and Pitts to be used in their artificial neuron, is one of possible activation functions, i.e. functions which generate the output of the artificial neuron, based on the value of the network excitation.

x₁

x₂

x_m 1



w₀ w₁ w₂

w_m

1(n)

n y

Fig. 2.2:1. McCulloch-Pitts artificial neuron Rys. 2.2:1. Sztuczny neuron Mc Cullocha-Pittsa

During the history of neural networks other activation functions have been proposed, both, linear and nonlinear, with the sigmoid function, given by (Żurada 1992)





i n

y    exp 1

1 (2.2:2)

where  is a parameter responsible for the slope of the function around network excitation equal zero. The sigmoid function is being most often used due to its non-linearity, differentiability, and continuity. Also for large values of , it approximates arbitrarily close the Heaviside function.

By grouping artificial neurons with sigmoid activation function in layers, a multiplayer perceptron (MLP) network is obtained, which is the most universal neural network architecture. The neurons in all layers of the MLP are fully interconnected with neurons of the next layer. The connections correspond to synapses in nerve systems, and they are implemented as vectos of weights. The input layer does not process any information, it serves only as a buffer. The last layer produces outputs which are considered as outputs of the whole MLP. Between input and output layer, the arbitrary number of hidden layers can occur, although it is known (see for example Osowski 1996) that a network with two hidden layers can solve a classification problem in arbitrary complex feature space.

A few years after proposition of mathematical model of the first artificial neuron, Hebb (1949) has proposed the coincidence rule for learning such element. Later a lot of different learning rules have been developed, both, for supervised, and unsupervised learning. They all can be described as a product of two functions g and h, which can be considered as a learning rule, which in general can be dependent on network excitation ni, desired value on the output di, the actual output Oi , and the weight wij . This general learning rule is given by



i i

 

j ij



ij g n d hO w

w  , ,

 . (2.2:3)

The unsupervised learning rule uses the function g in formula (3) which is not dependent on di, while the supervised learning rule uses the function g which depends on the desired value di. For example the unsupervised Hebb‟s rule given by (Hebb 1949)

j i

ij nO

w 

 (2.2:4)

is a special case of (3) with g = n_i, and h = O_j. Similarly, Widrow and Hoff (1960) supervised delta rule given by



i i



ij d n O

w  

  (2.2:5)

and applied to Adaptive Linear Elements (ADALINE), assumes g = (di – ni), and h = Oj. While ADALINE and Multiple ADALINE (MADALINE) were linear neural networks, the Rosenblatt (1958) proposed a perceptron, which was the nonlinear network. In nowadays classification the Rosenblatt‟s perceptron is considered as a very reduced version of MLP network, however, it should be mentionded that it was in fact the first neural network ever implemented and it was used for recognition of alphanumerical characters. The perceptron

was built as an electronic – electromechanic system and Rosenblatt has proven that of the solution of the problem exists, then, the perceptron can be trained using the convergent algorithm.

The very fruitful for artificial neural networks two decades have been finished with a Minsky and Papert (1969) famous book, criticizing the connectionist approach as appropriate only for linearly separable problems, and therefore, inappropriate for as simple problems as the exclusive OR function. This critique was addressed to one layer artificial neural networks but it has resulted in a decade of stagnancy of the whole field. The rebirth of interest in ANNs is connected with works showing that nonlinear multilayered networks are free from the limitations signaled by Minsky and Papert for one layered perceptrons. The additional, deciding step toward contemporary artificial neural networks has been done by development of a back-propagation algorithm (Rumelhart, Hinton, and Williams 1986a, 1986b, and Rumelhart et al. 1992) – an efficient method for supervised training of MLP. The derivation of back-propagation algorithm implementing the steepest descent method, is presented below after Tadeusiewicz (1993) and Lawrence (1994).

Let {(x⁽¹⁾, d⁽¹⁾), ..., (x^(L), d^(L))} be a training set. Observe that superscripts in parentheses denote the number of the training facts for which the learning occurs. The error E computed for the whole training set is a sum of errors for all training examples. It follows that





Definition 2.2:1 (Learning of the neural network)

The learning of the neural network is a minimization of error E in a space of weights wij.

▬

Since, even the simplest networks have a huge number of weights, it is minimization of a scalar field over a space with hundreds (or thousands) of dimensions. To minimize E the steepest descent, gradient-based, method is used.



  modified after each training fact with appropriately smaller value of the parameter , called

the learning rate. This parameter should be a positive number, equal typically less than one.

To large value of the learning rate can cause the oscillation around the minimum of the error function, too small value results in slow convergence. When modification after each training fact is applied, then (8) should be replaced by an equation, which is indexed by the training fact number l. Therefore,

ij Using (1) it follows that

)

Definition 2.2:2 (Generalized delta, after Lawrence 1994)

The generalized delta i of neuron i for training example (l) is defined as a negative partial derivative of the error E^(l) with respect to the network excitation function n(i)(l)

▬

By applying the chain rule, the generalized delta can be expressed as

)

and therefore, for output layer R, after substituting the derivative of the sigmoid (logistic) activation function, the generalized delta is given as

For hidden layers, the generalized delta has to be computed recursively. Let us start from the last hidden layer with index R – 1. It follows that



Using (1) it follows also that

R ki

and for any layer, the following recursive equation holds



^

Equation (21) uses the back-propagation of generalized deltas in a neural network what is the reason for the name of the whole algorithm. The back-propagation algorithm, described by equations (16), (21), and (14) is universal but slowly convergent error minimization technique. Therefore, this method is often modified by the introduction of the inertial term called momentum (Tadeusiewisz 1993). Then the equation (14) becomes

)

or, in a version called exponential smoothing (Lawrence 1994), )

The MLP networks trained with back-propagation algorithm with inertial modifications have proved to be one of the most universal networks, applicable for enormous class of practical problems, from pattern recognition, by financial instruments prediction, to medical

diagnosis support. However, these networks were not the only ANNs, which have been developed in the eighties of the twentieth century.

Hopfield (1982) designed the recurrent ANN capable to serve as an autoassociative memory and heuristically solving the traveling salesman problem. This is a network with associated Lapunov energy function (Cohen and Grossberg, 1983) minimized during operation of the network. The structure of the Hopfield network is given Fig. 2. It is noteworthy to mention that the operation of this network can be expressed also in terms of statistical mechanics using the notion of Hamiltonian for denoting the energy function, as shown by Hertz, Krogh, and Palmer (1991).

Fig. 2.2:2. Hopfield‟s network Rys. 2.2:2. Sieć Hopfielda

The operation of the discrete version of Hopfield‟s network is described by two formulae as given in Korbicz, Obuchowicz, and Uciński (1994). The first, is the formula used for

In equation (24) and in all other equations describing the Hopfield‟s network, the superscripts in parentheses are used to denote the actual step number during the operation of the network, rather than to denote the number of the training fact. The second formula describing operation of Hopfiled‟s network is used for definition of the activation function. It follows that the output of the network depends on the network excitation as

Note, that equation (25) defines the activation function in a discrete Hopfield‟s network, which is used as an autoassociative memory. This activation function is very similar to Heaviside function 1(n), however it takes special interest in the value of the function for n = 0. For this situation, the Hopfield‟s network simply does not change the current output of the neuron, so the new state can be both, 0 or 1, dependent on the present value.

The next two important features, which characterize the Hopfield‟s network (see Korbicz, Obuchowicz, and Uciński 1994) include the lack of self-dependence

, 

i w_ii (2.2:26)

and the symmetry of weights

ij w

ij 

 , . (2.2:27)

As it has been mentioned, with each Hopfield‟s network, the so called energy function (Lapunow function) is associated. This is function, which has finite lower bound and which is non-increasing during the evolution of the process considered (in our context, the process of change of states in a recurrent Hopfield‟s network).

The operation is started for p = 0 by connecting the inputs to the processing units.

Assuming that the input vector x = [x1, x2,...,xN], xi {0,1}, it follows that Oi (0) = xi for i = 1, 2, ..., N. Then the input signals are disconnected and the recurrent operation of the network begins. This process satisfies the equations (24) and (25). The network operates asynchronously, i.e. in a given moment each neuron can be chosen with equal probability, and only this neuron is activated. After a finite number of iterations, the network settles in a stable state, for which

p i p

i O

i 

, ^1 . (2.2:28)

This is a state corresponding to the local minimum of the energy function. This state is transmitted to the outputs of the network.

The energy function is chosen as (Korbicz, Obuchowicz, and Uciński 1994)

 

^O ^O^T^wO ^t^T^O

E  

1 . (2.2:29)

what, in a scalar notation is equivalent to

    

  





 ^N

i i i N

j i

ijOO tO

w E

1 1 1

O 1 . (2.2:30)

Lemma 2.2:1

The energy E(O) is a non-increasing function of time during the operation of a network.

Proof

Let in a moment p, the state of the k^th neuron be randomly chosen to be changed

)

Moreover, let the state of others neurons remain unchanged

)

Then, it follows that

   

which, after expanding similarly the external summations and canceling corresponding terms for states of neurons different than k^th, whose outputs are identical at moments p and p + 1, and using the lack of self-dependence (26), becomes

2 .

Using (31) for terms for the moment p + 1, the equation (35) can be transformed to

)

Canceling identical terms, and using the symmetry property (27), the above can be simplified to

If nk(p)

= 0 then based on (37) the energy remains constant, i.e it is not increasing. All possible situations for nk(p)  0 and the corresponding changes of the energy E(O) based on (25) and (37) are presented in Table 1.

Table 2.2:1 Possible changes of the energy function in the Hopfield network

(after Korbicz, Obuchowicz, and Uciński 1994) Ok(p+1)

Ok(p) Ok(p) nk(p) E^(p)

0 0 0 < 0 0

0 1 -1 < 0 < 0

1 0 1 > 0 < 0

1 1 0 > 0 0

Inspection of the last column in Table 1 assures that E^(p) is always zero or negative:

E^(p) 0. Hence, E(O^(p+1))  E(O^(p)) what should have been proved.

■

Theorem 2.2:1 (after Korbicz, Obuchowicz, and Uciński 1994)

The energy E(O) decreases with every change in a state of the network.

Proof

By Lemma 1 it is clear that the energy cannot increase. Therefore, to prove the theorem it is enough to show that each situation when the energy remains constant corresponds to the situation when the network state is not changed. Then, each state change will result in the energy decrease. Consider first the case when n_k^(p)= 0. Then from (25) it follows that O_k^(p+1)= O_k^(p) (i.e., outputs are not changed). This is one of the conditions when E^(p) = 0.

Other situations for E^(p) = 0 can be taken from Table 1. It follows that for the unchanged energy O_k^(p+1)= O_k^(p) = 0 or O_k^(p+1)= O_k^(p) = 1, so the outputs are not changed as well. Hence, for all changes of the network state, the energy decreases.

■

Theorem 2.2:2 (after Korbicz, Obuchowicz, and Uciński 1994)

In the discrete Hopfield network, the minimum energy Emin is finite and it is achieved in a finite number of steps.

Proof (after Korbicz, Obuchowicz, and Uciński 1994) From (30) it follows that

   



 



 ^N

i i N

i N

ij t

w E

1 1

O 1 . (2.2:38)

Because of (38) and the discrete domain of the network outputs it follows that the minimum non-zero change of energy is not infinitesimally small

0 min

, 0  

c   E c

E . (2.2:39)

By Theorem 1, the energy function decreases with each change of the network state. Since from (38) it is clear that the energy function has a finite lower bound and from (39) it follows that each change of E is at least as large as c, therefore the process of approaching towards the finite minimum value of E_minhas to be composed of a finite number of steps.

■

As the last example of an ANN developed in the decade of a great rebirth of connectionism, let us present the problem of self-organization (Kohonen1984) occurring in the Kohonen (1990) Self-Organizing Map (SOM). This is an unsupervised network, in which only the winning neuron and its „neighborhood” is learned. The main application of SOM is the search for regions in the input space which is activated by similar feature values. The goal of the self-organizing learning is such choice of the weights, which minimizes the expected value of the distortion, measured as an error of approximation of the input vector x by the weights of the winning neuron (see Osowski 1996)



 





 ^p

i w

p i

1 x w , (2.2:40)

where w(i) is the index of the neuron, which wins for the input vector xi, and ww(i) is the vector of weights leading to this neuron.

After learning, the network implements a vector quantization (VQ), i.e., the approximation of an arbitrary vector by a pattern vector, which is the closest to the vector considered. This process is equivalent to the quantization of the input space. Since the operation of quantization is a result of the learning process, it is called learning vector quantization. Let us discuss the unsupervised learning of SOM in more detail.

One of the simplest algorithms, which is able to learn the SOM is the algorithm called winner takes all (WTA). The name points out the fact that only the winning neuron, i.e., the neuron for which the distance between its weights vector ww and the input vector x is the smallest, is subject to learn. It is also worth to notice that WTA algorithm used in connectionist approaches, corresponds to the K-means algorithm in classical cluster analysis.

Learning is an adaptation tending to changing the weights of the winner in the direction of x (see Osowski 1996)

k  _w k  _w k

w w x w

w 1    . (2.2:41)

Note, that if the input vectors are normalized, then the minimum distance between vector ww

and the input vector x, corresponds to the maximum of the dot product ww· x.

However, learning only one neuron per one training fact, leads to relatively slow convergence, and therefore the modification, called winner takes most (WTA) is more often used. In this generalized version of the Kohonen‟s SOM, there is introduced the

neighborhood of the winner, which is also modified during learning together with the winner.

Additionally, it is possible to introduce the modification which takes into consideration that the neurons become tired after learning, and therefore are not activated in the subsequent moments. This modification is inspired by biology, and its goal of it is to favor neurons with smaller initial activation.

Learning of the Kohonen‟s map using the WTM algorithm follows according to the formula (Osowski 1996)

k  _i k _iG i  _i k 

i w x x w

w 1   ,  (2.2:42)

for all neurons i which belong to the neighborhood S_w of the winner. The neighborhood function G defines the influence of the distance from the winner on the modification strength.

By defining function G as

 





 

w i

w i i

G 0 for for

, x 1 (2.2:43)

where w denotes the index of the winner, the classical WTA algorithm, as a special case of WTM, is obtained.

In the classical Kohonen‟s map, the neighborhood function G(i, x) is of the form (Osowski 1996)

 

 

 0 for others ) , ( for

, 1 d i w 

G x (2.2:44)

where d(i, w) denotes the Euclidean distance between weights vectors of the winner w and the neuron i^th. Coefficient  is a radius of the neighborhood. Its value decreases with the learning of the network. The function G given by (44) defines the so called rectangular neighborhood.

Another type of the function G, which is used in the Kohonen‟s maps defines a Gaussian neighborhood. In this type of the neighborhood the function G(i, x) is given as (Osowski 1996)

   



 





 ² ₂

2 exp ,

, 

w i i d

G x . (2.2:45)

The Gaussian neighborhood results in better self-organization than the rectangular neighborhood, because the strength of the learning is gradually decreased with the increase of the distance.

While, both, the rectangular and the Gaussian neighborhoods are deterministic functions of the distance d(i, w), the stochastic relaxation algorithm (see Osowski 1996) defines the neighborhood, which neurons belong to with probabilities given by the Gibbs distribution

 

In the above distribution, T is a parameter called the temperature, which has similar role as temperature in a simulated annealing-based optimization.

When the temperature is high at the initial stage of learning, then all neurons belong to the neighborhood with approximately the same probability, what is reflected by the limit

 i N deterministic, achieving for very small temperatures behavior resembling the WTA algorithm

 



The stochastic relaxation defines the random neighborhood of the rectangular type.

Therefore, the function G is given by

   

where P is a random number taken from the uniform distribution with the range (0,1).

The next algorithm considered is the soft competition scheme (SCS). It is a deterministic version of the stochastic relaxation algorithm, which has better effectiveness than the original probabilistic algorithm (see Osowski 1996). Instead of rectangular neighborhood taken with probability P in stochastic relaxation, the SCS uses the Gibbs distribution (46) as the definition of deterministic function G

   i Pi

G ,x  . (2.2:50)

The last algorithm considered in the context of SOM and the neighborhood function G is the neuron gas algorithm (see Osowski 1996), in which all neurons are sorted according to the increasing distance from the vector x. Then the function G is given by

   _

where m(i) denotes the rank of the neuron i in a sorted sequence, which starts from 1 for the winner, and  is a decreasing in time parameter, analogous to the radius of the neighborhood in the Kohonen‟s classical WTM algorithm. If  = 0, then, only the winner is modified, and the algorithm becomes the WTA. Otherwise, the algorithm resembles the fuzzy approach, by associating with each neuron a membership function (51) of belonging to the winner

neighborhood. If the quantization error (40) is the criterion, then the following sequence (from the best to the worst) of the self-organizing algorithms is given by Osowski (1996):

Neuron gas, SCS, K-means, classical Kohonen‟s map.

Despite presented above successes of ANNs in many pattern recognition and other machine learning problems, many scientists were not convinced not having the mathematical theory describing the efficiency of ANN-based classifiers. The response to these reservations has been done in nineties of the previous century by proving the following theorem.

Theorem 2.2:3 (after Tebelskis 1995)

Properly trained ANNs are optimal classifiers in pattern recognition problems using statistical uncertainty model, i.e., the output neurons approximate arbitrary closely posterior probabilities of all classes considered.

Proof (after Tebelskis 1995)

Consider an ANN-based classifier learned with many training facts in a form of pairs (x, Cj) where x is the input vector and Cj correct abstract class corresponding to that vector.

W dokumencie Artifical intelligence, branching processes and coalescent methods in evolution of humans and early life (Stron 26-42)