• Nie Znaleziono Wyników

On the application of connectionist models for pattern recognition, robotics and computer vision: A technical report

N/A
N/A
Protected

Academic year: 2021

Share "On the application of connectionist models for pattern recognition, robotics and computer vision: A technical report"

Copied!
80
0
0

Pełen tekst

(1)

On the Application of

Connectionist Models

for

.

Pattern Recognition, Robotics and Computer Vision

A Technical Report Martin A. Kraaijveld

'2

\

1..

lrjinr~

1III

Delft

Pattern Recognition Group Martin A. Kraaijveld Faculty of Applied Physics Delft University of Technology P.O. Box 5046 2600 GA Delft The Netherlands C 1867622

(2)

ü

Published and Distributed by: Delft University Press

Stevinweg 1 2628 CN Delft

Tel. 31 - (0)15 - 783254

By order of:

Pattern Recognition Group Faculty of Applied Physics Delft University of Technology P.O. Box 5046

2600 GA Delft The Netherlands

Tel. 31 - (0)15 - 78 14 16

eIP-Data Koninklijke Bibliotheek, The Hague ISBN 90 - 6275 - 554 - 2

NUGI841

Copyright © 1989 by M.A. Kraaijveld

No part of this book may be reproduced in any form by print, photo-print, microfilm or any other means, without written pennission from Delft University Press.

(3)

iii

Abstract

Connectionist modeis, commonly referred to as neural networks, are computing models

C)

in which large numbers of processing units are connected to each other with variabie "weight". These weight values represent the "strength" of the connection between two units, which can be positive (excitatory, i.e. exciting the activity of a unit) or negative (inhibitory, i.e. suppressing the activity of a unit). The functional behavior of a connectionist network is determined by these weight values. Changing the weight values or the topology of the network results in different nets with different applications. 1t has been demonstrated th at connectionist models are weIl suited to implement some pattem recognition, optimization and/or adaptive leaming techniques, in a massively parallel, fault resistant manner.

The aim of this report is to provide an overview of the literature in this field, and to investigate the practical applications of connectionist models for pattern recognition, robotics and computer vision. From the perspective of an engineer, the tools provided by connectionist models are compared to other available tools and it is shown in which cases these tools are more efficient than other implementations. The improved efficiency can be based on (one of) the following properties: the massive parallelism, the robustness of the implementation, the large variety of algorithms that can be implemented in a connectionist network, the availability of a suitable technology for hardware implementations, and finally on the specific properties of some particular modeis.

(4)
(5)

v

Contents

Abstract ... iii

Contents ... v

1. Introduction ... 1

2. The Genera) Framework ... 3

2.1. The General Structure of a Connectionist Model ... 3

2.2. Capabilities of Connectionist Modeis ... 7

2.3. A Taxonomy of Connectionist Models ... 7

2.4. Literature for First Reading ... 10

3. A Survey of the Literature ... 11

3.1. A Brief History ... 11

3.1.1 The First Wave ... 11

3.1.2 The Second Wave ... 12

3.2. An Overview of the Currently U sed Modeis ... 15

3.2.1 The Hopfield Model ... 15

3.2.2 The Backpropagation Algorithm ... 20

3.2.3 The Boltzmann Machine ... 28

3.2.4 The Adaptive Resonance Theory ... 30

3.2.5 Kohonen's Topology Preserving Maps ... 33

3.2.6 The Neocognitron ... 36

4. Applications ... 39

4.1. Pattern Recognition ... 39

4.1.1 Recognition with Hopfield Networks ... 39

4.1.2 Recognition with Multi-Layer Feedforward Networks ... 41

4.1.3 Unsupervised Learning ... 42

4.2. Computer Vision ... 43

4.2.1 Evidence Integration ... 43

4.2.2 Relaxation ... 44

4.3. Robotics ... 45

4.4. Connectionist Models and Physics ... 46

4.5. Connectionist Models and AI ... 46

4.6. Connectionist Models and the Cognitive Sciences ... 47

5. Implementations ... 49

5.1. Software ... 49

5.2. Hardware ... 50

5.2.1 General Purpose Hardware ... 50

5.2.2 Special Purpose Hardware ... 50

6. Conclusions ... SS

(6)

VI

Acknowledgements ... 0 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 57

Literature •...•.•...•...•...•.... 59

(7)

1

1. Introduction

One of the prominent research topics of the scientific community in the 1980's is undoubtedly the research on neural networks, also known as: parallel distributed processing, connectionist models or neurocomputing. One of the most striking aspects of the world-wide interest in this subject, is that th ere are so many disciplines involved. It seems that in this field the lines of research of many previously unrelated disciplines come together, the huge progresses in neuroscience in the understanding of the central nervous system, the research of psychologists and psychiatrists conceming the computational aspects of cognition, the developments in VLSI-technology and computer architecture, the knowledge of statistical and structural pattem recognition that has been gained during the last few decades, and many other recent trends in scientific research. Currently, there are thousands of researchers working on this subject and a large number of conferences, magazines and research projects appeared during the last few years. The number of disciplines that is involved in this subject is very large and covers: neurology, neurophysiology, biology, zoology, psychology, psychiatry, philosophy, computer science, mathematics, physics, robotics, computer architecture, computer vision, pattem recognition and many others.

However, there seems to be an other side of the coin. In the frrst place, the claims that some researchers have made about connectionist models seem to be highly unrealistic. The straightforward hypothesis, for example, that anificial neural networks should easily solve all kinds of "natural " problems, like speech recognition, vision, tactile sensing, etc. is still not proven. In the second place, the euphoria of a cognitive scientist about networks that are able to recognize pattems, does not necessarily imply euphoria for an expert in pattem recognition. The same holds for networks that solve other problems, like optimization. In the third place, it seems that history repeats itself; the current "wave" of interest can be considered as the "third wave" in this field. The first two waves (the frrst during the late 1940's and the early 1950's, and the second during the 1960's) showed an equal amount of interest, but died out because most of the expectations appeared to be unrealistic, or very limited in practical use.

During the 1980's we are confronted with a revival of the subject. Basically, this is due to a few breakthroughs th at were discovered in the first half of the decade. One of the questions th at is therefore addressed in this report, is whether we can speak of these breakthroughs as revolutionary, or that we can speak of them as evolutionary, especially from an engineering perspective. This also shows in which respect this report differs from many other publications in this field: connectionist models are considered as tools for engineering problems. Their functionality is compared to ot her available tools, to provide the arguments for the selection of the connectionist versus the conventional tools. In this report the biologica! plausibility of a model is therefore not considered as an important argument!

(8)

2 -On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

Another important question that is addressed in this report, is which topics in this field are related to the work and expertise of the Pattern Recognition Group and in which of these the most interesting research is being performed.

The structure of this report is as follows. Chapter two covers agiobal technical introduction to the subject and sketches the outline of the common part of the different connectionist models. This chapter is meant for fITst reading. The third chapter addresses the specific details of some connectionist modeis, starting with the old models of the 1950's and 1960's and ending with those that are currently of interest. The fourth chapter explains how the models of chapter three could be applied for some specific disciplines. Chapter five describes some topics concerning the implementation of connectionist models in hard- and/or software and finally, chapter six contains the concIusions.

A final remark is that this report is not made with the intention to give a complete

overview of all activities. The world wide effort is so enormous th at it is virtually impossible to follow every line of research. There is a risk, therefore, that people who have made significant contributions, or some very recent developments are not mentioned here. The au thor hopes that despite these shortcomings, the reader is able to achieve a braad overview of this new and exciting field.

(9)

3

2. The General Framework

One of the remarkable aspects of connectionist models is their similar structure. Basically, cJ)

the "architecture" of each network is based on the same building blocks, whereas minor variations in these blocks are responsible for different modeis. This chapter explains what the basics of these building blocks are and in what respect they differ. Paragraph one describes the general structure of a connectionist model. Paragraph two provides a short overview of their capabilities and applications. In the third paragraph a taxonomy of the different models is presented, and finally, in paragraph 4 a short overview of introductory literature is presented.

2.1. The General Structure of a Connectionist Model

When we sketch a framework that is sufficiently rich to incorporate the existing connectionist modeis, we can distinguish five major aspects (fig. 2.1, see also: [Rumelhart 1986b] and [Lippmann 1987]).

- A set of processing units.

@

- A pattem of connectivity among units.

- A state of activation determined by a combination of the inputs imp inging on a unit. - An output function for each unit.

- An

(iterative) rule to determine the connection strengths.

The set ofprocessing units:

An essential element of a connectionist model is that the processing is distributed over many, relatively simple units, and that all actions can be considered as local asynchronous processes, altogether performing a larger global task. The "behavior", the "knowledge" or the "algorithm" is distributed over the network and most models can be considered as fault-tolerant, due to the fact that not a single unit or connection between units is responsible for the functioning of the system as a whoie. This provides a greater degree of robustness than the standard von Neumann type sequential computer.

We can distinguish three different types of units: the input units which receive input from , extemal sources, the output units that send signals out of the system and hidden units

0)

(10)

4 - On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

i

1

i2

W1 W2 W n -1 Wn

Figure 2.1: Blockscheme of a unit in a connectionist model. The output is a non-linear function of a weighted summation of the inputs.

The pattern of connectivity among units.

o

A specific property of a connectionist models is the fact that a number is assigned to each connection, representing the connection strength between the two units it is connecting. When this number, commonly referred to as the "weight value" of a connection, is pos-itive, we speak of an "excitatory" connection (which stimulates the, activity of the unit that it is connected t~), and when it is negative we speak of an inhibitory connection (which suppresses activity). The absolute value of the weight can be considered as the connection strength between the units. It is of ten convenient to define a matrix W in which each entry Wij represent the weight value between unit i and unit j.

o

\...2J The pattern of connectivity is very important, because it is one of the most significant aspects in which the various models differ. The network topologies that are most commonly used are:

- A fully connected network: each unit is connected to all other units in the network - A layered network: the units are grouped in layers and a unit is connected to all

units, or to some units in the layer below.

- A randomly connected network: each unit is connected to a random number of other units.

(11)

The Genera! Framework -5

The state of activation.

The state of activation of the system can be considered as the representation of the state of the system at time t. It is primarily determined by the vector aCt ), representing the activation states of the individu al processing units at time t . The activation state of a unit is a discrete, continuous or stochastic function of the inputs and the current value of the activation state.

A common class of activation states is determined by the following rule: ai (t) =

L

Wij (t) ij (t)

or in matrix notation: a ( t ) = W ( t ) i (t)

In this case, the activation state of a unit is determined by a weighted summation of the inputs. The weights for this summation are the weight values associated to the connections leading to the unit. This linear combination of the inputs can be considered as the inner product between the input vector and the weight vector, having a value zero when the vectors are orthogonal, and having a maximum value when they are parallel.

(J)

The outputfunction for each unit.

The output function maps the current activation state of a unit to the output signal:

This output function f(.) is generally a

non-linear

function of the activation state. This non-linear function is usually some sort of threshold function. This threshold can be a hard-limiting threshold (like a step function), a smoothly lirniting threshold (like a hyperbolic tangent or sigmoid function), a stochastic function of the activation state, a function like f(x)

=

max(O, x) or in some specific cases the identity function: f(x)

=

x (see fig. 2.2).

In the case of a hard-lirniting non-linearity, the behavior of the unit can be sketched as follows: The linear combination of the inputs divides the input space in two different regions, having an output value zero (or minus one) on one side of the (hyper)plane and having a value plus one on the other side of the plane (see fig. 2.3).

(12)

6 -On the Application of Connectionist Models for Pattem Recognition, Robotics and Computer Vision.

soI"t-llmiUng output rUDdloD

1.2 1.0 0.8 0.6 1 0.4 0.2 0.0 -0.2 -10 10 x

hard.Jimiling output runction

1.2 1.0 0.8 0.6 1 0.4 0.2 0.0 -0.2 -10 0 10 x

output runctlon: 1 = mali (O,x)

12 10 8 1 6 2 0 ·2 -10 0 10 x

(13)

The Genera! Framework -7

@

The rule to determine the connection strengths.

Associated with all models is a rule that tells how to detennine the weight values of the connections between the units. This rule can be straightforward, and provide the desired weight values immediately given a certain problem (e.g. the Hopfield model- [Hopfield 1982], [Hopfield 1984], [Hopfield 1985]), or can be iterative, whereby the weight values start with a certain initialization and slowly converge to the appropriate value (e.g. the backpropagation algorithm - [Rurnelhart 1986c]). The iterative rules are usually called "learning rules" and virtually all can be considered variants of the Hebbian learning rule [Hebb 1949]. The basic idea of Hebb's rule is as follows: if a unit Ui receives an input from a unit Uj ; then if both are highly active, the weight Wij, from Uj to Ui should be strengthened (see also: [Rurnelhart 1986b]).

The information processing.

When there is feedback in the network, the behavior of the system as a function of time is described by differential equations; when there is no feedback in the network, the time behavior is usually not taken into account.

The information processing of a connectionist model can basically be seen as arelaxation process. Computation proceeds by iteratively seeking to satisfy a large number of local constraints between units. A network should be thought of more as settling into a solution than calculating a solution.

2.2. Capabilities of Connectionist Models

Among the information processing capabilities that connectionist models have been able to perform are the following (see [Hecht-Nielsen 1987c] and [Hecht-Nielsen 1988a]): - Mathematical mapping approximation.

- Probability density function estimation.

- Extraction of relational knowledge from binary databases.

- Formation of a topologically continuous and statistically conformal mapping. - Nearest neighbor pattern classification.

- Categorization of data. - Optimization.

2.3. A Taxonomy of Connectionist Models

During the last few decades, numerous proposals for network architectures and learning rules have been made. Of these there are 14 types in common use (see [Hecht-Nielsen 1987c] and [Hecht-Nielsen 1988a]):

(14)

8 -On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision. 0=

Sgn(

,

i

Wj Ij

+

e)

)--_... J=1

e

x X X

.

..

X

. .

X X

..

•• X X

..

. .

X X

.

..

0

..

..

_ - ' l o . . . . _ • • ~. ~+--i"I---

1

1 ... 00 0 0

.-_

..

•• 0 0 0 •• 0 0

o

decision

boundary

Wl

e

h=

-

- I l +

-W2 W2

(15)

The General Framework -9

- Adaptive Resonanee: a class of networks that is able to fonn categories of input data ..

The coarseness of the categories is detennined by the value of a selectable parameter (see [Grossberg 1986a], [Carpenter 1987a] and [Carpenter 1988] ).

Avalanche: a class of networks for leaming, recognizing, and replaying spatie temporal patterns (see [Grossberg 1982] and [Hecht-Nielsen 1987d]).

Bidirectional Associative Memory: a class of single-stage heteroassociative networks, some capable oflearning (see [Kosko 1987a] and [Kosko 1987b] ).

Boltzmann machine / Cauchy machine: networks that use a noise process to find the global minimum of a cost function (see [Hinton 1986]).

Brain State in a Box: a single-stage autoassociative network that minimizes its mean squared error (see [Anderson 1983] and [Anderson 1987] ).

Cerebellatron: leams the averages of spatio-temporal command sequence patterns and replays these average command sequences on cue (see [Pellionez 1977] ).

Counterpropagation: a network that functions as a statistically optimal self-organizing lookup-table and probability function analyzer (see [Hecht-Nielsen 1987] ).

Hopfield: a class of single-stage autoassociative networks without learning (see [Hopfield 1982], [Hopfield 1984], [Hopfield 1985] ).

Lernmatrix: a single-pass non-recursive single-stage associative network (see [Steinbuch 1963] ).

Madaline: a bank of trainable linear combinations that minimize mean-squared error (see [Widrow 1960] and [Anderson 1987]) .

. Multi-layer perceptron: a multi-Iayer mapping network that minimizes the mean squared mapping error (see [Rumelhart 1986c] ).

Neocognitron: a multi-Iayer hierarchical character recognition network (see [Fukushima 1983] and [Fukushima 1984]).

Perceptron: a bank of trainable linear discriminants (see [Anderson 1987] and [Minsky 1969]).

Topology Preserving Maps: fonns a continuous topological mapping from one compact manifold to another, with the mapping metric density varying directly with a given probability density function on the second manifold (see [Kohonen 1984] ).

(16)

10 - On the Application of Connectionist Models for Pattem Recognition, Robotics and Computer Vision.

When we classify these networks into groups according to their learning mechanism and their capabilities of working with binary or continuous information, we can make the following subdivision (see [Lippmann 1987]):

Binary and Supervised:

Bidirectional Associative Memory Brain State in a Box

Hopfield (variation I)

Binary and U nsupervised:

Adaptive Resonance (variation

n

Continuous and Supervised:

Avalanche

Boltzmann machine / Cauchy machine Cerebellatron Counterpropagation Hopfield (variation 11) Lernmatrix Madaline Multi-Iayer perceptron Perceptron

Continuous and Unsupervised:

Adaptive Resonance (variation TI» Neocognitron

Self-Organizing Feature Maps

2.4. Literature for First Reading

The literature on this subject has grown enormously during the last few years. For fITst reading the following literature is highly recommended:

- [Lippmann 1987], an introductory technical paper.

- [Hecht-NielsenI987c] and [Hecht-Nielsen 1988a], introductory but less technical papers.

- [Anders on 1987] ,a collection of classic papers.

- The March 1988 issue of IEEE Computer: introductory articles by the world's leading scientists.

- [Rumelhart 1986a] and [McClelland 1986] , the "Parallel Distributed Processing Bibie".

(17)

11

3. A Survey of the Literature

This chapter describes the various connectionist models that have been proposed among the last few decades. The frrst part provides an overview of the models that have been proposed during the first two waves (i.e. between the 1940's and the 1960's) and the second part an overview of the models that are currently under investigation.

3.1. A Brief History

The history of connectionist models starts in 1943, when McCulloch and Pitts wrote their famous paper ([McCulloch 1943]) in which they gave the frrst formal description of the neuron. The publication of this paper marked the beginning of wh at we nowadays caU the first wave. The second wave starts with the discovery of the Perceptron ([Rosenblatt 1958]) and ends with the publication of the book of Minsky and Papert about the computational capabilities of the Perceptron ([Minsky 1969]).

3.1.1 The First Wave

One of the papers that highly influenced the first wave was the paper of McCulloch and Pitts in 1943 ([McCulloch 1943]). In this paper, the authors propose a formal description of the neuron which has become known as the 'McCulloch and Pitts' neuron. The McCulloch and Pitts neuron is a binary device; that is, it can be in only one of two possible states. The mode of operation is simpIe. The neuron responds to the activity of its synapses (its inputs). If no inhibitory synapses are active, the neuron adds its synaptic inputs and checks to see if the sum meets or exceeds its threshold. If it does, the neuron becomes active. If it does not, the neuron is inactive. The central result of the paper is that any fmite logical expression can be realized by these neurons ([Anderson 1987]).

A second important breakthrough of the first wave was due to Donald Hebb. In his book "The Organization of Behavior" [Hebb 1949], Hebb makes the first explicit statement of a learning rule for synaptic modification, which has become known as the Hebbian learning rule. The general idea of this learning rule is -to re state Hebb's description - :

"When an axon of cel! A is near enough to excite a cel! Band repeatedly or persistently takes pan infiring it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one ofthe cellsfiring B, is increased" [Hebb 1949, p. 50].

The frrst wave of interest was mainly important because some fundamental insights and concepts were discovered. It appeared however, that the usefulness at that stage of scientific development was very limited. Neither neuroscientists or computer scientists could build useful applications with the knowledge of th at time.

(18)

12 -On the Application of Connectionist Models for Pattern Recognition. Robotics and Computer Vision.

3.1.2 The Second Wave

The second wave of interest started in 1958 with the design of the perceptron by Frank Rosenblatt ([Rosenblatt 1958]). The first publications about the perceptron caused a sensation, because it was the first propos al for a leaming machine that was potentially capable of complex adaptive behavior. Af ter more than a decade of thorough study and worldwide interest, it was revealed by Minsky and Papert ([Minsky 1969]) that the class of problems that the perceptron was suited for appeared to be very limited. The publication of their book made an end to nearly all the research in this field

3.12.1 The Perceptron

The perceptron was the first precisely specified, computationally oriented neural network. It was proposed by Frank Rosenblatt in 1958 ([Rosenblatt 1958]). The perceptron consists of four parts (see fig. 3.1):

- The retina, which receives the stimuli.

- A set of association units (At) that receive inputs from a localized receptive field from the retina (N.B. these units are omitted in some variants of the perceptron; in those cases the retina is connected directly to the area A2).

- A set of association units (A2) th at receive input from a random number of units in areaAl,

- The "responses" RI. R2, ... , Rn.

The units of the perceptron are McCulloch and Pitts units; if the sum of the excitatory and inhibitory inputs is greater than or equal to a certain threshold, the unit is active and otherwise inactive. The goal of the operation of the perceptron was to make a specific R-cell respond to an input pattem on the retina. A simple reinforcement leaming scheme was used for changing the weight values of the connections to improve the performance. In

later studies it was proven that specific versions of the learning scheme (i.e. the "fixed-increment rule") will converge under certain conditions to a stabie solution. This Perceptron Convergence Theorem (see e.g. [Duda 1973]) guarantees that if the samples'

are linearly separabie the fixed-increment rule will yield a solution after a finite number of corrections of the weight values.

(19)

retina localized connections Al projection area random connections A2 association area

A Survey of the Literature -13

responses random

Figure 3.1: The Perceptron (from [Rosenblatt 1958]).

3.1.2.2 The ADAL/NE

The ADALINE (the ADaptive LInear NEuron) was a proposal for an adaptive system that could leam more quickly and accurately than the perceptron ([Widrow 1960]). The architecture of the system was not exactly like the perceptron but was related to it (see fig. 3.2). The system consists of a set of inputs, which are multiplied with a weight value and summed up in a summator. The output of the summator is thresholded, so the output of a unit is a binary value.

Widrow and Hoff proposed a rule to change the weights in order to let the system leam a set ofpattems. Their leaming rule, which has become known as the "Widrow-Hoffrule", the "LMS-rule" or the "delta rule", performs a gradient-descent in a bowl-shaped error surface, and is therefore guaranteed to find the best set of weights (see also: [Duda 1973] and [Rumelhart 1986c]). A mathematical treatrnent of the Widrow-Hoff rule is given in paragraph 3.2.3., where the backpropagation algorithm is described.

(20)

14 - On the Application of Connectionist Models for Pattem Recognition, Robotics and Computer Vision. W 1

~r---_r---'

W2

~

o

i

n-1 W n-1

in--1~

__

w_n~~~-~---~----~

+1

wo

Figure 3.2: The ADALINE.

The ADALINE has been used for system modelling (i.e. the network adaptively leams the dynamic response of a system), for statistical prediction (i.e. the network adaptively leams to predict a signal), adaptive noise cancelling, adaptive echo cancelling (e.g. the network leams the characteristics of a telephone circuit and adaptively suppresses the echo due to long-distance lines), inverse modelling (i.e. to adaptively leam the reciprocal of an unknown system's transfer function), channel equalization (i.e. the networks adapts itself to become a channel inverse, to compensate for the irregularities in channel magnitude and phase response) and adaptive pattern recognition. A recent version of the ADALINE and some of the applications mentioned above are described in [Widrow 1988].

3.12.3 Minsky and Papert

The second wave came to an end with the publication of the famous book called "Perceptrons" by Minsky and Papert ([Minsky 1969]). "Perceptrons" is considered to be an exploration of the properties of the simp/est machines that have a claim to be ''parallel'' and can perform computations that are non-trivia/, both in practical and in mathematical respects [Minsky 1969, par. 0.2]. This exploration is based on a thorough mathematical treatment of the subject, in which the perceptron is considered to be computing predicates in stead of classifications. It appeared that some intuitively simple predicates are extremely

(21)

A Survey of the Literature - 15

difficult to compute with perceptrons. One of the famous mathematical results in the book comes from the discussion of the geometrie al predicate

connectedness

(the property that you can drawan object without lifting the pen from the paper). Minsky and Papert prove that a perceptron cannot compute the connectedness of an object on the retina, unless there is at least one association unit to which all points of the retina are connected t~. The same holds for the predicate parity; the parity of the number of points (i.e. pixels) in an geometrie figure drawn on the retina.

The conclusion of Minsky and Papert that the perceptron was not capable of computing some simple predicates, that perceptrons usuaIly work quite weIl on very simple problems but deteriorate very rapidly as the tasks assigned to them get harder and that the results of hundreds ofprojects and experiments were generaIly disappointing ([Minsky 1969], par. 0.9]), nearly terminated the entire line of research.

3.2. An Overview of the Currently Used Models

Af ter the publication of "Perceptrons" in 1969 it remained silent until the early 1980's. During the first half of the 1980's, some new algorithms were discovered, and the interest in connectionist models revived. The assumption of Minsky and Papert that the limitations of the perceptron would also be true for its variants (specifically the multi-Iayer systems) appeared to be wrong, and many researchers presented ideas, and demonstrated capabilities that were completely new. The fact that scientists from many different disciplines rediscovered connectionism, resulted in what could be considered as the third wave.

This paragraph describes some of the recent networks. It starts with the Hopfield network (par. 3.2.1), followed by the Backpropagation-algorithm (par. 3.2.2), the Boltzmann machine (par. 3.2.3), the Adaptive Resonance Theory of Carpenter and Grossberg (par. 3.2.4), Kohonen's topology preserving maps (par. 3.2.5), and finally the Neocognitron (par. 3.2.6).

3.2.1 The Hopfield Model

The Hopfield model is due to John Hopfield of Caltech, who investigated the computational properties of some classes of networks. The following paragraphs follow the line of his investigations, starting with the Hopfield model as an associative memory (par. 3.2.1.1), the Hopfield model with continuous units (par. 3.2.1.2) and the mapping of optimization problems on Hopfield networks (par. 3.2.1.3). The last paragraph (par. 3.2.1.4) gives a short overview of the current research concerning the Hopfield model.

(22)

16 - On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

3.2 .1.1

The Hopfield Network as an Associative Memory

The frrst description of the Hopfield models dates from 1982 ([Hopfield 1982]) and is based on research on the collective computational properties of systems having a large number of simple equivalent processing units. We can consider the network that Hopfield describes in this paper as

afully connected network

(each processing unit is connected to allothers) whereas each unit has a

binary output -

see fig. 3.3.

W 1

~

W2

L

· ·

" •

o

-i

n-1 Wn -1 I

I

..

Wn

Figure 3.3: The unit of binary Hopfield network.

Hopfield discovered that it is possible, under certain conditions, to make such a network behave as an associative memory_ The algorithm for the storage and the retrieval of patterns in his network, consists of three steps [Lippmann 1987] :

1. The storage of patterns in the network by assigning the weight values of the connections:

For the storage of a set of n binary states: VS, (s = 1 .... n), we assign a weight value to a connection, according to the following rule:

Wij =

L

(2

~

- 1) (2

vj

-

1)

s

Wij = 0

o

~ i,j ~ N

This means that if two units i and j in a certain pattern VS have a similar value, 1 is added

to the weight; if their values are not similar, -1 is added to the weight. The total value of the weight is determined by a summation of all these factors 1 and -1 over all n patterns.

(23)

A Survey of the Literature -17

2. The network is initialized with an unknown input pattem:

Vi (0)

=

Xi 0~i~n-1

3. The network iterates until it converges; the units are evaluated according to:

<0

>0

It appears that when the network is clamped in a initial state, the converge process of step 3 makes the network converge to the "closest" pattern of step 1. Closest means in this sense "the stored pattem with the smallest Hamming distance to the initial pattem". This behavior can be understood when we consider the following "energy - function" (or Hamiltonian):

E

= -

~ ~ ~

Wij Vj

1 J

The change in energy due to a change in V is: Lill

= -

~Vi

L

Wij Vj

j

which shows that the algorithm for altering Vi causes E to be a monotonically decreasing function. The states of the Vi 's will change until a (local) minimum Eis reached.

Hopfield empirically found that the number of pattems that could be stored in a fully connected network of N units is at most 0.15N. This figure clearly shows that the (distributed) representation of the patterns in the weights of the network is very redundant. It is therefore not surprising that it is allowed to remove some of the connections, or to add some noise to the weight values without seriously affecting the performance of the network.

3.2 .1.2 Hopfield Networks with Graded Response

The second paper on this subject is dated 1984 and deals with networks with graded response. In this paper Hopfield describes that the behavior of networks with continuous units have collective computational properties like those of binary units [Hop field 1984]. In this case the equations that describe the system have to be replaced by differential equations. When we model the units as operational amplifiers, having a (parasitic) input capacitance and a (parasitic) input resistance, we can describe the "equation ofmotion" as:

(24)

18 - On the Application of Connectionist Models for Pattem Reèognition. Robotics and Computer Vision.

duo

L

Co (_1) =

w.·y __

1

+

11'

1 dt . IJ 1 Ri

. 1

ei in this equation describes the input capacitance, Wij the conductance between the output of unit j and the input of unit i, Ri the input resistance, li the current from an external source and gi the non-linear output function of the units. The equation states that the current flowing through the capacitor is equal to the currents coming from the other units, minus the current flowing away through the resistor plus the current that is injected by an extern al source.

The energy function is now defined as: Vi

E = -

~ ~ ~

WijViVj +

~ (~i)

f

gi

1(Vi) dV +

~

li Vi

I J l 1

o

which is very similar to the energy function of the binary network. It can be shown that when the gain of the amplifier goes to infinity (thus approximating the hard-limiting output function), the term with the integral of gr1 (Vü goes to zero. Furthermore, it can be shown that the time derivative ofE is smaller than or equal to zero. The network with continuous units is therefore able to find loc al minima in this energy landscape, and will behave like an associative memory.

3.2.1.3 The Mapping of Optimization Problems on Hopfield Networks

The next idea that Ropfield describes dates from 1985, and concerns the mapping of specific problems from other domains on neural networks (see [Ropfield 1985], [Ropfield 1986], [Tank 1986] and [Tank 1987]). The general principle of this mapping is based on the fact that when a problem can be formulated in terms of desired optima, a network can be constructed to solve this (optimization) problem.

To find a solution of a difficult optimization problem the step th at is required is to construct a cost function which reflects the quality of the solution. A good solution is associated with Iow costs, and a bad solution with high costs. When the cost function of the optimization problem can be associated with the energy function of the Ropfield network, the network can he used to minimize the energy, and thus to minimize the costs. Ropfield and Tank [Ropfield 1985] describe how to construct a network to find solutions for the Travelling Salesman Problem. For a5-city problem, they use a network of 5

*

5

=

25 units. These units are configured as follows:

(25)

1 2 3 4 5

A@]@][)@]@]

B@]Q]@]@]@]

C@]@]@]@][I]

D!IlI@]@]@]@]

E@]@]@][D@]

A Survey of tbe Literature -19

The horizontal direction states the order in which the cities are visited, and the vertical direction the 5 cities. The state of the network shown above would represent a tour in which city D is visited fITst, B the second, A the third, etc. The energy function that they constructed consists of two parts: one part reflecting the syntax of a solution, stating that valid solutions are those in which only one unit in each row and column is active, and a second data-part reflecting the length of the path corresponding to a given state of the network. From the energy function the weights of the connections can be deterrnined, so the problem is now completely stated in terms of a network. The supposition is that when the balance between the syntax-term and the data term is correct, the network should give solutions that have a valid syntax and represent areasonabie solution.

Because the network converges to a loeal minimum of the energy function, it is to be expected that the solutions are sub-optima!. The simulations of Hopfield however, show out that for alO-city problem (which has a solution space of 101/20

=

181,400 paths) the network converges in 16 out of 20 times to a valid tour, and in 50% of the valid tours to the best or next-best path. Hence the network was capable of selecting a good path, preferring paths in the best 10-5 of all paths. In a 30-city problem it appeared that the selection of the parameters is more delicate, but the network provided solutions to the problem, excluding poor paths by a factor 10-22 to 10-23 •

In other papers Hopfield and Tank describe other (optimization) problems that can be mapped on this type of networks. [Tank 1986] describes implementations of an AID

converter, a signal decision circuit (a network that decomposes a signal into a set of Gaussian basis functions), and a linear programming network. [Tank 1987] describes the mapping of a task-assignment problem on a Hopfield network.

(26)

20 - On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

3.2 .1.4

Current Research

The publications of John Hopfield have inspired many people to contribute to this field of research. The following topics are currently under investigation:

- Theoretical investigations: There is a strong parallel between the physics of spin-glasses [Kirkpatrick 1978] and the (binary) Hopfield model. For this reason there are many physicists and mathematicians working on theoretical questions as: the information capacity of the Hopfield model ([Abu-mostafa 1985] and [McEliece 1987]), partially connected networks ([Canning 1988]), three-state networks [Meunier 1988], the dynamics of the Hopfield model ([Coolen a], [Coolen 1988] and [Forshaw 1988]), and capabilities of Hopfield nets in which the transmission delay of the connections is taken into account ([Coolen 1989]).

- Mapping of optimization problems on Hopfield networks: There are many disciplines in

which fast solutions to complex optimization problems are potentially useful. For this reason papers from many different disciplines have been published in which researchers show how to map an optimization problem on a Hopfield network. A few examples are: the concentrator assignment problem [Page 1988], recognition of topological features of graphs [Kree 1988a] [Kree 1988b], image segmentation [Bilbro 1988], etc.

- VLSI-realizations: Many researchers in the field of VLSI design are involved in the implementation of Hopfield networks. This concerns questions of implementation

([Jackei 1986], [Howard 1987], [Murray 1987], [Graf 1987], [Tsividis 1987], [Graf 1988], [Weinfeld 1988], [Verleysen 1988]), and questions due to the technology that is involved ([Lamb 1987], [Schwartz 1987]).

3.2.2 The Backpropagation Algorithm

In the beginning of the 80's, a group of scientists (who have become known now as the PDP research group) became interested in the potential power of connectionism, and started some research projects in this field. One of the results was the (re )invention of a learning rule for multi-Iayer feed-forward networks, the backpropagation algorithm, due to David Rumelhart ([Rumelhart 1986c]). This algorithm is one of the breakthroughs that caused the revival of the worldwide interest in connectionism, and soon some impressive demonstrations of the power of the algorithm appeared. Because the algorithm is very simple and easy to implement, it has not only become the most widely, but also the most wildly used connectionist model. An important critique is heard from the scientists in neuroscience and biology, who state that the backpropagation algorithm is highly biologically implausible (see e.g. [Crick 1989]). As such it will never serve as a model of what is happening in a natural neural network.

In the following paragraphs, the backpropagation algorithm is explained (par. 3.2.2.1), some of the demonstrations of the capabilities of the algorithm are discussed (par. 3.2.2.2), some variants of the algorithm are described (par. 3.2.2.3) and finally an overview of the current research is presented (par. 3.2.2.4).

(27)

A Survey of the Literature -21

3.2.2.1 The Algorithm

The delta rule of Widrow and Hoff ([Widrow 1960], [Duda 1973]), is a rule that is capable of finding a linear discriminant in a highly dimensional feature space, or stated otherwise, to find the correct weight values for a single layer feed-forward network. This means that a network can be trained with this procedure to correctly classify a linearly separabie learning set. The modification of the weights according to the delta rule is governed by the following steps ([Rumelhart 1986c]):

1. All samples from the learning set are cyclica1ly presented to the network, until a certain stopping criterion is reached (i.e. when all samples are classified correctly in the linearly separabie case, or when the performance improves too litde in the non-linearly separabie case).

2. Each time an input/output pair p (i.e. an input pattern with the corresponding output pattern) is presented to the network, the weights are changed according to:

with .1pWij the change in the weight value between input unit i and output unitj following the presentation of input/output pair p, 11 a constant, tpj the target output of the j'th element, Opj the j'th element of the actual output, and ipi the value of the i'th element of the input pattern.

The equation states that the difference between the actual output and the target output (the delta), is a measure for the adaptation of the weights. It is not difficult to show that the delta rule performs a gradient descent in a bowl shaped error surface, so the iterative adaptation of the weights is guaranteed to find the optimal set of weights.

A severe problem of the single layer class networks, is that they are only capable of correctly classifying linearly separabie learning sets. Linear separability, however, is not a general property of arealistic learning set, and the value of perceptron-like architectures therefore appeared to be very limited. The solution to this problem is clearly to add an extra (hidden) layer between the input layer and the output layer. For simple non-linearly separabie problems, like the XOR-problem, it is not difficult to figure out what values the weights should have, but until recently no procedure was known that could find the correct weights automatically. The problem isthat a procedure like the delta rule adapts the weights according to the difference between the actual output and the target output of a unit, of which the target is unknown for a hidden unit. The generalized delta rule (or backpropagation algorithm) of Rumelhart, provides a way to adapt the weights of a multi-layer network by recursively propagatingthe error (the delta) back from the output multi-layer, via the hidden layer(s) to the input layer, thus finding a measure for the delta in the hidden layers:

(28)

22 - On !he Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

We denote netpj as the activation of unitj, according to: netpj =

L

Wji Opi

i

and the output function of the unit Opj as:

For a proper operation it is required that the output function is differentiable (e.g. a hyperbolic tangent). The modification of the weights according to the generalized delta rule is now given by:

where the delta for an output unit is determined by the difference of the actual output and t,he target output, multiplied by the derivative of the output function:

and the delta for a hidden unit is determined by a weighted summation of the delta's in the higher layer, multiplied by the derivative of the output function:

Also for the generalized delta rule it is not difficult to show that the procedure performs a gradient descent in the error space. The problem is that the error surface of a multi-Iayer network is not bowl-shaped anymore, so there is a risk that the system gets stuck into a local minimum. However, it appears that this rarely happens. This is a problem that is not completely understood yet, but there are a few reasons th at might point out why the performance happens to be so good. First, there is a stochastic aspect in the algorithm: the samples are offered to the network in a random order, so the noise of this process is responsible for the fact that the system can jump out of smalllocal minima. Second, the system has a strong preference for large and deep minima, just like the Hopfield network. The network selects (local) minima that are among the best (i.e. the deepest). Third, because there are so many free parameters in the system (i.e. the weights and thresholds), there are a numerous ways to reach a deep minimum without getting stuck in local minima ([Rumelhart 1989]). Clearly this is one of the research topics that are still under investigation.

(29)

Ikl

26 output units

CXX)OOOOO

120 hidden units

A Survey of the Literature -23

OOOOOQCX)0000 OOOOOOOOOQCX)

203 input units

0000 0000 0000 0000 0000 0000 0000

a

c

a

t

Figure 3.4: NETtalk. The network performs the mapping of text to phonemes. The phoneme corresponding to the central character in a window of seven characters is derived. In this example, the phoneme /kI, corresponding to the c of "a cat tI, is obtained from the context.

(30)

24 - On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

3222 Demonstrations and Applications

Rumelhart reports some interesting demonstrations of the algorithm in his original paper ([Rurnelhart 1986c]). These demonstrations concern the well-known parity problem (the classification of an input pattern as odd when an odd number of input units is on - see also [Minsky 1969]), the encoding problem (the networks leams to form an encoded version of the input patterns in a small number of hidden units), the symmetry problem (the network classifies input strings as to whether or not they are symmetric around their center), etc.

One of the most intriguing demonstrations of the backpropagation algorithm was presented by Sejnowski ([Sejnowski 1987a]). Sejnowski created a class of feed-forward networks that he called NETtalk, and trained them with the mapping from characters to phonemes (see fig. 3.4). This problem had previously been attacked by a knowIedge-based approach or by a look-up table in which the transcriptions were stored, but these methods appeared to be extremely difficult.

Sejnowski showed that a two layer network could be trained in about one night on a mini

computer to get a performance of 98% on a learning set of the 1000 most commonly used words. The performance of the network on a test set of over 20,000 words was about 90%, which is about equal to other text-to-speech systems. The total storage to define the network was 10 kbyte. The experiments of Sejnowski showed that the procedure was able to find a very compact representation of a complex mapping, only by showing examples of the mapping.

Other examples of classification andlor mapping problems for which the backpropagation algorithm has been used are: the problem of deducing the secondary structure of a protein from its amino-acid sequence ([Qian 1988]), the distinction between rocks and unmentioned metal objects on the sea bottom ([Gorman 1988]) and the derivation of shape from shading ([Lehky 1988]).

322.3 Information Storage in F eed-forward N etworks

An attractive aspect of storing associations between patterns in a feed-forward network is that the }nformation is distributed over all the weights. This distributed representation is due to the fault-tolerance and robustness of the information processing. As Sejnowski ([Sejnowski 1987a]) has demonstrated, it is possible to add an amount of noise to the weights that is equal to about half their average value, without a serious effect on the performance

(+/-

1 %). He also shows that the retraining of a damaged network is much faster than the originallearning.

Another important aspect of feedforward tnulti-Iayer networks is described by Rumelhart ([Rurnelhart 1986c]). This is the property that a feed-forward network that is used for classification, only stores the discriminating functions between the classes. This means that a network, in some cases, can find a very compact representation of a classifier, wh en compared to a method like the nearest neighbor classifier (see e.g. [Duda 1973]). An example is the encoding problem ([Rurnelhart 1986c]), in which a network encodes the input patterns in a few hidden units in the hidden layer, and decodes the patterns in the output layer.

(31)

single layer

half-plane

bounded by

hyperplane

two layer

convex open

or closed

regions

three layer

arbitrary

regions

A Survey of the Literature -25

(32)

26 - On the Application of Connectionist Models for Pattem Recognition, Robotics and Computer Vision.

3.2.2.4 Capabilities of Feed{orward Networks

In the literature there is a discussion going on about the capabilities of multi-Iayer feed-forward networks. Lippmann has shown ([Lippmann 1987]) that a network with two hidden layers can be used for any classification problem, provided that there are enough hidden units in the hidden layers (see fig. 3.5). Lippmann shows th at the first hidden layer represents the hyperplanes in the feature space, so the activity of a unit in that layer represents whether an input pattern is on one side or the other of the hyperplane. When we consider the second hidden layer as an AND-gate, the activity of a unit in that layer represents that a pattern is in a convex section of the feature space. The output layer can be considered as an OR-gate, which detects whether a pattern is in one convex part of the feature space or in the other, thereby forming concave subspaces of the feature space. In this way a network with hidden layers can classüy an arbitrary leaming set.

The problem so far is that a unit with a sigmoid output function can be considered as .

much more powerful than a simple OR- or AND-gate. For example, it is not difficult to show that a network with one hidden layer can correctly approximate any one-dimensional function (i.e. including all concave one-one-dimensional classification functions), and that a network with two hidden layers can approximate an arbitrary multi-dimensional function. It is not clear yet what the theoreticallimits are, but th ere is much progress in this field (see e.g. [Hornik 1988]).

3.2.2.5 Variants of the Algorithm

Af ter the publication of the backpropagation algorithm, a number of variants appeared in the literature. Some of them are worth mentioning here, because they provide either new capabilities or remarkable improvement in performance.

- The fITst variant is described in the original paper of Rumelhart ([Rurnelhart 1986c]), and is a way to make the system converge faster to areasonabie solutiön. The adaptation of the weights is damped by a mass-term , and prevents the system from tumbling in small ravines in error space from which it is difficult to escape. The adaptation of the weights in this variant is based on the following formula:

~Wji (n

+

1) = 11 Bpj Opi

+

Cl ~Wji (n)

with Cl a constant which determines the effect of past weight changes on the current direction ofmovement in weight space. The value of Cl is usually about 0.9.

- A second variant is also described by Rumelhart ([Rurnelhart 1986c]), and is based on what is called a Sigma-Pi unit. A Sigma-Pi unit has some inputs that are multiplicative, i.e. the input of such a unit is a multiplication factor for some other units. The activation of a unit is therefore a weighted summation of the weighted products of the inputs. Rumelhart has derived a similar leaming rule for this type of unit.

(33)

A Survey of the Literature -27

- A third variant is the recurrent net ([Rumelhart 1986c)) in which the units have some feedback to their selves. For a proper description of a recurrent net, the factor time must be taken into account because the activation of a unit on time t + 1 is a function of the activation of the unit on time t. The backpropagation algorithm can be applied for a recurrent net to leam temporal sequences of patterns.

- The fourth variant is due to Sejnowski and is described in his paper about NETtalk ([Sejnowski 1987a]). In stead of (always) adapting the weights af ter the presentation of a pattern, the weights are only changed when the absolute difference between the actual output and the target output is greater than a certain threshold (which is usually 0.1). The result is that the leaming activity is concentrated on those cases that are difficult (e.g. outliers in the feature space, complex shaped clusters, etc.), and that the leaming phase is sped up enorrnously.

- A fifth variant is a modification of a feed-forward network to enable it to leam (spatio)temporal patterns. The net is provided with a latched feedback from the output layer or a hidden layer to the input layer. In this way, the network can be taught to behave like a finite state machine. This is especially useful when the time behavior of a signal is taken into account, e.g. for speech recognition. An example is described in [Bourlard 1988], who also shows the relation of this particular architecture to discriminant hidden Markov modeis.

- A last variant is described by Ie Cun [Ie Cun 1988], and has a close relation to the Neocognitron (see paragraph 3.2.6). Le Cun describes how some a priori knowledge of the problem domain can be used to construct a feed-forward network for a classification task. He describes a digit recognition problem, in which 480 handwritten digits have been drawn in a matrix of 16 by 16 (binary) pixels. The leaming set consisted of 320 and the test set of 160 samples. The performance of the single layer network on the test set was improved from 82% to 98.4%, for a network in which every unit of a hidden layer was connected to a small number of units in the lower layer. The knowledge that the relevant features only appear in a small environment made it possible totaylor the network to the problem domain, thus increasing the performance. However, the algorithm decides which features on each level are important.

The recognition in this case is performed somewhere between a statistical and a structural approach: the network detects which structures are relevant during the leaming phase, and integrates the evidence for the features (i.e. structures) over the complete network during the classification.

322.6 Current Research

A few topics that are currently being investigated by research groups in the world are: - Theoretical studies: ways of speeding up the learning-phase of the generalized delta rule (see e.g. [Moody 1988]), the theoreticallimits on the approximation of functions with feed-forward nets (see also par. 3.2.2.4 and [Hecht-Nielsen 1987b]), some research on convergence and stability aspects of the backpropagation algorithm (see e.g. [Sontag

(34)

28 - On the Application of Connectionist Models for Pattem Recognition. Robotics and Computer Vision.

1988]), and dealing with symbolic structure (see e.g. [Smolensky 1988] or [Dolan 1988]).

- Variants of the algorithm: generalizations of the activation functions of units (see e.g. [Robinson 1988b] and [Niranjan 1988]), minimizing the classification error and the complexity of the network during the learning phase (this is one of the current research topics of David Rumelhart), reasoning based on the activation value of output and/or hidden units (for AI-purposes, see e.g. [HendIer 1988]), special rules for units with feedback (see e.g. [Almeida 1987], [Almeida 1988a], [Almeida 1988b]), etc.

-Applications of the algorithm: feed-forward networks applied for various classification tasks: see e.g. [Mozer 1987], [Tesauro 1988], [Yang 1988], [Bounds 1988], [Bridle 1988a], etc.

- Hardware: the development of special purpose hardware to speed up the learning and c1assification, see e.g. [Debenham 1988], [Duranton 1988] and [Faure 1988].

3.2.3 The Boltzmann Machine

As is known from the literature, the key to many actual problems is the search for global minima of an error- or a cost function. A powerful technique that has been proposed for these purposes, is called simulated annealing ([Kirkpatrick 1983]). This method is based on the analogy of finding a very low energy state of a metal; the best strategy is to melt it and then to slowly reduce the temperature. This process is called annealing. Simulated annealing is an analogue of this process: the amount of noise that is added to the optimization process is gradually decreased, thus providing a mechanism for the system to converge to a good (i.e. deep) minimum.

Simulated annealing has proven to be a very powerful method for performing complex optimization problems on conventional sequential computers. The link th at Geoffrey Hinton made ([Hinton 1986]) was to use this technique for neural networks with binary

stochastic units.

Because the Boltzmann distribution governs the state of such a network, these networks were called "Boltzmann machines". The applications that he describes are: the search for the energy minima in Hopfield networks and the search for the optimum weights in a multi-Iayer network.

- For a Hopfield network, Hinton proposes the following modification of Hopfield's updating rule; if the energy gap between the 1 and 0 state of the k'th unit is ~k then, regardless of the previous state, set Sk

=

1 with probability:

where T is a parameter which acts like the temperature of a physical system. This rule ensures th at in thermal equilibrium the relative probability of two global states is determined solely by their energy difference, and follows a Boltzmann distribution:

(35)

A Survey of the Literature -29

where Po. is the probability of being in the a'th global state, and Eo. is the energy of that state. The minimum energy level can be found by simulated annealing. Aarts ([Aarts 1987]) describes how this approach can be used to find solutions for the Travelling Salesman Problem, on a similar network as Hopfield and Tank describe in their paper about the mapping of optimization problems on Hopfield networks (see: [Hopfield 1985] and paragraph 3.2.1.3).

- The second contribution of Hinton is a way to find a set of weights for a multi-Iayer network, with the help of simulated annealing. The general idea is to minimize the discrepancy between the environmental structure and the network's intern al model. The

lear~_adapts the -weights of the network in order to make the probability

distributions of the activity of the_yisjble units (i.e. the input and output units} e@~l to _

theli probability distributions when the netwo~~ is running freelx; i.e. when no units are bemgcfiiffiped-Dytliê envirönment. This means that an (information theoretic) measure of the the distance between the environmental and free-running probability distributions is being minimized by the procedure. This measure is given by ([Kullbach 1959]):

p+ (V ) G=LP+(Va>ln a

a P- (Va>

where P+(V 0.) is the probability of the a'th state of the visible units in phase+ when the states are determined by the environment, and P-(V 0.) is the corresponding probability in

phase- when the network is running freely. In order to minimize the difference between

the distributions, a gradient descent in G is performed. Hinton derives the following relationship between the adaptation of a weight Wij and the resulting change in G:

~

= -

!(p~

.

-p:.)

aWij T ij ij

where Pit is the probability, averaged over all environmental inputs and measured at equilibrium, that the i'th and j'th unit are both on when the network is being driven by the environment, and Pif is the corresponding probability when the network is free running. Additional information that is required for learning is how much to change each weight, how long to collect co-occurrence statistics before changing the weight, how many weights to change at a time, and what temperature schedule to use duringthe annealing searches.

It appears that the Boltzmann machine learning algorithm is a powerful learning rule. However, a severe problem is that the learning phase, as weU as the classification phase, are excessively slow.

(36)

30 - On the Application of Connectionist Models for Pattem Recognition, Robotics and Computer Vision.

3.2.4 The Adaptive Resonance Theory

The Adaptive Resonance Theory (ART) of Carpenter and Grossberg is one of the connectionist models that is biologically plausible. In essence it is a theory of the unsupervised learning mechanisms of the brain, which is capable of explaining many biological properties (see e.g. [Grossberg 1980]). The theory grew out of analyses of a simpier learning model which is called competitive learning. The adaptive resonance theory is a very sophisticated and promising model.

3.2.4.1 Competitive Learning Models

Competitive learning is a learning mechanism that has been investigated since the early 1970's (see e.g, [Malsburg 1973], [Grossberg 1976], [Rumelhart 1985] and [Rumelhart 1986d]). The Adaptive Resonance Theory is a connectionist model for a two layer network that is based on competitive learning. The fITst layer in the model is a set of input units, which are connected with weighted connections to the units of the second layer. The weights of the connections are considered to be the long-term memory (LTM) of the system. The short-term memory (STM), i.e. the activation of the units in the second layer, is governed by the following equations:

Let the total signal received by unit i in the second layer be given by: ii

=

L

OjWij

j

then the activation of unit

i

is given by:

{ I if ii > max (ik: k

*

i) ai =

o

ifii < max (ik: k

*

i)

This means that the unit in the second layer that receives the largest signal is chosen for short-term memory storage, and is said to code, c1assify or cluster the corresponding input pattern. The weights of the winning node are changed for learning by the following differential equation:

A W·· - " 1J - 'I

1 (0' -J

1J

·

)

Grossberg has shown that this learning mechanism is not always stabie ([Grossberg 1976]). There are sequences of input patterns that can cause temporally unstable leaming. The Adaptive Resonance Theory is an improvement of this model, and is capable of proceeding stably in response to an arbitrary sequence of input patterns . .

(37)

A Survey of the Literature -31

gain con trol

.-~

-~

gain con trol

layer 2 STM

LTM

layer 1 STM input pattern

Figure 3.6: ART 1 system: two layers encode patterns of activation in short-term memory (STM). Bottom-up and top-down pathways between the layers contain the adaptive long term memory (LTM). The remainder of the circuit modulates these STM and LTM processes. Modulation by gain control enables layer one to distinguish between bottom-up input patterns and top-down priming. Gain con trol signals also enable layer two to react supraliminally to signals from layer one while an input pattern is on. A reset wave is generated when sufficiently large mis matches between bottom-up and top-down patterns occur at layer one. This reset wave selectively and enduringly inhibits previously active units (from [Carpenter 1988]).

Cytaty

Powiązane dokumenty

Identyfikuj¹c narzêdzia systemu negocjacji, nale¿y uwzglêdniæ, ¿e nowe wymagania mog¹ siê pojawiæ w wyniku wzrostu partycypacji spo³ecznej w procesie planowania

W trzecim stadium choroby, które przejawia się stopniową poprawą zapisu EEG, często obser- wuje się polepszenie percepcyjnych i realizacyjnych umiejętności językowych, jed-

Wszystko to, o czym od lat wiado- mo, czyniły partie komunistyczne rzeczonych krajów w imię zamysłu — przykazania „Moskwy&#34; (stąd zapewne ta wyjątkowa zbieżność):

Według innej definicji mianem fuzji moz˙na okres´lic´ poł ˛ aczenie dwóch lub wie˛kszej liczby spółek, generalnie poprzez zaoferowanie akcjonariuszom spółki nabywanej

Y es que, como consecuencia de su capacidad para comunicar un cúmulo de informaciones en el tráfico económico, la inclusión de una indicación geográfica en la presenta- ción de

Znacznie istotniejszy z tego punktu widzenia okazał się przekład dzieła Benjamina Constanta, jakiego Wincenty Niemojowski według własnych twier- dzeń dokonał w

However, spatial protection of an area with a designated monument or site was never implemented, before the Hellenic Ministry for the Environment took action in the 1980s; then