in: B.K. Ersboll, P. Johansen (eds.), SCIA'99, Proc. 11th Scandinavian Conference on Image Analysis, Vol. 2, (Kangerlussuaq, Greenland, June 7-11), Pattern Recognition Society of Denmark, Lyngby, 1999, 739-746
A weight set decorrelating training algorithm
for neural network interpretation and symmetry breaking
Dick de Ridder, Robert P.W. Duin, Piet W. Verbeek and Lucas J. van VlietPattern Recognition Group, Dept. of Applied Physics, Faculty of Applied Sciences, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands
e-mail:dick@ph.tn.tudelft.nl
Keywords: neural network training algorithms, neural network interpretation, decorrelation, symmetry breaking
Abstract
A modication to neural network training al-gorithms is proposed which decorrelates certain weights within the network while minimising the mean squared error. The technique was devel-oped to facilitate neural network interpretation in image processing problems, partly by allowing to initialise all network weights with one xed, low value. However, it can also be applied to classica-tion tasks in which symmetry breaking is dicult.
1 Introduction
In previous work [3], neural networks were trained to perform image processing operations. One of the main goals of this eort was to understand how a neural net-work learns to solve this kind of problem, by inspecting the network after training. Understanding of a neural network might provide new insights into the problem at hand and can promote acceptance of neural network tools in image processing practice.
An obvious diculty in neural network analysis is that an innite number of distinct network realisations can lead to the same solution in terms of the mean squared error (MSE), of which only a small number can be easily interpreted. Another problem, found in applications in which hidden units are expected to specialise in certain functions, is a tendency of these units to perform ap-proximately the same compound function. While these networks perform well, it is hard to understand their functionality.
We therefore seek ways to train neural networks and end up with distinct hidden unit weight sets, which are understandable in terms of image processing primitives, such as convolution lters. One way to do this is by constraining a neural network in its freedom, e.g. by lowering the number of free parameters or by imposing symmetries [4]. This does not, however, solve the prob-lem of hidden units learning more or less identical func-tions. Although a modular approach has been suggested
to address the latter problem [11], this is not applicable in cases in which there is no clear decomposition of a task's input space into several domains. In this paper, a training algorithm is proposed which minimises, besides the mean squared error (MSE), the squared correlation between hidden unit weight sets. This modied training rule is discussed in section 2.
Note that the problem described above is not the same as that addressed by other methods, which decor-relate or otherwise sparsify network outputs. These al-gorithms, such as principal component analysis (PCA), independent component analysis (ICA, [10]) or entropy maximisation [13], are unsupervised. In contrast, here we try to approximate a certain given mapping, i.e. the system is supervised, but we add the demand that the network weights be as little correlated as possible.
To demonstrate its usefulness, the technique is applied to two types of problems: image ltering (regression) and classication. In the rst application, described in section 3.1, it will be shown that using the new algo-rithm gives weight sets which can be understood better and thus facilitates network interpretation. The second application, discussed in section 3.2, is to a classica-tion problem in which symmetry breaking plays a role: at a certain point in the training process units have to specialise to learn a non-linear function. The proposed training algorithm can speed up this specialisation. Fi-nally, the merits of the training method will be discussed in section 4.
2 A decorrelating training rule
In a previous paper [3], a range of standard feed-forward neural networks were trained to perform a non-linear image ltering operation. In this application, the neural network as a whole was applied as a non-linear lter in a convolution-like manner, applying it to regions in the input image to obtain one pixel in the output image (see gure 1 for an example of such a network). Each hidden 1
W
AB
W
A
B
Input Hidden Output
Figure 1: A neural network used to learn a non-linear image ltering operation: a 55 unit input layer, two
hidden units and one output unit.
unit performs a multiplication of the input with its set of incoming weights.
One experimental nding was that the number of hid-den units used in these networks did not have a large in uence on the nal performance. Networks with only one hidden unit performed almost as well as networks with 2 hidden layers, each containing 250 hidden units. In an attempt to explain this phenomenon, the network weights were inspected.
Note that, in trained neural networks, the weight sets belonging to the dierent hidden units need not be ex-actly the same for the units to perform the same func-tion. In a three layer network, such as the one shown in gure 1, the weight sets
W
A andW
B can perform the
same function even if they look quite dierent at rst glance. As long as
W
A=c
1
W
B+c
2, biases in the
sec-ond and third layer and the weights between these layers can correct the dierences between the two weight sets1,
and their functionality can be approximately identical. The conclusion is that to compare weight sets, one has to look at their correlation.
In the image ltering problem described above, closer inspection revealed that each hidden unit had learned a convolution lter which could well be modelled as a mix-ture of a Gaussian and a Laplacian lter (see section 3.1 for a more detailed discussion). Even in networks with a quite large hidden layer, there was no real specialisa-tion of hidden units; each learned more or less the same function. It was shown that the average correlation be-tween the weight sets was quite high, even for the large networks. If a way can be found to decrease this cor-relation during training, hidden units may be forced to
1Up to a point, naturally, due to the non-linearity of the
trans-fer functions in the hidden and output layer. For this discussion we assume that the network operates in that part of the transfer function which is still reasonably linear.
specialise. The most simple approach, extracting the weights after a number of iterations and performing a PCA, will not work since that will change the function-ality of the network completely. Instead, a balance will have to be found between performance and decorrela-tion. This idea will be worked out in the next secdecorrela-tion.
Correlation has been used before in neural network training. In the cascade correlation algorithm [6], it is used as a tool to nd an optimal number of hidden units by taking the correlation between a hidden unit's output and the error criterion into account. However, it has not yet been applied on weights themselves, to force hidden units to learn dierent functions during training.
2.1 Decorrelation
Suppose there are two weight sets
W
A andW
B ofin-coming weights of hidden unitsAandB (as in gure 1),
with j
W
A j = jW
B j = N > 2, var(W
A) > 0 and var(W
B) >0.If these weight sets are stored in vectors
w
Aandw
B,where the index each weight is stored in depends on the input unit it is connected to, the correlationC between
these vectors can be calculated as:
C(
w
A ;w
B) = cov(w
A ;w
B) p var(w
A)var(w
B) = (1) 1 N P N k =1(w
A( k);w
A)(w
B( k);w
B) q (1 N P N k =1(w
A( k);w
A) 2)( 1 N P N k =1(w
B( k);w
B) 2) The correlationC(w
A ;w
B) is a number in the range
[;1;1]. For C(
w
A;
w
B) =1, there is a strong
cor-relation; for C(
w
A;
w
B) = 0 there is no correlation.
Therefore, the squared correlationC(
w
A;
w
B)2 has to
be minimised in order to minimise the likeness of the two weight sets.
For minimising the squared correlation, one has to compute its gradient w.r.t. a single weight
w
A(i). This
is rather straightforward, by using the chain rule:
@C(
w
A ;w
B)2 @w
A( i) = @ @w
A( i) F1 z }| { P N k =1(w
A( k);w
A)(w
B( k);w
B) 2 P N k =1(w
A( k);w
A)2 | {z } F 2 P N k =1(w
B( k);w
B)2 | {z } F 3 = (F 2 F 3) ;1 @ @w
A( i) F 2 1 + F 2 1 @ @w
A( i)( F 2 F 3) ;1 = (F 2 F 3) ;12 F 1(w
B( i);w
B) ; F 2 1( F 2 F 3) ;2 ; ;2(w
A( i);w
A) F 3+ F 20 = 2; F 1 F ;1 2 F ;1 3 (w
B( i);w
B) + F 2 1 F ;2 2 F ;1 3 (w
A( i);w
A) (2)function w
:= ConjugateGradientDescent(;N)w
0 := UniformRandom(;;) Uniform distribution, range [;;]
d
:= @ @w E(w
0) Derivative of MSEg
0 :=h
0:= ;d
for
t:= 1to
Ndo
w
t+1 := LineMinimise( E;w
t ;h
t) Minimise from point
w
talong directionh
td
:= @ @w E(w
t+1) Derivative of MSEg
t+1 := ;d
:= (gt+1;gt)gt+1 gigih
t+1 :=g
t+1+h
tod
end
Figure 2: The conjugate gradient descent (CGD) algorithm. The parameters,andN, determine the range of the
uniform distribution from which
w
0 is initialised and the number of iterations, respectively.function w
:= DecorrelatingConjugateGradientDescent(;;N)w
0 := UniformRandom(;;) Uniform distribution, range [;;]
d
1 := @ @w E(w
0) Derivative of MSEd
2 := 2 N(N;1) P N;1 k =1 P N l=k +1 @ @w C(w
k 0 ;w
l 0)2 Derivative of squared correlation
g
0 :=h
0:= ;(d
1+ jjd1jj jjd2jjd
2) Normalised sum of both derivatives
for
t:= 1to
Ndo
w
t+1 := LineMinimise( E;w
t ;
h
t) Minimise from point
w
talong directionh
td
1 := @ @w E(w
t+1) Derivative of MSEd
2 := 2 N(N;1) P N;1 k =1 P N l=k +1 @ @w C(w
k t+1 ;w
l t+1)2 Derivative of squared correlation
g
t+1 := ;(d
1+ jjd 1 jj jjd2jjd
2) Normalised sum of both derivatives
:= (g t+1 ;g t )g t+1 gigi
h
t+1 :=g
t+1+h
tod
end
Figure 3: The decorrelating conjugate gradient descent (DCGD) algorithm. The parameter again controls
initialisation, determines the relative weight of the derivative of the squared correlation andN determines the
number of iterations.
2.2 Conjugate gradient descent
The correlation terms in equations 1 and 2 will have to be incorporated into a learning algorithm. Note that, using the derivative given in equation 2, the weight up-date for each neuron cannot be calculated locally in the network, since it depends on weights in the same layer. Therefore, a global optimisation technique has to be used. Although gradient descent (GD, as in normal back-propagation) can be used in this way, an algorithm known to be superior in speed and convergence was cho-sen: the conjugate gradient descent method (CGD, dis-cussed in e.g. [14] and [8]). In short, this algorithm performs line minimisations along directions composed of the local gradient direction and previously traversed directions. The algorithm is given in pseudo code in gure 2. Note that the derivative of the error function
E to be minimised is used only to update
g
, and thefunction E itself only in the line minimisation.
2.3 Decorrelating conjugate gradient
descent
A problem in minimising two dierent criteria at the same time is that of weighting. The mean squared error (MSE), which is most commonly used in neural network training, can start very high but usually drops rapidly. The squared correlation part on the contrary lies in the range [0;1], but it may well be the case that it cannot
be completely brought down to zero, or only at a sig-nicant cost to the error. The latter eect should be avoided: the main training goal is to reach an optimal solution. Therefore, the correlation information is used in the derivative function only, to determine the direc-tion in which steps are taken. It is not used in the abso-lute minimisation function LineMinimise. The adapted algorithm, called the decorrelating conjugate gradient descent (DCGD) method, is given in gure 3.
Figure 4: The original image (input) and its Kuwahara ltered version (output), used to train the network.
Model weight matrix
0.05 −0.04 −0.07 −0.04 0.05 −0.04 −0.03 0.18 −0.03 −0.04 −0.07 0.18 0.77 0.18 −0.07 −0.04 −0.03 0.18 −0.03 −0.04 0.05 −0.04 −0.07 −0.04 0.05 (a) −0.5 0 0.5
Cross section of fitted model
Gaussian (A) Laplacian (B) Model (A−B)
(b)
Figure 5: A mixture of a Gaussian and Laplacian lter (a) and a cross-section of same (b).
The algorithm proposed above takes correlations be-tween all pairs of hidden units into account. Due to the quadratic complexity in the number of hidden units, ap-plication of this technique to large networks is not feasi-ble. A possible way to solve this problem is to take only a subset of correlations into account.
Note that the derivative of the squared correlation is only calculated once for each pair of weight sets and attributed to only one of the weight sets. This allows one weight set to learn a globally optimal function, while the second set is trained to both lower the error and avoid correlation with the rst set. It also allows network initialisation with xed values, since the asymmetrical contribution of the squared correlation term provides a symmetry breaking mechanism.
3 Experiments
3.1 Weight interpretation
In general, the following issues play a role when training neural networks in order to interpret weight sets:
Architectural constraints
: to obtainunder-standable weight sets, one can incorporate ar-chitectural constraints into the network, such as
shared weights or connections with xed weights. However, this quickly leads to training problems and carries the danger of imposing ones ideas of how to solve the problem on the network too much [2].
Initialisation
: for later interpretation, it isadvis-able to initialise the weights not too wildly. The best initialisation would be with all weights set to a single value, preferably low. However, when all weights are the same, the gradients of these weights with respect to the MSE will be the same, and they will never dier. It will be shown that the DCGD method can overcome this problem.
Training algorithm
: the GD training methodgives lters that are easier to understand than the CGD method, since it takes only small steps in weight space. While this can lead to a suboptimal solution, it makes sure that weight sets will not change too abruptly if it minimally enhances per-formance. DCGD seems to avoid this behaviour of normal CGD.
3.1.1 The image processing problem
To illustrate the usefulness of the proposed decorrelating training algorithm, it was applied it to the non-linear image processing problem described discussed in sec-tion 2. In [3], several networks of the type shown in gure 1 were trained on a non-linear edge-preserving smoothing lter, the Kuwahara lter [12]. This lter operates in a (2k;1)(2k;1) window around each
pixel, which is further subdivided into 4kk
subwin-dows. The central pixel is replaced by the mean of the subwindow with minimum variance.
The networks used had a varying number of hidden units per hidden layer (1;2;:::;5;10;25;50;100;250) and
one or two hidden layers. For all network sizes, most hid-den units had learned a weight set which could well be modelled by a linear approximation of the Kuwahara lter. This approximation consists of a combination of a smoothing Gaussian lter and a Laplacian second derivative lter, which sharpens when it is subtracted from an image: f(x;y) =c 1 1 2 2 1 exp ; x 2+ y 2 2 2 1 ; c 2 (x 2+ y 2) ;2 2 2 2 6 2 exp ; x 2+ y 2 2 2 2 (3) in which c 1 and
1 are parameters to be estimated
for the Gaussian and c 2 and
2 are parameters for the
Laplacian. Figure 5 shows a weight set and a cross-section of a realisation of equation 3.
−0.03 −0.18 −0.13 0.05 −0.04 −0.04 −0.18 0.25 −0.16 −0.07 −0.04 0.45 1.32 0.73 −0.03 −0.17 −0.20 0.60 −0.25 −0.15 −0.02 0.07 −0.13 −0.21 0.36 −0.10 −0.11 −0.04 −0.04 −0.05 −0.16 0.02 −0.24 0.10 0.09 −0.04 −0.29 −0.66 −0.36 −0.01 0.09 0.12 −0.41 0.27 0.17 −0.10 0.03 0.07 0.21 −0.10
(a) GD, initialisationUniformRandom(;0:1;0:1):
MSEtest = 1 :46 10 ;3 ;C = ;0:90 0.01 −0.05 −0.08 0.05 −0.01 0.05 −0.11 0.23 −0.10 −0.08 −0.01 0.36 1.05 0.55 −0.02 −0.14 −0.14 0.48 −0.21 −0.16 0.03 0.02 −0.10 −0.20 0.25 0.01 −0.05 −0.08 0.05 −0.01 0.05 −0.11 0.23 −0.10 −0.08 −0.01 0.36 1.05 0.55 −0.02 −0.14 −0.14 0.48 −0.21 −0.16 0.03 0.02 −0.10 −0.20 0.25 (b) GD, xed initialisation 0.1: MSEtest = 1 :47 10 ;3 ; C = 1:0 0.12 −0.26 −0.49 −0.19 −0.20 0.53 −0.78 −0.69 1.31 0.27 0.55 −1.67 −1.96 −0.76 0.36 0.22 0.87 −0.63 1.22 −0.00 −0.64 0.09 0.43 0.36 −0.49 −0.07 0.08 −0.34 0.10 −0.11 0.28 −0.49 0.24 0.06 −0.03 −0.01 0.32 0.91 0.62 0.04 0.02 −0.28 0.66 −0.15 −0.08 −0.19 0.15 0.01 −0.20 0.18 (c) CGD, initialisationUniformRandom(;0:1;0:1): MSEtest = 1 :42 10 ;3 ;C = ;0:49 −0.03 0.02 −0.06 0.04 −0.02 0.07 −0.10 0.13 −0.07 −0.02 −0.04 0.20 0.47 0.25 −0.03 −0.00 −0.15 0.25 −0.12 −0.04 −0.02 0.06 −0.05 −0.09 0.09 −0.03 0.02 −0.06 0.04 −0.02 0.07 −0.10 0.13 −0.07 −0.02 −0.04 0.20 0.47 0.25 −0.03 −0.00 −0.15 0.25 −0.12 −0.04 −0.02 0.06 −0.05 −0.09 0.09 (d) CGD, xed initialisation 0.1: MSEtest = 1 :43 10 ;3 ; C = 1:0 0.19 0.18 0.09 0.09 0.20 0.13 −0.15 −0.06 −0.26 0.12 0.08 0.00 0.11 −0.00 0.08 0.15 −0.24 0.05 −0.22 0.04 0.20 0.12 0.08 0.03 0.22 0.19 0.15 0.20 0.00 0.21 0.03 0.14 −0.42 −0.02 0.23 0.07 −0.55 −1.59 −0.82 0.04 0.26 0.17 −0.71 0.17 0.29 0.18 0.00 0.19 0.39 −0.19
(e) DCGD, initialisationUniformRandom(;0:1;0:1):
MSEtest = 1 :32 10 ;3 ;C = ;0:27 0.19 0.11 0.02 0.08 0.19 0.10 −0.12 −0.07 −0.14 0.03 0.03 −0.00 0.23 0.07 0.04 0.03 −0.13 0.06 −0.16 0.00 0.21 0.12 0.02 −0.01 0.31 −0.13 −0.18 −0.16 −0.05 −0.20 −0.04 0.00 0.41 0.14 −0.20 −0.05 0.46 1.26 0.65 −0.10 −0.23 −0.00 0.58 −0.04 −0.24 −0.17 −0.07 −0.16 −0.30 0.07 (f) DCGD, xed initialisation 0.1: MSEtest = 1 :44 10 ;3 ; C = 0:59 −0.24 −0.24 −0.43 −0.65 −0.41 0.12 0.46 −0.17 −0.01 −0.24 0.42 0.74 0.22 0.32 0.03 0.05 0.55 0.31 0.53 0.19 0.18 0.05 0.12 0.21 0.34 0.11 0.06 0.25 0.25 0.20 −0.06 −0.15 −0.02 0.03 0.13 −0.14 −0.48 −0.50 −0.30 −0.04 −0.06 −0.11 −0.32 −0.12 −0.05 −0.05 −0.08 −0.03 −0.05 −0.21 (g) DCGD, initialisationUniformRandom(;0:1;0:1): MSEtest = 1 :44 10 ;3 ; C = 0:61 0.09 0.14 0.06 0.07 0.12 0.12 −0.13 −0.09 −0.20 0.09 0.08 0.01 0.11 0.06 0.14 0.12 −0.19 0.06 −0.12 0.11 0.14 0.12 0.11 0.09 0.25 −0.10 −0.07 −0.14 0.02 −0.12 0.02 −0.16 0.36 −0.06 −0.17 −0.07 0.48 1.43 0.70 −0.09 −0.19 −0.24 0.64 −0.26 −0.30 −0.10 −0.02 −0.16 −0.38 0.15 (h) DCGD, xed initialisation 0.01: MSEtest = 1 :44 10 ;3 ; C = 0:51
Figure 6: Weight sets found after training with various training algorithms: gradient descent (GD), conjugate gradient descent (CGD) and decorrelating conjugate gradient descent (DCGD).
−0.14 −0.11 −0.12 −0.03 −0.19 −0.01 −0.05 0.36 0.05 −0.14 −0.04 0.44 1.16 0.61 −0.03 −0.16 −0.08 0.53 −0.10 −0.21 −0.15 −0.05 −0.16 −0.31 0.02 −0.06 −0.05 −0.09 −0.00 −0.09 0.02 −0.09 0.23 −0.03 −0.10 −0.02 0.32 0.92 0.46 −0.02 −0.11 −0.12 0.41 −0.14 −0.16 −0.05 −0.00 −0.10 −0.22 0.11 0.01 −0.00 −0.06 0.03 0.00 0.05 −0.11 0.13 −0.09 −0.06 −0.01 0.22 0.70 0.33 −0.01 −0.07 −0.16 0.30 −0.16 −0.11 0.02 0.04 −0.05 −0.14 0.18 0.06 0.04 −0.02 0.05 0.07 0.07 −0.13 0.05 −0.13 −0.02 0.00 0.14 0.51 0.22 0.01 −0.02 −0.17 0.21 −0.17 −0.06 0.08 0.07 −0.01 −0.07 0.23 0.09 0.08 0.01 0.07 0.11 0.08 −0.13 −0.00 −0.14 0.02 0.02 0.07 0.33 0.14 0.04 0.02 −0.17 0.14 −0.15 −0.00 0.11 0.09 0.03 −0.01 0.24
Figure 7: Weight sets found in a one hidden layer, 5 hidden unit network after training with DCGD.
3.1.2 Experiments with DCGD
A network with two hidden units was trained using GD, CGD and DCGD and dierent initialisations. The dataset consisted of 1,000 samples randomly drawn from an image (input) and a Kuwahara ltered version of that image (output). The images used are shown in gure 4. Training was stopped when the error on an independent validation set started to rise. The correlation weight pa-rameter was set to 1; the number of iterations per call
to the algorithms,N, was set to 10.
In gure 6, some resulting weight sets are shown. The rst pair of sets shows the results obtained with GD training, with a UniformRandom(;0:1;0:1)
initial-isation, i.e. with weight values drawn from a uniform distribution on [;0:1;0:1]. Both sets resemble a
mix-ture of a Gaussian and a Laplacian. The second pair of sets shows what happens when all weights are ini-tialised to a xed value (0.1): both sets have learned the same mixture. Training with CGD and a random initialisation, shown in gure 6 (c), leads to even less understandable mixtures. CGD also nds two identical weight sets when initialised with a xed value (gure 6 (d)). DCGD with random initialisation can lead to unclear lters (gure 6 (e)), but also to a separation in a Laplacian and a Gaussian (gure 6 (g)). Finally, DCGD makes it possible to control the formation of weight sets better by initialising with xed values, since the method takes care of symmetry breaking. This is shown in g-ures 6 (f) and (h) for two dierent initialisations. In both cases, the mixture is well separated.
The MSE, calculated on an independent test set, does not change much in these experiments. i.e., al-though all networks perform well, clearly DCGD is the only method giving insight into the two distinct lters learned.
Note that in all weight sets the upper right weight is very high, and the weight in the third row, second column is rather too low. This is an artifact due to the specic data set used to train the networks.
Of course, the network discussed in this section is very small. Therefore, we applied the same method to larger networks, with more hidden units (3, 5, 7 and 10) and/or one more hidden layer. The addition of hidden units
leads only to a gradual change from Gaussian to Lapla-cian, through a series of mixtures (for an example, see gure 7). Clearly, these units do not add functionality. Adding an extra hidden layer of 5 or 10 units to the ba-sic network, between the layer trained with DCGD and the output layer, does not change the weight sets much and does not signicantly increase performance.
3.2 Symmetry breaking
A well-known problem in training neural networks is that of symmetry breaking. At a certain point in the training process, hidden units will have to specialise in order for the network to perform better. This symmetry breaking is often accompanied by sudden sharp drops in the MSE [9].
A problem in which symmetry breaking is of great importance is that of classication of the Annema data set [1]. This two-class 2D dataset is constructed in such a way that a highly non-linear decision surface is neces-sary (see gure 8 (a)). To construct this decision surface a three layer, two hidden unit network suces. But, al-though in principle such a network has enough parame-ters to solve the problem, nding them using a training process is very dicult. The CGD method is normally not capable of nding the right weight settings.
The original dataset was modied slightly, since the proposed training algorithm runs into problems when the weight sets to be decorrelated have only two values: two vectors of length two are always fully correlated, i.e. C(
w
A ;w
B) = 1;8(w
A ;w
B). Therefore, a thirdinput value was added which was always 0. With this dataset, 1,000 networks were trained. Typical training runs are shown in gure 8 (c): either the network con-verges quite quickly or it does not converge at all. The number of training runs in which the network converged in fewer than 50 training cycles was counted, where con-vergence was dened as the MSE dropping below 0.018. This procedure was repeated for a number of dierent settings of the correlation term weight parameterand
for two dierent initialisations, with values drawn from a uniform distribution, one in the range [;0:01;0:01],
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 x y Annema dataset Class 1 Class 2 Converged Not converged (a) x y w wA B A B Output Hidden Input 0 (b) 0 500 1000 1500 2000 2500 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Learning curves for various training runs
Training step MSE Converged Not converged (c) 0 0.5 1 1.5 2 2.5 3 0 20 40 60 80 100 Performance improvement: DCGD vs. NCGD β
Percentage of training runs converged
DCGD NCGD (d) 0 0.5 1 1.5 2 2.5 3 0 20 40 60 80 100 Performance improvement: DCGD vs. NCGD β
Percentage of training runs converged
DCGD NCGD
(e)
Figure 8: (a) the Annema dataset; (b) the network used; (c) some typical training runs; (d) the perfor-mance of decorrelating CGD (DCGD) versus noise CGD (NCGD) for various settings of , initialisation UniformRandom(;0:01;0:01); (e) same, with initialisationUniformRandom(;0:5;0:5).
set to 2. To have a baseline, the whole procedure was repeated with a training algorithm in which the correla-tion derivative term
d
2was replaced by a vector drawnfrom a [;1;1] uniform distribution (denoted by NCGD,
for noise CGD).
The results, shown in gure 8 (d) and (e), clearly show a large improvement in convergence of DCGD over stan-dard CGD (note that the case= 0 is standard CGD).
For initialisation with small values, which is standard practice, the eect is most clear. Initialising with larger values (gure 8 (e)) improves the performance of stan-dard CGD, but DCGD improves equally. DCGD clearly works better than just using random noise during the training process (NCGD). If the setting of is too high
( >1:75), correlation plays too large a role in the
min-imisation process and convergence worsens again. The same holds for the parameterin NCGD, although this
method seems to be a little less sensitive to the exact setting. Unfortunately, the maximum improvement is not reached for = 1:0, which means that tuning of will be necessary to obtain optimum performance in
various applications.
4 Conclusions and discussion
A problem in understanding how neural networks learn is that it is often hard to interpret weight sets after training. To facilitate this interpretation, a modication to training algorithms has been proposed which, besides minimising the MSE, has minimisation of the squared correlation between weight sets as a goal. It can break the symmetry in networks initialised with xed values, allowing weights to converge more uniformly to their optimal values.
The new method has been applied to an image pro-cessing problem and was shown to ease the interpreta-tion of the parameters. Also, the algorithm was applied to a classication problem in which symmetry breaking, i.e. specialisation of hidden units, is important. It was demonstrated that the method can give large improve-ments for such problems.
A number of disadvantages of the method have to be addressed as well. Firstly, there is the large compu-tational demand of the algorithm, which makes it im-practical to apply the technique to very large networks.
One can get around this problem by not taking all inter-weight set correlations into account, or by addressing a dierent subset of inter-weight sets in each iteration.
Secondly, the outcome of the training process is still dependent on the choice of a number of parameters, to which the new method even adds one (the weight factor
). If the parameters are chosen poorly, one will still not
train understandable networks. We believe this to be a problem of neural networks in general, which cannot be solved easily: a certain amount of operator skill in applying neural networks is a prerequisite for obtaining good results.
Thirdly, correlation may not always be the best cri-terion for obtaining interpretable weight sets. There may very well be problems for which the optimal weight sets are dierent, yet highly correlated. However, the method proposed in this paper does show that taking non-error criteria into account is feasible. We plan to investigate enforcing symmetries or invariances in this way. Another open question is whether the technique can be used in a constructive way to minimise the num-ber of hidden units necessary for a certain task, as in cascade correlation.
Finally, the method could also be applied to more complicated problems, such as feature extraction from image data using neural networks. However, as prob-lem complexity grows, it becomes increasingly dicult to judge what individual neurons do. For classication problems, this problem could be circumvented by not in-specting the features the networks extract themselves, but using a method for quantifying the importance of features (e.g. [7, 5]).
Acknowledgements
This research is partly supported by the Foundation for Computer Science in the Netherlands (SION) and the Dutch Organisation for Scientic Research (NWO).
References
[1] A.J. Annema. Modeling and implementation of analog integrated neural networks. PhD thesis, Uni-versity of Twente, Enschede, 1994.
[2] D. de Ridder. Shared weights neural networks in image analysis. Master's thesis, Delft Univer-sity of Technology, March 1996. Download from
http://www.ph.tn.tudelft.nl/~dick.
[3] D. de Ridder, R.P.W. Duin, P.W. Verbeek, and L.J. van Vliet. On the application of neural net-works to non-linear image processing tasks. In S. Usui and T. Omori, editors, Proceedings Inter-national Conference on Neural Information
Pro-cessing 1998, Vol. I, pages 161{165, Tokyo, 1998. JNNS, Ohmsha Ltd.
[4] D. de Ridder, A. Hoekstra, and R. P. W. Duin. Feature extraction in shared weights neural net-works. In E.J.H. Kerckhos, P.M.A. Sloot, J.F.M. Tonino, and A.M. Vossepoel, editors, Proceedings of the 2nd annual conference of the Advanced School
for Computing and Imaging, pages 289{294, Delft, The Netherlands, 1996. ASCI, ASCI.
[5] M. Egmont-Petersen, J.L. Talmon, A. Hasman, and A.W. Ambergen. Assessing the importance of fea-tures for multi-layer perceptrons. Neural Networks, 11(4):623{635, 1998.
[6] S.E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In D.S. Touret-zky, editor, Advances in Neural Information Pro-cessing Systems 2, pages 524{532, Los Altos, CA, 1990. Morgan-Kaufmann.
[7] R.P. Gorman and T.J. Sejnowski. Analysis of the hidden units in a layered network trained to classify sonar targets. Neural Networks, 1(1):75{89, 1988. [8] J. Hertz, A. Krogh, and R. G. Palmer.
Introduc-tion to the theory of neural computaIntroduc-tion. Addison-Wesley, Reading, MA, 1991.
[9] A. Hoekstra and R.P.W. Duin. Investigating re-dundancy in feed-forward neural classiers. Pattern Recognition Letters, 18(11-13):1293{1300, 1997. [10] J. Hurri. Independent component analysis of image
data. Master's thesis, Dept. of Computer Science and Engineering, Helsinki University of Technology, Espoo, Finland, March 1997.
[11] R.A. Jacobs, M.I. Jordan, and A.G. Barto. Task decomposition through competition in a modular connectionist architecture: the what and where vi-sion tasks. Cognitive Science, 15:219{250, 1991. [12] M. Kuwahara, K. Hachimura, S. Eiho, and M.
Ki-noshita. Digital processing of biomedical images, pages 187{203. Plenum Press, New York, NY, 1976. [13] J.C. Principe and D. Xu. Information-theoretic learning using Renyi's quadratic entropy. In Pro-ceedings of the 1st International Workshop on
In-dependent Component Analysis and Signal Separa-tion, Aussois, France, January 11-15, 1999, pages 407{412, 1999.
[14] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Tech-nical Report CMU-CS-94-125, School of Computer Science, Carnegie Mellon University, Pittsburgh, Philadelphia, March 1994.