RBF networks with mixed radial basis functions

(1)

RBF Networks with Mixed Radial Basis Functions

Ö. CIFTCIOGLU and S. SARIYILDIZ

TU Delft, Faculty of Architecture, Computer Science Department

Berlageweg 1, 2628 CR Delft, The Netherlands

o.ciftcioglu@bk.tudelft.nl

ABSTRACT

After the introduction to neural network technology as multivariable function approximation, radial basis function (RBF) networks have been studied in many different aspects in recent years. From the theoretical viewpoint, approximation and uniqueness of the interpolation is studied and it has been established that RBF network can approximate arbitrarily well any multivariate continuous function provided enough radial basis functions are employed. For the number of hidden nodes, type of radial base functions, width of the basis functions, cluster centres of the basis functions are some example issues on which numerous research works appeared in the literature. In contrast with this, however, there is remarkably only a few papers pointing out the functional approximation from the frequency domain view-point. They identify that basis functions basically behave as low pass filters. Due to this over filtering effect the RBF networks are not favourable for high frequencies unless relatively high number of hidden nodes is used. Therefore, for approximations that have only low frequency components, RBF networks provide satisfactory results and this is presumably the case in many favourable RBF applications reported in literature and vice versa. However, considering the filtering characteristics of different radial basis functions, one can improve the performance of RBF networks with mixture of radial basis functions

.

1. INTRODUCTION

After the introduction to neural network technology (Broomhead and Lowe 88) as multivariable function approximation, radial basis function (RBF) networks have been studied in many different aspects in recent years. From the theoretical viewpoint, approximation and uniqueness of the interpolation is studied and it has been established that RBF network can approximate arbitrarily well any multivariate continuous function provided enough radial basis functions are employed (Powell 92). This representation is closely related to approximation theory and regularisation techniques (Tikhonov 77). Poggio and Girosi (90) developed regularisation networks from approximation theory with radial basis

function networks as a special case. RBF network has a feed-forward neural network structure. Therefore next to standard back-propagation algorithm, many other training algorithms are devised and reported in literature. One of the outstanding algorithms of this kind presumable is the orthogonal least squares (OLS) algorithm (Chen et al. 91) which has a number of merits providing one with insight into the interpretation of the network. Next to the training algorithms, the number of the hidden nodes, type of radial base functions, width of the basis functions, cluster centres of the basis functions are some example issues on which numerous research works appeared in the literature. In contrast with this, however, there is remarkably only a few papers (Wong 91; Borghese 98) pointing out the functional approximation from the frequency domain viewpoint. They identify that basis functions basically behave as low pass filters. Wong (91) hints the shortcomings of the RBF networks as over-filtering and difficult learning of high frequencies during the training. Borghse (98) deals with the gaussian low-pass filters suggesting different gaussian units with different low-pass cut-off frequencies as hierarchical RBFs. Because of some essential desirable characteristics (e.g., factorisation and theoretical developments leading to analytical solutions), almost always gaussian basis functions are implemented at least for practical applications. However, considering the different filtering characteristics of different type of radial basis functions, one can improve the performance of RBF networks with the mixture of basis functions. Such networks may be coined as Mixture RBF Networks.

2. REGULARISATION BY RBF FILTER

We consider a set of N data vectors {xi , i=1,...,N} dimension of p in Rp and N real numbers {di, i=1,2,...,N}. We seek a function f(x): Rp → R1 that satisfies the interpolation conditions f(xi)=di, i=1,2,...N. There are several methods as solutions for this interpolation problem, like Lagrange interpolation functions. Here we consider radial base functions (RBF) due to their suitability for multivariable interpolation. The characteristic feature of radial functions considered here is that their response decreases monotonically with distance from a central point. The RBF approach

(2)

constructs a linear space using a set of radial basis functions φ(||x-ci||) defined with a norm which is generally Euclidean. The centre described with a vector

ci, a distance scale and the shape of the radial function are parameters of the model. By means of these base functions, we can model the function as

)] , ( [ 1 i i N i i d u c h y h

φ

∑

= = , 2 / 1 1 2 2 ( ) || || ) , (     ₋ = − =

∑

= i N i ij j i i u c u c c u d

d(.) is a distance measure, usually taken to be the Euclidean norm. Radial functions are special class of functions. Characteristic feature is that their response decreases (or increases) monotonically with distance from a central point. We consider a set of N data vectors {xi , i=1,...,N} dimension of p in Rp and N real numbers {di, i=1,2,...,N}. We seek a function f(x): Rp → R that satisfies the interpolation conditions f(xi)=di, i=1,2,...N . The RBF approach constructs a linear space using a set of radial basis functions φ(||x-cj||) . The centre described with a vector cj, a distance scale and the shape of the radial function are parameters of the model.

p j j N j j x c c x R w x f =

∑

− ∈ = , ) || || ( ) ( 1

φ

wj are weights or coefficients.The interpolation conditions f(xi) = di , i=1,2,...N can be generalized as

s k R x c x w x f p j N j k j k( ) (|| ||) , 1,..., 1 = ∈ − =

∑

=

φ

where the mapping from input to output is Rp → Rs and fk(xi) = dik , i=1,2,...,p ; k=1,2,...,s. Once the appropriate base functions (φ) and the distance measure are selected the interpolation function can be established. The formulation above approximate a continuous multivariate function f(x) by an approximating function F(x,w). However, in the case of learning a smooth mapping from discrete set of examples, the approximating problem becomes ill-posed. The conversion of the problem well-posed one is via regularization where a cost functional is defined as

2 2 1 || || 2 1 )] , ( [ 2 1 ] [F d F x w DF T i N i i− + λ =

∑

=

D is the differential operator, λ is a positive real number called regularization parameter and T[F] is called the Tikhonov functional. The solution is through Frechet differential which results in Euler-Lagrange equation for the Tikhonov functional T(F) of the form

1 _[ ₍ _, ₎ ₍ ₎ 1 ~ i i N i i F x w x x d DF D =

∑

− − =

δ

λ

where D~ is the adjoint of D. This defines a necessary condition to have an extremum at F(x) . The solution of this equation is obtained by the integral transformation of

the right-hand side of the equation. The solution to Euler-Lagrange equation is:

₍ _, ₎ ₍ _, ₎ 1 i N i iG x x w w x F

∑

= = λ

G(x,xi) is the Green’s function for the self adjoint linear differential operator L=D~D. The function G(x,xi), for a specified centre xi, depends only on the form of differential operator D. If D is translationally invariant, the Green’s function G(x,xi) centred xi will depend on only the difference between the arguments x and xi. If D is both translationally and rotationally invariant, the Green’s function G(x,xi) will depend on only the Euclidean norm of the difference vector x-xi:

G(x,xi) = G(||x-xi||) ; that is, Green’s function must be a radial-basis function. Then, the regularised solution takes the form ₍ _, ₎ _(|| _||) 1

∑

= − = N i i iG x x w w x F_λ

The solution constructs a linear function space that depends on the known data points according to the Euclidean distance measure. The coefficients wi can be calculated by solving the set of linear equations

                    =           N N N N N N w w x x G x x G x x G x x G x f x f .. ) , ( ... ... ) , ( ... ... ... ... ... ) , ( ... ... ) , ( ( .. ) ( ₁ 1 1 1 1 1 Å w = G-1 d

By means of the basis functions, we can model the function as p j j N j k j x c c x R w x f =

∑

− ∈ = , ) || || ( ) ( 1 , φ

where wj are weights or coefficients. The interpolation conditions f(xi) = di , i=1,2,...N can be generalized as multivariable case as s k R x c x w x f p j N j kj k( ) (|| ||) , 1,..., 1 , − ∈ = =

∑

= φ

where the mapping from input to output is Rp → Rs and fk(xi) = dik , i=1,2,...,p ; k=1,2,...,s. For the present work the following form of radial basis functions are of particular interest:

) exp( ) (r = −ar2 φ ) 2 exp( ) 1 ( ) ( 2 2 r r r = − − φ

which are known as gaussian and Mexican-hat wavelet. For the approximation there can be as many basis functions as {f(xi),di} pairs available. For the frequency domain considerations we can proceed as follows. When the distance between the consecutive points (cj+1 -cj) becomes vanishingly small, for simplicity, for one-dimensional case, we write

(3)

) ( * ) ( ) ( ) ( ) (x wc x c dc w x x f R φ φ − = =

_∫

where * indicates convolution. For discrete case

∑

∞ −∞ = − = k k k x x w x f( )

φ

( ) In frequency domain ) ( ) ( ) (ω =W ω Φω F

where W(ω) is the Fourier transform of the weight series or equivalently of data points; Φ(ω) the Fourier transform of the radial basis function. Φ(ω) is given by

) 2 / exp( ) ( )) 4 /( exp( ) ( 2 2 2 ω ω ω ω π ω − = Φ − = Φ m g and a a

For both normalised gaussian and Mexican-hat functions,

Φ(ω) is shown in Fig.1. The important conclusion of this approximate frequency domain analysis (s. Borghese 98 for more detailed treatment and conditions imposed) for RBF network is that the approximation to f(x) by radial basis functions is low-pass filtered by gaussian RBF and band-pass filtered by Mexican-hat type RBF, anyway. In both cases the filter widths are dependent on the width parameter σ of the basis functions. In literature often RBF nets with gaussian functions are reported and width parameter value is not very crucial due to dominant low-pass filter effect and as this can be compensated by adding some more basis functions to the net to support the local behaviour.

Fig.1: Gaussian and Mexican-hat RBF filters

Namely, if high frequency components of f(x) are of concern than one can use more local gaussians with smaller widths. Such a treatment leads to the utilisation of high number of basis functions. In the extreme case

number of basis functions used are equal to the n umber of data points. From above, it is easy to understand that Tikhonov regularisation provides smooth function approximation from a scattered data since the same data, in a way, are low-pass filtered through the (low-pass) basis functions, e.g. gaussians where large deviations are filtered out. Therefore, for approximations that need only low frequency components, RBF networks provide satisfactory results and this is presumably the case in many RBF applications reported in literature. In fact this is the essential reason that, RBF network deemed to be over smoothing. To circumvent this property, we can use mixture of RBFs which have different filter characteristics. In this work, these filters are gaussian and Mexican-hat filters. The band-pass characteristic of Mexican-hat filter can easily represent the higher frequency components of the function with less number of RBFs. However, the function approximation can alternatively be achieved by excessive number of local gaussians otherwise.

3. RBF NETWORK OF MIXTURE BASES

To demonstrate the functionality of mixed RBFs two specific examples which are very specific and demonstrative for the present research are selected for RBF analysis. First example deals with the modeling of a single-input-single-output (SISO) non-linear dynamic system given by ) 2 ( ) 1 ( 1 ) ( ) 1 ) 2 ( )( 1 ( ) 2 ( ) 1 ( ) ( ) 1 ( ₂ ₂ − + − + + − − − − − = + k y k y k u k y k u k y k y k y k y

with white noise input. The second example, deals with the wavelet transform by RBF networks where the RBF structure contain 32-inputs and 32-outputs. For each case, the number of input output pairs used are hundred. The preceding research on wavelet transform by RBF networks is already reported in the literature (Ciftcioglu 1999).

The RBF network is trained by means of orthogonal least squares (OLS) method . In the OLS algorithm, (s. Appendix) initially the number of hidden layer nodes is equal to the total number of training patterns. After training the importance of the nodes are ordered according to their contribution to the function approximation. Before training, for the first half of the hidden layer nodes the radial basis function type selected is gaussian and for the second half Mexican-hat. The appropriate basis function types from this mixed basis functions network are determined during the training. In particular, the selection procedure by OLS is extremely consistent with the filtering interpretation of the RBF functions. The typical training results for the SISO system are presented in Figs. 2-4. Fig.2 gives the results obtained from gaussian RBFs with 31 basis functions.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Gaussian and Mexican-Hat RBF filters

Normalized frequency A m pl it ude

(4)

Fig.2: Dynamic system output and its estimate obtained from RBF network with gaussians where the mean error= 7.63E-2 and the number of gaussians=31.

Fig.3 gives the results obtained from Mexican-hat RBFs with the same number of basis functions. Fig.4 gives results from a mixture of gaussians and Mexican-hat RBFs where the total number of basis functions 31 and the number of Mexican-hat type functions is 26. It is interesting to note that there is no essential difference among the results since the non-linear system output covers a wide frequency range. The training algorithm in each case finds the appropriate non-linear model for the system with different number of basis functions of

Fig.3: Dynamic system output and its estimate obtained from RBF network with Mexican hat-basis functions where the mean error=5.12E-2 and the number of functions=31.

Fig.4: Dynamic system output and its estimate obtained from RBF network with mixed basis functions where the mean error=6.58E-2 and the number of Mexican hat functions is=26, the number of gaussians=6.

different types including mixture RBFs. To compare results in perspective, the number of basis functions is taken common as 31 in each case. This number is obtained as minimum from the analysis by mixture basis functions with a given width. The differences among the reported results are slight since each case has different merits over the total frequency range and the RBF network verifies this. However, the results are slightly better for Mexican-hat as this is also demonstrated in the mixture mode.

The second example concerns the wavelet transform of a set of pump vibration data that include dominantly high frequencies but also some low frequency components. For the present analysis, wavelet transform of this data is considered due to favourable analysis conditions. Very briefly, given a function f(x), wavelet transform provides coefficients, called ‘wavelet' coefficients, which are resulted from the inner products of the signal and a family of small wave packets called ‘wavelets’. With these coefficients and the associated wavelet functions the function f(x) can be expressed in different level of approximations called multi-resolution. Wavelet transform is similar to Fourier transform in discrete form in the sense that a block of data is transformed into a another block of transformed data of the same size. However while Fourier transform provides no local information in time domain, wavelet transform provides this information. In contrast with this, Fourier transform provides local information in frequency domain, while wavelet transform does not provide this information. Wavelet coefficients are orthogonal to each other and the mean of them is zero. The implication of this is that the Mexican-hat wavelet type RBF would expectedly

0 20 40 60 80 100 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

(5)

perform better approximation for wavelet transform representation by RBF network for a given number of basis functions.

Figures 5-10 indicate the typical results obtained from the research of wavelet transform by RBF networks. The length of the data block used for each wavelet transform is 32. The number of blocks, or in the terminology of RBFs the number of patterns, used are 100.

Fig.5: Wavelet transforms and their estimates (patterns

12 and 13 of total 100 patterns) using Mexican-hat basis function network. The number of nodes=34 and the mean error=6.50E-2. Note that two plots are virtually the same and some minor deviations are hardly visible

The number of basis functions used in each case is 34. In particular, fig.5 and fig.8 give the wavelet transform and its estimate using Mexican-Hat basis function network. Fig.6 and 9 give the same using gaussian basis functions. Fig.7 and 10 give the same using mixture basis functions where in both cases the number of Mexican-hat bases is 25.

The least mean errors out of hundred transforms are obtained from those using Mexican-hat basis functions. Since the patterns have a wide frequency band covering the whole range determined by the sampling frequency, band-pass type basis functions are supposedly superior. The optimal locations of band-pass filters can be placed optimally during the OLS training of the RBF network.

DISCUSSION AND CONCLUSIONS

The present paper put essentially emphasis on the frequency domain characteristics of the basis functions and aims to demonstrate the effectiveness of this new dimension in RBF network design for enhanced network

12 and 13 of total 100 patterns) using gaussian basis function network. The number of nodes=34 and the mean error=4.58E-1.

12-13 of total 100 patterns) using mixture (Mexican-hat and gaussian) basis function network. The number of nodes=34 and the mean errorr=2.9E-1. The number of Mexican-hat bases is 25.

performance. In principle, it is not straightforward to carry out such comparative study since the RBF network outcome is dependent on several independent factors. It is a well known fact that, RBF networks use fairly more hidden layer nodes than those used in perceptron type feed-forward neural networks although the reason for this in most case, is not explicitly mentioned. In some

0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6

(6)

74 and 75 of total 100 patterns) using Mexican-hat basis function network. The number of nodes=34 and the mean error=6.50E-2.

74 and 75 of total 100 patterns) using gaussian basis function network. The number of nodes=34 and the mean error=4.58E-1.

applications, this number might even tend to be close to the number of patterns used for training. From the frequency domain considerations viewpoint this becomes easy to explain. Namely, since the gaussian basis functions are low pass filters, for width parameter (σ) large filtering effect becomes excessive so that function approximation becomes difficult. Conversely, for width parameter small, then the basis functions are dominantly local so that the required number of nodes is high. In the extreme case the number is equal to the number of patterns used for training. Surely, with the increasing

74-75 of total 100 patterns) using mixture (Mexican-hat and gaussian) basis function network. The number of nodes=34 and the mean errorr=2.9E-1. . The number of Mexican-hat bases is 25.

number of nodes the function approximation by the RBF network should improve. However, concerning the generalisation capability of such a network as neural network this leads to the case known as bias-variance

dilemma. Therefore some optimality for the number of

hidden nodes is always desirable. From the practical applications viewpoint, the minimal number against some prescribed criterion can replace the optimal number. To formulate the optimal choice of the width parameter is normally is not easy since this is also dependent on the input signal amplitude and any formulation requires normalisation of the inputs. In this research the multivariable function approximation by RBF network is endeavoured using least number of basis functions according to mean error criterion.

Referring to the considerations above, the enhanced RBF network for multivariable function approximation can be achieved by mixture of basis functions. With frequency domain interpretation, it is rather straightforward to understand the functionality and effectiveness of such a combination in a RBF structure. Strictly speaking, the filtering interpretation of the basis functions for function approximation requires equally spaced inputs and the number of nodes equal to the number of patterns used for training. However such a network is not of much interest, at least, in the context of neural networks. However, by the training procedure, the filtering interpretation in function approximation by RBF networks with less number of centres than the number of training patterns available becomes a new and extremely effective dimension to understand the network behaviour in general. As conclusion of this research, the work

0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4 0.6 0 5 10 15 20 25 30 35 -0.4 -0.2 0 0.2 0.4

(7)

elaborates on mixed radial basis functions utilisation in RBF network and identifies the substantial merits of this approach referring to filtering properties of the network. Since, generally RBF networks are deemed to be over filtering, as a novel approach by means of mixture of radial bases this can be circumvented having a parsimonious number of centres for function approximation as a bonus, at the same time. These are accomplished by means of appropriate training of the network and in this respect the method of OLS for training play essential role since the selection of the centres is based upon a competition process between different type of radial basis functions initially introduced for competition. The parsimonious number of centres is of particular interest especially when RBF structure is considered to be a feed-forward neural network rather than a basic mathematical approach for multivariable function approximation.

REFERENCES

Borghese, N.A, S. Ferrari (1998), Hierarchical RBF

Networks and Local Parameters Estimate,

Neurocomputing 19, pp.259-283

Broomhead, D.S. and D. Lowe (1988), Multivariable

Function Interpolation and Adaptive Networks, Complex

Systems, 21, pp.321-355

Chen S, C.F.N. Cowan and P.M. Grant, (1991),

Orthogonal Least Squares Algorithm for Radial Basis Function Networks, IEEE Trans. on Neural

Networks, Vol.2, No.2, March

Ciftcioglu Ö., (1999), From Neural to Wavelet Network,

Proc. NAFIPS '99, 18th Int. Conf. North American Fuzzy Inf. Proc. Soc. June 10-12, 1999, New York, pp.894-898 Poggio, T and F. Girosi.(1990), Network for

Approximation and Learning, Proc. IEEE, 78(9),

September, pp.1481-1497

Powell, M.J.D. (1992), Radial basis functions in 1990,

Advanced Numerical Analysis, Vol.2. pp.105-210

Tikhonov, A.N. and A.Y. Arsenin. (1977), Solutions of

Ill Posed Problems, W.H. Winston, Washington D.C.

Wong Y. (1991), How Gaussian Radial Basis Functions

Work, IJCNN, International Conference on Neural

Networks, July 8-12, 1991, Seattle, WA

Appendix : The OLS Algorithm

The OLS algorithm starts by selecting the node (RBF-centre) that has the greatest contribution to the vector of the desired outputs. Therefore, prepare 2-D input matrix

A ={r}concatenating m input vectors (r) column-wise.

Do the same thing for the output (desired) vectors. The pseudo OLS code is performed in two steps.

Step 1: for i=1 to k do {       = = o o r r g err r r o r g _T i T i i i i T i T i i 2 ; }

Set h1=rχ, where χ satisfies

{

err

_i

i

1 ,...,

k

}

max

arg

=

χ

Set C1=cχ to be the best centre.

Drop column vector r_χout of

{

r

i

=

1 ,...,

k

}

In step 2 the rest of the nodes are selected one-by-one using orthogonalisation (e.g., Gram-Schmidt) procedure. The measure used for selection of the column vectors (r) in matrix A is the error reduction ration, err. Namely, erri, is the measure of variance at the output provided by a particular centre. Bigger the variance, bigger is the contribution of that centre to the network output.

Step 2.

For j=2 to m do {

Compute for all remaining column vectors {rx):

























=

−

=

−

=

∑

− =

o

g

err

o

g

h

r

j

i

for

h

T x T x x x x T x T x x j l l lj x x i T i T i ij

ψ

α

ψ

α

2 1 1

;

.

1 ,...,

1 ,

{

}

{ }

x j i j r of out r drop and c C Set indexes selected the except k i err satisfying h Set ξ ξ ξ ξ ξ ψ = = = = , ,..., 1 max arg , }