• Nie Znaleziono Wyników

Learning to simulate and predict chaotic dynamical systems

N/A
N/A
Protected

Academic year: 2021

Share "Learning to simulate and predict chaotic dynamical systems"

Copied!
192
0
0

Pełen tekst

(1)
(2)

P r o p o s i t i o n s

complementing the dissertation

“Learning to Simulate and Predict Chaotic Dynamical Systems”

by Rembrandt Bakker

1. A model which is trained to predict data measured from a deterministic

chaotic system, does not automatically learn the dynamical behavior of that

chaotic system [this thesis, chapter 7].

2. When asked a question to which he does not know the answer, a honest

human says: “I don't know.” This feature can, and should, be built into

black-box models [this thesis, chapter 9].

3. The statement “single hidden layer neural networks cannot approximate

certain types of functions” [Lippmann (1987), IEEE ASSP Magazine

4(2):4-22] is still valid, despite the mathematical proof that such neural networks

are universal approximators [Cybenko (1989), Math. Controls, Signals Syst.

2(4):303-314].

4. The Takens embedding theorem [F. Takens (1981), Lecture Notes in

Mathematics 898:366-381] should only be used when the accuracy of the

measurements (in bits) is more than the average loss of information (in

bits/s) within the embedding period (in s).

5. Algorithms to estimate a system's correlation dimension wrongly indicate a

low dimension for time-series consisting of coloured noise.

6. A single butterfly flapping its wings cannot trigger a storm, even if Lorenz'

three-dimensional model points to the opposite [E.N. Lorenz (1963), J.

Atmospheric Sci. 20: 130-141].

7. The slogan “to measure is to know” is good for scientists talking to business

executives. For scientists among themselves, a better slogan is “to measure is

to calibrate”.

8. Now that funding agencies measure the performance of research groups in

terms of the number of publications, the number of publications about a

given subject no longer reflects scientific progress on that matter.

9. As long as travel time is the limiting factor in people's choice for transport

solutions, the environmental impact of such solutions should not be

expressed per person and per distance, but per person and per unit time.

10. Just like entropy has no intention to create disorder and evolution has no

(3)

“Learning to Simulate and Predict Chaotic Dynamical Systems”

van Rembrandt Bakker

1. Een model dat getraind is om een gemeten variabele van een deterministich

chaotisch systeem te voorspellen, leert niet automatisch ook het dynamisch

gedrag van dat chaotische systeem [dit proefschrift, hoofdstuk 7].

2. Op een vraag waarop hij het antwoord niet weet antwoordt een eerlijk mens:

“Dat weet ik niet.” Die functionaliteit kan, en moet, ook bij black-box

modellen worden ingebouwd [dit proefschrift, hoofdstuk 9].

3. De uitspraak “neurale netwerken met één tussenlaag kunnen bepaalde typen

functies niet benaderen” [Lippmann (1987), IEEE ASSP Magazine

4(2):4-22] geldt nog steeds, ook al is wiskundig bewezen dat zulke modellen elke

willekeurige functie kunnen benaderen [Cybenko (1989), Math. Controls,

Signals Syst. 2(4):303-314].

4. Het Takens embedding theorema [F. Takens (1981), Lecture Notes in

Mathematics 898:366-381] dient alleen gebruikt te worden als de

nauwkeurigheid van de metingen (in bits) groter is dan het gemiddelde

informatieverlies (in bits/s) binnen de embedding periode (in s).

5. Algoritmen voor het schatten van de correlatie dimensie geven een foutieve

lage waarde voor tijdreeksen die bestaan uit gekleurde ruis.

6. Eén flapperende vlinder maakt nog geen storm, ook al wijst Lorenz'

drie-dimensionale model op het tegendeel [E.N. Lorenz (1963), J. Atmospheric

Sci. 20:130-141].

7. De slogan “meten is weten” is goed voor wetenschappers in gesprek met

beleidsmakers. Voor wetenschappers onder elkaar is een betere slogan

“meten is ijken”.

8. Sinds de prestatie van onderzoeksgroepen door hun financiers wordt

gemeten aan de hand van het aantal publicaties, is het aantal publicaties over

een bepaald onderwerp niet langer een maat voor de wetenschappelijke

voortgang m.b.t. dat onderwerp.

9. Zolang reistijd de beperkende factor is bij de keuze voor een vervoermiddel,

moet de milieubelasting van dat vervoermiddel niet per reizigerskilometer

worden uitgedrukt, maar per reizigersminuut.

10. Net zoals entropie er niet op uit is om wanorde te scheppen en evolutie er

niet op uit is om overlevingskansen te scheppen, is de economie er niet op

uit om te groeien.

(4)

Learning to Simulate and Predict

Chaotic Dynamical Systems

(5)
(6)

Learning to Simulate and Predict

Chaotic Dynamical Systems

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof.dr.ir. J.T. Fokkema, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op dinsdag 4 september 2007 om 10:00 uur

door

Rembrandt BAKKER

scheikundig ingenieur

(7)

Prof.dr.ir. J.C. Schouten

Samenstelling promotiecomissie:

Rector Magnificus, voorzitter

Prof.ir. C.M. van den Bleek, Technische Universiteit Delft, promotor Prof.dr.ir. J.C. Schouten, Technische Universiteit Eindhoven, promotor Prof.dr.dr.h.c. F. Takens, Rijksuniversiteit Groningen

Prof.dr.ir. M.-O. Coppens, Rensselaer Polytechnic Institute Prof.dr.ir. A.C.P.M. Backx, Technische Universiteit Eindhoven Prof.ir. J. Grievink, Technische Universiteit Delft Dr. C.G.H. Diks, Universiteit van Amsterdam

Bakker, Rembrandt

Learning to Simulate and Predict Chaotic Dynamical Systems / by Rembrandt Bakker

Dissertation at Delft University of Technology. - With ref.- With summary in Dutch

ISBN 978 90 8891 011 1 NUR 924

Subject headings: Neural Networks / Chaotic Dynamics / Time-series Prediction / Gas-solid Fluidized Beds / Multi-phase Chemical Reactors

Copyright 2007 by R. Bakkerc

(8)

Cover: R. Bakker.

(9)
(10)

Contents

Summary xiii

Samenvatting (Summary in Dutch) xvii

1 Introduction, Scope, and Conclusions 1

1.1 Chaotic Dynamics, Control, and Modeling . . . 2

1.1.1 Determinism, dissipation, and autonomy . . . 2

1.1.2 Fractals and attractors . . . 3

1.1.3 Is my system chaotic? . . . 6

1.1.4 Chaos Control . . . 6

1.1.5 Data-driven modeling of chaotic dynamics . . . 7

1.1.6 Limits on determistic modeling . . . 7

1.2 Motivation and previous work . . . 8

1.2.1 Fluidized beds . . . 8

1.2.2 Previous work at the CRE group . . . 8

1.3 Results presented in this thesis . . . 11

1.3.1 Main theme: Attractor learning . . . 11

1.3.2 Chapter 2: Neural Networks for Function Approximation . . 12

1.3.3 Chapter 3: Neural network model to control an experimental pendulum . . . 12

1.3.4 Chapter 4: Prediction and Control of Chaotic Fluidized Bed Hydrodynamics . . . 13

1.3.5 Chapter 5: Learning Chaotic Attractors by Neural Networks 13 1.3.6 Chapter 6: Selective regression and instant neural network pruning . . . 13

(11)

1.3.9 Chapter 9: Learning Chaotic Attractors with Nonlinear

Prin-cipal Component Regression . . . 14

1.4 Discussion and future work . . . 15

1.4.1 Recommendations for future work . . . 16

2 Neural Networks for Function Approximation 19 2.1 Introduction . . . 19

2.1.1 Standard Multi-Layer Perceptron . . . 21

2.1.2 MLP’s Approximation Capabilities . . . 23

2.1.3 Single-hidden-layer MLP . . . 23

2.2 Two-hidden-layer MLP . . . 26

2.3 Training the MLP networks . . . 28

2.3.1 Radial Basis Function network . . . 29

2.3.2 RBF’s Approximation Capabilities . . . 30

2.4 Discussion and developments . . . 32

3 Neural network model to control an experimental pendulum 35 3.1 Introduction . . . 36

3.2 Neural Network Modeling . . . 38

3.3 The Pendulum Model . . . 39

3.4 Measurements . . . 41

3.5 Network Training and Validation . . . 43

3.6 Prediction Surface . . . 47

3.7 Search for Unstable Periodic Orbits . . . 48

3.8 Applying Chaos Control . . . 48

3.9 Concluding Remarks . . . 50

3.10 Acknowledgements . . . 51

4 Prediction and Control ofChaotic Fluidized Bed Hydrodynamics 53 4.1 Introduction . . . 54

4.2 Neural network for learning chaotic dynamics . . . 55

4.2.1 Neural Network Modeling of Chaotic Systems . . . 55

4.2.2 Input data reduction with PCA . . . 56

4.2.3 Training Objective to Learn Chaotic Dynamics . . . 57

4.3 Neural network model of fluidized bed . . . 59

(12)

Contents ix

4.3.2 Neural network model trained on ECT data . . . 61

4.4 Concluding remarks . . . 63

4.5 Acknowledgements . . . 64

4.6 Appendix . . . 64

5 Learning Chaotic Attractors by Neural Networks 69 5.1 Introduction . . . 70

5.2 Data Sets . . . 73

5.2.1 Pendulum Data . . . 73

5.3 Model Structure . . . 74

5.3.1 Choice of Embedding . . . 75

5.3.2 Principal Component Embedding . . . 75

5.3.3 Updating Principal Components . . . 78

5.3.4 Prediction Model . . . 79

5.4 Training Algorithm . . . 80

5.4.1 Error Propagation . . . 80

5.4.2 Optimization and Training . . . 83

5.5 Diks Test Monitoring . . . 83

5.6 Modeling the Experimental Pendulum . . . 85

5.6.1 Pendulum Model I . . . 85

5.6.2 Pendulum Model II . . . 88

5.7 Laser Data Model . . . 89

5.8 1998 Leuven Time-Series Competition Data . . . 93

5.9 Summary and Discussion . . . 97

5.10 Appendix . . . 98

5.11 Acknowledgements . . . 100

6 Selective regression and instant neural network pruning 101 6.1 Introduction . . . 101

6.2 Post-training pruning . . . 103

6.3 Selective Regression: Introduction . . . 104

6.3.1 Application to Feedforward MLPs . . . 105

6.4 Application to Recurrent MLPs . . . 106

6.5 Selective Regression: Algorithms . . . 106

(13)

6.5.2 Forward Regression and Backward Elimination . . . 111

6.6 Benchmark Examples . . . 111

6.6.1 Linear System . . . 112

6.6.2 Monk problems . . . 113

6.6.3 Mackey-Glass time series . . . 113

6.6.4 Laser data neural network . . . 114

6.7 Concluding remarks . . . 116

7 Why capturing chaotic dynamics fails: a case study 117 7.1 Logistic Map Prediction Surface . . . 118

7.2 Model Prediction . . . 118

7.3 Bifurcation plots . . . 120

7.4 Discussion . . . 121

8 The Split & Fit model 127 8.1 Introduction . . . 127

8.2 Hierarchical partitioning . . . 128

8.2.1 Definitions . . . 129

8.2.2 Growing the partition tree . . . 130

8.2.3 Creating fuzzy boundaries . . . 130

8.2.4 Computing the boundary overlap . . . 131

8.2.5 Computing memberships . . . 131

8.3 Nonlinear modeling . . . 133

8.3.1 Parameter estimation . . . 133

8.3.2 The overall S&F model . . . 135

8.3.3 Connection to hierarchical mixtures of experts . . . 135

8.4 Applications . . . 136

8.5 Two Spiral Problem . . . 136

8.6 Diesel engine emission control . . . 141

8.7 Conclusion . . . 142

9 Learning Chaotic Attractors with NLPCR 143 9.1 Introduction . . . 144

9.2 Analogy with linear system identification . . . 144

(14)

Contents xi

9.3 Nonlinear PCR . . . 146

9.3.1 Extending the S&F model . . . 147

9.3.2 Self-intersecting curves . . . 149

9.4 Application Examples . . . 150

9.4.1 Logistic Map . . . 150

9.4.2 Laser data . . . 150

9.4.3 Experimental Bubble Column . . . 151

9.5 Conclusions . . . 153

Bibliography 155

Dankwoord,Acknowledgements 165

Publications by the author 167

(15)
(16)

Summary

Learning to Simulate and Predict Chaotic Dynamical Systems

About a century ago, it was thought that all systems which produce random-looking time series require models in which the source of randomness is a stochastic process. Until it was discovered by Henri Poincar´e that very simple sets of ordinary, nonlinear differential equations can produce dynamical behavior which, in many respects, cannot be distinguished from randomness. This type of dynamics is called chaotic.

Chaotic systems can be modeled by deterministic equations. They provide much more detailed information than stochastic models. With knowledge of the determin-istic rules which govern the chaotic system, it is possible to control chaos: interact with the system and change its dynamics. This research is part of a larger project, in which chaos control is used to improve the bubbling behavior of multi-phase chemical reactors.

Chaos control requires models which capture the complete behavior of the system. If we replace the system by its model, or vice versa, we should not notice a change in dynamical behavior. In chaos terminology, the model and real system must have the same attractor. In this thesis we develop data-driven models for chaotic systems. Data-driven implies that the model learns both its structure and its parameters from measured data.

Neural networks are among the most common types of nonlinear, data-driven mod-els. We explore their application to the learning of chaotic dynamics. The neural network model uses (delays of) the measured variables at time t as its input, and predicts their value at time t + 1. After training the neural network, a long time-series is generated by the model, by feeding the predicted value back as an input to the network. If this time-series has the same chaotic properties as the time-series measured from the real system, then we say that the model has learnt the system’s attractor.

(17)

The next step is to learn the chaotic dynamics of a gas-solids fluidized bed. This system is much more difficult, because it has a large number of state variables, while the pendulum has only three. To arrive at a good predictive model, the neural network approach is improved with two enhancements:

1. We use inputs which are compressed by Principal Component Analysis 2. The model does not simply minimize the one-step-prediction error, but a

so-called ‘error propagation’ scheme is introduced in the which the model learns to synchronize itself with the measured time series.

We succeed to create a model which can generate time-series with the same chaotic properties as the original data. But, we also create many models which have the same prediction accuracy, but have completely wrong attractors. Clearly, a model which has learnt to predict the system well, does not necessarily have the correct chaotic attractor.

Five further enhancements are applied to the training strategy:

1. When doing input compression, recent delays are given more weight than older delays.

2. The neural network is connected in parallel to a linear predictive model, so that the neural network can spend its resources to nonlinear parts of the problem.

3. The error propagation scheme is refined to incorporate the linear model. 4. A new pruning algorithm is developed which removes unused nodes from the

network.

5. A statistical test by Diks et al. (1996) is used to monitor, during training of the neural network, whether the model-generated and measured time series are produced by the same dynamical system.

(18)

Summary xv

An algorithm is needed which locally detects and eliminates the unused dimensions. Such an algorithm is called Nonlinear Principal Component Regression (NLPCR). We develop a new NLPCR algorithm, based on a fuzzy partitioning of the input space. We call it ‘Split & Fit’ (S&F). In each region, unused dimensions are de-tected with local Principal Component Analysis (PCA), and the fuzzy boundaries between the regions combine the PCA results into a smooth global lower dimen-sional subspace. We show that this algorithm can keep an otherwise unstable model for a chaotic laser onto the desired trajectory.

Meanwhile, Robert Jan de Korte found that even for long datasets and high di-mensional states, deterministic prediction of gas-solids fluidized beds is not feasible. Kaart (2002) turned his attention to an experimental gas-liquid bubble column with a single train of rising bubbles. We model this column with the S&F algorithm, combined with outlier detection. The model reveals that the bubble column’s be-havior is nearly periodic. In deterministic open-loop mode, the model outputs periodic behavior, but on driving the model with residual noise, its output matches the observed behavior.

The S&F model paves the way for robust learning of chaotic attractors, but we nevertheless recommend that future research starts from a completely different perspective: real-world systems rarely meet the requirement of determinism and low-dimensionality. Algorithms are needed which can find structure in ‘noisy’ non-linear behavior. For that task, a probabilistic representation (kernel smoother or mixture density) of how the measured data are distributed in state space should be the starting point, getting predictions (conditional expectations) as the by-product. The Hidden Markov Model, widely used in speech synthesis, can be a useful starting point.

(19)
(20)

Samenvatting

Zelflerende algoritmen voor het voorspellen en simuleren van chaotische systemen

Ongeveer een eeuw geleden werd gedacht dat alle systemen die een op ruis gelijkend gedrag vertonen, alleen beschreven kunnen worden met modellen die eveneens een ruiscomponent bevatten. Totdat Henri Poincar´e ontdekte dat een eenvoudig stelsel van gewone, niet-lineaire differentiaalvergelijkingen een type gedrag kan vertonen dat, in veel opzichten, niet van ruis kan worden onderscheiden.

Chaotische sysemen kunnen door een deterministisch model worden beschreven. Zo’n model bevat veel gedetailleerdere informatie dan een door ruis aangedreven model. Als het krachtenspel dat het chaotische systeem aandrijft bekend is, biedt dat de mogelijkheid voor chaosregeling: verandering van het chaotisch gedrag door het geven van kleine impulsen. Dit onderzoek maakt deel uit van een groter project, waarin chaosregeling wordt gebruikt om het bellengedrag van meerfase (chemische) reactoren te verbeteren.

Voor chaosregeling zijn modellen nodig die het gedrag van het systeem volledig beschrijven. Als we het systeem vervangen door het model, of andersom, mag het dynamisch gedrag niet merkbaar veranderen. In chaos terminologie: het model en het systeem moeten dezelfde attractor hebben. In dit proefschrift ontwikkelen we zelflerende modellen voor chaotische systemen. Zelflerend wil zeggen dat het model zowel zijn structuur als zijn parameters schat op basis van meetgegevens.

Neurale netwerken behoren tot de meest gebruikte niet-lineaire, zelflerende mod-ellen. We bekijken of ze geschikt zijn voor het leren van chaotische dynamica. Het neurale netwerk krijgt de gemeten variabelen op tijdstip t (of ouder) als input, en voorspelt hun waarde op tijdstip t + 1. Na het trainen van het neurale netwerk laten we het een lange tijdreeks genereren, door de voorspelde waarden recursief te gebruiken als input. Als deze lange tijdreeks dezelfde chaotische eigenschappen heeft als de gemeten tijdreeks, dan kunnen we stellen dat het model de attractor van het systeem heeft geleerd.

(21)

Het neurale netwerk blijkt een bijna perfect model te zijn voor dit systeem. De volgende stap is nu om het chaotisch gedrag van een wervelbedreactor te leren. Dit systeem is veel moeilijker, omdat het veel meer variabelen heeft die zijn toestand bepalen. De slinger had er slechts drie. Om een goed voorspellend model te krijgen, verbeteren we het neurale netwerk op twee manieren:

1. We comprimeren de inputs van het model met behulp van Principale Com-ponenten Analyse

2. We laten het model niet slechts ´e´en stap in de tijd vooruit voorspellen, maar introduceren een fout-correctie methode waardoor het model verder vooruit moet kijken om de gemeten data te kunnen volgen.

Het lukt nu om een model te maken dat tijdreeksen kan genereren met dezelfde chaotische eigenschappen als de gemeten reeksen. Maar, het gebeurt ook vaak dat een model op korte termijn wel goed voorspelt, maar een hele verkeerde at-tractor heeft. Het is duidelijk dat een model dat goede voorspellingen doet, niet automatisch ook het juiste lange-termijn gedrag vertoont.

Met vijf aanpassingen verbeteren we het trainingsalgoritme.

1. Bij het comprimeren van de inputs geven we nieuwere metingen een hogere nauwkeurigheid dan oudere.

2. Parallel aan het neurale netwerk wordt een lineair model gebruikt, zodat het neurale netwerk al zijn capaciteit kan besteden aan de niet-lineaire aspecten van het probleem.

3. De fout-correctie methode wordt aangepast om samen te kunnen werken met het lineaire model.

4. Een uitdunningsalgoritme wordt ontwikkeld, om ongebruikte nodes van het neurale netwerk te elimineren.

5. Een statistische test van Diks et al. (1996) wordt gebruikt om tijdens het trainen van het neurale netwerk voortdurend te zien of de model-gegenereerde en gemeten tijdreeks van hetzelfde dynamische systeem afkomstig (kunnen) zijn.

(22)

Samenvatting xix

de metingen alle beschikbare dimensies netjes op, maar op microniveau zijn de metingen geconcentreerd op lager-dimensionale oppervlakken. Het gevolg is dat het globale niet-lineaire model op microniveau teveel vrijheidsgraden heeft, met willekeurig gedrag als gevolg.

Wat we nodig hebben is een algoritme dat de ongebruikte dimensies op microschaal kan opsporen en elimineren. Een dergelijk algoritme wordt Niet-Lineare Prin-cipale Componenten Regressie (NLPCR) genoemd. We ontwikkelen een nieuwe NLPCR methode, die de toestandsruimte indeelt in gebiedjes met vage (fuzzy), overlappende grenzen. We noemen dit ‘Split & Fit’ (S&F). In elk gebiedje wor-den ongebruikte dimensies opgespoord met locale Principale Componenten Analyse (PCA), en met de overlappende gebiedsgrenzen kan het resultaat hiervan worden samengevoegd tot een vloeiend in elkaar overlopend geheel. We laten een voor-beeld zien waarbij dit algoritme een model met een onstabiele attractor voor een chaotische laser in de gewenste banen leidt.

In de tussentijd is uit het onderzoek van Robert Jan de Korte gebleken dat, zelfs bij gebruik van lange tijdreeksen en hoog dimensionale toestanden, de deterministische voorspelling van wervelbedreactoren niet lukt. Kaart (2002) heeft vervolgens zijn pijlen gericht op een experimentele gas-vloeistof kolom met een enkele straat van stijgende bellen. Deze kolom hebben we gemodelleerd met het S&F algoritme, in combinatie met een methode om uitschieters uit de metingen te filteren. Uit het model blijkt dat de bellenkolom bijna-periodiek gedrag vertoont. Bij deterministis-che simulaties vertoont het model periodiek gedrag, maar als een klein beetje ruis wordt toegevoegd komt het gedrag overeen met de metingen.

Het S&F model maakt het mogelijk om op robuuste wijze chaotische attractoren te modelleren. Desondanks bevelen we aan om bij vervolgonderzoek voor een heel andere invalshoek te kiezen: de meeste systemen uit de ‘echte’ wereld voldoen namelijk niet aan de eis dat ze deterministisch zijn en laag-dimensionaal. Voor deze systemen zijn algoritmes nodig die structuur kunnen vinden in niet-lineair stochastisch gedrag. Zo’n algoritme heeft als basis een waarschijnlijkheidsverdeling (op basis van een kernschatter of samengestelde verdeling) van de meetgegevens in de toestandsruimte. Vervolgens kunnen via conditionele verwachtingen voorspellin-gen worden gedaan. Het ‘Hidden Markov Model’, dat bij spraakherkenning veel gebruikt wordt, is hierbij een goed uitgangspunt.

(23)
(24)

Chapter 1

Introduction, Scope, and

Conclusions

In this thesis, we develop models for experimental, chaotic dynamical systems. The models are of the black-box, data-driven type. No physical understanding of the system at hand is required, but instead, a long sequence of measured data is used to build the model. We create these models, because they provide us with a powerful tool to manipulate the dynamics of the experimental system. For that purpose, the model must meet very high requirements. It must learn the complete behavior of the system. That is, if we replace the system by its model, or vice versa, we should not notice a change in dynamical behavior.

(25)

1.1

Chaotic Dynamics, Control, and Modeling

About a century ago, it was discovered by the French mathematician Henri Poincar´e that very simple sets of nonlinear differential equations can produce dynamical be-havior which, at first glance, cannot be distinguished from random bebe-havior. A surprizing discovery. How can a set of purely deterministic equations produce un-predictable behavior? This question became an important theme in mathematics, and the resulting concepts and theorems are now known as chaos theory. The word chaos reflects the paradox of deterministic rules causing apparent randomness. Application of chaos theory to experimental systems started much later, in the 1980’s. The reason for this delay is twofold: (1) most analysis methods make extensive use of digital computers, and (2) it only then became clear that one can perform the analysis on the basis of only a single measured variable, even if the system’s state has a higher dimension (Takens, 1981).

The basic mechanism underlying chaotic systems, is the interplay between stabi-lizing and destabistabi-lizing forces. Stationary linear systems cannot have destabistabi-lizing forces, because they make the system unstable. But in nonlinear systems, desta-bilizing forces can act locally. The system’s state can be tossed around between different regions, much like a ball is tossed around in a pinball game. If the com-bination of the forces is such that a never-ending, non-periodic dynamic evolution results, we speak of chaotic dynamics.

1.1.1

Determinism, dissipation, and autonomy

When, in this thesis, we talk about ‘a chaotic system’, we more precisely mean a chaotic system which is deterministic, stationary, and dissipative. Except when chaos control is active, the system is also autonomous.

1. Deterministic implies that there are no random forces involved in the system’s dynamics. It also implies that we can define a state of the system which fully determines the future evolution of the system. This state typically consists of a (small) number of physical variables.

(26)

1.1 Chaotic Dynamics, Control, and Modeling 3

3. Stationary implies that the mechanism driving the system does not change with time.

4. Dissipative means that the system, when disturbed by a small external force, returns to its original behavior when this force is taken away. Any physical system in which friction plays a role, is dissipative.

1.1.2

Fractals and attractors

A characteristic of a dissipative chaotic system, is that its state is confined to a bounded region in state-space. A second characteristic is that the behavior of a chaotic system is non-periodic: an autonomous chaotic system never visits the same state twice. These two properties may seem in contradiction. Won’t the system eventually visit all possible states in the bounded region, so that it has to return to a previous state? The answer is no, and the quick explanation is that the number of possible states in a space spanned by continuous variables is simply unlimited. Figure 1.1 shows a computer-generated chaotic time series. The plot reveals some basic characteristics of chaotic time series.

1. Sensitivity to initial conditions / Limited predictability: If we follow the evo-lution of two nearly identical states, we see that they eventually move away from each other, due to destabilizing forces. The further we want to predict the future, the more accurate we need to measure the initial state.

2. Loss of information / Entropy: This term is directly related to the previous item. If we have measured the system’s state at time t, and predict the state at time t + 1, then on average, the predicted state has a higher error margin than the measured state. This average ‘loss of precision per unit time’ is called entropy and is commonly expressed in bits/s.

A detailed study of how dissipative chaotic systems ‘fill’ the state-space often reveals very interesting structures, known as fractals. Their most particular property is the possession of infinite detail: if you zoom in to a fractal pattern, you see new detail emerge, no matter how far you zoom in. This is illustrated in Fig. 1.2. A useful characteristic to compare different fractals is the fractal dimension. It expresses to what extent the fractal ‘fills’ the space in which it is defined (Ott, 1993, Chapter 3).

(27)

Figure 1.1: State-space plot of a time-series produced by the Lorenz system (see Ott, 1993), consisting of a set of three first-order differential equations, known to produce chaotic dynamics. The three state variables are labeled x1, x2, and

x3. Two nearby states are selected, and their short-term evolution is followed

(28)

1.1 Chaotic Dynamics, Control, and Modeling 5

(29)

1.1.3

Is my system chaotic?

When deterministic chaos was first discovered as a source of apparent random-ness, it challenged scientists of many different disciplines with the question: is my random-looking, experimental data produced by a chaotic system? A toolbox of techniques to distinguish chaotic behavior from randomness was developed. It consists mainly of algorithms which extract chaotic invariants (entropy, fractal di-mension) from a measured time-series. The invariants are meaningful when applied to systems which are indeed chaotic, but when applied to (nonlinearly) correlated noise or any mixture of random and deterministic components, their value has no clear mathematical interpretation. Perhaps the best test for chaos is the following three step approach:

1. Build a deterministic model, using any of the techniques described in this thesis

2. Use this model to synthesize a long time-series.

3. Compare the synthesized data to the original measurements. This compar-ison can be based on chaotic invariants, but more powerful is the Diks test described in Ch. 5.

If the model passes all validation tests and exhibits chaotic behavior, then that is strong evidence that the experimental system is also chaotic.

1.1.4

Chaos Control

(30)

1.1 Chaotic Dynamics, Control, and Modeling 7

1.1.5

Data-driven modeling of chaotic dynamics

For both the identification and control of experimental chaotic dynamics, it is essential to have an accurate dynamical model available. But, most real-world chaotic systems are far too complex to describe in terms of their underlying physics. Thanks to the availability of cheap computational power, new modeling concepts have emerged which do not require a physical understanding of the system. Instead, these models use large amounts of observed data to extract knowledge from the system. This is called data-driven or self-learning. The best-known models in this category are neural networks, and they provide the starting point for the models developed in this thesis. A comprehensive treatment of neural networks is given in Ch. 2. Basically, a neural network is a very flexible model structure, which contains a large number of adjustable parameters. These are fitted to a large set of measured data, thereby trying to ensure a good balance between the flexibility of the model and the sensitivity of the final model to missing data and measurement noise.

1.1.6

Limits on determistic modeling

A dynamical system is deterministic if there is no source of randomness affecting it. But, a deterministic system can be so complex that the system’s state cannot be reconstructed from the available measurements. In that case, a deterministic model cannot be derived. The high complexity can have different causes:

1. the number of variables needed to represent the state of the system can be very high (high dimension)

2. the predictability of the system can be very low (high entropy)

If only a single measured variable is available, then these two causes of complexity reinforce one another. The Takens theorem (Takens, 1981) allows us to replace unmeasured variables by delayed values of the single measured variable, but only if certain conditions are met. One of those conditions is that the measured variable must be known at sufficiently high accuracy. This can be problematic. Consider a system whose state consists of 10 delayed values of the measured variable. If the measurement accuracy is 8 bits, and the entropy is 1 bit/delay, then the 9th and 10th delays have lost (on average) all predictive power. In this case the state is incomplete, and one has to fall back to a statistical modeling approach.

(31)

1.2

Motivation and previous work

The modeling of chaotic dynamical systems in this thesis is part of a larger body of work, all performed at the Chemical Reactor Engineering (CRE) group at Delft University of Technology. The overall research objective is to analyze and exploit the chaotic dynamical behavior of multi-phase chemical reactors. We briefly de-scribe multi-phase reactors below, and then summarize the work done at CRE in Sec. 1.2.2.

1.2.1

Fluidized beds

Multi-phase reactors are chemical reactors in which fluids, gases and/or solids in-teract with one another. In this thesis, two types of reactors play a role: gas-solids fluidized beds, and gas-liquid bubble columns. For the bubble column we refer to chapter 9.

Fluidized beds are vessels filled with small particles. The layer of particles is sup-ported by a porous plate, near the bottom. Below this plate, a gas is injected. The gas flows through the porous plate, through the layer of particles, to the exit at the top of the column. If the gasflow is small, the particles rest at the bottom. Above a certain threshold, the particles start to move around, and as a whole, the layer of particles acts like a fluid. When the flowrate is further increased, bubbles emerge within the particle layer. The various regimes are illustrated in Fig. 1.3.

Fluidized beds are used in many industrial processes in which gases are converted by a catalytic reaction, whereby the catalyst is attached to the small particles. Fluidization provides excellent mixing of both gases and solids. The bubbling regime often gives the best compromise between throughput and conversion. The bubbles should not become too large, as the gas inside them is not in touch with the particles. If we can use chaos control to stabilize a mode of operation with (1) many small bubbles, and (2) a high gasflow, then this will be very beneficial to the reactor’s economic performance.

1.2.2

Previous work at the CRE group

(32)

1.2 Motivation and previous work 9

(33)

a single (or a few) measured variable. But a deterministic approach might work if particles and voids/bubbles interact in such a way that their aggregate behavior can effectively be summarized by a small number of independent variables. This phenomenon is refered to as self-organization.

Characterizing fluidized bed dynamics

In cooperation with Floris Takens from the RU Groningen, a selection of mathemat-ical algorithms for chaos analysis were implemented, and adapted for dealing with noisy, experimental data. These methods were applied to pressure fluctuations, measured from various fluidized beds under a broad range of conditions by Michel van der Stappen. In his thesis “Chaotic Hydrodynamics of Fluidized Beds” (Van der Stappen, 1996), he develops an empirical correlation between the predictability of the measured time-series, expressed by the Kolmogorov entropy, and the bubble size distribution in the fluidized bed. This correlation can be used in the scale-up of fluidized beds, where the bubbling regime in the large system should be similar to that of the laboratory-size system.

Van der Stappen does not claim that the fluidized bed is low-dimensional chaotic. The values of the most important chaos descriptors turn out to depend on the scale at which one looks at the system. The smaller the scale, the larger the fractal dimension, and the larger the entropy. Interestingly, this typical behavior is also seen for time-series sampled from a random process. Such a process has an infinite dimension and entropy, but the estimated values for the fractal dimension and entropy are severely underestimated at higher length-scales (Van der Stappen, 1996, Sec. 4.3). Daw et al. (1995) suggest that the dynamics can perhaps be seen as a system of nonlinear equations at the level of bubbles and particle aggregates, disturbed by low-amplitude noise, due to particle interactions at the microscopic level.

This exploratory work was followed up by two PhD theses. Zijerveld (1998) ex-panded the measurements to different types of fluidized beds, while Van der Schaaf (2002) focused on the physical phenomena behind the measured pressure signals. Van der Schaaf rejects the proposition that low-amplitude system noise is respon-sible for the high entropy and dimension at small length-scales. In his view, the system is much more complex than indicated by the calculated entropy and dimen-sion. He attributes the complexity to the spatially extended nature of fluidized beds. He proposes an alternative characterization of the dynamics, based on the shape of the Fourier powerspectrum of the measured pressure time-series.

(34)

1.3 Results presented in this thesis 11

data with model-generated time-series.

Manipulating fluidized bed dynamics

In parallel with the above chaos analysis work, a project started to apply chaos control to multi-phase reactors. This work is part of that project. A driven and damped pendulum was acquired, as an experimental test-bed to gain experience with chaos control algorithms and hardware. Robert Jan de Korte (2000) started out with the control of the pendulum, and he contributed to Ch. 3 of this thesis. Controlling the chaotic hydrodynamics of fluidized beds turned out too difficult: the measured pressure time-series do not contain enough information to form the basis for a deterministic predictive model. He further experimented with periodic gas-pulse injections. Sander Kaart (2002) resorted to another type of multi-phase reactor, a gas-liquid bubble column with a single train of rising bubbles. In this system, passing bubbles can directly be measured by laser beams. With a control method based on synchronization and feedback, he managed to change the bub-bling behavior from chaotic to periodic, while maintaining the same average gas throughput. Chapter 9 of this thesis presents a global nonlinear model for that bubble column.

1.3

Results presented in this thesis

The chapters of this thesis are self-contained, each having its own introduction and description of prior art. Here we summarize the chapters, to show how our insights have progressed in time.

1.3.1

Main theme: Attractor learning

Nonlinear black-box models are applied at a large scale to predict the future based on historical data. In financial applications, slightly better predictions make the difference between profits or losses. In this thesis however, we are not looking for models which make the best possible predictions. For chaos control, it is more important that the model has the same dynamical behavior as the real system. In the context of deterministic chaos, that behavior is fully described by the system’s attractor. If the model has correctly learnt the attractor, it can be explored to find interesting periodic orbits, which can be stabilized to alter the system’s behavior. On the other hand, if one cannot find a model which can reproduce the observed attractor, then chaos control is unlikely to succeed.

(35)

a graph, showing the expected values with error bounds. The longer the prediction time, the larger the margin of error. For a one-year-ahead weather forecast, the best guess is to simply use climate data: the average over the past ten years. But climate data cannot be used to simulate next year’s weather: they only predict the average, not the day-to-day fluctuations. That missing dynamical information is contained in the attractor (presuming the weather system is deterministic chaotic). Unfortunately, no dedicated algorithm exists to perform attractor learning. Instead, models are first trained to make good predictions, and then the model’s attractor is obtained as a ‘by-product’. The attractor is extracted from the model on the basis of a very long, model-generated time series: the model is initialized to a randomly selected initial state, and then it runs in autonomous mode until the desired time series length is obtained.

1.3.2

Chapter 2: Neural Networks for Function

Approxima-tion

When this research started, data-driven, black-box modeling techniques were rapidly developing, mainly due to the expected high potential of neural networks. Succesful reports of predicting chaotic time-series with neural networks arrived as early as 1987 (Lapedes and Farber, 1987), and we started out to apply neural nets to our own experimental data. The message of Ch. 2 is that neural networks should not be seen as magical brain-like structures which solve problems in mysterious ways. The chapter provides a comprehensive overview of how neural networks approx-imate functions, and discuses aspects which are often overlooked when applying neural networks, and other types of black-box models, to real-world problems.

1.3.3

Chapter 3: Neural network model to control an

exper-imental pendulum

(36)

1.3 Results presented in this thesis 13

1.3.4

Chapter 4: Prediction and Control of Chaotic

Flu-idized Bed Hydrodynamics

We then move to build a fluidized bed model. As discused in Sec. 1.2.2, the low-dimensionality of this system is in doubt. Initially we use pressure time-series, which are the basis of Van der Stappen’s exploratory work. But Robert-Jan de Korte finds that this data is not a suitable basis for predictive models. One expla-nation could be John van der Schaaf’s observation, that the pressure signal com-bines too many simultaneous events, occuring at different locations in the system. Fortunately, we have an alternative measurement setup available, which measures passing bubbles more directly. The technique is known as Electrical Capacitance Tomography, and its application to fluidized beds is the subject of K¨uhn (1998). The first results include many models which produce either unstable behavior or quickly converge to a fixed point. To improve this, the ‘error-propagation’ learning algorithm is proposed. It trains the model to avoid accumulation of prediction er-rors. The result is encouraging, some of the models produce random-like behavior resembling the measured time-series.

1.3.5

Chapter 5:

Learning Chaotic Attractors by Neural

Networks

For chaos control we need models which are robust and reliable, so that its periodic orbits match those of the real system. So far, the fluidized bed models are not good enough. Among several algorithmic improvements, this chapter implements the Diks test (Diks et al., 1996). This is a very sensitive statistical test, which compares the dynamics of measured and model-generated time series. We revisit the pendulum, and create a model which uses only one of the three state-variables. Another model learns the chaotic behavior of a well-known benchmark time series, the Santa Fe laserdata. The Diks test shows that during the training of these models, their long-term dynamics often completely changes from one iteration to another. This is very undesired behavior. It implies that minimizing short term prediction errors is unlikely to result in a model with the correct attractor.

1.3.6

Chapter 6: Selective regression and instant neural

net-work pruning

(37)

the prediction accuracy. However, pruning does not seem to improve or deteriorate the model’s attractor.

1.3.7

Chapter 7: Why capturing chaotic dynamics fails: a

case study

The three chapters of algorithmic improvements have improved attractor learning, but not enough: a large percentage of the identified models does not have the correct attractor.

This case study takes us to the root cause of the stability problem. We consider a low-dimensional, purely deterministic chaotic system, from which noise-free mea-surements are taken. We first show that there are infinitely many models which can 100% accurately predict this data. But only one of these has the correct chaotic behavior! Based on these results, we need to rethink our strategy for reconstructing chaotic dynamics.

1.3.8

Chapter 8: The Split & Fit model

This chapter introduces the Split & Fit (S&F) algorithm. This new nonlinear mod-eling algorithm has distinct advantages over neural networks in terms of training time and node overlap. S&F does not in itself provide a new solution to attractor learning yet, that is postponed to chapter 9 where we augment S&F with nonlinear principal component analysis. In this chapter we introduce S&F in its basic form, and apply the algorithm to the prediction of the NOx emission of a diesel engine.

1.3.9

Chapter 9: Learning Chaotic Attractors with

Nonlin-ear Principal Component Regression

(38)

1.4 Discussion and future work 15

1.4

Discussion and future work

The previous section contains many intermediate conclusions from the individual chapters in this thesis. Here we return to our original goals:

1. The development of a data-driven algorithm which can robustly learn the dynamical behavior of an experimental system.

2. The application of such algorithm to an experimental fluidized bed, with the ultimate purpose to use the resulting model for chaos control.

In the early stage of this research, experiments with attractor learning showed that existing approaches for nonlinear modeling are good in making short-term predic-tions, but they often fail to learn the correct attractor. We then identified the cause of the frequent failures. They occur because the attractor learning problem is ill-posed: different models may produce identical predictions, while having completely different dynamical behavior. So, one cannot learn a chaotic attractor by simply minimizing prediction errors. The solution is to use a nonlinear variant of principal component regression (PCR). No suitable algorithms exist in the literature, and therefore we develop a new PCR algorithm, based on our Split & Fit modeling approach. By itself, S&F is merely a fast and hierarchical alternative to neural networks. The addition of PCR turns it into a robust attractor learner. A possible future improvement is to automatically select the dimensionality of the reduced state space. At present this parameter is preset before learning starts, but local model validation tests could set this parameter automatically. A limitation to be aware of, is that this form of PCR can create spurious attracting regions, such as the ones shown in Fig. 9.1, in the middle of the two circles of the figure 8.

Learning the chaotic attractor of the experimental fluidized bed has turned out to be not possible. We think that the fluidized bed has a much higher complex-ity than the exploratory research of Van der Stappen (1996) suggests. The early success in Ch. 4, and similar claims by Nakajima et al. (2001) might suggest oth-erwise, but they have only been validated on the basis of dimensions and entropies. We think these tests are inadequate, because for finite length-scales, the numbers are severely underestimated (Krakovsk´a, 1995). A similar observation was done by the author of this thesis with respect to sea clutter data. This data is the background signal received by a radar when pointed to an empty sea. The signal is produced by the waves. It was first thought that this data is low dimensional chaotic (Haykin and Puthusserypady, 1999), but more recent research points out that an amplitude-modulated, noise driven linear model explains the observed phe-nomena well (Haykin et al., 2002).

(39)

whose dynamical behavior is periodic, whereas the observed data appears to have a chaotic attractor. But, when disturbing the model output with a small amount of noise (having the same variance as the model prediction errors), the model’s behavior matches that of the experimental system. Due to the presence of this noise term, chaos control of the bubble column will need perturbations which are large compared to the noise.

1.4.1

Recommendations for future work

It is tempting to recommend further improvements to the S&F+PCR algorithm which concludes this thesis. But such improvement will not bring the modeling of fluidized beds and other complex nonlinear systems any closer. It is not true that apparently random systems, in which there is no external source of randomness, can always be described by deterministic rules. That idea follows from a misinter-pretation of the Takens theorem. Delays of a single measured state variable can indeed replace unobserved state variables, but when there is measurement noise, only a limited number of delays can be used for that purpose.

For many real-world dynamical systems, the data requirements cannot be met. Moreover, in many systems there really is a source of randomness. Or, the system would be greatly simplified if part of it could be considered a source of randomness. In speech synthesis for example, the air passing through the vocal cords to produce sound, is commonly considered a random source. The vocal tract can than simply be modeled as a frequency and amplitude modulator. Treating part of this problem as a random source avoids the need to figure out exactly how the compressed air in the lungs starts to vibrate.

In the bubble column in Ch. 9, we did include a noise source while generating the model output. However, that model is still built on the assumption of determinism. Adding noise to such a model is only realistic if the noise has a small amplitude, and is independent of the state. We think that further development of models for complex systems should start from a completely different perspective:

1. Current approach: deterministic model, train to make good predictions, aug-ment with stochastic components if necessary to improve the model’s attrac-tor.

2. Proposed approach: stochastic model, train to match the real system’s data distribution in state space, add deterministic components to improve the model’s predictions.

(40)

1.4 Discussion and future work 17

A good starting point for this new approach is the Hidden Markov Model (HMM), widely used in speech recognition and biological sequence analysis (Durbin et al., 2000). The HMM is a collection of stochastic submodels, in which the system jumps from one model to another, based on a transition probability matrix. A HMM can directly learn a smoothed version of the system’s attractor, in terms of a multi-dimensional probability density function (pdf). By itself, the HMM provides a very crude way to predict time series: from the currently active submodel, the next active submodel is predicted, and then a sample is drawn from that submodel’s pdf.

The time series prediction part of the standard HMM needs improvement. This can be achieved by augmenting the state, so that the HMM does not learn the pdf of the current state xt only, but rather the joint pdf p(xt,xt+1). Predictions can

then be made using the conditional distribution p(xt+1|xt). Learning probability

(41)
(42)

Chapter 2

Neural Networks for

Function Approximation

This chapter provides a very comprehensive overview of how neural networks ap-proximate functions, and discusses aspects which are often overlooked when apply-ing neural networks, and other black-box models, to real-world problems.

2.1

Introduction

Neural Networks are highly adjustable nonlinear model structures. They consist of simple nonlinear building blocks or nodes, put together into a network. The flow of information from a node in the network to other nodes is regulated by the strengths of the connections between them. By adjusting the strengths of all the connections in the network, the flow of information through the network can be ma-nipulated such that the neural network has desired responses for particular inputs. This enables neural networks to perform various tasks, such as nonlinear function approximation, clustering, and pattern recognition. Only function approximation is considered here. Importantly, that limits our scope to systems governed by de-terministic equations. Excluded are systems which involve random (sub)processes: they need probabilistic models.

(43)

learn. Neural networks first became popular in the mid-1980’s, a hallmark being the book on parallel distributed processing by Rumelhart and McClelland (1986). Engineers soon started to explore the basic asset of neural networks: finding non-linear relationships which are hidden in large amounts of experimental data. This gave rise to thousands of publications during the 1990’s, in which neural networks were applied to virtually any modeling problem that cannot readily be solved with first principle models or linear methods. But yet, neural networks are not widely used for real-world, industrial applications. We distinguish the following aspects which are crucial to the adoptation of neural networks, or any other black-box modeling tool:

1. A black-box model should match the target function as close as possible, thereby providing a good trade-off between bias and variance. In popular terms, bias is the mismatch which occurs when the model is not flexible enough to fit the measured data: the model is smoother than the target function. Variance is the mismatch due to excess flexibility of the model: the model surface is more ‘bumpy’ than the target function. A high-variance model typically has a low prediction error for the dataset which it was trained on, but a high prediction error for new, unseen data. For a more precise, statistical definition of bias and variance, we refer to Geman et al. (1992).

2. A black-box model should not only predict the most likely output for a given set of inputs, but it should also provide an error distribution for that predic-tion.

3. A black-box model cannot be expected to extrapolate well. Therefore, it should know the range of inputs which it is valid for.

4. The equations inside the black-box model should be transparent, ready for interpretation.

5. Training and use of the model should be fast.

(44)

2.1 Introduction 21

a

o

1

a

o

1

(a) Threshold transfer function

(b) Sigmoidal transfer function

Figure 2.1: Threshold (a) and continuous (b) transfer function.

2.1.1

Standard Multi-Layer Perceptron

A standard Multi-Layer Perceptron (MLP) consists of nodes ordered in layers. The first and last layer are called input and output layer, and the in-between layers are called hidden layers. Each node in a layer is connected to all nodes in the previous layer, except for the nodes in the input layer - they merely distribute the network inputs to the first hidden layer. The notation MLP(i, h1, h2, j) denotes a network

with i nodes in the input layer, h1 nodes in the first hidden layer and h2 in the

second, and j nodes in the output layer. Neural networks are conveniently depicted as ordered graphs, with nodes drawn as circles and connections as arrows, see for example the MLP(2,3,1,1) in Fig. 2.5b. To compute the output of the network, the network input is propagated from the first to the last layer. The first layer leaves its input unmodified. At the subsequent layers, each node gets an activation a that is a weighted sum of the node inputs:

a =

N



n=1

wnxn+ b, or in vector notation, a = w · x + b, (2.1)

where wnis the weight of the n-th connection to a node, xnis the n-th input and b

is a bias term. From this activation, an output o is computed by passing a through a transfer function. In most applications, the transfer function is either a threshold (Fig. 2.1a) or the sigmoidal function (Fig. 2.1b)

o(a) = 1

(45)
(46)

2.1 Introduction 23

2.1.2

MLP’s Approximation Capabilities

Around 1989 it was proven by a number of authors (Cybenko, 1989; Hornik, 1989; Funahashi, 1989) that MLPs with only a single hidden layer of sigmoidal nodes have the property of universal approximation: they can approximate any contin-uous nonlinear mapping arbitrarily close, provided they have an infinite supply of nodes. Earlier research had indicated, mostly by trial and error, that some-times two hidden layers are required (Lippmann, 1987; Lapedes and Farber, 1988). After the proof, many researchers used it to justify their choice to use only one-hidden-layer networks—forgetting about the previous findings that a second layer is sometimes necessary. A typical example is the scheme published by Lippmann (1987), shown in Fig. 2.2. It says that two hidden layers are required to create arbi-trary decision regions. The scheme was modified and republished by Wells (1993). He incorporated the new universal approximation result by replacing the text ‘two hidden layers’ by ‘single hidden layer with many nodes’. After 1989, several au-thors addressed the question of the number of required hidden layers and checked the validity of pre-1989 research with respect to the newly discovered universal ap-proximation properties. Gibson (1992) studied the decision region problem of Fig. 2.2 and concluded that functions that are trivial for two-hidden-layer networks can be arbitrarily more complex to approximate for single-hidden-layer networks. Brightwell et al. (1997) elaborated on this work, in an effort to characterize decision regions computable with a single hidden layer. Chester (1990) demonstrated the high cost of a single-hidden-layer for the example of a ‘pinnacle function’, a local function having a nonzero output only inside the unit circle. Russell and Faucett (1996) compared one and two-hidden-layer network accuracy for some arbitrary ex-amples. The next section discusses why the confusion about the number of hidden layers has emerged.

2.1.3

Single-hidden-layer MLP

Looking at a single sigmoidal node, it can be seen from Eqs. 2.1 and 2.2 that its output approaches zero for small activations, say a < −2, and 1 for large activitions, say a > 2. The boundaries of these two inequalities are two parallel

n − 1-dimensional planes in the n-dimensional input space of the node, defined by: 

w · x + b = −2, and w · x + b = 2. (2.3)

An example is shown in 2.3 for a node with two inputs. Since n = 2, the (n − 1)-dimensional planes are lines. Each sigmoidal node smoothly steps from 0 to 1 when crossing the line a = 0 in the direction of increasing a. The smoothness is inversely proportional to the magnitude of the weight vector, w . The distance of the line

a = 0 to the origin is equal to  wb. Similarly, the output of a node with n inputs

(47)

[ [ a=0 a=2 a=-2 

Figure 2.3: Lines of constant activity a of a single node with two inputs x1and x2.

a = 0. Because this n − 1-dimensional plane stretches out to infinity in each of the n−1 non-normal directions, MLP nodes are said to act globally, and MLP networks

are referred to as global approximators.

Changing the magnitude of the weight vector will affect the node’s output every-where along the n−1-dimensional plane. This raises the question if a single-hidden-layer MLP can approximate a function that acts locally—having non-constant out-put in some limited part of its inout-put space only, while being constant everywhere else.

To illustrate this point, a two-dimensional Gaussian function is used as an example of a function that acts locally. In this section, the previous insight on the ‘stepping’ behavior of nodes is used to construct networks that approximate the Gaussian function. Such construction of MLPs is very unusual, as it is only possible for very simple functions. The usual way of training MLPs is discussed in Sec. 2.3. First we take an MLP(2, h1, 1) with threshold nodes, where h1 is the number of

(48)

2.1 Introduction 25 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1 −5 0 5 −5 0 5 0 0.5 1 x1 x2 y1

(a) 2 nodes (e) 2 nodes

(d) 64 nodes (c) 16 nodes (f) 3 nodes (g) 16 nodes (b) 3 nodes (h) 64 nodes

(49)

will become zero. More specifically, if the construction of Figs. 2.4c and 2.4d is extended for an increasing number of nodes in the hidden layer, and if the network parameters are such that the lowest output far from the center is zero and the output in the center is one, then the higher output far from the center will be 2/h. This is in agreement with the result of Chester (1990) who derived that there must be at least h neurons to have 1/h as the approximation error, using the maximum norm. For a network with sigmoidal nodes in the hidden layer the same conclusions hold, but the surfaces from Fig. 2.4 become much smoother, see Fig. 2.4e–h. In Fig. 2.4h we have a surface that is very close to a Gaussian function.

To understand why the single-hidden-layer approach fails to work with only a few nodes, we analyze the MLP(2,3,1) (Fig. 2.5a) with threshold nodes in Fig. 2.5c. The three lines in the figure show the orientation of the three hidden layer nodes in the input space. The lines separate the input space in seven regions. Each node outputs zero on one side of its separation line, and 1 on the other, and this gives each of the seven regions its own unique triplet of node outputs (as indicated). We would have a function with localized output if the network can be made to output 1 in the (1,1,1) region, and 0 in the other regions. But this is not possible. The output of the network is a linear combination of the node outputs, y = wI

1oI1+ w2Io2I+ wI3oI3+ bI,

where y is the network output, superscript I refers to the first hidden layer, w1-w3

and b are network weights and bias, respectively, and o1-o3 are the outputs of the

three hidden layer nodes. The four parameters w1-w3 and b are not enough to

independently control the network output in each of the seven regions. Therefore, the network fails.

2.2

Two-hidden-layer MLP

It was recognized by Lapedes and Farber (1988) and Geva and Sitte (1992) how to construct a local function by a two-hidden-layer network. Fig. 2.5d explains how a MLP(2,3,1,1) with threshold nodes approximates a local function without the problem that it outputs two different values outside the central region. The previous paragraph showed that the MLP(2,3,1) separates the two-dimensional input space in seven regions (Fig. 2.5c). For the second hidden layer, the outputs of the first hidden layer nodes form a three-dimensional input space. Each of the seven discrete output triplets is represented by a point in this space. The single node in the second hidden layer separates this space in two regions, and the separation in Fig. 2.5d is done such that the (1,1,1) point falls on one side of the separation plane, and the two other points fall on the other side.

The output of the network is a linear combination of the output of the second-hidden-layer node, y = wII

1 + bII, where superscript II refers to the second hidden

layer. With the two parameters w1and b, we can independently specify the desired

(50)

2.2 Two-hidden-layer MLP 27

(51)

(a) MLP(2,16,1) (b) MLP(2,3,1,1)

Figure 2.6: Comparison of a (a) one-hidden-layer and (b) two-hidden-layer approx-imation, obtained by training the networks with backpropagation on a set of 250 samples randomly taken from the function shown in Fig. 2.4h. The data are taken from the interval x1 ∈ [−5, 5] and x2 ∈ [−5, 5], while the plots show the model evaluation over a larger interval x1 ∈ [−7, 7] and x2 ∈ [−7, 7]. Clearly, the one-hidden-layer network has a much ‘wilder’ extrapolation behavior than the two-hidden-layer network.

space into regions, and each node of the second hidden layer combines an arbitrary number of these regions into a larger one. Gray and Michel (1992) present a constructive algorithm for binary feedforward networks based on this notion.

2.3

Training the MLP networks

It is only for simple and well-defined functions that it is possible to construct MLP approximations as in the previous sections. The normal procedure to set the parameters of a neural network is by means of the following training procedure:

1. Collect a large set of input-output samples from the function. If possible cover all inputs that will be of interest when using the trained neural network in its final application.

2. Pick a suitable number of layers and nodes for the network. This step may require trial and error.

(52)

2.3 Training the MLP networks 29

The cost function is usually the squared difference between the network outputs and sampled outputs. Derivatives of the cost function with respect to the adjustable parameters are easily obtained analytically using the chain rule. For MLPs the procedure of obtaining derivatives and using them to minimize the cost function is known as backpropagation (Werbos, 1974). In this section it is investigated whether the popular backpropagation algorithm is enough powerful to find a good set of weights for the very concise MLP(2,3,1,1), and the resulting approximation is compared to that of a single hidden layer network.

Two sets of 250 points each, taken uniformly from the interval x1 ∈ [−5, 5] and

x2 ∈ [−5, 5], were used to train and test two MLP networks to approximate the

MLP(2,64,1) from Fig. 2.4h. The first network, an MLP( 2,16,1), reaches an mean squared error (MSE) of 1.1e-4 on the train set, and 2.7e-4 on the test set. The second, an MLP(2,3,1,1) reaches an MSE of 4.0e-4 on the train, and 5.6e-4 on the test set. These errors are smaller than the squared ‘reference error’ of 2/h from Sec. 2.1.3 that amounts to 9.8e-4. The difference between the errors on the train and test dataset is many times bigger for the single hidden layer than for the two hidden layer network. In Fig. 2.6a,b we study the extrapolation behavior of the two networks, by plotting the approximations in an extended input domain. Clearly, the single-hidden layer network behaves much wilder outside the original domain than the concise two-hidden-layer variant. The excercise shows that also when the networks are trained by backpropagation, the concise two-hidden-layer MLP generalizes much better.

2.3.1

Radial Basis Function network

Radial Basis Functions (RBFs) are functions whose output solely depends on the distance to a central point. The distance is called radius and the central point is called centre. When the radius goes to infinity, the node output goes to zero, and for this reason the nodes are said to act locally, and the RBF network is called a local approximator. RBF networks have a single hidden layer of nodes, so that their output is simply a weighted sum of the RBFs. A common choice for the RBF is the unnormalized Gaussian: o(r) = exp(−r2

2), where radius r is the Euclidian

distance to the node’s centre, and σ is the width or spread of the node. Training involves optimizing the centres, spreads and weights of the RBF network. One of the most successful training methods is based on selective regression, which is a combination of subset selection and linear regression. The main steps are:

1. Choose the number of nodes as large as the number of training data, such that the node centres coincide with the training data.

(53)

Figure 2.7: Gaussian RBFs (σ = 1/8) arranged in a 5-by-5 regularly spaced grid in two input dimensions. Approximation of a nonlinear function is obtained by taking a weighted sum of the node outputs.

bias).

3. With the centres and spreads fixed, finding optimum weights is a linear least squares problem. But if all the nodes from the initialization in the first step are used, there are far too many parameters in the model. Selective regression is a procedure which leaves out those parameters which do not contribute much to the solution. It tries to find the golden mean in a second bias/variance tradeoff: selecting more nodes will lower the approximation error (lower bias) but causes wilder interpolation behavior (higher variance). For details, see chapter 6 in this thesis, or the textbook by Miller (1990). The paper by Chen et al. (1989) contains an efficient implementation of the orthogonal least squares algorithm.

2.3.2

RBF’s Approximation Capabilities

RBF networks are also universal approximators (Park and Sandberg, 1991). This can be understood by looking at Fig. 2.7, where the nodes of an RBF network are arranged in a rectangular grid. The spacing of the grid can be reduced by adding nodes to the network. With the reduced spacing, the weighted sum of the node outputs can be tuned to create a closer match to the desired nonlinear mapping, and in the limit of an infinite number of nodes, the approximation can be arbitrarily precise.

Cytaty

Powiązane dokumenty

W przekonaniu wielu Serbów to właśnie oni byli narodem pokrzywdzonym – i to nie tylko na poziomie re- publik związkowych, lecz także w samej Serbii, gdzie okręgi

Autor artykułu przed- stawia liczne argumenty przemawiające za uznaniem niewłaściwości uchylania decyzji administracyjnych wyłącznie z powodu odmiennej interpretacji zastoso-

Wind-tunnel experiments were performed in order to investigate the influence of bilge-keels and of a bulbous bow on the drag induced by vortices generated at the bilges of an ogive.

In the big picture, metaphors borrowed from pixel-based paint programs – in our case, the venerable Deluxe Paint (see Maher, 2012) – are a useful starting point, but cannot be

Jednakże, z drugiej strony, znając ich m ocną pozycję we Francji, należy sądzić, że interw eniow aliby bezpośrednio u najwyższych czynników III Republiki,

Previous density functional theory (DFT) calculations predict a distinct change in electronic structure and magnetic moments across the magneto-elastic transition in Fe 2

nym misjonarzu assamskim do dziś się nie ukazała i już się w Indii naj­ prawdopodobniej nie ukaże.. Być może, zachowały się jakieś dokumenty w archiwach

Indywidualne gospodarstwo rolne w naszym systemie społeczno-gospodarczym wiąże się z istniejącą jeszcze przejściową wielosektorowością gospodarki narodowej w