• Nie Znaleziono Wyników

Sieci neuronowe – bezmodelowa analiza danych?

N/A
N/A
Protected

Academic year: 2021

Share "Sieci neuronowe – bezmodelowa analiza danych?"

Copied!
42
0
0

Pełen tekst

(1)

Sieci neuronowe –

bezmodelowa analiza danych?

K. M. Graczyk

IFT, Uniwersytet Wrocławski Poland

(2)

Why Neural Networks?

• Inspired by C. Giunti (Torino)

– PDF’s by Neural Network

• Papers of Forte et al.. (JHEP 0205:062,200, JHEP

0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:1- 63,2009).

• A kind of model independent way of fitting data and computing associated uncertainty

• Learn, Implement, Publish (LIP rule)

– Cooperation with R. Sulej (IPJ, Warszawa) and P.

Płoński (Politechnika Warszawska)

• NetMaker

– GrANNet ;) my own C++ library

(3)

Road map

• Artificial Neural Networks (NN) – idea

• Feed Forward NN

• PDF’s by NN

• Bayesian statistics

• Bayesian approach to NN

• GrANNet

(4)

Inspired by Nature

The human brain consists of around 1011 neurons which are highly interconnected with around 1015 connections

(5)

Applications

• Function approximation, or regression analysis, including time series prediction,

fitness approximation and modeling.

• Classification, including pattern and sequence recognition, novelty

detection and sequential decision making.

• Data processing, including filtering, clustering, blind source separation and

compression.

• Robotics, including directing manipulators, Computer

numerical control.

(6)

Artificial Neural Network

Input layer

Hidden layer

Output, target

Feed Forward 



2 1

32 22 12

31 21 11

3 2 1

i i w

w w

w w w

t t t

the simplest example  Linear Activation Functions  Matrix

(7)

threshold Summing

output activation

function

input

1

2

3

k

i-th perceptron weights

(8)

activation functions

•Heavside function (x)

 0 or 1 signal

•sigmoid function

•tanh()

•linear

e x

x

g

1 ) 1 (

4 2 2 4

1 .0

0 .5

0 .5 1 .0

tanh(x) sigmoid

th

threshol

signal is amplified Signal is weaker

(9)

architecture

• 3 -layers network, two hidden:

• 1:2:1:1

• 2+2+1 + 1+2+1: #par=9:

x F(x)

Linear Function Symmetric Sigmoid Function

•Bias neurons, instead of thresholds

•Signal One

(10)

Neural Networks – Function Approximation

• The

universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi- layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions.

(Wikipedia.org)

(11)

Q2

x F

)

; , 2 (

2 Q x wij F

Q2

)

; , 2

(Q wij

A map from one vector space to another

(12)

Supervised Learning

• Propose the Error Function

– in principle any continuous function which has a global minimum

• Motivated by Statistics: Standard Error Function, chi2, etc, …

• Consider set of the data

• Train given NN by showing the data  marginalize the error function

– back propagation algorithms

• An iterative procedure which fixes weights

(13)

Learning Algorithms

• Gradient Algorithms

– Gradient descent

– RPROP (Ridmiller & Braun) – Conjugate gradients

• Look at curvature

– QuickProp (Fahlman) – Levenberg-Marquardt

(hessian)

– Newtonian method (hessian)

• Monte Carlo algorithms (based on Marcov chain algorithm)

(14)

Overfitting

• More complex models describe data in better way, but lost

generalities

– bias-variance trade-

• Overfitting  large off values of the weights

• Compare with the test set (must be twice

larger than original)

• Regularization 

additional penalty term to error function

) exp(

) 0 ( ) ( absence data

,

2 1 2 2

t w

t w w

dt E dw

w E

E E

E E E

D

W i

i W

D D

W D D





Decay rate

(15)

What about physics

Nonparametric QED free parameters

Most of Models

Nature

Observation Measurements

Idea

model nonoperativeQCD

Data

Statistics

Data Still More precise than Theory

•PDF

Physics given directly by the data

Problems

Some general constraints Model Independent Analysis Statistical Model  data  Uncertainty of the predictions

(16)

Fitting data with Artificial Neural Networks

‘The goal of the network training is not to learn on exact representation of the training data itself, but rather to built statistical model for the process which generates the data’

C. Bishop, ‘Neural Networks for Pattern Recognition’

(17)

Parton Distribution Function with NN

Some method but…

Q2

x F

(18)

Parton Distributions Functions S. Forte, L. Garrido, J.

I. Latorre and A. Piccione, JHEP 0205 (2002) 062

• A kind of model

independent analysis of the data

• Construction of the probability density

P[G(Q2)] in the space of the structure functions

– In practice only one Neural Network

architecture

• Probability density in the space of parameters of one particular NN

But in reality Forte at al.. did

(19)

Training Nrep neural networks, one for each set of Ndat pseudo-data Generating Monte Carlo pseudo data

The Nrep trained neural networks  provide

a representation of the probability measure in the space of the structure functions

The idea comes from W. T. Giele and S. Keller

(20)

correlation uncertainty

(21)

10, 100 and 1000 replicas

(22)

30 data points, overfitting

short enough long too long

(23)
(24)

My criticism

• The simultaneous use of artificial data and chi2 error function overestimates

uncertainty?

• Do not discuss other NN architectures

• Problems with overfitting (a need of test set)

• Relatively simple approach, comparing with the present techniques in NN computing.

• The uncertainty of the model predictions must be generated by the probability

distribution obtained for the model then the data itself

(25)

GraNNet – Why?

• I stole some ideas from FANN

• C++ Library, easy in use

• User defined Error Function (any you wish)

• Easy access to units and their weights

• Several ways for initiating network of given architecture

• Bayesin learning

• Main objects:

– Classes: NeuralNetwork, Unit

– Learning algorithms: so far QuickProp, Rprop+, Rprop-, iRprop-, iRprop+,…,

– Network Response Uncertainty (based on Hessian) – Some restarting and stopping simple solutions

(26)

Structure of GraNNet

• Libraries:

– Unit class

– Neural_Network class

– Activation (activation and error function structures)

– Learning algorithms

– RProp+, RProp-, iRProp+, RProp-, Quickprop, Backprop

– generatormt

– TNT inverse matrix package

(27)

Bayesian Approach

‘common sense reduced to calculations’

(28)

Bayesian Framework for BackProp NN, MacKay, Bishop,…

• Objective Criteria for comparing alternative network solutions, in particular with different architectures

• Objective criteria for setting decay rate 

• Objective choice of regularizing function Ew

• Comparing with test data is not required.

(29)

Notation and Conventions

W N

x t

x t

x t D

x y x t

N N

i i i

) ,

( ),..., ,

( ), ,

( :

) (

2 2

1 1

Data point, vector input, vector Network response

Data set Number of data points

Number of data weights

(30)

Model Classification

• A collection of models, 1,

, …, k

• We believe that

models are classified by P(1), P(), …, P(k) (sum to 1)

• After observing data D

 Bayes’ rule 

• Usually at the

beginning P(1)=P()=

…=P(k)

) (

) (

) ) (

( P D

H P H

D D P

H

P i i i

Normalizing constatnt Probability of D given Hi

(31)

Single Model Statistics

• Assume that model Hi is the correct one

• The neural network A with weights w is

considered

• Task 1: Assuming some prior probability of w, after including data, construct Posterior

• Task 2: consider the

space of hypothesis and construct evidence for them

) (

) (

) , ) (

, (

i

i i

i P D A

A w P A w D A P

D w

P

Evidence

ior Likelihood

Posterior Pr

) ( ) (

)

(Ai D P D Ai P Ai

P

P D w A P w A dw A

D

P( i) ( , i) ( i)

(32)

Hierarchy

) (

) ( ) ) (

(

) (

) (

) , ) (

, (

) , (

) , (

) , , ) (

, , (

D P

A P A D D P

A P

A D P

A P

A D

A P D P

A D

P

A w

P A w

D A P

D w P

(33)

Constructing prior and posterior functions

W D

i i W

i i

i i

D

E E

S

w E

x t w x E y





2

2 2

2 1

) ( ) , (

) exp(

) (

) (

) ) exp(

, , (

) 2 exp(

) (

) (

) ) exp(

, (

) exp(

) ) exp(

, (

) (

) (

) , ) (

, (

constant Assume

2 / 1

2 /

W D

W M

M

W W

W W

W

W

N i

i N

D N

D

D D

E E

w d Z

Z A S

D w P

E w

d Z

Z A E

w P

E t

d Z

Z A E

w D P

D P

w P w

D D P

w P

likelihood

Prior

Posterior probability

2 0 1 0 0 1 0 2 0 w

0 .0 5 0 .1 0 0 .1 5 0 .2 0 P w

wMP w0

Weight distribution!!!

(34)

Computing Posterior

 

 

) , (

) , (

)) ( exp(

) ( )

, (

)) (

| exp(

| 2 2

)) ( 1 (

2 2 ) 1 (

) (

2 1 2

2 / 1

1 2

x w

y A

x w

y w

S x

y x

w y dw

w A S

Z

y y

y x

t y y

y S

A

w A w w

S w

S

MP T

MP x

MP W

M

kl N

i i

l k

kl N

i

i k l i

i i

l i k i

k k kl

T MP

hessian

Covariance matrix

(35)

How to fix proper ?

( , , ) ( , ) ( , , )

) ,

(w D A p w D A d p D A p w D A

p MP MP

Two ideas:

•Evidence Approximation (MacKay)

•Hierarchical

•Find wMP

•Find aMP

•Perform analytically integrals over a

( , , ) ( , )

) ,

(w D A d p w D A p D A

p

If sharply peaked!!!

(36)

Getting aMP

W iteration

W

i i

MP W

W D

M

E W E

D d p

d

Z Z w Z

p w D p w

p w D p D

D p p

p D D p

p

2 / 2

0 ) ( log

) (

) ) (

( ) ( )

( ) , ( )

) ( (

) ( ) ) (

(

1

The effective number of well-determined parameters

Iterative procedure during training

(37)

Bayesian Model Comparison – Occam Factor

A A w p A w D p A

D P

w A w

w p A w D p A

D P

A w w p if

w A w p A w D p dw A w p A w D p A

D P

A D P A

P A D P D A P

W i

MP i

MP i

prior posterior i

MP i

MP i

prior i

MP

posterior i

MP i

MP i

i i

i i

i i

det ) 2 )( (

) , (

) (

) (

) , (

) (

) 1 (

) (

) , (

) (

) , ( )

(

) (

) ( ) (

) (

2

/

Occam Factor

Best fit likelihood

•The log of Occam Factor  amount of

•Information we gain after data have arrived

•Large Occam factor  complex models

•larger accessible phase space (larger range of posterior)

•Small Occam factor  simple models

•small accessible phase space (larger range of posterior)

(38)

Evidence

! 2

ln ln

2 ln 2 ln

det 2ln

) 1 (

ln

1

M g

N g A W

E E

A D p

M

N i

i MP

W MP

W

Symmetry Factor

Q2

x F

Tanh(.)

Misfit of the interpolant data

Occam Factor – Penalty Term

(39)

Occam hill Network 121 preferred by data

(40)
(41)
(42)

Cytaty

Powiązane dokumenty

L e, Upper bounds for class numbers of real quadratic fields,

Given a sequence with infinitely many distinct elements {λ n }, there are three possibilities: (i) {λ n } has at least one finite cluster point; (ii) one cluster point of {λ n } is

If E/F is a finite-dimensional Galois extension with Galois group G, then, by the Normal Basis Theorem, there exist elements w ∈ E such that {g(w) | g ∈ G} is an F -basis of E,

W a l f i s z, Weylsche Exponentialsummen in der neueren Zahlentheorie, Deutscher Verlag Wiss., Berlin, 1963.. Institute of Mathematics Department of

Key words and phrases: functional inequality, subadditive functions, homogeneous functions, Banach functionals, convex functions, linear space, cones, measure space, inte- grable

Many pruning methods were described in the last decade, but pruning leads to the removal of the network connec- tions and unnecessary neurons, but frequently many neurons contribute

We have found two ways to obtain rotated densities in all dimensions using transfer functions with just N additional parameters per neuron... Treating C K ( ·) as the

The exact values of crossing numbers of the Cartesian products of four special graphs of order five with cycles are given and, in addi- tion, all known crossing numbers of