Sieci neuronowe –
bezmodelowa analiza danych?
K. M. Graczyk
IFT, Uniwersytet Wrocławski Poland
Why Neural Networks?
• Inspired by C. Giunti (Torino)
– PDF’s by Neural Network
• Papers of Forte et al.. (JHEP 0205:062,200, JHEP
0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:1- 63,2009).
• A kind of model independent way of fitting data and computing associated uncertainty
• Learn, Implement, Publish (LIP rule)
– Cooperation with R. Sulej (IPJ, Warszawa) and P.
Płoński (Politechnika Warszawska)
• NetMaker
– GrANNet ;) my own C++ library
Road map
• Artificial Neural Networks (NN) – idea
• Feed Forward NN
• PDF’s by NN
• Bayesian statistics
• Bayesian approach to NN
• GrANNet
Inspired by Nature
The human brain consists of around 1011 neurons which are highly interconnected with around 1015 connections
Applications
• Function approximation, or regression analysis, including time series prediction,
fitness approximation and modeling.
• Classification, including pattern and sequence recognition, novelty
detection and sequential decision making.
• Data processing, including filtering, clustering, blind source separation and
compression.
• Robotics, including directing manipulators, Computer
numerical control.
Artificial Neural Network
Input layer
Hidden layer
Output, target
Feed Forward
2 1
32 22 12
31 21 11
3 2 1
i i w
w w
w w w
t t t
the simplest example Linear Activation Functions Matrix
threshold Summing
output activation
function
input
1
2
3
k
i-th perceptron weights
activation functions
•Heavside function (x)
0 or 1 signal
•sigmoid function
•tanh()
•linear
e x
x
g
1 ) 1 (
4 2 2 4
1 .0
0 .5
0 .5 1 .0
tanh(x) sigmoid
th
threshol
signal is amplified Signal is weaker
architecture
• 3 -layers network, two hidden:
• 1:2:1:1
• 2+2+1 + 1+2+1: #par=9:
x F(x)
Linear Function Symmetric Sigmoid Function
•Bias neurons, instead of thresholds
•Signal One
Neural Networks – Function Approximation
• The
universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi- layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions.
(Wikipedia.org)
Q2
x F
)
; , 2 (
2 Q x wij F
Q2
)
; , 2
(Q wij
A map from one vector space to another
Supervised Learning
• Propose the Error Function
– in principle any continuous function which has a global minimum
• Motivated by Statistics: Standard Error Function, chi2, etc, …
• Consider set of the data
• Train given NN by showing the data marginalize the error function
– back propagation algorithms
• An iterative procedure which fixes weights
Learning Algorithms
• Gradient Algorithms
– Gradient descent
– RPROP (Ridmiller & Braun) – Conjugate gradients
• Look at curvature
– QuickProp (Fahlman) – Levenberg-Marquardt
(hessian)
– Newtonian method (hessian)
• Monte Carlo algorithms (based on Marcov chain algorithm)
Overfitting
• More complex models describe data in better way, but lost
generalities
– bias-variance trade-
• Overfitting large off values of the weights
• Compare with the test set (must be twice
larger than original)
• Regularization
additional penalty term to error function
) exp(
) 0 ( ) ( absence data
,
2 1 2 2
t w
t w w
dt E dw
w E
E E
E E E
D
W i
i W
D D
W D D
Decay rate
What about physics
Nonparametric QED free parameters
Most of Models
Nature
Observation Measurements
Idea
model nonoperativeQCD
Data
Statistics
Data Still More precise than Theory
Physics given directly by the data
Problems
Some general constraints Model Independent Analysis Statistical Model data Uncertainty of the predictions
Fitting data with Artificial Neural Networks
‘The goal of the network training is not to learn on exact representation of the training data itself, but rather to built statistical model for the process which generates the data’
C. Bishop, ‘Neural Networks for Pattern Recognition’
Parton Distribution Function with NN
Some method but…
Q2
x F
Parton Distributions Functions S. Forte, L. Garrido, J.
I. Latorre and A. Piccione, JHEP 0205 (2002) 062
• A kind of model
independent analysis of the data
• Construction of the probability density
P[G(Q2)] in the space of the structure functions
– In practice only one Neural Network
architecture
• Probability density in the space of parameters of one particular NN
But in reality Forte at al.. did
Training Nrep neural networks, one for each set of Ndat pseudo-data Generating Monte Carlo pseudo data
The Nrep trained neural networks provide
a representation of the probability measure in the space of the structure functions
The idea comes from W. T. Giele and S. Keller
correlation uncertainty
10, 100 and 1000 replicas
30 data points, overfitting
short enough long too long
My criticism
• The simultaneous use of artificial data and chi2 error function overestimates
uncertainty?
• Do not discuss other NN architectures
• Problems with overfitting (a need of test set)
• Relatively simple approach, comparing with the present techniques in NN computing.
• The uncertainty of the model predictions must be generated by the probability
distribution obtained for the model then the data itself
GraNNet – Why?
• I stole some ideas from FANN
• C++ Library, easy in use
• User defined Error Function (any you wish)
• Easy access to units and their weights
• Several ways for initiating network of given architecture
• Bayesin learning
• Main objects:
– Classes: NeuralNetwork, Unit
– Learning algorithms: so far QuickProp, Rprop+, Rprop-, iRprop-, iRprop+,…,
– Network Response Uncertainty (based on Hessian) – Some restarting and stopping simple solutions
Structure of GraNNet
• Libraries:
– Unit class
– Neural_Network class
– Activation (activation and error function structures)
– Learning algorithms
– RProp+, RProp-, iRProp+, RProp-, Quickprop, Backprop
– generatormt
– TNT inverse matrix package
Bayesian Approach
‘common sense reduced to calculations’
Bayesian Framework for BackProp NN, MacKay, Bishop,…
• Objective Criteria for comparing alternative network solutions, in particular with different architectures
• Objective criteria for setting decay rate
• Objective choice of regularizing function Ew
• Comparing with test data is not required.
Notation and Conventions
W N
x t
x t
x t D
x y x t
N N
i i i
) ,
( ),..., ,
( ), ,
( :
) (
2 2
1 1
Data point, vector input, vector Network response
Data set Number of data points
Number of data weights
Model Classification
• A collection of models, 1,
, …, k
• We believe that
models are classified by P(1), P(), …, P(k) (sum to 1)
• After observing data D
Bayes’ rule
• Usually at the
beginning P(1)=P()=
…=P(k)
) (
) (
) ) (
( P D
H P H
D D P
H
P i i i
Normalizing constatnt Probability of D given Hi
Single Model Statistics
• Assume that model Hi is the correct one
• The neural network A with weights w is
considered
• Task 1: Assuming some prior probability of w, after including data, construct Posterior
• Task 2: consider the
space of hypothesis and construct evidence for them
) (
) (
) , ) (
, (
i
i i
i P D A
A w P A w D A P
D w
P
Evidence
ior Likelihood
Posterior Pr
) ( ) (
)
(Ai D P D Ai P Ai
P
P D w A P w A dw A
D
P( i) ( , i) ( i)
Hierarchy
) (
) ( ) ) (
(
) (
) (
) , ) (
, (
) , (
) , (
) , , ) (
, , (
D P
A P A D D P
A P
A D P
A P
A D
A P D P
A D
P
A w
P A w
D A P
D w P
Constructing prior and posterior functions
W D
i i W
i i
i i
D
E E
S
w E
x t w x E y
2
2 2
2 1
) ( ) , (
) exp(
) (
) (
) ) exp(
, , (
) 2 exp(
) (
) (
) ) exp(
, (
) exp(
) ) exp(
, (
) (
) (
) , ) (
, (
constant Assume
2 / 1
2 /
W D
W M
M
W W
W W
W
W
N i
i N
D N
D
D D
E E
w d Z
Z A S
D w P
E w
d Z
Z A E
w P
E t
d Z
Z A E
w D P
D P
w P w
D D P
w P
likelihood
Prior
Posterior probability
2 0 1 0 0 1 0 2 0 w
0 .0 5 0 .1 0 0 .1 5 0 .2 0 P w
wMP w0
Weight distribution!!!
Computing Posterior
) , (
) , (
)) ( exp(
) ( )
, (
)) (
| exp(
| 2 2
)) ( 1 (
2 2 ) 1 (
) (
2 1 2
2 / 1
1 2
x w
y A
x w
y w
S x
y x
w y dw
w A S
Z
y y
y x
t y y
y S
A
w A w w
S w
S
MP T
MP x
MP W
M
kl N
i i
l k
kl N
i
i k l i
i i
l i k i
k k kl
T MP
hessian
Covariance matrix
How to fix proper ?
( , , ) ( , ) ( , , )
) ,
(w D A p w D A d p D A p w D A
p MP MP
Two ideas:
•Evidence Approximation (MacKay)
•Hierarchical
•Find wMP
•Find aMP
•Perform analytically integrals over a
( , , ) ( , )
) ,
(w D A d p w D A p D A
p
If sharply peaked!!!
Getting aMP
W iteration
W
i i
MP W
W D
M
E W E
D d p
d
Z Z w Z
p w D p w
p w D p D
D p p
p D D p
p
2 / 2
0 ) ( log
) (
) ) (
( ) ( )
( ) , ( )
) ( (
) ( ) ) (
(
1
The effective number of well-determined parameters
Iterative procedure during training
Bayesian Model Comparison – Occam Factor
A A w p A w D p A
D P
w A w
w p A w D p A
D P
A w w p if
w A w p A w D p dw A w p A w D p A
D P
A D P A
P A D P D A P
W i
MP i
MP i
prior posterior i
MP i
MP i
prior i
MP
posterior i
MP i
MP i
i i
i i
i i
det ) 2 )( (
) , (
) (
) (
) , (
) (
) 1 (
) (
) , (
) (
) , ( )
(
) (
) ( ) (
) (
2
/
Occam Factor
Best fit likelihood
•The log of Occam Factor amount of
•Information we gain after data have arrived
•Large Occam factor complex models
•larger accessible phase space (larger range of posterior)
•Small Occam factor simple models
•small accessible phase space (larger range of posterior)
Evidence
! 2
ln ln
2 ln 2 ln
det 2ln
) 1 (
ln
1
M g
N g A W
E E
A D p
M
N i
i MP
W MP
W
Symmetry Factor
Q2
x F
Tanh(.)
Misfit of the interpolant data
Occam Factor – Penalty Term
Occam hill Network 121 preferred by data