DATA SCIENCE WITH MACHINE LEARNING:
REGRESSION
WFAiS UJ, Informatyka Stosowana I stopień studiów
1
22/12 2020
This lecture is
based on course by E. Fox and C. Guestrin, Univ of Washington
What is Data Science?
22/12 2020
2
Is mainly about extracting knowledge from data (terms “data mining” or “Knowledge Discovery in Databases” are highly related). It can be about analyzing trends, building predictive models, … etc.
Is an agglomerate of data collection, data modeling and analysis, a decision making, and everything you need to know to accomplish your goals. Eventually, it boils down to the following fields/skills:
Computer science:
Algorithms, programming (patterns, languages etc.), understanding hardware &
operating systems, high-performance computing'
Mathematical aspects:
Linear algebra, differential equations for optimization problems, statistics
Few others:
Machine learning, domain knowledge, and data visualization & communication skills
Data Science and Machine Learning?
22/12 2020
3
Machine learning algorithms are algorithms that learn (often predictive) models from data. I.e., instead of formulating "rules" manually, a machine learning algorithm will learn the model for you.
Machine learning - at its core - is about the use and development of these learning algorithms. Data science is more about the extraction of
knowledge from data to answer particular question or solve particular problems.
Machine learning is often a big part of a "data science" project, e.g., it is often heavily used for exploratory analysis and discovery (clustering
algorithms) and building predictive models (supervised learning
algorithms). However, in data science, you often also worry about the collection, wrangling, and cleaning of your data (i.e., data engineering), and eventually, you want to draw conclusions from your data that helps you solve a particular problem.
Deploing inteligence module
22/12 2020
4
Case studied are about building, evaluating, deploying inteligence in data analysis.
Use pre-specified or develop your own
Case study
22/12 2020
5
Prediction: Predicting house prices
13/10/2020
6
Data
22/12 2020
7
Input vs output
• y is quantity of interest
• assume y can be predicted from x
Model: assume functional relationship
22/12 2020
8
„Essentially, all models are wrong but some are usefull.”
George Box, 1987.
Task 1:
22/12 2020
9
Which model to fit?
Task 2:
22/12 2020
10
For a given model f(x) estimate function from data
How it works: baseline flow chart
22/12 2020
11
22/12 2020
12
SIMPLE LINEAR REGRESSION
Simple linear regression model
22/12 2020
13
The cost of using a given line
22/12 2020
14
Find „best” line
22/12 2020
15
Interpreting the coefficients
22/12 2020
16
Interpreting the coefficients
22/12 2020
17
Magnitude of fit parameters depend on the units of both features and observations
ML algorithm: minimasing the cost
22/12 2020
18
Convergence criteria
22/12 2020
19
That will be „good enough”
value of e depends on the data we are looking at
Moving to multiple dimensions
22/12 2020
20
Contour plots
22/12 2020
21
Gradient descent
22/12 2020
22
Compute the gradient
22/12 2020
23
Approach 1: set gradient to 0
22/12 2020
24
This method is called
„Closed form solution”
Approach 2: gradient descent
22/12 2020
25
Comparing the approaches
22/12 2020
26
Asymmetric cost functions
22/12 2020
27
We can weight differently positive and negative errors in RSS calculations.
22/12 2020
28
MULTIPLE REGRESSION
Multiple regression
22/12 2020
29
Polynomial regression
22/12 2020
30
Other functional forms of one input
22/12 2020
31
Trends in time series
This trend can be modeled with polynomial function.
Other functional forms of one input
22/12 2020
32
Seasonality
Example of detrending
22/12 2020
33
Example of detrending
22/12 2020
34
Other examples of seasonality
22/12 2020
35
Generic basic expansion
22/12 2020
36
More realistic flow chart
22/12 2020
37
Incorporating multiple inputs
22/12 2020
38
Only one bathroom, not same as my 3 bathrooms
Incorporating multiple inputs
22/12 2020
39
Many possible inputs
General notation
22/12 2020
40
Simple hyperplane
22/12 2020
41
Noise term
More generally: D-dimensional curve
22/12 2020
42
Fitting in D-dimmensions
22/12 2020
43
Look now at this block
Rewriting in vector notation
22/12 2020
44
ei
ei +
+
Rewriting in matrix notation
22/12 2020
45
Here is our ML algorithm
Fitting in D-dimmensions
22/12 2020
46
Look now at this block
Cost function in D-dimmension
22/12 2020
47
RSS in vector notation
Cost function in D-dimmension
22/12 2020
48
RSS in matrix notation
Regression model for D-dimmension
22/12 2020
49
Gradient of RSS
Regression model for D-dimmension
22/12 2020
50
Approach 1: set gradient to zero
Closed form solution
Closed-form solution
22/12 2020
51
This matrix might not be invertible.
This might not be CPU feasible.
Regression model for D-dimmension
22/12 2020
52
Approach 2: gradient descent
We initialise our solution somewhere and then …
Gradient descent
22/12 2020
53
Summary of gradient descent
22/12 2020
54
Extremely useful algorithm in several applications
22/12 2020
55
ACCESSING PERFORMANCE
Measuring loss
22/12 2020
56
Symmetric loss functions
Accessing the loss
22/12 2020
57
Use training data
Compute training error
22/12 2020
58
Training error
22/12 2020
59
Convention is to take average here
Training error vs. model complexity
22/12 2020
60
Descrease as you increase your model complexity.
Very intuitive why it is like that.
Is training error a good measure?
22/12 2020
61
Is there something particularly wrong about having xt square feet ???
Generalisation (true) error
22/12 2020
62
Generalisation error vs model complexity
22/12 2020
63
However … in contrast to the training
error, in practice we cannot really compute true generalisation error. We don’t have data on all possible houses in the area.
Forming a test set
22/12 2020
64
We want to approximate generalisation error.
Test set: proxy for
„everything you might see”
Compute test error
22/12 2020
65
Training, true and test error vs. model complexity. Notion of overfitting.
22/12 2020
66
Test error: noisy version due to limited statistics.
Training/test splits
22/12 2020
67
Three sources of errors
22/12 2020
68
Data are inherently noisy
22/12 2020
69
There is some true relatioship between sq.ft and value of the house, specific to the given house.
We cannot reduce it by chosing better model or procedure, It is beyond our control.
Bias contribution
22/12 2020
70
This contribution we can control.
Bias contribution
22/12 2020
71
Average over all possible fits
Bias contribution
22/12 2020
72
Variance contribution
22/12 2020
73
Variance contribution
22/12 2020
74
Variance of high complexity models
22/12 2020
75
For each train remove few random houses
Bias of high complexity models
22/12 2020
76
For each train remove few random houses
High complexity models are very flexible, pick better average trends.
Bias –variance tradeoff
22/12 2020
77
MSE = mean square error
Machine Learing
is all about this tradeoff
But….
Errors vs amount of data
22/12 2020
78
The regression/ML workflow
22/12 2020
79
Hypothetical implementation
22/12 2020
80
Practical implementation
22/12 2020
81
Typical splits
22/12 2020
82
K-fold cross validation
22/12 2020
83
What value of K
22/12 2020
84
22/12 2020
85
RIDGE REGRESSION
Flexibility of high-order polynomials
22/12 2020
86
Symptoms for overfitting: often associated with very large value of estimated parameters
How does # of observations influence overfitting?
22/12 2020
87
Lets improve quality metric blok
22/12 2020
88
Desire total cost format
22/12 2020
89
Want to balance
want to balance
Measure of magnitude of regression coefficients
22/12 2020
90
But … the coefficients are very large
Consider specific total cost
22/12 2020
91
Consider resulting objectives
22/12 2020
92
Ridge regression: bias-variance tradeoff
22/12 2020
93
Ridge regression: coefficients path
22/12 2020
94
features scaled to unit norm sweet spot
Flow chart
22/12 2020
95
Ridge regression: cost in matrix notation
22/12 2020
96
Gradient of ridge regresion cost
22/12 2020
97
Ridge regression: closed-form solution
22/12 2020
98
Ridge regression: gradient descent
22/12 2020
99
Summary of ridge regression algorithm
22/12 2020
100
How to handle the intercept
22/12 2020
101
Recall multiple regression model
Do we penalize intercept?
22/12 2020
102
Do we penalize intercept?
22/12 2020
103
Option 1: don’t penalize intercept
Option 2: Center data first
22/12 2020
104
FEATURES SELECTION
&
LASSO REGRESSION
Why features selection?
22/12 2020
105
Sparcity
22/12 2020
106
Find best model of size: 0
22/12 2020
107
Find best model of size: 1
22/12 2020
108
Find best model of size: 2
22/12 2020
109
Note: not necessarily nested!
Find best model of size: N
22/12 2020
110
Which model complexity to choose?
Certainly not that with the smalest training error!
Choosing model complexity
22/12 2020
111
Complexity of „all subsets”
22/12 2020
112
Greedy algorithm
22/12 2020
113
Visualizing greedy algorithm
22/12 2020
114
Visualizing greedy algorithm
22/12 2020
115
Visualizing greedy algorithm
22/12 2020
116
Notice… it is suboptimal .
Adding next best thing, fit is nested now.
Visualizing greedy algorithm
22/12 2020
117
Complexity of forward stepwise
22/12 2020
118
Other greedy algorithms
22/12 2020
119
Using regularisation for features selection
22/12 2020
120
Thresholding ridge coefficients?
22/12 2020
121
Thresholding ridge coefficients?
22/12 2020
122
Thresholding ridge coefficients?
22/12 2020
123
Thresholding ridge coefficients?
22/12 2020
124
Remember:
this is linear model. If we assume that #showers = #bathrooms and remove one of them from the model, coefficients will sum up.
Thresholding ridge coefficients?
22/12 2020
125
Try this cost instead of ridge …
22/12 2020
126
Lasso regression
22/12 2020
127
Coefficient path: ridge
22/12 2020
128
Coefficient path: lasso
22/12 2020
129
22/12 2020
130
NONPARAMETRIC REGRESSION
Fit globaly vs fit locally
22/12 2020
131
Parametric models
Below …
f(x) is not really
a polynomial function
linear constant
quadratic
What alternative do we have?
22/12 2020
132
Nearest Neighbor & Kernel Regression (nonparametric approach)
22/12 2020
133
Simple implementation, flexibility increases as we have more data)
Fit locally to each data point
22/12 2020
134
What people do naturally…
22/12 2020
135
1-NN regression more formally
22/12 2020
136
Transition point
Visualizing 1-NN in multiple dimensions
22/12 2020
137
Distance metrics: Notion of „closest”
22/12 2020
138
Weighting housing inputs
22/12 2020
139
Scaled Euclidan distance
22/12 2020
140
Different distance metrics
22/12 2020
141
Performing 1-NN search
22/12 2020
142
1-NN algorithm
22/12 2020
143
1-NN in practice
22/12 2020
144
1-NN sensitive to noise in the data
function
1-NN fit
Get more „comps”
22/12 2020
145
K-NN regression more formally
22/12 2020
146
K-NN more formally
22/12 2020
147
K-NN algorithm
22/12 2020
148
K-NN in practice
22/12 2020
149
All k-NN for a specific red point
K-NN in practice
22/12 2020
150
Issues with discontinuities
22/12 2020
151
Weighted k-NN
22/12 2020
152
How to define weights
22/12 2020
153
Kernel weights for d=1
22/12 2020
154
Kernel drives how the weights will decay, if at all, as a function of the distance.
Kernel regression
22/12 2020
155
Kernel regression in practice
22/12 2020
156
Choice of bandwith l
22/12 2020
157
Choosing l (or k on k-NN)
22/12 2020
158
Contrasting with global average
22/12 2020
159
Contrasting with global average
22/12 2020
160
Local linear regression
22/12 2020
161
Local regression rules of thumb
22/12 2020
162
Nonparametric approaches
22/12 2020
163
Limiting behaviour of NN
22/12 2020
164
Limiting behaviour of NN
22/12 2020
165
Error vs amount of data
22/12 2020
166
Limiting behaviour of NN
22/12 2020
167
Issues: NN and kernel methods
22/12 2020
168
Issues: Complexity of NN search
22/12 2020
169
Summarising
22/12 2020
170