MACHINE LEARNING:

(1)

DATA SCIENCE WITH MACHINE LEARNING:

REGRESSION

WFAiS UJ, Informatyka Stosowana I stopień studiów

1

22/12 2020

This lecture is

based on course by E. Fox and C. Guestrin, Univ of Washington

(2)

What is Data Science?

22/12 2020

2

Is mainly about extracting knowledge from data (terms “data mining” or “Knowledge Discovery in Databases” are highly related). It can be about analyzing trends, building predictive models, … etc.

Is an agglomerate of data collection, data modeling and analysis, a decision making, and everything you need to know to accomplish your goals. Eventually, it boils down to the following fields/skills:

 Computer science:

Algorithms, programming (patterns, languages etc.), understanding hardware &

operating systems, high-performance computing'

 Mathematical aspects:

Linear algebra, differential equations for optimization problems, statistics

 Few others:

Machine learning, domain knowledge, and data visualization & communication skills

(3)

Data Science and Machine Learning?

22/12 2020

3

Machine learning algorithms are algorithms that learn (often predictive) models from data. I.e., instead of formulating "rules" manually, a machine learning algorithm will learn the model for you.

Machine learning - at its core - is about the use and development of these learning algorithms. Data science is more about the extraction of

knowledge from data to answer particular question or solve particular problems.

Machine learning is often a big part of a "data science" project, e.g., it is often heavily used for exploratory analysis and discovery (clustering

algorithms) and building predictive models (supervised learning

algorithms). However, in data science, you often also worry about the collection, wrangling, and cleaning of your data (i.e., data engineering), and eventually, you want to draw conclusions from your data that helps you solve a particular problem.

(4)

Deploing inteligence module

22/12 2020

4

Case studied are about building, evaluating, deploying inteligence in data analysis.

Use pre-specified or develop your own

(5)

Case study

22/12 2020

5

(6)

Prediction: Predicting house prices

13/10/2020

6

(7)

Data

22/12 2020

7

Input vs output

• y is quantity of interest

• assume y can be predicted from x

(8)

Model: assume functional relationship

22/12 2020

8

„Essentially, all models are wrong but some are usefull.”

George Box, 1987.

(9)

Task 1:

22/12 2020

9

Which model to fit?

(10)

Task 2:

22/12 2020

10

For a given model f(x) estimate function from data

(11)

How it works: baseline flow chart

22/12 2020

11

(12)

22/12 2020

12

SIMPLE LINEAR REGRESSION

(13)

Simple linear regression model

22/12 2020

13

(14)

The cost of using a given line

22/12 2020

14

(15)

Find „best” line

22/12 2020

15

(16)

Interpreting the coefficients

22/12 2020

16

(17)

Interpreting the coefficients

22/12 2020

17

Magnitude of fit parameters depend on the units of both features and observations

(18)

ML algorithm: minimasing the cost

22/12 2020

18

(19)

Convergence criteria

22/12 2020

19

That will be „good enough”

value of e depends on the data we are looking at

(20)

Moving to multiple dimensions

22/12 2020

20

(21)

Contour plots

22/12 2020

21

(22)

Gradient descent

22/12 2020

22

(23)

Compute the gradient

22/12 2020

23

(24)

Approach 1: set gradient to 0

22/12 2020

24

This method is called

„Closed form solution”

(25)

Approach 2: gradient descent

22/12 2020

25

(26)

Comparing the approaches

22/12 2020

26

(27)

Asymmetric cost functions

22/12 2020

27

We can weight differently positive and negative errors in RSS calculations.

(28)

22/12 2020

28

MULTIPLE REGRESSION

(29)

Multiple regression

22/12 2020

29

(30)

Polynomial regression

22/12 2020

30

(31)

Other functional forms of one input

22/12 2020

31

 Trends in time series

This trend can be modeled with polynomial function.

(32)

Other functional forms of one input

22/12 2020

32

 Seasonality

(33)

Example of detrending

22/12 2020

33

(34)

Example of detrending

22/12 2020

34

(35)

Other examples of seasonality

22/12 2020

35

(36)

Generic basic expansion

22/12 2020

36

(37)

More realistic flow chart

22/12 2020

37

(38)

Incorporating multiple inputs

22/12 2020

38

Only one bathroom, not same as my 3 bathrooms

(39)

Incorporating multiple inputs

22/12 2020

39

Many possible inputs

(40)

General notation

22/12 2020

40

(41)

Simple hyperplane

22/12 2020

41

Noise term

(42)

More generally: D-dimensional curve

22/12 2020

42

(43)

Fitting in D-dimmensions

22/12 2020

43

Look now at this block

(44)

Rewriting in vector notation

22/12 2020

44

e_i

e_i +

+

(45)

Rewriting in matrix notation

22/12 2020

45

Here is our ML algorithm

(46)

Fitting in D-dimmensions

22/12 2020

46

Look now at this block

(47)

Cost function in D-dimmension

22/12 2020

47

RSS in vector notation

(48)

Cost function in D-dimmension

22/12 2020

48

RSS in matrix notation

(49)

Regression model for D-dimmension

22/12 2020

49

Gradient of RSS

(50)

22/12 2020

50

Approach 1: set gradient to zero

Closed form solution

(51)

Closed-form solution

22/12 2020

51

This matrix might not be invertible.

This might not be CPU feasible.

(52)

22/12 2020

52

Approach 2: gradient descent

We initialise our solution somewhere and then …

(53)

Gradient descent

22/12 2020

53

(54)

Summary of gradient descent

22/12 2020

54

Extremely useful algorithm in several applications

(55)

22/12 2020

55

ACCESSING PERFORMANCE

(56)

Measuring loss

22/12 2020

56

Symmetric loss functions

(57)

Accessing the loss

22/12 2020

57

Use training data

(58)

Compute training error

22/12 2020

58

(59)

Training error

22/12 2020

59

Convention is to take average here

(60)

Training error vs. model complexity

22/12 2020

60

Descrease as you increase your model complexity.

Very intuitive why it is like that.

(61)

Is training error a good measure?

22/12 2020

61

Is there something particularly wrong about having x_t square feet ???

(62)

Generalisation (true) error

22/12 2020

62

(63)

Generalisation error vs model complexity

22/12 2020

63

However … in contrast to the training

error, in practice we cannot really compute true generalisation error. We don’t have data on all possible houses in the area.

(64)

Forming a test set

22/12 2020

64

We want to approximate generalisation error.

Test set: proxy for

„everything you might see”

(65)

Compute test error

22/12 2020

65

(66)

Training, true and test error vs. model complexity. Notion of overfitting.

22/12 2020

66

Test error: noisy version due to limited statistics.

(67)

Training/test splits

22/12 2020

67

(68)

Three sources of errors

22/12 2020

68

(69)

Data are inherently noisy

22/12 2020

69

There is some true relatioship between sq.ft and value of the house, specific to the given house.

We cannot reduce it by chosing better model or procedure, It is beyond our control.

(70)

Bias contribution

22/12 2020

70

This contribution we can control.

(71)

Bias contribution

22/12 2020

71

Average over all possible fits

(72)

Bias contribution

22/12 2020

72

(73)

Variance contribution

22/12 2020

73

(74)

Variance contribution

22/12 2020

74

(75)

Variance of high complexity models

22/12 2020

75

For each train remove few random houses

(76)

Bias of high complexity models

22/12 2020

76

For each train remove few random houses

High complexity models are very flexible, pick better average trends.

(77)

Bias –variance tradeoff

22/12 2020

77

MSE = mean square error

Machine Learing

is all about this tradeoff

But….

(78)

Errors vs amount of data

22/12 2020

78

(79)

The regression/ML workflow

22/12 2020

79

(80)

Hypothetical implementation

22/12 2020

80

(81)

Practical implementation

22/12 2020

81

(82)

Typical splits

22/12 2020

82

(83)

K-fold cross validation

22/12 2020

83

(84)

What value of K

22/12 2020

84

(85)

22/12 2020

85

RIDGE REGRESSION

(86)

Flexibility of high-order polynomials

22/12 2020

86

Symptoms for overfitting: often associated with very large value of estimated parameters

(87)

How does # of observations influence overfitting?

22/12 2020

87

(88)

Lets improve quality metric blok

22/12 2020

88

(89)

Desire total cost format

22/12 2020

89

Want to balance

want to balance

(90)

Measure of magnitude of regression coefficients

22/12 2020

90

But … the coefficients are very large

(91)

Consider specific total cost

22/12 2020

91

(92)

Consider resulting objectives

22/12 2020

92

(93)

Ridge regression: bias-variance tradeoff

22/12 2020

93

(94)

Ridge regression: coefficients path

22/12 2020

94

features scaled to unit norm sweet spot

(95)

Flow chart

22/12 2020

95

(96)

Ridge regression: cost in matrix notation

22/12 2020

96

(97)

Gradient of ridge regresion cost

22/12 2020

97

(98)

Ridge regression: closed-form solution

22/12 2020

98

(99)

Ridge regression: gradient descent

22/12 2020

99

(100)

Summary of ridge regression algorithm

22/12 2020

100

(101)

How to handle the intercept

22/12 2020

101

Recall multiple regression model

(102)

Do we penalize intercept?

22/12 2020

102

(103)

Do we penalize intercept?

22/12 2020

103

 Option 1: don’t penalize intercept

 Option 2: Center data first

(104)

22/12 2020

104

FEATURES SELECTION

&

LASSO REGRESSION

(105)

Why features selection?

22/12 2020

105

(106)

Sparcity

22/12 2020

106

(107)

Find best model of size: 0

22/12 2020

107

(108)

22/12 2020

108

(109)

22/12 2020

109

Note: not necessarily nested!

(110)

Find best model of size: N

22/12 2020

110

Which model complexity to choose?

Certainly not that with the smalest training error!

(111)

Choosing model complexity

22/12 2020

111

(112)

Complexity of „all subsets”

22/12 2020

112

(113)

Greedy algorithm

22/12 2020

113

(114)

Visualizing greedy algorithm

22/12 2020

114

(115)

22/12 2020

115

(116)

22/12 2020

116

Notice… it is suboptimal .

Adding next best thing, fit is nested now.

(117)

22/12 2020

117

(118)

Complexity of forward stepwise

22/12 2020

118

(119)

Other greedy algorithms

22/12 2020

119

(120)

Using regularisation for features selection

22/12 2020

120

(121)

Thresholding ridge coefficients?

22/12 2020

121

(122)

22/12 2020

122

(123)

22/12 2020

123

(124)

22/12 2020

124

Remember:

this is linear model. If we assume that #showers = #bathrooms and remove one of them from the model, coefficients will sum up.

(125)

22/12 2020

125

(126)

Try this cost instead of ridge …

22/12 2020

126

(127)

Lasso regression

22/12 2020

127

(128)

Coefficient path: ridge

22/12 2020

128

(129)

Coefficient path: lasso

22/12 2020

129

(130)

22/12 2020

130

NONPARAMETRIC REGRESSION

(131)

Fit globaly vs fit locally

22/12 2020

131

Parametric models

Below …

f(x) is not really

a polynomial function

linear constant

quadratic

(132)

What alternative do we have?

22/12 2020

132

(133)

Nearest Neighbor & Kernel Regression (nonparametric approach)

22/12 2020

133

Simple implementation, flexibility increases as we have more data)

(134)

Fit locally to each data point

22/12 2020

134

(135)

What people do naturally…

22/12 2020

135

(136)

1-NN regression more formally

22/12 2020

136

Transition point

(137)

Visualizing 1-NN in multiple dimensions

22/12 2020

137

(138)

Distance metrics: Notion of „closest”

22/12 2020

138

(139)

Weighting housing inputs

22/12 2020

139

(140)

Scaled Euclidan distance

22/12 2020

140

(141)

Different distance metrics

22/12 2020

141

(142)

Performing 1-NN search

22/12 2020

142

(143)

1-NN algorithm

22/12 2020

143

(144)

1-NN in practice

22/12 2020

144

1-NN sensitive to noise in the data

function

1-NN fit

(145)

Get more „comps”

22/12 2020

145

(146)

K-NN regression more formally

22/12 2020

146

(147)

K-NN more formally

22/12 2020

147

(148)

K-NN algorithm

22/12 2020

148

(149)

K-NN in practice

22/12 2020

149

All k-NN for a specific red point

(150)

K-NN in practice

22/12 2020

150

(151)

Issues with discontinuities

22/12 2020

151

(152)

Weighted k-NN

22/12 2020

152

(153)

How to define weights

22/12 2020

153

(154)

Kernel weights for d=1

22/12 2020

154

Kernel drives how the weights will decay, if at all, as a function of the distance.

(155)

Kernel regression

22/12 2020

155

(156)

Kernel regression in practice

22/12 2020

156

(157)

Choice of bandwith l

22/12 2020

157

(158)

Choosing l (or k on k-NN)

22/12 2020

158

(159)

Contrasting with global average

22/12 2020

159

(160)

Contrasting with global average

22/12 2020

160

(161)

Local linear regression

22/12 2020

161

(162)

Local regression rules of thumb

22/12 2020

162

(163)

Nonparametric approaches

22/12 2020

163

(164)

Limiting behaviour of NN

22/12 2020

164

(165)

22/12 2020

165

(166)

Error vs amount of data

22/12 2020

166

(167)

22/12 2020

167

(168)

Issues: NN and kernel methods

22/12 2020

168

(169)

Issues: Complexity of NN search

22/12 2020

169

(170)

Summarising

22/12 2020

170